Importance of Data Visualization in WSO2 ML : Part I

Why Need Visualizing?

Either you are a data scientist or a random person with some data on your hands, visualizing plays a key role in data analyzing. Why this so important is, when you got a set of data in your hand, you need to have a clear understanding on that data in order to make an effective advanced analysis. You may be an expert on the field of which the data is associated with and may have a deep knowledge on the domain of the data, but still that does not guarantee you would know all the details about the data and some hidden and unobserved patterns that would never be observable just by looking at the numbers. These numbers, despite it may be either just raw data or may be some summary statistics, can sometime leads to incorrect conclusions, if they were used without a proper understanding on the behavior of the data.

As a simple example, 'mean' or the 'average' is often considered as the best statistic to represent a sample of data, and more often than not, the whole data set is treated just by taking the 'mean' in to consideration. But if the data were plotted, with the value against the frequency, the graph may take one of the following overall shapes.

figure 1 (left skewed)
figure 2 (symmetric)
figure 3  (right skewed)

Taking mean to represent the data is a pretty bad idea, if the data is distributed as in either  figure 1 or figure 3, (i.e. if the data are skewed) as 'mean' does not represent the data properly due to the extreme values with lesser frequencies (the two tails). In such situations taking the 'median' is a better statistic to represent the data. Thus, when it comes to Machine Learning with WSO2 ML, when there are situations to deal with missing data, it is important to look at each feature separately before selecting the imputing method, rather than always going with the 'mean imputation' where missing values are replaced with the mean of the rest of the data.

Another very common use-case is that visualizing the data in two or more dimensions helps to detect multicollinearity. Simply, multicollinearity is the ability to express one predictor variable using another one or more predictor variables, with a very trivial loss of precision. In Machine Learning, time takes to train a model depends on the complexity of the data. In other words, higher the number of features (predictor variables), more time it takes to train the model. As WSO2 ML intend to work with big data often, dimension reduction can have a severe effect on the model building time. Thus it is important to eliminate the predictor variables of which multicollinearity exists, and reduce the dimensions of the data, in order to minimize the model building time.

Like wise, even though visualizing data is often overlooked by data scientists when it comes to Machine Learning, it can really help to increase the performance of the outcome in-terms of both accuracy as well as efficiency.