Building Your First Predictive Model with WSO2 Machine Learner

WSO2 Machine Learner is a powerful tool for predictive analytics on big data. The outstanding feature of it is the step by step wizard which makes it easier for anyone to use and build even advanced models with just a matter of few clicks. (You can refer to this excellent article at [1] for a good read on the ability and features of WSO2 ML).

In this article I will be addressing how to deal with a classification problem with WSO2 Machine Learner. Here I will discuss on using WSO2 ML to build a simple Random Forest model, for the well known "Iris" flower data set. My ultimate goal would be to train a machine Learning model to predict the Flower "Type" using the rest of the features of the flowers.


Prerequisites


  • Download and extract WSO2 Machine Learner (WSO2_ML) 1.0.0 from here.
  • Download the Iris Dataset from here. Make sure to save it in ".csv" format.

Navigate to <WSO2_ML>/bin directory and start the server using "./wso2server.sh" command (or wso2server.bat in windows). Then locate the URL for the ML web UI (not the carbon console) in the terminal log, and open the ML UI. (which is http://localhost:9443/ml by default). You will be directed to the following window.


Then log in to the ML UI using the "admin" as both username and password.



Import Dataset

Once you logged in to the ML UI, you will see the following window. Here, since we haven't uploaded any data sets or, created any projects yet, it will show Datasets/projects as "0".


First thing we need to do is to import the Iris dataset to the WSO2 ML server. For that, click on "Add Dataset", and you will get the following dataset upload form. 



Give a desired name and a description for the dataset. Since we are uploading the dataset file from our local machine, select the "Source Type" as "File". Then browse the file and select it. Chose the dataformat as "CSV",  select "No" to the column headers as the csv file does not contain any headers. Once everything is filled, "select "Create Dataset". You will be navigate to a new page where the newly created dataset is listed. Refresh the page and click on the dataset name to expand the view.



If the dataset got uploaded correctly, you will see the Green tick.


Explore Dataset

Now lets visually explore the dataset to see what characteristics does our dataset hold. Click on the "Explore" button which can be find in the same tab as the Dataset name. You will be navigated to the following window.



WSO2 ML provides various sets of visualizing tools such as Scatter plot, Parallel set, Treliss Chart and Cluster diagram. I will not be discussing in detail on what does each of these charts can be used for, and will address that in a separate article. But for now, lets have a look at the Scatter plot. Here I have plotted PL (Petal Length) against PW (Petal Width), and have colored the points by the "Type". As we can see, "Type" is clearly separated in to three cluster of points, when we plotted against PL and PW.This is a good evidance that both PW and PL are important factors of deciding (predicting) the flower "Type". Hence, we should include those two points as inputs to our model. Similarly You can try plotting against different combinations of variables, and see what features/variables are important and which ones not.


Create a Project

Now that we import our data, we need to create project using the uploaded data, to start working on it. For that, click "Create Project" button which is next to the dataset name.



Enter a desired name and a description to the project, and make sure the dataset we uploaded earlier is selected as the dataset (this is selected by default if you create the project as mentioned in the previous step). Then click on "Create Project" on complete the action, which will show you the following window.



Here You will see a yellow exclamation mark next to the Project name, mentioning that no Analyses are available. This is because, to start analysing our dataset, we need to create an Analysis.



Create Analysis

To create an Analysis, in the text box under the project, Type the name of the analysis you want to create and click "Create Analysis". Then you will be directed to the following Preprocess window, where it shows the overall summary of the dataset.



Here we can see the Data type (Categorical/Numerical) and the overall distribution of each feature. For Numerical features, a histogram will be displayed, and for categorical features, a pie chart would be shown. Also we can decided which feature to be included in our Analysis ("Include") and how should we treat the missing values of each of the feature/variable. 

As per the Data explore step we did earlier, I will be including all the features for my analysis. So all the features will be kept ticked in the "include" column. And for the simplicity, I will be using "Discard" option for missing values. (This means if there is a missing value somewhere, that complete row will be discarded). Once we are done with selecting what options we need, lets proceed to the next step by clicking "Next" button on the top-right corner. We will be redirected again to the data explore step, which is optional at this stage since we have already done with our data exploration part. Thefore lets skip this step and proceed to the Model Building phase by clicking "Next" again.



Build Model

In this step, we can chose which algorithm are we going to use to train our model. As I stated at the beginning as well, I will be using "Random Forest" as the algorithm, and the flower "Type" as the variable that I want to predict (Response variable) using my model. 




We can also define what proportion of data should be used to train our model and what proportion to be used for validation/evaluation of the build model. As its a common standard, lets use 0.7 proportion of data (70%) to train the model. Yet again, click "next" to proceed to the next step, where we can set the hyper-parameters to be used by the algorithm to build the model.




These hyper-parameters are specific to the Random Forest algorithm. Each of the hyper-parameters represents the following:
  • Num Trees -  Number of trees in the random forest. Increasing the number of trees will decrease the variance in predictions, improving the model’s test-time accuracy. Also training time increases roughly linearly in the number of trees. This parameter value should be an integer greater than 0.
  • Max Depth - Maximum depth of the tree. This helps you to control the size of the tree to prevent overfitting. This parameter value should be an integer greater than 0.
  • Max Bins - Number of bins used when discretizing continuous features. This must be at least the maximum number of categories M for any categorical feature. Increasing this allows the algorithm to consider more split candidates and make fine-grained split decisions. However, it also increases computation and communication. This parameter value should be an integer greater than 0.
  • Impurity -  A measure of the homogeneity of the labels at the node. This parameter value should be either 'gini' or 'entropy' (without quotes).
  • Feature Subset Strategy - Number of features to use as candidates for splitting at each tree node. The number is specified as a fraction or function of the total number of features. Decreasing this number will speed up training, but can sometimes impact performance if too low. This parameter value can take values 'auto', 'all', 'sqrt', 'log2' and 'onethird' (without quotes).
  • Seed - Seed for the random number generator.

Those hyper-parameters have default values in WSO2 ML. Eventhough they are not optimized for the Iris dataset, they can do a decent enough job for any dataset. Hence I will be using the default values as well, to keep it simple. Click "Next" once more, to proceed to the last step of our model building phase, where we will be asked to select a dataset version. This is usefull when there are multiple dataset versions for the same dataset. but for now since we have only one version (default version), we can keep it as it is.


We are now done with all the processing need to build our model. Finally, click "Run" on the top-right corner to build the model. Then you will be directed to a page where all the models you built under an analysis are listed. Since we built only one odel for now, there will be only one model listed. It also state the current status of the model building process. Initiall it will be shown as "In Progress". After a couple of seconds, refresh the page to update the status. Then it should be saying "Completed" with a green tick next to it.




Evaluate Model

Our next and final step is to evaluate our model to see how well it performs. For that click on "View" button on the model, which will take you to a page having the following results.




In this page we can see the model's overall accuracy, the confusion matrix, and a Scatter plot where data points are marked according to their classification status (Correctly classified/ Incorrectly classified). These are generated from the 30% of the data we kept aside at the beginning of our analysis. 

As we can see from the above output, our model has a 95.74% overall accuracy which is extremely high. Also the confusion matrix shows a breakdown of this accuracy. On the ideal scenario, accuracy should be 100% and all the non-diagonal cells in the confusion matrix should be all zero. But this ideal case is far away from the real world scenarios, and the accuracy we have got here is extremely towards the higher side. Therefore, looking at the evaluation results, we can conclude that out model is a good enough model to predict for future data.



Predict

Optionally you can use this model and predict for new data points, from within the WSO2 ML itself. This is primarily for testing purpose only. For that, navigate back to the model listing page, and click on "predict".


Here we have two options: either to upload a file and predict for all the rows in the file, Or to enter a set of values (which represent a single row) and get the prediction for that single instance. I will be using the later option here. As in the above figure, Select "Feature Values" and then enter desired values for the features, and click on "Predict". The output value will be shown right below the predict button. In this case, I got the predicted output as "1" for the values I entered, which means that the flower belongs to the Type 1.

For more information on WSO2 Machine Learner, please refer the official documentation found at [1].

References


Share:

0 comments