Building a Random Forest Model with R

Here I will be talking on how to build a Random Forest Model, for a classification problem, with R. There are two libraries exists in R which we can used to train Random Forest models. Here I'll be using the package "randomForest" for my purpose (Other package is "caret" package). Further, we can export our built model as a PMML file, which is a standard way (an XML format) of exporting models from R. I will be using the famous iris data-set for the demonstration.


Prerequisites:

You need to have R installed. If you are using Linux, then you may Install Rstudio as well. Because in windows, R comes with a GUI, but in Linux, R has only a console to execute commands. So it would be much convenient to have Rstudio installed as well, which will provide a GUI for R operations.


Install Necessary Packages.

We are going to need to install some external standard libraries for our requirement. For that start R and in the console, execute the following commands. (You can create a new script and execute all the commands at once also).
install.packages('caret')
install.packages('randomForest')
install.packages('pmml')
Once the libraries are installed, we need to load them to run-time.
library(randomForest)
library(caret)
library(pmml)

Prepare Data:

Next, lets import/load the dataset (iris dataset) to R.
data <- read.csv("/home/supun/Desktop/intersects4.csv",header = TRUE, sep = ",")
Now, we need to split this dataset in to two propotions. One set is to train the model, and the remaining is to test/validate the model we built. Here I'l be using 70% for training and the remaining 30% for testing.
#take a partition of 70% from the data
split <- createDataPartition(data$traffic, p = 0.7)[[1]]
#create train data set from 70% of data
trainset <- data[ split,]
#create test data set from remaining 30% of data
testset <- data[-split,]
Now we need to put some special attention to the data type of our dataset. The feature Species of the iris data set is categorical and the remaining four features are numerical. And the Species columns has been encoded for numerical values. When R imports the data, since these are numerical values, it will treat all the columns as numerical data. But we need R to treat Species as categorical data, and the remaining as Numerical. For that, we need to convert the Species column in to factors, as follows. Remember to do this only for categorical columns, but not for numerical columns. Here im creating two new Tran set and a Test set, using the converted data.
trainset2 <- data.frame(as.factor(trainset$Species), trainset$Sepal.Length, trainset$Sepal.Width, trainset$Petal.Length, trainset$Petal.Width)
testset2<- data.frame(as.factor(testset$Species), testset$Sepal.Length, testset$Sepal.Width, testset$Petal.Length, testset$Petal.Width)
# Rename the column headers
colnames(trainset2) <- c("Species","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")
colnames(testset2) <- c("Species","Sepal.Length","Sepal.Width","Petal.Length","Petal.Width")

Train Model:

Before we start training the model, we need to put our attention on the inputs needed for the randomForest model. Here im going to give five input parameters and they are as follows, in order.
  • Formula - formula defining the relationship between the response variable and the predictor variables. (here y~. means variable y is the response variable and everything else are predictor variables)
  • Dataset - dataset to be used for training. This dataset should contain the variables defined in the above equation.
  • Boolean value indicating whether to calculate feature importance or not.
  • ntree - Number of trees to grow. This should not be set to too small a number, to ensure that every input row gets predicted at least a few times.Saying that, large number for this (say about 100 would result the output model to be very large in size - few GBs. And eventually, it would take a lot of time to export the model as PMML)
  • mtry - Number of variables randomly sampled as candidates at each split. Note that the default values are different for classification (sqrt(p) where p is number of variables in x) and regression (p/3)
If you need further details on randomForest, execute the following ccommand in R, which will open the help page of the respective command.
help(randomForest)
Now we need to find the best mtry value for our case. For that execute the follow command, and in the resulting graph, pick the mtry value which gives the lowest OOB error.
bestmtry <- tuneRF(trainset[-1],factor(trainset$Species), ntreeTry=10, stepFactor=1.5,improve=0.1, trace=TRUE, plot=TRUE, dobest=FALSE)

 
According to the graph, OOB error minimize at mtry=2. Hence I will be using that value for model training step. To train the model, execute the following command. Here I'm training the Random Forest with 10 trees.
model <- randomForest(Species~.,data=trainset2, importance=TRUE, ntree=10, mtry=2)
Lets see how important each feature is, to our output model.  
importance(model)
This will result the following output.
0 1 2 MeanDecreaseAccuracy MeanDecreaseGini
Sepal.Length 1.257262 1.579965 1.985794 2.694172 8.639374
Sepal.Width 1.083289 0 -1.054093 1.085028 2.917022
Petal.Length 6.455716 4.398722 4.412181 6.071185 39.194641
Petal.Width 2.213921 2.045408 3.581145 3.338613 18.181343

Evaluate Model:

Now that we have a model with us, we need to check how good our model is. This is where our test data set comes in to play. We are going to make the prediction for the Response variable "Species" using the data in the tests set. And then compare the actual values with the predicted values.
prediction <- predict(model, testset2)
Lets calculate the confusion matrix, to evaluate how accurate our model is.
confMatrix <- confusionMatrix(prediction,testset2$Species)
print(confMatrix)
You will get an output like follows:
Confusion Matrix and Statistics

          Reference
Prediction  0  1  2
         0 13  0  0
         1  0 17  0
         2  0  0 15

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9213, 1)
    No Information Rate : 0.3778     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: 0 Class: 1 Class: 2
Sensitivity            1.0000   1.0000   1.0000
Specificity            1.0000   1.0000   1.0000
Pos Pred Value         1.0000   1.0000   1.0000
Neg Pred Value         1.0000   1.0000   1.0000
Prevalence             0.2889   0.3778   0.3333
Detection Rate         0.2889   0.3778   0.3333
Detection Prevalence   0.2889   0.3778   0.3333
Balanced Accuracy      1.0000   1.0000   1.0000

As you can see in the confusion matrix, all the values that are NOT in the digonal of the matrix are zero. This is the best model we can get for a classification problem, with 100% accuracy. In most of the real world scenarios, it is pretty hard to get this kind of a highly-accurate output. But it all depends on the data-set. 
Now that we know our model is highly accurate,  lets export this models from R, so that it can me used in other applications. Here Im using the PMML [2] format to export the model.
RFModel <- pmml(model);
write(toString(RFModel),file = "/home/supun/Desktop/RFModel.pmml");

[1]
[2] https://en.wikipedia.org/wiki/Predictive_Model_Markup_Language

Share:

0 comments