Random Forests Understanding
Random forest falls under the supervised learning techniques in machine learning and it is certainly one of the most popular algorithms used for both regression and classification purposes. The Random Forest algorithm is based on the concept of ensembling learning, which simply means, stacking together a lot of classifiers to improve the performance. Simply put, a random forest is a group of decision trees and takes the majority of the outputs of the decision trees to improve the prediction and results.
Random forest is an improvement on the decision tree algorithm, in that, decision trees have a massive limitation which is overfitting, but the random forest overcomes this limitation by taking the prediction from each tree and based on the majority votes from the trees, predicts the final output. The higher the number of trees in the forest, the better the accuracy, and the issue of overfitting is prevented or at least reduced. The following sections give further intuition about the algorithm and show how it prevents this overfitting. The emphasis of this article would be on classification.
Structure and workings of the Random Forest
In order to understand Random forest very well, please check the previous article on decision trees before continuing. As mentioned previously, the random forest combines decision trees, so there's a possibility that some decision trees predict the correct output, while others may not.
As can be seen in the figure above, there are several decision trees, and they each work independently to form their own output and give a prediction. The random forest then takes the prediction from each tree and selects the majority of the class each tree predicted as the true predicted class of the dataset.
The algorithm works in this way:
- Randomly select some data points from the set
- Build decision trees using the selected data points
- Select the number of decision trees to be built
- Repeat the first and second steps
- For any new data point(s), compute the prediction of each decision tree, and using the majority votes by the trees, assign the new data point(s) to that class.
The following example would therefore expand the idea of random forest for better understanding.
Let's consider an example where there are two animals in a dataset, as can be seen, from the figure above, and this is given to a random forest classifier to classify the animals based on certain features. Each decision tree outputs a decision based on its own computations, the decision trees would have different 'opinions on what the animal is based on certain features. Decision Tree A gives the output that its a giraffe, decision tree B thinks its an elephant and each of the decision trees give their own outputs.
The next step is the majority voting, where each output from a tree is counted and the animal that gets outputted the most is deemed to be the correct animal based on the features. Based on our example, Giraffe was the animal most decision trees predicted, therefore the random forest assigns this data point to the class of Giraffe.
The random forest algorithm can be represented mathematically as:
Where class j refers to the classes in the data and i refers to the number of decision trees from 1 up to the ith example. Argmax refers to the maximum value of the function, in other words, the majority voting.
Implementation of Random forest Using Python
Python is the programming language used for this implementation, and the dataset used was to determine if an advertising company should give out promotions based on the number of visitors that visit their site, the amount of money they spend on marketing and the Revenue they generate. The target label contains 3 classes, No promotion, Promotion red, and Promotion Blue.
We always begin by importing the necessary libraries to use in completing this whole task,
import pandas as pd from sklearn.ensemble import RandomForestClassifier import matplotlib.pyplot as plt from sklearn.tree import plot_tree from sklearn.model_selection import train_test_split from sklearn.metrics import accuracy_score
Pandas is used to load the data and perform all the data manipulations. We import the random forest classifier from sci-kit learn, to train and run inference on the model, we import train_test_split, which helps us split our data into training and testing, we then import the accuracy score metric, to check how our model performed.
X = data.drop(columns=['Promo']) y = data['Promo'] X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.3, random_state=0)
We then divide our data into X and y, with X holding all the information of our data except the target variable(Promo), and y containing just the target variable. We then split our data into a 70%-30% ratio, and storing it as X_train(input training data), X_test(input testing data), y_train(output training data) and y_test(output testing data). This enables us to use some of the data(70%), to train the random forest classifier and use the remaining 30% to test to check how our model did.
Rf = RandomForestClassifier() Rf.fit(X_train,y_train)
We first instantiate the classifier and store it as an alias, Rf. We then start training the algorithm by calling the fit function, which fits the inputs(X)to the output(y).
We call the predict function to make predictions on our algorithm and store it as predictions
To check how our model did, we first call the accuracy score function, which counts the number of correct predictions and divides it by the total amount of predictions, and express it as a percentage. Our model achieved an accuracy of 77.87%, which indicates that using the number of visitors that visited the site, the marketing spends of the company, and the revenue generated, the random forest algorithm can predict if a promotion should be given or not, and if it should, what type of promotion should give out, with a 77.87% accuracy.
To further evaluate this algorithm, a comparison was made to the decision tree classifier, which outputted an accuracy of 69.19%, which clearly shows that, ensembling a number of decision trees and taking the majority vote to predict the outcome(AKA Random forest), gives a better result than using a single decision tree.
fig, axes = plt.subplots(nrows = 1,ncols = 4,figsize = (10,2), dpi=900) for index in range(0,4): plt.figure(figsize=(10,10)) plot_tree(Rf.estimators_[index], filled = True,max_depth=2, ax = axes[index]) axes[index].set_title('Estimator: ' + str(index), fontsize = 11)
In order to visualize the random forest algorithm, we use the function plot_tree and the matplotlib to visualize. The number of decision trees in the forest is set to just 4, in order to get a good look at it. Also, only 2 splits(max_depth) are visualized in order to get clean visualization.
The figure above shows just 4 decision trees in the forest, the point is not to show the detailed information output of the trees but to show how to output the random forest using python. The estimator refers to the number of trees, each tree outputs a decision as we previously talked about, and the random forest outputs the final prediction by taking the majority vote.
We have looked at the random forest algorithm in detail, the structure, how it works, and finally how to implement it. The main advantages of this algorithm are:
- It reduces overfitting
- It is not biased, since the algorithm makes its decisions based on the overall decisions from each tree
- The algorithm can handle missing data well
- It's a stable algorithm because any new data point that is introduced does not affect the whole process, it may affect just one tree
- It can be used for both categorical and regression tasks
A disadvantage is that due to the number of trees, it is computationally expensive and can take a long time to train.
Overall, this algorithm is a go-to algorithm for a lot of machine learning tasks and users should consider it when trying to score high accuracies in competitions or personal projects.