Supervised machine learning as we know, has labels attached to the data set and has regression and classification problems as its subset. Decision trees are a special kind of algorithm that can be used for both regression(predicting a number) and classification tasks.
Decision trees that are used for regression tasks are called Continous variable decision tree and the one used for classification is called the Categorical variable decision tree. The name decision tree comes from the fact that it has a tree-like structure, which beings at the root node and further splits into branches, which enables us to visualize how the model makes decisions and its decision-making process. The following sections give a detailed description of how the algorithm works
Structure of a Decision Tree
We mentioned beforehand that decision trees have a tree-like structure, comprising of many parts. Looking at the figure above, the topmost part of a decision tree is the root node, which consists of the entire population and gets further split into smaller branches. The splitting of the root node yields two sub-branches called the decision node and the leaf node. The decision node is where the decisions by the tree are made, and consist of a lot of sub-branches, whiles the leaf node contains the output of the decisions made by the decision node and does not consist of any further branches.
A decision tree can have as many branches as possible based on the context of the decision is trying to reach. A very simple explanation of how the decision tree works are that it asks a simple question, and based on the answers, it splits into smaller trees to find a solution to the answer.
There are a few terminologies associated with decision trees, some of which we have been mentioned previously:
- Root node: The beginning of the decision tree consisting of the entire population
- Decision node: When the nodes can be further split into smaller nodes or subnodes
- Leaf node: Final output of the decision, and the nodes can not be further split
- Splitting: When the decision node can be further divided into smaller nodes based on the given condition
- Pruning: The process of removing subnodes, which is kind of the opposite of splitting
- Parent node/Child node: The topmost node(Root node) is the parent node and any other node is a child node since it came from the root node
Decision Tree Steps
It was mentioned earlier that decision trees are used for both classification and regression tasks, but it is mostly used for classification tasks, so the rest of the article is going to focus on classification. In performing the classification, predicting the class for a particular dataset, the decision tree algorithm starts the process from the root node. Below are the steps, the decision tree goes through in making decisions.
- The decision starts from the root node, which contains the entire population
- The best attribute in the population(dataset) is determined using a technique called Attribute selection measure
- The root node is then split into subnodes containing the best attributes from the dataset
- The decision tree node is generated which contains the best attribute
- Repeat the process iteratively, by generating new decision trees, using the subnodes from the dataset, until a stopping criterion is reached, where there can be no further splitting of the nodes called the leaf node
Considering the figure above, let's assume we want to determine the type of fruits available based on their features. Using a decision tree to solve this problem, we start from the root node, which is the color of the fruit.
The root node is further split into the next decision node, based on the conditions, which in this case, Red or Green, so if the color is Red, then it's a leaf node, which indicates the output of a decision(Apple). If it's green, then the decision node is further split based on the conditions. If the size of the fruit is big, then it's an apple, if it's small, then it's lime. So the Apple and lime are leaf nodes which are the outputs and they cannot be further split.
Attribute Selection Measure
I am sure so far, it is all well and good, except for the fact that we don't know how the algorithm selects the best attribute. That's when the aforementioned Attribute selection measure(ASM) comes in, which enables us to select the best attribute of every feature in the dataset. For decision trees, there are two popular techniques used, which are:
- Information gain
- Gini index
This information gain is the amount of information gained about a class in the target variable. Using the amount of information gained, the node is split and used to build the decision trees. Decision trees are always trying to maximum this information gain attribute, and the node having the highest information gain is split first, in other words, the node which provides us with the most useful information in arriving at a decision is split first. Before we move on any further, let's talk about entropy.
Entropy is simply a measure of disorder or impurity or uncertainty in a dataset. Entropy can be represented mathematically as:
Where p(i) is the probability of a class(i) in a dataset. Suppose there are 100 responses in a dataset, in which there are two classes, Yes or no, the p(i) can be either of them, if there are 60 data points belonging to Yes, then p(yes) is 6/10 and p(no) is 4/10.
Information gain can also, therefore, be represented mathematically as :
Gini index is a measure of how frequently a randomly chosen data point from the whole dataset would be incorrectly identified. A low Gini index is always preferred compared to a high Gini index. Gini index creates binary splits and cannot be used to create more than 2 splits. Gini index can be represented mathematically as:
Where p(i) is the proportion of data points that belong to a particular class for a particular node
Implementation Of Decision Trees
We used python program software to implement a simple decision Tree for a particular dataset. We are using the very popular Iris dataset, where the data set consists of several iris flowers with certain features such as the sepal length and petal width, and we are to predict the type of iris flower it is based on those features. One excellent thing in python is that there are several libraries developed that help us with our machine learning tasks easily. So we import the necessary libraries
from sklearn.tree import DecisionTreeClassifier import matplotlib.pyplot as plt from sklearn.tree import plot_tree from sklearn.metrics import accuracy_score from sklearn.model_selection import train_test_split
We import the decision tree algorithm to build and train the model, the matplotlib to help us visualize our data with a few lines of code, the plot tree to plot the decision tree process, the train test split to split our data into training and testing so we use some of it to train and the rest to evaluate how good our model is and finally the accuracy score to calculate exactly how good the model is.
X = data.drop(columns=['Species']) y = data['Species'] X_train, X_test, y_train, y_test= train_test_split(X, y, test_size= 0.2, random_state=0)
We first divide our data into X and y, with the X containing all features except the target label which is species, and y containing just the target label.
We then split our data into 80% for the training and 20% for the testing and partitioned it into four groups, the input training data(X_train), input testing data(X_test), output training data(y_train), and output testing data(y_test).
Dt_gini = DecisionTreeClassifier(criterion='gini') Dt_entropy = DecisionTreeClassifier(criterion='entropy')
We then instantiate our decision tree model and store it in the alias of Dt, with Dt_gini being the model using the Gini index and Dt_entropy being the model with the entropy. We would compare the two later on.
We then train both our gini and entropy models by fitting the X to the y
pred_gini = Dt_gini.predict(X_test) pred_entropy = Dt_entropy.predict(X_test)
After training, we need to test and see how good our models are, so we call the predict function and store it as pred_gini and pred_entropy for both models
We then check our accuracies, by calling the accuracy score function to compute how accurate our model was in predicting the species of Iris. Both models gave an accuracy of 96.7%, which means both models are good at predicting, and ultimately means that in using the features of the iris flowers such as the petal length and width, the decision tree algorithm using both the Gini index and entropy can classify the species with a 96.7% accuracy.
plt.figure(figsize=(9,9)) plot_tree(Dt_gini, filled = True,max_depth=3, rounded=True, ); plt.figure(figsize=(9,9)) plot_tree(Dt_entropy, filled = True,max_depth=3, rounded=True, );
Now, we have to visualize how our decision tree makes decisions regarding the iris dataset. We use the matplotlib to construct this in addition to the plot tree to get the trees.
The decision tree of both models, with 3 decisions(coming from max_depth=3). The root node contains the total number of samples and the value refers to the number of samples in each of the 3 classes.
So we can clearly see how the decision tree is split using both the Gini and entropy. Finally, on the last node, the value indicates the class the decision tree assigned, both the Gini and entropy outputted the same class, so looking at it, for the green box, it indicates that the output is the second class and the violet box it indicates that the decision is the third class.
We looked at decision trees in detail, their structure, intuitive description, and how to implement them using python. Decision trees are powerful algorithms for classification because they give both excellent accuracies and explainability to how they arrive at decisions. However, decision trees can contain multiple splits which makes it complex and it has a major liability known as overfitting.