Decision Trees are one of the simple and easy algorithms which have been used for handling non-linear data effectively. It is a non-parametric supervised learning method used for both Regression and classification problems — yet, is mostly used for classification problems. In general, decision trees are contrived via an algorithmic approach that identifies ways to split a data set based on different conditions. The goal is to create a model that predicts the target variable by learning simple decision rules inferred from the data features. Tree consists of IF then Else statements. The deeper the tree, the more complex the tree and fitter the model.
Why do we use a Decision tree algorithm?
Decision trees are extremely helpful to evaluate your options and helps you to choose between several courses of action. They provide a highly effective structure within which you can lay out options and investigate the possible outcomes of choosing those options.
How do we use a Decision tree algorithm?
Before we dive into an algorithm, let’s get familiar with some terminologies:
- Instances: Refer to the vector of features or attributes that define the input space.
- Attribute: A quantity describing an instance.
- Root Node: It is the main node of the tree and further it gets divided into two or more homogeneous sets.
- Decision Node: At this node, the tree gets split into sub-nodes.
- Leaf Node: This is the final node of the tree that gives us the outcome from the target variable.
- Pruning: The splitting process results in fully grown trees until the stopping criteria are reached. But, the fully grown tree is likely to overfit the data, leading to poor accuracy on unseen data.
Let’s take a simple example, A insurance guy wants to get more policies for his firm. He wants to find that what age group people will take more policies. There are three columns in total, namely Age group, Gender, policy. Here, Age is the target variable and the policy taken is the Root Node.
Although, a real dataset will have a lot more features and this will just be a branch in a much bigger tree, but you can’t ignore the simplicity of this algorithm. The feature importance is clear and relations can be viewed easily. This methodology is more commonly known as a learning decision tree from data and the above tree is called the Classification tree as the target is to classify policies. Regression trees are represented in the same manner, just they predict continuous values like the price of a house. In general, Decision Tree algorithms are referred to as CART or Classification and Regression Trees. So, to obtain the results using a Decision tree, there is a procedure going in the background for these amazing outcomes. Growing a tree involves deciding on which features to choose and what conditions to use for splitting, along with knowing when to stop. As a tree generally grows arbitrarily, you will need to trim it down for it to look beautiful.
Decision trees use multiple algorithms to decide to split a node into two or more sub-nodes. The creation of sub-nodes increases the homogeneity of resultant sub-nodes. The algorithm selection is also based on the type of target variables. Let us look at some algorithms used in Decision Trees:
ID3 → (extension of D3)
C4.5 → (successor of ID3)
CART → (Classification And Regression Tree)
CHAID → (Chi-square automatic interaction detection Performs multi-level splits when computing classification trees)
MARS → (multivariate adaptive regression splines)
The basic algorithm used in decision trees is known as the ID3 algorithm. The ID3 algorithm builds decision trees using a top-down, greedy approach. Briefly, the steps to the algorithm are: — Select the best attribute → A — Assign A as the decision attribute (test case) for the NODE. — For each value of A, create a new descendant of the NODE. — Sort the training examples to the appropriate descendant node leaf. — If examples are perfectly classified, then STOP else iterate over the new leaf nodes.
There are few attributes that play a major role in the Decision tree algorithm are namely Entropy, Information gain, and Gini index.
Entropy is a measure of the randomness in the information being processed. The higher the entropy, the harder it is to draw any conclusions from that information. Flipping a coin is an example of an action that provides information that is random.
Information gain or IG is a statistical property that measures how well a given attribute separates the training examples according to their target classification. Constructing a decision tree is all about finding an attribute that returns the highest information gain and the smallest entropy.
You can understand the Gini index as a cost function used to evaluate splits in the dataset. It is calculated by subtracting the sum of the squared probabilities of each class from one. It favors larger partitions and is easy to implement, whereas information gain favors smaller partitions with distinct values.
Here we came to know that Entropy and Gini index serves the same purposes in the algorithm. But The main difference between Entropy and Gini index is Computation Speed because of the complexity of the entropy formula due to logarithmic, which takes much time for computation on large datasets. The below diagram shows the basic outline of the decision tree algorithm that is splitting the dataset, one is used for training purposes and another set of data is used for testing purposes. The training process happens as we discussed above and model evaluation, as well as performance evaluation measures, takes place on the test datasets.
- Easy to use and understand.
- Can handle both categorical and numerical data.
- Resistant to outliers, hence require little data preprocessing.
- Can be used to build larger classifiers by using ensemble methods.
- Prone to overfitting.
- It can be unstable because small variations in the data might result in a completely different tree being generated. This is called variance, which needs to be lowered by methods like bagging and boosting.
- Decision tree learners create biased trees if some classes dominate. Therefore, It is recommended to balance the data set prior to fitting with the decision tree.
The practical example of the Decision tree was solved by taking a dataset from Kaggle and click here for the sample code of the decision tree. I will come with episode#03 as soon as possible, until then Happy Learning.
If anything is worth doing, do it with all your heart -Budda
- Hands-on machine learning book by Aurelien Geron.