Today, I have come up with the most frequently used algorithm from the supervised learning type is called Random Forest. It is used in both regression and classification problems. Also, it is used as an ensemble technique. In this episode of the Blog Series, we will examine how individual decisions trees are combined to make a random forest, and ultimately discover why random forests are so good at what they do.
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
- It takes less training time as compared to other algorithms.
- It predicts output with high accuracy, even for the large dataset, it runs efficiently.
- Furthermore, it can also maintain accuracy when a large proportion of data is missing.
One of the common problems with decision trees, especially the ones that have a table full of columns, is that they tend to overfit a lot. Sometimes it looks like the tree just memorizes the data. Here are the typical examples of decision trees that overfit, both for categorical and continuous data:
If the client is female, between 15 and 25, from the UK, likes ice cream, has a German friend, hates birds, and ate pancakes on August 31st, 2021, — he is likely to download Pokémon Go.
If the rainfall prediction in the different regions of the country is unlike because of the various factors, predicting the rainfall based on the past in different regions (Time-series forecasting).
Random Forest prevents this problem: it is an ensemble of multiple decision trees, not just one. And the more the number of these decision trees in the Random Forest, the better the generalization.
More precisely, Random Forest works as follows:
- Selects k features (columns) from the dataset (table) with a total of m features randomly (where k<<m). Then, it builds a Decision Tree from those k features.
- Repeats n times so that you have n Decision Trees built from different random combinations of k features (or a different random sample of the data, called bootstrap sample).
- Takes each of the n built Decision Trees and passes a random variable to predict the outcome. Stores the predicted outcome (target), so that you have a total of n outcomes from the n Decision Trees.
- Calculates the votes for each predicted target and takes the mode (most frequent target variable). In other words, considers the high voted predicted target as the final prediction from the random forest algorithm.
- In case of a regression problem, for a new record, each tree in the forest predicts a value for Y (output). The final value can be calculated by taking the average of all the values predicted by all the trees in a forest. Or, in case of a classification problem, each tree in the forest predicts the category to which the new record belongs. Finally, the new record is assigned to the category that wins the majority vote.
- Can be used for both classification and regression problems: Random Forest works well when you have both categorical and numerical features.
- Reduction in overfitting: by averaging several trees, there is a significantly lower risk of overfitting.
- Make a wrong prediction only when more than half of the base classifiers are wrong: Random Forest is very stable — even if a new data point is introduced in the dataset, the overall algorithm is not affected much as new data may impact one tree, but it is very hard for it to impact all the trees.
- Random forests have been observed to overfit some datasets with noisy classification/regression tasks.
- More complex and computationally expensive than the decision tree algorithm.
- Due to their complexity, they require much more time to train than other comparable algorithms.
Thus, Random Forest is an algorithm that builds n decision trees by randomly selecting k out of the total of m features for every decision tree and takes the mode (average, if regression) of the predicted outcomes.
“Time is more precious than anything, Make it accountable”