In my previous article, I presented the Decision Tree Regressor algorithm. If you haven't read this article I would urge you to read it before continuing. The reason is that the Decision Tree is the main building block of a Random Forest.

Random Forest is a flexible, easy to use machine learning algorithm that produces great results most of the time with minimum time spent on hyper-parameter tuning. It has gained popularity due to its simplicity and the fact that it can be used for both classification and regression tasks. In this article, I will present in details the Random Forest Regression model.


Table of Contents

  • Introduction
  • Random sampling of training observations
  • Random subsets of features for splitting nodes
  • Why a Random forest is better than a single decision tree?
  • Why a Random Forest reduces overfitting?
  • How does it work?
  • When to use Random Forests?
  • Bagging
  • Advantages & Disadvantages
  • Conclusion
  • References
‘That Wanaka Tree’ - now a famous photo spot - at first light on the last day of winter. Serene and picturesque.
Photo by Hamish Clark / Unsplash

Introduction

Random Forest is an ensemble machine learning technique capable of performing both regression and classification tasks using multiple decision trees and a statistical technique called bagging. Bagging along with boosting are two of the most popular ensemble techniques which aim to tackle high variance and high bias. A RF instead of just averaging the prediction of trees it uses two key concepts that give it the name random:

  1. Random sampling of training observations when building trees
  2. Random subsets of features for splitting nodes

In other words, Random forest builds multiple decision trees and merge their predictions together to get a more accurate and stable prediction rather than relying on individual decision trees.


Random sampling of training observations

Each tree in a random forest learns from a random sample of the training observations. The samples are drawn with replacement, known as bootstrapping, which means that some samples will be used multiple times in a single tree. The idea is that by training each tree on different samples, although each tree might have high variance with respect to a particular set of the training data, overall, the entire forest will have lower variance but not at the cost of increasing the bias. In Sklearn implementation of Random forest the sub-sample size of each tree is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True. If bootstrap=False each tree will use exactly the same dataset without any randomness.


Random Subsets of features for splitting nodes

The other main concept in the random forest is that each tree sees only a subset of all the features  when deciding to split a node. In Skearn this can be set by specifying max_features = sqrt(n_features) meaning that if there are 16 features, at each node in each tree, only 4 random features will be considered for splitting the node.


Why a Random Forest is better than a single decision tree?

The fundamental idea behind a random forest is to combine the predictions made by many decision trees into a single model. Individually, predictions made by decision trees may not be accurate but combined together, the predictions will be closer to the true value on average.

Photo by Imat Bagja Gumilar / Unsplash

Each individual tree brings their own information sources to the problem as they consider a random subset of features when forming questions and they have access to a random set of the training data points. If we only build one tree we would only take advantage of their limited scope of information, but by combining many trees' predictions together, our net information would be much greater. If instead, each tree used the same dataset every tree would be greatly affected by the same way by an anomaly or an outlier.

This increased diversity in the forest leading to more robust overall predictions and the name ‘random forest.’ When it comes time to make a prediction, the random forest regression model takes the average of all the individual decision tree estimates.


Why a Random Forest reduces overfitting?

The objective of a machine learning model is to generalize well to new data it has never seen before. Overfitting occurs when a very flexible model (high capacity) memorizes the training data by fitting it closely. The problem is that the model learns not only the actual relationships in the training data but also any noise that is present. A flexible model is said to have high variance because the learned parameters (such as the structure of the decision tree) will vary considerably with the training data.

On the other hand, an inflexible model is said to have high bias because it makes assumptions about the training data (it’s biased towards pre-conceived ideas of the data). An inflexible model may not have the capacity to fit even the training data and in both cases — high variance and high bias — the model is not able to generalize well to new data.

The balance between creating a model that is so flexible it memorizes the training data versus an inflexible model that can’t learn the training data is known as the bias-variance tradeoff and is a foundational concept in machine learning.

As I stated in my previous article decision tree is prone to overfitting as it can keep growing until it has exactly one leaf node for every single observation. If the maximum depth is set to 2 (making only a single split), the predictions are no longer 100% correct. We have reduced the variance of the decision tree but at the cost of increasing the bias.

That said a small change in the initial parameters of a decision tree can cause the model prediction to vary a lot, which qualifies it as an unstable model. That is the reason we apply Bagging on unstable models like Decision Tree to reduces variance (good) and increases bias (bad) by combining many trees into a single ensemble model known as the random forest.


How does it work?

Let's understand Random Forest step by step:

Step 1: Samples are taken repeatedly from the training data so that each data point is having an equal probability of getting selected, and all the samples have the same size as the original training set.

Let's say we have the following data:

x= 0.1,0.5,0.4,0.8,0.6, y=0.1,0.2,0.15,0.11,0.13 where x is an independent variable with 5 data points and y is dependent variable.

Now Bootstrap samples are taken with replacement from the above data set. n_estimators is set to 3 (no of tree in random forest), then:

The first tree will have a bootstrap sample of size 5 (same as the original dataset), assuming it to be: x1={0.5,0.1,0.1,0.6,0.6} likewise

x2={0.4,0.8,0.6,0.8,0.1}

x3={0.1,0.5,0.4,0.8,0.8}

Step 2: A Random Forest Regressor model is trained at each bootstrap sample drawn in the above step, and a prediction is recorded for each sample.

Step 3: Now the ensemble prediction is calculated by averaging the predictions of the above trees producing the final prediction.


When to use Random Forests?

Probably there isn't any situation where Random Forest is not going to be at least somewhat useful. So it is always worth trying. The real question might be in what situation should we try other things as well, and the short answer to that is for unstructured data (image, sound, etc) and time series data, you almost certainly want to try deep learning.


Bagging

Random forest — a way of bagging decision trees.

Bagging is an interesting idea of combining the predictions of n different models each of which was only somewhat predictive while their predictions were not correlated with each other. That would mean that the n models would have profound different insights into the relationships of the data. If you took the average of those n models, you are effectively bringing in the insights from each of them. So this idea of averaging models is a technique called Ensembling.

What if we created a whole bunch of trees — big, deep, massively overfit trees but for each of them, we only pick a random 1/10 of the data. Let’s say we do that a hundred times (different random sample every time). They would probably overfit terribly but since they are all using different random samples, they all overfit in different ways on different things. In other words, they all have errors but the errors are random. The average of a bunch of random errors is zero. If we take the average of these trees each of which has been trained on a different random subset, the error will average out to zero and what is left is the true relationship — and that’s the random forest.

In a way, we average 10 crappy models and we get a good model?

Exactly and the reason is that the crappy models are based on different random subsets and their errors are not correlated with each other. If the errors were correlated, this will not work.

The key insight here is to construct multiple models which are better than nothing and where the errors are, as much as possible, not correlated with each other.

The subsets that are being selected, are they exclusive? Can there be overlaps?

Sklearn select by default n random rows with replacement out of a dataset with n rows which is called bootstrapping ( this can be deactivated by setting bootstrap=False).  In addition, randomness can be boosted by considering only a selection of the features at each node split which decorrelates the trees in the forest even more.

As a result, random forests have an in-built validation mechanism due to the fact that only a percentage of the data is used for each model, an out-of-bag error. The model’s performance can be calculated using the 36.8% of the sample left out for each model (Out-of-bag (OOB) score).

How 36.8% is calculated?

A bootstrap sample from D = {X1,...,Xn} is a sample of size n drawn with replacement from D.

In a bootstrap sample, some elements of D:

  • will show up multiple times
  • some won’t show up at all.

Each Xi has a probability of 1/n to be selected or a probability of (1−1/n) of not being selected at a pick.  In other words since we draw n elements with replacement from D. Each Xi has a probability of (1-1/n)^n of not shown up at all.

So we expect ~63.2% of elements of D will show up at least once and ~36.8% not to be shown at all.


Advantages & Disadvantages

Advantages

  • Reduction in overfitting: by averaging several trees, there is a significantly lower risk of overfitting.
  • It is very easy to measure the relative importance of each feature on the prediction. Sklearn as an example has a powerful library to do that. Feature importance matrix provides a nice way to start getting insights about the data.
  • It requires very few pieces of feature engineering. For many different types of situation, you do not have to take the log of the data or multiple interactions together.
  • It has few, if any, statistical assumptions. It does not assume that your data is normally distributed, the relationship is linear, or you have specified interactions.
  • Can be used for both regression and classification tasks
  • Easy to use algorithm, because it’s default hyperparameters often produce a good prediction result. The number of hyperparameters is also not that high and they are straightforward to understand.
  • It has an in-built validation mechanism named Out-of-bag (OOB) score.

Disadvantages

  • It’s more complex and computationally expensive than decision tree algorithm. Fact which makes the algorithm slow and ineffective for real-time predictions as a more accurate prediction requires more trees.
  • Cannot extrapolate at all to data that are outside the range that they have seen
  • The downside of methods like random forest or any other bagging algorithms is their non interpret-ability.

Conclusion

Random Forest as technique may seem intimidating at first, but it is just many simple ideas combined together to yield an extremely accurate model that can ‘learn’ from past data. There are two fundamental ideas behind it:

  1. Constructing a flowchart of questions and answers leading to a decision
  2. The wisdom of the (random and diverse) crowd

It is the combination of these basic ideas that lead to the power of the random forest model.

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Machine Learning.


References