In my previous article, I presented the Random Forest Regressor model. If you haven't read this article I would urge you to read it before continuing. In simple terms, a Random forest  is  a way of bagging decision trees.

In this article, I will present in details some advanced tricks of Random Forest Regression model.


Table of Contents

  • Feature Importance
  • Paradox of Random Forests
  • Out-of-bag (OOB) score
  • Important Hyperparameters
  • Advanced Tip
  • Conclusion
  • References‌
Photo by Jason Leem / Unsplash

Feature Importance

A great quality of the random forest algorithm is that it is very easy to measure the relative importance of each feature on the prediction. Sklearn provides a great tool for this, that measures the importance of a feature by looking at how much the tree nodes, which use that feature, reduce impurity across all trees in the forest. It computes this score automatically for each feature after training and scales the results so that the sum of all importance is equal to 1.

Looking at the feature importance, you can decide which features you may want to drop, because they don’t contribute enough or nothing to the prediction process. This is important because a general rule in machine learning is that the more features you have, the more likely your model will suffer from overfitting and vice versa.

Below you can see a code snippet that calculates and plot the feature importance of a random forest model:‌

from sklearn.ensemble import RandomForestRegressor

m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)

def rf_feat_importance(m, df):
    return pd.DataFrame({'cols':df.columns, 'imp':m.feature_importances_}
                       ).sort_values('imp', ascending=False)
fi = rf_feat_importance(m, df_trn); fi[:10]
def plot_fi(fi): return fi.plot('cols', 'imp', 'barh', figsize=(12,7), legend=False)
plot_fi(fi[:30]);

‌Paradox of Random Forests

The effective machine learning model is accurate at finding the relationships in the training data and generalizes well to new data. In bagging, that means that each of your individual estimators, you want them to be as predictive as possible but for the predictions of your individual trees to be as uncorrelated as possible.

The research community found that the more important thing seems to be creating uncorrelated trees rather than more accurate trees. In Sklearn, there is another class called ExtraTreeClassifier which is an extremely randomized tree model. Rather than trying every split of every variable, it randomly tries a few splits of a few variables which makes training much faster and it can build more trees — better generalization. If you have crappy individual models, you just need more trees to get a good end model.


‌Out-of-bag (OOB) score

We all have encountered a situation where the dataset was small fact which made difficult to pull out a validation set because doing so means we would not have enough data to build a good model. However, random forests have a very clever trick called out-of-bag (OOB) error which can handle this.‌

‌As we have seen each individual tree of a random forest uses a sample of the rows of the dataset which means that some of the rows did not get used for training (approx. 36.8%). We can take advantage of this fact and actually pass those unused rows through the first tree and treat it as a validation set. For the second tree, we could pass through the rows that were not used for the second tree, and so on. Similarly, we can perform the same procedure for all of the trees of the Random Forest.

Effectively, we now have a different validation set for each tree. To calculate our prediction, we would average all the trees where that row is not used for training. If you have hundreds of trees, it is very likely that all of the rows are going to appear many times in these out-of-bag samples. You can then calculate RMSE, R², etc on these out-of-bag predictions.

In Sklearn you could just define the following parameter oob_score=True before you train your Random Forest Regressor:‌

from sklearn.ensemble import RandomForestRegressor

m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)

print(m.oob_score_) # R^2 score

Setting oob_score to true will do exactly this and create an attribute called oob_score_ to the model and as you see above.

Naturally, OOB score helps a lot when determining the optimal hyperparameters. There will be quite a few hyperparameters that we are going to set and we would like to find some automated way to set them. One way to do that is to do grid search. Sklearn has a function called grid search and you pass in a list of all the hyperparameters you want to tune and all of the values of these hyperparameters you want to try. It will run your model on every possible combination of all these hyperparameters and tell you which one is the best. OOB score is a great choice for getting it to tell you which one is the best.

Wouldn’t oob_score_ always be lower than the one for the entire forest?

The accuracy tends to be lower because each row appears in fewer trees in the OOB samples than it does in the full set of trees. So OOB R² will slightly underestimate how generalizable the model is, but the more trees you add, the less serious that underestimation is.

Does obb_score=True affect training?

All obb_score=True does is it says whatever your subsample is (it might be a bootstrap sample or a subsample), take all of the other rows (for each tree), put them into a different data set, and calculate the error on those. So it doesn’t actually impact training at all. It just gives you an additional metric which is the OOB error. So if you don’t have a validation set, then this allows you to get kind of a quasi validation set for free.‌


Important Hyperparameters‌

I will here talk about the hyperparameters of Sklearn built-in random forest regression model.

n_estimators: is just the number of trees the algorithm builds before taking the average of the predictions. In general, a higher number of trees increases the performance and makes the predictions more stable, but it also slows down the computation.

max_features: is the maximum number of features Random Forest considers when splitting a node. It sounds similar to bootstraping (the number of rows each tree sees)  but it’s actually quite a different way of thinking about it. This means there will be fewer combinations of variable-values to check when deciding on a split of a node (i.e. faster training time). The idea is that the less correlated your trees are with each other, the better.

Imagine that you got one feature that is just super predictive. It’s so predictive that every random subsample you look at always starts out by splitting on that same feature. So every tree will always split on the same thing the first time, you will not get many variations in those trees. But there may be some other interesting initial splits because they create different interactions of variables.

With max_features=0.5, half the time that feature won’t even be available at the top of the tree, at least half the trees are going to have a different initial split. It definitely can give us more variation and therefore it can help us to create more generalized trees that have less correlation with each other even though the individual trees probably won’t be as predictive.

The reason we do that is that we want the trees to be as rich as possible. Particularly, if you were only doing a small number of trees (e.g. 10 trees) and you picked the same column set all the way through the tree, you are not really getting much variety in what kind of things it can find.

So this way, at least in theory, seems to be something which is going to give us a better set of trees by picking a different random subset of features at every decision point. Good values to use are 1, 0.5, 0, log2, None  or sqrt.

The overall effect of the max_features is:

  • each individual tree is probably going to be less accurate
  • but the trees are going to be more varied.

The Sklearn docs show an example of different max_features methods with increasing numbers of trees - as you see, using a subset of features on each split requires using more trees, but results in better models:‌

min_sample_leaf: is the minimum number of samples that are required to split an internal node. For example, when min_sample_leaf=3 this is translated as stop splitting the node further when it has 3 or fewer samples (before we were going all the way down to 1). This means there will be one or two fewer levels of decision being made which means there are half the number of actual decision criteria we have to train (i.e. faster training time). For each tree, rather than just taking one point, we are taking the average of at least three points that we would expect each tree to generalize better. The numbers that work well are 1, 3, 5, 10, 25,100.. but it is relative to your overall dataset size. As you increase, if you notice by the time you get to 10, it’s already getting worse then there is no point going further. If you get to 100 and it’s still going better, then you can keep trying.

Each time we double the min_samples_leaf , we are removing one layer from the tree, and halving the number of leaf nodes (i.e. 10k). The result of increasing min_samples_leaf is that now each of our leaf nodes has more than one thing in, so we are going to get a more stable average that we are calculating in each tree. We have a little less depth (i.e. we have fewer decisions to make) and we have a smaller number of leaf nodes. So again, we would expect the result of that node would be that each estimator would be less predictive, but the estimators would be also less correlated. So this might help us avoid overfitting.

Increasing min_samples_leaf  will speed up our training because it has one less set of decisions to make and generalize better.

Please remember that is called minimum however doesn't restrict by stopping to a node with more samples. For example, if you get to a leaf node where every single one of them has the same price, or in classification every single one of them is a do (identical in terms of the dependent variable), then there is no split that you can do that’s going to improve your information.

Information is the term we use in a general sense in random forest to describe the amount of difference about the additional information we create from a split is how much we are improving the model. So you will often see this word information gain which means how much better the model got by adding an additional split point, and it could be based on RMSE for random forest regression models.

Below I present some additional hyperparameters which do not impact our training at all

n_jobs: tells the engine how many processors it is allowed to use. If it has a value of "1", it can only use one processor. A value of “-1” means that there is no limit. It seems weird but the default is to use one core. You will definitely get more performance by using more cores because all of you have computers with more than one core nowadays, so please remember to update it.

random_state: makes the model’s output replicable. The model will always produce the same results when it has a definite value of random_state and if it has been given the same hyperparameters and the same training data.

oob_score: in the sampling, about one-third of the data is not used to train the model and can be used to evaluate its performance. These samples are called the out of bag samples.

Those are the key basic parameters that can be tuned. There are more than you can see in the docs or shift+tab in Jupyter to have a look at them, but the ones you’ve seen are the ones that I’ve found useful to play with so feel free to play with others as well. ‌


Hyperparameter Tuning

Assuming that sufficient data cleaning/engineering has been performed and the training dataset is ready to be passed to a Random Forest Regression Model, I present below how to implement a Random Forest Regression Model using Python.

Step 1: Import some libraries:

from sklearn.model_selection import PredefinedSplit
from sklearn.model_selection import GridSearchCV
from time import time
import math,pprint

Step 2: Create a function that reports the optimal hyperparameter:

# Reporting util for different optimizers
def report_perf(optimizer, X, y, title):
    """
    A wrapper for measuring time and performances of different optmizers
    
    optimizer = a sklearn or a skopt optimizer
    X = the training set 
    y = our target
    title = a string label for the experiment
    """
    start = time()
    optimizer.fit(X, y)
    best_score = optimizer.best_score_
    best_score_std = optimizer.cv_results_['std_test_score'][optimizer.best_index_]
    best_params = optimizer.best_params_
    print((title + " took %.2f seconds,  candidates checked: %d, best CV score: %.3f "
           +u"\u00B1"+" %.3f") % (time() - start, 
                                  len(optimizer.cv_results_['params']),
                                  best_score,
                                  best_score_std))    
    print('Best parameters:')
    pprint.pprint(best_params)
    print()
    return best_params,optimizer.best_estimator_

Step 3: Create our Predefiened split using the first 70% rows of the training dataset for training and the rest 30% for validation:

# The indices which have the value -1 will be kept in train.
train_indices = np.full((int(len(X_train)*0.7),), -1, dtype=int)

# The indices which have zero or positive values, will be kept in test
test_indices = np.full((len(X_train)-int(len(X_train)*0.7),), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)

print(test_fold)

ps = PredefinedSplit(test_fold)

# Check how many splits will be done, based on test_fold
print(ps.get_n_splits())

Step 4: Create a random forest regression object, specify the grid space  (values of hyperparameters to examine) and let GridSearchCV find the optimal combination:

clf = RandomForestRegressor(n_jobs=-1,random_state=-1,criterion='mse')
grid_search = GridSearchCV(clf, 
                           param_grid={
                                       "n_estimators": [10, 15],
                                       "min_samples_leaf": [5, 3],
                                       "max_features": ['sqrt',None]
                                       },
                           n_jobs=-1,
                           cv=ps,
                           iid=False, # just return the average score across folds
                           return_train_score=False)

best_params,best_model = report_perf(grid_search, X_train, y_train,'GridSearchCV')

This will return the optimal values to be used for the hyperparameters of our model from a specified range of values. The ones that achieve the best MSE in the validation dataset (last 30% rows of the training dataset) and also the best model.


Advanced Tip

A lot of time when we have a big dataset we may select to take a random subset if possible in order to speed up the training phase.

For example, let's assume that we have an entire dataset of 1,000,000 rows and in our effort to speed up the process we just took 20,000 rows. Random Forest will pick this subset and will build different trees using a different subset of that 20,000 rows.

Why not take a totally different subset of 20,000 each time?

In other words, let’s leave the entire 1,000,000 records as is, and if we want to make things faster, let's force each tree to pick a different subset of 20,000 each time. So rather than bootstrapping the entire set of rows, just randomly sample a subset of the data. In that way, we don't limit each tree of having access to the entire dataset.

Let's see both implementations:

First method: Random subset of data for Random Forest

from sklearn.ensemble import RandomForestRegressor
import pandas as pd

def random_subset(n,df,target): 
	sample_df = df.sample(n).copy()
    return sample_df.drop(target,1), sample_df[target]

X_train, y_train = random_subset(20000,df,target)

m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=True)
m.fit(X_train, y_train)

print(m.score(X_train,y_train), m.oob_score_) # R^2 score

Second method: Random subset of data for each tree of the Random Forest

from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import forest
import pandas as pd


def set_rf_samples(n):
    """ Changes Scikit learn's random forests to give each tree a random sample of
    n random rows.
    """
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n))

set_rf_samples(20000)

X_train, y_train = df.drop(target,1), df[target]

m = RandomForestRegressor(n_estimators=40, n_jobs=-1, oob_score=False)
m.fit(X_train, y_train)

print(m.score(X_train,y_train)) # R^2 score

We can revert to use a full bootstrap (sample a new dataset as big as the original one but with replacement) by running the following command:‌

def reset_rf_samples():
    """ Undoes the changes produced by set_rf_samples.
    """
    forest._generate_sample_indices = (lambda rs, n_samples:
        forest.check_random_state(rs).randint(0, n_samples, n_samples))

reset_rf_samples()

Both will take the same amount of time to run as before, but in the second case, every tree has access to the entire dataset.

What samples is this OOB score calculated on now on the second method?

Sklearn does not support this out of the box, so set_rf_samples is a custom function. So OOB score cannot be calculated correctly when using the second method.

Why the process is faster when using the function set_rf_samples?

set_rf_samples determines how many rows are in each tree. So before we start a new tree, we either bootstrap a sample (i.e. sampling with replacement from the whole thing) or we pull out a subsample of a smaller number of rows and then we build a tree from there. Let's present the process in detail:

  • Step 1 From the whole big dataset, we grab a few rows at random from it, and we turn them into a smaller dataset. From that, we build a tree.‌

Assuming that the tree remains balanced as we grow it, how many layers deep will this tree be (assuming we are growing it until every leaf is of size one)?

log2(20000). The depth of the tree doesn’t actually vary that much depending on the number of samples because it is related to the log of the size.‌
‌depth = log(samplesize/min_leaves_per_node)=log(20,000/1)=log(20,000)

Once we go all the way down to the bottom, how many leaf nodes would there be?

20K. We have a linear relationship between the number of leaf nodes and the size of the sample. So when you decrease the sample size, there are less final decisions that can be made. Therefore, the tree is going to be less rich in terms of what it can predict because it is making less different individual decisions and it also is making less binary choices to get to those decisions.

Therefore, setting RF samples lower is going to mean that you overfit less, but it also means that you are going to have a less accurate individual tree model.

The way Breiman, the inventor of random forest, described this is that you are trying to do two things when you build a model with bagging.

  • One is that each individual tree/estimator is as accurate as possible (so each model is a strong predictive model).
  • But then across the estimators, the correlation between them is as low as possible so that when you average them out together, you end up with something that generalizes well.

By decreasing the set_rf_samples number, we are actually decreasing the power of the estimator and decreasing the correlation — so is that going to result in a better or worse validation set result for you? It depends. This is the kind of compromise that a data scientist has to figure out when doing machine learning models.

The major benefit of set_rf_samples is that you can run more quickly. Particularly if you are running on a really large dataset like a hundred million rows, it will not be possible to run it on the full dataset. So you would either have to pick a subsample yourself before you start or you set_rf_samples.‌


Conclusion

This brings us to the end of this article. Hope you got a basic understanding of the advanced tricks of a random forest regression model by following this post. Feel free to use the Python code snippet of this article.

Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Machine Learning.


References