As data scientist working on regression problems I have faced a lot of times datasets with right-skewed target's distributions. By googling it I found out that log transformation can help a lot.
In this article, I will try answering my initial question of how log-transforming the target variable into a more uniform space boost model performance.
Note: For the analysis below I used the House Prices: Advanced Regression Techniques Kaggle dataset.
✏️ Table of Contents
- When to log-transform the target variable?
- Why Tree-based models suffers a lot?
- Expected Result
- Why does it work?
- What about left-skewed distributions?
🛠When to log-transform the target variable?
It is useful if and only if the distribution of the target variable is right-skewed which can be observed by a simply histogram plot. This occurs when there are outliers that can't be filtered out as they are important to the model.
That said if you are sure that those points which skewed the distribution are outliers then they should be filtered out.
Remember that the log-transformation can only be applied when the target variable takes only non-negative values.
🌳 Tree-based models suffers a lot
Tree-based models makes predictions by averaging similar record's target values. This can lead to wildly skewed predictions (predictions could be very far off) if outliers are present leading to poor models.
🎯 Expected Result
The ultimate goal is to transform the distribution of the target variable resembling that of a narrow “bell curve” distribution without a tail.
This is doable if and only if large values are shrinked a lot and smaller values a little. That magic operation is called the logarithm.
It manages to "draws in" big values which often makes the data easier to look at and sometimes normalizes the variance across observations. In the figure below we get a nice normally-shaped distribution without having to filter out outliers.
🤔 Why does it work?
Let's look at the actual effect on some target values.
For example a DT trained on the raw target variable, would predict a terrible average of df.SalePrice.mean()=$180921.
Whereas an DT trained on the log of the target variable predicts an average of
It is evident that predicting the average in the log-of-price space seems like a better option. This is due to the fact that the average now is less sensitive to outliers because we have scrunched the space (outliers are brought close in).
What about left-skewed distributions?
A log transformation in a left-skewed distribution will tend to make it even more left skew, for the same reason it often makes a right skew one more symmetric. It will only achieve to pull the values above the median in even more tightly, and stretching things below the median down even harder.
In that cases power transformation can be of help.
Left-skewed distributions can become more symmetric by taking a power (greater than 1, square), or by exponentiating.
For people who like video courses and want to kick-start a career in data science today I highly recommend the below video course from Udemy:Python for Statistical Analysis
While for book lovers:
- "Python for Data Analysis" by Wes McKinney, best known for creating the Pandas project.
- "Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurelien Geron, currently ranking first in the best sellers Books in AI & Machine Learning on Amazon.
- "Deep Learning" by Ian Goodfellow research scientist at OpenAI.
Remember that you always need to transform the values back to the original dimension, by using the inverse of the transformation used.
For the log transformation the inverse function is the exp (exponential) function.
Log transform function:
Inverse Exponential function:
📌 Other Transformations
Some other transformation that I have found are:
Based on my experience, I have noticed that the log-transformation tend to always work better for right skewed data. But there are also times when the square root will make things more symmetric, but it tends to happen with less skewed distributions.
The best way to be sure what works best is to try both and compare.
This brings us to the end of this article. Hope you understand why log-transforming a right-skewed target variable into a more uniform space boost model performance.
If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.