Validation is probably in one of most important techniques that a data scientist use as there is always a need to validate the stability of the machine learning model-how well it would generalize to new data. It needs to be sure that the model has got most of the patterns from the data correct, and its not picking up too much on the noise, or in other words its low on bias and variance. The goal of this article is to present the cross-validation concept.


Table of Contents

  • What is cross-validation?
  • Why it is helpful?
  • What is overfitting & underfitting?
  • Different Validation strategies
  • When to use each of the above technique?
  • In kFold how many folds to use?
  • Conclusion
  • References‌
Photo by Franck V. on Unsplash-

What is cross-validation?

Cross-validation, it’s a model validation techniques for assessing how the results of a statistical analysis (model) will generalize to an independent data set. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice.

The goal of cross-validation is to define a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting,underfitting and get an insight on how the model will generalize to an independent data set. It is important the validation and the training set to be drawn from the same distribution otherwise it would make things worse.


Why it is helpful?

  • Validation help us evaluate the quality of the model
  • Validation help us select the model which will perform best on unseen data
  • Validation help us to avoid overfitting and underfitting.

What is overfitting & underfitting?

  • Underfitting refers to not capturing enough patterns in the data. The model performs poorly both in the training and the test set.
  • Overfitting refers: a)capturing noise and b) capturing patterns which do not generalize well to unseen data. The model performs extremely well to the training set but poorly on the test set.

The optimal model performs well both in the train but at the test set as well.


Different Validation strategies

Typically, different validation strategies exist based on the number of splits being done in the dataset.

Train/Test split or Holdout: # groups =2

In this strategy, we simply split the data into two sets: train and test set so that the sample between train and test set do not overlap, if they do we simply can’t trust our model. That is the reason why it is important not to have duplicated samples in our dataset. Before we make our final model we can retrain the model on the whole dataset without changing any of the hyperparameters of the model.

But train/test split has one major disadvantage:
What if the split we make isn’t random? What if one subset of our data has only people from a certain state, employees with a certain income level but not other income levels, only women or only people at a certain age? . This will result in overfitting, even though we’re trying to avoid it! as it is not certain which data points will end up in the validation set and the result might be entirely different for different sets. Thus, it is a good choice only if we have enough data.

Implementation in python: sklearn.model_selection.train_test_split

K-fold: # groups = k

As there is never enough data to train a model, removing a part of it for validation poses a problem of underfitting. By reducing the training data, we risk losing important patterns/ trends in data set, which in turn increases error induced by bias. So, what we require is a method that provides ample data for training the model and also leaves ample data for validation. K Fold cross validation does exactly that.

It can be viewed as repeated holdout and we simply average scores after K different holdouts. Every data point gets to be in a validation set exactly once, and gets to be in a training set k-1times. This significantly reduces underfitting as we are using most of the data for fitting, and also significantly reduces overfitting as most of the data is also being used in validation set.

This method is a good choice when we have a minimum amount of data and we get sufficiently big difference in quality or different optimal parameters between folds. As a general rule, we choose k=5 or k=10, as these values have been shown empirically to yield test error estimates that suffer neither from excessively high bias nor high variance.

Implementation in python: sklearn.model_selection.KFold

Leave one out : # groups = len(train)

It is a special case of Kfold when K is equal to the number of samples in our dataset. This means that will iterate through every sample in our dataset each time using k-1 object as train samples and 1 object as test set.

This method can be useful if we have too little data and fast enough model to retrain.

Implementation in python: sklearn.model_selection.LeaveOneOut

Extra : Stratification

Usually, when we use train/test split, Kfold we shuffle the data trying to reproduce random train validation split. In that case, it is possible different target distribution to be applied to different folds. With stratification we achieve similar target distribution over different folds when we split the data.

It is useful for:

  • Small datasets
  • Unbalanced datasets
  • Multiclass classification

General, for a balanced big dataset, stratification split will be quite similar to a simple shuffle (random) split.


When to use each of the above technique?

If we have enough data and it is likely to get similar scores and optimal model’s parameters for different splits, train/test split is a good option. If on the contrary, scores & optimal parameters differ for different splits we can choose KFold approach while if we have too little data we can apply leave-one-out. Stratification helps to make validation more stable and it is especially useful for small and unbalanced dataset.


In kFold how many folds to use?

As the number of folds increasing the error due the bias decreasing but increasing the error due to variance; the computational price would go up too. Obviously, you need more time to compute it and you would need more memory.
With a lower number of folds, we’re reducing the error due to variance, but the error due to bias would be bigger. It’s would also computationally cheaper.

General advice for big dataset usually k = 3 or k = 5 is a preferred option while in small datasets it is recommended to use Leave one out.


Conclusion

Cross Validation is a very useful tool of a data scientist for assessing the effectiveness of the model, especially for tackling overfitting and underfitting. In addition,it is useful to determine the hyper parameters of the model, in the sense that which parameters will result in lowest test error.

If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌
‌Thanks for reading and I am looking forward to hear your questions :)‌
Stay tuned and Happy Machine Learning.


References

Originally published at :https://towardsdatascience.com/cross-validation-70289113a072 on Aug 16, 2018.