In this article, I present Mlxtend (machine learning extensions), a Python library of useful tools for the day-to-day data science tasks. To showcase its strength I use the library to find the most important features of a dataset.

✏️ Table of Contents

  • How to install it?
  • Curse of dimensionality
  • Feature Selection
  • Exhaustive search
  • Forward feature selection
  • Backward feature selection
  • Stochastic feature selection
  • Python Implementation
  • Conclusion
  • References

⚙️ How to install it?

Just run the following command if you have conda installed in your PC:

conda install -c conda-forge mlxtend

Or using pip:

pip install mlxtend

Mlxtend is a useful package for diverse data science-related tasks. It contains some useful wrapper methods such as:

  • SequentialFeatureSelector (supporting both Forward and Backward feature selection)
  • ExhaustiveFeatureSelector

💀 Curse of dimensionality

Being in the know that adding more features is not always helpful. This is due to:

  • Data is sparse in high dimensions
  • Impractical to use all the measured data directly
  • Some features may be detrimental to pattern recognition
  • Some features are essentially noise

🧠 Feature Selection

Feature Selection is the process of selecting a subset of the extracted features. This is helpful because:

  • Reduces dimensionality
  • Discards uninformative features
  • Discards deceptive features (Deceptive features appear to aid learning on the training set, but impair generalisation)
  • Speeds training/testing

In general, there are three approach which we will analyze in more details shortly:

  • Exhaustive search generally too expensive
  • Forward/backward greedy search algorithms
  • Stochastic search

The goal is:

" Given M input features, select a subset of the d most useful."

Try each combination of d features and assess which is most effective. Number of combinations:

Allowing subsets of size d = 1, . . . , M gives 2^M − 1 combinations. Prohibitively expensive for M >= 20 (2^(20) ≈ 1,000,000 ). Since it is potentially too expensive forward and backward are usually the preferred option.

🟢 Forward feature selection

The forward selection involves the below steps:

  • Train the model with a single feature the one which gives the better result based on the evaluation metric.
  • Select a second feature which in combination with the first gives the best performance.
  • Continue the above steps
  • Stop when no significant improvement is observed or the limit of d features is observed.

🟡 Backward feature selection

The backward selection involves the below steps:

  • Train the model using all features
  • Discard the one which gives the least decrease in the performance
  • Continue the above steps
  • Stop when significant decrease of the performance is observed or the limit of d features is observed.

Both techniques are fast but does not guarantee that another untried feature set is not better. It is only guarantee that eliminate features whose information content is subsumed by other features.

🟠 Stochastic feature selection

Feature selection is a combinatorial optimisation problem:

  • Simulated annealing or genetic algorithms to locate global maximum
  • Potentially very good results
  • Potentially very expensive

🐍 Python Implementation

Try to find the most important features of the wine dataset by using the above techniques. Keep in mind the following:

  • Setting our desired number of features too low could lead to a sub-optimal solution.
  • Mlxtend feature selector uses cross validation internally, and we set our desired folds to 10.

Below you can see screenshots of a working example. The full notebook is available on my GitHub page.

🚀 For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:

Learn to Become a Data Scientist Online | Udacity | Udacity
Gain real-world data science experience with projects from industry experts. Take the first step to becoming a data scientist. Learn online, with Udacity.

📚 While for book lovers:

🤖 Conclusion

This brings us to the end of this article. Hope you become aware of the Mlxtend Python library and how it can be used for feature selection.

Thanks for reading; if you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌

💪💪💪💪 As always keep studying, keep creating 🔥🔥🔥🔥

‌🔘 References