In this article, I present Mlxtend (machine learning extensions), a Python library of useful tools for the day-to-day data science tasks. To showcase its strength I use the library to find the most important features of a dataset.
✏️ Table of Contents
- How to install it?
- Curse of dimensionality
- Feature Selection
- Exhaustive search
- Forward feature selection
- Backward feature selection
- Stochastic feature selection
- Python Implementation
⚙️ How to install it?
Just run the following command if you have conda installed in your PC:
conda install -c conda-forge mlxtend
Or using pip:
pip install mlxtend
Mlxtend is a useful package for diverse data science-related tasks. It contains some useful wrapper methods such as:
- SequentialFeatureSelector (supporting both Forward and Backward feature selection)
💀 Curse of dimensionality
Being in the know that adding more features is not always helpful. This is due to:
- Data is sparse in high dimensions
- Impractical to use all the measured data directly
- Some features may be detrimental to pattern recognition
- Some features are essentially noise
🧠 Feature Selection
Feature Selection is the process of selecting a subset of the extracted features. This is helpful because:
- Reduces dimensionality
- Discards uninformative features
- Discards deceptive features (Deceptive features appear to aid learning on the training set, but impair generalisation)
- Speeds training/testing
In general, there are three approach which we will analyze in more details shortly:
- Exhaustive search generally too expensive
- Forward/backward greedy search algorithms
- Stochastic search
🔵 Exhaustive search
The goal is:
" Given M input features, select a subset of the d most useful."
Try each combination of d features and assess which is most effective. Number of combinations:
Allowing subsets of size d = 1, . . . , M gives 2^M − 1 combinations. Prohibitively expensive for M >= 20 (2^(20) ≈ 1,000,000 ). Since it is potentially too expensive forward and backward are usually the preferred option.
🟢 Forward feature selection
The forward selection involves the below steps:
- Train the model with a single feature the one which gives the better result based on the evaluation metric.
- Select a second feature which in combination with the first gives the best performance.
- Continue the above steps
- Stop when no significant improvement is observed or the limit of d features is observed.
🟡 Backward feature selection
The backward selection involves the below steps:
- Train the model using all features
- Discard the one which gives the least decrease in the performance
- Continue the above steps
- Stop when significant decrease of the performance is observed or the limit of d features is observed.
Both techniques are fast but does not guarantee that another untried feature set is not better. It is only guarantee that eliminate features whose information content is subsumed by other features.
🟠 Stochastic feature selection
Feature selection is a combinatorial optimisation problem:
- Simulated annealing or genetic algorithms to locate global maximum
- Potentially very good results
- Potentially very expensive
🐍 Python Implementation
Try to find the most important features of the wine dataset by using the above techniques. Keep in mind the following:
- Setting our desired number of features too low could lead to a sub-optimal solution.
- Mlxtend feature selector uses cross validation internally, and we set our desired folds to 10.
Below you can see screenshots of a working example. The full notebook is available on my GitHub page.
🚀 For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:
📚 While for book lovers:
- "Python for Data Analysis" by Wes McKinney, best known for creating the Pandas project.
- "Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurelien Geron, currently ranking first in the best sellers Books in AI & Machine Learning on Amazon.
- "Deep Learning" by Ian Goodfellow research scientist at OpenAI.
This brings us to the end of this article. Hope you become aware of the Mlxtend Python library and how it can be used for feature selection.
Thanks for reading; if you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.