When loading a new data set, the first thing we do is to get an understanding of the data. This includes steps like determining the number of unique values, identifying the data type, as well as computing the number or percentage of missing values for each variable.
The well-known pandas library has many extremely useful functions for EDA such as: df.info(null_counts=True), df.describe() however they do provide a limited amount of information. That said I searched online and came across pandas-profiling which does an excellent job in aiding you to perform a simple quick EDA.
✏️ Table of Contents
- How to install it?
- How to use it?
- Source Code
Pandas-profiling generates profile reports from a pandas
DataFrame. As said, the pandas
df.describe() function is great but a little basic for serious exploratory data analysis.
pandas_profiling extends the pandas DataFrame with
df.profile_report() for quick data analysis.
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
❓How to install it?
You can install using the pip package manager by running
pip install pandas-profiling
You can install using the conda package manager by running
conda install -c conda-forge pandas-profiling
🤔 How to use it?
import pandas as pd import pandas_profiling
display() requires a Jupyter notebook. Alternately you can output to file.
profile = pandas_profiling.ProfileReport(df) display(profile) # can output to file... # profile.to_file(outputfile="/tmp/myoutputfile.html")
For large datasets, the analysis can run out of memory. In that case, it is useful to disable the correlation analysis as shown below:
profile = pandas_profiling.ProfileReport(df, check_correlation = False)
Another solution might be to curtail the size of the dataframe and run it only on a subset.
You can see an example on my github page.
🚀 For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:
📚 While for book lovers:
- "Python for Data Analysis" by Wes McKinney, best known for creating the Pandas project.
- "Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow" by Aurelien Geron, currently ranking first in the best sellers Books in AI & Machine Learning on Amazon.
- "Deep Learning" by Ian Goodfellow research scientist at OpenAI.
💻 Source Code
Since I am not a big fan of using black-box parts in my code, I am going to present below the source code for a numeric variable:
The source code is available on GitHub.
This brings us to the end of this article. Hope you got a basic understanding of how Pandas Profiling can help you to perform a simple quick EDA.
Thanks for reading, if you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.