When loading a new data set, the first thing we do is to get an understanding of the data. This includes steps like determining the number of unique values, identifying the data type, as well as computing the number or percentage of missing values for each variable.

The well-known pandas library has many extremely useful functions for EDA such as: df.info(null_counts=True), df.describe() however they do provide a limited amount of information.  That said I searched online and came across pandas-profiling which does an excellent job in aiding you to perform a simple quick EDA.

✏️ Table of Contents

  • Introduction
  • How to install it?
  • How to use it?
  • Source Code
  • Conclusion
  • References‌
Parthenon Temple at Athen, Greece
Photo by Puk Patrick / Unsplash

🌟 Introduction

Pandas-profiling generates profile reports from a pandas DataFrame. As said, the pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values

❓How to install it?

Using pip

You can install using the pip package manager by running

pip install pandas-profiling

Using conda

You can install using the conda package manager by running

conda install -c conda-forge pandas-profiling

🤔 How to use it?

Import it

import pandas as pd
import pandas_profiling

Obviously display() requires a Jupyter notebook. Alternately you can output to file.

profile = pandas_profiling.ProfileReport(df)
# can output to file...
# profile.to_file(outputfile="/tmp/myoutputfile.html")

For large datasets, the analysis can run out of memory. In that case, it is useful to disable the correlation analysis as shown below:

profile = pandas_profiling.ProfileReport(df, check_correlation = False)

Another solution might be to curtail the size of the dataframe and run it only on a subset.

You can see an example on my github page.

🚀 For people who like video courses and want to kick-start a career in data science today, I highly recommend the below video course from Udacity:

Learn to Become a Data Scientist Online | Udacity | Udacity
Gain real-world data science experience with projects from industry experts. Take the first step to becoming a data scientist. Learn online, with Udacity.

📚 While for book lovers:

💻 Source Code

Since I am not a big fan of using black-box parts in my code, I am going to present below the source code for a numeric variable:

The source code is available on GitHub.

🤖 Conclusion

This brings us to the end of this article. Hope you got a basic understanding of how Pandas Profiling can help you to perform a simple quick EDA.

Thanks for reading, if you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌

💪💪💪💪 As always keep studying, keep creating 🔥🔥🔥🔥

‌🔘 References