When loading a new data set, the first thing we do is to get an understanding of the data. This includes steps like determining the number of unique values, identifying the data type, as well as computing the number or percentage of missing values for each variable.

The well-known pandas library has many extremely useful functions for EDA such as: df.info(null_counts=True), df.describe() however they do provide a limited amount of information.  That said I searched online and came across with pandas-profiling which does an excellent job in aiding you perform a simple quick EDA.


Table of Contents

  • Introduction
  • How to install it?
  • How to use it?
  • Source Code
  • Conclusion
  • References‌
Parthenon Temple at Athen, Greece
Photo by Puk Patrick / Unsplash

Introduction

Pandas-profiling generates profile reports from a pandas DataFrame. As said, the pandas df.describe() function is great but a little basic for serious exploratory data analysis. pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

  • Essentials: type, unique values, missing values
  • Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
  • Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
  • Most frequent values
  • Histogram
  • Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
  • Missing values matrix, count, heatmap and dendrogram of missing values

How to install it?

Using pip

You can install using the pip package manager by running

pip install pandas-profiling

Using conda

You can install using the conda package manager by running

conda install -c conda-forge pandas-profiling

How to use it?

Import it

import pandas as pd
import pandas_profiling

Obviously display() requires a Jupyter notebook. Alternately you can output to file.

profile = pandas_profiling.ProfileReport(df)
display(profile)
# can output to file...
# profile.to_file(outputfile="/tmp/myoutputfile.html")

For large datasets the analysis can run out of memory. In that case it is useful to disable the correlation analysis as shown below:

profile = pandas_profiling.ProfileReport(df, check_correlation = False)

Another solution might be to curtail the size of the dataframe and run it only on a subset.

You can see an example on my github page.


Source Code

Since I am not a big fan of using black-box parts in my code, I am going to present below the source code for a numeric variable:

The source code is available on GitHub.


Conclusion

This brings us to the end of this article. Hope you got a basic understanding of how an Embedding Layer is used. Please remember to use it especially when dealing with a text preprocessing task.

‌If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.‌
‌Thanks for reading and I am looking forward to hearing your questions :)‌
Stay tuned and Happy Machine Learning.


‌References