When loading a new data set, the first thing we do is to get an understanding of the data. This includes steps like determining the number of unique values, identifying the data type, as well as computing the number or percentage of missing values for each variable.
The well-known pandas library has many extremely useful functions for EDA such as: df.info(null_counts=True), df.describe() however they do provide a limited amount of information. That said I searched online and came across with pandas-profiling which does an excellent job in aiding you perform a simple quick EDA.
Table of Contents
- How to install it?
- How to use it?
- Source Code
Pandas-profiling generates profile reports from a pandas
DataFrame. As said, the pandas
df.describe() function is great but a little basic for serious exploratory data analysis.
pandas_profiling extends the pandas DataFrame with
df.profile_report() for quick data analysis.
For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:
- Essentials: type, unique values, missing values
- Quantile statistics like minimum value, Q1, median, Q3, maximum, range, interquartile range
- Descriptive statistics like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness
- Most frequent values
- Correlations highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices
- Missing values matrix, count, heatmap and dendrogram of missing values
How to install it?
You can install using the pip package manager by running
pip install pandas-profiling
You can install using the conda package manager by running
conda install -c conda-forge pandas-profiling
How to use it?
import pandas as pd import pandas_profiling
display() requires a Jupyter notebook. Alternately you can output to file.
profile = pandas_profiling.ProfileReport(df) display(profile) # can output to file... # profile.to_file(outputfile="/tmp/myoutputfile.html")
For large datasets the analysis can run out of memory. In that case it is useful to disable the correlation analysis as shown below:
profile = pandas_profiling.ProfileReport(df, check_correlation = False)
Another solution might be to curtail the size of the dataframe and run it only on a subset.
You can see an example on my github page.
Since I am not a big fan of using black-box parts in my code, I am going to present below the source code for a numeric variable:
The source code is available on GitHub.
This brings us to the end of this article. Hope you got a basic understanding of how an Embedding Layer is used. Please remember to use it especially when dealing with a text preprocessing task.
If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.
Thanks for reading and I am looking forward to hearing your questions :)
Stay tuned and Happy Machine Learning.
- Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems