When loading a new data set, the first thing we do is to get an understanding of the data. This includes steps like determining the number of unique values, identifying the data type, as well as computing the number or percentage of missing values for each variable.

The well-known pandas library has many extremely useful functions for EDA such as: **df.info(null_counts=True), df.describe() **however they do provide a limited amount of information. That said I searched online and came across with pandas-profiling which does an excellent job in aiding you perform a simple quick ** EDA**.

**Table of Contents**

**Table of Contents**- Introduction
- How to install it?
- How to use it?
- Source Code
- Conclusion
- References

## Introduction

Pandas-profiling generates profile reports from a pandas `DataFrame`

. As said, the pandas `df.describe()`

function is great but a little basic for serious exploratory data analysis. `pandas_profiling`

extends the pandas DataFrame with `df.profile_report()`

for quick data analysis.

For each column the following statistics - if relevant for the column type - are presented in an interactive HTML report:

**Essentials**: type, unique values, missing values**Quantile statistics**like minimum value, Q1, median, Q3, maximum, range, interquartile range**Descriptive statistics**like mean, mode, standard deviation, sum, median absolute deviation, coefficient of variation, kurtosis, skewness**Most frequent values****Histogram****Correlations**highlighting of highly correlated variables, Spearman, Pearson and Kendall matrices**Missing values**matrix, count, heatmap and dendrogram of missing values

## How to install it?

### Using pip

You can install using the pip package manager by running

```
pip install pandas-profiling
```

**Using conda**

You can install using the conda package manager by running

`conda install -c conda-forge pandas-profiling`

## How to use it?

**Import it**

```
import pandas as pd
import pandas_profiling
```

Obviously `display()`

requires a Jupyter notebook. Alternately you can output to file.

```
profile = pandas_profiling.ProfileReport(df)
display(profile)
# can output to file...
# profile.to_file(outputfile="/tmp/myoutputfile.html")
```

**For large datasets the analysis can run out of memory. **In that case it is useful to disable the correlation analysis as shown below:

```
profile = pandas_profiling.ProfileReport(df, check_correlation = False)
```

Another solution might be to curtail the size of the dataframe and run it only on a subset.

You can see an example on my github page.

## Source Code

Since I am not a big fan of using black-box parts in my code, I am going to present below the source code for a numeric variable:

The source code is available on GitHub.

## Conclusion

This brings us to the end of this article. Hope you got a basic understanding of how an Embedding Layer is used. Please remember to use it especially when dealing with a text preprocessing task**.**

If you liked this article, please consider subscribing to my blog. That way I get to know that my work is valuable to you and also notify you for future articles.

Thanks for reading and I am looking forward to hearing your questions :)

*Stay tuned and Happy Machine Learning.*

## References

- https://github.com/pandas-profiling/pandas-profiling
- https://github.com/geodra/Articles/blob/master/Pandas_Profiling.ipynb
- Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems