top of page

Welcome
to NumpyNinja Blogs

NumpyNinja: Blogs. Demystifying Tech,

One Blog at a Time.
Millions of views. 

EDA using Pandas Profiling

Updated: Apr 8, 2022


ree

EDA (Exploratory Data Analysis) :


Exploratory Data Analysis (EDA) is an approach for data analysis that employs a variety of techniques (mostly graphical) to

  1. Maximize insight into a data set

  2. Uncover underlying structure

  3. Extract important variables

  4. Detect outliers and anomalies

  5. Test underlying assumptions

With the help of EDA, we can find out missing values, duplicated rows, no. of columns , no. of rows, shape of data, size of data, mean, median. In general various insights of data. For that we have to write more code. But with the help of Pandas library we can do that in easy way.


EDA using Pandas Profiling :


EDA can be automated with the help of pandas library caller Pandas Profiling . It is a great tool to create reports in the interactive HTML format which is quite easy to understand and analyze the data. Pandas_profiling extends the pandas DataFrame with df.profile_report() for quick data analysis. Lets explore Pandas Profiling in detail and see the benefit of this Magical single line of code.


Import Dataset :


Lets take a example of IPL Dataset :


import pandas as pd
df = pd.read_csv('ipl_data.csv')

ree

Installation of Pandas Profiling:


First step is to install using command:


! pip install pandas-profiling

Then we generate the report using these commands:


from pandas_profiling import ProfileReport
prof = ProfileReport(df)
prof.to_file(output_file ='output.html')

It will show progress like this


ree

If you open "Ouput.html" it will give full details in following sections:

  1. Overview

  2. Variables

  3. Interactions

  4. Correlations

  5. Missing Values

  6. Sample

We will discuss each sections in details


Overview :


Overview provides statistics about Dataset like no. of variables (Columns) , no. of observations (Rows) , Missing cell (no. of cells = no. of row * no. of columns ), Duplicate cells , Total size in memory. Variable type (like Categorical, Numerical, Boolean) .


ree

In Alerts/ Warning we can find warnings on variables (column) like which two columns are highly corelated , Which column has unique value, Constant value .


ree

The reproduction tab simply displays information related to the report generation. It shows the start and ends the time of the analysis, the time taken to generate the report, the software version of pandas profiling, and a configuration download option.



ree

Variables :


This section provides detail analysis on Variables/Columns/Features of the Dataset that totally depend on type of Variables/Columns/Features like Numeric, String , Boolean etc. .


Numeric variables :


For Numeric variable report provides details like distinct , missing value , mean , zeros , memory size. You also get distribution in Histogram graph.


ree

If you want to know more details like Statistics, Histogram, Common values, Extreme values you can press on the Toggle details button.


ree


String variables:


For string variables, you will get Distinct (unique) values, distinct percentage, missing, missing percentage, memory size, and a horizontal bar presentation of all the unique values with count presentation.


ree

In toggle details you will find :


ree

Boolean variables:


For boolean variables, you will get Distinct (unique) values, distinct percentage, missing, missing percentage, false and true.


ree

In toggle details you will find:


ree

Interaction :


Interaction sections gives more details with bivariate analysis/multivariate analysis .



ree

Correlations :


Correlation is a statistical measure that expresses the extent to which two variables are linearly related (meaning they change together at a constant rate). It’s a common tool for describing simple relationships without making a statement about cause and effect. In the pandas profiling report, we have 5 types of correlation coefficients: Pearson’s r, Spearman’s ρ, Kendall’s τ, Phik (φk), and Cramer's V (φc).



ree

Missing values:

This report also gives detail analysis of Missing values in the four types of graph Count, matrix, Heatmap and dendrogram.



ree

ree

Sample:

This section displays the first and last 10 rows of the dataset.

ree

Pandas profiling disadvantage :

The main disadvantage of pandas profiling is its use with large datasets. With the increase in the size of the data the time to generate the report also increases a lot.


One way to solve this problem is to generate the report from only a part of all the data we have.


An example with code:



from pandas_profiling import ProfileReport
prof = ProfileReport(df.sample(n=100)
prof.to_file(output_file ='output.html')

+1 (302) 200-8320

NumPy_Ninja_Logo (1).png

Numpy Ninja Inc. 8 The Grn Ste A Dover, DE 19901

© Copyright 2025 by Numpy Ninja Inc.

  • Twitter
  • LinkedIn
bottom of page