Exploratory Data Analysis in Python

sheetaldpatil
Oct 11
4 min read

Exploratory Data Analysis or EDA plays a crucial role in the early stages of any data project. It is a way to look at data to understand what’s going on inside it. We can get detailed information about the statistical summary of the data by applying this technique.

With EDA, you can:

Get a quick summary of the data, like averages and totals
Identify duplicate entries
Find any outliers in the dataset
Notice patterns or trends that might be significant

Basically, it helps you clean up the data and figure out what story it’s telling before doing deeper analysis or building models. Skipping the data exploration step can lead to unreliable insights and poor outcomes for the business. Building models in Python without performing EDA increases the risk of using incomplete or biased data.

Now that we know the importance of EDA, let’s dive into some of the Python libraries and tools that can be used in this process. There are 3 main tasks in this process

Step 1: Dataset Overview and Descriptive Statistics
Step 2: Data Quality Evaluation
Step 3: Data Visualization

Dataset Overview and Descriptive Statistics – In this step, we begin by importing the required python libraries for use in the project.

In the next step, we create a data frame and load our dataset by using the pandas library. I have used the world population dataset that is available on Kaggle. We can read the CSV files using the read_csv() method. df is the dataframe.

Output:

We move on to explore the dataset by using the panda functions the python provides. These functions help us to gain more insights into the dataset. These are as follows:

df.info() – This function provides a high-level overview of the dataset. It provides information on the number of columns, their names, count of values and column data types.

Output:

We notice that the non null count for each column is 234 up until the column named ‘2022 population’. This count is decreasing until the ‘Growth Rate’ column.

df.describe() – This function provides a quick statistical summary of all the numeric columns in the dataset. It gives insights into statistics like the mean, count, standard deviation, etc.

Output:

Data Quality Evaluation - “Complicating factors” are things that can make working with data harder. They might be mistakes that happen when collecting or handling the data, or they could be natural quirks in the data itself.

Some common examples include:

Missing values
Uneven amounts of data across categories (imbalanced data)
Columns that never change (constant values)
Repeated rows (duplicates)

df.isnull.sum() – As the name suggests this function helps us trace the missing values in the dataset. It uses the Boolean notation, True for null values and False for non-null values.

Output:

As per the above output, 4 values are missing from the 2020 population column ,7 from 2000 population. To avoid basing our insights on biased data, tracing the missing values is crucial. We can then decide to manually handle the missing values or make use of the data cleaning tools.

df.nunique() – This function gives insight into how many unique values exist in each column.

Output:

df.sortvalues.head() - This function helps to sort the dataset values in ascending/descending order.

Output:

This gives us a list of the countries with population on the basis of 2022 population.

Data Visualization – This is the next task in the EDA process. Now that we have a high level overview of the dataset, we will now work on finding the correlation between the variables,

df.corr() – This helps understand the relationship between the numeric columns in the dataset.

To avoid the data type conversion error, we explicitly mention ‘number’ as datatype in the code as follows:

Output:

Heatmaps: The heatmap is a data visualization technique that is used to analyze the dataset as colors in two dimensions. Basically, it shows a correlation between all numerical variables in the dataset. In simpler terms, we can plot the above-found correlation using the heatmap as follows:

Output:

This gives us the correlation between each of the numeric columns.

Handling Outliers - An Outlier is a data-item/object that deviates significantly from the rest of the objects. They can be caused by measurement or execution errors. The analysis for outlier detection is referred to as outlier mining. There are many ways to detect the outliers, and the removal process is the data frame same as removing a data item from the panda’s dataframe.

We can use a box plot to identify outliers as follows:

Output:

In the above box plot the blue boxes represent the averages and the black lines are the upper limits. All the circles therefore are the outliers. However, considering our dataset there can be countries that are small with higher population. Hence, we can ignore the outliers here.

Conclusion: This process helps us understand the data better by showing useful stats like averages and totals. It also lets us clean up duplicate entries, spot unusual values, and find patterns or trends in the data.

Welcome
to NumpyNinja Blogs

Exploratory Data Analysis in Python

Recent Posts

Welcome to NumpyNinja Blogs

Welcome
to NumpyNinja Blogs