Why Python for Data Analysis

Neetu Rathaur
Jan 9
5 min read

When I start learning data analysis, I observed that visualization is very important and effective way to analyze any data. Then one question arises in my mind that why do we use python while we have tools like Tableau, Power BI for visualization?

Data analysis is not only about visualization, but Python is also a general-purpose data language that can do:

Cleaning of messy data
Combine data from many sources (APIs, databases, files)
Perform statistics, forecasting, and machine learning
Automate entire workflows end-to-end

Tableau/Power BI are powerful, but it has limited UI supports.

Python is free and open source while Tableau and Power BI have licensing costs.

Let’s discuss some common data analysis practices—other than visualization—that every analyst typically follows:

1. Reading Data

2. Data Cleaning

3. Exploratory Data Analysis (EDA)

Reading Data with Python

Python has Pandas, a very powerful library that can be used to read data from files, databases, APIs.

Reading data from file:

import pandas as pd

Install Pandas libraries and write import statement to make sure library is ready to use.

df = pd.read_csv('gestational_diabetes_data.csv')

Method read_csv(file_name) return a Pandas DataFrame. We can use different methods for different file types. Ex read_excel(), read_json().

Reading from database:

Install required libraries for db connection and import them

pip install pandas sqlalchemy psycopg2-binary

from sqlalchemy import create_engine

now create engine to read from database

engine =   	   create_engine("postgresql+psycopg2://postgres:1234@localhost:5432/questiondb")

query = """
select * from Questions where difficulty_level='Easy'
"""
df = pd.read_sql(query, engine

df.head()

Now df is Pandas DataFrame.

A database connection URL is a single string that tells your application how and where to connect to a database.

It is built using the database type, the driver used to connect, and the login details.

DatabaseType+ Driver://username:password@hostname:Port/databaseName

Few examples for different databases

PostgreSQL → postgresql+psycopg2://

MySQL → mysql+pymysql://

SQL Server → mssql+pyodbc://

Reading data from API:

Reading data from API is most common scenario of data analysis. These APIs can be with or without authentication. There could be nested APIs and APIs with large data. We can learn all other scenarios one after another. First let’s see how to read data from a basic API without authentication, here I have created a simple API that fetches data from a database and returns it as response.

Install required libraries to get response

pip install requests

Code to get data into dataFrame.

import requests
url = "http://localhost:8081/questions/allQuestions"
response = requests.get(url)
api_df = pd.DataFrame(response.json())
url = "http://localhost:8081/questions/allQuestions"
response = requests.get(url)
api_df = pd.DataFrame(response.json())

Data Cleaning with Python:

Data Cleaning is very important and crucial step of data analysis because uncleaned data can create incorrect insights. In real world raw data mostly unclean. Following are most common issues in data

Missing values
Duplicates
Incorrect format
Mismatch column names

We need to identify steps to make it clean and analysis ready. There are few common techniques

Filling or removing missing data

Missing data can be a valid data too be mindful in case of removing or filling that with some data. For example, if a hospital has newborn children’s records with their vitals and some records with missing weight does exists, then it must be invalid and not correct for analysis. In this case either there should be some way to fill those blank values, or we must remove those rows.

Now take an example of covid-19 survey data, it has fields like prior medical condition (heart disease, diabetes etc.) and those fields are blank for some records, these null or blank values considered as valid data. Because there could be people without any medical conditions.

Removing duplicates

Removing duplicate records is mandatory to make data authentic. Duplicate records can lead to many problems such as

o Inflate counts (e.g., number of users, sales, transactions)

o Bias averages, sums, and other statistics

To remove duplicate records, one must first determine whether the data is truly duplicated. This decision depends on whether multiple rows represent the same real-world entity. If more than one row corresponds to the same entity or event, it is considered a duplicate.

For example, in a COVID-19 survey dataset, there may be multiple rows with identical values. However, because the dataset does not include personal identifiers such as an ID or name, it is not possible to determine whether those records belong to the same individual or to different individuals with similar symptoms and health conditions. In this case, the records cannot be confidently treated as duplicates.

In contrast, consider a hospital dataset containing newborn records, where each child is assigned a unique ID. If multiple records share the same ID, they clearly represent the same child and must be considered duplicates. Such records should be removed to maintain data accuracy.

Format or change datatypes

Data types should be converted when necessary. For example, if a column contains values such as Y/N, it can be converted to 0/1 to improve processing efficiency, since computers handle binary data more efficiently than strings. Additionally, any date or time-related column must be in the correct format to perform time series analysis.

Consider another example from survey data: a column named covid_positive contains values with the same meaning but expressed using different strings, such as “positive”, “n”, “negatively”, and “negative”. To improve readability, consistency, and ease of analysis, these values should be standardized and converted into a binary format (0/1).

Renaming columns

When multiple datasets need to be combined for analysis, they must use consistent column names and represent the same type of information in the same way. This may require renaming columns or transforming the data within a column.

For example, consider three COVID-19 survey datasets: Survey-1, Survey-2, and Survey-3. Surveys 1 and 2 include a week column, while Survey-3 contains a month column. To bring all datasets onto a common basis for analysis, the time information must be standardized—either by converting all data to weeks or all data to months—so that the datasets can be combined and analyzed consistently.

After completing this entire process, you obtain data that is fully prepared for analysis, visualization, and modeling.

Exploratory Data Analysis (EDA)

EDA is an important step in data analysis where you look at your data carefully to understand it before doing any calculations or building models. The main goal is to find patterns, spot unusual values, test ideas, and make sure the data makes sense.

In this step you start by checking basic statistics like the average, middle value, or range to understand the numbers. Next, you look at how the data is spread and spot any missing values that need attention. You also watch out for outliers, which are unusual values that can affect your results. Finally, you explore connections between different pieces of data using charts and tables to see how they relate to each other.

For example we can see top five symptoms reported for probable covid-19 positive as below

Common Python tools for EDA

o Pandas – summary stats, handling missing values, grouping

o Matplotlib & Seaborn – visualization of distributions, correlations

o NumPy – numerical computations

o Plotly – interactive charts

Key insight:

Python is essential for data analysis because it supports the entire analytical workflow—from reading and cleaning messy data to transforming, analyzing, and preparing it for accurate insights and effective visualization—something visualization tools alone cannot fully achieve.

Welcome
to NumpyNinja Blogs

Why Python for Data Analysis

Reading Data with Python

Data Cleaning with Python:

Filling or removing missing data

Removing duplicates

Format or change datatypes

Renaming columns

Exploratory Data Analysis (EDA)

Key insight:

Recent Posts

Welcome to NumpyNinja Blogs

Reading Data with Python

Data Cleaning with Python:

Filling or removing missing data

Removing duplicates

Format or change datatypes

Renaming columns

Exploratory Data Analysis (EDA)

Key insight:

Welcome
to NumpyNinja Blogs