Why Python for Data Analysis
- Neetu Rathaur
- Jan 9
- 5 min read

When I start learning data analysis, I observed that visualization is very important and effective way to analyze any data. Then one question arises in my mind that why do we use python while we have tools like Tableau, Power BI for visualization?
Data analysis is not only about visualization, but Python is also a general-purpose data language that can do:
Cleaning of messy data
Combine data from many sources (APIs, databases, files)
Perform statistics, forecasting, and machine learning
Automate entire workflows end-to-end
Tableau/Power BI are powerful, but it has limited UI supports.
Python is free and open source while Tableau and Power BI have licensing costs.
Let’s discuss some common data analysis practices—other than visualization—that every analyst typically follows:
1. Reading Data
2. Data Cleaning
3. Exploratory Data Analysis (EDA)
Reading Data with Python
Python has Pandas, a very powerful library that can be used to read data from files, databases, APIs.
Reading data from file:
import pandas as pdInstall Pandas libraries and write import statement to make sure library is ready to use.
df = pd.read_csv('gestational_diabetes_data.csv')
Method read_csv(file_name) return a Pandas DataFrame. We can use different methods for different file types. Ex read_excel(), read_json().
Reading from database:
Install required libraries for db connection and import them
pip install pandas sqlalchemy psycopg2-binary
from sqlalchemy import create_enginenow create engine to read from database
engine = create_engine("postgresql+psycopg2://postgres:1234@localhost:5432/questiondb")
query = """
select * from Questions where difficulty_level='Easy'
"""
df = pd.read_sql(query, engine
df.head()Now df is Pandas DataFrame.

A database connection URL is a single string that tells your application how and where to connect to a database.
It is built using the database type, the driver used to connect, and the login details.
DatabaseType+ Driver://username:password@hostname:Port/databaseName
Few examples for different databases
PostgreSQL → postgresql+psycopg2://
MySQL → mysql+pymysql://
SQL Server → mssql+pyodbc://
Reading data from API:
Reading data from API is most common scenario of data analysis. These APIs can be with or without authentication. There could be nested APIs and APIs with large data. We can learn all other scenarios one after another. First let’s see how to read data from a basic API without authentication, here I have created a simple API that fetches data from a database and returns it as response.
Install required libraries to get response
pip install requestsCode to get data into dataFrame.
import requests
url = "http://localhost:8081/questions/allQuestions"
response = requests.get(url)
api_df = pd.DataFrame(response.json())
url = "http://localhost:8081/questions/allQuestions"
response = requests.get(url)
api_df = pd.DataFrame(response.json())
Data Cleaning with Python:
Data Cleaning is very important and crucial step of data analysis because uncleaned data can create incorrect insights. In real world raw data mostly unclean. Following are most common issues in data
Missing values
Duplicates
Incorrect format
Mismatch column names
We need to identify steps to make it clean and analysis ready. There are few common techniques
Filling or removing missing data
Missing data can be a valid data too be mindful in case of removing or filling that with some data. For example, if a hospital has newborn children’s records with their vitals and some records with missing weight does exists, then it must be invalid and not correct for analysis. In this case either there should be some way to fill those blank values, or we must remove those rows.
Now take an example of covid-19 survey data, it has fields like prior medical condition (heart disease, diabetes etc.) and those fields are blank for some records, these null or blank values considered as valid data. Because there could be people without any medical conditions.
Removing duplicates
Removing duplicate records is mandatory to make data authentic. Duplicate records can lead to many problems such as
o Inflate counts (e.g., number of users, sales, transactions)
o Bias averages, sums, and other statistics
To remove duplicate records, one must first determine whether the data is truly duplicated. This decision depends on whether multiple rows represent the same real-world entity. If more than one row corresponds to the same entity or event, it is considered a duplicate.
For example, in a COVID-19 survey dataset, there may be multiple rows with identical values. However, because the dataset does not include personal identifiers such as an ID or name, it is not possible to determine whether those records belong to the same individual or to different individuals with similar symptoms and health conditions. In this case, the records cannot be confidently treated as duplicates.
In contrast, consider a hospital dataset containing newborn records, where each child is assigned a unique ID. If multiple records share the same ID, they clearly represent the same child and must be considered duplicates. Such records should be removed to maintain data accuracy.
Format or change datatypes
Data types should be converted when necessary. For example, if a column contains values such as Y/N, it can be converted to 0/1 to improve processing efficiency, since computers handle binary data more efficiently than strings. Additionally, any date or time-related column must be in the correct format to perform time series analysis.
Consider another example from survey data: a column named covid_positive contains values with the same meaning but expressed using different strings, such as “positive”, “n”, “negatively”, and “negative”. To improve readability, consistency, and ease of analysis, these values should be standardized and converted into a binary format (0/1).
Renaming columns
When multiple datasets need to be combined for analysis, they must use consistent column names and represent the same type of information in the same way. This may require renaming columns or transforming the data within a column.
For example, consider three COVID-19 survey datasets: Survey-1, Survey-2, and Survey-3. Surveys 1 and 2 include a week column, while Survey-3 contains a month column. To bring all datasets onto a common basis for analysis, the time information must be standardized—either by converting all data to weeks or all data to months—so that the datasets can be combined and analyzed consistently.
After completing this entire process, you obtain data that is fully prepared for analysis, visualization, and modeling.
Exploratory Data Analysis (EDA)
EDA is an important step in data analysis where you look at your data carefully to understand it before doing any calculations or building models. The main goal is to find patterns, spot unusual values, test ideas, and make sure the data makes sense.
In this step you start by checking basic statistics like the average, middle value, or range to understand the numbers. Next, you look at how the data is spread and spot any missing values that need attention. You also watch out for outliers, which are unusual values that can affect your results. Finally, you explore connections between different pieces of data using charts and tables to see how they relate to each other.
For example we can see top five symptoms reported for probable covid-19 positive as below

Common Python tools for EDA
o Pandas – summary stats, handling missing values, grouping
o Matplotlib & Seaborn – visualization of distributions, correlations
o NumPy – numerical computations
o Plotly – interactive charts
Key insight:
Python is essential for data analysis because it supports the entire analytical workflow—from reading and cleaning messy data to transforming, analyzing, and preparing it for accurate insights and effective visualization—something visualization tools alone cannot fully achieve.

