Hospital Data Analysis Using Python

Neetu Rathaur
Jan 13
7 min read

This blog is about visualization of data using python, here we will be going to use a healthcare data. Healthcare organizations generate huge amounts of patient data every day. Today to will do patient discharge, Provider specialty and Ambulatory visit trends analysis on Healthcare Dataset (HC Dataset) from Kaggle, appling Python and data analytics techniques. Analyzing discharge information and ambulatory visits helps hospitals, improve operational efficiency, reduce length of stay, and enhance patient outcomes. Provider specialty analysis helps in resource management.

The HC Dataset contains patient and provider-level hospital data, including visit, admission and discharge information. Key fields used in this analysis include:

· Patient ID

· Provider Specialty

· Admission Date

· Discharge Date

· Primary Diagnosis

· Discharge Disposition

· Date of Visit

Using admission and discharge dates, we calculate the Length of Stay (LOS) for each patient.

We will use Python, Pandas (data manipulation), Matplotlib & Seaborn (data visualization), Jupyter Notebook (analysis environment). We will complete below steps

1. Loading the data.

2. Cleaning/missing value handling.

3. Exploring discharge outcomes.

4. Exploring Provider Specialty

5. Visualizing trends.

Loading the data from multi tab Excel file:

Import all necessary libraries and read file using Pandas. The Excel file has multiple tabs, and each tab represents a separate table. We can read all tabs as

all_sheets = pd.read_excel("Excel_file_name", sheet_name=None)

If you want to read only specific tab you can mention tab name as

sheet_tab1 = pd.read_excel("Excel_file_name", sheet_name=”tabName”)

Cleaning/missing value handling:

Check if any missing/null value in data.

Removing null values and dropping unused columns

patient_df = all_sheets["Patients"].drop(columns=["FirstName","LastName"])

Next, we’ll merge pandas data frames using joins, just like combining tables in SQL. The basic syntax looks like this:

pd.merge(

    left_df,

    right_df,

    left_on="foreign_key_column",

    right_on="primary_key_column",

    how="left"

)

Length of Stay

Length of Stay is calculated as the difference (in days) between discharge and admission dates. Records with invalid or negative stays are removed to ensure data accuracy.

When calculating length of stay, using a simple formula like df["length_of_stay"] = (df["EDDischargeTime"] - df["VisitDate"]).dt.days can result in 0 for records where the discharge date is different from the visit date. This happens because the datetime64[ns] data type includes the time component, so stays shorter than 24 hours are counted as zero days. To fix this, we need to normalize the date columns and then calculate the length of stay, ensuring the difference is based on full days rather than including time.

df["length_of_stay"] = (

    df["EDDischargeTime"].dt.normalize()

    - df["VisitDate"].dt.normalize()

).dt.days

Why this works

· .normalize() sets time to 00:00:00

· Difference becomes date-based, not time-based

· Matches Tableau’s DATEDIFF('day', …) logic exactly

Patient Average Length of Stay by Reason for Visit

The average length of stay provides a high-level view of hospital efficiency. A lower LOS generally indicates optimized care delivery, while unusually high LOS may signal operational or clinical challenges. For reason of visit indicates which patient need prompt care to reduce LOS.

Here we can analyze stay with respect to reason of visit

avg_los_by_reason = (

    df.groupby("ReasonForVisit", as_index=False)

      .agg(avg_length_of_stay=("length_of_stay", "mean"))

)

Visualization of above analysis

Insights:

Intoxication have longest stay, they need acute care

Patients visits with cough doesn’t need hospitalization.

Count of Stay per Primary Diagnosis:

Find total number of patients stays for diagnosis. By grouping patient stays by primary diagnosis, we identify the conditions that needs for the highest hospital utilization. Diagnoses with high stay counts often represent chronic conditions or seasonal illnesses, making them key targets for preventive care and resource planning.

diagnosis_stay_count = (

    df.groupby("PrimaryDiagnosis")["PatientMRN"]

      .count()

      .sort_values(ascending=False)

)

print("\n Count of Stay per Primary Diagnosis:")

print(diagnosis_stay_count.head(10))

Visulization of above analysis

Insights:

The data show that patients diagnosed with pneumonia are more likely to remain hospitalized than those with other diagnoses. Influenza is also a common reason for hospital admission. In contrast, fewer patients with lacerations or hip fractures required extended hospital stays.

Patient Discharge Disposition:

Discharge disposition shows where patients go after leaving the hospital —such as:

· Discharged to home

· Transferred to another facility

· Expired

A high percentage of home discharges typically reflects effective treatment, while higher transfer rates may indicate specialized care needs. It helps hospitals plan follow-up care and make sure patients get the support they need.

disposition_counts = df["discharge_disposition"].value_counts()

 

print("\ Patient Discharge Disposition:")

print(disposition_counts)

Visualization of above analysis

Insights:

Above visualization shows discharge to home is quite higher than all other disposition. It shows patients are getting effective care during hospitalization.

Visit Volume by Provider Specialty:

Visit volume by provider specialty shows how many patients each type of doctor or specialist sees. It helps us understand which specialties are busiest and where more staff or resources might be needed. This helps hospitals plan better and make sure patients get care on time. Visits are grouped by Provider Specialty to understand how outpatient demand varies across clinical disciplines.

specialty_visits = (

   abVst_df.groupby("ProviderSpecialty")["VisitID"]

    .count()

    .sort_values(ascending=False)

)

print(specialty_visits)

Visualization of above analysis

Insights:

Here you can see primary care handles maximum number of ambulatory visits. It helps hospital to staff accordingly like here hospital needs more primary care providers then any other specialty.

Provider count Per Specialty:

Provider count per specialty shows how many healthcare providers, such as doctors, nurses, or specialists, work in each medical specialty. This helps to understand the distribution of staff across different areas of care. For example, a higher number of providers in pediatrics compared to primary care indicates more resources are available for general medical care. This information is useful for workforce planning, identifying gaps in care, and ensuring that patient demand matches the available providers.

provider_count_specialty = (

    abVst_df.groupby("ProviderSpecialty")["ProviderID"]

    .nunique()

    .sort_values(ascending=False)

)

print(provider_count_specialty)

Visualization of above analysis

Insights:

Here you can see hospital has maximum number of primary care providers followed by pediatrics, cardiology and lowest number was surgery providers.

Avg Length of Stay & Count of Discharges per Day Trend

Analyzing daily trends helps uncover patterns in hospital workload. Here we are going to track Number of discharges and Average LOS per day. Peaks in discharge volume may may need more staffing or scheduling practices, while spikes in LOS could indicate inefficient care or delays in diagnostics or poor discharge planning.

daily_trend = (

    df.groupby(df["EDDischargeTime"].dt.date)

      .agg(

          discharge_count=("PatientMRN", "count"),

          avg_los=("length_of_stay", "mean")

      )

      .reset_index()

)

print("\n Daily Discharge Trend:")

print(daily_trend.head())

Visualization of above analysis

A dual-axis line chart clearly displays the relationship between discharge volume and LOS over time.

Insights:

Here we can see peek in discharge count on January 1^st and January 14^thduring this time Hospital might need more staff for complete discharge processes. There is a dip on January 16^th. For LOS there is peek in the end that is around January 16^th, that time may be there were less staff or less availability of equipment. This analysis could help hospital management to know what needs to be done for effective care.

Average Length of Stay per Week:

When we look at data by week instead of by day, it helps to reduce the random ups and downs that happen each day and makes bigger patterns easier to see. For example, if the average length of stay (LOS) goes up consistently week after week, it could point to a problem in the system, like delays in treatment or discharge. On the other hand, if the weekly LOS stays about the same, it shows that patient flow is stable and predictable.

df["week"] = df["EDDischargeTime"].dt.to_period("W").astype(str)

weekly_los = (

    df.groupby("week")["length_of_stay"]

      .mean()

      .reset_index()

)

print("\nAverage Length of Stay per Week:")

print(weekly_los)

Visualization of above analysis

A line chart illustrates how LOS changes week over week.

Insights:

It’s quite visible from above graph length of stay is keep increasing. There must be some issue with planning because it looks overwhelming high in length of stay.

Daily Ambulatory Visit Trend

The daily ambulatory visit trend shows the variation in outpatient visits across the observed time. Overall, the trend indicates day-to-day fluctuations in patient volume, with identifiable peaks and declines. These variations may be influenced by factors such as weekdays versus weekends, seasonal health conditions, and accessibility of ambulatory services. Analyzing this trend provides valuable insight into patient demand patterns and supports data-driven decisions for capacity planning, staffing optimization, and operational efficiency in ambulatory care facilities.

abVst_df["VisitDay"] = abVst_df["DateofVisit"].dt.date

abVst_df["VisitWeek"] = abVst_df["DateofVisit"].dt.to_period("W").astype(str)

 

daily_visits = (

    abVst_df.groupby("VisitDay")["VisitID"]

    .count()

    .reset_index(name="VisitCount")

)

print(daily_visits.head())

Visualization of above analysis

Insights:

Above graph shows peek from March 15^th to April 1^st and dip in visits observed from January 1^st to January 15^th and April 20^th to April 25^th .

Weekly Ambulatory Visit Trend:

The weekly ambulatory visit trend shows how the number of outpatient visits changes from week to week. It helps us see overall patterns more clearly by smoothing out daily ups and downs. Some weeks have more visits, while others have fewer, which may be due to seasons, holidays, or common health issues. Understanding this trend helps hospitals plan staff, manage resources, and prepare for busy or slow weeks.

weekly_visits = (

    abVst_df.groupby("VisitWeek")["VisitID"]

    .count()

    .reset_index(name="VisitCount")

)
print(weekly_visits.head())

Visualization of above analysis

Insights:

Here we can see high volume of visits in March 11^th to March 17^th .

Key Learnings – Healthcare Data Visualization

Patient Length of Stay (LOS): Highlights hospital efficiency; long stays may indicate operational or clinical challenges.
Primary Diagnosis Impact: Pneumonia and influenza drive higher hospital use; minor injuries need shorter stays.
Discharge Disposition: Most patients go home, showing effective care; transfers indicate specialized needs.
Provider Specialty & Visits: Primary care handles the highest patient volume; staffing should match demand.
Daily & Weekly Trends: Daily data shows short-term fluctuations; weekly trends reveal broader patterns and resource needs.
Visualization Benefits: Python tools (Pandas, Matplotlib, Seaborn) make patterns clear, supporting better hospital planning and decision-making.

Welcome
to NumpyNinja Blogs

Hospital Data Analysis Using Python

Cleaning/missing value handling:

Patient Average Length of Stay by Reason for Visit

Count of Stay per Primary Diagnosis:

Patient Discharge Disposition:

Visit Volume by Provider Specialty:

Provider count Per Specialty:

Avg Length of Stay & Count of Discharges per Day Trend

Average Length of Stay per Week:

Daily Ambulatory Visit Trend

Weekly Ambulatory Visit Trend:

Recent Posts

Welcome to NumpyNinja Blogs

Cleaning/missing value handling:

Patient Average Length of Stay by Reason for Visit

Count of Stay per Primary Diagnosis:

Patient Discharge Disposition:

Visit Volume by Provider Specialty:

Provider count Per Specialty:

Avg Length of Stay & Count of Discharges per Day Trend

Average Length of Stay per Week:

Daily Ambulatory Visit Trend

Weekly Ambulatory Visit Trend:

Welcome
to NumpyNinja Blogs