top of page

Welcome
to NumpyNinja Blogs

NumpyNinja: Blogs. Demystifying Tech,

One Blog at a Time.
Millions of views. 

From Raw Healthcare Data to Diabetes Insights: End-to-End Data Analytics Project Using Python & Tableau

In Healthcare, data is often available in large quantities with messy and inconsistent format. When I explored this dataset, I noticed missing values, inconsistent formats, outliers, and unstructured clinical measurements.

This blog mainly focuses on transforming raw healthcare data into meaningful insights using Python for data cleaning and feature engineering, followed by building interactive dashboards in Tableau.

The main goal of this blog, before building dashboards or run machine learning models, the data must be cleaned and standardized to understand how clinical factors such as BMI, glucose, cholesterol, and blood pressure influence diabetes risk.


Project Workflow:

Below is the workflow steps I followed to complete the analysis and dashboard


Data Understanding

⬇️

Data Cleaning & Preprocessing

⬇️

Handling missing value

⬇️

Outlier Detection & Treatment

⬇️

Feature Engineering

⬇️

Feature Importance Analysis

⬇️

Dashboard Development using Tableau

⬇️

Generating insights


Data Understanding:

This dataset includes patient-level healthcare information such as Age, BMI-related measurements, cholesterol levels, glucose values, blood pressure readings, and diabetes indicators. I noticed the following issues while exploring: Missing values, outliers, inconsistent data types and raw medical measurements



Data Cleaning & Preprocessing (Python in Jupyter Notebook)

To prepare the data for analysis, i performed the following steps:


  • First handled missing values

  • Then I removed less relevant inconsistent columns like Bp2s,bp2d

  • Detected and treated outliers

  • Applied feature engineering techniques

  • Running machine learning–based feature importance analysis

  • Exporting the cleaned dataset after preprocessing


  • Finally, importing it into Tableau for visualization

    and building 2 meaningful dashboard boards

    Diabetes Risk and Health overview

    Clinical Risk factors and Diabetes Relationship Analysis


    Here is the link for the dashboards


Understanding the Raw dataset and Analysis Goals :

Rather than working with the raw glucose number, converting them to level of risks categories, helped the analysis simple and easier to identify how patients were distributed across different diabetes risk groups .


Removing Irrelevant Columns

Some columns with excessive missing values were removed to simplify analysis.like bp2s, bp2d there were the sequential analysis also there were missing values for some , as it was not necessary for the evaluation i removed to simplify the dataset.

df = df.drop(columns=['bp.2s', 'bp.2d'])

 

Handling Missing Values

Handling missing values is important because it can negatively affect the statistical analysis Missing values were handled using  median as it is numerical column and more skewed. And mode for catergorical columns

Data type conversion :

To optimize the analysis several column were converted to categorical columns


df['gender'] = df['gender'].astype('category')

df['location'] = df['location'].astype('category')

df['frame'] = df['frame'].astype('category')


 Outlier Detection (IQR Method) and treatment (capping)


Outliers can heavily influence statistical calculation  so it is identified and treated

Outliers were identified using the Interquartile Range (IQR) method.

Instead of removing records, outliers were capped to preserve patient data.


Feature Engineering importance

After cleaning the dataset, I wanted to understand which health indicators contributed most to diabetes prediction .To achieve this i used a Random Forest model to calculate feature importance scores, this helped to transform raw medical values into meaningful clinical indicators and also helped identify which variables had the strongest influence on the prediction outcome.


BMI Category Feature

Converting the BMI values to categorical values such as Underweight , Normal , Obese and Overweight , to make obesity analysis easier, and allowed better understanding of how diabetes prevalence changes across different body weight groups , instead of  raw BMI numbers, categorization made patterns easier to interpret visually in Tableau.



Age group categorization:

Patients were grouped into categories like Young ,Adult, senior and Elderly to simply age - based analysis

and make pattern easier and understand.


Blood pressure categories:

Blood pressure values were categorized into clinical groups like Normal , Prehypertension and Hypertension


Glucose risk categories:

similar to Age and Bp, I categorized glucose values also under risks level like Normal, Pre-Diabetic and Diabetic


Instead of relying on raw values this approach made the dashboard more intuitive for understanding health status of patients.


Outcome:

The Outcome variable is created based on the Glyhb, where Glyhb > 7 classified as diabetic while below we classified as non -diabetic. This transformation made the continuous medical measure to a binary classification target, which is used for modelling and analysis.


Random Forest Feature Importance:

This model well suited the healthcare data as it handles complex relationship between features effectively ,it helps to understand and evaluate how much each feature (such as glucose, BMI, cholesterol, and blood pressure) contributes to predicting the target outcome. Based on feature ranking, the model helps highlight the key medical factors that have a strong impact on the diabetes risk.




A quick overview of the code :

scikit learn the machinelearning libraries of python for building predictive models.

Train_test_split used to split the datset into training and testing data , if test and train is done on the same data,it will memorize the pattern rather than generalizing the relationship and gives accurancy unrealistic (overfitting), hence training set is for learning and the testing set is for evaluation.

x= df(feature) ->predictor input values

y= df('outcome') ->store target variables

test_size=0.2 ->here 80% training data and 20% testing data, to get enough data to learn and it will still keep the unseen data to evaluation.

random_state= 42 ->data split will happen every run without this, so the result become inconsistent (it can be any fixed number)

stratify ->maintain the class balance , without this training data may contain very few diabetic cases evaluation becomes unreliable. (same class distribution preserved in both train and test)

how random forest works

Random samples of data are selected , many decision trees are built , each tree predicts the outcome, majority voting gives the final prediction.


Tableau Dashboard Development

After preprocessing, I imported  the cleaned dataset into Tableau for visualization.Created  Two dashboards for better storytelling.


Dashboard-1 Diabetes Risk &Health overview



 KPI Metrics (Patients, BMI, Glucose, Cholesterol)

Created a few KPI cards. Nothing fancy — just the total number of patients, which is 403, and the average BMI(important because obesity is strongly associated with diabetes risk), glucose (average blood sugar condition), and cholesterol(linked with cardiovascular and metabolic issues). These quick numbers help set the tone so anyone looking at the dashboard knows what kind of dataset we’re dealing with before diving into the deeper charts.


Diabetes Distribution (Highlight Table)

Before doing any detailed analysis, I wanted to see how many people were diabetic vs. non‑diabetic. A highlight table made this really easy to read. It gave me a simple overview of how balanced (or unbalanced) the dataset is. It’s a small chart, but it helps a lot. This chart shows that nondiabetic is higher than the diabetic which indicates the imbalance within the dataset.


Age vs Diabetes

Then I checked how diabetes is spread across different age groups. I grouped the ages into categories so the pattern is easier to see. This chart basically shows whether diabetes becomes more common as age increases or if it’s spread out evenly.The analysis shows that Adult and Senior groups contain a higher number of diabetic cases compared to younger age groups, suggesting that diabetes risk tends to increase with age


BMI Category Analysis

BMI analysis instead of raw data converted to categorical data for simple and easier analysis.

This chart helps to analyze how obesity plays an important role. Also, most patients fall within the overweight and obese BMI categories.


Glucose Risk Analysis

Instead of using raw glucose values, grouping them into risk levels like Normal, Prediabetic, and Diabetic helped simplify the interpretation of glucose levels and made it easier to identify how patients are distributed across different risk categories.The chart shows that the majority of patients fall under the Normal glucose category, while smaller groups belong to Prediabetic and Diabetic categories. The line overlay also helps compare average glucose trends across risk levels

 

DASHBOARD 2 — Clinical Analysis Charts



BMI vs Glucose Scatterplot

I created this scatterplot to understand the relationship between BMI and glucose levels.

Each point represents a patient, and this visualization helped identify whether higher BMI is associated with higher glucose levels and potential diabetes risk patterns.


Glucose vs Cholesterol Scatterplot

This chart explores the relationship between glucose and cholesterol levels.

It helped me observe how metabolic health factors interact with each other and whether patients with high glucose also tend to show higher cholesterol levels.


Blood Pressure vs Glucose Boxplot

I used a boxplot to compare glucose levels across different blood pressure categories. This helped in understanding how glucose levels vary among patients with normal, prehypertension, and hypertension conditions.


Feature Importance Analysis

After cleaning and preparing the dataset, I trained a Random Forest model to understand which features influence diabetes prediction the most. This chart shows the relative importance of each health indicator in predicting diabetes. It helped identify which medical factors are more influential compared to others.

 

Final Thoughts :

After completing both dashboards, I was able to bring everything together into a complete analytical healthcare workflow

This project helped me understand how to transform raw healthcare data into meaningful and actionable insights by integrating Python and Tableau.

+1 (302) 200-8320

NumPy_Ninja_Logo (1).png

Numpy Ninja Inc. 8 The Grn Ste A Dover, DE 19901

© Copyright 2025 by Numpy Ninja Inc.

  • Twitter
  • LinkedIn
bottom of page