Outlier Detection in a Diabetes Dataset: A Visual and Statistical Approach
- saranyashanmugam200
- Jun 5
- 5 min read
Introduction
Whenever we work with real‑world data, especially medical data, the first thing we need to do is understand what the data is and how it look like. Before building any model, it’s important to check whether the dataset has unusual values, missing values, or anything that might affect the results later. Outliers are simply values that don’t fit in with the rest of the data — they may be much higher or lower than what we normally expect. In healthcare datasets, these unusual values can mean two very different things: sometimes they are errors, and sometimes they are real medical conditions that we should not ignore.
In this blog, I explored a diabetes dataset and tried to understand its outliers using three methods:
Histogram
Boxplot
Standard Deviation (Z‑score)
My goal was not just to remove outliers, but to understand what they represent.
Understanding the Dataset
Before finding outliers, it is important to understand the dataset distribution

The dataset contains medical information of patients used for diabetes prediction which includes features like
Pregnancies, Glucose, BloodPressure, SkinThickness, Insulin, BMI, DiabetesPedigreeFunction, Age, Outcome
DiabetesPedigreeFunction shows the hereditary risk of diabetes, and the Outcome column tells whether the patient has diabetes or not. Here, I noticed something important: Some columns contain zero values where zero is medically impossible. For example, Blood pressure and glucose of a person cannot be 0. These zeros are most likely missing values.
Why Outlier Detection Matters in Healthcare
Outliers in medical datasets are not just “weird numbers.” They can indicate:
a serious health condition
a measurement error
a missing value
a wrongly entered value
For example, extremely high insulin values may point to insulin resistance, while very high BMI values may indicate obesity‑related risks. So, we cannot blindly remove outliers — we need to understand them.
Handling Invalid Zero Values
One major problem in this dataset is the presence of invalid zero values.

Pregnancies and Outcome legitimately contain zeros, so they were kept.
However, Glucose, BloodPressure, SkinThickness, Insulin, and BMI contained zeros that are biologically impossible. A zero in these columns usually means the value was not recorded. So, before doing any outlier analysis, I replaced these zeros with NaN so they don’t interfere with calculations.
This cleaning step ensures that histogram, boxplot, and Z‑score outlier detection methods work accurately.
Why This Step Is Important
If we leave the zeros as they are:
Histograms become misleading
Boxplots show wrong quartiles
Z‑score becomes unreliable
Mean and standard deviation shift incorrectly
Replacing invalid zeros with NaN gives a more honest picture of the data.
Outliers Using Histogram
A Commonly used technique to find outliers is Histogram . A histogram is a type of visualization that's used to view the distribution or the shape of a numerical column and they're really good at visually showing which values are outside of the normal range of the data, which are also known as outliers. Outliers usually appear as bars far away from the main concentration of the data. It depends on the data distribution , skewness and spread of the values also the frequency observation
Lets look at the example to really solidify the concept


Example: Insulin Distribution: The Insulin column in this dataset has many missing values, has several extremely high values, is heavily right‑skewed
These high values could be due to:
insulin resistance
abnormal metabolic conditions
recording issues
The histogram makes it clear that the data is not normally distributed.
Outliers Using Boxplot (IQR Method)
A boxplot is another useful visualization technique used to detect outliers. A box plot is used to visualize the descriptive statistics of a numerical column. It helps identify the distribution of the data and automatically highlights outliers as dots outside the normal data range. Unlike histograms, boxplots summarize the statistical distribution of the data using quartiles and the Interquartile Range (IQR).
Understanding the Boxplot
Example : if u have the 100 rows of data, the bottom quartile would be the 25 rows with the smallest value, the top quartile would be the Top 25 rows rest would be the median.The bottom and top edges of the box represent Q1 (First Quartile) and Q3 (Third Quartile). These quartiles divide the data into four equal parts:
Q1 represents the lowest 25% of the data.
Q3 represents the highest 25% of the data.
The middle 50% of the data lies between Q1 and Q3.
The distance between Q1 and Q3 is called the Interquartile Range (IQR). : IQR=Q3-Q1
To detect outliers, the IQR is multiplied by 1.5.
The minimum and maximum acceptable ranges are calculated as:
Lower Bound=Q1-1.5(IQR)
Upper Bound=Q3+1.5(IQR)
Any value farther away than the 1.5 * IQR from each side of the whisker line after the box considered as the outliers and it represents as the dots. So all you have to do is look for the dots .
We also have the whiskers, which are the lines extending outside the box. They represent the minimum and maximum values that are still considered within the normal range according to the IQR rule. Any values outside the whiskers are considered potential outliers and appear as dots in the boxplot.



Clinical Interpretation of Boxplot Outliers
Using the IQR method, the dataset shows:
Pregnancies and Age have natural high outliers
Glucose and BloodPressure contain invalid zero values
SkinThickness has extremely high values (e.g., 99)
Insulin has the highest number of extreme outliers (300–846)
BMI contains both invalid zeros and high values above 50
DiabetesPedigreeFunction has valid high outliers related to hereditary risk
Outliers Using Standard Deviation (Z‑score)
So far, we have used data visualization to identify outliers. Another method is using mathematical calculations, such as the standard deviation. Standard deviation measures how spread out the data is. A dataset with a wide range has a large standard deviation, while a tightly packed dataset has a small one.
How It Works
We first Find the mean
Then measure the distance of each value from the mean
Square those distances
The Average squared distance gives the variance
Square root of that is standard deviation
If the dataset normally distributed, any values lies more than 3 standard deviations away from the mean are considered outliers. But if the data is skewed, contains missing values, or has unusual spikes, this method may not work well and can either miss outliers or mark normal values as outliers. Depending on the dataset, the threshold can also be adjusted — for example, using 2 standard deviations to catch more outliers or 4 standard deviations to be less sensitive.
But here in this dataset
It is not normally distributed
It contains invalid zeros
It has missing values
It is heavily skewed
It has extreme high values
Because of this, Z‑score only detected extreme high values and missed all the zero outliers.
In this dataset:
Histogram helped understand the shape
Boxplot gave the clearest outlier picture
Z‑score was the least reliable
Understand in better way I have created scatterplot


Conclusion
In this diabetes dataset, several features such as Insulin, BMI, Glucose, and BloodPressure had unusual values. Some of these were due to missing or incorrect data (like zeros), while others were real medical conditions.
Since the data is skewed and not normally distributed, histograms and boxplots were more effective for detecting outliers. The Z‑score method did not work well because it assumes a normal distribution.
Overall, outliers should not be removed without understanding them. In healthcare data, extreme values can be meaningful and may reveal important medical insights.

