Data Cleaning Explained: How Clean Data Drives Better Visualizations and Decisions
- Gayathri Venkatachalam
- Jan 9
- 8 min read
Updated: Jan 14

Bad data doesn’t just create bad charts — it leads to bad decisions.
In today’s data-driven world, organizations rely heavily on dashboards, charts, and reports to guide business and public-health decisions. However, when data contains missing values, duplicates, or inconsistencies, even the most visually appealing dashboards can become misleading.
This article explains why data cleaning is essential for accurate data visualization and better decision-making, using a real-world COVID-19 survey dataset as a case study. It highlights how proper data cleaning prevents false trends, improves trust in dashboards, and enables confident, data-driven decisions.
What Is Data Cleaning?
Data cleaning is the process of identifying and correcting errors in raw data to make it accurate, consistent, and ready for analysis. Real-world data is rarely perfect. It often contains missing values, duplicate records, incorrect formats, and outliers that can affect analysis and visualization.
The goal of data cleaning is to improve data quality so that insights drawn from the data are reliable. Clean data ensures that charts, dashboards, and reports truly represent reality and support correct business decisions.
Common data cleaning activities include:
Handling missing values
Removing duplicate records
Correcting data types and formats (dates, numbers, text)
Fixing inconsistent values (e.g., “USA” vs “United States”)
Identifying and treating outliers
Without proper data cleaning, even advanced analytics tools and visualization platforms can produce misleading results.
Impact on Visualization : Misleading Charts and Wrong Trends
When data is not cleaned properly, visualizations can go wrong in several ways:
Missing Values Create Gaps :Missing or null values can cause broken lines, sudden drops, or incomplete charts, making trends look inaccurate.
Duplicate Records Inflate Numbers :Duplicate data can increase totals and counts, leading to exaggerated bars, incorrect percentages, and misleading comparisons.
Inconsistent Categories Split Visuals :Different labels for the same category (e.g., “US”, “USA”, “United States”) can appear as separate bars or slices, confusing the audience.
Incorrect Data Types Affect Charts : Dates stored as text or numbers stored as strings can prevent proper sorting and distort time-based visualizations.
Outliers Distort the Scale : Extreme or incorrect values can stretch chart axes, hiding meaningful patterns in the rest of the data.
False Trends Lead to Wrong Insights : When visuals are built on dirty data, they may highlight trends that don’t actually exist or hide important ones.
Result : Even well-designed dashboards can tell the wrong story if the data behind them is not clean.
Impact on Business Decisions
Poor data quality doesn’t just affect charts — it directly impacts business decisions. When decision-makers rely on unclean data, the consequences can be costly.
Incorrect Insights Lead to Wrong Decisions: Dirty data can produce false insights, causing leaders to make decisions based on inaccurate information.
Revenue and Cost Miscalculations: Duplicate or missing data can inflate revenue figures or hide expenses, leading to poor budgeting and financial planning.
Misguided Strategy and Planning: When trends are wrong, businesses may invest in the wrong products, markets, or campaigns.
Loss of Trust in Data and Reports :If dashboards frequently show conflicting or incorrect numbers, stakeholders lose confidence in data-driven decision-making.
Operational Inefficiencies: Inaccurate data can lead to overstocking, understaffing, or poor resource allocation.
Risk of Compliance and Legal Issues: In industries like healthcare and finance, poor data quality can result in reporting errors and regulatory penalties.
Result: Clean data enables confident decisions, while dirty data increases risk, cost, and uncertainty.
Dataset Description: Flatten COVID‑19 Survey (PhysioNet)
For this blog, we use the Flatten COVID‑19 Survey dataset from PhysioNet, which contains approximately 294,000 survey responses collected across Canada between March and July 2020.
The dataset captures self-reported information on symptoms, travel, contact with illness, demographics, mental health, and more. It is divided into three schemas based on the timing and content of the survey:
Schema 1 (Weeks 13–14): Early phase of the pandemic (late March – early April 2020)
Schema 2 (Weeks 15–16): Mid-April 2020, with refined questions
Schema 3 (April–July): Later months, with additional demographic and mental health data
Each schema includes weekly or monthly timestamps, location information (postal codes), and various survey responses. The schemas have different column structures, requiring normalization to create a unified dataset.
Why There Is No Primary Key
Because this is survey-based data:
Each row represents a single survey submission, not a tracked individual.
There is no inherent primary key to uniquely identify participants.
Multiple rows may have the same combination of attributes (location, month, symptoms).
All data is de-identified to protect privacy, so there are no direct identifiers.
Without a primary key, standard deduplication or merging across schemas is not possible. Creating a pseudo key allows us to analyze patterns without assuming uniqueness.
Step-by-Step Data Cleaning Process Followed
After understanding the structure and limitations of the COVID-19 Flatten Survey dataset, we carried out a systematic data cleaning process using Python in a Jupyter Notebook environment. This approach allowed us to explore the data, apply transformations step by step, and validate each cleaning decision to ensure accuracy, consistency, and analytical reliability.
Step 1: Standardize Time Granularity (Normalization)
Problem:
Schema 1 & 2 contain week-level data.
Schema 3 contains month-level data.
Direct analysis would cause misaligned trends.
Solution:
Converted week-level data to month-level for Schema 1 & 2.
Dropped the week column after conversion.
import panda as pd
# Define week-to-month mapping
week_to_month = {
13: 'March',
14: 'April',
15: 'April',
16: 'April'
}
# Function to process a DataFrame
def process_df(df, schema_label):
# Convert week to integer if it exists
if 'week' in df.columns:
df['week'] = df['week'].astype(int)
# Map week to month
df['month'] = df['week'].map(week_to_month)
# Drop the original week column
df.drop(columns=['week'], inplace=True)
# Add schema column
df['schema'] = schema_label
return df
# Apply to all DataFrames
df_sc1 = process_df(df_sc1, 'Schema 1 (W13–14)')
df_sc2 = process_df(df_sc2, 'Schema 2 (W15–16)')
df_sc3['schema'] = 'Schema 3 (Apr–Jul)' # If df_sc3 has no week column
# Optional: check results
print(df_sc1.head())
print(df_sc2.head())
print(df_sc3.head())
Why this matters:
Ensures a single, consistent time variable.
Enables accurate visualization of trends over time.
Step 2: Concatenating All Three Schemas
After standardizing the time variable (converting week-level data in Schema 1 & 2 to month-level), the next step is to combine all three survey schemas into a single dataset.
Why Concatenation?
Problem:
Data was collected using three different survey schemas.
There is no common individual identifier across datasets.
Joining datasets would create false person-level matches.
Solution:
Used row-wise concatenation instead of JOIN.
# Add a column to tag each schema
df_sc1['schema'] = 'Schema 1 (W13–14)'
df_sc2['schema'] = 'Schema 2 (W15–16)'
df_sc3['schema'] = 'Schema 3 (Apr–Jul)'
# Concatenate all three schemas into a single dataset
df_all = pd.concat([df_sc1, df_sc2, df_sc3], ignore_index=True)
# Check the combined dataset
df_all.info()
Why this matters:
Ensures one unified dataset for further cleaning and analysis.
Preserves population-level insights without assuming individual linkage.
Prevents incorrect matches that could distort trends or analysis.
Step 3: Create a Pseudo Key (Synthetic Identifier)
Problem:
Dataset has no primary key.
Cannot uniquely identify individual respondents.
Solution:
Introduced a pseudo-key that captures each response pattern.
# Define columns to include in pseudo-key
key_cols = ['fsa', 'month', 'fever_chills_shakes',
'cough', 'shortness_of_breath', 'any_medical_conditions']
# Fill nulls and create pseudo-key in one step
df_all_copy['pseudo_key'] = df_all_copy[key_cols].fillna('none').agg('-'.join, axis=1)
# Check pseudo-key statistics
df_all_copy['pseudo_key'].describe()Why this matters:
Enables pattern-level aggregation.
Avoids incorrect assumptions about individual identity.
Multiple rows can share the same pseudo-key — this is expected and correct.
Step 4: Remove Redundant Symptom Columns
After creating the pseudo-key, we noticed that some symptoms, like cough and shortness of breath, appear in both structured binary columns and the free-text symptoms column.
Problem:
Having the same information in multiple columns can cause data redundancy and potentially double-counting when analyzing or visualizing symptom trends.
Solution:
Compare the structured columns with the free-text symptoms column.
Validate the match rate — high overlap (~96–98%) confirms redundancy.
Drop the redundant binary columns while keeping the free-text symptoms column.
# Remove redundant symptom columns
df_all_copy['cough_in_symptoms'] = (
df_all_copy['symptoms'].fillna('').str.lower().str.split(';').apply(lambda x: 'cough' in x)
)
df_all_copy['cough_flag'] = df_all_copy['cough'].str.lower().eq('y')
print(f"Cough match rate: {(df_all_copy['cough_flag'] == df_all_copy['cough_in_symptoms']).mean():.2%}")
df_all_copy['sob_in_symptoms'] = (
df_all_copy['symptoms'].fillna('').str.lower().str.split(';').apply(lambda x: 'shortnessofbreath' in x)
)
df_all_copy['sob_flag'] = df_all_copy['shortness_of_breath'].str.lower().eq('y')
print(f"Shortness of breath match rate: {(df_all_copy['sob_flag'] == df_all_copy['sob_in_symptoms']).mean():.2%}")
# Drop redundant columns
cols_to_drop = ['cough', 'cough_in_symptoms', 'cough_flag',
'shortness_of_breath', 'sob_in_symptoms', 'sob_flag']
df_all_copy.drop(columns=cols_to_drop, inplace=True, errors='ignore')
Why this matters:
Prevents double counting of symptoms
Reduces dataset redundancy
Simplifies further analysis and visualization
Keeps one reliable column (symptoms) for all symptom-related features
Step 5: Normalize Multi-Value Text Columns
Problem:
Columns like symptoms, conditions, media_channels used ; as separators.
Hard to analyze and visualize.
Solution:
Replaced semicolons with commas for consistency.
df_all['symptoms'] = df_all['symptoms'].str.replace(';', ', ')
df_all['conditions'] = df_all['conditions'].str.replace(';', ', ')
Why this matters:
Improves readability.
Enables easier text processing and feature engineering.
Step 6: Standardize Yes / No Values
Problem:
Binary columns had inconsistent values (y, Y, n, N).
Solution:
Standardized all binary responses to Yes / No.
cols = ['probable','vulnerable','tested','self_isolating']
df_all[cols] = df_all[cols].replace({'y':'Yes','n':'No','Y':'Yes','N':'No'})
Why this matters:
Prevents category duplication in charts and reports.
Step 7: Translate and Normalize Text Values
Problem:
Some values were in French, e.g., travel status.
Solution:
Translated to consistent English labels.
df_all['travel_work_school'] = df_all['travel_work_school'].replace(
"non: j'ai toujours travaille de chez moi.",
"alwaysWorkFromHome"
)
Why this matters:
Ensures consistent categorization across the dataset.
Step 8: Improve Column Naming and Readability
Actions Taken:
Renamed fsa → FSA (acronym clarity)
Renamed age_1 → age_category
df_all.rename(columns={'fsa':'FSA', 'age_1':'age_category'}, inplace=True)
Why this matters:
Improves self-documentation.
Makes dashboards easier to understand.
Step 9: Clean Demographic and Outcome Fields
After handling symptoms and standardizing categorical variables, the next step is to clean demographic and outcome columns to ensure accuracy in analysis and visualization.
Sex Column
Problem:
The sex column contains inconsistent values: "m", "f", "na", or other variations in uppercase/lowercase.
This inconsistency can inflate counts or misrepresent gender distribution in reports.
Solution:
Convert the column to string type
Standardize to "Male" and "Female"
Replace "na" with null (pd.NA) to indicate missing values
df_all['sex'] = (
df_all['sex'].astype('string')
.str.lower()
.replace({'m':'Male', 'f':'Female', 'na':pd.NA})
)
Why this matters:
Ensures accurate counts in demographic analysis
Prevents misleading visuals or statistics
Makes the dataset ready for grouping, plotting, or modeling
COVID Result Column
Problem:
The covid_positive column contains inconsistent entries like "positively", "negatively", "n"
This can cause confusion when counting positive vs negative cases or plotting outcomes
Solution:
Convert the column to string type
Standardize values to "Positive", "Negative", "Inconclusive"
df_all['covid_positive'] = (
df_all['covid_positive'].astype('string')
.str.lower()
.replace({'positively':'Positive', 'negatively':'Negative', 'n':'Inconclusive'})
)
Why this matters:
Ensures reliable outcome analysis
Prevents misinterpretation of COVID-positive/negative counts
Supports accurate reporting and predictive modeling
Step 10: Group FSAs into Broader Regions
Problem:
The dataset contains detailed Ontario postal codes (FSAs) for each respondent. While precise, this level of granularity:
Makes visualization and regional analysis difficult
Reduces readability in charts
Makes it harder to identify trends or high-risk areas
Solution:
Group FSAs into 5 broader regions:
Eastern Ontario
Central Ontario
Toronto
Southwestern Ontario
Northern Ontario
This can be done with a mapping dictionary in Python:
# Example FSA to region mapping
fsa_to_region = {
'K': 'Eastern Ontario', 'L': 'Central Ontario', 'M': 'Toronto',
'N': 'Southwestern Ontario', 'P': 'Northern Ontario'
}
# Map FSA to regions
df_all_copy['region'] = df_all_copy['FSA'].str[0].map(fsa_to_region)
Why this matters:
Simplifies charts and analysis while still preserving geographic insights
Enables easy visualization of symptom spread and severity by region
Helps quickly identify high-risk areas for targeted interventions
Supports immediate public health decisions like quarantine measures or resource allocation
Final Clean Dataset Ready for Analysis
After cleaning:
Consistent time variable
No false joins
Reduced redundancy
Standardized categories
Pattern-level pseudo key
Analysis-ready structure
Result: The dataset is now suitable for reliable visualization, trend analysis, and predictive modeling.
Final Conclusion
This project demonstrates that data cleaning is the foundation of accurate visualization and reliable decision-making. Using the Flatten COVID-19 Survey dataset, we addressed real-world challenges such as multiple survey schemas, inconsistent time formats, lack of a primary key, and unstructured text data.
By standardizing time to a monthly level, using concatenation instead of joins, creating a pseudo-key, and removing redundant and inconsistent values, we ensured analytical accuracy without assuming individual-level linkage. These steps prevented misleading charts, false trends, and incorrect conclusions.
Clean data leads to trustworthy visualizations, meaningful insights, and confident public-health and business decisions.
If you’re a data analyst or aspiring data scientist, this case study highlights why data cleaning should never be skipped — no matter how advanced your tools are.

