top of page

Welcome
to NumpyNinja Blogs

NumpyNinja: Blogs. Demystifying Tech,

One Blog at a Time.
Millions of views. 

Data Cleaning Explained: How Clean Data Drives Better Visualizations and Decisions

Updated: Jan 14



 Bad data doesn’t just create bad charts — it leads to bad decisions.


In today’s data-driven world, organizations rely heavily on dashboards, charts, and reports to guide business and public-health decisions. However, when data contains missing values, duplicates, or inconsistencies, even the most visually appealing dashboards can become misleading.

This article explains why data cleaning is essential for accurate data visualization and better decision-making, using a real-world COVID-19 survey dataset as a case study. It highlights how proper data cleaning prevents false trends, improves trust in dashboards, and enables confident, data-driven decisions.



What Is Data Cleaning?


Data cleaning is the process of identifying and correcting errors in raw data to make it accurate, consistent, and ready for analysis. Real-world data is rarely perfect. It often contains missing values, duplicate records, incorrect formats, and outliers that can affect analysis and visualization.

The goal of data cleaning is to improve data quality so that insights drawn from the data are reliable. Clean data ensures that charts, dashboards, and reports truly represent reality and support correct business decisions.


Common data cleaning activities include:

  • Handling missing values 

  • Removing duplicate records

  • Correcting data types and formats (dates, numbers, text)

  • Fixing inconsistent values (e.g., “USA” vs “United States”)

  • Identifying and treating outliers


Without proper data cleaning, even advanced analytics tools and visualization platforms can produce misleading results.


Impact on Visualization : Misleading Charts and Wrong Trends


When data is not cleaned properly, visualizations can go wrong in several ways:


  1. Missing Values Create Gaps :Missing or null values can cause broken lines, sudden drops, or incomplete charts, making trends look inaccurate.

  2. Duplicate Records Inflate Numbers :Duplicate data can increase totals and counts, leading to exaggerated bars, incorrect percentages, and misleading comparisons.

  3. Inconsistent Categories Split Visuals :Different labels for the same category (e.g., “US”, “USA”, “United States”) can appear as separate bars or slices, confusing the audience.

  4. Incorrect Data Types Affect Charts : Dates stored as text or numbers stored as strings can prevent proper sorting and distort time-based visualizations.

  5. Outliers Distort the Scale : Extreme or incorrect values can stretch chart axes, hiding meaningful patterns in the rest of the data.

  6. False Trends Lead to Wrong Insights : When visuals are built on dirty data, they may highlight trends that don’t actually exist or hide important ones.


Result : Even well-designed dashboards can tell the wrong story if the data behind them is not clean.


Impact on Business Decisions


Poor data quality doesn’t just affect charts — it directly impacts business decisions. When decision-makers rely on unclean data, the consequences can be costly.


  1. Incorrect Insights Lead to Wrong Decisions: Dirty data can produce false insights, causing leaders to make decisions based on inaccurate information.

  2. Revenue and Cost Miscalculations: Duplicate or missing data can inflate revenue figures or hide expenses, leading to poor budgeting and financial planning.

  3. Misguided Strategy and Planning: When trends are wrong, businesses may invest in the wrong products, markets, or campaigns.

  4. Loss of Trust in Data and Reports :If dashboards frequently show conflicting or incorrect numbers, stakeholders lose confidence in data-driven decision-making.

  5. Operational Inefficiencies: Inaccurate data can lead to overstocking, understaffing, or poor resource allocation.

  6. Risk of Compliance and Legal Issues: In industries like healthcare and finance, poor data quality can result in reporting errors and regulatory penalties.


Result: Clean data enables confident decisions, while dirty data increases risk, cost, and uncertainty.


Dataset Description: Flatten COVID‑19 Survey (PhysioNet)


For this blog, we use the Flatten COVID‑19 Survey dataset from PhysioNet, which contains approximately 294,000 survey responses collected across Canada between March and July 2020.

The dataset captures self-reported information on symptoms, travel, contact with illness, demographics, mental health, and more. It is divided into three schemas based on the timing and content of the survey:


  • Schema 1 (Weeks 13–14): Early phase of the pandemic (late March – early April 2020)

  • Schema 2 (Weeks 15–16): Mid-April 2020, with refined questions

  • Schema 3 (April–July): Later months, with additional demographic and mental health data


Each schema includes weekly or monthly timestamps, location information (postal codes), and various survey responses. The schemas have different column structures, requiring normalization to create a unified dataset.


Why There Is No Primary Key


Because this is survey-based data:


  • Each row represents a single survey submission, not a tracked individual.

  • There is no inherent primary key to uniquely identify participants.

  • Multiple rows may have the same combination of attributes (location, month, symptoms).

  • All data is de-identified to protect privacy, so there are no direct identifiers.


Without a primary key, standard deduplication or merging across schemas is not possible. Creating a pseudo key allows us to analyze patterns without assuming uniqueness.


Step-by-Step Data Cleaning Process Followed


After understanding the structure and limitations of the COVID-19 Flatten Survey dataset, we carried out a systematic data cleaning process using Python in a Jupyter Notebook environment. This approach allowed us to explore the data, apply transformations step by step, and validate each cleaning decision to ensure accuracy, consistency, and analytical reliability.


Step 1: Standardize Time Granularity (Normalization)


Problem:

  • Schema 1 & 2 contain week-level data.

  • Schema 3 contains month-level data.

  • Direct analysis would cause misaligned trends.


Solution:

  • Converted week-level data to month-level for Schema 1 & 2.

  • Dropped the week column after conversion.

import panda as pd
# Define week-to-month mapping
week_to_month = {
    13: 'March',
    14: 'April',
    15: 'April',
    16: 'April'
}

# Function to process a DataFrame
def process_df(df, schema_label):
    # Convert week to integer if it exists
    if 'week' in df.columns:
        df['week'] = df['week'].astype(int)
        # Map week to month
        df['month'] = df['week'].map(week_to_month)
        # Drop the original week column
        df.drop(columns=['week'], inplace=True)
    # Add schema column
    df['schema'] = schema_label
    return df

# Apply to all DataFrames
df_sc1 = process_df(df_sc1, 'Schema 1 (W13–14)')
df_sc2 = process_df(df_sc2, 'Schema 2 (W15–16)')
df_sc3['schema'] = 'Schema 3 (Apr–Jul)'  # If df_sc3 has no week column

# Optional: check results
print(df_sc1.head())
print(df_sc2.head())
print(df_sc3.head())

Why this matters:

  • Ensures a single, consistent time variable.

  • Enables accurate visualization of trends over time.


Step 2: Concatenating All Three Schemas


After standardizing the time variable (converting week-level data in Schema 1 & 2 to month-level), the next step is to combine all three survey schemas into a single dataset.

Why Concatenation?


Problem:

  • Data was collected using three different survey schemas.

  • There is no common individual identifier across datasets.

  • Joining datasets would create false person-level matches.


Solution:

  • Used row-wise concatenation instead of JOIN.


# Add a column to tag each schema
df_sc1['schema'] = 'Schema 1 (W13–14)'
df_sc2['schema'] = 'Schema 2 (W15–16)'
df_sc3['schema'] = 'Schema 3 (Apr–Jul)'

# Concatenate all three schemas into a single dataset
df_all = pd.concat([df_sc1, df_sc2, df_sc3], ignore_index=True)

# Check the combined dataset
df_all.info()

Why this matters:

  • Ensures one unified dataset for further cleaning and analysis.

  • Preserves population-level insights without assuming individual linkage.

  • Prevents incorrect matches that could distort trends or analysis.



Step 3: Create a Pseudo Key (Synthetic Identifier)


Problem:

  • Dataset has no primary key.

  • Cannot uniquely identify individual respondents.


Solution:

  • Introduced a pseudo-key that captures each response pattern.

# Define columns to include in pseudo-key
key_cols = ['fsa', 'month', 'fever_chills_shakes', 
            'cough', 'shortness_of_breath', 'any_medical_conditions']

# Fill nulls and create pseudo-key in one step
df_all_copy['pseudo_key'] = df_all_copy[key_cols].fillna('none').agg('-'.join, axis=1)

# Check pseudo-key statistics
df_all_copy['pseudo_key'].describe()

Why this matters:

  • Enables pattern-level aggregation.

  • Avoids incorrect assumptions about individual identity.

  • Multiple rows can share the same pseudo-key — this is expected and correct.


Step 4: Remove Redundant Symptom Columns


After creating the pseudo-key, we noticed that some symptoms, like cough and shortness of breath, appear in both structured binary columns and the free-text symptoms column.


Problem:

Having the same information in multiple columns can cause data redundancy and potentially double-counting when analyzing or visualizing symptom trends.


Solution:

  1. Compare the structured columns with the free-text symptoms column.

  2. Validate the match rate — high overlap (~96–98%) confirms redundancy.

  3. Drop the redundant binary columns while keeping the free-text symptoms column.


# Remove redundant symptom columns
df_all_copy['cough_in_symptoms'] = (
    df_all_copy['symptoms'].fillna('').str.lower().str.split(';').apply(lambda x: 'cough' in x)
)
df_all_copy['cough_flag'] = df_all_copy['cough'].str.lower().eq('y')
print(f"Cough match rate: {(df_all_copy['cough_flag'] == df_all_copy['cough_in_symptoms']).mean():.2%}")

df_all_copy['sob_in_symptoms'] = (
    df_all_copy['symptoms'].fillna('').str.lower().str.split(';').apply(lambda x: 'shortnessofbreath' in x)
)
df_all_copy['sob_flag'] = df_all_copy['shortness_of_breath'].str.lower().eq('y')
print(f"Shortness of breath match rate: {(df_all_copy['sob_flag'] == df_all_copy['sob_in_symptoms']).mean():.2%}")

# Drop redundant columns
cols_to_drop = ['cough', 'cough_in_symptoms', 'cough_flag',
                'shortness_of_breath', 'sob_in_symptoms', 'sob_flag']
df_all_copy.drop(columns=cols_to_drop, inplace=True, errors='ignore')

Why this matters:

  • Prevents double counting of symptoms

  • Reduces dataset redundancy

  • Simplifies further analysis and visualization

  • Keeps one reliable column (symptoms) for all symptom-related features


Step 5: Normalize Multi-Value Text Columns


Problem:

  • Columns like symptoms, conditions, media_channels used ; as separators.

  • Hard to analyze and visualize.


Solution:

  • Replaced semicolons with commas for consistency.

df_all['symptoms'] = df_all['symptoms'].str.replace(';', ', ')
df_all['conditions'] = df_all['conditions'].str.replace(';', ', ')

Why this matters:

  • Improves readability.

  • Enables easier text processing and feature engineering.


Step 6: Standardize Yes / No Values


Problem:

  • Binary columns had inconsistent values (y, Y, n, N).


Solution:

  • Standardized all binary responses to Yes / No.


cols = ['probable','vulnerable','tested','self_isolating']
df_all[cols] = df_all[cols].replace({'y':'Yes','n':'No','Y':'Yes','N':'No'})

Why this matters:

  • Prevents category duplication in charts and reports.


Step 7: Translate and Normalize Text Values


Problem:

  • Some values were in French, e.g., travel status.

Solution:

  • Translated to consistent English labels.

df_all['travel_work_school'] = df_all['travel_work_school'].replace(
    "non: j'ai toujours travaille de chez moi.",
    "alwaysWorkFromHome"
)

Why this matters:

  • Ensures consistent categorization across the dataset.


Step 8: Improve Column Naming and Readability


Actions Taken:

  • Renamed fsa → FSA (acronym clarity)

  • Renamed age_1 → age_category

df_all.rename(columns={'fsa':'FSA', 'age_1':'age_category'}, inplace=True)

Why this matters:

  • Improves self-documentation.

  • Makes dashboards easier to understand.


Step 9: Clean Demographic and Outcome Fields


After handling symptoms and standardizing categorical variables, the next step is to clean demographic and outcome columns to ensure accuracy in analysis and visualization.


Sex Column


Problem:

  • The sex column contains inconsistent values: "m", "f", "na", or other variations in uppercase/lowercase.

  • This inconsistency can inflate counts or misrepresent gender distribution in reports.


Solution:

  • Convert the column to string type

  • Standardize to "Male" and "Female"

  • Replace "na" with null (pd.NA) to indicate missing values

df_all['sex'] = (
    df_all['sex'].astype('string')
    .str.lower()
    .replace({'m':'Male', 'f':'Female', 'na':pd.NA})
)

Why this matters:

  • Ensures accurate counts in demographic analysis

  • Prevents misleading visuals or statistics

  • Makes the dataset ready for grouping, plotting, or modeling


COVID Result Column


Problem:

  • The covid_positive column contains inconsistent entries like "positively", "negatively", "n"

  • This can cause confusion when counting positive vs negative cases or plotting outcomes


Solution:

  • Convert the column to string type

  • Standardize values to "Positive", "Negative", "Inconclusive"

df_all['covid_positive'] = (
    df_all['covid_positive'].astype('string')
    .str.lower()
    .replace({'positively':'Positive', 'negatively':'Negative', 'n':'Inconclusive'})
)

Why this matters:

  • Ensures reliable outcome analysis

  • Prevents misinterpretation of COVID-positive/negative counts

  • Supports accurate reporting and predictive modeling


Step 10: Group FSAs into Broader Regions


Problem:

The dataset contains detailed Ontario postal codes (FSAs) for each respondent. While precise, this level of granularity:

  • Makes visualization and regional analysis difficult

  • Reduces readability in charts

  • Makes it harder to identify trends or high-risk areas


Solution:

  • Group FSAs into 5 broader regions:

    • Eastern Ontario

    • Central Ontario

    • Toronto

    • Southwestern Ontario

    • Northern Ontario

  • This can be done with a mapping dictionary in Python:

# Example FSA to region mapping
fsa_to_region = {
    'K': 'Eastern Ontario', 'L': 'Central Ontario', 'M': 'Toronto',
    'N': 'Southwestern Ontario', 'P': 'Northern Ontario'
}

# Map FSA to regions
df_all_copy['region'] = df_all_copy['FSA'].str[0].map(fsa_to_region)

Why this matters:

  • Simplifies charts and analysis while still preserving geographic insights

  • Enables easy visualization of symptom spread and severity by region

  • Helps quickly identify high-risk areas for targeted interventions

  • Supports immediate public health decisions like quarantine measures or resource allocation


 Final Clean Dataset Ready for Analysis


After cleaning:

  • Consistent time variable

  • No false joins

  • Reduced redundancy

  • Standardized categories

  • Pattern-level pseudo key

  • Analysis-ready structure


Result: The dataset is now suitable for reliable visualization, trend analysis, and predictive modeling.


Final Conclusion


This project demonstrates that data cleaning is the foundation of accurate visualization and reliable decision-making. Using the Flatten COVID-19 Survey dataset, we addressed real-world challenges such as multiple survey schemas, inconsistent time formats, lack of a primary key, and unstructured text data.

By standardizing time to a monthly level, using concatenation instead of joins, creating a pseudo-key, and removing redundant and inconsistent values, we ensured analytical accuracy without assuming individual-level linkage. These steps prevented misleading charts, false trends, and incorrect conclusions.

Clean data leads to trustworthy visualizations, meaningful insights, and confident public-health and business decisions.


If you’re a data analyst or aspiring data scientist, this case study highlights why data cleaning should never be skipped — no matter how advanced your tools are.











+1 (302) 200-8320

NumPy_Ninja_Logo (1).png

Numpy Ninja Inc. 8 The Grn Ste A Dover, DE 19901

© Copyright 2025 by Numpy Ninja Inc.

  • Twitter
  • LinkedIn
bottom of page