From COVID Survey Chaos to Clean Insights

uzmafarheen
Jan 9
4 min read

The Python hackathon gave me hands-on experience analyzing the Flatten COVID-19 dataset in Jupyter Notebook—the digital lab book where data scientists prototype Python, R, and more. Before writing a single line of code, I learned the first rule of data analysis: deeply understand your dataset first.

Dataset Reality Check:

This wasn't hospital lab data. It contained self-reported COVID-19 symptom surveys from Ontario residents during early 2020, split across 3 evolving schemas.

Schema Breakdown:

Schema 1: 120K+ rows, basic symptoms, minimal demographics
Schema 2 & 3: Richer details (age, sex, race, pandemic needs), fewer responses

Critical Dataset Information

Key points to note from this dataset:

Each row represents one survey submission, not one unique person. The is_most_recent flag is crucial as it identifies the latest response from each participant for person-level analysis.
This survey includes Ontario residents only, not all of Canada. Consider location and demographic limitations when drawing conclusions.
The data is self-reported, not clinical diagnoses. Responses come directly from survey participants, making it suitable for trend analysis rather than confirmed case counts.

What the Data Actually Contains

Before going through cleaning or analyses of data . Its very important to check what the data actually contains.

This data has raw variables where original survey questions people answered directly.

As per the covid 19 dataset I found raw fields like

Response time (week/month)
Symptoms: fever, cough, shortness of breath
Exposure risks: travel, suspected contacts
Basic demographics/health indicators

Smart derived fields are computer-calculated columns, not direct survey answers.

The below are the smart fields found in the data

probable: Symptom + exposure match (not lab-confirmed)
vulnerable: Age/health risk flags
is_most_recent: Duplicate survey filter

Before starting any cleaning or coding, a data analyst should pause and ask a few key questions about the dataset.

What exactly is this data?
Is it survey data, lab data, sensor data, or something else?
Who is included in this dataset?
Only people from Ontario? Which age groups or populations are represented?
What does one row represent?
Is each row a submission, a person, a day, or something else? Do I need a column like is_most_recent to avoid double counting?
How was the data collected?
Is it self-reported or clinically confirmed? What kinds of biases could this introduce?
What are the important columns?
Which columns show time, symptoms, exposure risks, and demographics? Which are derived fields like probable and vulnerable?
What questions do I want to answer?
Do I want to look at trends over time, differences between groups, or risk patterns?
Asking these questions will help us understand what we're analyzing.

Let's start with the cleaning process in Jupyter Notebook!

First I loaded schema_1.csv (the Ontario COVID survey file) using pd.read_csv(). This schema had the core symptom tracking columns plus the is_most_recent flag for deduplication, so I applied the same 6-step cleaning process shown below. The exact same workflow works for Schema 2 and 3 (just different demographic columns).

Tip after loading: Once data loads into Jupyter Notebook, immediately run survey_1.shape and survey_1.info() to spot duplicates, missing values, and data types before cleaning begins—this catches 80% of data quality issues upfront. I also used survey_1.head() to preview sample rows and understand the structure. Schema 1 had core symptom columns plus the is_most_recent flag, so I applied the 6-step process below. The same workflow works for Schemas 2 and 3.

Data cleaning steps:

1. Filtered to keep only records where is_most_recent == 'y' to ensure one row per respondent, removing duplicate historical submissions from the same people taking the survey multiple times. Same participants created 4-5 rows per person, inflating symptom counts. This step dropped 20% of rows while preserving the latest, most accurate data for analysis.

2.Renamed abbreviated column headers (like travel_outside_canada → international_travel, contact_with_illness →known_covid_exposure) to create a professional dataset structure. Clear, descriptive names prevent analysis errors and ensure charts/insights are immediately understandable for reporting.

3.Standardized the postal_fsa column by converting all entries to uppercase strings and stripping whitespace, which ensures uniformity for accurate regional grouping and census dataset linkage. This eliminates counting errors where the same Toronto neighborhood appears as 3 different FSAs, giving you clean provincial COVID trend charts.

4. Converted 'y/n' symptom and exposure columns to binary 1/0 values to enable mathematical analysis and percentage calculations. This will help in calculating percentage of respondents with fever symptoms instead of requiring manual counting.

5.Replaced missing values (NaN) with -1 to indicate non-responses, enabling integer column conversion while preserving absence-of-symptom information for analysis.

6.Optimized column datatypes (int8 for binary flags, string for FSA codes) to reduce memory usage and enable efficient statistical analysis.Proper data types reduce memory usage, speed up pandas operations like groupby and filtering, and prevent errors from mixing incompatible formats. Clean data types also make your data summary output professional and instantly show stakeholders the dataset is analysis-ready.

Finally, we can see the cleaned data

Applied the same 6-steps to Schema 2 and Schema 3 separately, handling their additional demographic columns (age, sex, race) using identical methods for missing data, whitespace stripping, and binary encoding.

Why I Cleaned This Way

Data cleaning transforms unusable raw survey data into reliable insights. Duplicates count the same person multiple times, cryptic column names like week confuse analysts, and inconsistent formats like "y/n" prevent calculations.

By using these 6 standard steps: I filtered duplicates with is_most_recent, renamed columns for clarity, standardized postal codes, converted text answers to numbers, handled missing values, and optimized data types.

This makes symptom trends accurate, regional analysis possible, and charts understandable—turning messy CSVs into stakeholder-ready datasets.

Common Cleaning Techniques :

1. Remove Duplicates:Filter using flags like is_most_recent to keep only the latest record per person, or remove exact duplicate rows entirely.

2. Rename Columns:Change cryptic names like week to survey_week and probable to is_probable_case for clear analysis.

3. Standardize Text:Convert postal codes and text fields to uppercase and remove extra spaces for consistent grouping.

4. Convert Text to Numbers: Transform "y/n" answers into 1/0 values so you can calculate percentages and averages.

5. Handle Missing Values :Replace empty cells with -1 to mark non-responses, or use averages for numeric fields.

6. Optimize Data Types:Change columns to efficient formats like int8 for flags and string for text to save memory and speed up analysis.

Always check row counts before and after each step to verify your cleaning worked correctly. These techniques work on almost any messy dataset.

Key Takeaway:

Data analysis means first understanding your data (60% effort), then cleaning it (30%), then getting insights (10%). Clean data shows real symptom patterns people trust. Use these 6 steps next time you have messy CSV files to make chaos simple and useful.

Welcome
to NumpyNinja Blogs

From COVID Survey Chaos to Clean Insights

Recent Posts

Welcome to NumpyNinja Blogs

Welcome
to NumpyNinja Blogs