Covid-19 data analysis Using Python

Niranjana Ramasamy
Jan 11
5 min read

picture courtesy: wix.com

Introduction

In this blog I would like to share my hackathon experience in Covid-19 data analysis using python programs. Hackathon organizers gave us guidelines and the expectations about what is expected out of the hackathon.

Ours was a team of 5 members, and each one of us contributed a part to the data analysis and we presented a summary of all our work. In this blog, I will be explaining primarily about my work in detail.

Understanding the data

For our analysis, we used Canada’s pre-clinical COVID-19 dataset from this website COVID-19 Survey Data on Symptoms, Demographics and Mental Health in Canada v1.0.

This is Canada’s first publicly available pre-clinical COVID-19 dataset, based on survey responses collected from 294,106 Canadians from March 23rd until July 30th 2020, using a platform developed by Flatten, a Canadian non-profit organization. Survey participants were solicited through social media platforms and traditional news services.

The Flatten dataset consists of three versions of the survey referred to as Schema 1, Schema 2, and Schema 3.

Survey responses associated with Schema 1 are the most numerous, consisting of 263,640 individual level records submitted in the early weeks of the pandemic (March 23rd to April 8th of 2020). These responses also correspond with the peak of Flatten’s presence in the media. Survey responses associated with Schema 2 consist of 14,932 records (April 8th to April 28th of 2020), and Schema 3 consists of 15,534 records (April 28th to July 30th of 2020). Although Schema 2 and Schema 3 contain far fewer records than Schema 1, they contain valuable information about the demographic profile of the survey participant that Schema 1 does not have, such as their race, ethnicity, sex, age, and pandemic-induced most pressing needs (i.e., food, medical, financial, emotional, other).

Objectives: This hackathon was aimed to

(a) test our ability to ask good questions,

(b) understand what questions need to be asked to provide value (analysis & insights)

We had 5 categories of questions viz.,

Data processing / cleaning,
Descriptive analysis,
Prescriptive analysis,
Predictive analysis and
Demo of insights using python.

In this blog, I will be explaining about our work in the 3rd category of questions, i.e., Prescriptive analysis, which is to analyze the data and come up with actionable recommendations.

Our Prescriptive analysis using this Covid-19 survey data included finding out how much Covid-19 spread is happening in each region of Canada, how many people have symptoms or infections, and who are vulnerable to this and suggesting what should be done now to stop the covid-19 spread and provide effective medical support to people who are already affected.

From our team, my part was to do the prescriptive analysis on possible demands for oxygen cylinders and mobile testing units based on the symptoms reported by the survey participants. This analysis will be helpful for the public health officials to be prepared with appropriate inventory/logistics arrangements to meet up the demand for oxygen cylinder supply and testing units.

Prescriptive analysis to find Oxygen Demand

This survey data set has symptoms like shortness of breath, fever/chills, and cough, which are strong indicators of respiratory distress and the patients with these symptoms might need oxygen support. But not all symptoms carry the same clinical weight for oxygen needs. To reflect this, we assigned medically aligned weights to each of these demands as follows:

Symptom	Weightage for oxygen demand
Shortness of breath	3
Fever/ Chills/ Shakes	2
Cough	1

For each patient I calculated the total score for oxygen demand by adding up the weightage of each of their symptoms.

For example if one person has reported all 3 symptoms, then their oxygen demand score is 6.

If they reported only shortness of breath (weightage 3) and cough (weightage 1), their oxygen demand score is 4 .

import matplotlib.pyplot as plt

weights = {
   "shortness_of_breath": 3,
   "fever_chills_shakes": 2,
   "cough": 1
}

df["oxygen_demand_score"] = (
   df["shortness_of_breath"] * weights["shortness_of_breath"] +
   df["fever_chills_shakes"] * weights["fever_chills_shakes"] +
   df["cough"] * weights["cough"]
)

Then I grouped the oxygen demand score of individuals by each region in Canada.

This dataset has a column called ‘FSA’ in all the survey records, which is the acronym of Forward Sortation Area in Canada, the first three characters of Canadian postal code that designate a specific geographic area for mail delivery. Examples : M5A, L5M etc.,

So, I used the FSA code for grouping the oxygen demand score, and to find out the top 15 FSAs where the oxygen demand score is high.

FSA_oxygen_need = (
   df.groupby("fsa")["oxygen_demand_score"]
     .sum()
     .reset_index()
     .sort_values(by="oxygen_demand_score", ascending=False)
)

top_fsa = FSA_oxygen_need.sort_values(by="oxygen_demand_score", ascending=False).head(15)

Visualizing Oxygen demand

Then I used Python code to plot a lollipop chart showing the oxygen demand score per FSA and sort them from highest to lowest value.

The code uses plt.hlines to draw the “sticks” and plt.scatter to place the “lollipops” at the exact value of the oxygen demand score.

plt.figure(figsize=(10, 6))

plt.hlines(
   y=top_fsa["fsa"].astype(str),
   xmin=0,
   xmax=top_fsa["oxygen_demand_score"],
   linewidth=2,
   color='orange'
)

plt.scatter(
   top_fsa["oxygen_demand_score"],
   top_fsa["fsa"].astype(str),
   s=80,
   zorder=3,
   color='orange'
)
plt.grid(False)
plt.title("Top 15 FSAs Requiring Oxygen Cylinders", fontsize=14, fontweight="bold")
plt.xlabel("Total Oxygen Demand Score")
plt.ylabel("FSA")
plt.tight_layout()
plt.show()

Prescriptive analysis to find demand for mobile testing units

This analysis investigates which geographic areas (FSAs) have people with high symptoms and low testing. This indicates they need more attention and testing to confirm if their symptoms are due to Covid-19 or not.

First the symptom intensity is calculated per person, by adding number of symptoms they have out of three main symptoms viz., 'fever_chills_shakes', 'cough', 'shortness_of_breath'.

For example if a person has all 3 symptoms, then the symptom intensity is 3. If they have only 2 symptoms, their symptom intensity is 2.

Then I grouped the symptom intensity for each FSA, and priority for testing need per FSA is calculated based on this. FSAs with High symptoms and low testing have high priority for new testing site. Top 15 FSAs as per this priority are listed in a chart.

following is the Python code used for this analysis.


import pandas as pd
import numpy as np
import re
import matplotlib.pyplot as plt

demog = pd.read_csv('demographics_cleaned.csv', low_memory=False)
socio = pd.read_csv('socioeconomic_Cleaned.csv', low_memory=False)
symptoms = pd.read_csv('symptoms_df_cleaned.csv', low_memory=False)
risk = pd.read_csv('risk_flags_cleaned.csv', low_memory=False)
exposure = pd.read_csv('exposure_cleaned.csv', low_memory=False)
covid = pd.read_csv('covid_Cleaned.csv', low_memory=False)
core = pd.read_csv('core_cleaned.csv', low_memory=False)

df = pd.concat([demog, socio, symptoms, risk, exposure, covid, core], axis=1)

df['symptom_intensity'] = df[['fever_chills_shakes', 'cough', 'shortness_of_breath']].fillna(0).sum(axis=1)

df['fsa'] = core['fsa']

fsa_analysis = df.dropna(subset=['fsa'])

fsa_stats = fsa_analysis.groupby('fsa').agg({
    'symptom_intensity': 'mean',
    'tested': lambda x: (x == 'Yes').mean()
})

fsa_stats['prescriptive_priority'] = fsa_stats['symptom_intensity'] * (1 - fsa_stats['tested'])

fsa_top = fsa_stats.sort_values(by='prescriptive_priority', ascending=False).head(15)

plt.figure(figsize=(12, 6))
sns.barplot(data=fsa_top, x='fsa', y='prescriptive_priority', palette='Reds_r',hue = 'fsa', legend=False)
plt.title('Prescriptive Testing Sites: Top 15 FSAs for Mobile Unit Deployment', fontsize=15, fontweight='bold')
plt.xlabel('FSA')
plt.ylabel('Urgency Score (High Symptoms + Low Testing)')
plt.xticks(rotation=45)
plt.tight_layout()
plt.savefig('prescriptive_testing_deserts.png')
plt.show()

Conclusion

This prescriptive analysis can help public health officials to proactively plan their actions to control covid-19 spread and manage effective healthcare for affected people to save the community from the pandemic.

They can

Pre-position oxygen supplies in high-score FSAs.
Optimize delivery routes for oxygen trucks.
Identify emerging hotspots by placing more mobile testing units to increase the number of testing, before they overwhelm local clinics.

By combining simple arithmetic with geographic aggregation and clean visualization, we can transform the raw data into a clear, actionable plan that saves lives.

Reference: https://physionet.org/content/flatten-covid-survey/1.0/

Welcome
to NumpyNinja Blogs

Covid-19 data analysis Using Python

Recent Posts

Welcome to NumpyNinja Blogs

Welcome
to NumpyNinja Blogs