top of page

Welcome
to NumpyNinja Blogs

NumpyNinja: Blogs. Demystifying Tech,

One Blog at a Time.
Millions of views. 

Correlation between Predictive and Prescriptive Analysis

Picture Courtesy Wix.com
Picture Courtesy Wix.com

Introduction

There are 4 primary types of analytics viz., 


  1. Descriptive analytics : To understand what happened

  2. Diagnostic analytics: To explain why it happened

  3. Predictive analytics: To predict what might happen in the future, and

  4. Prescriptive analytics: To recommend what should be done to achieve the best possible  outcome.


In this blog I would like to explain the correlation between Predictive and Prescriptive analysis with a real world example, based on my recent hackathon experience in Covid-19 data analysis using Python.


Understanding the data

For our analysis, we used Canada’s pr-clinical COVID-19 dataset from this website COVID-19 Survey Data on Symptoms, Demographics and Mental Health in Canada v1.0

This is Canada’s first publicly available pre-clinical COVID-19 dataset, based on survey responses collected from 294,106 Canadians from March 23rd until July 30th 2020, using a platform developed by Flatten, a Canadian non-profit organization. 

Survey participants were solicited through social media platforms and traditional news services. 

The Flatten dataset consists of three versions of the survey referred to as Schema 1, Schema 2, and Schema 3. 

Survey responses associated with Schema 1 are the most numerous, consisting of 263,640 individual level records submitted in the early weeks of the pandemic (March 23rd to April 8th of 2020). These responses also correspond with the peak of Flatten’s presence in the media. Survey responses associated with Schema 2 consist of 14,932 records (April 8th to April 28th of 2020), and Schema 3 consists of 15,534 records (April 28th to July 30th of 2020). Although Schema 2 and Schema 3 contain far fewer records than Schema 1,  they contain valuable information about the demographic profile of the survey participant that Schema 1 does not have, such as their race, ethnicity, sex, age, and pandemic-induced most pressing needs (i.e., food, medical, financial, emotional, other).


Predictive Analysis: Which media has more reach and Influence Covid-19 Testing Behavior?


In the survey, participants reported from which media they came to know about the survey. I wanted to use this data to find out which media will have more reach among the participants for any such survey in the future.


Approach


In the survey media channel is a multi-select data, that means each participant can select multiple media channels as their source of information, separated by semicolon.

During analysis, these multi-select data were cleaned and sorted. Then I created a flat data frame where each row has a combination of single age and single media channels. Then calculated the Reach i.e., Percentage of age group using each channel. 

#create a new Series containing lists of individual media channels for each row with valid data

media_series = socioeconomic_df['media_channels'].dropna().str.split(';')

# convert the series to single flat list

media_flat = [item for sublist in media_series for item in sublist]

# count of each value

media_counts = pd.Series(media_flat).value_counts()
print(media_counts)

Our next step in the analysis is to predict which media channels influence the testing behavior the most. 

For this, we used the Random Forest Classifier method in Python. This method combines the predictions of multiple decision trees to produce more accurate and stable results.


Following is the python code I used for this calculation and then plot it as a bar chart.


df = pd.concat([pd.read_csv('socioeconomic_Cleaned.csv',low_memory = False),pd.read_csv('covid_Cleaned.csv', low_memory = False)], axis=1)

df_m = df[['media_channels', 'tested']].dropna()
media_dummies = df_m['media_channels'].str.get_dummies(sep=';')

model_m = RandomForestClassifier().fit(media_dummies, LabelEncoder().fit_transform(df_m['tested']))

plt.figure(figsize=(10, 6))
pd.Series(model_m.feature_importances_, index=media_dummies.columns).sort_values().plot(kind='barh', color='orange')
plt.title('Which Media Channels has more reach ?', fontsize=14, fontweight='bold')
plt.grid(False)
plt.xlabel('Predictive Importance')
plt.savefig('perspective_media_testing.png')

Output 


This code predicts which communication channels actually influence health behavior. TV, Newspapers and Facebook show a higher predictive weight than other media. This suggests that traditional media has more reach.




This predictive analytics has given us predictions on which media will have more reach. To use this prediction effectively we did the prescriptive analytics.


Prescriptive Analysis: Where to spend money to create more awareness about the survey ?


Approach:


As part of this prescriptive analysis, we use the reach % from the predictive analysis,  and recommend the priority of each area for spending the marketing money for creating awareness about the campaign. Recommended the priority for spending based on the reach % as follows: 


    if percentage > 40: then it is 'High Priority'

    if percentage > 20: then it is 'Medium Priority'

    others are 'Low Priority' media.


This will help health care officials with the following questions:

  • Which media channels are popular among younger vs. older audiences?

  • Where should we spend the marketing budget money to get the desired outcome?


In this dataset, we see that all age groups have mentioned TV & newspaper, and youngsters less than 26 years old have mentioned instagram as their source of information.

Based on the analysis I built a heatmap showing media reach across age groups and printed the top recommended channel for each age group.


Following is the Python Code I used


socio_df = pd.read_csv('socioeconomic_Cleaned.csv', low_memory=False)
demo_df = pd.read_csv('demographics_cleaned.csv', low_memory=False)

df = pd.concat([socio_df, demo_df], axis=1)
df = df.dropna(subset=['age', 'media_channels'])
df = df[df['media_channels'] != 'Not Reported']

expanded_rows = []
for idx, row in df.iterrows():

    channels = str(row['media_channels']).split(';')
    for channel in channels:
    expanded_rows.append({'age': row['age'], 'channel': channel.strip()})
    expanded_df = pd.DataFrame(expanded_rows)

age_counts = df['age'].value_counts()
reach_df = expanded_df.groupby(['age', 'channel']).size().reset_index(name='user_count')

reach_df['reach_percentage'] = reach_df.apply(lambda x: (x['user_count'] / age_counts[x['age']]) * 100, axis=1)

def assign_priority(percentage):
	if percentage > 40: return 'High Priority (Primary Spend)'
	if percentage > 20: return 'Medium Priority (Supportive Spend)'
	return 'Low Priority (Niche/Testing)'
reach_df['spending_priority'] = reach_df['reach_percentage'].apply(assign_priority)

pivot_reach = reach_df.pivot(index="channel", columns="age", values="reach_percentage").fillna(0)

plt.figure(figsize=(14, 8))
sns.heatmap(pivot_reach, annot=True, cmap="YlGnBu", fmt=".1f", cbar_kws={'label': '% Reach in Age Group'})
plt.title('Prescriptive Spending Map: Media Reach % by Age Group', fontsize=16, fontweight='bold')
plt.ylabel('Media Channel')
plt.xlabel('Age Group')
plt.tight_layout()
plt.show()

recommendations = reach_df.sort_values(['age', 'reach_percentage'], ascending=[True, False])
print("--- Top Spending Recommendation per Age Group ---")
print(recommendations.groupby('age').head(1)[['age', 'channel', 'reach_percentage', 'spending_priority']])

Output 


TV, Newspapers and Facebook are the top 3 sources of information for most of the participants. Public health campaigns should consider spending their budget toward these media channels for future surveys.

Prescriptive analytics transforms raw data into budget strategy, campaign design, and resource allocation.


This chart shows the media reach % per age group
This chart shows the media reach % per age group

Conclusion


Predictive and prescriptive analytics are complementing each other to get better insights. 

In this example we see that by combining the outcomes of both predictive and prescriptive analysis of the Covid-19 survey data, health care officials can plan future activities related to public health communication effectively including where to focus and spend more money to get better reach.


Together, we got answer for the following questions :


  • Which channels matter most?

  • Which audiences can we influence through them?

  • Where should we invest to maximize desired outcome?

+1 (302) 200-8320

NumPy_Ninja_Logo (1).png

Numpy Ninja Inc. 8 The Grn Ste A Dover, DE 19901

© Copyright 2025 by Numpy Ninja Inc.

  • Twitter
  • LinkedIn
bottom of page