SQL Data Hygiene: How to Clean Your Database for Better Insights
- Kavita
- Sep 3
- 7 min read
When I first started a data analysis project, I wondered how bad is the quality of the raw dataset that we source from online and it's obvious that messy data is going to deliver messy insights. Hence the quality of insights depends entirely on the quality of the data. Preprocessing a dataset is a first step to transform raw, messy data into a reliable, cleaned resource. Here in this blog, I am set to show how to clean a real-world survey dataset in PostgreSQL, that ensures further analysis, visualizations, and decisions are based on reliable and consistent information.
I have used the Hearing dataset from Kaggle. This dataset is a collection of survey responses designed to capture people’s experiences, perceptions, and behaviors around hearing wellness. It provides valuable insights into aspects such as daily headphone use, perceived barriers to hearing tests, instances of missed important sounds, and interest in hearing care apps.
*Dataset: The reference link is provided at the end of this blog.
For the demonstration, I am using PostgreSQL and pgAdmin to showcase Data Cleaning in SQL.
Load the data file into the Database
Step 1: Create a New Database in PostgreSQL
Launch pgAdmin.
In the left sidebar, right-click on Databases → Create → Database.
In the Create Database window:
Database Name: hearing_wellness_db
Owner: postgres (default)
Click Save.
New database is now created.
Step 2: Create a New Table
Right-click the newly created database hearing_wellness_db → Query Tool.
Paste the following query to create the hearing_survey table:
CREATE TABLE hearing_survey (
perceived_hearing_meaning TEXT,
hearing_fomo TEXT,
hearing_test_barrier TEXT,
missed_important_sounds TEXT,
left_out_due_to_hearing TEXT,
daily_headphone_use TEXT,
belief_early_hearing_care INT,
last_hearing_test_method TEXT,
interest_in_hearing_app TEXT,
desired_app_features TEXT,
awareness_on_hearing_and_willingness_to_invest TEXT,
paid_app_test_interest TEXT,
age_group TEXT,
ear_discomfort_after_use TEXT
);Execute the query. The hearing_survey table is now created in the database.
Step 3: Populate the Table with the Hearing well-being Survey Report.csv file
Navigate to Schemas → public → Tables → hearing_survey.
Right-click hearing_survey → Import/Export.
Configure import settings:
Filename: Browse and select the CSV file.
Encoding: UTF-8
Format: CSV
Go to Options →
Check Header (since the CSV has column headers).
Delimiter: , (comma)
Click OK.
The table is now populated with the survey data.
Data Cleaning
Step 1: Understand the Dataset
Before cleaning, it’s important to inspect the structure and content of the hearing_survey table. This helps identify data types, missing values, and potential inconsistencies.
Preview the first 10 rows
SELECT * FROM hearing_survey LIMIT 10;This gives a quick overview of the kind of data in each column.
Check Total Number of Rows
SELECT COUNT(*) AS total_rows FROM hearing_survey;Inspect Table Structure
SELECT
column_name,
data_type,
is_nullable,
column_default
FROM information_schema.columns
WHERE table_name = 'hearing_survey'
ORDER BY ordinal_position;It shows name of the column, type of data stored, whether NULL values are allowed, and the default value if none is provided.
Step 2: Count Missing Values per Column
Check for NULL values in the data
SELECT
COUNT(*) FILTER (WHERE perceived_hearing_meaning IS NULL) AS missing_meaning,
COUNT(*) FILTER (WHERE hearing_fomo IS NULL) AS missing_fomo,
COUNT(*) FILTER (WHERE hearing_test_barrier IS NULL) AS missing_barrier,
COUNT(*) FILTER (WHERE missed_important_sounds IS NULL) AS missing_missed_sounds,
COUNT(*) FILTER (WHERE left_out_due_to_hearing IS NULL) AS missing_left_out,
COUNT(*) FILTER (WHERE daily_headphone_use IS NULL) AS missing_headphone_use,
COUNT(*) FILTER (WHERE belief_early_hearing_care IS NULL) AS missing_belief,
COUNT(*) FILTER (WHERE last_hearing_test_method IS NULL) AS missing_test_method,
COUNT(*) FILTER (WHERE interest_in_hearing_app IS NULL) AS missing_interest,
COUNT(*) FILTER (WHERE desired_app_features IS NULL) AS missing_features,
COUNT(*) FILTER (WHERE awareness_on_hearing_and_willingness_to_invest IS NULL) AS missing_awareness,
COUNT(*) FILTER (WHERE paid_app_test_interest IS NULL) AS missing_paid_interest,
COUNT(*) FILTER (WHERE age_group IS NULL) AS missing_age,
COUNT(*) FILTER (WHERE ear_discomfort_after_use IS NULL) AS missing_ear_discomfort
FROM hearing_survey;No NULLs in any column in the hearing_survey table. This is a strong indicator of good data quality and makes further cleaning easier.
Step 3: Standardizing Text Data Across all the columns in the Dataset
Standardizes common special characters like curly quotes and dashes, removes any non-ASCII characters.
DO $$
DECLARE
col RECORD;
BEGIN
-- Loop through all text columns in hearing_survey
FOR col IN
SELECT column_name
FROM information_schema.columns
WHERE table_name = 'hearing_survey'
AND data_type LIKE '%text%'
LOOP
-- Step 1: Standardize common special characters
EXECUTE format(
'UPDATE hearing_survey
SET %I = TRANSLATE(%I, ''’“”–—…'', ''''""--...'')
WHERE %I IS NOT NULL;',
col.column_name, col.column_name, col.column_name
);
-- Step 2: Removes non-ASCII characters
EXECUTE format(
'UPDATE hearing_survey
SET %I = regexp_replace(%I, ''[^\x20-\x7E]'', '''', ''g'')
WHERE %I IS NOT NULL;',
col.column_name, col.column_name, col.column_name
);
END LOOP;
END $$;Step 4: Identify and Remove duplicate rows
SELECT *,
COUNT(*) AS duplicate_count
FROM hearing_survey
GROUP BY perceived_hearing_meaning, hearing_fomo, hearing_test_barrier,
missed_important_sounds, left_out_due_to_hearing, daily_headphone_use,
belief_early_hearing_care, last_hearing_test_method,
interest_in_hearing_app, desired_app_features,
awareness_on_hearing_and_willingness_to_invest,
paid_app_test_interest, age_group, ear_discomfort_after_use
HAVING COUNT(*) > 1;No Duplicates-free of duplicates, which is a strong indicator of high data quality
Step 5: Add a Unique Identifier(Primary Key) for Normalization
ALTER TABLE hearing_survey ADD COLUMN id SERIAL PRIMARY KEY;Step 6: Standardize Categorical Columns
Mixed-type columns in surveys are common, and mapping them into consistent categories is key for meaningful analysis. Let's look at each column in the dataset one by one.
percieved_hearing_meaning column
The dataset has 25 distinct textual combinations for this column, grouping responses into broader categories (like “Enjoyment / Lifestyle” or “Independence & Awareness”) allows us to aggregate, count, and visualize easily.
Create a mapping dictionary that standardizes messy raw responses into clear, consistent categories for analysis
CREATE TABLE meaning_map (
raw_text TEXT,
category TEXT
);INSERT INTO meaning_map (raw_text, category) VALUES
('Enjoying music, laughter, and life', 'Enjoyment / Leisure'),
('Feeling close to loved ones', 'Relationships'),
('Staying independent and alert', 'Independence / Alertness'),
('Staying connected with the world', 'Social Connectivity'),
('I havent thought about it much', 'Undecided'),
('Peace and Calm to Brain soulfully', 'Well-being'),
('Life', 'Other');Create a normalized table by linking each survey response to its standardized category using the mapping dictionary
CREATE TABLE hearing_meaning_normalized AS
SELECT h.id, m.category
FROM hearing_survey h
JOIN meaning_map m
ON h.perceived_hearing_meaning ILIKE '%' || m.raw_text || '%';By creating a mapping dictionary and a normalized table, we transformed 25 messy, inconsistent textual responses into a set of clean, standardized categories.
hearing_fomo
Check the frequency of each response
SELECT DISTINCT hearing_fomo
FROM hearing_survey
ORDER BY hearing_fomo;hearing_fomo column has only 4 distinct values, which is actually clean
3. hearing_test_barrier
Check the frequency of each response
SELECT DISTINCT hearing_test_barrier
FROM hearing_survey
ORDER BY hearing_test_barrier;hearing_test_barrier column is very messy with multiple variants, combinations, typos, and extra spaces. We need to standardize and categorize them before analysis.
Normalize multiple responses
Split comma-separated barriers into separate rows and trim spaces:
CREATE TABLE hearing_test_barrier_normalized AS
SELECT
id,
TRIM(unnest(string_to_array(hearing_test_barrier, ','))) AS barrier
FROM hearing_survey;Create a mapping table
Map each distinct barrier keyword to a category, including a catch-all 'Other / Unknown':
CREATE TABLE barrier_map (
raw_text TEXT,
category TEXT
);INSERT INTO barrier_map (raw_text, category) VALUES
('Cost', 'Financial'),
('Facilities', 'Access / Infrastructure'),
('Lack of awareness', 'Awareness'),
('Fear', 'Psychological'),
('Shame', 'Psychological'),
('Time', 'Time Constraints'),
('Didn''t feel the need', 'Undecided'),
('Never', 'Undecided'),
('NA', 'Other / Unknown'),
('Not applicable', 'Other / Unknown');Map normalized barriers to categories
Join the normalized table with the mapping table using ILIKE to handle partial matches:
CREATE TABLE hearing_test_barrier_categorized AS
SELECT
n.id,
n.barrier,
COALESCE(m.category, 'Other / Unknown') AS category
FROM hearing_test_barrier_normalized n
LEFT JOIN barrier_map m
ON n.barrier ILIKE '%' || m.raw_text || '%';COALESCE ensures that any unmapped barrier gets the 'Other / Unknown' category.
Mapping hearing_test_barrier responses to clear categories turns messy text into structured data, making it easier to analyze and visualize.
missed_important_sounds
Check the frequency of each response
SELECT missed_important_sounds, COUNT(*) AS frequency
FROM hearing_survey
GROUP BY missed_important_sounds
ORDER BY frequency DESC;Very few distinct values are found, mapping isn’t necessary.
But There are few spelling mistakes which can be corrected.
UPDATE hearing_survey
SET missed_important_sounds = 'I cannot hear clearly when told important information in serious situations'
WHERE missed_important_sounds ILIKE 'I can''t hear claerly when tell important information in serious situation';UPDATE hearing_survey
SET missed_important_sounds = 'I miss things when with friends'
WHERE missed_important_sounds ILIKE 'When with friends anol';left_out_due_to_hearing
Check the frequency of each response
SELECT DISTINCT left_out_due_to_hearing
FROM hearing_survey;This column seems fairly clean already — all responses are short, consistent, and meaningful.
daily_headphone_use column
Check the frequency of each response
SELECT DISTINCT daily_headphone_use
FROM hearing_survey;
UPDATE hearing_survey
SET daily_headphone_use = CASE
WHEN daily_headphone_use ILIKE '%Less than 1%' THEN '0-1 hours'
WHEN daily_headphone_use ILIKE '%1-2%' THEN '1-2 hours'
WHEN daily_headphone_use ILIKE '%2-4%' THEN '2-4 hours'
WHEN daily_headphone_use ILIKE '%more than 4%' THEN '4+ hours'
WHEN daily_headphone_use ILIKE '%my parent%' THEN 'Family use - parent'
WHEN daily_headphone_use ILIKE '%my child%' THEN 'Family use - child'
ELSE daily_headphone_use
END;This SQL standardizes the daily_headphone_use column by grouping similar responses into consistent, easily interpretable categories for analysis.
belief_early_hearing_care
Check the frequency of each response
SELECT DISTINCT belief_early_hearing_care
FROM hearing_survey;The belief_early_hearing_care column contains values on a 1-to-5 scale, so no data cleaning is required.
last_hearing_test_method
Check the frequency of each response
SELECT DISTINCT last_hearing_test_method
FROM hearing_survey;Clean values, no standardization required.
interest_in_hearing_app
Check the frequency of each response
SELECT DISTINCT interest_in_hearing_app
FROM hearing_survey;Standardize the values using query below
UPDATE hearing_survey
SET interest_in_hearing_app = CASE
WHEN interest_in_hearing_app ILIKE 'Yes%' THEN 'Yes'
WHEN interest_in_hearing_app ILIKE 'No%' THEN 'No'
WHEN interest_in_hearing_app ILIKE 'Maybe%' THEN 'Maybe'
ELSE interest_in_hearing_app
END;This will map all variations of "Yes", "No", and "Maybe" to consistent categories.
desired_features_normalized
Check the frequency of each response
SELECT DISTINCT desired_app_features
FROM hearing_survey;This column includes a lot of “noise”—long, free-text responses that aren’t actual selectable app features.
Create a normalized table by splitting multi-select desired_app_features into individual features while removing empty values for easier analysis.
CREATE TABLE hearing_features_normalized AS
SELECT id, feature
FROM (
SELECT id, TRIM(unnest(string_to_array(desired_app_features, ','))) AS feature
FROM hearing_survey
) t
WHERE feature <> '';Categorize normalized app features into predefined options, grouping all other responses as "Other / Misc" for consistent analysis.
CREATE TABLE desired_features_clean AS
SELECT id,
CASE
WHEN feature IN (
'Quick tests',
'Doctor consultation',
'Game-based interaction',
'Earphone calibration',
'Soft guidance',
'Regular testing reminders',
'Detailed report generation',
'Personalized volume adjustments',
'Audio amplifier or sound booster',
'Report sharing options',
'Privacy'
) THEN feature
ELSE 'Other / Misc'
END AS feature
FROM desired_features_normalized;desired_features_clean serves as the standardized table of app features to reference or map for analysis.
awareness_on_hearing_and_willingness_to_invest
Check the frequency of each response
SELECT DISTINCT awareness_on_hearing_and_willingness_to_invest
FROM hearing_survey;column has only 2 values 'Yes'/'No'. It's clean already
paid_app_test_interest
Check the frequency of each response
SELECT DISTINCT paid_app_test_interest
FROM hearing_survey;This maps all variants of Yes/No/Maybe responses into consistent categories.
UPDATE hearing_survey
SET interest_in_hearing_app = CASE
WHEN interest_in_hearing_app ILIKE '%Yes%' THEN 'Yes'
WHEN interest_in_hearing_app ILIKE '%Maybe%' THEN 'Maybe'
WHEN interest_in_hearing_app ILIKE '%No%' THEN 'No'
ELSE interest_in_hearing_app
END;age_group
Check the frequency of each response
SELECT DISTINCT age_group
FROM hearing_survey;age groups are already standardized and clean. No further transformation is needed.
ear_discomfort_after_use
Check the frequency of each response
SELECT DISTINCT ear_discomfort_after_use
FROM hearing_survey;No standardization required
Now, the dataset is clean and standardized, free from duplicates, Nulls and non ASCII characters.
We can now begin further analysis on the dataset.


