SQL Data Hygiene: How to Clean Your Database for Better Insights

Kavita
Sep 3
7 min read

When I first started a data analysis project, I wondered how bad is the quality of the raw dataset that we source from online and it's obvious that messy data is going to deliver messy insights. Hence the quality of insights depends entirely on the quality of the data. Preprocessing a dataset is a first step to transform raw, messy data into a reliable, cleaned resource. Here in this blog, I am set to show how to clean a real-world survey dataset in PostgreSQL, that ensures further analysis, visualizations, and decisions are based on reliable and consistent information.

I have used the Hearing dataset from Kaggle. This dataset is a collection of survey responses designed to capture people’s experiences, perceptions, and behaviors around hearing wellness. It provides valuable insights into aspects such as daily headphone use, perceived barriers to hearing tests, instances of missed important sounds, and interest in hearing care apps.

*Dataset: The reference link is provided at the end of this blog.

For the demonstration, I am using PostgreSQL and pgAdmin to showcase Data Cleaning in SQL.

Load the data file into the Database

Step 1: Create a New Database in PostgreSQL

Launch pgAdmin.
In the left sidebar, right-click on Databases → Create → Database.
In the Create Database window:
- Database Name: hearing_wellness_db
- Owner: postgres (default)
Click Save.

New database is now created.

Step 2: Create a New Table

Right-click the newly created database hearing_wellness_db → Query Tool.
Paste the following query to create the hearing_survey table:

CREATE TABLE hearing_survey (
 perceived_hearing_meaning TEXT,
 hearing_fomo TEXT,
 hearing_test_barrier TEXT,
 missed_important_sounds TEXT,
 left_out_due_to_hearing TEXT,
 daily_headphone_use TEXT,
 belief_early_hearing_care INT,
 last_hearing_test_method TEXT,
 interest_in_hearing_app TEXT,
 desired_app_features TEXT,
 awareness_on_hearing_and_willingness_to_invest TEXT,
 paid_app_test_interest TEXT,
 age_group TEXT,
 ear_discomfort_after_use TEXT
);

Execute the query. The hearing_survey table is now created in the database.

Step 3: Populate the Table with the Hearing well-being Survey Report.csv file

Navigate to Schemas → public → Tables → hearing_survey.
Right-click hearing_survey → Import/Export.
Configure import settings:
- Filename: Browse and select the CSV file.
- Encoding: UTF-8
- Format: CSV
Go to Options →
- Check Header (since the CSV has column headers).
- Delimiter: , (comma)
Click OK.

The table is now populated with the survey data.

Data Cleaning

Step 1: Understand the Dataset

Before cleaning, it’s important to inspect the structure and content of the hearing_survey table. This helps identify data types, missing values, and potential inconsistencies.

Preview the first 10 rows

SELECT * FROM hearing_survey LIMIT 10;

This gives a quick overview of the kind of data in each column.

Check Total Number of Rows

SELECT COUNT(*) AS total_rows FROM hearing_survey;

Inspect Table Structure

SELECT 
    column_name,
    data_type,
    is_nullable,
    column_default
FROM information_schema.columns
WHERE table_name = 'hearing_survey'
ORDER BY ordinal_position;

It shows name of the column, type of data stored, whether NULL values are allowed, and the default value if none is provided.

Step 2: Count Missing Values per Column

Check for NULL values in the data

SELECT
    COUNT(*) FILTER (WHERE perceived_hearing_meaning IS NULL) AS missing_meaning,
    COUNT(*) FILTER (WHERE hearing_fomo IS NULL) AS missing_fomo,
    COUNT(*) FILTER (WHERE hearing_test_barrier IS NULL) AS missing_barrier,
    COUNT(*) FILTER (WHERE missed_important_sounds IS NULL) AS missing_missed_sounds,
    COUNT(*) FILTER (WHERE left_out_due_to_hearing IS NULL) AS missing_left_out,
    COUNT(*) FILTER (WHERE daily_headphone_use IS NULL) AS missing_headphone_use,
    COUNT(*) FILTER (WHERE belief_early_hearing_care IS NULL) AS missing_belief,
    COUNT(*) FILTER (WHERE last_hearing_test_method IS NULL) AS missing_test_method,
    COUNT(*) FILTER (WHERE interest_in_hearing_app IS NULL) AS missing_interest,
    COUNT(*) FILTER (WHERE desired_app_features IS NULL) AS missing_features,
    COUNT(*) FILTER (WHERE awareness_on_hearing_and_willingness_to_invest IS NULL) AS missing_awareness,
    COUNT(*) FILTER (WHERE paid_app_test_interest IS NULL) AS missing_paid_interest,
    COUNT(*) FILTER (WHERE age_group IS NULL) AS missing_age,
    COUNT(*) FILTER (WHERE ear_discomfort_after_use IS NULL) AS missing_ear_discomfort
FROM hearing_survey;

No NULLs in any column in the hearing_survey table. This is a strong indicator of good data quality and makes further cleaning easier.

Step 3: Standardizing Text Data Across all the columns in the Dataset

Standardizes common special characters like curly quotes and dashes, removes any non-ASCII characters.

DO $$
DECLARE
    col RECORD;
BEGIN
    -- Loop through all text columns in hearing_survey
    FOR col IN 
        SELECT column_name
        FROM information_schema.columns
        WHERE table_name = 'hearing_survey'
          AND data_type LIKE '%text%'
    LOOP
        -- Step 1: Standardize common special characters
        EXECUTE format(
            'UPDATE hearing_survey
             SET %I = TRANSLATE(%I, ''’“”–—…'', ''''""--...'')
             WHERE %I IS NOT NULL;',
             col.column_name, col.column_name, col.column_name
        );
	   --  Step 2: Removes non-ASCII characters
        EXECUTE format(
            'UPDATE hearing_survey
             SET %I = regexp_replace(%I, ''[^\x20-\x7E]'', '''', ''g'')
             WHERE %I IS NOT NULL;',
             col.column_name, col.column_name, col.column_name
        );
    END LOOP;
END $$;

Step 4: Identify and Remove duplicate rows

SELECT *,
       COUNT(*) AS duplicate_count
FROM hearing_survey
GROUP BY perceived_hearing_meaning, hearing_fomo, hearing_test_barrier,
         missed_important_sounds, left_out_due_to_hearing, daily_headphone_use,
         belief_early_hearing_care, last_hearing_test_method,
         interest_in_hearing_app, desired_app_features, 
         awareness_on_hearing_and_willingness_to_invest,
         paid_app_test_interest, age_group, ear_discomfort_after_use
HAVING COUNT(*) > 1;

No Duplicates-free of duplicates, which is a strong indicator of high data quality

Step 5: Add a Unique Identifier(Primary Key) for Normalization

ALTER TABLE hearing_survey ADD COLUMN id SERIAL PRIMARY KEY;

Step 6: Standardize Categorical Columns

Mixed-type columns in surveys are common, and mapping them into consistent categories is key for meaningful analysis. Let's look at each column in the dataset one by one.

percieved_hearing_meaning column

The dataset has 25 distinct textual combinations for this column, grouping responses into broader categories (like “Enjoyment / Lifestyle” or “Independence & Awareness”) allows us to aggregate, count, and visualize easily.
Create a mapping dictionary that standardizes messy raw responses into clear, consistent categories for analysis

CREATE TABLE meaning_map (
    raw_text TEXT,
    category TEXT
);

INSERT INTO meaning_map (raw_text, category) VALUES

('Enjoying music, laughter, and life', 'Enjoyment / Leisure'),

('Feeling close to loved ones', 'Relationships'),

('Staying independent and alert', 'Independence / Alertness'),

('Staying connected with the world', 'Social Connectivity'),

('I havent thought about it much', 'Undecided'),

('Peace and Calm to Brain soulfully', 'Well-being'),

('Life', 'Other');

Create a normalized table by linking each survey response to its standardized category using the mapping dictionary

CREATE TABLE hearing_meaning_normalized AS

SELECT h.id, m.category

FROM hearing_survey h

JOIN meaning_map m 

  ON h.perceived_hearing_meaning ILIKE '%' || m.raw_text || '%';

By creating a mapping dictionary and a normalized table, we transformed 25 messy, inconsistent textual responses into a set of clean, standardized categories.

hearing_fomo

Check the frequency of each response

	      SELECT DISTINCT hearing_fomo

           FROM hearing_survey

           ORDER BY hearing_fomo;

hearing_fomo column has only 4 distinct values, which is actually clean

3. hearing_test_barrier

Check the frequency of each response

	SELECT DISTINCT hearing_test_barrier

	FROM hearing_survey

	ORDER BY hearing_test_barrier;

hearing_test_barrier column is very messy with multiple variants, combinations, typos, and extra spaces. We need to standardize and categorize them before analysis.

Normalize multiple responses

Split comma-separated barriers into separate rows and trim spaces:

CREATE TABLE hearing_test_barrier_normalized AS
SELECT 
    id,
    TRIM(unnest(string_to_array(hearing_test_barrier, ','))) AS barrier
FROM hearing_survey;

Create a mapping table

Map each distinct barrier keyword to a category, including a catch-all 'Other / Unknown':

CREATE TABLE barrier_map (
    raw_text TEXT,
    category TEXT
);

INSERT INTO barrier_map (raw_text, category) VALUES

('Cost', 'Financial'),

('Facilities', 'Access / Infrastructure'),

('Lack of awareness', 'Awareness'),

('Fear', 'Psychological'),

('Shame', 'Psychological'),

('Time', 'Time Constraints'),

('Didn''t feel the need', 'Undecided'),

('Never', 'Undecided'),

('NA', 'Other / Unknown'),

('Not applicable', 'Other / Unknown');

Map normalized barriers to categories

Join the normalized table with the mapping table using ILIKE to handle partial matches:

CREATE TABLE hearing_test_barrier_categorized AS

SELECT 

    n.id,

    n.barrier,

    COALESCE(m.category, 'Other / Unknown') AS category

FROM hearing_test_barrier_normalized n

LEFT JOIN barrier_map m

  ON n.barrier ILIKE '%' || m.raw_text || '%';

COALESCE ensures that any unmapped barrier gets the 'Other / Unknown' category.

Mapping hearing_test_barrier responses to clear categories turns messy text into structured data, making it easier to analyze and visualize.

missed_important_sounds

Check the frequency of each response

SELECT missed_important_sounds, COUNT(*) AS frequency

FROM hearing_survey

GROUP BY missed_important_sounds

ORDER BY frequency DESC;

Very few distinct values are found, mapping isn’t necessary.

But There are few spelling mistakes which can be corrected.

UPDATE hearing_survey

SET missed_important_sounds = 'I cannot hear clearly when told important information in serious situations'

WHERE missed_important_sounds ILIKE 'I can''t hear claerly when tell important information in serious situation';

UPDATE hearing_survey

SET missed_important_sounds = 'I miss things when with friends'

WHERE missed_important_sounds ILIKE 'When with friends anol';

left_out_due_to_hearing

Check the frequency of each response

SELECT DISTINCT left_out_due_to_hearing

FROM hearing_survey;

This column seems fairly clean already — all responses are short, consistent, and meaningful.

daily_headphone_use column

Check the frequency of each response

SELECT DISTINCT daily_headphone_use 

FROM hearing_survey;

UPDATE hearing_survey

SET daily_headphone_use = CASE

    WHEN daily_headphone_use ILIKE '%Less than 1%' THEN '0-1 hours'

    WHEN daily_headphone_use ILIKE '%1-2%' THEN '1-2 hours'

    WHEN daily_headphone_use ILIKE '%2-4%' THEN '2-4 hours'

    WHEN daily_headphone_use ILIKE '%more than 4%' THEN '4+ hours'

    WHEN daily_headphone_use ILIKE '%my parent%' THEN 'Family use - parent'

    WHEN daily_headphone_use ILIKE '%my child%' THEN 'Family use - child'

    ELSE daily_headphone_use

END;

This SQL standardizes the daily_headphone_use column by grouping similar responses into consistent, easily interpretable categories for analysis.

belief_early_hearing_care

Check the frequency of each response

SELECT DISTINCT belief_early_hearing_care
FROM hearing_survey;

The belief_early_hearing_care column contains values on a 1-to-5 scale, so no data cleaning is required.

last_hearing_test_method

Check the frequency of each response

SELECT DISTINCT last_hearing_test_method
FROM hearing_survey;

Clean values, no standardization required.

interest_in_hearing_app

Check the frequency of each response

SELECT DISTINCT interest_in_hearing_app
FROM hearing_survey;

Standardize the values using query below

UPDATE hearing_survey

SET interest_in_hearing_app = CASE

    WHEN interest_in_hearing_app ILIKE 'Yes%' THEN 'Yes'

    WHEN interest_in_hearing_app ILIKE 'No%' THEN 'No'

    WHEN interest_in_hearing_app ILIKE 'Maybe%' THEN 'Maybe'

    ELSE interest_in_hearing_app

END;

This will map all variations of "Yes", "No", and "Maybe" to consistent categories.

desired_features_normalized

Check the frequency of each response

SELECT DISTINCT desired_app_features

FROM hearing_survey;

This column includes a lot of “noise”—long, free-text responses that aren’t actual selectable app features.

Create a normalized table by splitting multi-select desired_app_features into individual features while removing empty values for easier analysis.

CREATE TABLE hearing_features_normalized AS

SELECT id, feature

FROM (

    SELECT id, TRIM(unnest(string_to_array(desired_app_features, ','))) AS feature

    FROM hearing_survey

) t

WHERE feature <> '';

Categorize normalized app features into predefined options, grouping all other responses as "Other / Misc" for consistent analysis.

CREATE TABLE desired_features_clean AS

SELECT id,

       CASE 

           WHEN feature IN (

               'Quick tests',

               'Doctor consultation',

               'Game-based interaction',

               'Earphone calibration',

               'Soft guidance',

               'Regular testing reminders',

               'Detailed report generation',

               'Personalized volume adjustments',

               'Audio amplifier or sound booster',

               'Report sharing options',

               'Privacy'

           ) THEN feature

           ELSE 'Other / Misc'

       END AS feature

FROM desired_features_normalized;

desired_features_clean serves as the standardized table of app features to reference or map for analysis.

awareness_on_hearing_and_willingness_to_invest

Check the frequency of each response

SELECT DISTINCT awareness_on_hearing_and_willingness_to_invest
FROM hearing_survey;

column has only 2 values 'Yes'/'No'. It's clean already

paid_app_test_interest

Check the frequency of each response

SELECT DISTINCT paid_app_test_interest
FROM hearing_survey;

This maps all variants of Yes/No/Maybe responses into consistent categories.

UPDATE hearing_survey

SET interest_in_hearing_app = CASE

    WHEN interest_in_hearing_app ILIKE '%Yes%' THEN 'Yes'

    WHEN interest_in_hearing_app ILIKE '%Maybe%' THEN 'Maybe'

    WHEN interest_in_hearing_app ILIKE '%No%' THEN 'No'

    ELSE interest_in_hearing_app

END;

age_group

Check the frequency of each response

SELECT DISTINCT age_group
FROM hearing_survey;

age groups are already standardized and clean. No further transformation is needed.

ear_discomfort_after_use

Check the frequency of each response

SELECT DISTINCT ear_discomfort_after_use
FROM hearing_survey;

No standardization required

Now, the dataset is clean and standardized, free from duplicates, Nulls and non ASCII characters.

We can now begin further analysis on the dataset.

*Dataset: https://www.kaggle.com/datasets/adharshinikumar/2025-hearing-wellness-survey

Welcome
to NumpyNinja Blogs

SQL Data Hygiene: How to Clean Your Database for Better Insights

Load the data file into the Database

Data Cleaning

Step 1: Understand the Dataset

Step 2: Count Missing Values per Column

Step 3: Standardizing Text Data Across all the columns in the Dataset

Step 4: Identify and Remove duplicate rows

Step 5: Add a Unique Identifier(Primary Key) for Normalization

Step 6: Standardize Categorical Columns

Recent Posts

Welcome to NumpyNinja Blogs

Load the data file into the Database

Data Cleaning

Step 1: Understand the Dataset

Step 2: Count Missing Values per Column

Step 3: Standardizing Text Data Across all the columns in the Dataset

Step 4: Identify and Remove duplicate rows

Step 5: Add a Unique Identifier(Primary Key) for Normalization

Step 6: Standardize Categorical Columns

Welcome
to NumpyNinja Blogs