Mastering SQL Data Cleaning: A Guide for Aspiring Data Analysts

Gunjan Saini
May 21
5 min read

Updated: May 22

If you’re just diving into the world of data analysis, you’ve most likely come across the saying,

"Garbage in, garbage out."

And honestly, it couldn’t be more accurate. No matter how sophisticated your machine learning model or how sleek your dashboard looks, it won’t do you any good if the data it’s based on is messy, incomplete, or inconsistent. That’s where data cleaning steps in. It might not be the most glamorous part of the job, but it’s absolutely crucial and it’s often where the real analytical work begins.

In this blog post, I’m excited to share a hands-on SQL data cleaning project I tackled, using a real dataset that covers global company layoffs from 2020 to 2023. If you’re an aspiring data analyst like I am, this guide will equip you with essential SQL techniques for data cleaning and getting it ready for insightful analysis.

Why Data Cleaning Matters

Imagine you’re diving into a dataset that tracks layoffs across a bunch of companies. But then you start to notice some quirks:

• “Google” and “google” are being treated as if they’re two separate companies.

• Some dates look like they were written in a secret code.

• The “industry” column is a bit of a mess, with missing values and inconsistent labels like "Crypto.com" and "Crypto."

• You find entire rows that are just duplicates of each other.

If you jump right into creating reports or dashboards without addressing these issues, your insights could be misleading, potentially leading to some poor decisions. That’s why, regardless of whether you’re a data analyst, engineer, or scientist, you’ll probably find yourself spending 60 to 80% of your time just cleaning and prepping data.

Let’s get down to business and explore how to tidy up this dataset using SQL.

Dataset Overview

Here are the main columns of our dataset:

• Company – The name of the business

• Location – The city or region where the layoffs occurred

• Industry – The sector in which the company operates

• Total_laid_off – The total number of employees who were laid off

• Percentage_laid_off – The percentage of the workforce that was impacted

• Date – The date when the layoffs took place

• Stage – The current business stage of the company (like Startup or Public)

• Country – The country where the layoffs happened

• Funds_raised_millions – The amount of funding raised, measured in millions

Tools Used

• SQL Platform: MySQL Workbench

• Language: SQL

• Table: layoffs

Everything here is accomplished using plain SQL—no fancy external tools or packages involved!

Step-by-Step SQL Data Cleaning Process

1. Create a Staging Table

Why? It’s always a good idea to work on a copy of your data. This way, you can avoid accidentally messing up your original table.

CREATE TABLE layoffs_staging LIKE layoffs;

INSERT INTO layoffs_staging SELECT * FROM layoffs;

2. Remove Duplicate Records

Sometimes, you might notice that the same row shows up more than once. If we don’t address this, those duplicates can really throw off our results.

To tackle this, we’ll use ROW_NUMBER() within a CTE to identify and mark the duplicates.

WITH duplicate_cte as

(

SELECT * ,

ROW_NUMBER() over(

PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, `date`, stage, country, funds_raised_millions ) AS row_num

from layoffs_staging

)

Delete from layoffs_staging

WHERE id IN ( SELECT id FROM duplicate_cte WHERE row_num > 1);

Note: MySQL doesn’t allow you to delete directly from a CTE, so I had to come up with a workaround using IDs.

3. Standardize Inconsistent Text

When it comes to terms like “crypto,” “Crypto.com,” and “Crypto,” they all mean the same thing. However, SQL sees them as distinct unless we clean them up.

##Clean up white space and formatting in columns

UPDATE layoffs_staging

SET company = TRIM(company),

industry = TRIM(industry),

country = TRIM(TRAILING '.' FROM country);

##Converting everything to lowercase or proper case

UPDATE layoffs_staging

SET company = LOWER(company);

##Fix inconsistent values manually

UPDATE layoffs_staging

SET industry = 'Crypto'

WHERE industry LIKE 'Crypto%';

4. Convert Text Dates to Proper Format

Raw CSV files often keep dates in a text format. To work with these dates, we need to convert them into a correct format.

UPDATE layoffs_staging

SET date = STR_TO_DATE(date, '%m/%d/%Y');

ALTER TABLE layoffs_staging

MODIFY COLUMN date DATE;

5. Handle Missing Values

Not every missing value is represented as NULL; some are simply empty strings. We need to take care of those as well.

UPDATE layoffs_staging

SET industry = NULL

WHERE industry = '';

##To fill in missing values, we can use a self- join

UPDATE layoffs_staging t1

JOIN layoffs_staging t2

ON t1.company =t2.company AND

t1.industry IS NULL AND t2.industry IS NOT NULL

SET t1.industry = t2.industry

6. Remove Incomplete Rows

Some rows are missing layoff data, which isn't really useful for our analysis.

DELETE FROM layoffs_staging

WHERE total_laid_off IS NULL AND percentage_laid_off IS NULL;

7. Drop Temporary Columns

If you've set up any helper columns, like row numbers, make sure to tidy up once you're done!

ALTER TABLE layoffs_staging

DROP COLUMN row_num;

SQL Techniques Used

Here’s a quick overview of SQL Techniques Used:

TASK	SQL TECHNIQUE
Backup original data	CREATE TABLE, INSERT INTO
Remove duplicates	ROW_NUMBER (), CTE, DELETE
Standardize text	TRIM (), LOWER (), UPDATE
Fix dates	STR_TO_DATE (), ALTER TABLE
Handle nulls and empties	NULL, SELF-JOIN, UPDATE
Drop bad rows	DELETE ... WHERE
Clean up helper columns	ALTER TABLE ... DROP COLUMN

Real-World Tips

• Adopt an analyst's mindset: your aim is to ensure the data is accurate, clean, and reliable.

• Always double-check your work: frequently use SELECT queries to confirm any changes you make.

• Keep a record of your process: this will save your time, especially when working with others or revisiting a project down the line.

Quick Insights from the Cleaned Data

1. Layoffs Reached their Peaked in 2022–2023

SELECT YEAR(date) AS year, SUM(total_laid_off) AS total_layoffs

FROM layoffs_staging

GROUP BY YEAR(date)

ORDER BY year;

Insight: The most significant layoffs took place in 2022–2023, probably as a result of market adjustments following the pandemic.

2. Most Affected Industries

SELECT industry, SUM(total_laid_off) AS total_layoffs

FROM layoffs_staging

GROUP BY industry

ORDER BY total_layoffs DESC;

Top Sectors Hit: Consumer Retail, Transportation, Tech Startups

3. Companies with 100% Layoffs

SELECT company, percentage_laid_off

FROM layoffs_staging

WHERE percentage_laid_off = 1;

These could indicate complete shutdowns.

4. Countries with the Most Layoffs

SELECT country, SUM(total_laid_off) AS total_layoffs

FROM layoffs_staging

GROUP BY country

ORDER BY total_layoffs DESC;

Finding: The US made up about 70% of all layoffs recorded in the dataset.

Final Thoughts

Data cleaning isn’t just a technical task; it’s a whole mindset. It’s about taking a look at that raw, messy data and uncovering the clean story that lies beneath. If you’re just starting out as a data analyst, diving into real datasets like this and going through the process of cleaning and prepping them is one of the best ways to sharpen your skills and build your portfolio.

Thanks for reading — and happy querying!

Welcome
to NumpyNinja Blogs

Mastering SQL Data Cleaning: A Guide for Aspiring Data Analysts

Recent Posts

Welcome to NumpyNinja Blogs

Welcome
to NumpyNinja Blogs