Mastering SQL Data Cleaning: A Guide for Aspiring Data Analysts
- Gunjan Saini
- May 21
- 5 min read
Updated: May 22

If you’re just diving into the world of data analysis, you’ve most likely come across the saying,
"Garbage in, garbage out."
And honestly, it couldn’t be more accurate. No matter how sophisticated your machine learning model or how sleek your dashboard looks, it won’t do you any good if the data it’s based on is messy, incomplete, or inconsistent. That’s where data cleaning steps in. It might not be the most glamorous part of the job, but it’s absolutely crucial and it’s often where the real analytical work begins.
In this blog post, I’m excited to share a hands-on SQL data cleaning project I tackled, using a real dataset that covers global company layoffs from 2020 to 2023. If you’re an aspiring data analyst like I am, this guide will equip you with essential SQL techniques for data cleaning and getting it ready for insightful analysis.
Why Data Cleaning Matters
Imagine you’re diving into a dataset that tracks layoffs across a bunch of companies. But then you start to notice some quirks:
• “Google” and “google” are being treated as if they’re two separate companies.
• Some dates look like they were written in a secret code.
• The “industry” column is a bit of a mess, with missing values and inconsistent labels like "Crypto.com" and "Crypto."
• You find entire rows that are just duplicates of each other.
If you jump right into creating reports or dashboards without addressing these issues, your insights could be misleading, potentially leading to some poor decisions. That’s why, regardless of whether you’re a data analyst, engineer, or scientist, you’ll probably find yourself spending 60 to 80% of your time just cleaning and prepping data.
Let’s get down to business and explore how to tidy up this dataset using SQL.
Dataset Overview
Here are the main columns of our dataset:
• Company – The name of the business
• Location – The city or region where the layoffs occurred
• Industry – The sector in which the company operates
• Total_laid_off – The total number of employees who were laid off
• Percentage_laid_off – The percentage of the workforce that was impacted
• Date – The date when the layoffs took place
• Stage – The current business stage of the company (like Startup or Public)
• Country – The country where the layoffs happened
• Funds_raised_millions – The amount of funding raised, measured in millions
Tools Used
• SQL Platform: MySQL Workbench
• Language: SQL
• Table: layoffs
Everything here is accomplished using plain SQL—no fancy external tools or packages involved!
Step-by-Step SQL Data Cleaning Process
1. Create a Staging Table
Why? It’s always a good idea to work on a copy of your data. This way, you can avoid accidentally messing up your original table.
CREATE TABLE layoffs_staging LIKE layoffs;
INSERT INTO layoffs_staging SELECT * FROM layoffs;
2. Remove Duplicate Records
Sometimes, you might notice that the same row shows up more than once. If we don’t address this, those duplicates can really throw off our results.
To tackle this, we’ll use ROW_NUMBER() within a CTE to identify and mark the duplicates.
WITH duplicate_cte as
(
SELECT * ,
ROW_NUMBER() over(
PARTITION BY company, location, industry, total_laid_off, percentage_laid_off, `date`, stage, country, funds_raised_millions ) AS row_num
from layoffs_staging
)
Delete from layoffs_staging
WHERE id IN ( SELECT id FROM duplicate_cte WHERE row_num > 1);
Note: MySQL doesn’t allow you to delete directly from a CTE, so I had to come up with a workaround using IDs.
3. Standardize Inconsistent Text
When it comes to terms like “crypto,” “Crypto.com,” and “Crypto,” they all mean the same thing. However, SQL sees them as distinct unless we clean them up.
##Clean up white space and formatting in columns
UPDATE layoffs_staging
SET company = TRIM(company),
industry = TRIM(industry),
country = TRIM(TRAILING '.' FROM country);
##Converting everything to lowercase or proper case
UPDATE layoffs_staging
SET company = LOWER(company);
##Fix inconsistent values manually
UPDATE layoffs_staging
SET industry = 'Crypto'
WHERE industry LIKE 'Crypto%';
4. Convert Text Dates to Proper Format
Raw CSV files often keep dates in a text format. To work with these dates, we need to convert them into a correct format.
UPDATE layoffs_staging
SET date = STR_TO_DATE(date, '%m/%d/%Y');
ALTER TABLE layoffs_staging
MODIFY COLUMN date DATE;
5. Handle Missing Values
Not every missing value is represented as NULL; some are simply empty strings. We need to take care of those as well.
UPDATE layoffs_staging
SET industry = NULL
WHERE industry = '';
##To fill in missing values, we can use a self- join
UPDATE layoffs_staging t1
JOIN layoffs_staging t2
ON t1.company =t2.company AND
t1.industry IS NULL AND t2.industry IS NOT NULL
SET t1.industry = t2.industry
6. Remove Incomplete Rows
Some rows are missing layoff data, which isn't really useful for our analysis.
DELETE FROM layoffs_staging
WHERE total_laid_off IS NULL AND percentage_laid_off IS NULL;
7. Drop Temporary Columns
If you've set up any helper columns, like row numbers, make sure to tidy up once you're done!
ALTER TABLE layoffs_staging
DROP COLUMN row_num;
SQL Techniques Used
Here’s a quick overview of SQL Techniques Used:
Real-World Tips
• Adopt an analyst's mindset: your aim is to ensure the data is accurate, clean, and reliable.
• Always double-check your work: frequently use SELECT queries to confirm any changes you make.
• Keep a record of your process: this will save your time, especially when working with others or revisiting a project down the line.
Quick Insights from the Cleaned Data
1. Layoffs Reached their Peaked in 2022–2023
SELECT YEAR(date) AS year, SUM(total_laid_off) AS total_layoffs
FROM layoffs_staging
GROUP BY YEAR(date)
ORDER BY year;
Insight: The most significant layoffs took place in 2022–2023, probably as a result of market adjustments following the pandemic.
2. Most Affected Industries
SELECT industry, SUM(total_laid_off) AS total_layoffs
FROM layoffs_staging
GROUP BY industry
ORDER BY total_layoffs DESC;
Top Sectors Hit: Consumer Retail, Transportation, Tech Startups
3. Companies with 100% Layoffs
SELECT company, percentage_laid_off
FROM layoffs_staging
WHERE percentage_laid_off = 1;
These could indicate complete shutdowns.
4. Countries with the Most Layoffs
SELECT country, SUM(total_laid_off) AS total_layoffs
FROM layoffs_staging
GROUP BY country
ORDER BY total_layoffs DESC;
Finding: The US made up about 70% of all layoffs recorded in the dataset.
Final Thoughts
Data cleaning isn’t just a technical task; it’s a whole mindset. It’s about taking a look at that raw, messy data and uncovering the clean story that lies beneath. If you’re just starting out as a data analyst, diving into real datasets like this and going through the process of cleaning and prepping them is one of the best ways to sharpen your skills and build your portfolio.
Thanks for reading — and happy querying!


