Cleaning and Exploring Data with MSSQL
- Subasini Clattus
- Sep 2
- 5 min read
In this blog, I will share how I approached the dataset from data cleaning to exploration in detail. Before we begin, lets know about the dataset. Electric Vehicle Population data is about the number of electric vehicles in specific areas, along with its specification.
It has the following columns:
Vehicle Identification: VIN (partial), Make, Model Year
Geographic Data: County, City, State, Postal Code, Legislative District
Vehicle Specifications: Electric Vehicle Type, Electric Range, Base MSRP
Regulatory Information: CAFV Eligibility, DOL Vehicle ID
Infrastructure Data: Electric Utility, Vehicle Location, Census Tract
Data cleaning
Data cleaning is the key step in the analysis process. Without dependable data, the analysis will be unreliable and lead to making poor decisions. Data cleaning is the process of detecting, removing, updating inaccurate, inconsistent, incomplete data in a dataset. It ensures that the dataset is consistent and ready to use in analysis and to make informed decisions.
Data Cleaning and Preparation:
Initial Checks: In order to understand the data structure, I reviewed the column names and its data type. This often reveals immediate errors or any formatting issues.
-- View data types of all columns
SELECT
COLUMN_NAME,
DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'ev_population_staging';
Work with a Temporary copy (Staging Table): I created a working or staging copy of dataset before making any changes. This ensures the raw data is untouched and can present as reference or a roll back if anything goes wrong.
-- Create a temporary (staging) table
SELECT *
INTO ev_population_staging
FROM portfolio_projects..EV_Population;Standardization: It makes the dataset consistent and easy to manage.
Column naming: It is best practice to rename columns to use lowercase letters and replace spaces with underscore for standardization.
-- Converting column names to consistent standard format
EXEC sp_rename 'ev_population_staging.[Postal Code]', 'postal_code';
EXEC sp_rename 'ev_population_staging.[Model Year]', 'model_year';
EXEC sp_rename 'ev_population_staging.[Electric Range]', 'electric_range';Removing Duplicates and Invalid Records: Duplicates records can change the analysis counts, averages, and other metrics. So, identifying and removing them while keeping the most reliable and accurate record.
--Identified Duplicates and deleting it
WITH cte_duplicate AS
(
SELECT *,ROW_NUMBER() OVER(PARTITION BY dol_vehicle_id ,vin ORDER BY model_year DESC ) AS row_num
FROM ev_population_staging
)
DELETE
FROM cte_duplicate
WHERE row_num > 1Handling Missing Values: Managing missing data properly ensures completeness and reliability.
-- Count NULL or empty values per text column for a quick quality assessment
DECLARE @table_name NVARCHAR(100) = 'ev_population_staging';
DECLARE @sql NVARCHAR(MAX) = '';
SELECT @sql = STRING_AGG(
'COUNT(CASE WHEN [' + COLUMN_NAME + '] IS NULL OR [' + COLUMN_NAME + '] = '''' THEN 1 END) AS [' + COLUMN_NAME + '_null_or_empty]',
', '
)
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = @table_name
AND DATA_TYPE IN ('varchar', 'nvarchar', 'char', 'nchar', 'text');
SET @sql = 'SELECT ' + @sql + ' FROM ' + QUOTENAME(@table_name) + ';';
EXEC sp_executesql @sql; 
Data Exploration
Data Exploration is the basis of meaningful analysis, where it transforms raw data into informed decisions. It involves descriptive statistics, visualization and creating hypotheses.
I started the EV registration data exploration by defining set of questions to guide the analysis. The following are the questions and their respective analysis insights.
Overall EV Market Overview
-- What is the total number of electric vehicles, manufactures, model and year?
SELECT
COUNT(*) AS total_vehicles,
COUNT(DISTINCT Make) AS unique_manufacturers,
COUNT(DISTINCT Model) AS unique_models,
COUNT(DISTINCT County) AS total_counties,
COUNT(DISTINCT County) AS total_counties,
MIN(Model_Year) AS earliest_year,
MAX(Model_Year) AS latest_year
FROM ev_population_staging;
-- Which electric vehicle models are most popular?
SELECT
Make,
Model,
COUNT(*) AS registrations,
CAST(AVG(Electric_Range) AS DECIMAL(10,2)) AS avg_range,
AVG(Base_MSRP) AS avg_price,
MIN(Model_Year) AS first_year,
MAX(Model_Year) AS latest_year,
Electric_Vehicle_Type AS ev_type
FROM ev_population_staging
WHERE Make IS NOT NULL AND Model IS NOT NULL
GROUP BY Make, Model, Electric_Vehicle_Type
HAVING COUNT(*) >= 50
ORDER BY registrations DESC;
EV Types and Adoption
--What is the overall distribution of Electric Vehicle types?
SELECT
Electric_Vehicle_Type,
COUNT(*) AS vehicle_count,
CAST(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM ev_population_staging)AS
DECIMAL(10,2)) AS percentage,
CAST(AVG(Electric_Range)AS DECIMAL(10,2)) AS avg_range
FROM ev_population_staging
WHERE Electric_Vehicle_Type IS NOT NULL
GROUP BY Electric_Vehicle_Type
ORDER BY vehicle_count DESC;
--Which states and counties have the highest EV adoption rates?
SELECT TOP 10 state ,county,
COUNT(*) AS ev_count,
CAST(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM ev_population_staging) AS
DECIMAL(10,2)) AS percentage
FROM ev_population_staging
WHERE State IS NOT NULL AND County IS NOT NULL
GROUP BY state,county
ORDER BY count(*) DESC;
Market Trends and Evolution
--How has the EV market evolved over the years?
SELECT
Model_Year,
COUNT(*) AS vehicles_registered,
COUNT(DISTINCT Make) AS number_of_manufacturers
FROM ev_population_staging
WHERE Model_Year BETWEEN 2015 AND 2026
GROUP BY Model_Year
ORDER BY Model_Year;
-- What are the newest model year?
SELECT
Model_Year,
COUNT(*) AS vehicle_count,
COUNT(DISTINCT Make) AS manufacturer_count,
CAST(AVG(Electric_Range) AS DECIMAL(10,2)) AS avg_range
FROM ev_population_staging
WHERE Model_Year IN (2025, 2026)
GROUP BY Model_Year
ORDER BY Model_Year DESC;
Vehicle Characteristics and Pricing
--What is the relationship between vehicle price and electric range?
SELECT
CASE
WHEN Base_MSRP < 30000 THEN 'Budget (<$30K)'
WHEN Base_MSRP < 50000 THEN 'Mid-range ($30K-$50K)'
WHEN Base_MSRP < 70000 THEN 'Premium ($50K-$70K)'
WHEN Base_MSRP < 100000 THEN 'Luxury ($70K-$100K)'
ELSE 'Ultra-luxury ($100K+)'
END as price_category,
COUNT(*) AS vehicle_count,
CAST(AVG(Electric_Range) AS DECIMAL(10,2))AS avg_range,
MIN(Electric_Range) AS min_range,
MAX(Electric_Range) AS max_range,
CAST( AVG(Base_MSRP)AS DECIMAL(10,2)) AS avg_price
FROM ev_population_staging
WHERE Base_MSRP > 0 AND Electric_Range > 0
GROUP BY
CASE
WHEN Base_MSRP < 30000 THEN 'Budget (<$30K)'
WHEN Base_MSRP < 50000 THEN 'Mid-range ($30K-$50K)'
WHEN Base_MSRP < 70000 THEN 'Premium ($50K-$70K)'
WHEN Base_MSRP < 100000 THEN 'Luxury ($70K-$100K)'
ELSE 'Ultra-luxury ($100K+)'
END
ORDER BY AVG(Base_MSRP);
Top Electric Vehicle Manufacturers
--Which are the top 5 electric vehicle manufacturers by registration count?
SELECT TOP 10
Make,
COUNT(*) as vehicle_count,
CAST(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM ev_population_staging)AS
DECIMAL(10,2)) as market_share_percent
FROM ev_population_staging
WHERE Make IS NOT NULL
GROUP BY Make
ORDER BY vehicle_count DESC;
EV service by utilities
-- Which electric utilities serve the most EV customers?
SELECT TOP 10
Electric_Utility,
COUNT(*) AS ev_customers,
COUNT(DISTINCT State) AS states_served,
COUNT(DISTINCT County) AS counties_served,
CAST(AVG(Electric_Range) AS DECIMAL(10,2))AS avg_vehicle_range,
CAST(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM ev_population_staging WHERE
Electric_Utility IS NOT NULL) AS DECIMAL(10,2)) AS percentage
FROM ev_population_staging
WHERE Electric_Utility IS NOT NULL
GROUP BY Electric_Utility
HAVING COUNT(*) >= 100
ORDER BY ev_customers DESC;
CAFV eligibility impact
-- How does CAFV eligibility affect vehicle characteristics?
SELECT cafv_eligibility,
COUNT(*) AS vehicle_count,
CAST(COUNT(*) * 100.0 / (SELECT COUNT(*) FROM ev_population_staging) AS
DECIMAL(10,2)) AS percentage,
CAST(AVG(Electric_Range) AS DECIMAL(10,2)) AS avg_range ,
CAST(AVG(Base_MSRP)AS DECIMAL(10,2)) AS avg_price,
COUNT(DISTINCT Make) AS manufacturer_count
FROM ev_population_staging
WHERE cafv_eligibility IS NOT NULL
GROUP BY cafv_eligibility
ORDER BY vehicle_count DESC;
Key Insights:
Based on the above analysis,
There are 247344 electric vehicles that are registered across 49 states and 215 countries.
The vehicles registration year is between 2000 and 2026.
46 manufacturers and 177 unique models are registered, with Tesla leading the market at over 42% share.
Tesla manufactured 3 of the 4 most registered EV models, highlighting superiority.
Battery Electric Vehicles (BEVs) is 79.71% of overall distribution of EVs and imply that it offer better range than Plug-in Hybrids (PHEVs).
King County, Washington, has the highest EV count of over 12000, nearly 50% of the state’s EVs.
A high rise in EV registrations occurred between 2021 and 2023; 2026 data appear incomplete or early-stage.
Premium and ultra-luxury EVs give higher average driving ranges, correspond with price.
More than 60% of vehicles need range data, making CAFÉ eligibility not suitable for analysis.
Puget Sound Energy and the City of Seattle are the top utility providers supporting large EV customers.
Data quality issues—like missing range and pricing info, or suspicious figures (e.g., Tesla Model Y range)—limit analysis accuracy.
Conclusion
Data Cleaning and Exploration helps to make sure the data you are working with is accurate, consistent, and ready for further analysis. Taking time to do this makes the data more dependable and insights will be more helpful to make good decisions.


