๐ Cleaning Layoff Data with SQL
May 2024 ยท MySQL ยท Data Cleansing ยท Feature Engineering
๐ Project Overview
This project focused on cleaning and preparing layoff data from various sources using SQL. The raw data contained inconsistencies, missing values, and formatting issues that needed to be addressed before any meaningful analysis could be performed. I used MySQL to clean, standardise, and enrich the dataset with new features to make it more analysis-ready.
โ๏ธ Methodology
- Importing CSVs: Used
LOAD DATA INFILE to import large CSVs efficiently into staging tables.
- Data Cleaning:
- Removed duplicates using
ROW_NUMBER() over partitions.
- Normalised inconsistent company names and date formats.
- Filled missing values in categorical fields using contextual replacements (e.g. using mode).
- Feature Engineering:
- Created new fields such as
layoff_size_category,
year, and calculated rolling layoffs by year or sector.
- Used
CASE statements for classification buckets.
- Export: Final cleaned and enriched data was exported for use in Python and Tableau analysis.
๐ก Errors & Fixes
- CSV encoding issues (โใโ and โโโ characters) โ resolved using
SET NAMES utf8mb4 before import.
- Duplicate records due to company name variations โ resolved using fuzzy matching + SQL string functions.
- Missing fields caused JOIN failures โ added fallback logic and filtered NULLs in subqueries.
๐ Key Takeaways
- SQL is a powerful tool not just for querying but for full-scale data wrangling and enrichment.
- Well-designed schema staging helps isolate raw vs processed data and improve maintainability.
- Even simple datasets need thoughtful preprocessing to be useful for downstream analysis or dashboards.
๐ Repository
GitHub Repo: Layoff Data Cleansing
โ Back to Blog