📊 Cleaning Layoff Data with SQL

May 2024 · MySQL · Data Cleansing · Feature Engineering

🔍 Project Overview

This project focused on cleaning and preparing layoff data from various sources using SQL. The raw data contained inconsistencies, missing values, and formatting issues that needed to be addressed before any meaningful analysis could be performed. I used MySQL to clean, standardise, and enrich the dataset with new features to make it more analysis-ready.

⚙️ Methodology

Importing CSVs: Used LOAD DATA INFILE to import large CSVs efficiently into staging tables.
Data Cleaning:
- Removed duplicates using ROW_NUMBER() over partitions.
- Normalised inconsistent company names and date formats.
- Filled missing values in categorical fields using contextual replacements (e.g. using mode).
Feature Engineering:
- Created new fields such as layoff_size_category, year, and calculated rolling layoffs by year or sector.
- Used CASE statements for classification buckets.
Export: Final cleaned and enriched data was exported for use in Python and Tableau analysis.

💡 Errors & Fixes

CSV encoding issues (“〜” and “–” characters) – resolved using SET NAMES utf8mb4 before import.
Duplicate records due to company name variations – resolved using fuzzy matching + SQL string functions.
Missing fields caused JOIN failures – added fallback logic and filtered NULLs in subqueries.

📈 Key Takeaways

SQL is a powerful tool not just for querying but for full-scale data wrangling and enrichment.
Well-designed schema staging helps isolate raw vs processed data and improve maintainability.
Even simple datasets need thoughtful preprocessing to be useful for downstream analysis or dashboards.

🔗 Repository

GitHub Repo: Layoff Data Cleansing

← Back to Blog