📉 Telecom Customer Churn — End-to-End ML with Streamlit

August 2025 · Classification · Streamlit · SHAP · MLOps


🔍 Project Overview

This project delivers a production-style churn prediction workflow on the IBM Telco dataset. I built a fully reproducible pipeline spanning data acquisition, preprocessing & feature engineering, model selection, evaluation with cost–benefit analysis, interpretability with SHAP, and a Streamlit app for batch scoring and an interactive “churn probability” calculator.

⚙️ Methodology

🏁 Results

Cross-validated performance (after leakage fixes):

The app computes a recommended operating threshold by maximizing expected profit: profit = TP × margin − (TP + FP) × (incentive + outreach).

💡 App Features

🏗️ Repository & Architecture

🧪 Reproducibility

  1. Create & use venv (avoid mixing with Anaconda)
    python -m venv .venv
    source .venv/bin/activate
    python -m pip install --upgrade pip
  2. Install dependencies (pin compatibility):
    pip install -r requirements.txt
    pip install "altair<5" "pyarrow==16.1.0" openpyxl
  3. Download data with kagglehub and convert XLSX→CSV during ingest.
  4. Train:
    python -m src.models.train --csv "data/raw/Telco-Customer-Churn.csv"
  5. Run app (use the venv’s Python to avoid PATH issues):
    python -m streamlit run app/Home.py

🩺 Troubleshooting Log (what I hit & how I fixed it)

1) ImportError: attempted relative import with no known parent package
2) UnicodeDecodeError when reading CSV
3) Duplicate column names → AttributeError: 'DataFrame' object has no attribute 'dtype'
4) Perfect scores (PR-AUC/ROC-AUC ≈ 1.0) = leakage
5) Streamlit launched from Anaconda, not venv → ModuleNotFoundError: altair.vegalite.v4
6) pyarrow build failed (no cmake)
7) App “single-customer” calculator → columns are missing
8) OHE + NaN → TypeError: ufunc 'isnan' not supported
9) SHAP tab → columns are missing
10) SHAP showed “Feature 1, 2, …” instead of names

📊 Key Takeaways

🔗 Repository

🔗 View Source Code on GitHub
📊 See Full Analysis


← Back to Blog