📉 Telecom Customer Churn — End-to-End ML with Streamlit

August 2025 · Classification · Streamlit · SHAP · MLOps

🔍 Project Overview

This project delivers a production-style churn prediction workflow on the IBM Telco dataset. I built a fully reproducible pipeline spanning data acquisition, preprocessing & feature engineering, model selection, evaluation with cost–benefit analysis, interpretability with SHAP, and a Streamlit app for batch scoring and an interactive “churn probability” calculator.

⚙️ Methodology

Data acquisition: Pulled the dataset from Kaggle using kagglehub and converted the provided .xlsx file to CSV for training.
Data cleaning & normalization: Standardized headers (e.g., Senior Citizen → SeniorCitizen; Tenure Months → tenure), removed duplicate columns, coerced numeric fields (e.g., TotalCharges), and imputed missing values.
Leakage control: Dropped target-derived columns (e.g., ChurnScore, CustomerStatus, ChurnCategory, ChurnReason) and non-actionable extras (e.g., Country, State, City, Latitude/Longitude, CLTV).
Feature engineering:
- contract_length_months from Contract (Month-to-month/One year/Two year → 1/12/24)
- is_electronic_check flag from PaymentMethod
- has_tech_support from TechSupport
- tenure_bucket (bins) and charges_per_tenure
Modeling: Compared Logistic Regression, Random Forest, and XGBoost with a shared preprocessing pipeline (ColumnTransformer = One-Hot for categoricals + StandardScaler for numerics).
Evaluation: 5-fold CV with ROC-AUC & PR-AUC; profit-optimized threshold using margin, incentive, and outreach costs.
Deployment: Streamlit app with three tabs—Batch scoring & metrics, Single-customer calculator, and SHAP explainability.

🏁 Results

Cross-validated performance (after leakage fixes):

Logistic Regression — PR-AUC ≈ 0.625, ROC-AUC ≈ 0.837
Random Forest — PR-AUC ≈ 0.660, ROC-AUC ≈ 0.852
XGBoost (selected) — PR-AUC ≈ 0.662, ROC-AUC ≈ 0.851

The app computes a recommended operating threshold by maximizing expected profit: profit = TP × margin − (TP + FP) × (incentive + outreach).

💡 App Features

Batch scoring: Upload CSV/XLSX → predictions, PR-AUC/ROC-AUC (if Churn present), and “download scored CSV”.
Probability calculator: Enter customer attributes → instant churn probability from the saved model.
SHAP explainability: Global beeswarm plot with correct feature names using pre.get_feature_names_out().

🏗️ Repository & Architecture

src/features/build_features.py — normalization, leakage drops, feature engineering, and preprocessing builder.
src/models/train.py — 5-fold CV comparison, best-by-PR-AUC selection, model persistence with joblib.
src/models/predict.py — production-safe prediction: normalizes inputs, adds any missing columns (Unknown/0), and sanitizes OHE categories.
app/Home.py — Streamlit UI (batch, calculator, SHAP) using the saved models/model.pkl.

🧪 Reproducibility

Create & use venv (avoid mixing with Anaconda)

python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip

Install dependencies (pin compatibility):

pip install -r requirements.txt
pip install "altair<5" "pyarrow==16.1.0" openpyxl

Download data with kagglehub and convert XLSX→CSV during ingest.

Train:

python -m src.models.train --csv "data/raw/Telco-Customer-Churn.csv"

Run app (use the venv’s Python to avoid PATH issues):
```
python -m streamlit run app/Home.py
```

🩺 Troubleshooting Log (what I hit & how I fixed it)

1) `ImportError: attempted relative import with no known parent package`

Cause: Running module directly with plain python from src/.

Fix: Run as a module from repo root:

python -m src.models.train --csv "data/raw/..."

2) `UnicodeDecodeError` when reading CSV

Cause: Input was actually .xlsx or non-UTF8.
Fix: Added a read_any() helper to support XLSX via openpyxl and CSV with fallback encodings.

3) Duplicate column names → `AttributeError: 'DataFrame' object has no attribute 'dtype'`

Cause: Normalization produced duplicate headers (df[c] returned a DataFrame).
Fix: Ensured unique names (suffix __dup1, __dup2) during normalization.

4) Perfect scores (PR-AUC/ROC-AUC ≈ 1.0) = leakage

Cause: Churn-derived columns (ChurnScore/CustomerStatus/etc.) leaked the label.
Fix: Dropped all churn-related fields except canonical target Churn; retrained to realistic metrics.

5) Streamlit launched from Anaconda, not venv → `ModuleNotFoundError: altair.vegalite.v4`

Cause: PATH pointed to /opt/anaconda3/bin/streamlit.
Fix: Run via venv explicitly and pin Altair v4:
```
source .venv/bin/activate
python -m pip install "altair<5"
python -m streamlit run app/Home.py
```
Verified with which streamlit and python -c "import sys,streamlit,altair; ..."

6) `pyarrow` build failed (no `cmake`)

Fix: Install a prebuilt wheel and keep pins stable:

pip install --upgrade pip setuptools wheel
pip uninstall -y pyarrow
pip install "pyarrow==16.1.0" --only-binary=:all:

7) App “single-customer” calculator → `columns are missing`

Cause: Model expected columns not present in the form (e.g., geo fields, engineered features).
Fix A (preferred): Drop non-actionable fields during cleaning so the model never expects them.
Fix B (quick): In predict.py, auto-add missing categoricals as “Unknown” and numerics as 0.

8) OHE + NaN → `TypeError: ufunc 'isnan' not supported`

Cause: OneHotEncoder was fitted with NaNs inside categories_.
Fix: Filled missing categoricals with “Unknown” before encoding and enforced str dtype; additionally sanitized the fitted encoder’s categories_ at inference.

9) SHAP tab → `columns are missing`

Cause: Sent raw upload directly to the preprocessor.
Fix: Ran clean_columns(...) on the uploaded data and ensured expected columns existed before pre.transform.

10) SHAP showed “Feature 1, 2, …” instead of names

Cause: Passed a NumPy array without feature names to SHAP.
Fix: Built names with pre.get_feature_names_out() and wrapped Xt into a named DataFrame.

📊 Key Takeaways

“Perfect” scores usually mean leakage—systematically audit features before celebrating.
Keep preprocessing identical between training and inference (same clean_columns everywhere).
Pin tricky packages (altair<5, pyarrow==16.1.0) and always launch Streamlit from your venv.
Design UIs around actionable features; drop geography/CLTV unless you truly need them.

🔗 Repository

🔗 View Source Code on GitHub
📊 See Full Analysis

← Back to Blog