August 2025 · Classification · Streamlit · SHAP · MLOps
This project delivers a production-style churn prediction workflow on the IBM Telco dataset. I built a fully reproducible pipeline spanning data acquisition, preprocessing & feature engineering, model selection, evaluation with cost–benefit analysis, interpretability with SHAP, and a Streamlit app for batch scoring and an interactive “churn probability” calculator.
kagglehub and converted the provided .xlsx file to CSV for training.SeniorCitizen; Tenure Months → tenure), removed duplicate columns, coerced numeric fields (e.g., TotalCharges), and imputed missing values.ChurnScore, CustomerStatus, ChurnCategory, ChurnReason) and non-actionable extras (e.g., Country, State, City, Latitude/Longitude, CLTV).contract_length_months from Contract (Month-to-month/One year/Two year → 1/12/24)is_electronic_check flag from PaymentMethodhas_tech_support from TechSupporttenure_bucket (bins) and charges_per_tenureColumnTransformer = One-Hot for categoricals + StandardScaler for numerics).Cross-validated performance (after leakage fixes):
The app computes a recommended operating threshold by maximizing expected profit: profit = TP × margin − (TP + FP) × (incentive + outreach).
Churn present), and “download scored CSV”.pre.get_feature_names_out().src/features/build_features.py — normalization, leakage drops, feature engineering, and preprocessing builder.src/models/train.py — 5-fold CV comparison, best-by-PR-AUC selection, model persistence with joblib.src/models/predict.py — production-safe prediction: normalizes inputs, adds any missing columns (Unknown/0), and sanitizes OHE categories.app/Home.py — Streamlit UI (batch, calculator, SHAP) using the saved models/model.pkl.python -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
pip install -r requirements.txt
pip install "altair<5" "pyarrow==16.1.0" openpyxl
kagglehub and convert XLSX→CSV during ingest.python -m src.models.train --csv "data/raw/Telco-Customer-Churn.csv"
python -m streamlit run app/Home.py
ImportError: attempted relative import with no known parent packagepython from src/.python -m src.models.train --csv "data/raw/..."
UnicodeDecodeError when reading CSV.xlsx or non-UTF8.openpyxl and CSV with fallback encodings.AttributeError: 'DataFrame' object has no attribute 'dtype'df[c] returned a DataFrame).__dup1, __dup2) during normalization.ChurnScore/CustomerStatus/etc.) leaked the label.Churn; retrained to realistic metrics.ModuleNotFoundError: altair.vegalite.v4/opt/anaconda3/bin/streamlit.source .venv/bin/activate
python -m pip install "altair<5"
python -m streamlit run app/Home.py
which streamlit and python -c "import sys,streamlit,altair; ..."pyarrow build failed (no cmake)pip install --upgrade pip setuptools wheel
pip uninstall -y pyarrow
pip install "pyarrow==16.1.0" --only-binary=:all:
columns are missingpredict.py, auto-add missing categoricals as “Unknown” and numerics as 0.TypeError: ufunc 'isnan' not supportedcategories_.str dtype; additionally sanitized the fitted encoder’s categories_ at inference.columns are missingclean_columns(...) on the uploaded data and ensured expected columns existed before pre.transform.pre.get_feature_names_out() and wrapped Xt into a named DataFrame.clean_columns everywhere).altair<5, pyarrow==16.1.0) and always launch Streamlit from your venv.
🔗 View Source Code on GitHub
📊 See Full Analysis