Predicting Employee Churn
Project Context: Employee turnover imposes substantial costs on organizations through lost expertise, productivity disruptions, and expenses tied to recruitment and training. For this BANA 273 machine learning project, my team analyzed a publicly available Kaggle dataset of 4,653 employees across three major Indian cities to answer a central question about which employee characteristics most strongly predict voluntary churn. The goal was not just to understand the drivers, but to build a model capable of identifying at-risk employees early enough to enable targeted intervention.
Business Question: Which employee attributes and organizational factors most strongly influence churn, and how accurately can they be used to identify at-risk employees?
Churn is not randomly distributed across the workforce. Both models consistently pointed to the same groups as highest risk, with risk concentrated at the intersection of tenure, pay, education, and location rather than any single factor acting alone.
- Mid-pay tier employees show the highest churn — the dissatisfaction appears sharpest when compensation is neither entry-level nor high enough to feel rewarding. The logistic model found mid-pay employees have 2.2x the churn odds of low-pay employees.
- Master's degree holders churn at higher rates than both Bachelor's and PhD employees — likely reflecting greater external mobility and a mismatch between expectation and available advancement. Odds ratio: 2.4x versus Bachelor's.
- Employees who have been benched show noticeably elevated churn — periods without project assignment appear to signal instability or exclusion. Odds ratio: 1.5x versus never benched.
- Pune employees have the highest churn rate of the three cities — at 1.6x the odds of Bangalore employees, pointing to regional labor market dynamics or workplace conditions that differ from the other hubs.
- Female employees show a higher churn proportion than male employees — a pattern that emerged consistently in descriptive analysis and was confirmed in the logistic model.
Two supervised learning models were built and compared. The tuned Decision Tree outperformed logistic regression on the primary objective of maximizing recall for churners, while logistic regression provided clearer interpretability through odds ratios.
The two models offer complementary perspectives. Logistic regression surfaces broad linear trends and interpretable effect sizes. The decision tree reveals how churn risk unfolds through conditional pathways, where tenure, pay, and education combine in ways that neither variable alone would predict.
The modeling results translate directly into five retention strategies that HR teams and managers can act on, targeting the specific employee groups identified as highest risk.
- Choosing the right evaluation metric matters more than choosing the right model. Once recall replaced accuracy as the primary objective, everything about how we built and compared models changed. A model that looked decent by accuracy was actually catching fewer than half of the churners it needed to find.
- Threshold tuning is an underused lever. Lowering the decision threshold from 0.50 to 0.35 lifted recall from 0.64 to 0.79 on the decision tree without any additional feature engineering. It is one of the most practical tools available for imbalanced classification problems.
- Different models tell different stories about the same problem. Logistic regression identified benching history as a strong predictor through its odds ratios, while the decision tree assigned it minimal importance. Neither is wrong — they are capturing different types of relationships in the data, and both perspectives are useful.
- Class imbalance is not just a modeling problem, it is a framing problem. Recognizing that missing a churner is far more costly than flagging a stable employee shaped every modeling decision we made, from class weighting to threshold selection to how we reported results.
- The clearest actionable output of a churn model is not the model itself, it is the ranked list of risk factors it produces. Tenure, pay tier, education, and location each point to a specific and concrete HR intervention. The model is only useful if it translates into something a manager can act on.
- Python: primary language for the entire analysis. Used to load and preprocess the dataset, build and evaluate both models, tune hyperparameters, and generate all visualizations.
- pandas: used for data manipulation, one-hot encoding of categorical variables, and building the summary tables used to compare model performance across configurations.
- Cross-Validation: used throughout to confirm that model performance generalized beyond the training split. 5-fold CV was applied to both the logistic regression and decision tree, with recall scores evaluated across each fold to verify stability before finalizing any model.
- scikit-learn: provided LogisticRegression, DecisionTreeClassifier, GridSearchCV, and cross_val_score. Used to fit baseline and tuned versions of both models and evaluate recall across folds.
- matplotlib and seaborn: used to produce all visualizations including the churn outcome distribution, attribute distribution charts, odds ratio bar chart, information gain rankings, and the recall comparison chart.
- Logistic Regression: applied with L1 regularization and class-weight balancing to produce interpretable odds ratios and identify the linear predictors of churn risk.
- Decision Tree Classifier: applied with pre-pruning constraints and threshold tuning to capture nonlinear interactions and achieve the highest recall of any model in the project.
- GridSearchCV: used to systematically search hyperparameter combinations for both models, selecting configurations that maximized recall on the training data.
- Business Communication: translated model coefficients and information gain rankings into five concrete HR recommendations, each tied to a specific employee group and a specific intervention strategy.