Predicting Employee Churn

← Back to Projects
BANA 273 · University of California, Irvine · Fall 2025
Logistic Regression Decision Tree Python Machine Learning Predictive Analytics
4,653
Employee records analyzed
2
Models compared
79%
Best recall achieved

Project Context: Employee turnover imposes substantial costs on organizations through lost expertise, productivity disruptions, and expenses tied to recruitment and training. For this BANA 273 machine learning project, my team analyzed a publicly available Kaggle dataset of 4,653 employees across three major Indian cities to answer a central question about which employee characteristics most strongly predict voluntary churn. The goal was not just to understand the drivers, but to build a model capable of identifying at-risk employees early enough to enable targeted intervention.

Business Question: Which employee attributes and organizational factors most strongly influence churn, and how accurately can they be used to identify at-risk employees?

Key Terms
ChurnWhen an employee voluntarily leaves the organization. In this dataset, the outcome variable indicates whether an employee left within two years of the observation.
RecallThe proportion of actual churners the model correctly identifies. A high recall means fewer at-risk employees are missed. This was the primary evaluation metric because missing a churner (a false negative) is far more costly than flagging a stable employee.
Class ImbalanceWhen one outcome is much more common than the other. In this dataset, 65.6% of employees stay and only 34.4% churn. Without addressing this, a model can achieve decent accuracy by simply predicting everyone stays, while catching no churners at all.
Logistic RegressionA supervised learning model that estimates the probability of an outcome based on a linear combination of predictors. It produces interpretable coefficients and odds ratios that show how each variable independently influences churn risk.
Decision TreeA model that recursively splits the data into groups based on input features. Unlike logistic regression, it captures nonlinear relationships and interaction effects, showing how combinations of attributes create churn risk pathways.
Odds RatioFrom the logistic regression model, this measures how much more likely churn is for one group versus another. An odds ratio of 2.4 for Master's degree holders means they are 2.4 times more likely to churn than Bachelor's degree holders, holding other variables constant.
BenchedAn employee who has been placed on the bench, meaning they are not currently assigned to a project. The dataset tracks whether this has ever occurred, as bench time is associated with elevated churn risk.
Threshold TuningAdjusting the probability cutoff at which a model classifies someone as a churner. Lowering the threshold from 0.50 to 0.35 causes the model to flag more borderline cases as churners, increasing recall at the cost of some false positives.
Cross-ValidationA technique that tests the model on multiple different splits of the data to confirm that performance is stable and not dependent on any one particular training set.
Who Is Most Likely to Leave?

Churn is not randomly distributed across the workforce. Both models consistently pointed to the same groups as highest risk, with risk concentrated at the intersection of tenure, pay, education, and location rather than any single factor acting alone.

  • Mid-pay tier employees show the highest churn — the dissatisfaction appears sharpest when compensation is neither entry-level nor high enough to feel rewarding. The logistic model found mid-pay employees have 2.2x the churn odds of low-pay employees.
  • Master's degree holders churn at higher rates than both Bachelor's and PhD employees — likely reflecting greater external mobility and a mismatch between expectation and available advancement. Odds ratio: 2.4x versus Bachelor's.
  • Employees who have been benched show noticeably elevated churn — periods without project assignment appear to signal instability or exclusion. Odds ratio: 1.5x versus never benched.
  • Pune employees have the highest churn rate of the three cities — at 1.6x the odds of Bangalore employees, pointing to regional labor market dynamics or workplace conditions that differ from the other hubs.
  • Female employees show a higher churn proportion than male employees — a pattern that emerged consistently in descriptive analysis and was confirmed in the logistic model.
Model Performance Comparison

Two supervised learning models were built and compared. The tuned Decision Tree outperformed logistic regression on the primary objective of maximizing recall for churners, while logistic regression provided clearer interpretability through odds ratios.

Logistic Regression
63%
Recall (Churn Class)
After L1 regularization and class-weight balancing. Identified education, pay tier, city, benching, and gender as the strongest predictors. Strong interpretability through odds ratios.
Tuned Decision Tree
79%
Recall (Churn Class)
Pre-pruned with balanced class weights and a lowered decision threshold of 0.35. Identified joining year and payment tier as the dominant splits, capturing nonlinear interactions the logistic model cannot.

The two models offer complementary perspectives. Logistic regression surfaces broad linear trends and interpretable effect sizes. The decision tree reveals how churn risk unfolds through conditional pathways, where tenure, pay, and education combine in ways that neither variable alone would predict.

The modeling results translate directly into five retention strategies that HR teams and managers can act on, targeting the specific employee groups identified as highest risk.

High Priority
Target Long-Tenured Mid-Pay Employees First
Primary risk group across both models
The decision tree identified joining year and payment tier as the two dominant predictors of churn. Longer-tenured employees in the mid-pay tier are the highest-risk combination in the dataset. Targeted outreach — career-development conversations, internal mobility options, or differentiated pay adjustments — should be concentrated here before turnover occurs.
High Priority
Review Mid-Tier Compensation Structure
Mid-pay tier — 19.7% of employees, 2.2x churn odds
Mid-tier dissatisfaction was the most consistent finding across both models. A pay equity analysis focused on this band, with clearer promotion pathways and salary benchmarking against market rates, could address the perceived mismatch between contribution and compensation that appears to be driving elevated turnover.
Medium Priority
Minimize Bench Time and Improve Communication During Gaps
Ever-benched employees — 10.3% of workforce, 1.5x churn odds
Employees who have been benched face meaningfully higher churn odds. Reducing unassigned periods where possible, and proactively communicating with employees when benching is unavoidable, can reduce the sense of exclusion or instability that bench time appears to create.
Medium Priority
Develop Location-Specific Retention Programs for Pune
Pune employees — 1.6x churn odds vs Bangalore
Churn rates in Pune are significantly higher than in Bangalore and New Delhi. Retention strategies should not be applied uniformly across all three cities. Local labor market conditions, compensation benchmarks, and engagement programs should be evaluated and tailored specifically for the Pune workforce.
Lower Priority
Create Advancement Pathways for Highly Educated Employees
Master's degree holders — 18.8% of workforce, 2.4x churn odds
Master's degree holders churn at more than double the rate of Bachelor's degree employees. Creating clearer recognition, advancement tracks, and career development opportunities for this group can reduce the gap between expectation and reality that likely drives their higher departure rate.
Step 1
Data Preparation
The Employee Future Prediction dataset from Kaggle contains 4,653 employee records across 9 attributes covering education, joining year, city of employment, payment tier, age, gender, benching history, domain experience, and the churn outcome. The dataset had no missing values and required minimal preprocessing. For logistic regression, categorical variables (education, city, payment tier) were one-hot encoded to ensure appropriate estimation. The decision tree used original categorical encodings directly. A 70/30 train-test split with a fixed random state of 42 was used throughout for reproducibility.
Step 2
Addressing Class Imbalance
65.6% of employees stay and only 34.4% churn — a meaningful imbalance that makes overall accuracy a misleading metric. A model predicting everyone stays would achieve 65.6% accuracy while catching zero actual churners. To counter this, both models used class_weight="balanced" to adjust the loss function and place greater emphasis on correctly identifying the minority churn class. Recall for the churn class became the primary evaluation metric throughout the project.
Step 3
Logistic Regression Modeling
A baseline logistic regression was estimated first using default scikit-learn settings, achieving a churn recall of only 0.41. The model was then improved using GridSearchCV, which selected L1 regularization (lasso penalty), class-weight balancing, and a regularization strength of C=0.1. L1 regularization shrinks weaker coefficients to zero, removing uninformative predictors and improving interpretability. The improved model raised churn recall to approximately 0.63, with 5-fold cross-validation confirming stability (mean recall 0.629 across folds).
Step 4
Decision Tree Modeling
A baseline decision tree already outperformed the logistic baseline (recall 0.64) without any tuning. The pre-pruned version applied class_weight="balanced", a maximum depth of 8, minimum 10 samples per leaf, and minimum split size of 5 to prevent overfitting. The decision threshold was then lowered from 0.50 to 0.35, allowing the model to flag borderline cases as churners. This final tuned tree achieved test-set recall of 0.79, with 5-fold cross-validation confirming stability (mean recall 0.727).
What I took away from this project
  • Choosing the right evaluation metric matters more than choosing the right model. Once recall replaced accuracy as the primary objective, everything about how we built and compared models changed. A model that looked decent by accuracy was actually catching fewer than half of the churners it needed to find.
  • Threshold tuning is an underused lever. Lowering the decision threshold from 0.50 to 0.35 lifted recall from 0.64 to 0.79 on the decision tree without any additional feature engineering. It is one of the most practical tools available for imbalanced classification problems.
  • Different models tell different stories about the same problem. Logistic regression identified benching history as a strong predictor through its odds ratios, while the decision tree assigned it minimal importance. Neither is wrong — they are capturing different types of relationships in the data, and both perspectives are useful.
  • Class imbalance is not just a modeling problem, it is a framing problem. Recognizing that missing a churner is far more costly than flagging a stable employee shaped every modeling decision we made, from class weighting to threshold selection to how we reported results.
  • The clearest actionable output of a churn model is not the model itself, it is the ranked list of risk factors it produces. Tenure, pay tier, education, and location each point to a specific and concrete HR intervention. The model is only useful if it translates into something a manager can act on.
Tools and Skills
  • Python: primary language for the entire analysis. Used to load and preprocess the dataset, build and evaluate both models, tune hyperparameters, and generate all visualizations.
  • pandas: used for data manipulation, one-hot encoding of categorical variables, and building the summary tables used to compare model performance across configurations.
  • Cross-Validation: used throughout to confirm that model performance generalized beyond the training split. 5-fold CV was applied to both the logistic regression and decision tree, with recall scores evaluated across each fold to verify stability before finalizing any model.
  • scikit-learn: provided LogisticRegression, DecisionTreeClassifier, GridSearchCV, and cross_val_score. Used to fit baseline and tuned versions of both models and evaluate recall across folds.
  • matplotlib and seaborn: used to produce all visualizations including the churn outcome distribution, attribute distribution charts, odds ratio bar chart, information gain rankings, and the recall comparison chart.
  • Logistic Regression: applied with L1 regularization and class-weight balancing to produce interpretable odds ratios and identify the linear predictors of churn risk.
  • Decision Tree Classifier: applied with pre-pruning constraints and threshold tuning to capture nonlinear interactions and achieve the highest recall of any model in the project.
  • GridSearchCV: used to systematically search hyperparameter combinations for both models, selecting configurations that maximized recall on the training data.
  • Business Communication: translated model coefficients and information gain rankings into five concrete HR recommendations, each tied to a specific employee group and a specific intervention strategy.