Sun Country Airlines — Customer Segmentation
Project Context: Sun Country Airlines, a Minneapolis-based low-cost leisure carrier, had no dedicated data analyst on staff and relied entirely on anecdotal assumptions to drive marketing decisions. Acting as an analytics consultant for a BANA 200 case study, I applied K-Means clustering to 15,144 customer-trip records spanning 2013 to 2014, identifying five actionable segments and translating the findings into three targeted campaigns directly tied to leadership's stated goals around Ufly loyalty enrollment, direct channel growth, and vacation package differentiation.
Business Question: Which distinct customer segments exist within Sun Country's booking data, and what targeted strategies should the airline pursue to grow loyalty enrollment, increase direct bookings, and differentiate its vacation packages?
K-Means clustering with k=5 revealed five distinct, business-interpretable customer groups. The clearest structural pattern was a near-complete booking channel split between Clusters 0 and 1, who share the same Minneapolis leisure traveler profile but differ entirely in where they book. Loyalty program participation was almost entirely absent outside of Cluster 3.
Before any model was run, exploratory data analysis surfaced two numbers that defined the entire business opportunity and foreshadowed what the clustering would find:
- 78.5% of customers were not enrolled in Ufly Rewards — the loyalty program was almost entirely untapped across the customer base.
- Booking channels split nearly evenly — 45.9% booked direct through SCA.com while 44.0% went through third-party platforms like Expedia and Google Flights, a structural revenue gap hiding in plain sight.
These two patterns set the direction for the entire analysis. The segmentation that followed gave them shape, revealing exactly which customer groups were driving each split and what it would take to move them.
Three campaigns were designed, each mapped directly to one of leadership's stated business objectives around Ufly enrollment, direct channel migration, and vacation package differentiation.
- The channel split between C0 and C1 was the single most strategically interesting finding. Two customer groups with near-identical travel profiles diverged entirely on where they booked, which meant two very different interventions for what initially looked like one audience.
- Naming segments matters. Translating cluster centroids into personas like MSP Direct Bookers and MSP Deal Seekers made the results immediately usable for stakeholders who would never look at a centroid table. Communication is part of the analysis.
- Framing findings around business objectives, not just statistical patterns, is what makes an analysis actionable. Every segment was evaluated through the lens of leadership's three stated goals: loyalty enrollment, direct channel growth, and vacation package differentiation.
- Data gaps are part of the analysis. The absence of ancillary spend data, post-flight satisfaction scores, and loyalty conversion event tracking were not just limitations to acknowledge — they became the basis for a concrete long-term data strategy recommendation.
- Unsupervised learning requires more judgment than supervised learning. There is no target variable to optimize against and no accuracy score to chase. Decisions about how many clusters to use, which features matter, and what a centroid means in business terms all come down to analyst judgment. That judgment is where the real work happens.
- Python: primary language for the entire analysis. Used to load and inspect the dataset, run all EDA, build and evaluate clustering models, and generate every visualization.
- pandas: used for data manipulation throughout: filtering rows, computing group-level summary statistics, and building the centroid comparison tables used to profile each segment.
- scikit-learn: provided the KMeans algorithm and silhouette scoring. Used to fit models across k=2 through k=10 and select the optimal number of clusters.
- matplotlib and seaborn: used to produce all visualizations including the elbow curve, silhouette plot, booking channel distributions, Ufly enrollment breakdown, and per-cluster destination bar charts.
- K-Means Clustering: core modeling technique. Applied to 88 normalized features to identify five behaviorally distinct customer groups from the raw booking data.
- Elbow Method and Silhouette Analysis: used together to select k=5 as the optimal cluster count, balancing statistical fit with the requirement that segments be interpretable and actionable.
- Business Communication: translated cluster centroids into named personas, mapped each segment to a specific leadership goal, and structured findings into a prioritized campaign brief for a non-technical audience.