Sun Country Airlines — Customer Segmentation

← Back to Projects
BANA 200 · University of California, Irvine · Summer 2025
K-Means Clustering Python Customer Analytics Predictive Analytics
15,144
Customer records analyzed
5
Distinct segments identified
52.7%
Customers in highest-priority segments

Project Context: Sun Country Airlines, a Minneapolis-based low-cost leisure carrier, had no dedicated data analyst on staff and relied entirely on anecdotal assumptions to drive marketing decisions. Acting as an analytics consultant for a BANA 200 case study, I applied K-Means clustering to 15,144 customer-trip records spanning 2013 to 2014, identifying five actionable segments and translating the findings into three targeted campaigns directly tied to leadership's stated goals around Ufly loyalty enrollment, direct channel growth, and vacation package differentiation.

Business Question: Which distinct customer segments exist within Sun Country's booking data, and what targeted strategies should the airline pursue to grow loyalty enrollment, increase direct bookings, and differentiate its vacation packages?

Key Terms
MSPMinneapolis-Saint Paul International Airport. Sun Country's home base and the origin or destination for the majority of customers in this dataset.
K-Means ClusteringA machine learning algorithm that groups records into k clusters based on similarity. Each cluster represents customers who share common booking behaviors and travel patterns.
Ufly RewardsSun Country's loyalty program. Members earn points on flights and purchases. At the time of this analysis, enrollment was extremely low across the customer base.
Elbow MethodA technique for choosing the right number of clusters (k). You plot the model's error at each value of k and look for the point where adding more clusters stops producing meaningful improvement, the "elbow" in the curve.
Silhouette ScoreA measure of how well each customer fits its assigned cluster versus neighboring clusters. Higher scores indicate cleaner, more distinct groupings.
EDAExploratory Data Analysis. Examining the raw data through summary statistics and visualizations before any modeling begins.
Direct ChannelA booking made through Sun Country's own website (SCA.com), which avoids third-party commissions and allows direct loyalty enrollment.
Third-Party ChannelA booking made through platforms like Expedia or Google Flights. Sun Country earns less margin and has limited ability to engage the customer post-booking.
The 5 Customer Segments

K-Means clustering with k=5 revealed five distinct, business-interpretable customer groups. The clearest structural pattern was a near-complete booking channel split between Clusters 0 and 1, who share the same Minneapolis leisure traveler profile but differ entirely in where they book. Loyalty program participation was almost entirely absent outside of Cluster 3.

C0: MSP Direct Bookers
27.3%
MSP locals booking on SCA.com. 74% direct channel. 99.9% non-Ufly — the single largest loyalty enrollment opportunity.
C1: MSP Deal Seekers
25.4%
Same traveler as C0 but 100% third-party channel (Expedia, Google Flights). A margin problem with a structural fix.
C2: Inbound Visitors
16.1%
95% fly INTO Minneapolis. Low spend, one-way dominant, 95% non-Ufly. Forms a cross-sell mirror with C4.
C3: Loyal Frequent Fliers
15.7%
99% Ufly Standard. Highest ticket spend and longest booking lead time of any segment. Peak travel in Q4.
C4: Returning Home Fliers
15.5%
98% flying home to MSP. 30% Ufly Standard — partial loyalty conversion already underway. Natural vacation package target.
Key Insight from EDA

Before any model was run, exploratory data analysis surfaced two numbers that defined the entire business opportunity and foreshadowed what the clustering would find:

  • 78.5% of customers were not enrolled in Ufly Rewards — the loyalty program was almost entirely untapped across the customer base.
  • Booking channels split nearly evenly — 45.9% booked direct through SCA.com while 44.0% went through third-party platforms like Expedia and Google Flights, a structural revenue gap hiding in plain sight.

These two patterns set the direction for the entire analysis. The segmentation that followed gave them shape, revealing exactly which customer groups were driving each split and what it would take to move them.

Three campaigns were designed, each mapped directly to one of leadership's stated business objectives around Ufly enrollment, direct channel migration, and vacation package differentiation.

High Priority
Launch Direct Booking and Ufly Enrollment Campaign
C0 + C1 — 52.7% of all customers
C0 already books direct but 99.9% are non-Ufly. A post-booking enrollment prompt with a first-flight miles bonus is low-friction and high-volume. C1 books 100% through third parties — a guaranteed 5 to 10% direct-booking discount gives them a concrete reason to switch channels, where an enrollment prompt can then be served.
Medium Priority
Introduce a Ufly Elite Tier for Loyal Frequent Fliers
C3 — 15.7% of customers
C3 is 99% Ufly Standard with the highest ticket spend of any segment. There is currently no Elite tier to aspire toward. A competitor offering a status-match could rapidly erode years of built loyalty. Gating Elite perks (priority boarding, free checked bag) behind a spend threshold creates a concrete retention mechanism before a competitor acts.
Medium Priority
Cross-Sell Outbound Vacation Packages at Checkout
C4 — 15.5% of customers
C4 flies home to MSP one-way and 30% already hold Ufly Standard. A customer completing a homebound trip via Sun Country is primed for an outbound vacation package offer at checkout. A "Plan your next escape" prompt converts one-way buyers into round-trip vacation customers and deepens Ufly engagement in a segment with existing brand affinity.
Low Priority
Round-Trip Bundle Offer
C2 — 16.1% of customers
C2 is the most transactional and lowest-spend segment. A round-trip bundle offer is worth testing but is not the core focus given limited brand attachment and low Ufly engagement.
Step 1
Data Preparation
Started with 15,144 records across 90 features covering booking channel, loyalty status, demographics, origin and destination airports, and travel seasonality. Removed two identifier columns (uid and PNRLocatorID) that carry no analytical signal. All remaining 88 features were pre-normalized to a 0 to 1 scale. The dataset had no missing values.
Step 2
Exploratory Data Analysis
Examined the distribution of key variables before any modeling. The near-even booking channel split (46% direct vs 44% third-party) and the 78.5% non-Ufly enrollment rate emerged as the two most important structural patterns. Destination frequency showed MSP dominated, confirming a locally concentrated customer base. These findings shaped the interpretation of every cluster that followed.
Step 3
Choosing k = 5
Tested values of k from 2 through 10. The elbow method plots within-cluster sum of squares at each k — the point where the curve flattens signals diminishing returns from adding more clusters. A clear elbow appeared at k=5. Silhouette analysis confirmed k=5 as the optimal balance of statistical fit and business interpretability. See the full elbow curve in the notebook.
Step 4
K-Means Clustering
Final model fit with k=5, random_state=42 for reproducibility, and n_init=10 to run from 10 different starting points and select the best result. Clusters were reordered by size. Each cluster's centroid was translated into a named segment profile with a clear business interpretation and a recommended action tied to leadership's stated goals.
What I took away from this project
  • The channel split between C0 and C1 was the single most strategically interesting finding. Two customer groups with near-identical travel profiles diverged entirely on where they booked, which meant two very different interventions for what initially looked like one audience.
  • Naming segments matters. Translating cluster centroids into personas like MSP Direct Bookers and MSP Deal Seekers made the results immediately usable for stakeholders who would never look at a centroid table. Communication is part of the analysis.
  • Framing findings around business objectives, not just statistical patterns, is what makes an analysis actionable. Every segment was evaluated through the lens of leadership's three stated goals: loyalty enrollment, direct channel growth, and vacation package differentiation.
  • Data gaps are part of the analysis. The absence of ancillary spend data, post-flight satisfaction scores, and loyalty conversion event tracking were not just limitations to acknowledge — they became the basis for a concrete long-term data strategy recommendation.
  • Unsupervised learning requires more judgment than supervised learning. There is no target variable to optimize against and no accuracy score to chase. Decisions about how many clusters to use, which features matter, and what a centroid means in business terms all come down to analyst judgment. That judgment is where the real work happens.
Tools and Skills
  • Python: primary language for the entire analysis. Used to load and inspect the dataset, run all EDA, build and evaluate clustering models, and generate every visualization.
  • pandas: used for data manipulation throughout: filtering rows, computing group-level summary statistics, and building the centroid comparison tables used to profile each segment.
  • scikit-learn: provided the KMeans algorithm and silhouette scoring. Used to fit models across k=2 through k=10 and select the optimal number of clusters.
  • matplotlib and seaborn: used to produce all visualizations including the elbow curve, silhouette plot, booking channel distributions, Ufly enrollment breakdown, and per-cluster destination bar charts.
  • K-Means Clustering: core modeling technique. Applied to 88 normalized features to identify five behaviorally distinct customer groups from the raw booking data.
  • Elbow Method and Silhouette Analysis: used together to select k=5 as the optimal cluster count, balancing statistical fit with the requirement that segments be interpretable and actionable.
  • Business Communication: translated cluster centroids into named personas, mapped each segment to a specific leadership goal, and structured findings into a prioritized campaign brief for a non-technical audience.