Sun Country Airlines — Customer Segmentation Analysis¶


Context: Sun Country Airlines leadership tasked a consulting team with identifying distinct customer segments from customer booking data (Jan 2013 – Dec 2014). The goal: uncover actionable insights to improve digital booking, grow Ufly Rewards enrollment, and develop differentiated vacation packages.

Dataset: 15,144 customer records × 90 features — booking channel, demographics, origin/destination, loyalty status, and travel seasonality. All features are pre-normalized to [0, 1].

Approach: Exploratory Data Analysis → Elbow Method → K-Means (k=5) → Segment Profiling → Business Recommendations


0. Setup & Imports¶

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
import seaborn as sns
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')

# --- Global style settings ---
CLUSTER_COLORS = ['#4A7FA5', '#C0554E', '#4E9A72', '#D4895A', '#7B6B9E']
CLUSTER_NAMES  = ['C0: MSP Direct Bookers','C1: MSP Deal Seekers','C2: Inbound Visitors','C3: Loyal Frequent Fliers','C4: Returning Home Fliers']
SHORT_NAMES = ['MSP Direct\nBookers', 'MSP Deal\nSeekers', 'Inbound\nVisitors','Loyal Frequent\nFliers', 'Returning Home\nFliers']

plt.rcParams.update({'figure.facecolor':'white','axes.facecolor':'white', 'axes.spines.top': False, 'axes.spines.right': False, 
                     'font.family':'Times New Roman', 'axes.titlesize': 14, 'axes.labelsize': 12, 'xtick.labelsize': 10, 'ytick.labelsize': 10,})

1. Load & Inspect Data¶

In [2]:
# Load the clustering dataset
clu = pd.read_csv('Clustering Data.csv')

print(f'Dataset shape: {clu.shape[0]:,} rows × {clu.shape[1]} columns')
print(f'Missing values: {clu.isnull().sum().sum()}')
Dataset shape: 15,144 rows × 90 columns
Missing values: 0
In [3]:
# Preview the clustering dataset structure
clu.head()
Out[3]:
uid PNRLocatorID avg_amt round_trip group_size group days_pre_booked BookingChannel_Other BookingChannel_Outside_Booking BookingChannel_Reservations_Booking ... true_destination_dest_SXM true_destination_dest_TPA true_destination_dest_ZIH UflyMemberStatus_Elite UflyMemberStatus_non-ufly UflyMemberStatus_Standard seasonality_Q1 seasonality_Q2 seasonality_Q3 seasonality_Q4
0 504554455244696420493F7C2067657420746869732072... AADMLF 0.019524 0 0.000 0 0.029703 0 1 0 ... 0 0 0 0 0 1 0 0 0 1
1 46495853454E44696420493F7C20676574207468697320... AAFBOM 0.081774 1 0.000 0 0.039604 0 0 0 ... 0 0 0 0 0 1 0 0 1 0
2 534355545444696420493F7C2067657420746869732072... AAFILI 0.026650 0 0.125 1 0.069307 0 0 0 ... 0 0 0 0 1 0 1 0 0 0
3 534355545444696420493F7C2067657420746869732072... AAFILI 0.026650 0 0.125 1 0.069307 0 0 0 ... 0 0 0 0 1 0 1 0 0 0
4 44554D4D414E4E44696420493F7C206765742074686973... AAFRQI 0.000000 1 0.000 0 0.035361 0 1 0 ... 0 0 0 0 1 0 0 0 0 1

5 rows × 90 columns

Data Quality Notes:

  • All 88 numeric features are pre-normalized to [0, 1], so no additional scaling is required before clustering.
  • uid and PNRLocatorID are identifier strings — excluded from clustering features.
  • No missing values detected; the dataset is clean and ready for modeling.
  • Categorical variables (booking channel, age group, origin/destination, loyalty status, seasonality) have been one-hot encoded into binary indicator columns.
  • Key business variables: avg_amt (ticket price), days_pre_booked (booking lead time), round_trip, group_size, and the loyalty/channel/destination indicator columns.
  • Note: A secondary reservations dataset (sample_data_transformed.csv) was referenced in the original assignment but is not available. All analysis below uses the clustering dataset directly, which contains all features needed for full segment profiling.

2. Exploratory Data Analysis (EDA)¶

In [138]:
fig, axes = plt.subplots(2, 2, figsize=(8, 6))

continuous_vars = [
    ('avg_amt',         'Average Ticket Amount (normalized)',  '#4A7FA5'),
    ('group_size',      'Group Size (normalized)',             '#4E9A72'),
    ('days_pre_booked', 'Days Pre-Booked (normalized)',        '#D4895A'),
]

# Flatten to get first 3 axes, hide the 4th
ax_list = [axes[0,0], axes[0,1], axes[1,0]]
axes[1,1].set_visible(False)

for ax, (col, title, color) in zip(ax_list, continuous_vars):
    ax.hist(clu[col], bins=40, color=color, edgecolor='white', linewidth=0.5, alpha=0.85)
    ax.axvline(clu[col].mean(), color='black', linestyle='--', linewidth=1.2,
               label=f'Mean: {clu[col].mean():.3f}')
    ax.set_title(title, fontweight='bold', pad=20)
    ax.set_xlabel('Normalized value (0–1)', labelpad=10)
    ax.set_ylabel('Count', labelpad=10)
    ax.legend(fontsize=9)

plt.tight_layout()
plt.savefig('00 - eda_continuous_distributions.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image
In [163]:
def autopct_threshold(pct):
    return f'{pct:.1f}%' if pct >= 5 else ''

# Chart 1: Booking Channel Mix
channel_cols = ['BookingChannel_SCA_Website_Booking', 'BookingChannel_Outside_Booking', 'BookingChannel_Reservations_Booking', 
                'BookingChannel_Tour_Operator_Portal', 'BookingChannel_SY_Vacation', 'BookingChannel_Other']
channel_labels = ['SCA Website', 'Outside/3rd Party', 'Reservations', 'Tour Operator', 'SY Vacation', 'Other']
channel_counts = clu[channel_cols].sum().values
colors_ch = ['#4A7FA5', '#C0554E', '#4E9A72', '#D4895A', '#7B6B9E', '#9E9E9E']

fig1, ax1 = plt.subplots(figsize=(7, 7))
wedges, _, autotexts = ax1.pie(channel_counts, labels=None, colors=colors_ch, autopct=autopct_threshold, startangle=90, pctdistance=1.18,
                               wedgeprops={'edgecolor': 'white', 'linewidth': 1.5}, radius=0.75)
for at in autotexts: at.set_fontsize(15); at.set_fontweight('bold')

legend_labels_ch = [f'{l} ({c/channel_counts.sum()*100:.1f}%)' if c/channel_counts.sum()*100 < 5 else l
    for l, c in zip(channel_labels, channel_counts)]

ax1.legend(wedges, legend_labels_ch, loc='upper center', bbox_to_anchor=(0.5, 0.15), ncol=3, fontsize=12, frameon=False)
ax1.set_title('Booking Channel Mix', fontweight='bold', fontsize=17, y=0.90)
plt.tight_layout()
plt.savefig('00 - eda_booking_channel.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image
In [168]:
def autopct_threshold(pct):
    return f'{pct:.1f}%' if pct >= 5 else ''

# Chart 2: Ufly Loyalty Program Status
loyalty_cols   = ['UflyMemberStatus_non-ufly', 'UflyMemberStatus_Standard', 'UflyMemberStatus_Elite']
loyalty_labels = ['Non-Member', 'Standard', 'Elite']
loyalty_counts = clu[loyalty_cols].sum().values
colors_ly = ['#9E9E9E', '#D4895A', '#C0554E']

fig2, ax2 = plt.subplots(figsize=(6, 6))
wedges2, _, autotexts2 = ax2.pie(loyalty_counts, labels=None, colors=colors_ly, autopct=autopct_threshold, startangle=90, pctdistance=1.18,
                                 wedgeprops={'edgecolor': 'white', 'linewidth': 1.5}, radius=0.75)
for at in autotexts2: at.set_fontsize(15); at.set_fontweight('bold')

legend_labels_ly = [f'{l} ({c/loyalty_counts.sum()*100:.1f}%)' if c/loyalty_counts.sum()*100 < 5 else l
    for l, c in zip(loyalty_labels, loyalty_counts)]

ax2.legend(wedges2, legend_labels_ly, loc='upper center', bbox_to_anchor=(0.5, 0.15), ncol=3, fontsize=12, frameon=False)
ax2.set_title('Ufly Loyalty Program Status', fontweight='bold', fontsize=17, y=0.90)
plt.tight_layout()
plt.savefig('00 - eda_loyalty_status.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image
In [44]:
# EDA: Top destination airports

dest_cols = [c for c in clu.columns if c.startswith('true_destination_dest_')]
dest_means = clu[dest_cols].mean().sort_values(ascending=False).head(12)
dest_labels = [c.replace('true_destination_dest_', '') for c in dest_means.index]

fig, ax = plt.subplots(figsize=(12, 4))
bars = ax.bar(dest_labels, dest_means.values, color='#4A7FA5', edgecolor='white', linewidth=0.8)
bars[0].set_color('#C0554E')  # highlight top destination
ax.set_title('Top 12 Destinations by Booking Frequency', fontweight='bold', fontsize=15)
ax.set_ylabel('Mean booking proportion (0–1)', labelpad=20, fontsize=15)
ax.set_xlabel('Destination Airport Code', labelpad=20, fontsize=15)
ax.tick_params(axis='x', labelsize=13)
ax.tick_params(axis='y', labelsize=13)
for bar in bars:
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.003,
            f'{bar.get_height():.3f}', ha='center', va='bottom', fontsize=12)
plt.tight_layout()
plt.savefig('00 - eda_top_destinations.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image

3. Feature Selection & Clustering Preparation¶

In [8]:
# Select numeric features for clustering (exclude identifier columns)

exclude = ['uid', 'PNRLocatorID']
feature_cols = [c for c in clu.select_dtypes(include='number').columns if c not in exclude]

X = clu[feature_cols].copy()
print(f'Feature matrix shape: {X.shape}')
print(f'\nFeature categories:')
print(f'Booking behavior          : avg_amt, round_trip, group_size, group, days_pre_booked')
print(f'Booking channels          : {len([c for c in feature_cols if "BookingChannel" in c])} columns')
print(f'Age groups                : {len([c for c in feature_cols if "age_group" in c])} columns')
print(f'Origin airports           : {len([c for c in feature_cols if "true_origins" in c])} columns')
print(f'Destination airports.     : {len([c for c in feature_cols if "true_destination" in c])} columns')
print(f'Loyalty status            : {len([c for c in feature_cols if "Ufly" in c])} columns')
print(f'Seasonality               : {len([c for c in feature_cols if "seasonality" in c])} columns')
Feature matrix shape: (15144, 88)

Feature categories:
Booking behavior          : avg_amt, round_trip, group_size, group, days_pre_booked
Booking channels          : 6 columns
Age groups                : 5 columns
Origin airports           : 29 columns
Destination airports.     : 36 columns
Loyalty status            : 3 columns
Seasonality               : 4 columns

4. Choosing k: Elbow Method + Silhouette Analysis¶

In [9]:
# Run elbow method and silhouette analysis for k = 2 through 10.

k_range = range(2, 11)
inertias = []
silhouettes = []

print('Computing inertia and silhouette scores...')
for k in k_range:
    km = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = km.fit_predict(X)
    inertias.append(km.inertia_)
    sil = silhouette_score(X, labels, sample_size=3000, random_state=42)
    silhouettes.append(sil)
    print(f'  k={k}: inertia={km.inertia_:,.0f}, silhouette={sil:.4f}')
Computing inertia and silhouette scores...
  k=2: inertia=59,918, silhouette=0.1151
  k=3: inertia=55,337, silhouette=0.1000
  k=4: inertia=52,394, silhouette=0.0942
  k=5: inertia=50,292, silhouette=0.1036
  k=6: inertia=48,803, silhouette=0.0992
  k=7: inertia=47,585, silhouette=0.0991
  k=8: inertia=46,479, silhouette=0.1025
  k=9: inertia=45,236, silhouette=0.1099
  k=10: inertia=44,749, silhouette=0.0987
In [75]:
ks = list(k_range)

# Elbow plot
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(ks, inertias, 'o-', color='#4A7FA5', linewidth=2, markersize=8)
ax.axvline(5, color='#C0554E', linestyle='--', linewidth=1.5, label='k = 5 (selected)')
ax.set_title('Elbow Method - Inertia vs. k', fontweight='bold', pad=25, fontsize=20)
ax.set_xlabel('Number of clusters (k)', labelpad=15, fontsize=16)
ax.set_ylabel('Inertia (within-cluster sum of squares)', labelpad=15, fontsize=16)
ax.set_xticks(ks)
ax.tick_params(axis='x', labelsize=13)
ax.tick_params(axis='y', labelsize=13)
ax.legend(fontsize=13)
plt.tight_layout()
plt.savefig('00 - elbow.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image

5. K-Means Clustering (k=5)¶

In [11]:
#Fit final K-Means model with k=5 and assign cluster labels.

kmeans = KMeans(n_clusters=5, random_state=42, n_init=10)
kmeans.fit(X)

# Assign labels and reorder clusters by size (largest = Cluster 0)
clu['cluster'] = kmeans.labels_
order   = clu['cluster'].value_counts().index.tolist()
cmap    = {old: new for new, old in enumerate(order)}
clu['cluster'] = clu['cluster'].map(cmap)

print('Cluster sizes:')
for i, row in clu['cluster'].value_counts().sort_index().items():
    print(f'  Cluster {i} ({CLUSTER_NAMES[i]}): {row:,} customers ({row/len(clu)*100:.1f}%)')
Cluster sizes:
  Cluster 0 (C0: MSP Direct Bookers): 4,127 customers (27.3%)
  Cluster 1 (C1: MSP Deal Seekers): 3,843 customers (25.4%)
  Cluster 2 (C2: Inbound Visitors): 2,436 customers (16.1%)
  Cluster 3 (C3: Loyal Frequent Fliers): 2,385 customers (15.7%)
  Cluster 4 (C4: Returning Home Fliers): 2,353 customers (15.5%)

6. Cluster Summary Table¶

In [12]:
profile_cols = {
    'avg_amt':                              'Avg Ticket Amount',
    'round_trip':                           'Round Trip Rate',
    'group_size':                           'Avg Group Size',
    'days_pre_booked':                      'Days Pre-Booked',
    'BookingChannel_SCA_Website_Booking':   'SCA Website %',
    'BookingChannel_Outside_Booking':       'Outside/3rd Party %',
    'BookingChannel_Tour_Operator_Portal':  'Tour Operator %',
    'UflyMemberStatus_non-ufly':            'Non-Ufly Member %',
    'UflyMemberStatus_Standard':            'Ufly Standard %',
    'UflyMemberStatus_Elite':               'Ufly Elite %',
    'age_group_18-24':                      'Age 18-24 %',
    'age_group_35-54':                      'Age 35-54 %',
    'age_group_55+':                        'Age 55+ %',
    'seasonality_Q1':                       'Q1 (Jan-Mar) %',
    'seasonality_Q4':                       'Q4 (Oct-Dec) %',
    'true_origins_ori_MSP':                 'Origin: MSP %',
    'true_destination_dest_MSP':            'Destination: MSP %',
}

# Compute means per cluster
summary = clu.groupby('cluster')[[c for c in profile_cols.keys() if c in clu.columns]].mean()
summary.insert(0, 'n_customers', clu.groupby('cluster').size())
summary.insert(1, 'pct_of_total', (clu.groupby('cluster').size() / len(clu) * 100).round(1))
summary = summary.rename(columns=profile_cols)
summary.index = [f'C{i}: {CLUSTER_NAMES[i].split(": ")[1]}' for i in summary.index]

# Display rounded
display_summary = summary.copy()
for col in display_summary.columns[2:]:
    display_summary[col] = display_summary[col].map(lambda x: f'{x:.3f}')

print(display_summary.to_string())
print('\n* All feature values are normalized (0–1) means of the cluster centroid.')
                           n_customers  pct_of_total Avg Ticket Amount Round Trip Rate Avg Group Size Days Pre-Booked SCA Website % Outside/3rd Party % Tour Operator % Non-Ufly Member % Ufly Standard % Ufly Elite % Age 18-24 % Age 35-54 % Age 55+ % Q1 (Jan-Mar) % Q4 (Oct-Dec) % Origin: MSP % Destination: MSP %
C0: MSP Direct Bookers            4127          27.3             0.053           0.793          0.169           0.085         0.740               0.000           0.126             0.999           0.000        0.001       0.093       0.331     0.256          0.330          0.234         0.907              0.000
C1: MSP Deal Seekers              3843          25.4             0.053           0.794          0.150           0.088         0.000               1.000           0.000             0.998           0.000        0.002       0.111       0.350     0.213          0.263          0.259         0.888              0.001
C2: Inbound Visitors              2436          16.1             0.048           0.417          0.068           0.052         0.000               0.946           0.000             0.949           0.050        0.002       0.174       0.298     0.174          0.155          0.298         0.001              0.950
C3: Loyal Frequent Fliers         2385          15.7             0.060           0.790          0.112           0.098         0.679               0.218           0.013             0.000           0.990        0.010       0.047       0.356     0.395          0.268          0.319         0.946              0.001
C4: Returning Home Fliers         2353          15.5             0.050           0.428          0.106           0.074         0.968               0.000           0.000             0.689           0.306        0.006       0.106       0.275     0.283          0.163          0.290         0.001              0.979

* All feature values are normalized (0–1) means of the cluster centroid.

7. Visualizations¶

7A. Cross-Cluster Comparison: Feature Heatmap¶

In [54]:
hm_data = clu.groupby('cluster')[heatmap_cols].mean()
hm_data.index = [f'Cluster {i}:\n{SHORT_NAMES[i].replace(chr(10), " ")}' for i in hm_data.index]
hm_data.columns = heatmap_labels

fig, ax = plt.subplots(figsize=(20, 10))
sns.heatmap(hm_data, annot=True, fmt='.2f', cmap='RdYlGn', linewidths=0.5, linecolor='white', ax=ax,
            cbar_kws={'label': 'Mean value (0–1)', 'shrink': 0.8, 'pad': 0.02}, annot_kws={'size': 18, 'weight': 'bold'})
ax.set_title('Feature Means Across All 5 Customer Segments', fontsize=24, fontweight='bold', pad=30)
ax.set_xlabel('')
ax.set_ylabel('')
ax.tick_params(axis='x', rotation=30, labelsize=18)
ax.tick_params(axis='y', rotation=0, labelsize=18)
ax.figure.axes[-1].tick_params(labelsize=17)
ax.figure.axes[-1].set_ylabel('Mean value (0–1)', fontsize=17, labelpad=15)
plt.tight_layout(pad=2)
plt.savefig('00 - heatmap_all_clusters.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image

7B. Cross-Cluster Comparison: Ticket Price & Booking Lead Time¶

In [107]:
# Average Ticket Amount by Segment
vals   = clu.groupby('cluster')['avg_amt'].mean()
errors = clu.groupby('cluster')['avg_amt'].sem()

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(range(5), vals, color=CLUSTER_COLORS, edgecolor='white', linewidth=0.8)

ax.set_xticks(range(5))
ax.set_xticklabels(FULL_NAMES, fontsize=13)
ax.set_title('Average Ticket Amount by Segment', fontweight='bold', pad=20, fontsize=18)
ax.set_ylabel('Normalized value (0-1)', labelpad=10, fontsize=14)
ax.tick_params(axis='y', labelsize=13)
for bar, val in zip(bars, vals):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
            f'{val:.3f}', ha='center', va='bottom', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.savefig('00 - avg_ticket_amount.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image
In [106]:
# Booking Lead Time by Segment
vals   = clu.groupby('cluster')['days_pre_booked'].mean()
errors = clu.groupby('cluster')['days_pre_booked'].sem()

fig, ax = plt.subplots(figsize=(10, 6))
bars = ax.bar(range(5), vals, color=CLUSTER_COLORS, edgecolor='white', linewidth=0.8)

ax.set_xticks(range(5))
ax.set_xticklabels(FULL_NAMES, fontsize=13)
ax.set_title('Booking Lead Time by Segment', fontweight='bold', pad=20, fontsize=18)
ax.set_ylabel('Normalized days pre-booked (0-1)', labelpad=10, fontsize=14)
ax.tick_params(axis='y', labelsize=13)
for bar, val in zip(bars, vals):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.001,
            f'{val:.3f}', ha='center', va='bottom', fontsize=13, fontweight='bold')

plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.savefig('00 - booking_lead_time.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image

7C. Booking Channel Split by Segment¶

In [137]:
# Stacked bar chart showing booking channel composition for each cluster.

FULL_NAMES = [
    'C0: MSP Direct\nBookers',
    'C1: MSP Deal\nSeekers',
    'C2: Inbound\nVisitors',
    'C3: Loyal Frequent\nFliers',
    'C4: Returning Home\nFliers'
]

# Stacked bar chart
fig, ax = plt.subplots(figsize=(9, 6))
bottom = np.zeros(5)
for col, label, color in zip(channel_cols_plot, ch_labels, ch_colors):
    vals = ch_data[col].values
    ax.bar(range(5), vals, bottom=bottom, label=label, color=color, edgecolor='white', linewidth=0.6)
    for i, (v, b) in enumerate(zip(vals, bottom)):
        if v > 0.07:
            ax.text(i, b + v/2, f'{v:.0%}', ha='center', va='center', fontsize=12,
                    color='white', fontweight='bold')
    bottom += vals

ax.set_xticks(range(5))
ax.set_xticklabels(FULL_NAMES, fontsize=12)
ax.set_title('Booking Channel Composition by Customer Segment', fontweight='bold', fontsize=14, pad=30)
ax.set_ylabel('Proportion of bookings', fontsize=12, labelpad=10)
ax.tick_params(axis='y', labelsize=12)
ax.tick_params(axis='x', labelsize=12)
ax.legend(title='Booking Channel', title_fontsize=12, ncol=3, frameon=False, fontsize=12,
          loc='upper center', bbox_to_anchor=(0.5, -0.15))
plt.tight_layout()
plt.savefig('00 - stacked_bar_channels.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image

7D. Ufly Loyalty Status by Segment¶

In [134]:
FULL_NAMES = [
    'C0: MSP Direct\nBookers',
    'C1: MSP Deal\nSeekers',
    'C2: Inbound\nVisitors',
    'C3: Loyal Frequent\nFliers',
    'C4: Returning Home\nFliers'
]

# Grouped bar chart
x = np.arange(5)
width = 0.25
fig, ax = plt.subplots(figsize=(9, 6))

for i, (col, label, color) in enumerate(zip(loyalty_cols_plot, ly_labels, ly_colors)):
    bars = ax.bar(x + (i - 1) * width, ly_data[col], width, label=label,
                  color=color, edgecolor='white', linewidth=0.8)

ax.set_xticks(x)
ax.set_xticklabels(FULL_NAMES, fontsize=12)
ax.set_title('Ufly Loyalty Program Status by Customer Segment', fontweight='bold', fontsize=14, pad=20)
ax.set_ylabel('Proportion of customers', fontsize=12, labelpad=10)
ax.tick_params(axis='y', labelsize=12)
ax.tick_params(axis='x', labelsize=12)
ax.legend(title='Ufly Status', ncol=3, frameon=False, title_fontsize=12, fontsize=12,
          loc='upper center', bbox_to_anchor=(0.45, -0.20))
ax.set_ylim(0, 1.1)
plt.tight_layout()
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.savefig('00 - grouped_bar_loyalty.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image

7E. Age Distribution by Segment¶

In [129]:
age_cols   = ['age_group_0-17', 'age_group_18-24', 'age_group_25-34', 'age_group_35-54', 'age_group_55+']
age_labels = ['0-17', '18-24', '25-34', '35-54', '55+']
age_colors = ['#4A7FA5', '#4E9A72', '#D4895A', '#7B6B9E', '#C0554E']

age_data = clu.groupby('cluster')[age_cols].mean()

x = np.arange(5)
width = 0.15
fig, ax = plt.subplots(figsize=(8, 5))

for i, (col, label, color) in enumerate(zip(age_cols, age_labels, age_colors)):
    offset = (i - 2) * width
    ax.bar(x + offset, age_data[col], width, label=f'Age {label}',
           color=color, edgecolor='white', linewidth=0.8)

ax.set_xticks(x)
ax.set_xticklabels(FULL_NAMES, fontsize=10)
ax.set_title('Age Group Distribution by Customer Segment', fontweight='bold', fontsize=14, pad=20)
ax.set_ylabel('Proportion of customers', fontsize=12, labelpad=10)
ax.tick_params(axis='y', labelsize=12)
ax.legend(title='Age Group', ncol=5, frameon=False, title_fontsize=12, fontsize=12, loc='upper center', bbox_to_anchor=(0.5, -0.25))
plt.tight_layout()
plt.savefig('00 - grouped_bar_age.png', dpi=500, bbox_inches='tight')
plt.grid(axis='y', linestyle='--', alpha=0.5)
plt.show()
No description has been provided for this image

7F. Travel Seasonality by Segment (Line Chart)¶

In [100]:
#season_cols   = ['seasonality_Q1', 'seasonality_Q2', 'seasonality_Q3', 'seasonality_Q4']
season_labels = ['Q1\n(Jan-Mar)', 'Q2\n(Apr-Jun)', 'Q3\n(Jul-Sep)', 'Q4\n(Oct-Dec)']

season_data = clu.groupby('cluster')[season_cols].mean()

fig, ax = plt.subplots(figsize=(8, 6))
for i in range(5):
    ax.plot(season_labels, season_data.iloc[i].values, 'o-', color=CLUSTER_COLORS[i], label=CLUSTER_NAMES[i], linewidth=2, markersize=5)

ax.set_title('Travel Seasonality by Customer Segment', fontweight='bold', fontsize=14, pad=20)
ax.set_ylabel('Proportion of bookings', labelpad=15, fontsize=12)
ax.tick_params(axis='x', labelsize=12)
ax.tick_params(axis='y', labelsize=12)
ax.legend(title='Cluster', title_fontsize=12, fontsize=12, frameon=False, loc='upper center', bbox_to_anchor=(0.5, -0.2), ncol=3)
ax.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig('00 - line_seasonality.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image

7G. Box Plots — Ticket Price Distribution by Segment¶

In [105]:
fig, ax = plt.subplots(figsize=(9, 5))
ax.set_ylim(0, 0.35)

data_by_cluster = [clu[clu['cluster'] == i]['avg_amt'].values for i in range(5)]
bp = ax.boxplot(data_by_cluster, patch_artist=True, notch=False,
                medianprops={'color': 'white', 'linewidth': 2})
for patch, color in zip(bp['boxes'], CLUSTER_COLORS):
    patch.set_facecolor(color)
    patch.set_alpha(0.85)
for element in ['whiskers', 'caps', 'fliers']:
    for item, color in zip(bp[element], CLUSTER_COLORS * 2):
        item.set_color(color)

ax.set_xticks(range(1, 6))
ax.set_xticklabels(FULL_NAMES, fontsize=12)
ax.tick_params(axis='y', labelsize=12)
ax.set_title('Distribution of Ticket Prices by Customer Segment', fontweight='bold', fontsize=16, pad=30)
ax.set_ylabel('Avg Ticket Amount (normalized 0-1)', labelpad=15, fontsize=12)
ax.grid(axis='y', linestyle='--', alpha=0.5)
plt.tight_layout()
plt.savefig('00 - boxplot_ticket_price.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image

7J. Top Destinations by Segment¶

In [121]:
dest_cols_all = [c for c in clu.columns if c.startswith('true_destination_dest_')]

single_line_names = ['MSP Direct Bookers', 'MSP Deal Seekers', 'Inbound Visitors', 'Loyal Frequent Fliers', 'Returning Home Fliers']

fig, axes = plt.subplots(3, 2, figsize=(10, 14))
axes = axes.flatten()

for i in range(5):
    ax = axes[i]
    sub    = clu[clu['cluster'] == i][dest_cols_all].mean().sort_values(ascending=False).head(5)
    labels = [c.replace('true_destination_dest_', '') for c in sub.index]
    bars   = ax.bar(labels, sub.values, color=CLUSTER_COLORS[i], edgecolor='white', linewidth=0.8)

    for bar in bars:
        if bar.get_height() > 0.001:
            ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 0.003,
                    f'{bar.get_height():.3f}', ha='center', va='bottom', fontsize=12)

    ax.set_title(f'Cluster {i}: {single_line_names[i]}', fontsize=14, fontweight='bold',
                 color='black', pad=12)
    ax.set_ylabel('Mean proportion' if i % 2 == 0 else '', labelpad=12, fontsize=12)
    ax.set_xlabel('Destination', labelpad=12, fontsize=12)
    ax.tick_params(axis='x', labelsize=12, pad=6)
    ax.tick_params(axis='y', labelsize=12, pad=6)

axes[5].set_visible(False)

fig.suptitle('Top 5 Destinations by Customer Segment', fontsize=16, fontweight='bold', y=1.02)
plt.subplots_adjust(hspace=0.4, top=0.96, bottom=0.04, left=0.08, right=0.97)
plt.savefig('00 - top_destinations_by_segment.png', dpi=500, bbox_inches='tight')
plt.show()
No description has been provided for this image