Context
Halo Goods sells direct-to-consumer with revenue concentrated in flash sales and Black Friday week. The previous Black Friday, a scaling failure took checkout down for 90 minutes at peak.
A three-person platform team managed capacity with spreadsheets and manual GKE node-pool resizes.
The challenge
Flash-sale traffic arrives in minutes, not hours — reactive autoscaling alone couldn't keep p95 latency inside budget during the ramp.
Infra cost per order had risen every quarter; the easy fix (permanent over-provisioning) was exactly what the CFO vetoed.
Approach
Load modeling & game days
Weeks 1–3Replayed historical sale traffic against a production-shaped environment, found the four bottlenecks (checkout DB connections chief among them), and fixed them before touching the platform.
Self-service platform
Weeks 4–9Crossplane compositions gave product teams one-line provisioning for services, queues, and databases — with sizing, policies, and observability baked in. Tekton golden-path pipelines standardized deploys.
Capacity intelligence
Weeks 10–12Thanos federated metrics across clusters; capacity forecasting drives scheduled pre-scaling ahead of announced sales, with HPA handling the unannounced spikes.
Architecture
Results
Black Friday peak vs. baseline, p95 latency inside budget throughout.
Across Black Friday week and every flash sale since.
Spot pools + right-sizing + scale-to-zero for off-peak services.
Down from 11 minutes — measured from spike start to capacity-ready.
“Last year we watched dashboards and prayed. This year the platform pre-scaled itself and the on-call engineer watched a movie.”