Scaled Black Friday traffic 14× with zero incidents on a self-service platform.

A D2C retailer that lost 90 minutes of Black Friday to an outage rebuilt on GKE with Crossplane self-service — and absorbed 14× peak the next year without a single P0.

14×
peak traffic absorbed
0
P0 incidents
37%
lower infra cost / order

Halo Goods sells direct-to-consumer with revenue concentrated in flash sales and Black Friday week. The previous Black Friday, a scaling failure took checkout down for 90 minutes at peak.

A three-person platform team managed capacity with spreadsheets and manual GKE node-pool resizes.

Flash-sale traffic arrives in minutes, not hours — reactive autoscaling alone couldn't keep p95 latency inside budget during the ramp.

Infra cost per order had risen every quarter; the easy fix (permanent over-provisioning) was exactly what the CFO vetoed.

  1. Load modeling & game days

    Weeks 1–3

    Replayed historical sale traffic against a production-shaped environment, found the four bottlenecks (checkout DB connections chief among them), and fixed them before touching the platform.

  2. Self-service platform

    Weeks 4–9

    Crossplane compositions gave product teams one-line provisioning for services, queues, and databases — with sizing, policies, and observability baked in. Tekton golden-path pipelines standardized deploys.

  3. Capacity intelligence

    Weeks 10–12

    Thanos federated metrics across clusters; capacity forecasting drives scheduled pre-scaling ahead of announced sales, with HPA handling the unannounced spikes.

Edge
Cloud CDNGlobal LBCloud Armor
Delivery
Tekton pipelinesGolden pathsCrossplane APIs
Platform
GKE regionalHPA + pre-scaleSpot node pools
Observability
ThanosCapacity forecastsSLO burn alerts
Self-service provisioning via Crossplane; scheduled pre-scale + HPA absorbs flash-sale ramps.
14×
peak absorbed

Black Friday peak vs. baseline, p95 latency inside budget throughout.

0
P0 incidents

Across Black Friday week and every flash sale since.

37%
lower infra cost / order

Spot pools + right-sizing + scale-to-zero for off-peak services.

2 min
p95 scale-out

Down from 11 minutes — measured from spike start to capacity-ready.

Last year we watched dashboards and prayed. This year the platform pre-scaled itself and the on-call engineer watched a movie.

Head of Engineering, Halo Goods
GKECrossplaneThanosTekton