Context
Onyx runs an equities and derivatives trading platform used by institutional desks. Two hundred engineers across 14 teams shipped through a single Jenkins deploy train: every release was a three-hour, all-hands affair scheduled outside market hours.
Release freezes around trading windows meant features queued for days. Rollbacks took 45 minutes — an eternity when a regression touches order flow.
The challenge
Regulatory change-control required a full audit trail for every production change, which the team had bolted onto Jenkins with manual ticketing. Config drift between the 40-service fleet's environments made every deploy a small gamble.
The hard constraint: migrate without a single minute of downtime during market hours, and make the audit trail better, not worse.
Approach
Discovery
Weeks 1–2Dependency-mapped all 40 services, audited the deploy train end-to-end, and scored each service for migration risk. Output: a sequenced cutover plan the compliance team signed off on.
Platform build
Weeks 3–10Stood up multi-AZ EKS with ArgoCD GitOps. Every change became a pull request — the git history is the audit trail. Ephemeral preview environments replaced the shared staging bottleneck.
Mesh & canary cutover
Weeks 11–14Istio service mesh with mTLS everywhere. Services migrated one at a time behind canary releases — 1% of order flow first, automated rollback on SLO breach.
Hardening & handover
Weeks 15–16OpenTelemetry tracing across the order path, error budgets per service, and on-call runbooks. Their platform team ran the final five cutovers solo.
Architecture
Results
Down from 3 hours. Teams deploy independently, during market hours.
Right-sized requests, spot for non-prod, and retired duplicate staging fleets.
Zero unplanned downtime in the first 180 days post-cutover.
Down from 45 minutes — ArgoCD reverts to the last healthy commit.
“We went from dreading release nights to shipping during market hours. The audit team likes the git trail more than the old ticketing system, too.”