Cut inference cost 62% on a
9-figure ARR workload.

Sector

AI / Infrastructure

Timeframe

14 weeks · Q1 2025

Team

4 operators

Scale

~400 GPU nodes

§ 01Challenge

Three years of growth with no cost visibility.

Kortex Labs runs proprietary language model inference across a fleet of roughly 400 GPU nodes. Costs had grown 3× over 18 months as the model fleet expanded — new model sizes, new endpoints, new teams deploying without coordination.

The engineering team was capable. The problem was structural: there was no systematic view of which endpoints, model sizes, or call patterns were driving spend. Every cost-reduction attempt had been local — one team, one endpoint, one sprint — and none had held.

The mandate was scoped to: build a cost attribution layer, identify the highest-leverage intervention points, and implement the changes without increasing latency on the user-facing inference path.

§ 02Approach

Attribution before optimization.

Weeks 1–2 were diagnostic. We instrumented every inference endpoint with cost-per-call tracking — model size, batch configuration, token throughput, hardware utilization — and built a dashboard that showed the cost breakdown in real time. The team hadn't seen this view before.

The finding was immediate: 60% of compute spend was coming from 12% of calls. Specifically, bulk async jobs — nightly summaries, batch enrichment, background classification — were running through the same real-time endpoint as latency-sensitive user requests. Same model, same priority queue, same GPU allocation.

The intervention was architectural, not algorithmic. We separated the async path, deployed quantized model variants calibrated for throughput rather than P99 latency, and introduced a spot-instance tier for all non-latency-critical workloads. No model retraining. No changes to the user-facing path.

§ 03What we shipped

A tiered inference architecture.

The deliverable was a two-tier inference system: a real-time path with reserved capacity and the original model, and an async path with dynamic batching, quantized variants, and spot-instance scheduling. Routing logic was added upstream, classifying each call by latency requirement before it reached the model layer.

We also shipped the cost attribution dashboard as a permanent operational tool — wired into the team's existing monitoring stack, with per-team spend breakdowns and weekly cost reports triggered automatically. The intent: the visibility that found the problem should outlast the engagement that fixed it.

§ 04Outcome

62% cost reduction. Latency unchanged.

Total inference cost fell 62% within six weeks of deployment. P95 latency on the real-time path was unchanged. P50 latency on the async path improved because batching reduced queue contention.

Three weeks after delivery, the system absorbed a 40% traffic spike — a product launch — without intervention. The spot-instance tier scaled automatically; the cost attribution dashboard flagged the spike and confirmed it resolved within the expected window.

Composite engagement based on representative project structures. Client and outcome details are illustrative.

← Back to selected work

§ 04Initiate

Tell us what you're
about to build.

We respond to every brief in 48 hours. Long shots welcome — we keep the calendar light on purpose.

Send a brief →or read the manifesto

Cut inference cost 62% on a9-figure ARR workload.