The cost & latency cube: tokens × price × concurrency — CCA Advanced (Free Preview) — Claude Cert Academy

Every Claude bill is the product of three independent dimensions — optimize them in that order

At a single-call scale, cost is a curiosity. At 50M calls per day, cost becomes the dominant architectural constraint. The decomposition you need: total cost = tokens consumed × per-token price × concurrency overhead. Each dimension has different levers; mixing them up causes optimization that looks productive but moves the wrong needle.

Tokens consumed is the volume dimension. Prompt size and output length compound with call volume. A 5K-token system prompt sent on every request to a 50M-call/day app sends 250B tokens through the API monthly just for boilerplate. Prompt caching is the primary lever here.

Per-token price is the model-selection dimension. Haiku is roughly 1/12th the cost of Opus at similar token counts. Most production workloads have a mix of high-judgment calls (Opus) and routine calls (Haiku) — the question is whether the routing logic is explicit or accidental. Batch API offers a 50% discount in exchange for asynchronous return.

Concurrency overhead is the orchestration tax. Aggressive retries multiply token consumption on failed calls. Sequential calls that could be parallelized stretch latency. The cost shows up as both wasted tokens and missed throughput against rate limits.

Optimization order matters: first reduce tokens (caching, prompt compression), then route to cheaper models, then fix concurrency. Doing them in the wrong order produces gains that get erased by the next layer.

Continue to Claude Cert Academy