usage.credits_charged IS the truth, and usage.breakdown shows you the per-token math that derived it. This page walks through reading the breakdown end-to-end and reconciling it against the live rate card.
The receipt fields
Every billed response (POST /v1/chat/completions, POST /v1/embeddings, POST /v1/images, POST /v1/videos) carries a usage block. Two shape variants — chat and embedding — corresponding to the different cost classes each surface bills against.
Chat usage block
prompt_tokens/completion_tokens/total_tokens— OpenAI-shaped token counts. These are the billable counts.reasoning_tokens— top-level field present only on responses where the model emitted reasoning (typically whenreasoning_effortwas set, but can also appear at the defaultdisabledthinking mode if the model produced internal reasoning anyway). Counts the model’s internal reasoning tokens — SEPARATE fromcompletion_tokens(which counts only visible output text). Reasoning is billed at the output rate; the credit subtotal is inbreakdown.reasoning_credits.credits_charged— The single number you owe for this call. Currency is “credit”; the credit-to-USD conversion is in your team’s billing settings.breakdown.input_credits/breakdown.output_credits— credit amounts spent on input tokens vs output tokens; they sum (within 4dp rounding) tocredits_charged.breakdown.pricing_version— the rate-card version used to compute this charge. See Pricing version pinning below.
Embedding usage block
prompt_tokens/total_tokens— token counts (embeddings have no output tokens; the two fields are identical and present for OpenAI-shape compatibility).credits_charged— single credit-amount owed.breakdown.input.{text, visual, video}— per-modality credit amounts. They sum (within rounding) tocredits_charged. The video / visual fields are always present (zero when unused) so client code can sum them without optional-chaining.
usage arrives in the final non-[DONE] chunk’s usage block (matching OpenAI’s streaming-with-usage semantics, which we enable by default on our side).
Reconciling the charge
The simplest reconciliation: take the live per-model rate fromGET /v1/models and multiply per-token-class counts by per-token-class rates.
credits_charged. If your reconciliation differs by more than 0.0001 credits in either direction, there’s either a rate-card version skew (see below) or a bug we want to know about.
Pricing version pinning
Rate cards can change. Every billed chat response stamps thepricing_version it billed against:
pricing_version reflects the rate-card snapshot that was in force when your request landed. If a charge stamped pricing_version: 2 shows up in your ledger and a later request stamps pricing_version: 3, the earlier charge was billed at the v2 rates. We never silently retro-bill at a new rate.
Embedding responses don’t currently include pricing_version in the breakdown — embedding rates have been stable since launch. If we change embedding pricing, we’ll add the field in lockstep.
The Aurous-Version response header you saw on every call is a different concept — that’s the API contract version, not the pricing version. The two evolve independently. See Aurous-Version for the API version pinning story.
Forecasting cost before the call
Three options:1. POST /v1/embeddings/estimate (embeddings only)
Same body as the real call; returns the same usage shape minus the vector. No hold, no charge. See estimate docs.
2. POST /v1/images/estimate (image generation)
Returns the credit cost without minting an inferences row. Mirrors the request shape of POST /v1/images.
3. Multiply the live rate card client-side (chat)
We do not currently expose a/v1/chat/completions/estimate endpoint — chat output tokens are not knowable until the model generates them. The best you can do is multiply your worst-case max_tokens against the output rate to compute an upper bound:
reasoning_effort, multiply max_reasoning_tokens against the output rate too. (Models commit fewer reasoning tokens at low and more at high.) The real spend will land below the upper bound; the receipt tells you the actual.
The hold mechanism
When youPOST /v1/chat/completions (non-streamed) or POST /v1/embeddings, we put a credit hold on your team for the maximum the call could cost — max_tokens × output_rate + max_input × input_rate. The hold reserves credits but doesn’t bill them. When the call completes:
- Success: the hold is committed to a charge for the actual
credits_charged. The unused portion of the hold is released back to your available balance. - Failure / cancellation: the entire hold is released, no charge.
available_credits (read from GET /v1/balance) is credits - held_credits and reflects what’s actually free to spend on the next call. The held_credits view lets you see open chat / embedding holds in real-time.
For streamed chat completions, the same hold mechanism applies — the hold is committed to the actual usage on the final non-[DONE] chunk; partial-completion (client disconnect) commits the tokens we’d already emitted and releases the rest.
Per-model spend caps (v1.1 roadmap)
For enterprise-grade cost control (“don’t let any single API key spend more than $50/day onaurous-grow-2.0-pro”), per-API-key spend caps are on the v1.1 roadmap. Today, the tools are:
GET /v1/balanceto read your team’s available + held credit summaryGET /v1/usagewithgroup_by=modeland amodelfilter — to see per-model spend over time- Auto top-up settings in the dashboard (
/dashboard/billing→ Auto top-up) to prevent runaway depletion
Where to next?
- How we count tokens — what’s in the input/output token classes
- Chat pricing — per-token rates for chat
- Embedding pricing — per-modality rates for embeddings
- Image pricing — image surface rate card
GET /v1/models— per-modelchat_pricing/embedding_pricing(the live rate, including any per-team override)GET /v1/balance— team credits + held credits

