GET /v1/models under the chat_pricing block. The rate is the caller’s effective rate — any per-team override is already applied. See the dedicated LLM pricing guide for the per-1K math and worked examples.
What you pay for
Each chat completion is billed across up to three buckets, depending on the model:| Bucket | Counts | Rate field on /v1/models |
|---|---|---|
| Input | usage.prompt_tokens (text + image + video parts) | chat_pricing.input.credits_per_M |
| Output | usage.completion_tokens (assistant message content + tool-call arguments) | chat_pricing.output.credits_per_M |
| Reasoning | usage.reasoning_tokens (hidden deliberation on reasoning-capable models) | chat_pricing.output.credits_per_M (same as output) |
credits_per_M by 1000.
The response usage block reports each bucket and the rolled-up credits_charged:
credits_charged is the exact amount deducted from your team’s balance. Check it on every response — it’s the source of truth for the billing line.
Hold-and-commit (credits are reserved before dispatch)
Submitting a chat completion places a hold on your team balance sized at(estimated_input_tokens × 1.10) + (max_tokens × output_rate). The hold ensures the request can’t run if you can’t afford the worst case. When the response completes:
- Actuals come back in
usage. - The hold is replaced by the actual
credits_charged. - Any unused hold is released back to your balance.
max_tokens upper bound usually cover it), the overage is committed up to your team’s balance_negative_floor. The default floor is 0 — overages beyond that are absorbed by the platform and logged for follow-up. Enterprise contracts can set a per-team negative floor; contact support if your workload needs one.
The pricing-version pin
Every chat completion snapshots the rate-card version at request time.breakdown.pricing_version on the response is the version that was applied. Admin can update chat-model rates between published Aurous-Version releases, but the price actually charged is whatever was in force when the request was created — not whatever the latest rates are at the moment you read the response.
Mutability asymmetry vs. images and videos
There is a documented asymmetry in how rates change for different inference types:| Surface | Rate update path |
|---|---|
| Images, videos | Frozen per Aurous-Version. Pin a version, your image and video rates don’t move. |
| Chat, embeddings | Mutable without an Aurous-Version bump. Admin can update LLM rates at any time. The per-request charge is still deterministic via pricing_version. |
GET /v1/models and store the chat_pricing block next to your prompt — that captures the per-model rate in force at that moment.
The inferences.llm_pricing_version field on each completion row is the audit-trail proof that the rate you were quoted is the rate you paid.
Estimating cost before dispatch
Two ways to budget:- Hand math — multiply expected token counts by the per-million rates on
/v1/models. The per-1K rate ischat_pricing.input.credits_per_M / 1000(and likewise for output). See LLM pricing for the full formula. - Run a dry call — submit the request with
max_tokens: 1and readusage.prompt_tokensfrom the response. The input cost is then(prompt_tokens / 1_000_000) × chat_pricing.input.credits_per_M. The 1-token output charge is negligible.
aurous-grow-2.0-pro lands around 0.2–0.4 credits. Long-context analytical turns with reasoning can push past 1 credit. The live /v1/models response carries the current per-model rates.
Refunds and partial charges
| Outcome | Charged for |
|---|---|
| Successful completion | Actuals as reported in usage. |
| Streamed completion cancelled mid-stream | Actuals up to the abort point (tokens already delivered). The remainder of the hold is released. |
Provider returned an empty content_filter block | 0. The hold is released. |
Provider returned a partial content_filter block | Partial actuals (tokens delivered before the filter trip). |
| Provider returned a 5xx | 0. The hold is released. Retry with backoff. |
| Bad input (400 from the platform) | 0. No row written. |
Idempotency and billing
For non-streamed requests, anIdempotency-Key replays the cached response — including the original credits_charged — for 24 hours. You will NOT be double-charged for a retried key. See Idempotency for the full semantics.
Where to read rates
GET /v1/models— per-modelchat_pricing/embedding_pricing(the caller’s effective rate, including any per-team override) plus capability metadata.- LLM pricing guide — per-1K math, examples, and the mutability story in long form.
Common questions
Are cached prompts cheaper? Some providers cache stable prefixes and discount the input rate for cache hits. When a hit happens, the responseusage will reflect the reduced charge on breakdown.input_credits; the platform handles cache accounting transparently.
Can I see usage trends? Yes — GET /v1/usage aggregates credits by day/key/model. The dashboard’s usage tab visualizes the same data.
What if I’m rate-limited? Token throughput is limited per team (TPM bucket). Hitting it returns 429 tpm_rate_limit_exceeded with Retry-After. The TPM is in addition to the per-minute request bucket (RPM); both apply.
