Skip to main content
Chat completions are billed in credits at per-token rates surfaced on each model row at GET /v1/models under the chat_pricing block. The rate is the caller’s effective rate — any per-team override is already applied. See the dedicated LLM pricing guide for the per-1K math and worked examples.

What you pay for

Each chat completion is billed across up to three buckets, depending on the model:
BucketCountsRate field on /v1/models
Inputusage.prompt_tokens (text + image + video parts)chat_pricing.input.credits_per_M
Outputusage.completion_tokens (assistant message content + tool-call arguments)chat_pricing.output.credits_per_M
Reasoningusage.reasoning_tokens (hidden deliberation on reasoning-capable models)chat_pricing.output.credits_per_M (same as output)
Customers can compute credits-per-1K by dividing credits_per_M by 1000. The response usage block reports each bucket and the rolled-up credits_charged:
{
  "usage": {
    "prompt_tokens": 200,
    "completion_tokens": 600,
    "reasoning_tokens": 50,
    "total_tokens": 850,
    "credits_charged": 0.2856,
    "breakdown": {
      "input_credits": 0.0150,
      "output_credits": 0.2700,
      "reasoning_credits": 0.0006,
      "model": "aurous-grow-2.0-pro",
      "pricing_version": 7
    }
  }
}
credits_charged is the exact amount deducted from your team’s balance. Check it on every response — it’s the source of truth for the billing line.

Hold-and-commit (credits are reserved before dispatch)

Submitting a chat completion places a hold on your team balance sized at (estimated_input_tokens × 1.10) + (max_tokens × output_rate). The hold ensures the request can’t run if you can’t afford the worst case. When the response completes:
  • Actuals come back in usage.
  • The hold is replaced by the actual credits_charged.
  • Any unused hold is released back to your balance.
If actuals exceed the hold (rare — the 10% input margin and max_tokens upper bound usually cover it), the overage is committed up to your team’s balance_negative_floor. The default floor is 0 — overages beyond that are absorbed by the platform and logged for follow-up. Enterprise contracts can set a per-team negative floor; contact support if your workload needs one.

The pricing-version pin

Every chat completion snapshots the rate-card version at request time. breakdown.pricing_version on the response is the version that was applied. Admin can update chat-model rates between published Aurous-Version releases, but the price actually charged is whatever was in force when the request was created — not whatever the latest rates are at the moment you read the response.

Mutability asymmetry vs. images and videos

There is a documented asymmetry in how rates change for different inference types:
SurfaceRate update path
Images, videosFrozen per Aurous-Version. Pin a version, your image and video rates don’t move.
Chat, embeddingsMutable without an Aurous-Version bump. Admin can update LLM rates at any time. The per-request charge is still deterministic via pricing_version.
Why: image and video pricing is shaped per release; LLM rates can swing with provider tuning, new models, and per-model markup changes that don’t justify a new version pin every time. If you want the rate snapshot at the moment your client made a request, call GET /v1/models and store the chat_pricing block next to your prompt — that captures the per-model rate in force at that moment. The inferences.llm_pricing_version field on each completion row is the audit-trail proof that the rate you were quoted is the rate you paid.

Estimating cost before dispatch

Two ways to budget:
  1. Hand math — multiply expected token counts by the per-million rates on /v1/models. The per-1K rate is chat_pricing.input.credits_per_M / 1000 (and likewise for output). See LLM pricing for the full formula.
  2. Run a dry call — submit the request with max_tokens: 1 and read usage.prompt_tokens from the response. The input cost is then (prompt_tokens / 1_000_000) × chat_pricing.input.credits_per_M. The 1-token output charge is negligible.
A typical short conversational turn on aurous-grow-2.0-pro lands around 0.2–0.4 credits. Long-context analytical turns with reasoning can push past 1 credit. The live /v1/models response carries the current per-model rates.

Refunds and partial charges

OutcomeCharged for
Successful completionActuals as reported in usage.
Streamed completion cancelled mid-streamActuals up to the abort point (tokens already delivered). The remainder of the hold is released.
Provider returned an empty content_filter block0. The hold is released.
Provider returned a partial content_filter blockPartial actuals (tokens delivered before the filter trip).
Provider returned a 5xx0. The hold is released. Retry with backoff.
Bad input (400 from the platform)0. No row written.
The policy is: you pay for what was delivered.

Idempotency and billing

For non-streamed requests, an Idempotency-Key replays the cached response — including the original credits_charged — for 24 hours. You will NOT be double-charged for a retried key. See Idempotency for the full semantics.

Where to read rates

  • GET /v1/models — per-model chat_pricing / embedding_pricing (the caller’s effective rate, including any per-team override) plus capability metadata.
  • LLM pricing guide — per-1K math, examples, and the mutability story in long form.

Common questions

Are cached prompts cheaper? Some providers cache stable prefixes and discount the input rate for cache hits. When a hit happens, the response usage will reflect the reduced charge on breakdown.input_credits; the platform handles cache accounting transparently. Can I see usage trends? Yes — GET /v1/usage aggregates credits by day/key/model. The dashboard’s usage tab visualizes the same data. What if I’m rate-limited? Token throughput is limited per team (TPM bucket). Hitting it returns 429 tpm_rate_limit_exceeded with Retry-After. The TPM is in addition to the per-minute request bucket (RPM); both apply.