Pricing

Chat completions are billed in credits at per-token rates surfaced on each model row at GET /v1/models under the chat_pricing block. The rate is the caller’s effective rate — any per-team override is already applied. See the dedicated LLM pricing guide for the per-1K math and worked examples.

What you pay for

Each chat completion is billed across up to three buckets, depending on the model:

Bucket	Counts	Rate field on `/v1/models`
Input	`usage.prompt_tokens` (text + image + video parts)	`chat_pricing.input.credits_per_M`
Output	`usage.completion_tokens` (assistant message content + tool-call arguments)	`chat_pricing.output.credits_per_M`
Reasoning	`usage.reasoning_tokens` (hidden deliberation on reasoning-capable models)	`chat_pricing.output.credits_per_M` (same as output)

Customers can compute credits-per-1K by dividing credits_per_M by 1000. The response usage block reports each bucket and the rolled-up credits_charged:

{
  "usage": {
    "prompt_tokens": 200,
    "completion_tokens": 600,
    "reasoning_tokens": 50,
    "total_tokens": 850,
    "credits_charged": 0.2856,
    "breakdown": {
      "input_credits": 0.0150,
      "output_credits": 0.2700,
      "reasoning_credits": 0.0006,
      "model": "aurous-grow-2.0-pro",
      "pricing_version": 7
    }
  }
}

credits_charged is the exact amount deducted from your team’s balance. Check it on every response — it’s the source of truth for the billing line.

Hold-and-commit (credits are reserved before dispatch)

Submitting a chat completion places a hold on your team balance sized at (estimated_input_tokens × 1.10) + (max_tokens × output_rate). The hold ensures the request can’t run if you can’t afford the worst case. When the response completes:

Actuals come back in usage.
The hold is replaced by the actual credits_charged.
Any unused hold is released back to your balance.

If actuals exceed the hold (rare — the 10% input margin and max_tokens upper bound usually cover it), the overage is committed up to your team’s balance_negative_floor. The default floor is 0 — overages beyond that are absorbed by the platform and logged for follow-up. Enterprise contracts can set a per-team negative floor; contact support if your workload needs one.

The pricing-version pin

Every chat completion snapshots the rate-card version at request time. breakdown.pricing_version on the response is the version that was applied. Admin can update chat-model rates between published Aurous-Version releases, but the price actually charged is whatever was in force when the request was created — not whatever the latest rates are at the moment you read the response.

Mutability asymmetry vs. images and videos

There is a documented asymmetry in how rates change for different inference types:

Surface	Rate update path
Images	Frozen per `Aurous-Version`. Pin a version, your image rates don’t move.
Chat, embeddings, videos	Mutable without an `Aurous-Version` bump. Admin can update these rates at any time; video rates are per-model and DB-driven. The per-request charge is still deterministic — each request snapshots the rate in force when it was created (`pricing_version`).

Why: image pricing is shaped per release; chat, embedding, and video rates can swing with provider tuning, new models, and per-model markup changes that don’t justify a new version pin every time. If you want the rate snapshot at the moment your client made a request, call GET /v1/models and store the relevant pricing block next to your request — that captures the per-model rate in force at that moment. The inferences.llm_pricing_version field on each completion row is the audit-trail proof that the rate you were quoted is the rate you paid.

Estimating cost before dispatch

Two ways to budget:

Hand math — multiply expected token counts by the per-million rates on /v1/models. The per-1K rate is chat_pricing.input.credits_per_M / 1000 (and likewise for output). See LLM pricing for the full formula.
Run a dry call — submit the request with max_tokens: 1 and read usage.prompt_tokens from the response. The input cost is then (prompt_tokens / 1_000_000) × chat_pricing.input.credits_per_M. The 1-token output charge is negligible.

A typical short conversational turn on aurous-grow-2.0-pro lands around 0.2–0.4 credits. Long-context analytical turns with reasoning can push past 1 credit. The live /v1/models response carries the current per-model rates.

Refunds and partial charges

Outcome	Charged for
Successful completion	Actuals as reported in `usage`.
Streamed completion cancelled mid-stream	Actuals up to the abort point (tokens already delivered). The remainder of the hold is released.
Provider returned an empty `content_filter` block	0. The hold is released.
Provider returned a partial `content_filter` block	Partial actuals (tokens delivered before the filter trip).
Provider returned a 5xx	0. The hold is released. Retry with backoff.
Bad input (400 from the platform)	0. No row written.

The policy is: you pay for what was delivered.

Idempotency and billing

For non-streamed requests, an Idempotency-Key replays the cached response — including the original credits_charged — for 24 hours. You will NOT be double-charged for a retried key. See Idempotency for the full semantics.

Where to read rates

GET /v1/models — per-model chat_pricing / embedding_pricing (the caller’s effective rate, including any per-team override) plus capability metadata.
LLM pricing guide — per-1K math, examples, and the mutability story in long form.

Common questions

Are cached prompts cheaper? Some providers cache stable prefixes and discount the input rate for cache hits. When a hit happens, the response usage will reflect the reduced charge on breakdown.input_credits; the platform handles cache accounting transparently. Can I see usage trends? Yes — GET /v1/usage aggregates credits by day/key/model. The dashboard’s usage tab visualizes the same data. What if I’m rate-limited? Token throughput is limited per team (TPM bucket). Hitting it returns 429 tpm_rate_limit_exceeded with Retry-After. The TPM is in addition to the per-minute request bucket (RPM); both apply.

Get started

Guides

Concepts

API Reference

Resources

What you pay for

Hold-and-commit (credits are reserved before dispatch)

The pricing-version pin

Mutability asymmetry vs. images and videos

Estimating cost before dispatch

Refunds and partial charges

Idempotency and billing

Where to read rates

Common questions

​What you pay for

​Hold-and-commit (credits are reserved before dispatch)

​The pricing-version pin

​Mutability asymmetry vs. images and videos

​Estimating cost before dispatch

​Refunds and partial charges

​Idempotency and billing

​Where to read rates

​Common questions

What you pay for

Hold-and-commit (credits are reserved before dispatch)

The pricing-version pin

Mutability asymmetry vs. images and videos

Estimating cost before dispatch

Refunds and partial charges

Idempotency and billing

Where to read rates

Common questions