LLM pricing

LLM workloads (chat completions and embeddings) are priced in credits per million tokens. Each model row at GET /v1/models carries the rate inline on its chat_pricing / embedding_pricing block — the value is the caller’s effective rate, with any per-team override already applied. Customers can compute credits-per-1K by dividing credits_per_M by 1000.

Chat completions

Chat models bill across three token buckets:

Bucket	Rate field on `/v1/models`	What it counts
Input	`chat_pricing.input.credits_per_M`	Text, image, and video tokens on the prompt (including system messages and prior turns).
Output	`chat_pricing.output.credits_per_M`	Assistant text content + tool-call arguments.
Reasoning	`chat_pricing.output.credits_per_M` (same as output)	Hidden deliberation tokens on reasoning-capable models.

The math:

credits_charged
  = (input_tokens     / 1_000_000) × chat_pricing.input.credits_per_M
  + (output_tokens    / 1_000_000) × chat_pricing.output.credits_per_M
  + (reasoning_tokens / 1_000_000) × chat_pricing.output.credits_per_M

Worked example

A request to aurous-grow-2.0-pro with the following response:

{
  "usage": {
    "prompt_tokens": 200,
    "completion_tokens": 600,
    "reasoning_tokens": 50,
    "credits_charged": 0.2856,
    "breakdown": {
      "input_credits":     0.0150,
      "output_credits":    0.2700,
      "reasoning_credits": 0.0006,
      "model": "aurous-grow-2.0-pro",
      "pricing_version": 7
    }
  }
}

At the day-1 rates (chat_pricing.input.credits_per_M: 75, chat_pricing.output.credits_per_M: 450):

input_credits     = (200 / 1_000_000) × 75  = 0.0150
output_credits    = (600 / 1_000_000) × 450 = 0.2700
reasoning_credits = (50  / 1_000_000) × 12  = 0.0006     # (note: rate snapshot differs per version)
                                              ─────────
                                              0.2856 credits charged

The breakdown.pricing_version field is the rate-card snapshot pinned to this inference. The exact per-1K numbers used for reasoning_credits come from the same snapshot — consult the pricing_version row if you need to reverse-derive the math.

Embeddings

Embedding models bill across two buckets (text vs visual input). The video modality was retired 2026-05-24 — the provider folded video frames into the visual bucket, so the published video rate never actually fired. video_url parts on POST /v1/embeddings are now rejected with embeddings_video_unsupported; embed visual content by extracting frames in your pipeline and submitting them as image_url.

Bucket	Rate field on `/v1/models`	What it counts
Text input	`embedding_pricing.text.credits_per_M`	Text tokens in `input` (string input or `text` content parts).
Visual input	`embedding_pricing.visual.credits_per_M`	Image tokens from `image_url` content parts.

credits_charged
  = (input_text_tokens   / 1_000_000) × embedding_pricing.text.credits_per_M
  + (input_visual_tokens / 1_000_000) × embedding_pricing.visual.credits_per_M

Worked example

{
  "usage": {
    "prompt_tokens": 7000,
    "credits_charged": 0.19125,
    "breakdown": {
      "input_text_credits":   0.09375,
      "input_visual_credits": 0.09750,
      "model": "aurous-embed-vision-1.0",
      "pricing_version": 3
    }
  }
}

At the day-1 rates (embedding_pricing.text.credits_per_M: 18.75, embedding_pricing.visual.credits_per_M: 48.75) with a 5,000-text-token + 2,000-visual-token input:

text_credits   = (5000 / 1_000_000) × 18.75 = 0.09375
visual_credits = (2000 / 1_000_000) × 48.75 = 0.09750
                                              ─────────
                                              0.19125 credits charged

Pricing version — mutability asymmetry

Rates for chat, embedding, and video models are mutable without an Aurous-Version bump. Image pricing is pinned to the frozen rate card under the current Aurous-Version; chat, embedding, and video pricing are not — video rates are per-model and DB-driven, read live from GET /v1/models (see Seedance models & pricing). Why:

Provider rates change more frequently than image rates do.
Per-model markup is tuned more often than image markup.
Tying every LLM rate change to a new Aurous-Version would create dozens of version pins per week and confuse downstream caches.

This is a deliberate design choice, not a bug. The pricing surface stays deterministic at the per-request level via a snapshot mechanism:

When a chat, embedding, or video request hits the platform, the current llm_model_pricing version is captured.
The inference row stores that version on inferences.llm_pricing_version (audit trail).
The response usage.breakdown.pricing_version echoes it back.
If admin updates the model’s rates between hold-placement and commit, the held + committed amounts use the captured version, not the latest.

You will never be billed at rates that weren’t current at the moment you sent the request.

How to snapshot rates client-side

If you want to record the rate at the moment your customer made a chat request (audit logs, billing transparency UIs), call GET /v1/models immediately before the chat request and store the chat_pricing block from the relevant model row next to your prompt. Combined with the pricing_version on your completion’s usage.breakdown, you have a full audit pair.

How rates relate to USD

Credits are the platform’s unit of account. The conversion from credits to USD is set by your team’s billing plan and is not exposed via the V1 API (visible only on the dashboard’s billing page). Pricing math is done in credits end-to-end so that rate changes are denominated in the same unit as your balance.

Practical guidance

For a tight cost estimate before dispatch, run a request with max_tokens: 1, read prompt_tokens from the response, and multiply by the input rate.
For ongoing cost monitoring, aggregate usage.credits_charged across your completions (or use GET /v1/usage, which does the aggregation for you).
Reasoning-capable workloads can swing 3–5x in cost between reasoning_effort: "low" and "high". Match the effort to the task.
Cache hits on stable prompt prefixes are surfaced via reduced input_credits on the response — the platform handles the discount transparently.

Chat pricing — chat-specific cost rules.
Errors — balance_too_low on insufficient balance.
Idempotency — replays don’t double-charge.

Get started

Guides

Concepts

API Reference

Resources

Chat completions

Worked example

Embeddings

Worked example

Pricing version — mutability asymmetry

How to snapshot rates client-side

How rates relate to USD

Practical guidance

​Chat completions

​Worked example

​Embeddings

​Worked example

​Pricing version — mutability asymmetry

​How to snapshot rates client-side

​How rates relate to USD

​Practical guidance

​Related

Chat completions

Worked example

Embeddings

Worked example

Pricing version — mutability asymmetry

How to snapshot rates client-side

How rates relate to USD

Practical guidance

Related