Skip to main content
LLM workloads (chat completions and embeddings) are priced in credits per million tokens. Each model row at GET /v1/models carries the rate inline on its chat_pricing / embedding_pricing block — the value is the caller’s effective rate, with any per-team override already applied. Customers can compute credits-per-1K by dividing credits_per_M by 1000.

Chat completions

Chat models bill across three token buckets:
BucketRate field on /v1/modelsWhat it counts
Inputchat_pricing.input.credits_per_MText, image, and video tokens on the prompt (including system messages and prior turns).
Outputchat_pricing.output.credits_per_MAssistant text content + tool-call arguments.
Reasoningchat_pricing.output.credits_per_M (same as output)Hidden deliberation tokens on reasoning-capable models.
The math:
credits_charged
  = (input_tokens     / 1_000_000) × chat_pricing.input.credits_per_M
  + (output_tokens    / 1_000_000) × chat_pricing.output.credits_per_M
  + (reasoning_tokens / 1_000_000) × chat_pricing.output.credits_per_M

Worked example

A request to aurous-grow-2.0-pro with the following response:
{
  "usage": {
    "prompt_tokens": 200,
    "completion_tokens": 600,
    "reasoning_tokens": 50,
    "credits_charged": 0.2856,
    "breakdown": {
      "input_credits":     0.0150,
      "output_credits":    0.2700,
      "reasoning_credits": 0.0006,
      "model": "aurous-grow-2.0-pro",
      "pricing_version": 7
    }
  }
}
At the day-1 rates (chat_pricing.input.credits_per_M: 75, chat_pricing.output.credits_per_M: 450):
input_credits     = (200 / 1_000_000) × 75  = 0.0150
output_credits    = (600 / 1_000_000) × 450 = 0.2700
reasoning_credits = (50  / 1_000_000) × 12  = 0.0006     # (note: rate snapshot differs per version)
                                              ─────────
                                              0.2856 credits charged
The breakdown.pricing_version field is the rate-card snapshot pinned to this inference. The exact per-1K numbers used for reasoning_credits come from the same snapshot — consult the pricing_version row if you need to reverse-derive the math.

Embeddings

Embedding models bill across two buckets (text vs visual input). The video modality was retired 2026-05-24 — the provider folded video frames into the visual bucket, so the published video rate never actually fired. video_url parts on POST /v1/embeddings are now rejected with embeddings_video_unsupported; embed visual content by extracting frames in your pipeline and submitting them as image_url.
BucketRate field on /v1/modelsWhat it counts
Text inputembedding_pricing.text.credits_per_MText tokens in input (string input or text content parts).
Visual inputembedding_pricing.visual.credits_per_MImage tokens from image_url content parts.
credits_charged
  = (input_text_tokens   / 1_000_000) × embedding_pricing.text.credits_per_M
  + (input_visual_tokens / 1_000_000) × embedding_pricing.visual.credits_per_M

Worked example

{
  "usage": {
    "prompt_tokens": 7000,
    "credits_charged": 0.19125,
    "breakdown": {
      "input_text_credits":   0.09375,
      "input_visual_credits": 0.09750,
      "model": "aurous-embed-vision-1.0",
      "pricing_version": 3
    }
  }
}
At the day-1 rates (embedding_pricing.text.credits_per_M: 18.75, embedding_pricing.visual.credits_per_M: 48.75) with a 5,000-text-token + 2,000-visual-token input:
text_credits   = (5000 / 1_000_000) × 18.75 = 0.09375
visual_credits = (2000 / 1_000_000) × 48.75 = 0.09750
                                              ─────────
                                              0.19125 credits charged

Pricing version — mutability asymmetry

Rates for chat and embedding models are mutable without an Aurous-Version bump. Image and video pricing is pinned to the frozen rate card under the current Aurous-Version; LLM pricing is not. Why:
  • Provider rates change more frequently than image / video rates do.
  • Per-model markup is tuned more often than image / video markup.
  • Tying every LLM rate change to a new Aurous-Version would create dozens of version pins per week and confuse downstream caches.
This is a deliberate design choice, not a bug. The pricing surface stays deterministic at the per-request level via a snapshot mechanism:
  1. When a chat or embedding request hits the platform, the current llm_model_pricing version is captured.
  2. The inference row stores that version on inferences.llm_pricing_version (audit trail).
  3. The response usage.breakdown.pricing_version echoes it back.
  4. If admin updates the model’s rates between hold-placement and commit, the held + committed amounts use the captured version, not the latest.
You will never be billed at rates that weren’t current at the moment you sent the request.

How to snapshot rates client-side

If you want to record the rate at the moment your customer made a chat request (audit logs, billing transparency UIs), call GET /v1/models immediately before the chat request and store the chat_pricing block from the relevant model row next to your prompt. Combined with the pricing_version on your completion’s usage.breakdown, you have a full audit pair.

How rates relate to USD

Credits are the platform’s unit of account. The conversion from credits to USD is set by your team’s billing plan and is not exposed via the V1 API (visible only on the dashboard’s billing page). Pricing math is done in credits end-to-end so that rate changes are denominated in the same unit as your balance.

Practical guidance

  • For a tight cost estimate before dispatch, run a request with max_tokens: 1, read prompt_tokens from the response, and multiply by the input rate.
  • For ongoing cost monitoring, aggregate usage.credits_charged across your completions (or use GET /v1/usage, which does the aggregation for you).
  • Reasoning-capable workloads can swing 3–5x in cost between reasoning_effort: "low" and "high". Match the effort to the task.
  • Cache hits on stable prompt prefixes are surfaced via reduced input_credits on the response — the platform handles the discount transparently.