Skip to main content
Every chat completion and embedding response carries a usage block with token counts. Every credit charge is derived from those counts. This page explains exactly how Aurous Labs counts tokens for each input type, what’s the same as OpenAI, and where you’ll see drift.

Text tokens

Text input is tokenized using the underlying provider tokenizer for each model. For aurous-grow-2.0-pro and aurous-embed-vision-1.0, that’s the BytePlus Doubao-family tokenizer, which is closely related to but not identical to OpenAI’s tiktoken cl100k_base / o200k_base encodings. In practice:
  • English-language input tokenizes within ±10% of OpenAI’s o200k_base (the gpt-4o tokenizer) on prose; ±20% on highly-repetitive content ("hello "*1000 style).
  • Languages with non-Latin scripts (Chinese, Japanese, Korean, Arabic, etc.) can show 20-40% drift depending on script. The Doubao tokenizer is more efficient on CJK scripts than cl100k_base; OpenAI’s o200k_base has narrowed that gap but not eliminated it.
  • Code (TypeScript, Python, JSON) tokenizes within ±15%.
  • Repeated short tokens (a a a a a a ...) can land 2-3× higher than tiktoken estimates because BytePlus does not aggressively merge them.
If you’re budgeting per-call token counts based on tiktoken, allow at least a 15-25% margin for safety on English prose and 30-40% on CJK workloads. The cheapest reliable estimate is a probe call (max_tokens: 1 + read prompt_tokens from the response).

How to know the actual count before the call

Two options:
  1. POST /v1/embeddings/estimate for embeddings — returns the token count + the credit charge the real call would compute. No hold, no billed charge, no inferences row. See estimate docs.
  2. Read usage.prompt_tokens from a small probe call for chat — issue one cheap call with the full prompt and max_tokens: 1, read the prompt_tokens from the response. This costs you one billed token of completion.
We don’t expose a standalone count_tokens endpoint today. Adding one is on the v1.1 roadmap.

Image tokens (chat multimodal + embedding visual)

Image inputs are converted into a fixed number of visual tokens at the model’s tokenizer layer. The count depends on the image’s pixel dimensions and the model’s tile-based encoder:
  • aurous-grow-2.0-pro (chat multimodal images): each image contributes a per-image token count derived from ceil(width / 28) × ceil(height / 28) + overhead (subject to a hard cap per image). For practical UI screenshots (1280×720, say) expect ~1,200-1,400 visual tokens per image.
  • aurous-embed-vision-1.0 (embedding image parts): each image contributes a model-internal visual token count that is not broken out as prompt_tokens on the response. Instead, the usage.breakdown.input.visual field reports the visual-token count specifically, billed at the visual rate (separate from text). See Embeddings pricing.
OpenAI’s image-tokenization formula (the tile-based scheme published for gpt-4o) is roughly similar but not identical. Don’t try to predict our count from OpenAI’s table — read the actual count off usage.breakdown.input.visual (embeddings) or usage.prompt_tokens (chat) from a probe call.

Image URL fetching

When you pass image_url: { url: "https://..." }, we fetch the bytes server-side at request time (HTTPS only, 10-second timeout, RFC1918 / loopback / link-local URLs blocked). The fetched bytes are then tokenized; if the URL is unreachable, we surface a provider_request_invalid error rather than running the model on a partial input. See URL fetching.

Video tokens (embedding only today)

aurous-embed-vision-1.0 accepts a single video part per request. Video tokens are computed from (duration_seconds × frames_per_second × per_frame_visual_token_count) and capped by the model’s context window. Practical reference: a 30-second 1080p video → ~3,000-5,000 visual tokens. The usage.breakdown.input.video field on the response reports the exact count.

Output tokens (chat only)

For chat completions, output tokens are counted by the model as it generates. usage.completion_tokens is the count after the response finishes; for streamed responses, the count is on the final non-[DONE] chunk’s usage block (we attach usage to the last data frame, not as a separate frame, matching OpenAI’s streaming-with-usage shape).

Reasoning tokens

For reasoning-capable models invoked with reasoning_effort: "low"|"medium"|"high", the model internally produces “reasoning tokens” that don’t appear in the visible content but ARE counted toward the output cost. We surface them separately as usage.reasoning_tokens so you can see what reasoning effort actually cost. completion_tokens is the total of visible-output + reasoning tokens, matching OpenAI’s reasoning-model semantics.

Cached input tokens — v1.1

BytePlus auto-caches stable prompt prefixes (≥ ~1,024 tokens) on aurous-grow-2.0-pro. We observe the cache hit count internally but do not pass the discount through to customers in v1.0 — the full input rate applies to all input tokens regardless of cache hit status. This is a margin policy, not a technical limitation; see Cost transparency for the rationale. Coming in v1.1: customer-visible usage.breakdown.cached_input_tokens counter + per-team opt-in for cache-discount pass-through. Track the changelog or launch-week followups for the timeline. We do NOT expose a manual context-cache control plane (the context.create / context.attach flow some platforms have). Auto-caching at the provider layer is good enough for the typical multi-turn pattern.

Putting it all together — sample receipts

Chat (text-only, system + user, ~50 + ~50 tokens, ~50-token reply)

{
  "usage": {
    "prompt_tokens": 102,
    "completion_tokens": 47,
    "total_tokens": 149,
    "credits_charged": 0.0298,
    "breakdown": {
      "model": "aurous-grow-2.0-pro",
      "input_credits": 0.0145,
      "output_credits": 0.0153,
      "pricing_version": 1
    }
  }
}

Chat (with reasoning_effort=medium, ~30 + ~10 input, ~100 visible + ~400 reasoning)

{
  "usage": {
    "prompt_tokens": 41,
    "completion_tokens": 503,
    "reasoning_tokens": 402,
    "total_tokens": 544,
    "credits_charged": 0.1872,
    "breakdown": {
      "model": "aurous-grow-2.0-pro",
      "input_credits": 0.0086,
      "output_credits": 0.0786,
      "reasoning_credits": 0.1000,
      "pricing_version": 1
    }
  }
}
Note that reasoning_tokens is a top-level field on usage (not nested under breakdown) and is NOT included in completion_tokens — it’s a separate count. The reasoning rate equals the visible-output rate today; breakdown.reasoning_credits reports the credit subtotal attributable to reasoning so it reconciles cleanly with credits_charged.

Embedding (text only, ~50 tokens)

{
  "usage": {
    "prompt_tokens": 47,
    "total_tokens": 47,
    "credits_charged": 0.000094,
    "breakdown": {
      "model": "aurous-embed-vision-1.0",
      "input": { "text": 0.000094, "visual": 0, "video": 0 }
    }
  }
}

Embedding (text + 1 image, ~1200 visual tokens)

{
  "usage": {
    "prompt_tokens": 1247,
    "total_tokens": 1247,
    "credits_charged": 0.00374,
    "breakdown": {
      "model": "aurous-embed-vision-1.0",
      "input": { "text": 0.000094, "visual": 0.00365, "video": 0 }
    }
  }
}
The values inside breakdown.input.{text, visual, video} are credit amounts, not token counts. The token counts (per-modality, computed by the model’s tokenizer) are not surfaced in the response — only the aggregate prompt_tokens is. To estimate them, use POST /v1/embeddings/estimate.

Where to next?