messages[*].content accepts either a plain string (text-only) or an array of content parts when you want to mix text with images or video. The model aurous-grow-2.0-pro is multimodal — it accepts text + image and text + video parts in the same request.
Content-part shape
type | Payload | Notes |
|---|---|---|
text | { "text": "..." } | Plain text fragment. |
image_url | { "image_url": { "url": "...", "detail": "low" | "high" | "auto" } } | URL must be HTTPS-fetchable, or a data:image/...;base64,... URI. |
video_url | { "video_url": { "url": "..." } } | URL must point to a video supported by aurous-grow-2.0-pro. |
detail hint on image_url corresponds to the vision-quality tier: low is the cheapest (~512 tokens per image), high is the standard (~1024 tokens per image), and the platform-specific xhigh (~2048 tokens) is exposed as vision_quality: "xhigh" at the top level of the request body for finer-grained control.
Token cost for images
Image input is charged per token alongside text input. A rough heuristic at default quality:| Quality | Tokens per image |
|---|---|
low | ~512 |
high (default) | ~1024 |
xhigh | ~2048 |
usage.prompt_tokens field of the response. Estimates are held conservatively; if actuals exceed the hold, the difference is committed up to your team’s negative-balance floor (default 0 — see Pricing).
Image input example
Base64-encoded images
For images that aren’t reachable over HTTPS (test fixtures, local-only files), inline them via a data URI:Node.js
Python
Video input
Video parts work the same way as images. The model decodes the first N seconds of the video (provider-specific cap) at sampled frames and treats them as the visual context:usage.prompt_tokens on the response.
Mixing modalities in a single message
You can interleave text and vision parts freely:Limits
- The total prompt-token count (text + image + video) must be at or below the model’s
aurous_metadata.context_window. Over the cap returns400 max_input_tokens_exceeded. - Image and video URLs must resolve in under 10 seconds; longer fetches will be treated as a failed request.
- Provider-side moderation may reject inputs; rejection is surfaced via
finish_reason: "content_filter".
Errors
max_input_tokens_exceeded(400) — prompt is over the model’s context window after image/video tokenization. Drop content parts or lowerdetail.chat_provider_request_invalid(500) — an inline asset failed to parse. This is treated as our bug; the request is logged for follow-up.

