Skip to main content
messages[*].content accepts either a plain string (text-only) or an array of content parts when you want to mix text with images or video. The model aurous-grow-2.0-pro is multimodal — it accepts text + image and text + video parts in the same request.

Content-part shape

{
  "role": "user",
  "content": [
    { "type": "text", "text": "Describe this image in one sentence." },
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/cat.jpg",
        "detail": "high"
      }
    }
  ]
}
Supported part types:
typePayloadNotes
text{ "text": "..." }Plain text fragment.
image_url{ "image_url": { "url": "...", "detail": "low" | "high" | "auto" } }URL must be HTTPS-fetchable, or a data:image/...;base64,... URI.
video_url{ "video_url": { "url": "..." } }URL must point to a video supported by aurous-grow-2.0-pro.
The detail hint on image_url corresponds to the vision-quality tier: low is the cheapest (~512 tokens per image), high is the standard (~1024 tokens per image), and the platform-specific xhigh (~2048 tokens) is exposed as vision_quality: "xhigh" at the top level of the request body for finer-grained control.

Token cost for images

Image input is charged per token alongside text input. A rough heuristic at default quality:
QualityTokens per image
low~512
high (default)~1024
xhigh~2048
The exact count is returned in the usage.prompt_tokens field of the response. Estimates are held conservatively; if actuals exceed the hold, the difference is committed up to your team’s negative-balance floor (default 0 — see Pricing).

Image input example

curl -X POST https://api.aurous-labs.com/v1/chat/completions \
  -H "Authorization: Bearer $AUROUS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "aurous-grow-2.0-pro",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "What breed is this cat?" },
          {
            "type": "image_url",
            "image_url": { "url": "https://example.com/cat.jpg", "detail": "high" }
          }
        ]
      }
    ],
    "max_tokens": 256
  }'

Base64-encoded images

For images that aren’t reachable over HTTPS (test fixtures, local-only files), inline them via a data URI:
Node.js
import { readFileSync } from "node:fs";

const buf = readFileSync("./cat.jpg");
const base64 = buf.toString("base64");
const dataUrl = `data:image/jpeg;base64,${base64}`;

const res = await client.chat.completions.create({
  model: "aurous-grow-2.0-pro",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What breed?" },
        { type: "image_url", image_url: { url: dataUrl } },
      ],
    },
  ],
});
Python
import base64
from openai import OpenAI

with open("cat.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

res = client.chat.completions.create(
    model="aurous-grow-2.0-pro",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What breed?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{b64}"},
                },
            ],
        }
    ],
)
Watch the request body size — base64 inflates payloads by ~33%. For images larger than a few hundred KB, host them on your own CDN and pass the URL.

Video input

Video parts work the same way as images. The model decodes the first N seconds of the video (provider-specific cap) at sampled frames and treats them as the visual context:
{
  "role": "user",
  "content": [
    { "type": "text", "text": "Summarize what happens in this clip." },
    { "type": "video_url", "video_url": { "url": "https://example.com/clip.mp4" } }
  ]
}
Video token costs are higher than images (typically 3–5x) and depend on clip length. The exact charge is in usage.prompt_tokens on the response.

Mixing modalities in a single message

You can interleave text and vision parts freely:
{
  "role": "user",
  "content": [
    { "type": "text", "text": "Compare these two product photos:" },
    { "type": "image_url", "image_url": { "url": "https://example.com/a.jpg" } },
    { "type": "text", "text": "versus" },
    { "type": "image_url", "image_url": { "url": "https://example.com/b.jpg" } },
    { "type": "text", "text": "Which is the better photograph?" }
  ]
}
The model treats the array as one ordered input — text fragments and visual parts share semantic context.

Limits

  • The total prompt-token count (text + image + video) must be at or below the model’s aurous_metadata.context_window. Over the cap returns 400 max_input_tokens_exceeded.
  • Image and video URLs must resolve in under 10 seconds; longer fetches will be treated as a failed request.
  • Provider-side moderation may reject inputs; rejection is surfaced via finish_reason: "content_filter".

Errors

  • max_input_tokens_exceeded (400) — prompt is over the model’s context window after image/video tokenization. Drop content parts or lower detail.
  • chat_provider_request_invalid (500) — an inline asset failed to parse. This is treated as our bug; the request is logged for follow-up.