Multimodal input

messages[*].content accepts either a plain string (text-only) or an array of content parts when you want to mix text with images or video. The model aurous-grow-2.0-pro is multimodal — it accepts text + image and text + video parts in the same request.

Content-part shape

{
  "role": "user",
  "content": [
    { "type": "text", "text": "Describe this image in one sentence." },
    {
      "type": "image_url",
      "image_url": {
        "url": "https://example.com/cat.jpg",
        "detail": "high"
      }
    }
  ]
}

Supported part types:

`type`	Payload	Notes
`text`	`{ "text": "..." }`	Plain text fragment.
`image_url`	`{ "image_url": { "url": "...", "detail": "low" \| "high" \| "auto" } }`	URL must be HTTPS-fetchable, or a `data:image/...;base64,...` URI.
`video_url`	`{ "video_url": { "url": "..." } }`	URL must point to a video supported by `aurous-grow-2.0-pro`.

The detail hint on image_url corresponds to the vision-quality tier: low is the cheapest (~512 tokens per image), high is the standard (~1024 tokens per image), and the platform-specific xhigh (~2048 tokens) is exposed as vision_quality: "xhigh" at the top level of the request body for finer-grained control.

Token cost for images

Image input is charged per token alongside text input. A rough heuristic at default quality:

Quality	Tokens per image
`low`	~512
`high` (default)	~1024
`xhigh`	~2048

The exact count is returned in the usage.prompt_tokens field of the response. Estimates are held conservatively; if actuals exceed the hold, the difference is committed up to your team’s negative-balance floor (default 0 — see Pricing).

Image input example

curl -X POST https://api.aurous-labs.com/v1/chat/completions \
  -H "Authorization: Bearer $AUROUS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "aurous-grow-2.0-pro",
    "messages": [
      {
        "role": "user",
        "content": [
          { "type": "text", "text": "What breed is this cat?" },
          {
            "type": "image_url",
            "image_url": { "url": "https://example.com/cat.jpg", "detail": "high" }
          }
        ]
      }
    ],
    "max_tokens": 256
  }'

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.aurous-labs.com/v1",
  apiKey: process.env.AUROUS_API_KEY!,
});

const res = await client.chat.completions.create({
  model: "aurous-grow-2.0-pro",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What breed is this cat?" },
        {
          type: "image_url",
          image_url: { url: "https://example.com/cat.jpg", detail: "high" },
        },
      ],
    },
  ],
  max_tokens: 256,
});

console.log(res.choices[0].message.content);

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aurous-labs.com/v1",
    api_key="al_live_xxxxxxxxxxxxxxxx",
)

res = client.chat.completions.create(
    model="aurous-grow-2.0-pro",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What breed is this cat?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/cat.jpg",
                        "detail": "high",
                    },
                },
            ],
        }
    ],
    max_tokens=256,
)

print(res.choices[0].message.content)

Base64-encoded images

For images that aren’t reachable over HTTPS (test fixtures, local-only files), inline them via a data URI:

Node.js

import { readFileSync } from "node:fs";

const buf = readFileSync("./cat.jpg");
const base64 = buf.toString("base64");
const dataUrl = `data:image/jpeg;base64,${base64}`;

const res = await client.chat.completions.create({
  model: "aurous-grow-2.0-pro",
  messages: [
    {
      role: "user",
      content: [
        { type: "text", text: "What breed?" },
        { type: "image_url", image_url: { url: dataUrl } },
      ],
    },
  ],
});

Python

import base64
from openai import OpenAI

with open("cat.jpg", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

res = client.chat.completions.create(
    model="aurous-grow-2.0-pro",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What breed?"},
                {
                    "type": "image_url",
                    "image_url": {"url": f"data:image/jpeg;base64,{b64}"},
                },
            ],
        }
    ],
)

Watch the request body size — base64 inflates payloads by ~33%. For images larger than a few hundred KB, host them on your own CDN and pass the URL.

Video input

Video parts work the same way as images. The model decodes the first N seconds of the video (provider-specific cap) at sampled frames and treats them as the visual context:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "Summarize what happens in this clip." },
    { "type": "video_url", "video_url": { "url": "https://example.com/clip.mp4" } }
  ]
}

Video token costs are higher than images (typically 3–5x) and depend on clip length. The exact charge is in usage.prompt_tokens on the response.

Mixing modalities in a single message

You can interleave text and vision parts freely:

{
  "role": "user",
  "content": [
    { "type": "text", "text": "Compare these two product photos:" },
    { "type": "image_url", "image_url": { "url": "https://example.com/a.jpg" } },
    { "type": "text", "text": "versus" },
    { "type": "image_url", "image_url": { "url": "https://example.com/b.jpg" } },
    { "type": "text", "text": "Which is the better photograph?" }
  ]
}

The model treats the array as one ordered input — text fragments and visual parts share semantic context.

Limits

The total prompt-token count (text + image + video) must be at or below the model’s aurous_metadata.context_window. Over the cap returns 400 max_input_tokens_exceeded.
Image and video URLs must resolve in under 10 seconds; longer fetches will be treated as a failed request.
Provider-side moderation may reject inputs; rejection is surfaced via finish_reason: "content_filter".

Errors

max_input_tokens_exceeded (400) — prompt is over the model’s context window after image/video tokenization. Drop content parts or lower detail.
chat_provider_request_invalid (500) — an inline asset failed to parse. This is treated as our bug; the request is logged for follow-up.

Get started

Guides

Concepts

API Reference

Resources

Content-part shape

Token cost for images

Image input example

Base64-encoded images

Video input

Mixing modalities in a single message

Limits

Errors

​Content-part shape

​Token cost for images

​Image input example

​Base64-encoded images

​Video input

​Mixing modalities in a single message

​Limits

​Errors

Content-part shape

Token cost for images

Image input example

Base64-encoded images

Video input

Mixing modalities in a single message

Limits

Errors