Multimodal embeddings

aurous-embed-vision-1.0 is a multimodal embedding model: a single request that mixes text and image parts produces one embedding representing both together. This is the distinguishing feature of the embeddings surface — you get a single vector that captures the semantic relationship between text and visual content in the same document, not separate vectors per modality.

Video input is no longer accepted as of 2026-05-24. video_url parts return embeddings_video_unsupported. The provider folded video frames into the visual billing bucket — the published video rate never actually fired. Extract a representative frame in your pipeline and submit it as image_url; it bills at the visual rate.

Input shape

input accepts two shapes:

A plain string — text-only embedding. The simplest form.
An array of content parts — multimodal embedding. Mix text and image_url parts in one request; the model concatenates them into a single combined document and returns one embedding for the whole thing.

{
  "model": "aurous-embed-vision-1.0",
  "input": [
    { "type": "text", "text": "Product photo of a leather messenger bag." },
    { "type": "image_url", "image_url": { "url": "https://assets.aurous-labs.com/example-images/messenger-bag.jpg" } }
  ]
}

Supported part types:

`type`	Payload	Notes
`text`	`{ "text": "..." }`	UTF-8 text. NULL bytes are rejected. Max 1,000,000 characters per part.
`image_url`	`{ "image_url": { "url": "https://..." } }`	HTTPS URL, ≤ 2048 chars. Must be fetchable in under 10s.
`video_url`	—	Rejected as of 2026-05-24. Returns `embeddings_video_unsupported`.

One request → one embedding

The fundamental difference from OpenAI’s embedding API is this: the v1 surface returns exactly one embedding per request, regardless of how many parts the input array contains. When you pass a content-parts array, the model treats the parts as one ordered document and produces a single vector representing the whole thing. This is intentional — it lets you embed text + an image together so the resulting vector captures their joint meaning (a product description fused with the photo, a chart caption fused with the chart image).

// Response is always a single-element data array on v1
{
  "object": "list",
  "data": [
    { "index": 0, "object": "embedding", "embedding": [/* 2048 floats */] }
  ],
  "model": "aurous-embed-vision-1.0",
  "usage": { /* ... */ }
}

Batch rejection — the `string[]` shape is NOT accepted on v1

OpenAI’s API accepts input: string[] and returns one embedding per string (N→N). Aurous Labs rejects that shape on v1 because the underlying model would concatenate the strings into a single document and return one combined vector — the opposite of what an OpenAI-trained customer would expect. Silently swapping semantics would cause subtle bugs in production code (your “100 documents embedded” call would return 1 unusable embedding). The platform returns 400 embeddings_batch_not_supported whenever input is an array of pure strings. Two workarounds:

Option 1 — loop client-side

Send one request per item. This is the equivalent of OpenAI’s N→N batch semantics. Use Promise.all (Node) or asyncio.gather (Python) to parallelize.

const documents = [
  "The quick brown fox jumps over the lazy dog.",
  "Pack my box with five dozen liquor jugs.",
  "Sphinx of black quartz, judge my vow.",
];

const results = await Promise.all(
  documents.map((text) =>
    client.embeddings.create({
      model: "aurous-embed-vision-1.0",
      input: text,
    }),
  ),
);

const vectors = results.map((r) => r.data[0].embedding);

import asyncio
from openai import AsyncOpenAI

client = AsyncOpenAI(
    base_url="https://api.aurous-labs.com/v1",
    api_key="al_live_xxxxxxxxxxxxxxxx",
)

documents = [
    "The quick brown fox jumps over the lazy dog.",
    "Pack my box with five dozen liquor jugs.",
    "Sphinx of black quartz, judge my vow.",
]

async def embed_all() -> list[list[float]]:
    results = await asyncio.gather(*[
        client.embeddings.create(model="aurous-embed-vision-1.0", input=text)
        for text in documents
    ])
    return [r.data[0].embedding for r in results]

vectors = asyncio.run(embed_all())

Option 2 — pass content parts for a deliberately combined embedding

If you actually want one embedding representing several text fragments fused together (e.g., a title + description + tags as one document), pass them as content parts:

{
  "model": "aurous-embed-vision-1.0",
  "input": [
    { "type": "text", "text": "Title: Leather Messenger Bag" },
    { "type": "text", "text": "Description: Hand-stitched full-grain leather." },
    { "type": "text", "text": "Tags: bag, leather, messenger, full-grain" }
  ]
}

This is intentional, semantically meaningful, and returns one embedding for the combined document. It’s NOT the same as embedding the three strings independently — the combined vector is a single point in vector space representing all three together.

Worked example — text + image

A typical RAG-for-images use case: embed a product photo with its description, store the vector, search later with a user’s natural-language query.

curl -X POST https://api.aurous-labs.com/v1/embeddings \
  -H "Authorization: Bearer $AUROUS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "aurous-embed-vision-1.0",
    "input": [
      { "type": "text", "text": "Product photo of a vintage leather messenger bag with brass buckles." },
      {
        "type": "image_url",
        "image_url": { "url": "https://assets.aurous-labs.com/example-images/messenger-bag.jpg" }
      }
    ]
  }'

import OpenAI from "openai";

const client = new OpenAI({
  baseURL: "https://api.aurous-labs.com/v1",
  apiKey: process.env.AUROUS_API_KEY!,
});

const res = await client.embeddings.create({
  model: "aurous-embed-vision-1.0",
  input: [
    {
      type: "text",
      text: "Product photo of a vintage leather messenger bag with brass buckles.",
    },
    {
      type: "image_url",
      image_url: {
        url: "https://assets.aurous-labs.com/example-images/messenger-bag.jpg",
      },
    },
  ] as never, // OpenAI's typings predate multimodal embeddings; the wire shape is forwarded as-is.
});

console.log("vector dim:", res.data[0].embedding.length);
console.log("credits_charged:", res.usage.credits_charged);
console.log("breakdown:", res.usage.breakdown.input);
// → { text: 0.000487, visual: 0.04992, video: 0 }

from openai import OpenAI

client = OpenAI(
    base_url="https://api.aurous-labs.com/v1",
    api_key="al_live_xxxxxxxxxxxxxxxx",
)

res = client.embeddings.create(
    model="aurous-embed-vision-1.0",
    input=[
        {
            "type": "text",
            "text": "Product photo of a vintage leather messenger bag with brass buckles.",
        },
        {
            "type": "image_url",
            "image_url": {
                "url": "https://assets.aurous-labs.com/example-images/messenger-bag.jpg",
            },
        },
    ],
)

print("vector dim:", len(res.data[0].embedding))
print("credits_charged:", res.usage.credits_charged)
print("breakdown:", res.usage.breakdown.input)

Example response:

{
  "object": "list",
  "data": [
    { "index": 0, "object": "embedding", "embedding": [/* 2048 floats */] }
  ],
  "model": "aurous-embed-vision-1.0",
  "usage": {
    "prompt_tokens": 1050,
    "total_tokens": 1050,
    "credits_charged": 0.050407,
    "breakdown": {
      "input": { "text": 0.000487, "visual": 0.049920, "video": 0 },
      "model": "aurous-embed-vision-1.0"
    }
  }
}

The breakdown.input.text and breakdown.input.visual fields decompose the charge across modalities so you can attribute cost to inputs. See Pricing for the per-1K credit math.

Image-URL requirements

HTTPS only. Plain HTTP and data: URIs are rejected.
≤ 2048 characters per URL string.
Fetchable in under 10 seconds. Long-running fetches are treated as failed requests.
Public reachability. The platform fetches the URL from a server-side IP, so private hosts (localhost, RFC 1918 ranges, internal VPC) are not accessible.

If you have local images that aren’t on a public CDN, host them on your own (S3, Cloudflare R2, etc.) and pass the URL. The embeddings surface does not currently accept inline base64 data URIs on v1.

Limits

Limit	Cap	Code on violation
Total content parts per request	16	`embeddings_input_too_many_items`
`image_url` parts per request	8	`embeddings_input_too_many_items`
`video_url` parts per request	0 (any video → reject)	`embeddings_video_unsupported`
Total input tokens (after tokenization)	128,000	`embeddings_input_too_large`
URL string length	2048 chars	`invalid_request` (DTO validation)
Text part character length	1,000,000 chars	`invalid_request` (DTO validation)

If you need to embed more than 8 images or more than 1 video, split the work into multiple requests — the vectors will land in your index independently. There is no “fan-out” mode that combines more images into a single embedding on v1.

Errors

embeddings_batch_not_supported (400) — input was an array of pure strings. Loop client-side or pass content parts. See batch rejection.
embeddings_input_too_many_items (400) — over the 16-part or 8-image cap. Split into multiple requests.
embeddings_video_unsupported (400) — any video_url part is rejected. Extract a representative frame in your pipeline and submit it as image_url (bills at the visual rate).
embeddings_input_too_large (400) — pre-fetch tokenization estimates over the context window. Trim input or skip the largest part.
embeddings_provider_unknown_error (502) — upstream returned an error the platform’s mapping table doesn’t yet recognize. Retry with backoff; quote the request_id.

See Errors for the full taxonomy and recovery guidance.

Get started

Guides

Concepts

API Reference

Resources

Multimodal embeddings

Input shape

One request → one embedding

Batch rejection — the `string[]` shape is NOT accepted on v1

Option 1 — loop client-side

Option 2 — pass content parts for a deliberately combined embedding

Worked example — text + image

Image-URL requirements

Limits

Errors

​Input shape

​One request → one embedding

​Batch rejection — the string[] shape is NOT accepted on v1

​Option 1 — loop client-side

​Option 2 — pass content parts for a deliberately combined embedding

​Worked example — text + image

​Image-URL requirements

​Limits

​Errors

Input shape

One request → one embedding

Batch rejection — the `string[]` shape is NOT accepted on v1

Option 1 — loop client-side

Option 2 — pass content parts for a deliberately combined embedding

Worked example — text + image

Image-URL requirements

Limits

Errors