Skip to main content
aurous-embed-vision-1.0 is a multimodal embedding model: a single request that mixes text and image parts produces one embedding representing both together. This is the distinguishing feature of the embeddings surface — you get a single vector that captures the semantic relationship between text and visual content in the same document, not separate vectors per modality.
Video input is no longer accepted as of 2026-05-24. video_url parts return embeddings_video_unsupported. The provider folded video frames into the visual billing bucket — the published video rate never actually fired. Extract a representative frame in your pipeline and submit it as image_url; it bills at the visual rate.

Input shape

input accepts two shapes:
  1. A plain string — text-only embedding. The simplest form.
  2. An array of content parts — multimodal embedding. Mix text and image_url parts in one request; the model concatenates them into a single combined document and returns one embedding for the whole thing.
{
  "model": "aurous-embed-vision-1.0",
  "input": [
    { "type": "text", "text": "Product photo of a leather messenger bag." },
    { "type": "image_url", "image_url": { "url": "https://assets.aurous-labs.com/example-images/messenger-bag.jpg" } }
  ]
}
Supported part types:
typePayloadNotes
text{ "text": "..." }UTF-8 text. NULL bytes are rejected. Max 1,000,000 characters per part.
image_url{ "image_url": { "url": "https://..." } }HTTPS URL, ≤ 2048 chars. Must be fetchable in under 10s.
video_urlRejected as of 2026-05-24. Returns embeddings_video_unsupported.

One request → one embedding

The fundamental difference from OpenAI’s embedding API is this: the v1 surface returns exactly one embedding per request, regardless of how many parts the input array contains. When you pass a content-parts array, the model treats the parts as one ordered document and produces a single vector representing the whole thing. This is intentional — it lets you embed text + an image together so the resulting vector captures their joint meaning (a product description fused with the photo, a chart caption fused with the chart image).
// Response is always a single-element data array on v1
{
  "object": "list",
  "data": [
    { "index": 0, "object": "embedding", "embedding": [/* 2048 floats */] }
  ],
  "model": "aurous-embed-vision-1.0",
  "usage": { /* ... */ }
}

Batch rejection — the string[] shape is NOT accepted on v1

OpenAI’s API accepts input: string[] and returns one embedding per string (N→N). Aurous Labs rejects that shape on v1 because the underlying model would concatenate the strings into a single document and return one combined vector — the opposite of what an OpenAI-trained customer would expect. Silently swapping semantics would cause subtle bugs in production code (your “100 documents embedded” call would return 1 unusable embedding). The platform returns 400 embeddings_batch_not_supported whenever input is an array of pure strings. Two workarounds:

Option 1 — loop client-side

Send one request per item. This is the equivalent of OpenAI’s N→N batch semantics. Use Promise.all (Node) or asyncio.gather (Python) to parallelize.
const documents = [
  "The quick brown fox jumps over the lazy dog.",
  "Pack my box with five dozen liquor jugs.",
  "Sphinx of black quartz, judge my vow.",
];

const results = await Promise.all(
  documents.map((text) =>
    client.embeddings.create({
      model: "aurous-embed-vision-1.0",
      input: text,
    }),
  ),
);

const vectors = results.map((r) => r.data[0].embedding);

Option 2 — pass content parts for a deliberately combined embedding

If you actually want one embedding representing several text fragments fused together (e.g., a title + description + tags as one document), pass them as content parts:
{
  "model": "aurous-embed-vision-1.0",
  "input": [
    { "type": "text", "text": "Title: Leather Messenger Bag" },
    { "type": "text", "text": "Description: Hand-stitched full-grain leather." },
    { "type": "text", "text": "Tags: bag, leather, messenger, full-grain" }
  ]
}
This is intentional, semantically meaningful, and returns one embedding for the combined document. It’s NOT the same as embedding the three strings independently — the combined vector is a single point in vector space representing all three together.

Worked example — text + image

A typical RAG-for-images use case: embed a product photo with its description, store the vector, search later with a user’s natural-language query.
curl -X POST https://api.aurous-labs.com/v1/embeddings \
  -H "Authorization: Bearer $AUROUS_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "aurous-embed-vision-1.0",
    "input": [
      { "type": "text", "text": "Product photo of a vintage leather messenger bag with brass buckles." },
      {
        "type": "image_url",
        "image_url": { "url": "https://assets.aurous-labs.com/example-images/messenger-bag.jpg" }
      }
    ]
  }'
Example response:
{
  "object": "list",
  "data": [
    { "index": 0, "object": "embedding", "embedding": [/* 2048 floats */] }
  ],
  "model": "aurous-embed-vision-1.0",
  "usage": {
    "prompt_tokens": 1050,
    "total_tokens": 1050,
    "credits_charged": 0.050407,
    "breakdown": {
      "input": { "text": 0.000487, "visual": 0.049920, "video": 0 },
      "model": "aurous-embed-vision-1.0"
    }
  }
}
The breakdown.input.text and breakdown.input.visual fields decompose the charge across modalities so you can attribute cost to inputs. See Pricing for the per-1K credit math.

Image-URL requirements

  • HTTPS only. Plain HTTP and data: URIs are rejected.
  • ≤ 2048 characters per URL string.
  • Fetchable in under 10 seconds. Long-running fetches are treated as failed requests.
  • Public reachability. The platform fetches the URL from a server-side IP, so private hosts (localhost, RFC 1918 ranges, internal VPC) are not accessible.
If you have local images that aren’t on a public CDN, host them on your own (S3, Cloudflare R2, etc.) and pass the URL. The embeddings surface does not currently accept inline base64 data URIs on v1.

Limits

LimitCapCode on violation
Total content parts per request16embeddings_input_too_many_items
image_url parts per request8embeddings_input_too_many_items
video_url parts per request0 (any video → reject)embeddings_video_unsupported
Total input tokens (after tokenization)128,000embeddings_input_too_large
URL string length2048 charsinvalid_request (DTO validation)
Text part character length1,000,000 charsinvalid_request (DTO validation)
If you need to embed more than 8 images or more than 1 video, split the work into multiple requests — the vectors will land in your index independently. There is no “fan-out” mode that combines more images into a single embedding on v1.

Errors

  • embeddings_batch_not_supported (400) — input was an array of pure strings. Loop client-side or pass content parts. See batch rejection.
  • embeddings_input_too_many_items (400) — over the 16-part or 8-image cap. Split into multiple requests.
  • embeddings_video_unsupported (400) — any video_url part is rejected. Extract a representative frame in your pipeline and submit it as image_url (bills at the visual rate).
  • embeddings_input_too_large (400) — pre-fetch tokenization estimates over the context window. Trim input or skip the largest part.
  • embeddings_provider_unknown_error (502) — upstream returned an error the platform’s mapping table doesn’t yet recognize. Retry with backoff; quote the request_id.
See Errors for the full taxonomy and recovery guidance.