AI Endpoints - Batch API

New

AI Endpoints - Batch API

The most cost-efficient choice for large-scale or non-urgent workloads.

Run your requests asynchronously and optimize your budget for background processing or scheduled tasks, while ensuring sovereign data processing on our European infrastructure.

Beta started !
Free of charge during the Beta phase !
Duration : 2 months
Quick start

Why AI Endpoints Batch API ?

While AI Endpoints Base API is for interaction, Batch API aims for long-running jobs (minutes to hours), where cost and throughput matter more than latency.

Cost efficiency, run jobs for 50% of price (at General Availability)
Fixed SLO: All jobs are queued to be completed within a 24 hours target.
Resource Efficiency: We bridge the gap between volatile demands and fixed compute capacity.
OpenAI Ready: Seamless integration into existing workflows with full API compatibility.

Getting started

Use cases for AI Endpoints Batch mode

Batch API is optimized for asynchronous Workloads: The perfect engine for tasks where volume beats speed

Mass document processing (classification, extraction, OCR post-processing, tagging)

Scale: Turn mountains of PDFs and scans into structured data.

Features: Automated classification, entity extraction, and OCR post-processing.

Value: Eliminates manual data entry for back-office operations

Large-Scale Embeddings

Scale: Convert entire knowledge bases into vectors for RAG or search.

Features: High-throughput indexing for vector databases.

Value: Cost-efficient foundation for semantic search engines.

Dataset Preparation

Scale: Clean and augment massive datasets for machine learning.

Features: Text preprocessing, PII anonymization, and synthetic data generation.

Value: High-quality training data without the premium real-time price tag.

How to use Batch mode ?

A batch job is processed in four stages:

1.Prepare a JSONL file where each line describes a single request (model, endpoint, body).
2.Upload this file to AI Endpoints through the Files API (`/v1/files`) with `purpose="batch"`
3.Create a batch (`/v1/batches`) referencing the uploaded file and the target endpoint.
4.Poll the batch status until it is `completed`, then download the output file (and the error file, if any).

User Guide Documentation

Requirements

Access to the OVHcloud Control Panel
A Public Cloud Project in your OVHcloud account
An AI Endpoints API key (see AI Endpoint - Getting started)

AI Endpoints Base and Batch API specifications

	BATCH (Beta)	BASE (Generally available)
Models available	MVP : LLMs & Embeddings	All catalog
Uses cases	Optimized for large-volume or delayed workloads. Ideal for all your requests that can wait	Ideal for interactive workloads with standard throughtput, where users experience usual interactions
Prompts per API Call	50k rows max. / File size limited to 200MB	1 (max 2 Mo) / 1 - 25 for Embeddings
Response Time	SLO target = 24h	Standard response time experience
Rate Limit	n.a	400 Requests per Minute per Project (higher for embeddings)
Pricing model (per token or per image or per second)	Cost-optimized, with batch requests billed at a significant discounted price compared to the Base API rate	With standard price, no commitment required, enjoy a pay-as-you-go consumption mode
API Type	Asynchronous	Synchronous
SLA	No SLA	99.8% uptime
Pricing	50% off vs Base	https://www.ovhcloud.com/en/public-cloud/prices/#17128
Availability	WW (out of US)	WW (out of US)
OpenAI compatibility	YES	YES (except. image generation models)

Feedbacks are welcome

Batch mode is still in development. Functionalities are still in progress and supposed to improve.

Thanks to give us feedbacks through Feature Requests on GitHub.

FAQ

What is the Batch API ?

The Batch API allows you to batch many tasks or big datasets into one request, run them asynchronously in the background, and fetch all results together. It reduces round trips, eliminates overhead, and is ideal for large processing jobs. For all the requests that can wait, this allows us to deliver discounted prices per token.

How is Batch Mode different from AI Endpoints base (real-time) inference ?

Standard inference returns a response immediately (synchronous). Batch Mode is asynchronous: you submit a job, it is processed in the background, and you poll for completion. The trade-off is latency for cost — Batch Mode is significantly cheaper, has no rate limits per request, and is designed for workloads where a sub-second response time is not required.

Is the API compatible with OpenAI's Batch API

Yes. The Batch Mode API is designed to be compatible with the OpenAI Batch API interface. You can use the official OpenAI Python or TypeScript SDK by pointing base_url to our endpoint and replacing your API key. The completion_window parameter accepts "24h".

What will be the price of AI Endpoint Batch mode ?

AI Endpoints Batch mode is a asynchronous offering allowing customers to benefit from 50% target price of AI Endpoint services.

How long is my request duration?

Batch API requests are processed asynchronously. While the processing time can vary based on the current system load and the size of your payload, our target Service Level Objective (SLO) is 24 hours.

How do I check the status of a running job ?

You can track the status of your batch jobs using either of the following methods:

Via the API / Technical Documentation: If you prefer to check the status programmatically, you can follow the step-by-step instructions and find technical details on checking job status in our dedicated Ai Endpoints Batch mode User Guide

Via the Dedicated UI: You can monitor the real-time status and evolution of your jobs directly from our interactive dashboard here: [coming soon].

What happens if a job is not completed within 24 hours ?

If a batch cannot be completed within the 24-hour completion window, it transitions to the expired status. This is a terminal state — the job stops processing and will not be retried automatically. You will need to submit a new batch to reprocess your requests.

For more information about the batch lifecycle, please refer to our documentation

Am I billed for requests that fail or time out ?

No. You are only billed for requests that are successfully processed. Requests that fail due to an internal error, or that are not reached before the 24-hour window closes, are not charged.

Alpha
Beta
General Availability