AI Endpoints Products

Q: What is the Fast API ?

The Fast API benefits from high-performance deployement design for AI workloads that require ultra-low latency and guaranteed throughput. You might want to be provided with significantly faster response times compared to the Standard API, making it ideal for interactive applications such as chat, automation, or code generation. Fast API operates on a commitment model: you commit to a minimum monthly usage, and OVHcloud guarantees consistent performance, priority processing or processing on dedicated Endpoints, and enhanced privacy for your workloads.

Q: What reliability can I expect ?

We offer 99.8% uptime SLA, with the Batch API further optimized for GPU efficiency during low-activity periods (typically at night). Fast API provides guaranteed infrastructure availability for continuous production workloads.

Q: Can I get a dedicated instance ?

Yes, via AI Deploy with a dedicated inference server with dedicated API. If any help is needed, contact us, and we’ll work with you to size the right dedicated setup and ensure it meets all your performance, privacy, and compliance requirements.

New

AI Endpoints - Introducing tailored performance services

OVHcloud AI Endpoints with Base API is generally available and allows you to easily deploy the best AI models into your applications.

Because not all AI workloads are the same, we’re taking things a step further. Today, we are introducing our plan for the 2 next API services, each designed to match specific use cases and performance needs. All so that you ultimately get :

AI Endpoints - Base API : The standard option for consistent performance and balanced throughput. Ideal for most production applications. (Already generally available)
AI Endpoint - Fast API : Built for real-time use cases that demand ultra-low latency. Deliver instant responses and a seamless experience for your users.
AI Endpoint - Batch API : The most cost-efficient choice for large-scale or non-urgent workloads. Run your requests asynchronously and optimize your budget for background processing or scheduled tasks

Start using our AI Models now

Contact us

AI Endpoints - Base API

Already generally available, ideal for interactive workloads with throughput for usual user experience and common expected response time

Chatbots & conversational assistants (customer support, website assistants, internal helpdesks)
Task automation (summaries, email drafting, content rewriting, workflow automation)
Real-time RAG applications (document Q&A, knowledge base search)
Search & retrieval augmentation using embeddings (semantic search with instant responses)

Take me to AI Endpoints - Generally Available

AI Endpoints - Fast API

Coming soon, for high performance workloads needs with, guaranteed-performance and time to first token

Code assistants for developers (autocomplete, debugging suggestions, code refactoring)
AI-powered SaaS applications requiring predictable throughput for end-customer traffic
High-volume enterprise RAG systems with guaranteed latency and reserved compute
Mission-critical production systems (fraud detection, monitoring, anomaly detection)

AI Endpoints - Batch API

Coming soon, optimized for large-volume or delayed workloads

Mass document processing (classification, extraction, OCR post-processing, tagging)
Large-scale embeddings generation for indexing or vector database creation
Bulk content generation (product descriptions, reports, translations)
Dataset preparation (preprocessing text, generating synthetic data, cleaning data at scale)
Asynchronous analytics tasks (trend detection, sentiment analysis, log analysis)

Compare our AI Endpoints API services

	Usual Flow	Specific Needs
	Base API (Generally Available)	Fast API (Coming soon)	Batch API (Coming soon)
Models available	All catalog	Subset of Base API models	LLMs & Embeddings
Use cases	Ideal for interactive workloads with standard throughput, where users experience usual interactions	Ideal for high performance workloads requiring high througput and low TTFT (Time To First Token)	Optimized for large-volume or delayed workloads. Ideal for all your requests that can wait
Prompts per API Call	1 (max 2 Mo) / 1 - 25 for Embeddings	1 (Max 2 Mo)	Max 50 000
Response time	Standard response time experience	Minimal TTFT (Time To First Token) and TPOT (Time Per Output Token)	Up to 24h
Rate Limit	400 Requests Per Minute per project. (higher for embeddings)	Rate limit will depend on models	n.a.
Pricing model Based on consumption unit (per token or per image or per second)	With standard price, no commitment required, enjoy a pay-as-you-go consumption mode	You commit with minimum consumption, you are rewarded with lightening fast response time.	Cost-optimized, with batch requests billed at a significant discounted price compared to the Base API rate.
API Type	Synchronous	Synchronous	Asynchonous
SLA	99,8% uptime	99,8% uptime, guaranteed TTFT & TPOT	99,8% uptime, max 24h response time

FAQ

What are OVHcloud AI Endpoints ?

OVHcloud AI Endpoints let you integrate powerful AI models into your applications via simple APIs without managing infrastructure or worrying about scalability. You can focus on building features while OVHcloud handles performance, availability, and data security.

What is the Batch API ?

The Batch API allows you to batch many tasks or big datasets into one request, run them asynchronously in the background, and fetch all results together. It reduces round trips, eliminates overhead, and is ideal for large processing jobs. For all the requests that can wait, this allows us to deliver discounted prices per token.

What is the Fast API ?

The Fast API benefits from high-performance deployement design for AI workloads that require ultra-low latency and guaranteed throughput.

You might want to be provided with significantly faster response times compared to the Standard API, making it ideal for interactive applications such as chat, automation, or code generation. Fast API operates on a commitment model: you commit to a minimum monthly usage, and OVHcloud guarantees consistent performance, priority processing or processing on dedicated Endpoints, and enhanced privacy for your workloads.

How will the pricing be structured ?

Pricing is still being finalized, but the following principles will guide our model: all AI Endpoints will continue to be consumption based, typically per-token billing, with variations depending on the performance level and guarantees provided.

BaseAPI: Billed per token, ideal for general-purpose workloads.
Batch API: Priced at discounted price of the Standard API rate, as processing can be scheduled during off-peak hours.
Fast API: Also billed per token, but offered through a mutual commitment model: you commit to a minimum monthly usage, and we provide guaranteed throughput, ultra-fast delivery, and enhanced privacy for your workload.

What reliability can I expect ?

We offer 99.8% uptime SLA, with the Batch API further optimized for GPU efficiency during low-activity periods (typically at night). Fast API provides guaranteed infrastructure availability for continuous production workloads.

Can I get a dedicated instance ?

Yes, via AI Deploy with a dedicated inference server with dedicated API. If any help is needed, contact us, and we’ll work with you to size the right dedicated setup and ensure it meets all your performance, privacy, and compliance requirements.

Alpha
Beta
General Availability

Contact us for all your inference needs

Help us better understand your requirements, our experts will guide you toward the best deployment for your needs.