AI Endpoints Products
AI Endpoints - Introducing tailored performance services
OVHcloud AI Endpoints with Base API is generally available and allows you to easily deploy the best AI models into your applications.
Because not all AI workloads are the same, we’re taking things a step further. Today, we are introducing our plan for the 2 next API services, each designed to match specific use cases and performance needs. All so that you ultimately get :
- AI Endpoints - Base API : The standard option for consistent performance and balanced throughput. Ideal for most production applications. (Already generally available)
AI Endpoint - Fast API : Built for real-time use cases that demand ultra-low latency. Deliver instant responses and a seamless experience for your users.
AI Endpoint - Batch API : The most cost-efficient choice for large-scale or non-urgent workloads. Run your requests asynchronously and optimize your budget for background processing or scheduled tasks
AI Endpoints - Base API
Already generally available, ideal for interactive workloads with throughput for usual user experience and common expected response time
Chatbots & conversational assistants (customer support, website assistants, internal helpdesks)
Task automation (summaries, email drafting, content rewriting, workflow automation)
Real-time RAG applications (document Q&A, knowledge base search)
Search & retrieval augmentation using embeddings (semantic search with instant responses)
AI Endpoints - Fast API
Coming soon, for high performance workloads needs with, guaranteed-performance and time to first token
Code assistants for developers (autocomplete, debugging suggestions, code refactoring)
AI-powered SaaS applications requiring predictable throughput for end-customer traffic
High-volume enterprise RAG systems with guaranteed latency and reserved compute
Mission-critical production systems (fraud detection, monitoring, anomaly detection)
AI Endpoints - Batch API
Coming soon, optimized for large-volume or delayed workloads
Mass document processing (classification, extraction, OCR post-processing, tagging)
Large-scale embeddings generation for indexing or vector database creation
Bulk content generation (product descriptions, reports, translations)
Dataset preparation (preprocessing text, generating synthetic data, cleaning data at scale)
Asynchronous analytics tasks (trend detection, sentiment analysis, log analysis)
Compare our AI Endpoints API services
| Usual Flow | Specific Needs | ||
Base API (Generally Available) | Fast API (Coming soon) | Batch API (Coming soon) | |
Models available | All catalog | Subset of Base API models | LLMs & Embeddings |
| Use cases | Ideal for interactive workloads with standard throughput, where users experience usual interactions | Ideal for high performance workloads requiring high througput and low TTFT (Time To First Token) | Optimized for large-volume or delayed workloads. Ideal for all your requests that can wait |
| Prompts per API Call | 1 (max 2 Mo) / 1 - 25 for Embeddings | 1 (Max 2 Mo) | Max 50 000 |
| Response time | Standard response time experience | Minimal TTFT (Time To First Token) and TPOT (Time Per Output Token) | Up to 24h |
| Rate Limit | 400 Requests Per Minute per project. (higher for embeddings) | Rate limit will depend on models | n.a. |
Pricing model Based on consumption unit (per token or per image or per second) | With standard price, no commitment required, enjoy a pay-as-you-go consumption mode | You commit with minimum consumption, you are rewarded with lightening fast response time. | Cost-optimized, with batch requests billed at a significant discounted price compared to the Base API rate. |
| API Type | Synchronous | Synchronous | Asynchonous |
| SLA | 99,8% uptime | 99,8% uptime, guaranteed TTFT & TPOT | 99,8% uptime, max 24h response time |
FAQ
What are OVHcloud AI Endpoints ?
OVHcloud AI Endpoints let you integrate powerful AI models into your applications via simple APIs without managing infrastructure or worrying about scalability. You can focus on building features while OVHcloud handles performance, availability, and data security.
What is the Batch API ?
The Batch API allows you to batch many tasks or big datasets into one request, run them asynchronously in the background, and fetch all results together. It reduces round trips, eliminates overhead, and is ideal for large processing jobs. For all the requests that can wait, this allows us to deliver discounted prices per token.
What is the Fast API ?
The Fast API benefits from high-performance deployement design for AI workloads that require ultra-low latency and guaranteed throughput.
You might want to be provided with significantly faster response times compared to the Standard API, making it ideal for interactive applications such as chat, automation, or code generation. Fast API operates on a commitment model: you commit to a minimum monthly usage, and OVHcloud guarantees consistent performance, priority processing or processing on dedicated Endpoints, and enhanced privacy for your workloads.
How will the pricing be structured ?
Pricing is still being finalized, but the following principles will guide our model: all AI Endpoints will continue to be consumption based, typically per-token billing, with variations depending on the performance level and guarantees provided.
BaseAPI: Billed per token, ideal for general-purpose workloads.
Batch API: Priced at discounted price of the Standard API rate, as processing can be scheduled during off-peak hours.
Fast API: Also billed per token, but offered through a mutual commitment model: you commit to a minimum monthly usage, and we provide guaranteed throughput, ultra-fast delivery, and enhanced privacy for your workload.
What reliability can I expect ?
We offer 99.8% uptime SLA, with the Batch API further optimized for GPU efficiency during low-activity periods (typically at night). Fast API provides guaranteed infrastructure availability for continuous production workloads.
Can I get a dedicated instance ?
Yes, via AI Deploy with a dedicated inference server with dedicated API. If any help is needed, contact us, and we’ll work with you to size the right dedicated setup and ensure it meets all your performance, privacy, and compliance requirements.
-
Alpha
-
Beta
-
General Availability
Contact us for all your inference needs
Help us better understand your requirements, our experts will guide you toward the best deployment for your needs.