/chat/completions endpoint. This unified interface allows you to seamlessly work with models from different providers through a consistent API.
Contents
| Section | Description |
|---|---|
| Getting Started | Basic setup and configuration |
| Input Controls | System prompts and request parameters |
| Working with Media | Images, audio, and video support |
| Function Calling | Enabling models to invoke functions |
| Thought Signatures | Preserving reasoning context in tool calls |
| Response Format | Structured JSON outputs |
| Prompt Caching | Optimize API usage with caching |
| Reasoning Models | Access model reasoning processes |
Getting Started
You can use the standard OpenAI client to send requests to the gateway:Configuration
You will need to configure the following:- base_url: The base URL of the TrueFoundry dashboard
- api_key: API key generated from Personal Access Tokens
- model: TrueFoundry model ID in the format
provider_account/model_name(available in the LLM playground UI)
Input Controls
System Prompts
System prompts set the behavior and context for the model by defining the assistant’s role, tone, and constraints:Request Parameters
Fine-tune model behavior with these common parameters:Some models don’t support all parameters. For example, temperature is not supported by
o series models like o3-mini.Working with Multi Modal
The API supports various media types including images, audio, video and pdf.Images
Images
Media Resolution
Media Resolution
Supported Providers:
OpenAI, Azure OpenAI, Google Gemini, Google Vertex AI, xAIThe detail parameter in the image_url object allows you to control the resolution at which images are processed. This helps balance between response quality, latency, and cost.Supported Values: low, high, autoExample Usage
For Google Gemini and Vertex AI providers, the
detail parameter is automatically translated to the mediaResolution parameter:"low"→MEDIA_RESOLUTION_LOW(64 tokens)"high"→MEDIA_RESOLUTION_HIGH(256+ tokens with scaling)"auto"or omitted → No explicit media resolution (model decides)
Audio
Audio
Video
Video
PDF Documents
PDF Documents
Supported Providers: OpenAI, Bedrock, Anthropic, Google Vertex, Google GeminiPDF document processing allows models to analyze and extract information from PDF files:
Using Base64 Encoded PDF
Vision
TrueFoundry supports vision models from all integrated providers as they become available. These models can analyze and interpret images alongside text, enabling multimodal AI applications.| Provider | Models |
|---|---|
| OpenAI | gpt-4-vision-preview, gpt-4o, gpt-4o-mini |
| Anthropic | claude-3-sonnet, claude-3-haiku, claude-3-opus, claude-3.5-sonnet, claude-3.5-haiku, claude-4-oppus, claude-4-sonnet, claude-3-7-sonnet |
| Gemini | gemini-1.0-pro-vision, gemini-1.5-flash, gemini-1.5-flash-8b, gemini-1.5-pro, gemini-2.5-pro, gemini-2.5-flash |
| AWS Bedrock | anthropic.claude-3-5-sonnet, anthropic.claude-3-5-haiku, anthropic.claude-3-5-sonnet-20240620-v1:0 |
| Azure OpenAI | gpt-4-vision-preview, gpt-4o, gpt-4o-mini |
| xAI | grok-2-vision-1212 |
Using Vision Models with OpenAI SDK
Function Calling
Function calling allows models to invoke defined functions during conversations, enabling them to perform specific actions or retrieve external information.Basic Usage
Define functions that the model can call:Function Definition Reference
Creating Well-Structured Function Definitions
Creating Well-Structured Function Definitions
When defining functions, you need to provide:
- name: The function name
- description: What the function does
- parameters: JSON Schema object describing the parameters
Supported Parameter Types for Function Arguments
Supported Parameter Types for Function Arguments
Functions support various parameter types:
Implementation Workflows
Working with Multiple Function Definitions
Working with Multiple Function Definitions
Define multiple functions for the model to choose from:
Processing and Responding to Function Calls
Processing and Responding to Function Calls
Process function calls and continue the conversation:
Controlling When and How Functions Are Called
Controlling When and How Functions Are Called
Control when and how functions are called:
Thought Signatures
Thought signatures are encrypted representations of a model’s internal reasoning process that help maintain context and coherence across multi-turn interactions, particularly during function calling. When using certain Gemini 3 preview models, the API includes athought_signature field in tool call responses.
Response Format
The chat completions API supports structured response formats, enabling you to receive consistent, predictable outputs in JSON format. This is useful for parsing responses programmatically.JSON Response Options
Basic JSON Mode: Getting Valid JSON Without Structure Constraints
Basic JSON Mode: Getting Valid JSON Without Structure Constraints
JSON mode ensures the model’s output is valid JSON without enforcing a specific structure:Output:
JSON Schema Mode: Enforcing Specific Data Structures
JSON Schema Mode: Enforcing Specific Data Structures
JSON Schema mode provides strict structure validation using predefined schemas:
When using JSON schema with strict mode set to true, all properties defined in the schema must be included in the required array. If any property is defined but not marked as required, the API will return a 400 Bad Request Error.
Advanced Schema Integration
Python Type Validation with Pydantic Models
Python Type Validation with Pydantic Models
Pydantic provides automatic validation, serialization, and type hints for structured data:
When using OpenAI models with Pydantic Models, there should not be any optional fields in the pydantic model when strict mode is true. This is because the corresponding JSON schema will have missing fields in the “required” section.
Streamlined Pydantic Integration with OpenAI's Beta Parse API
Streamlined Pydantic Integration with OpenAI's Beta Parse API
The beta parse client provides the most streamlined approach for Pydantic integration:This approach allows for optional fields in your Pydantic model and provides a cleaner API for structured responses.
Prompt Caching
Prompt caching optimizes API usage by allowing resumption from specific prefixes in your prompts. This significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements.Prompt caching is supported by multiple providers, each with their own implementation.
Supported Providers
| Provider | Implementation | Documentation |
|---|---|---|
| OpenAI | Automatic prompt caching (KV cache) | OpenAI Prompt Caching |
| Anthropic | Requires explicit cache_control parameter | Anthropic Prompt Caching |
| Azure OpenAI | Automatic (inherited from OpenAI) | Azure OpenAI Prompt Caching |
| Groq | Automatic (similar to OpenAI) | Groq Prompt Caching |
| xAI | Automatic prompt caching via prefix matching | xAI Consumption and Rate Limits |
Supported Models
OpenAI
OpenAI
Supported models: All recent models,
gpt-4o and newer.Prompt caching is enabled for all recent models. You can use the prompt_cache_key parameter to improve cache hit rates when requests share common prefixes.Anthropic
Anthropic
Supported models:
Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5 (deprecated), Claude Haiku 3.5, Claude Haiku 3, Claude Opus 3 (deprecated)For Anthropic models, you must explicitly add the cache_control parameter to any message content you want to cache:Minimum Cacheable Length for Anthropic
| Model | Minimum Token Length |
|---|---|
| Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5, Claude Opus 3 | 1024 tokens |
| Claude Haiku 3.5, Claude Haiku 3 | 2048 tokens |
Azure OpenAI
Azure OpenAI
Supported models:
gpt-4o, gpt-4o-mini, gpt-4o-realtime-preview (version 2024-12-17), gpt-4o-mini-realtime-preview (version 2024-12-17), o1 (version 2024-12-17), o3-mini (version 2025-01-31)Groq
Groq
Supported models:
moonshotai/kimi-k2-instructxAI
xAI
Supported models: All Grok models (e.g.,
grok-4-0709, grok-4-1-fast-reasoning, grok-2-vision-1212)xAI supports automatic prompt caching via prefix matching. When you send requests with identical prompt prefixes, xAI caches those tokens, resulting in reduced costs for cached tokens. Cached tokens are shown in the usage.prompt_tokens_details.cached_tokens field.To increase cache hit likelihood, you can use the
x-grok-conv-id header with a constant UUID4 ID across related requests. Prompt caching works automatically via exact prefix matching.Reasoning Models
TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, available for models from multiple providers includingAnthropic,OpenAI,Azure OpenAI,Groq, xAI and Vertex.
These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.
Supported Reasoning Models
OpenAI
OpenAI
Supported models:
o4-mini, o4-preview, o3 model family, o1 model family, gpt-5-mini, gpt-5-nano, gpt-5Azure OpenAI
Azure OpenAI
Supported models:
gpt-5, gpt-5-mini, gpt-5-nano, o3-pro, codex-mini, o4-mini, o3, o3-mini, o1, o1-miniAnthropic
Anthropic
Supported models:
viaUsing Direct API Calls with Native
For more precise control with Anthropic models, you can use the native
Claude Opus 4.1 (claude-opus-4-1-20250805), Claude Opus 4 (claude-opus-4-20250514), Claude Sonnet 4 (claude-sonnet-4-20250514), Claude Sonnet 3.7 (claude-3-7-sonnet-20250219) via
Anthropic, AWS Bedrock, and Google Vertex AIUsing OpenAI SDK
For Anthropic models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the
reasoning_effort parameter into Anthropic’s native thinking parameter format since Anthropic doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:none: 0% of max_tokenslow: 30% of max_tokensmedium: 60% of max_tokenshigh: 90% of max_tokens
Using Direct API Calls with Native thinking Parameter
For more precise control with Anthropic models, you can use the native thinking parameter directly:Groq
Groq
Supported models:
OpenAI GPT-OSS 20B (openai/gpt-oss-20b), OpenAI GPT-OSS 120B (openai/gpt-oss-120b), Qwen 3 32B (qwen/qwen3-32b), DeepSeek R1 Distil Llama 70B (deepseek-r1-distill-llama-70b)xAI
xAI
Supported models:
grok-3-mini (with reasoning_effort parameter), grok-4-0709, grok-4-1-fast-reasoning, grok-4-fast-reasoning (reasoning built-in)For grok-3-mini, you can use the reasoning_effort parameter to control reasoning depth. Other Grok models like grok-4-0709 have reasoning capabilities built-in but do not support the reasoning_effort parameter.The
reasoning_effort parameter is only supported for grok-3-mini. For other Grok models like grok-4-0709 and grok-4-1-fast-reasoning, reasoning is built-in and the reasoning_effort parameter should not be used. Reasoning tokens are included in the usage metrics for all reasoning-capable models.Parameter Restrictions: Reasoning models (like grok-4-0709 and grok-4-1-fast-reasoning) do not support presence_penalty, frequency_penalty, or stop parameters. Using these parameters with reasoning models will result in an error.Gemini
Gemini
Supported models: All Using Direct API Calls with Native
For more precise control with Gemini models, you can use the native
Gemini 2.5 Series Models.These models can be accessed from Google Vertex or Google Gemini ProvidersFor Gemini models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the
reasoning_effort parameter into Gemini’s native thinking parameter format since Gemini doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:none: 0% of max_tokenslow: 30% of max_tokensmedium: 60% of max_tokenshigh: 90% of max_tokens
Using Direct API Calls with Native thinking Parameter
For more precise control with Gemini models, you can use the native thinking parameter directly: