Truefoundry Docs

TrueFoundry AI Gateway provides a universal API for all supported models via the standard OpenAI /chat/completions endpoint. This unified interface allows you to seamlessly work with models from different providers through a consistent API.

Section	Description
Getting Started	Basic setup and configuration
Input Controls	System prompts and request parameters
Working with Media	Images, audio, and video support
Function Calling	Enabling models to invoke functions
Thought Signatures	Preserving reasoning context in tool calls
Response Format	Structured JSON outputs
Prompt Caching	Optimize API usage with caching
Reasoning Models	Access model reasoning processes

Getting Started

You can use the standard OpenAI client to send requests to the gateway:

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm" # e.g. https://my-company.truefoundry.cloud/api/llm
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini", # this is the truefoundry model id
    messages=[{"role": "user", "content": "Hello, how are you?"}]
)

print(response.choices[0].message.content)

Configuration

You will need to configure the following:

base_url: The base URL of the TrueFoundry dashboard
api_key: API key generated from Personal Access Tokens
model: TrueFoundry model ID in the format provider_account/model_name (available in the LLM playground UI)

Input Controls

System Prompts

System prompts set the behavior and context for the model by defining the assistant’s role, tone, and constraints:

response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant that specializes in Python programming."},
        {"role": "user", "content": "How do I write a function to calculate factorial?"}
    ]
)

Request Parameters

Fine-tune model behavior with these common parameters:

response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "Hello, how are you?"}],
    temperature=0.7,       # Controls randomness (0.0 to 1.0)
    max_tokens=100,        # Maximum tokens to generate
    verbosity="high",      # Constrains verbosity: low, medium, high
    top_p=0.9,             # Nucleus sampling parameter
    frequency_penalty=0.0, # Reduces repetition
    presence_penalty=0.0,  # Encourages new topics
    stop=["\n", "Human:"]  # Stop sequences
)

Some models don’t support all parameters. For example, temperature is not supported by o series models like o3-mini.

The API supports various media types including images, audio, video and pdf.

Images

Supported Models: GPT-4o, GPT-4 Vision, Claude 3, Gemini Pro VisionSend images as part of your chat completion requests using either URLs or base64 encoding:

Using Image URLs

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm"
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/image.jpg"
                    }
                }
            ]
        }
    ]
)

Using Base64 Encoded Images

import base64

def encode_image(image_path):
    with open(image_path, "rb") as image_file:
        return base64.b64encode(image_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/jpeg;base64,{encode_image('image.jpeg')}"
                    }
                }
            ]
        }
    ]
)

Media Resolution

Supported Providers: OpenAI, Azure OpenAI, Google Gemini, Google Vertex AI, xAIThe detail parameter in the image_url object allows you to control the resolution at which images are processed. This helps balance between response quality, latency, and cost.Supported Values: low, high, auto

Example Usage

import base64

from openai import OpenAI

API_KEY = "your_truefoundry_api_key"
BASE_URL = "<truefoundry-base-url>/api/llm"

# Read and encode the image as base64
with open("test-img.png", "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

client = OpenAI(
    api_key=API_KEY,
    base_url=BASE_URL
)

response = client.chat.completions.create(
    model="test-123/gemini-3-pro-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:image/png;base64,{base64_image}",
                        "detail": "low"  # Options: "low", "high", "auto"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message)

For Google Gemini and Vertex AI providers, the detail parameter is automatically translated to the mediaResolution parameter:

"low" → MEDIA_RESOLUTION_LOW (64 tokens)
"high" → MEDIA_RESOLUTION_HIGH (256+ tokens with scaling)
"auto" or omitted → No explicit media resolution (model decides)

Audio

Supported Models: Google Gemini models (Gemini 2.0 Flash, etc.)Send audio files in supported formats (MP3, WAV, etc.). Currently supported for Google Gemini models:

Using Audio URLs

response = client.chat.completions.create(
    model="internal-google/gemini-2-0-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/audio.wav",
                        "mime_type": "audio/wav" # required for gemini models
                    }
                }
            ]
        }
    ]
)

Using Base64 Encoded Audio

import base64

def encode_audio(audio_path):
    with open(audio_path, "rb") as audio_file:
        return base64.b64encode(audio_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="internal-google/gemini-2-0-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Transcribe this audio"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:audio/wav;base64,{encode_audio('audio.wav')}"
                    }
                }
            ]
        }
    ]
)

Video

Supported Models: Google Gemini models (Gemini 2.0 Flash, etc.)Video processing is natively supported for Google Gemini models:

Using Video URLs

response = client.chat.completions.create(
    model="internal-google/gemini-2-0-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what's happening in this video"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://www.youtube.com/watch?v=example",
                        "mime_type": "video/mp4" # required for gemini models
                    }
                }
            ]
        }
    ]
)

Using Base64 Encoded Video

import base64

def encode_video(video_path):
    with open(video_path, "rb") as video_file:
        return base64.b64encode(video_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="internal-google/gemini-2-0-flash",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe what's happening in this video"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:video/mp4;base64,{encode_video('video.mp4')}",
                        "mime_type": "video/mp4" # required for gemini models
                    }
                }
            ]
        }
    ]
)

PDF Documents

Supported Providers: OpenAI, Bedrock, Anthropic, Google Vertex, Google GeminiPDF document processing allows models to analyze and extract information from PDF files:

Using Base64 Encoded PDF

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm"
)

import base64

with open("sample.pdf", "rb") as file_data:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

response = client.chat.completions.create(
    model="tfy-ai-anthropic/claude-4-sonnet-20250514",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "what's the data in the file"},
                {
                    "type": "file",
                    "file": {
                        "filename": "sample.pdf",
                        "file_data": f"data:application/pdf;base64,{file_data}",
                    }
                },
            ]
        }
    ]
)

print(response.choices[0].message.content)

Vision

TrueFoundry supports vision models from all integrated providers as they become available. These models can analyze and interpret images alongside text, enabling multimodal AI applications.

Provider	Models
OpenAI	`gpt-4-vision-preview, gpt-4o, gpt-4o-mini`
Anthropic	`claude-3-sonnet, claude-3-haiku, claude-3-opus, claude-3.5-sonnet, claude-3.5-haiku, claude-4-oppus, claude-4-sonnet, claude-3-7-sonnet`
Gemini	`gemini-1.0-pro-vision, gemini-1.5-flash, gemini-1.5-flash-8b, gemini-1.5-pro, gemini-2.5-pro, gemini-2.5-flash`
AWS Bedrock	`anthropic.claude-3-5-sonnet, anthropic.claude-3-5-haiku, anthropic.claude-3-5-sonnet-20240620-v1:0`
Azure OpenAI	`gpt-4-vision-preview, gpt-4o, gpt-4o-mini`
xAI	`grok-2-vision-1212`

Using Vision Models with OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="http://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What is in this image?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg"
                    }
                }
            ]
        }
    ]
)

print(response.choices[0].message)

Function Calling

Function calling allows models to invoke defined functions during conversations, enabling them to perform specific actions or retrieve external information.

Basic Usage

Define functions that the model can call:

from openai import OpenAI
import json

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm"
)

# Define a function
tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get the current weather in a given location",
        "parameters": {
            "type": "object",
            "properties": {
                "location": {
                    "type": "string",
                    "description": "The city and state, e.g. San Francisco, CA"
                },
                "unit": {
                    "type": "string",
                    "enum": ["celsius", "fahrenheit"]
                }
            },
            "required": ["location"]
        }
    }
}]

# Make the request
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather in New York?"}],
    tools=tools
)

# Check if the model wants to call a function
if response.choices[0].message.tool_calls:
    tool_call = response.choices[0].message.tool_calls[0]
    function_name = tool_call.function.name
    function_args = json.loads(tool_call.function.arguments)

    print(f"Function called: {function_name}")
    print(f"Arguments: {function_args}")

Function Definition Reference

Creating Well-Structured Function Definitions

When defining functions, you need to provide:

name: The function name
description: What the function does
parameters: JSON Schema object describing the parameters

function_schema = {
    "name": "get_weather",
    "description": "Get current weather information",
    "parameters": {
        "type": "object",
        "properties": {
            "location": {
                "type": "string",
                "description": "City name"
            }
        },
        "required": ["location"]
    }
}

Supported Parameter Types for Function Arguments

Functions support various parameter types:

function_schema = {
    "name": "process_data",
    "description": "Process data with various parameters",
    "parameters": {
        "type": "object",
        "properties": {
            "text": {
                "type": "string",
                "description": "Text to process"
            },
            "count": {
                "type": "integer",
                "description": "Number of items"
            },
            "confidence": {
                "type": "number",
                "description": "Threshold (0.0 to 1.0)"
            },
            "enabled": {
                "type": "boolean",
                "description": "Whether processing is enabled"
            },
            "categories": {
                "type": "array",
                "items": {"type": "string"},
                "description": "List of categories"
            }
        },
        "required": ["text"]
    }
}

Implementation Workflows

Working with Multiple Function Definitions

Define multiple functions for the model to choose from:

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get current weather information",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {"type": "string"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
                },
                "required": ["location"]
            }
        }
    },
    {
        "type": "function",
        "function": {
            "name": "search_web",
            "description": "Search the web for information",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string"},
                    "max_results": {"type": "integer", "default": 5}
                },
                "required": ["query"]
            }
        }
    }
]

Processing and Responding to Function Calls

Process function calls and continue the conversation:

# Initial request
messages = [{"role": "user", "content": "What's the weather in Tokyo?"}]
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=messages,
    tools=tools
)

# Handle function call
if response.choices[0].message.tool_calls:
    messages.append(response.choices[0].message)

    for tool_call in response.choices[0].message.tool_calls:
        function_name = tool_call.function.name
        function_args = json.loads(tool_call.function.arguments)

        # Execute your function (simulated here)
        if function_name == "get_weather":
            result = f"The weather in {function_args['location']} is 22°C and sunny"

        # Add the function result to the conversation
        messages.append({
            "role": "tool",
            "tool_call_id": tool_call.id,
            "content": result
        })

    # Continue the conversation
    final_response = client.chat.completions.create(
        model="openai-main/gpt-4o-mini",
        messages=messages
    )

    print(final_response.choices[0].message.content)

Controlling When and How Functions Are Called

Control when and how functions are called:

# Force a specific function call
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=tools,
    tool_choice={"type": "function", "function": {"name": "get_weather"}}
)

# Allow automatic function calling (default)
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=tools,
    tool_choice="auto"
)

# Prevent function calling
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=tools,
    tool_choice="none"
)

# Force any function call
response = client.chat.completions.create(
    model="openai-main/gpt-4o-mini",
    messages=[{"role": "user", "content": "What's the weather?"}],
    tools=tools,
    tool_choice="required"
)

Thought Signatures

Thought signatures are encrypted representations of a model’s internal reasoning process that help maintain context and coherence across multi-turn interactions, particularly during function calling. When using certain Gemini 3 preview models, the API includes a thought_signature field in tool call responses.

from openai import OpenAI
import json

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm"
)

tools = [{
    "type": "function",
    "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
            "type": "object",
            "properties": {"location": {"type": "string"}},
            "required": ["location"]
        }
    }
}]

# First call - model requests tool
response = client.chat.completions.create(
    model="vertex-main/gemini-3-pro-preview",
    messages=[{"role": "user", "content": "What's the weather in San Francisco?"}],
    tools=tools
)

message = response.choices[0].message

if message.tool_calls:
    tool_call = message.tool_calls[0]
    args = json.loads(tool_call.function.arguments)
    result = f"The weather in {args['location']} is 18°C and cloudy."

    # Convert message to dict (preserves thought_signature)
    assistant_message = message.model_dump(exclude_none=True)

    # Second call - send tool result back
    final_response = client.chat.completions.create(
        model="vertex-main/gemini-3-pro-preview",
        messages=[
            {"role": "user", "content": "What's the weather in San Francisco?"},
            assistant_message,  # Includes thought_signature
            {
                "role": "tool",
                "content": json.dumps(result),
                "tool_call_id": tool_call.id
            }
        ]
    )

    print(final_response.choices[0].message.content)

Response Format

The chat completions API supports structured response formats, enabling you to receive consistent, predictable outputs in JSON format. This is useful for parsing responses programmatically.

JSON Response Options

Basic JSON Mode: Getting Valid JSON Without Structure Constraints

JSON mode ensures the model’s output is valid JSON without enforcing a specific structure:

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm"
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {"role": "system", "content": "You are a helpful assistant designed to output JSON."},
        {"role": "user", "content": "Extract information about the 2020 World Series winner"}
    ],
    response_format={"type": "json_object"}
)

print(response.choices[0].message.content)

Output:

{
  "team": "Los Angeles Dodgers",
  "year": 2020,
  "opponent": "Tampa Bay Rays",
  "games_played": 6,
  "series_result": "4-2"
}

JSON Schema Mode: Enforcing Specific Data Structures

JSON Schema mode provides strict structure validation using predefined schemas:

from openai import OpenAI
import json

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm"
)

# Define JSON schema
user_info_schema = {
    "type": "object",
    "properties": {
        "name": {"type": "string"},
        "age": {"type": "integer", "minimum": 0},
        "occupation": {"type": "string"},
        "location": {"type": "string"},
        "skills": {
            "type": "array",
            "items": {"type": "string"}
        }
    },
    "required": ["name", "age", "occupation", "location", "skills"],
    "additionalProperties": False
}

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Extract user information and respond according to the provided JSON schema."
        },
        {
            "role": "user",
            "content": "My name is Sarah Johnson, I'm 28 years old, and I work as a data scientist in New York. I'm skilled in Python, SQL, and machine learning."
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "user_info",
            "schema": user_info_schema,
            "strict": True
        }
    }
)

# Parse response
result = json.loads(response.choices[0].message.content)

When using JSON schema with strict mode set to true, all properties defined in the schema must be included in the required array. If any property is defined but not marked as required, the API will return a 400 Bad Request Error.

Advanced Schema Integration

Python Type Validation with Pydantic Models

Pydantic provides automatic validation, serialization, and type hints for structured data:

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm"
)

# Define Pydantic model
class UserInfo(BaseModel):
    name: str = Field(description="Full name of the user")
    age: int = Field(ge=0, description="Age in years")
    occupation: str = Field(description="Job title or profession")
    location: str = Field(description="City or location")
    skills: List[str] = Field(description="List of professional skills")

    class Config:
        extra = "forbid"  # Prevent additional fields

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Extract user information and respond according to the provided schema."
        },
        {
            "role": "user",
            "content": "Hi, I'm Mike Chen, a 32-year-old software architect from Seattle. I specialize in cloud computing, microservices, and Kubernetes."
        }
    ],
    response_format={
        "type": "json_schema",
        "json_schema": {
            "name": "user_info",
            "schema": UserInfo.model_json_schema(),
            "strict": True
        }
    }
)

# Parse and validate with Pydantic
user_data = UserInfo.model_validate_json(response.choices[0].message.content)

When using OpenAI models with Pydantic Models, there should not be any optional fields in the pydantic model when strict mode is true. This is because the corresponding JSON schema will have missing fields in the “required” section.

Streamlined Pydantic Integration with OpenAI's Beta Parse API

The beta parse client provides the most streamlined approach for Pydantic integration:

from openai import OpenAI
from pydantic import BaseModel, Field
from typing import List, Optional

class UserInfo(BaseModel):
    name: str = Field(description="Full name of the user")
    age: int = Field(ge=0, description="Age in years")
    occupation: str = Field(description="Job title or profession")
    location: Optional[str] = Field(None, description="City or location")
    skills: List[str] = Field(default=[], description="List of professional skills")

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="<truefoundry-base-url>/api/llm"
)

completion = client.beta.chat.completions.parse(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "system",
            "content": "Extract user information from the provided text."
        },
        {
            "role": "user",
            "content": "Hello, I'm Alex Rodriguez, a 29-year-old product manager from Austin. I have experience in agile methodologies, data analysis, and team leadership."
        }
    ],
    response_format=UserInfo,
)

user_result = completion.choices[0].message.parsed

This approach allows for optional fields in your Pydantic model and provides a cleaner API for structured responses.

Prompt Caching

Prompt caching optimizes API usage by allowing resumption from specific prefixes in your prompts. This significantly reduces processing time and costs for repetitive tasks or prompts with consistent elements.

Prompt caching is supported by multiple providers, each with their own implementation.

Supported Providers

Provider	Implementation	Documentation
OpenAI	Automatic prompt caching (KV cache)	OpenAI Prompt Caching
Anthropic	Requires explicit cache_control parameter	Anthropic Prompt Caching
Azure OpenAI	Automatic (inherited from OpenAI)	Azure OpenAI Prompt Caching
Groq	Automatic (similar to OpenAI)	Groq Prompt Caching
xAI	Automatic prompt caching via prefix matching	xAI Consumption and Rate Limits

Supported Models

OpenAI

Supported models: All recent models, gpt-4o and newer.Prompt caching is enabled for all recent models. You can use the prompt_cache_key parameter to improve cache hit rates when requests share common prefixes.

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="openai-main/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "Your prompt here"
        }
    ],
    stream=True,
    prompt_cache_key="optional-custom-key"  # OpenAI-specific parameter to improve cache hit rates
)

Anthropic

Supported models: Claude Opus 4.1, Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5 (deprecated), Claude Haiku 3.5, Claude Haiku 3, Claude Opus 3 (deprecated)For Anthropic models, you must explicitly add the cache_control parameter to any message content you want to cache:

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet-latest",
    messages=[
        {
            "role": "system",
            "content": [
                {
                    "type": "text",
                    "text": "<TEXT_TO_CACHE>",
                    "cache_control": {"type": "ephemeral", "ttl": "5m"},
                },
            ],
        },
        {
            "role": "user",
            "content": "Enter your prompt here",
        },
    ]
)

Minimum Cacheable Length for Anthropic

Model	Minimum Token Length
Claude Opus 4, Claude Sonnet 4, Claude Sonnet 3.7, Claude Sonnet 3.5, Claude Opus 3	1024 tokens
Claude Haiku 3.5, Claude Haiku 3	2048 tokens

Azure OpenAI

Supported models: gpt-4o, gpt-4o-mini, gpt-4o-realtime-preview (version 2024-12-17), gpt-4o-mini-realtime-preview (version 2024-12-17), o1 (version 2024-12-17), o3-mini (version 2025-01-31)

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="azure-openai-main/gpt-4o",
    messages=[
        {
            "role": "user",
            "content": "Your prompt here"
        }
    ],
    stream=True,
    prompt_cache_key="optional-custom-key"  # OpenAI-specific parameter to improve cache hit rates
)

Groq

Supported models: moonshotai/kimi-k2-instruct

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="groq-main/moonshotai-kimi-k2-instruct",
    messages=[
        {
            "role": "user",
            "content": "Your prompt here"
        }
    ],
    stream=True
)

xAI

Supported models: All Grok models (e.g., grok-4-0709, grok-4-1-fast-reasoning, grok-2-vision-1212)xAI supports automatic prompt caching via prefix matching. When you send requests with identical prompt prefixes, xAI caches those tokens, resulting in reduced costs for cached tokens. Cached tokens are shown in the usage.prompt_tokens_details.cached_tokens field.

from openai import OpenAI

client = OpenAI(
    api_key="your_truefoundry_api_key",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="xai-main/grok-4-0709",
    messages=[
        {
            "role": "user",
            "content": "Your prompt here"
        }
    ],
    stream=True
)

To increase cache hit likelihood, you can use the x-grok-conv-id header with a constant UUID4 ID across related requests. Prompt caching works automatically via exact prefix matching.

Reasoning Models

TrueFoundry AI Gateway provides access to model reasoning processes through thinking/reasoning tokens, available for models from multiple providers including Anthropic,OpenAI,Azure OpenAI,Groq, xAI and Vertex. These models expose their internal reasoning process, allowing you to see how they arrive at conclusions. The thinking/reasoning tokens provide step-by-step insights into the model’s cognitive process.

Supported Reasoning Models

OpenAI

Supported models: o4-mini, o4-preview, o3 model family, o1 model family, gpt-5-mini, gpt-5-nano, gpt-5

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="openai-main/o4-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)

Azure OpenAI

Supported models: gpt-5, gpt-5-mini, gpt-5-nano, o3-pro, codex-mini, o4-mini, o3, o3-mini, o1, o1-mini

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="azure-openai-main/o3-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)

Anthropic

Supported models: Claude Opus 4.1 (claude-opus-4-1-20250805), Claude Opus 4 (claude-opus-4-20250514), Claude Sonnet 4 (claude-sonnet-4-20250514), Claude Sonnet 3.7 (claude-3-7-sonnet-20250219)
via Anthropic, AWS Bedrock, and Google Vertex AI

Using OpenAI SDK

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low", "none"
    max_tokens=8000
)

print(response.choices[0].message.content)

For Anthropic models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Anthropic’s native thinking parameter format since Anthropic doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:

none: 0% of max_tokens
low: 30% of max_tokens
medium: 60% of max_tokens
high: 90% of max_tokens

Using Direct API Calls with Native `thinking` Parameter

For more precise control with Anthropic models, you can use the native thinking parameter directly:

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="anthropic-main/claude-3-7-sonnet",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    max_tokens=8000,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 1000
        }
    }
)

print(response.choices[0].message.content)

Groq

Supported models: OpenAI GPT-OSS 20B (openai/gpt-oss-20b), OpenAI GPT-OSS 120B (openai/gpt-oss-120b), Qwen 3 32B (qwen/qwen3-32b), DeepSeek R1 Distil Llama 70B (deepseek-r1-distill-llama-70b)

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="groq-main/deepseek-r1-distill-llama-70b",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low"
    max_tokens=8000
)

print(response.choices[0].message.content)

xAI

Supported models: grok-3-mini (with reasoning_effort parameter), grok-4-0709, grok-4-1-fast-reasoning, grok-4-fast-reasoning (reasoning built-in)For grok-3-mini, you can use the reasoning_effort parameter to control reasoning depth. Other Grok models like grok-4-0709 have reasoning capabilities built-in but do not support the reasoning_effort parameter.

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

# For grok-3-mini with reasoning_effort parameter
response = client.chat.completions.create(
    model="xai-main/grok-3-mini",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "low" (only for grok-3-mini)
    max_tokens=8000
)

print(response.choices[0].message.content)

The reasoning_effort parameter is only supported for grok-3-mini. For other Grok models like grok-4-0709 and grok-4-1-fast-reasoning, reasoning is built-in and the reasoning_effort parameter should not be used. Reasoning tokens are included in the usage metrics for all reasoning-capable models.Parameter Restrictions: Reasoning models (like grok-4-0709 and grok-4-1-fast-reasoning) do not support presence_penalty, frequency_penalty, or stop parameters. Using these parameters with reasoning models will result in an error.

Gemini

Supported models: All Gemini 2.5 Series Models.These models can be accessed from Google Vertex or Google Gemini Providers

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="vertex-main/gemini-2-5-pro",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    reasoning_effort="high",  # Options: "high", "medium", "low", "none"
    max_tokens=8000
)

print(response.choices[0].message.content)

For Gemini models (from Anthropic, Google Vertex AI, AWS Bedrock), TrueFoundry automatically translates the reasoning_effort parameter into Gemini’s native thinking parameter format since Gemini doesn’t support the reasoning_effort parameter directly.The translation uses the max_tokens parameter with the following ratios:

none: 0% of max_tokens
low: 30% of max_tokens
medium: 60% of max_tokens
high: 90% of max_tokens

Note: Gemini 2.5 Pro and 2.5 Flash comes with reasoning on by default.

Using Direct API Calls with Native `thinking` Parameter

For more precise control with Gemini models, you can use the native thinking parameter directly:

from openai import OpenAI

client = OpenAI(
    api_key="TFY_API_KEY",
    base_url="https://{controlPlaneURL}/api/llm"
)

response = client.chat.completions.create(
    model="vertex-main/gemini-2-5-pro",
    messages=[
        {"role": "user", "content": "How to compute 3^3^3?"}
    ],
    max_tokens=8000,
    extra_body={
        "thinking": {
            "type": "enabled",
            "budget_tokens": 1000
        }
    }
)

print(response.choices[0].message.content)

Response Format

When reasoning tokens are enabled, the response includes both thinking and content sections:

{
  "id": "1742890579083",
  "object": "chat.completion",
  "created": 1742890579,
  "model": "",
  "provider": "aws",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "To compute 3^3^3, you need to understand that exponentiation is evaluated from right to left (right-associative). So:\n\n3^3^3 = 3^(3^3), not (3^3)^3\n\nHere's how to calculate it step by step:\n\n**Step 1:** Calculate the rightmost exponent first\n- 3^3 = 27\n\n**Step 2:** Now calculate 3^27\n- 3^27 = 7,625,597,484,987\n\nTherefore: **3^3^3 = 7,625,597,484,987**\n\n---\n\n**Note:** If you meant (3^3)^3 instead, that would be:\n- (3^3)^3 = 27^3 = 19,683\n\nThis is much smaller than 3^3^3 because:\n- (3^3)^3 = 3^(3×3) = 3^9\n- While 3^3^3 = 3^27\n\nThe difference between 3^9 and 3^27 is enormous!",
        "reasoning_content": "The user is asking how to compute 3^3^3. This is a question about exponentiation and specifically about the order of operations when dealing with repeated exponentiation (also known as tetration or power towers).\n\nThe key thing to understand here is that exponentiation is right-associative, meaning 3^3^3 = 3^(3^3), not (3^3)^3.\n\nSo we need to:\n1. First compute 3^3 = 27\n2. Then compute 3^27\n\nLet me work through this:\n- 3^3 = 27\n- 3^27 = 3^27\n\nNow 3^27 is a large number. Let me think about how to compute it:\n3^27 = 3^(3×9) = (3^3)^9 = 27^9\n\nOr we could compute it directly:\n3^1 = 3\n3^2 = 9\n3^3 = 27\n3^4 = 81\n3^5 = 243\n3^6 = 729\n3^7 = 2,187\n3^8 = 6,561\n3^9 = 19,683\n3^10 = 59,049\n...\n\nActually, let me just state that 3^27 = 7,625,597,484,987\n\nSo 3^3^3 = 3^(3^3) = 3^27 = 7,625,597,484,987"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 45,
    "completion_tokens": 180,
    "total_tokens": 225
  }
}

Streaming with Reasoning Tokens

For streaming responses, the thinking section is always sent before the content section.

API Reference

For detailed API specifications, parameters, and response schemas, see the Chat Completions API Reference.

Get Started

Developer Guide

MCP Registry and Gateway

Prompt Management

Observability

Integrations

Deployment

API Reference

Chat

Agent

Embeddings

Rerank

Responses

Image

Audio

Batch

Files

Moderations

Models

​Contents

​Getting Started

​Configuration

​Input Controls

​System Prompts

​Request Parameters

​Working with Multi Modal

​Using Image URLs

​Using Base64 Encoded Images

​Example Usage

​Using Audio URLs

​Using Base64 Encoded Audio

​Using Video URLs

​Using Base64 Encoded Video

​Using Base64 Encoded PDF

​Vision

​Using Vision Models with OpenAI SDK

​Function Calling

​Basic Usage

​Function Definition Reference

​Implementation Workflows

​Thought Signatures

​Response Format

​JSON Response Options

​Advanced Schema Integration

​Prompt Caching

​Supported Providers

​Supported Models

​Minimum Cacheable Length for Anthropic

​Reasoning Models

​Supported Reasoning Models

​Using OpenAI SDK

​Using Direct API Calls with Native thinking Parameter

​Using Direct API Calls with Native thinking Parameter

​Response Format

​Streaming with Reasoning Tokens

​API Reference

Contents

Getting Started

Configuration

Input Controls

System Prompts

Request Parameters

Working with Multi Modal

Using Image URLs

Using Base64 Encoded Images

Example Usage

Using Audio URLs

Using Base64 Encoded Audio

Using Video URLs

Using Base64 Encoded Video

Using Base64 Encoded PDF

Vision

Using Vision Models with OpenAI SDK

Function Calling

Basic Usage

Function Definition Reference

Implementation Workflows

Thought Signatures

Response Format

JSON Response Options

Advanced Schema Integration

Prompt Caching

Supported Providers

Supported Models

Minimum Cacheable Length for Anthropic

Reasoning Models

Supported Reasoning Models

Using OpenAI SDK

Using Direct API Calls with Native `thinking` Parameter

Using Direct API Calls with Native `thinking` Parameter

Response Format

Streaming with Reasoning Tokens

API Reference