Skip to main content

Chutes

Overview​

PropertyDetails
DescriptionChutes is a cloud-native AI deployment platform that allows you to deploy, run, and scale LLM applications with OpenAI-compatible APIs using pre-built templates for popular frameworks like vLLM and SGLang.
Provider Route on LiteLLMchutes/
Link to Provider DocChutes Website ↗
Base URLhttps://llm.chutes.ai/v1/
Supported Operations/chat/completions, Embeddings

What is Chutes?​

Chutes is a powerful AI deployment and serving platform that provides:

  • Pre-built Templates: Ready-to-use configurations for vLLM, SGLang, diffusion models, and embeddings
  • OpenAI-Compatible APIs: Use standard OpenAI SDKs and clients
  • Multi-GPU Scaling: Support for large models across multiple GPUs
  • Streaming Responses: Real-time model outputs
  • Custom Configurations: Override any parameter for your specific needs
  • Performance Optimization: Pre-configured optimization settings

Required Variables​

Environment Variables
os.environ["CHUTES_API_KEY"] = ""  # your Chutes API key

Get your Chutes API key from chutes.ai.

Usage - LiteLLM Python SDK​

Non-streaming​

Chutes Non-streaming Completion
import os
import litellm
from litellm import completion

os.environ["CHUTES_API_KEY"] = "" # your Chutes API key

messages = [{"content": "What is the capital of France?", "role": "user"}]

# Chutes call
response = completion(
model="chutes/model-name", # Replace with actual model name
messages=messages
)

print(response)

Streaming​

Chutes Streaming Completion
import os
import litellm
from litellm import completion

os.environ["CHUTES_API_KEY"] = "" # your Chutes API key

messages = [{"content": "Write a short poem about AI", "role": "user"}]

# Chutes call with streaming
response = completion(
model="chutes/model-name", # Replace with actual model name
messages=messages,
stream=True
)

for chunk in response:
print(chunk)

Usage - LiteLLM Proxy Server​

1. Save key in your environment​

export CHUTES_API_KEY=""

2. Start the proxy​

model_list:
- model_name: chutes-model
litellm_params:
model: chutes/model-name # Replace with actual model name
api_key: os.environ/CHUTES_API_KEY

Supported OpenAI Parameters​

Chutes supports all standard OpenAI-compatible parameters:

ParameterTypeDescription
messagesarrayRequired. Array of message objects with 'role' and 'content'
modelstringRequired. Model ID or HuggingFace model identifier
streambooleanOptional. Enable streaming responses
temperaturefloatOptional. Sampling temperature
top_pfloatOptional. Nucleus sampling parameter
max_tokensintegerOptional. Maximum tokens to generate
frequency_penaltyfloatOptional. Penalize frequent tokens
presence_penaltyfloatOptional. Penalize tokens based on presence
stopstring/arrayOptional. Stop sequences
toolsarrayOptional. List of available tools/functions
tool_choicestring/objectOptional. Control tool/function calling
response_formatobjectOptional. Response format specification

Support Frameworks​

Chutes provides optimized templates for popular AI frameworks:

vLLM (High-Performance LLM Serving)​

  • OpenAI-compatible endpoints
  • Multi-GPU scaling support
  • Advanced optimization settings
  • Best for production workloads

SGLang (Advanced LLM Serving)​

  • Structured generation capabilities
  • Advanced features and controls
  • Custom configuration options
  • Best for complex use cases

Diffusion Models (Image Generation)​

  • Pre-configured image generation templates
  • Optimized settings for best results
  • Support for popular diffusion models

Embedding Models​

  • Text embedding templates
  • Vector search optimization
  • Support for popular embedding models

Authentication​

Chutes supports multiple authentication methods:

  • API Key via X-API-Key header
  • Bearer token via Authorization header

Example for LiteLLM (uses environment variable):

os.environ["CHUTES_API_KEY"] = "your-api-key"

Performance Optimization​

Chutes offers hardware selection and optimization:

  • Small Models (7B-13B): 1 GPU with 24GB VRAM
  • Medium Models (30B-70B): 4 GPUs with 80GB VRAM each
  • Large Models (100B+): 8 GPUs with 140GB+ VRAM each

Engine optimization parameters available for fine-tuning performance.

Deployment Options​

Chutes provides flexible deployment:

  • Quick Setup: Use pre-built templates for instant deployment
  • Custom Images: Deploy with custom Docker images
  • Scaling: Configure max instances and auto-scaling thresholds
  • Hardware: Choose specific GPU types and configurations

Additional Resources​