Documentation Index
Fetch the complete documentation index at: https://docs.getpatter.com/llms.txt
Use this file to discover all available pages before exploring further.
Cerebras LLM
CerebrasLLM plugs Cerebras’s OpenAI-compatible Inference API at https://api.cerebras.ai/v1 into Patter’s pipeline mode. It is a thin wrapper around the OpenAI Chat Completions client with a Cerebras-specific base URL and optional gzip payload compression (enabled by default) for faster TTFT on large prompts.
Why Cerebras for voice
Cerebras runs inference on the WSE-3 wafer-scale chip, which serves the default modelgpt-oss-120b at ~3000 tok/sec. Patter’s downstream TTS consumption rate is ~150-300 tok/sec, so any model on Cerebras saturates the pipeline regardless of weight size — meaning you can ship a 120B model with no realtime latency penalty over an 8B model. Picking the larger one buys higher answer quality for free.
Install
Usage
The namespaced import (
import * as cerebras from "getpatter/llm/cerebras" / from getpatter.llm import cerebras) auto-resolves the API key from CEREBRAS_API_KEY and exposes a uniform LLM class — the same pattern Patter uses for STT and TTS namespaces.Supported models
Pricing in USD per 1M tokens. Availability is gated per-tier — when a 404model_not_found lands, the provider logs a recovery hint with override candidates and lets the call continue (voice pipelines treat LLM failures as recoverable).
| Model | Tier | Input | Output | Notes |
|---|---|---|---|---|
gpt-oss-120b (default) | production | n/a | n/a | Highest throughput on WSE-3 (~3000 tok/sec). No deprecation date. |
llama3.1-8b | production | n/a | n/a | Smaller-context alternative. Deprecating 2026-05-27. |
llama-3.3-70b | paid | $0.85 | $1.20 | Listed in LLM_PRICING. |
qwen-3-32b | paid | $0.40 | $0.80 | Listed in LLM_PRICING. |
qwen-3-235b-a22b-instruct-2507 | preview | n/a | n/a | Multilingual, strong on European languages. |
zai-glm-4.7 | preview | n/a | n/a | Preview model, opt-in. |
gpt-oss-120b and llama3.1-8b are billed under Cerebras’s tier plans rather than per-token rate cards, so they are not present in LLM_PRICING — pass pricing={...} overrides if your dashboard needs cost figures for them.Environment variables
| Variable | Required | Notes |
|---|---|---|
CEREBRAS_API_KEY | yes | Auto-loaded when apiKey / api_key is omitted. |
Options
| Option | Default | Notes |
|---|---|---|
apiKey / api_key | undefined | Reads from CEREBRAS_API_KEY when omitted. |
model | "gpt-oss-120b" | Any Cerebras chat model id. Use GET /v1/models to discover tier-available IDs. |
baseUrl / base_url | https://api.cerebras.ai/v1 | Override the Cerebras endpoint (rarely needed). |
gzipCompression / gzip_compression | true | Gzip request payloads for faster TTFT on large prompts. |
temperature, maxTokens, topP, seed, frequencyPenalty, presencePenalty, stop, responseFormat, parallelToolCalls, toolChoice | unset | All forwarded to chat.completions.create. See Cerebras docs for accepted values. |
Payload compression
Cerebras supports gzip (TypeScript) and msgpack + gzip (Python) request bodies — see Cerebras payload optimization. Patter enables compression by default, which reduces wire size on large prompts and shaves time-to-first-token. Disable per-request by passinggzipCompression: false.
Error handling
Whenchat.completions.create returns a 404 model_not_found, the provider logs an ERROR-level recovery hint including the upstream message and lists override candidates (llama3.1-8b, qwen-3-235b-a22b-instruct-2507, llama-3.3-70b on paid tier) — then returns silently. Voice pipelines treat LLM failures as recoverable so the call continues and the user simply hears no LLM response for that turn. All other errors raise PatterError after one retry with exponential backoff and honour x-ratelimit-reset-* advisory headers.
