Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.getpatter.com/llms.txt

Use this file to discover all available pages before exploring further.

Cerebras LLM

CerebrasLLM plugs Cerebras’s OpenAI-compatible Inference API at https://api.cerebras.ai/v1 into Patter’s pipeline mode. It is a thin wrapper around the OpenAI Chat Completions client with a Cerebras-specific base URL and optional gzip payload compression (enabled by default) for faster TTFT on large prompts.

Why Cerebras for voice

Cerebras runs inference on the WSE-3 wafer-scale chip, which serves the default model gpt-oss-120b at ~3000 tok/sec. Patter’s downstream TTS consumption rate is ~150-300 tok/sec, so any model on Cerebras saturates the pipeline regardless of weight size — meaning you can ship a 120B model with no realtime latency penalty over an 8B model. Picking the larger one buys higher answer quality for free.

Install

npm install getpatter
pip install "getpatter[cerebras]"

Usage

// Namespaced import
import * as cerebras from "getpatter/llm/cerebras";

const llm = new cerebras.LLM();                             // reads CEREBRAS_API_KEY
const llm = new cerebras.LLM({ apiKey: "csk-...", model: "gpt-oss-120b" });
const llm = new cerebras.LLM({
  model: "gpt-oss-120b",
  gzipCompression: true,                                    // default
  responseFormat: { type: "json_object" },                  // OpenAI-style structured outputs
});

// Flat alias (equivalent)
import { CerebrasLLM } from "getpatter";

const llm2 = new CerebrasLLM();
The namespaced import (import * as cerebras from "getpatter/llm/cerebras" / from getpatter.llm import cerebras) auto-resolves the API key from CEREBRAS_API_KEY and exposes a uniform LLM class — the same pattern Patter uses for STT and TTS namespaces.
Plug it into an agent:
import { Patter, Twilio, DeepgramSTT, CerebrasLLM, ElevenLabsTTS } from "getpatter";

const phone = new Patter({ carrier: new Twilio(), phoneNumber: "+15550001234" });

const agent = phone.agent({
  stt: new DeepgramSTT(),
  llm: new CerebrasLLM(),                                   // CEREBRAS_API_KEY from env
  tts: new ElevenLabsTTS({ voiceId: "rachel" }),
  systemPrompt: "You are a helpful assistant.",
  firstMessage: "Hi, how can I help?",
});

await phone.serve(agent);

Supported models

Pricing in USD per 1M tokens. Availability is gated per-tier — when a 404 model_not_found lands, the provider logs a recovery hint with override candidates and lets the call continue (voice pipelines treat LLM failures as recoverable).
ModelTierInputOutputNotes
gpt-oss-120b (default)productionn/an/aHighest throughput on WSE-3 (~3000 tok/sec). No deprecation date.
llama3.1-8bproductionn/an/aSmaller-context alternative. Deprecating 2026-05-27.
llama-3.3-70bpaid$0.85$1.20Listed in LLM_PRICING.
qwen-3-32bpaid$0.40$0.80Listed in LLM_PRICING.
qwen-3-235b-a22b-instruct-2507previewn/an/aMultilingual, strong on European languages.
zai-glm-4.7previewn/an/aPreview model, opt-in.
gpt-oss-120b and llama3.1-8b are billed under Cerebras’s tier plans rather than per-token rate cards, so they are not present in LLM_PRICING — pass pricing={...} overrides if your dashboard needs cost figures for them.
llama3.1-8b is deprecating 2026-05-27. Switch to gpt-oss-120b (the default), llama-3.3-70b (paid tier), or qwen-3-235b-a22b-instruct-2507 (preview) before that date.

Environment variables

VariableRequiredNotes
CEREBRAS_API_KEYyesAuto-loaded when apiKey / api_key is omitted.

Options

OptionDefaultNotes
apiKey / api_keyundefinedReads from CEREBRAS_API_KEY when omitted.
model"gpt-oss-120b"Any Cerebras chat model id. Use GET /v1/models to discover tier-available IDs.
baseUrl / base_urlhttps://api.cerebras.ai/v1Override the Cerebras endpoint (rarely needed).
gzipCompression / gzip_compressiontrueGzip request payloads for faster TTFT on large prompts.
temperature, maxTokens, topP, seed, frequencyPenalty, presencePenalty, stop, responseFormat, parallelToolCalls, toolChoiceunsetAll forwarded to chat.completions.create. See Cerebras docs for accepted values.

Payload compression

Cerebras supports gzip (TypeScript) and msgpack + gzip (Python) request bodies — see Cerebras payload optimization. Patter enables compression by default, which reduces wire size on large prompts and shaves time-to-first-token. Disable per-request by passing gzipCompression: false.

Error handling

When chat.completions.create returns a 404 model_not_found, the provider logs an ERROR-level recovery hint including the upstream message and lists override candidates (llama3.1-8b, qwen-3-235b-a22b-instruct-2507, llama-3.3-70b on paid tier) — then returns silently. Voice pipelines treat LLM failures as recoverable so the call continues and the user simply hears no LLM response for that turn. All other errors raise PatterError after one retry with exponential backoff and honour x-ratelimit-reset-* advisory headers.