Documentation Index
Fetch the complete documentation index at: https://docs.getpatter.com/llms.txt
Use this file to discover all available pages before exploring further.
Groq LLM
GroqLLM plugs Groq’s OpenAI-compatible Chat Completions API at https://api.groq.com/openai/v1 into Patter’s pipeline mode. Groq’s LPU inference engine serves Llama models at very high throughput with low time-to-first-token, making it a strong pick when latency matters more than long-context reasoning.
The provider is a thin wrapper around OpenAILLMProvider with a Groq-specific base URL — every OpenAI sampling kwarg (response_format, parallel_tool_calls, tool_choice, seed, top_p, frequency_penalty, presence_penalty, stop, temperature, max_tokens) is forwarded to chat.completions.create automatically.
Install
Usage
The namespaced import (
from getpatter.llm import groq / import * as groq from "getpatter/llm/groq") auto-resolves the API key from GROQ_API_KEY and exposes a uniform LLM class — the same pattern Patter uses for STT and TTS namespaces.Supported models
Pricing in USD per 1M tokens. Availability depends on account tier — Groq’s free tier rate-limits more aggressively than the paid plans.| Model | Input | Output | Notes |
|---|---|---|---|
llama-3.3-70b-versatile (default) | $0.59 | $0.79 | General-purpose Llama 3.3, long context. |
llama-3.1-8b-instant | $0.05 | $0.08 | Cheapest fast option. |
llama-3.3-70b-specdec | n/a | n/a | Speculative decoding variant. |
llama3-70b-8192 | n/a | n/a | Llama 3, 8K context. |
llama3-8b-8192 | n/a | n/a | Llama 3, 8K context. |
mixtral-8x7b-32768 | n/a | n/a | Mixtral MoE, 32K context. |
gemma2-9b-it | n/a | n/a | Google Gemma 2 instruct. |
LLM_PRICING entry — pass pricing={...} overrides if your dashboard needs cost figures for them.
Environment variables
| Variable | Required | Notes |
|---|---|---|
GROQ_API_KEY | yes | Auto-loaded when api_key / apiKey is omitted. |
Options
| Option | Default | Notes |
|---|---|---|
api_key / apiKey | None | Reads from GROQ_API_KEY when omitted. |
model | "llama-3.3-70b-versatile" | Any Groq chat model id. |
base_url / baseUrl | https://api.groq.com/openai/v1 | Override the Groq endpoint (rarely needed). |
temperature, max_tokens, top_p, seed, frequency_penalty, presence_penalty, stop, response_format, parallel_tool_calls, tool_choice | unset | All forwarded to chat.completions.create. See the Groq API docs for accepted values. |
Notes
- Groq returns the standard OpenAI Chat Completions stream shape, so tool calls, JSON mode, and seeded sampling all work without provider-specific code.
- Time-to-first-token on Groq’s LPU is typically < 200 ms for the 70B model and < 100 ms for the 8B model — well below most TTS startup latency.
- Long-context calls (32K+) use Mixtral; everything else fits comfortably in the Llama 3.3 context.

