Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.getpatter.com/llms.txt

Use this file to discover all available pages before exploring further.

Groq LLM

GroqLLM plugs Groq’s OpenAI-compatible Chat Completions API at https://api.groq.com/openai/v1 into Patter’s pipeline mode. Groq’s LPU inference engine serves Llama models at very high throughput with low time-to-first-token, making it a strong pick when latency matters more than long-context reasoning. The provider is a thin wrapper around OpenAILLMProvider with a Groq-specific base URL — every OpenAI sampling kwarg (response_format, parallel_tool_calls, tool_choice, seed, top_p, frequency_penalty, presence_penalty, stop, temperature, max_tokens) is forwarded to chat.completions.create automatically.

Install

pip install "getpatter[groq]"
npm install getpatter

Usage

# Namespaced import
from getpatter.llm import groq

llm = groq.LLM()                                            # reads GROQ_API_KEY
llm = groq.LLM(api_key="gsk_...", model="llama-3.3-70b-versatile")
llm = groq.LLM(
    model="llama-3.3-70b-versatile",
    response_format={"type": "json_object"},                # OpenAI-style structured outputs
    seed=42,
)

# Flat alias (equivalent)
from getpatter import GroqLLM

llm = GroqLLM()
The namespaced import (from getpatter.llm import groq / import * as groq from "getpatter/llm/groq") auto-resolves the API key from GROQ_API_KEY and exposes a uniform LLM class — the same pattern Patter uses for STT and TTS namespaces.
Plug it into an agent:
import asyncio
from getpatter import Patter, Twilio, DeepgramSTT, GroqLLM, ElevenLabsTTS

phone = Patter(carrier=Twilio(), phone_number="+15550001234")

agent = phone.agent(
    stt=DeepgramSTT(),
    llm=GroqLLM(),                                          # GROQ_API_KEY from env
    tts=ElevenLabsTTS(voice_id="rachel"),
    system_prompt="You are a helpful assistant.",
    first_message="Hi, how can I help?",
)

asyncio.run(phone.serve(agent))

Supported models

Pricing in USD per 1M tokens. Availability depends on account tier — Groq’s free tier rate-limits more aggressively than the paid plans.
ModelInputOutputNotes
llama-3.3-70b-versatile (default)$0.59$0.79General-purpose Llama 3.3, long context.
llama-3.1-8b-instant$0.05$0.08Cheapest fast option.
llama-3.3-70b-specdecn/an/aSpeculative decoding variant.
llama3-70b-8192n/an/aLlama 3, 8K context.
llama3-8b-8192n/an/aLlama 3, 8K context.
mixtral-8x7b-32768n/an/aMixtral MoE, 32K context.
gemma2-9b-itn/an/aGoogle Gemma 2 instruct.
Models without listed rates are available on the API but aren’t yet pinned to a LLM_PRICING entry — pass pricing={...} overrides if your dashboard needs cost figures for them.

Environment variables

VariableRequiredNotes
GROQ_API_KEYyesAuto-loaded when api_key / apiKey is omitted.

Options

OptionDefaultNotes
api_key / apiKeyNoneReads from GROQ_API_KEY when omitted.
model"llama-3.3-70b-versatile"Any Groq chat model id.
base_url / baseUrlhttps://api.groq.com/openai/v1Override the Groq endpoint (rarely needed).
temperature, max_tokens, top_p, seed, frequency_penalty, presence_penalty, stop, response_format, parallel_tool_calls, tool_choiceunsetAll forwarded to chat.completions.create. See the Groq API docs for accepted values.

Notes

  • Groq returns the standard OpenAI Chat Completions stream shape, so tool calls, JSON mode, and seeded sampling all work without provider-specific code.
  • Time-to-first-token on Groq’s LPU is typically < 200 ms for the 70B model and < 100 ms for the 8B model — well below most TTS startup latency.
  • Long-context calls (32K+) use Mixtral; everything else fits comfortably in the Llama 3.3 context.