OpenAIBase plan

GPT audio mini

Cost-efficient audio-native chat model. Supports text + audio output in chat completions.

You always get the exact model you pick — we never silently route you to another.

About GPT audio mini

GPT audio mini takes a different path from most language models by working directly with sound. It accepts audio and text as inputs and returns both audio and text outputs, processing speech natively rather than routing through intermediate text conversions — a design choice that cuts latency sharply for voice-driven applications like call center automation and transcription pipelines. Where it stands out in practice: transcription accuracy is substantially better than older systems, with a reported 35% lower word error rate and 89% fewer hallucinations during transcription tasks. At $0.60 per million input tokens, it undercuts comparable TTS alternatives significantly, which is why developers building scaled voice products find it compelling for high-volume workloads. That said, GPT audio mini is not without rough edges. Early adopters have run into audio artifacts in generated speech and intermittent API instability despite its General Availability label — reviewers describe it more as a capable beta than a polished production release. Its knowledge cutoff of October 2023 also means it is not suited for tasks requiring current information. For teams building voice assistants, automated transcription, or spoken-word interfaces on a budget, it earns its place. For tasks requiring image understanding or real-time duplex audio, look elsewhere.

Best for

  • Voice agent and call center automation requiring low-latency, cost-efficient speech processing
  • High-accuracy speech-to-text transcription for meetings, interviews, and voice notes
  • Cost-effective text-to-speech for chatbots, accessibility tools, and interactive voice interfaces
  • Batch audio processing workflows via the Chat Completions API where real-time streaming is not required
  • Short-turn conversational dialogue systems with function-calling integration

Specifications

ProviderOpenAI
Released2025-12
Context window128,000 tokens
Max output16,384 tokens
Knowledge cutoffOctober 1, 2023
Input price$0.60 / 1M tokens
Output price$2.40 / 1M tokens
Request cost3 base requests
Plan tierBase
Model IDgpt-audio-mini

Frequently asked questions

Input is priced at $0.60 per million tokens and output at $2.40 per million tokens. Note that audio tokens carry roughly a 6.4x premium over standard text tokens in OpenAI's pricing model, so factor that in for high-volume audio workloads.

128,000 tokens with a maximum of 16,384 output tokens per request.

It accepts audio and text as inputs and produces both audio and text as outputs. It does not support image, video, or structured output formats, and does not offer fine-tuning.

It is optimized for low-latency back-and-forth voice exchanges and works with the Chat Completions and Responses APIs, but it does not support full-duplex audio (simultaneous listening and speaking) and cannot be used with the Realtime API streaming endpoint.

October 1, 2023 — meaning it has no awareness of events or developments after that date. This limits its usefulness for tasks requiring current information.

GPT audio mini is the cost-efficient variant, trading some capability for significantly lower pricing. It targets scaled voice applications and routine transcription rather than complex or nuanced audio reasoning tasks.

Related models