Model page

GPT OSS 120B Fast

OpenAI's GPT OSS 120B routed through Cerebras chat completions for very fast tool-capable replies. 131k context. Does not support web search or image input.

About GPT OSS 120B Fast

An open-weight 117B-parameter Mixture-of-Experts model from OpenAI, GPT OSS 120B Fast activates only 5.1B parameters per forward pass — making it one of the few frontier-class models that runs on a single H100 GPU or even consumer hardware like a MacBook. This efficiency is the core draw: developers building RAG pipelines and multi-tool agent systems report strong real-world results, and its 62.4% SWE-bench Verified score and 80.9% GPQA put it near o4-mini territory on coding and PhD-level reasoning at a fraction of the cost. The "Fast" variant uses a lower reasoning effort setting, prioritizing throughput (344 tokens/sec on standard deployments, 1,800+ on Cerebras) over depth. The Apache 2.0 license removes enterprise friction for self-hosting. One genuine caveat: the declared June 2024 knowledge cutoff appears optimistic — empirical testing puts practical knowledge closer to September 2023, and the model will still attempt answers past that boundary, raising hallucination risk for recent-events queries. Text only; no image or audio inputs.

Best for

  • Agentic workflows and multi-tool automation requiring reliable function calling and chain-of-thought reasoning
  • Code generation, review, and software engineering tasks where SWE-bench performance matters
  • RAG and document analysis over large corpora, taking advantage of the 128K context window
  • Privacy-sensitive or offline-first deployments where local inference on consumer hardware is a requirement
  • Cost-sensitive production workloads benefiting from the Apache 2.0 license and sub-$0.20/1M-token input pricing

Specs & capabilities

How GPT OSS 120B Fast stacks up — intelligence, speed, context, and modalities.

Capability

Intelligence

Medium

Capability

Speed

Fast

Capability

Context window

131,072 tokens

Capability

Max output

20,000 tokens

Capability

Knowledge cutoff

June 2024

Modalities

Input and output

Input: Text
Output: Text

Features

Availability notes

Function calling supported · Cerebras-hosted fast path on just4o.chat · 1 premium request per send

Frequently asked questions

What does 'Fast' mean for this model?

It refers to running GPT OSS 120B at the low reasoning effort setting, which prioritizes speed and lower latency over deeper chain-of-thought processing. The underlying model supports three reasoning levels (low, medium, high) configurable via system prompt.

How much does it cost?

Pricing varies significantly across providers. OpenRouter lists it as low as $0.039 input / $0.18 output per 1M tokens, while the median across 21 providers is roughly $0.15 input / $0.69 output per 1M tokens — up to 7x difference depending on where you route traffic.

What is the context window?

128K tokens on most providers (including AWS Bedrock), and up to 131,072 tokens via the OpenAI API directly. Max output tokens also vary: Bedrock caps at 16K, while native deployments support up to 131K.

What are its real weaknesses?

Two stand out: the knowledge cutoff is effectively September 2023 in practice despite an official June 2024 claim, increasing hallucination risk on recent events; and the model tends toward verbose output, which raises inference costs and latency in high-volume settings.

Can I run it locally?

Yes. With MXFP4 quantization it fits on a single 80GB H100, and community GGUF quantizations allow running on consumer GPUs and MacBooks, though throughput will be lower than cloud deployments.

How does it compare to GPT OSS 20B?

GPT OSS 20B (21B total parameters, 3.6B active) is the smaller sibling optimized for even lower latency. The 120B variant offers meaningfully stronger reasoning and coding performance at higher compute cost.

Related models