Question 1

What does 'Fast' mean for this model?

Accepted Answer

It refers to running GPT OSS 120B at the low reasoning effort setting, which prioritizes speed and lower latency over deeper chain-of-thought processing. The underlying model supports three reasoning levels (low, medium, high) configurable via system prompt.

Question 2

How much does it cost?

Accepted Answer

Pricing varies significantly across providers. OpenRouter lists it as low as $0.039 input / $0.18 output per 1M tokens, while the median across 21 providers is roughly $0.15 input / $0.69 output per 1M tokens — up to 7x difference depending on where you route traffic.

Question 3

What is the context window?

Accepted Answer

128K tokens on most providers (including AWS Bedrock), and up to 131,072 tokens via the OpenAI API directly. Max output tokens also vary: Bedrock caps at 16K, while native deployments support up to 131K.

Question 4

What are its real weaknesses?

Accepted Answer

Two stand out: the knowledge cutoff is effectively September 2023 in practice despite an official June 2024 claim, increasing hallucination risk on recent events; and the model tends toward verbose output, which raises inference costs and latency in high-volume settings.

Question 5

Can I run it locally?

Accepted Answer

Yes. With MXFP4 quantization it fits on a single 80GB H100, and community GGUF quantizations allow running on consumer GPUs and MacBooks, though throughput will be lower than cloud deployments.

Question 6

How does it compare to GPT OSS 20B?

Accepted Answer

GPT OSS 20B (21B total parameters, 3.6B active) is the smaller sibling optimized for even lower latency. The 120B variant offers meaningfully stronger reasoning and coding performance at higher compute cost.

GPT OSS 120B Fast

About GPT OSS 120B Fast

Best for

Specs & capabilities

Intelligence

Speed

Context window

Max output

Knowledge cutoff

Input and output

Availability notes

Frequently asked questions

Related models