Gemini 3.1 Flash-Lite
Stable Gemini 3.1 Flash-Lite model for high-volume agentic tasks, translation, and simple data processing. 1M/65k context.
About Gemini 3.1 Flash-Lite
When throughput and cost per token drive the decision, Gemini 3.1 Flash-Lite is built precisely for that constraint. Released in March 2026, it delivers roughly 363 tokens per second — fast enough to handle classification, labeling, translation, and tool-calling pipelines at production scale without the budget pressure of heavier models. Developers who have put it through its paces consistently praise it as the most cost-effective Gemini model available, calling out its quality step-up over earlier Flash-Lite generations as a genuine surprise at this price tier. The adjustable thinking levels — minimal through high — let you tune reasoning depth per request, which is genuinely useful for routing cost-sensitive workloads. The honest caveat: time-to-first-token sits at 5.27 seconds, well above the cross-model median of 2.09 seconds, so it is not the right choice for real-time conversational interfaces where perceived snappiness matters. For batch, background, and high-volume tasks where that first-token latency is invisible, it offers a hard-to-beat combination of speed, multimodal coverage, and a 1M-token context window.
Best for
- High-volume translation of chat messages, support tickets, and user reviews at scale
- Content classification, intent detection, and customer support ticket routing
- Multimodal labeling — image annotation, video scene classification, and audio transcription at scale
- Code exploration, API documentation retrieval, and lightweight tool-calling pipelines
- Production inference workloads where cost per token is a primary constraint and batch latency is acceptable
Specs & capabilities
How Gemini 3.1 Flash-Lite stacks up — intelligence, speed, context, and modalities.
Intelligence
Medium
Speed
Fast
Context window
1,048,576 tokens
Max output
65,536 tokens
Knowledge cutoff
January 2025
Frequently asked questions
How much does Gemini 3.1 Flash-Lite cost?
Input is $0.25 per 1M tokens and output is $1.50 per 1M tokens. With caching (90% discount on cache hits), the blended rate drops to roughly $0.22 per 1M tokens for workloads with repeated context.
What is the context window?
1,048,576 tokens — approximately 1 million tokens — with a maximum output of 64K tokens per response.
What input types does it support?
Text, images, audio, and video. Output is text only.
Is it suitable for real-time chat applications?
Not ideally. Time-to-first-token averages 5.27 seconds, which is notably higher than the cross-model median of 2.09 seconds. It is better suited to batch and background workflows where that initial delay is not user-facing.
How does it compare to Gemini 2.5 Flash?
It matches Gemini 2.5 Flash on many common tasks at a lower cost, but trades off raw reasoning capability. Its AIME 2025 score of 16.7% confirms it is not suited for advanced mathematics or frontier reasoning — use a larger model for those cases.
What makes the reasoning feature different from a simple 'thinking mode'?
Rather than a binary on/off thinking toggle, it offers four granular levels — minimal, low, medium, and high — so you can tune reasoning depth and cost independently per request.