DeepSeek V4 Flash
DeepSeek-V4-Flash via Fireworks: streamlined open-source MoE model optimized for fast, cost-efficient inference while preserving strong reasoning and coding performance at 1M context scale. Function calling supported. Uses 1 base request per send before length multipliers. Does not support web search or image input.
About DeepSeek V4 Flash
DeepSeek V4 Flash bets on a simple proposition: near-Pro coding performance at a fraction of the cost. Its 284B-parameter Mixture-of-Experts architecture keeps only 13B parameters active per token, which is how it hits 104.8 tokens per second output speed while pricing inputs at $0.14 per million tokens — blending to just $0.06 per million with caching. On coding benchmarks, Flash scores within 1.6 percentage points of V4 Pro on both LiveCodeBench and SWE-bench, making it the preferred choice for 80% of coding tasks in head-to-head comparisons. Users consistently praise the cost-to-performance ratio and the practical 1M token context window for handling full codebases without chunking. The honest caveat: Flash is flagged "benchmark maxed" by real-world testers, meaning it can stumble on complex multi-step logic — users report syntactically correct code with subtle algorithmic flaws that take several iterations to surface and fix. It is still in preview, so API behavior may shift. For teams running high-volume pipelines or building coding tools where speed and cost matter more than deep reasoning depth, it's a compelling option.
Best for
- High-volume coding assistants and IDE integrations where LiveCodeBench 91.6% and SWE-bench 79% performance matters at low per-token cost
- Real-time chatbots and customer support systems that need sustained throughput above 100 tokens per second
- Long-document workflows — full codebases, research papers, or technical specs that fit inside the 1M token context window
- Batch automation pipelines (content generation, classification, summarization) where per-request cost is a primary constraint
- Agentic tool-use workflows with native function calling and external API integration
Specs & capabilities
How DeepSeek V4 Flash stacks up — intelligence, speed, context, and modalities.
Intelligence
High
Speed
Medium
Context window
1,000,000 tokens
Max output
384,000 tokens
Knowledge cutoff
April 2026
Input and output
Input: Text
Output: Text
Availability notes
Cached input: $0.03 / 1M tokens · 1 base request per send before length multipliers · Function calling supported · Serverless through Fireworks
Frequently asked questions
What does DeepSeek V4 Flash cost?
Input tokens are $0.14 per million and output tokens are $0.28 per million. Cache hits drop to $0.003 per million input tokens — a 98% discount — bringing the blended rate to roughly $0.06 per million tokens at a typical usage mix.
How large is the context window?
One million tokens, roughly equivalent to 1,500 pages of standard text. Long-context recall scores 78.7% on MRCR 1M in Think Max mode.
How does Flash compare to DeepSeek V4 Pro?
Flash and Pro sit within 1.6 percentage points of each other on coding benchmarks. Flash is faster and cheaper; Pro is the better choice for tasks requiring deeper or more sustained reasoning.
What are the model's main weaknesses?
Real-world testers describe it as 'benchmark maxed' — strong on standard tests but prone to subtle logical flaws in complex algorithms. It also generates unusually verbose output, which can inflate costs on high-volume workloads despite the low per-token rate.
Is V4 Flash stable for production use?
It is explicitly labeled a preview release on the Hugging Face model card. API behavior and pricing are subject to change without notice, so production deployments should build in tolerance for breaking changes.
Can I self-host or fine-tune it?
Yes. V4 Flash is open-weights under an MIT license and available on Hugging Face, with no restrictions on commercial or research use.