GLM 4.7 Fast
Z.ai's GLM-4.7 routed through Cerebras chat completions for lower-latency coding and agentic work. 131k context. Does not support web search or image input. Cerebras currently lists it as a preview model.
About GLM 4.7 Fast
Coding-first and built lean by design, GLM 4.7 Fast activates only 3 billion of its 31.2 billion parameters per token — a Mixture-of-Experts trick that delivers 83 tokens per second while holding its own against much larger dense models on real software tasks. Its SWE-bench Verified score of 73.8% sits at the top of its weight class, and developers report that code tends to work on the first attempt rather than requiring rounds of correction. One practitioner described it as the best sub-70B model they had run for generative UI work, citing a complex animated frontend it produced cleanly. It is also priced accordingly: $0.06 per million input tokens makes it one of the more accessible options for sustained coding workloads. The trade-off is pure reasoning: on abstract logic puzzles and harder math, GLM 4.7 Fast falls behind frontier reasoning-focused models, and its outputs can run verbose, which gradually adds up in token costs. For teams that need fast, cost-effective code generation over extended context — up to 200K tokens — it fills that gap without the price of a flagship model.
Best for
- Code generation and bug fixing — produces working code on first attempt and handles repo-level refactoring with strong real-world benchmark results
- Agentic and tool-use workflows — maintains coherent multi-step reasoning across tool calls better than comparable open-source models at similar parameter counts
- Frontend and UI generation — particularly strong at creative code tasks including animated and interactive interface components
- Long-context document and codebase processing — 200K token context window supports large files, full codebases, and extended knowledge bases in a single pass
- Cost-sensitive production deployments — $0.06 per million input tokens makes it viable for high-volume applications where a frontier model would be prohibitively expensive
Specs & capabilities
How GLM 4.7 Fast stacks up — intelligence, speed, context, and modalities.
Intelligence
Medium
Speed
Medium
Context window
131,000 tokens
Max output
40,000 tokens
Knowledge cutoff
August 2024
Input and output
Input: Text
Output: Text
Availability notes
Function calling supported · 2 premium requests per send · Cerebras preview model
Frequently asked questions
How much does GLM 4.7 Fast cost?
$0.06 per million input tokens and $0.40 per million output tokens via Z.ai, making it one of the more affordable options for capable code-generation workloads.
What is its context window?
200,000 tokens (roughly 300 pages of text), with a maximum output of 128,000 tokens per response.
What is it best at?
Coding tasks are its clear strength — SWE-bench Verified at 73.8%, practical bug fixing and refactoring, agentic tool use, and frontend UI generation. It also handles bilingual English and Chinese use cases well.
Where does it fall short?
Pure abstract reasoning and complex math are its weaker areas compared to frontier reasoning-focused models. Outputs can also be verbose, which increases token costs over time.
How does it compare to the base GLM-4.7 model?
GLM 4.7 Fast (also called GLM-4.7-Flash) is the lighter, faster variant of the base GLM-4.7 released in December 2025. The base model offers stronger reasoning capability; this variant trades some of that depth for significantly faster inference and lower cost.
Who should choose GLM 4.7 Fast over a larger model?
Developers building coding assistants, agentic pipelines, or interactive applications where speed and cost matter more than peak reasoning depth. It is a strong fit for teams that need reliable first-pass code generation without paying flagship-model prices.