StepFun Step 3.7 Flash: The $0.19 Coding Agent That Changes the Game
StepFun's Step 3.7 Flash is a direct threat to expensive proprietary coding agents. On SWE-Bench Verified with Advisor Mode, it achieves 76.3%—97% of Claude Opus 4.6's 78.7%—at a per-task cost of $0.19 versus $1.76. That's an 89% cost reduction. For enterprises running thousands of coding tasks daily, the savings are transformative.
This 198B-parameter sparse Mixture-of-Experts (MoE) model activates only ~11B parameters per token, enabling high throughput (400 tokens/sec) and a 256k context window. Released under Apache 2.0 on May 29, 2026, it adds native vision input—a first for StepFun's Flash series—and improves tool-use reliability. The implications for developer tooling, enterprise automation, and AI competition are profound.
Architecture: Efficiency by Design
Step 3.7 Flash pairs a 196B language backbone with a 1.8B Vision Transformer (ViT) encoder. The MoE architecture ensures only a fraction of experts fire per token, keeping inference compute near an 11B dense model. Three reasoning depths (low, medium, high) let developers trade latency for depth. This design directly addresses the cost-performance tension that has limited agentic AI adoption.
Coding Performance: Narrowing the Gap
On SWE-Bench Pro, Step 3.7 Flash scores 56.26% (up from 51.3% in 3.5 Flash). On Terminal-Bench 2.1, it hits 59.55% (up from 53.37%). More importantly, cross-harness variance on internal Step-SWE-Bench narrowed from a 43–73% range to 64.5–71.5%. This predictability is critical for production deployments where scaffold behavior varies.
Advisor Mode—StepFun's implementation of Anthropic's advisor strategy—lets the model run the agentic loop end-to-end, escalating to a larger model only at key inflection points. This keeps most tasks at executor cost, explaining the dramatic cost advantage.
Multimodal and Tool Use: Emergent Capabilities
Step 3.7 Flash supports Visual Search and Python Tool pathways. On SimpleVQA (with Search), it scores 79.16%, comparable to GPT 5.5 (79.11%). On Android Daily (phone UI tasks), it scores 61.87%, ahead of Kimi K2.6 (53.36%) but behind Gemini 3 Flash (63.21%). StepFun reports emergent compositional tool use—the model combined visual and non-visual tools without explicit training. This suggests a path toward more autonomous, multi-step reasoning.
Search and Research: Integrated Reasoning
StepFun integrated search into the model's reasoning loop, focusing on planning, evidence filtering, and synthesis. Results are strong: DeepSearchQA F1 of 92.82% (vs. Kimi K2.6's 92.50%) and ResearchRubrics score of 71.68% (vs. GPT 5.5's 61.50%). On HLE with Tools, accuracy is 47.20%—a significant jump from Step 3.5 Flash's text-only 35.68%.
Pricing and Availability
Pricing is aggressive: $0.20/M input tokens (cache miss), $0.04/M (cache hit), $1.15/M output. Available on StepFun Platform, OpenRouter, NVIDIA NIM, and soon DeepInfra, Fireworks AI, and Modal. Inference backends include vLLM, SGLang, Hugging Face Transformers, and llama.cpp. Quantization formats (BF16, FP8, NVFP4, GGUF) support local deployment with a minimum 120 GB unified memory.
Winners and Losers
Winners: StepFun gains credibility and market share; enterprises get cost-effective coding agents; open-source community benefits from Apache 2.0 license; NVIDIA strengthens its AI ecosystem.
Losers: Proprietary high-cost providers like Anthropic face pricing pressure; smaller open-source models without multimodal/agentic capabilities risk obsolescence; cloud providers with expensive inference offerings may lose customers.
Second-Order Effects
Expect a race to the bottom in coding agent pricing. Step 3.7 Flash sets a new cost baseline, forcing competitors to justify premium pricing. The Apache 2.0 license will spur community forks and specialized fine-tunes, accelerating commoditization. Enterprises will reevaluate build-vs-buy decisions for internal developer tools. The narrowing performance gap between open-weight and proprietary models will shift value to data, fine-tuning, and integration rather than base model capability.
Market Impact
The release accelerates the trend toward open-weight, efficient MoE models that combine vision, language, and tool use. Multimodal agentic AI becomes more accessible, commoditizing high-end coding assistance. StepFun's move pressures Anthropic, OpenAI, and Google to either lower prices or differentiate on unique capabilities.
Rate the Intelligence Signal
Intelligence FAQ
It achieves 97% of Claude Opus 4.6's SWE-Bench Verified score at 11% the cost per task ($0.19 vs $1.76).
Yes, with minimum 120 GB unified memory/VRAM. Supports BF16, FP8, NVFP4, and GGUF quantization.


