FlashQLA: The Kernel That Rewrites the Attention Economy
The race to scale large language models has a new front: GPU kernels. Qwen's FlashQLA, released under MIT license, delivers 2-3x forward and 2x backward speedup on NVIDIA Hopper GPUs for Gated Delta Network (GDN) attention—the linear attention mechanism powering Qwen3.5 and Qwen3.6. This is not an incremental improvement. It is a structural shift in the cost-performance equation for long-context LLMs.
Standard softmax attention carries O(n²) complexity. Linear attention reduces that to O(n). But until now, the kernel implementations—primarily Triton-based Flash Linear Attention (FLA)—left significant performance on the table, especially on Hopper's new warpgroup-level Tensor Cores and asynchronous pipelines. FlashQLA closes that gap with three innovations: gate-driven automatic intra-card context parallelism, hardware-friendly algebraic reformulation that preserves numerical precision, and TileLang fused warp-specialized kernels that overlap data movement, Tensor Core, and CUDA Core operations.
For executives, the bottom line is clear: FlashQLA cuts the cost of training and inference for linear attention models by up to 3x on H100/H200 hardware. That translates to lower cloud bills, faster time-to-market, or the ability to handle longer sequences without exploding compute budgets.
Strategic Analysis: Winners, Losers, and the New Kernel Stack
Who Gains?
Qwen Team / Alibaba Cloud – FlashQLA directly accelerates their GDN-based models, giving them a competitive edge in both training throughput and inference latency. This is a moat-building move: by open-sourcing the kernel, they set the standard for linear attention on Hopper, making it harder for competitors to match their performance without adopting the same stack.
NVIDIA Hopper GPU Users – Any organization running Qwen3.5/3.6 or other GDN-based models on H100/H200 can immediately realize 2-3x speedups. This includes cloud providers, enterprises deploying long-context agents, and research labs training large models.
Open-Source AI Community – MIT license means FlashQLA can be integrated into any project, commercial or otherwise. This accelerates the adoption of linear attention, which is critical for scaling to million-token contexts.
Who Loses?
FLA Triton Kernel – FlashQLA's benchmark results show 2-3x superiority. Unless FLA can close the gap, it will lose mindshare and adoption among Hopper users. Triton's value proposition—ease of use—is now weighed against a 3x performance penalty.
Proprietary Kernel Vendors – Companies selling closed-source attention optimizations face a free, high-performance alternative. FlashQLA raises the bar for what 'good enough' means, compressing margins for proprietary solutions.
Standard Softmax Attention Users – Organizations still using full attention for long sequences will feel pressure to migrate to linear attention to stay cost-competitive. Migration costs and model retraining are real barriers, but the performance gap is widening.
Second-Order Effects: The TileLang vs. Triton War
FlashQLA is built on TileLang, a compiler framework that competes with Triton. This is a strategic play: by demonstrating superior performance on a key workload, TileLang gains credibility as an alternative to Triton for high-performance kernel development. Expect more model teams to evaluate TileLang for their own kernels, especially if they target Hopper-specific features that Triton cannot fully exploit.
Longer term, this could fragment the kernel ecosystem. Triton's advantage is its Python-based accessibility and broad community. TileLang's advantage is hardware-level optimization. The winner will be the framework that balances performance with developer productivity—but for now, FlashQLA proves that TileLang can deliver where it counts.
Market / Industry Impact
FlashQLA accelerates the shift from quadratic to linear attention in production LLMs. As long-context applications (e.g., code generation, document analysis, conversational agents) grow, the cost of full attention becomes prohibitive. Linear attention, now with a 3x faster kernel, becomes the default choice for new model architectures.
Cloud providers will likely integrate FlashQLA into their inference stacks, reducing per-token costs for customers. This could trigger a price war in LLM inference, benefiting end users but squeezing margins for providers that cannot match the efficiency.
On the hardware side, FlashQLA's reliance on Hopper-specific features (SM90+) reinforces NVIDIA's dominance in AI compute. AMD and other GPU vendors will need to match Hopper's warpgroup capabilities to compete in this kernel-level optimization game.
Executive Action
- Evaluate FlashQLA for your GDN-based models: If you use Qwen3.5/3.6 or plan to adopt linear attention, benchmark FlashQLA against your current kernel stack. The 2-3x speedup directly reduces compute costs.
- Monitor the TileLang vs. Triton ecosystem: FlashQLA's success may signal a broader shift. Consider investing in TileLang expertise if your team develops custom kernels.
- Reassess long-context strategy: With linear attention now significantly faster, the trade-off between model expressiveness and cost shifts. Re-evaluate whether full attention layers are worth the premium.
Why This Matters
FlashQLA is not just a kernel—it is a signal that the software stack for AI is still ripe for optimization. Every 2x speedup in a core operation like attention translates to millions of dollars in saved compute for large-scale deployments. Ignoring this development means leaving money on the table.
Final Take
FlashQLA is a masterclass in hardware-software co-design. It exploits Hopper's architecture to an extent that Triton cannot match, delivering real-world speedups that will reshape the economics of long-context LLMs. The open-source release ensures rapid adoption, and the TileLang framework gains a killer app. For anyone building or deploying large language models, this is the kernel to watch.
Rate the Intelligence Signal
Intelligence FAQ
FlashQLA requires NVIDIA Hopper GPUs (SM90+), CUDA 12.8+, and PyTorch 2.8+.
Through gate-driven context parallelism, hardware-friendly algebraic reformulation, and TileLang fused warp-specialized kernels that overlap data movement and computation.


