Why Gradient Descent Zigzags and How Momentum Fixes It
Gradient descent has a fundamental limitation: on most real-world loss surfaces, it is inefficient. When the surface has uneven curvature—steep in one direction and flat in another—the algorithm struggles to make consistent progress. A high learning rate helps move faster along the flat direction but causes overshooting and oscillations along the steep direction. Reducing the learning rate stabilizes the updates but significantly slows convergence. This trade-off is not rare; it is typical behavior for standard gradient descent.
Momentum addresses this issue by incorporating information from past gradients. Instead of relying only on the current gradient, it maintains a running average (often called velocity) and updates parameters based on this accumulated direction. As a result, consistent gradients reinforce each other, allowing faster movement across flat regions, while oscillating gradients tend to cancel out, reducing instability.
In a controlled experiment on a stretched bowl loss surface—with eigenvalues 0.1 and 10, condition number 100—vanilla gradient descent required 185 steps to converge, while momentum with β=0.90 achieved convergence in 159 steps. However, β=0.99 caused divergence, with final loss 0.487363 after 300 steps. This sensitivity to hyperparameter choice is a critical strategic consideration for any organization deploying optimization at scale.
Strategic Analysis
The Architecture of the Problem
The loss surface is a stretched bowl—flat along one axis, steep along the other. The Hessian is diagonal with eigenvalues 0.1 and 10, giving a condition number of 100. That number is the core of the problem: it tells you the surface is 100× more curved in one direction than the other, which is what forces GD into its zigzag behavior.
The learning rate of 0.18 is chosen deliberately. The stability limit for GD is 2 / λ_max = 2 / 10 = 0.2—any higher and the optimizer diverges outright. At 0.18, the steep axis update factor is |1 − 10 × 0.18| = 0.8, meaning the optimizer overshoots and reverses direction every single step. The flat axis factor is |1 − 0.1 × 0.18| = 0.982, meaning it recovers only 1.8% of the remaining distance per step. This is the worst-case combination momentum is built for: oscillation in one direction, near-stagnation in the other.
How Momentum Works
Momentum introduces one additional term: velocity (v), which is initially zero. Instead of using only the current gradient, it updates this velocity by combining the previous velocity and the new gradient. The parameter β controls how much weight is given to past information. A higher β (e.g., 0.9) means the update relies more on past gradients, while a lower β makes it behave more like standard gradient descent.
This averaging effect behaves differently across directions. In steep directions, where gradients frequently change sign, the updates tend to cancel each other out, reducing oscillations. In flatter directions, where gradients are more consistent, they accumulate over time, allowing the optimizer to move faster. Finally, the position is updated using this velocity instead of the raw gradient. This results in smoother and more stable progress toward the minimum.
Experimental Results
All three experiments start from the same point (−4.0, 1.5), use the same learning rate, and run for 300 steps. Vanilla gradient descent progresses slowly with a zigzag pattern and reaches a final loss of 0.000015. Momentum with β = 0.90 performs more efficiently, reducing oscillations and building speed in the right direction, ultimately achieving a lower loss of 0.000001 within the same number of steps.
However, momentum is sensitive to the choice of β. When β is set too high (e.g., 0.99), the optimizer accumulates excessive velocity with very little decay. This leads to overshooting the minimum and failing to stabilize, resulting in a much higher final loss of 0.487363 even after 300 steps. In this case, the optimizer effectively keeps circling the minimum without converging.
A β sensitivity sweep confirms the inverted U-shaped relationship: as β increases from 0.0 to 0.95, convergence steadily improves, but at β = 0.99, performance drops sharply. Moderate β values (around 0.9–0.95) give the best performance, while values that are too high can hurt convergence significantly.
Winners & Losers
Winners: Deep learning practitioners gain a simple yet effective tool to accelerate training on ill-conditioned problems, reducing time-to-convergence. AI research labs can leverage momentum to train deeper and more complex models faster, enabling experimentation with larger architectures. Cloud computing providers benefit as faster training reduces compute costs per model, potentially increasing demand for their services as more experiments become feasible.
Losers: Proponents of vanilla gradient descent see their preferred method shown to be significantly slower on real-world loss surfaces, reducing its practical appeal. Optimizer algorithm vendors selling complex alternatives may find momentum's simplicity and effectiveness reduces the perceived need for more sophisticated (and often proprietary) optimizers.
Second-Order Effects
The optimization landscape shifts towards hybrid methods that combine momentum with adaptive learning rates (e.g., Adam), making vanilla gradient descent obsolete for most practical applications. This drives further research into hyperparameter-free optimizers and automated tuning strategies for momentum coefficient β.
Organizations that fail to adopt momentum or misuse it (e.g., setting β too high) will face slower training times and higher compute costs, eroding their competitive advantage in AI development.
Market / Industry Impact
The broader impact is a re-evaluation of optimization strategies across the AI industry. As models grow larger and loss surfaces become more ill-conditioned, the ability to converge quickly and stably becomes a key differentiator. Momentum, despite its age, remains a critical component of modern optimizers, and understanding its behavior is essential for any organization serious about AI.
Executive Action
- Audit current optimizer choices in your ML pipelines. If vanilla gradient descent is still in use, replace it with momentum (β=0.9) or Adam to reduce training time by 15-30%.
- Implement automated hyperparameter tuning for β, especially when training deep networks. A sweep from 0.85 to 0.95 can identify the optimal value without manual guesswork.
- Monitor convergence behavior: if loss oscillates or fails to decrease, check β. Values above 0.95 risk divergence, while values below 0.5 offer little benefit over vanilla GD.
Why This Matters
Momentum is not a new technique, but its strategic importance is growing as AI models scale. The difference between 185 steps and 159 steps may seem small, but multiplied across thousands of training runs, it translates into significant compute savings and faster time-to-market. Ignoring momentum—or mis-tuning it—is a hidden cost that erodes efficiency and competitiveness.
Final Take
Momentum is a proven, low-cost upgrade to gradient descent that delivers measurable gains in convergence speed and stability. However, it is not a free lunch: hyperparameter sensitivity means that β must be chosen carefully. Organizations that treat momentum as a drop-in replacement without tuning risk divergence and wasted compute. The smart move is to adopt momentum with a moderate β (0.9-0.95) and validate with a sensitivity sweep. In the optimization game, small changes yield outsized returns—but only if you understand the trade-offs.
Rate the Intelligence Signal
Intelligence FAQ
Because real loss surfaces have uneven curvature—steep in one direction and flat in another. A single learning rate cannot simultaneously prevent oscillations in the steep direction and make progress in the flat direction.
Momentum maintains a velocity term that averages past gradients. Oscillating gradients cancel out, damping oscillations, while consistent gradients accumulate, accelerating progress in flat directions.
Typically 0.9 to 0.95. Values below 0.5 offer little benefit over vanilla GD, while values above 0.95 risk overshooting and divergence.
Yes. If beta is set too high (e.g., 0.99), the optimizer can overshoot the minimum and fail to converge, resulting in higher loss than vanilla GD.


