Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery

John Kearney

doi:10.5281/zenodo.19346536

Papers /

Grokking Has Finite Capacity: Measuring and Overcoming Limits on Simultaneous Algorithmic Discovery

John Kearney15 Research LabMarch 31, 2026DOI: 10.5281/zenodo.19346536

Zenodo preprint Related findings

Abstract

We demonstrate that grokking — the phenomenon where neural networks suddenly generalize after prolonged memorization — has a hard capacity limit. Using modular arithmetic (mod 97) as a testbed, a d=128 transformer reliably groks up to 5 simultaneous operations but collapses completely at 6. The problem is pinpointed in the shared embedding layer, where representational interference between operations causes catastrophic failure. We propose two solutions: separate embeddings per operation, and a modular split architecture using gradient similarity to identify optimal groupings without manual intervention. The split architecture achieves full capability recovery using half the parameters of the monolithic model (486K vs 929K).

Key Findings

Capacity cliff at 5 operations. A d=128 transformer groks 5 simultaneously. At 6: complete collapse across all seeds.
Cliff, not slope. The transition is discontinuous. 5/5 at N=5, 0/6 at N=6. No gradual degradation.
Shared embeddings are the bottleneck. Representational interference in the embedding layer causes the collapse.
Modular architecture solves it. Split architecture with separate embeddings: 6/6 grokking. Half the parameters (486K vs 929K).
Automated grouping. Gradient similarity identifies optimal operation groupings without manual intervention.

Safety Implications

If algorithmic discovery has hard capacity limits that are invisible until the cliff is hit, then safety properties learned alongside capability properties may silently degrade as task count increases. The cliff is not visible in loss curves until it has already happened. This suggests that safety evaluations need to test composition specifically, not just individual capabilities.

Keywords

Grokking, multi-task learning, modular architecture, mixture of experts, capacity limits, algorithmic generalization, mechanistic interpretability, representational interference

Citation

@article{kearney2026grokking,
  title   = {Grokking Has Finite Capacity: Measuring and Overcoming
             Limits on Simultaneous Algorithmic Discovery},
  author  = {Kearney, John},
  year    = {2026},
  doi     = {10.5281/zenodo.19346536},
  url     = {https://doi.org/10.5281/zenodo.19346536},
  publisher = {Zenodo}
}

← All publications