Deep Delta / Residual Geometry / Grokking

1. Deep Delta Learning
2. The Delta Rule (Background)
3. Grokking: Generalization Beyond Overfitting
4. Why Neural Networks Suddenly Start Generalizing

Nested / Multi-Timescale / Meta Learning

5. Introducing Nested Learning – Google Research Blog
6. Learning to Learn by Gradient Descent by Gradient Descent
7. Meta-Learning in Neural Networks: A Survey

Speculative Decoding / Draft Models

8. Accelerating Large Language Model Decoding with Speculative Sampling
9. Speculative Decoding with Draft Models (Original Paper)
10. SpecExtend: Scaling Speculative Decoding to Long Contexts
11. Dynamic Depth Decoding for Efficient LLM Inference

EAGLE / Advanced Speculative Heads

12. EAGLE-3: Efficient Accelerated Generation via Learned Drafting
13. From Research to Production: Accelerate OSS LLMs with EAGLE-3 on Vertex AI

Long Context / Memory / RLM-Style Ideas

14. Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
15. LongNet: Scaling Transformers to 1M Tokens

N-grams / DeepSeek / Classical Foundations

16. A Tutorial on N-gram Language Models
17. DeepSeek-R1: Incentivizing Reasoning Capability via Reinforcement Learning

Optimizer Discovery / RL for Training Rules

18. Learning to Optimize
19. Discovering Optimization Algorithms via Reinforcement Learning

Data / Datasets / Web Corpora

20. FineWeb: A New Large-Scale Web Dataset
21. The Common Crawl Dataset
22. The Pile: An 800GB Dataset of Diverse Text
23. NVIDIA NeMo Data Curation Overview

Flash / Systems / Acceleration

24. dFlash: Fast and Accurate LLM Decoding

Articles from the Archive You Won’t Regret Reading

How to Time Travel
Brian Chesky
High Agency
George Mack
How to Get Lucky
Taylor Pearson
Childhoods of Exceptional People
Henrik Karlsson

Earlier Readings

Attention is all you need
Seminal paper on transformers
The Curse of Dimensionality
Long time pending from my uni time