Reading list
Papers and posts I’m circling—links open in a new tab.
by Shubham Rasal ·
Deep Delta / Residual Geometry / Grokking
How networks learn, and when generalization “clicks.”
Deep Delta Learning
Reframes learning dynamics through a delta lens—if you like optimization geometry, start here.
The Delta Rule (Background)
The classic update rule behind a lot of what follows—quick Wikipedia grounding.
Grokking: Generalization Beyond Overfitting
The paper that named the phenomenon: memorization first, generalization later.
Why Neural Networks Suddenly Start Generalizing
OpenAI’s readable take on the grokking story—good intuition before the heavy proofs.
Nested / Multi-Timescale / Meta Learning
Learning to learn, and systems that update at more than one clock speed.
Introducing Nested Learning – Google Research Blog
Google’s framing of nested optimization—useful mental model for continual learning.
Learning to Learn by Gradient Descent by Gradient Descent
Meta-learning classic: an optimizer learned by gradient descent—still cited everywhere.
Meta-Learning in Neural Networks: A Survey
Wide-angle map of the field if you want references more than a single trick.
Speculative Decoding / Draft Models
Faster generation by guessing ahead—core ideas behind modern fast inference.
Accelerating Large Language Model Decoding with Speculative Sampling
Foundational speculative sampling paper—how a small draft model speeds up a large one.
Speculative Decoding with Draft Models (Original Paper)
Pairs draft and target models explicitly—read after the sampling paper above.
SpecExtend: Scaling Speculative Decoding to Long Contexts
Pushes speculative ideas into long-context regimes where latency really hurts.
Dynamic Depth Decoding for Efficient LLM Inference
Adaptive depth during decoding—another lever beyond pure draft models.
EAGLE / Advanced Speculative Heads
Learned drafting and bringing research into production stacks.
EAGLE-3: Efficient Accelerated Generation via Learned Drafting
State-of-the-art learned drafting—dense, but the figures repay the time.
From Research to Production: Accelerate OSS LLMs with EAGLE-3 on Vertex AI
How Google ships these ideas—good if you care about serving, not just theory.
Long Context / Memory / RLM-Style Ideas
When attention windows need to stretch beyond the usual.
Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Segment-level recurrence for longer dependencies—still a useful baseline to know.
LongNet: Scaling Transformers to 1M Tokens
Dilated attention at extreme lengths—skim the method, stare at the scaling plots.
N-grams / DeepSeek / Classical Foundations
From count-based LMs to modern reasoning-focused training.
A Tutorial on N-gram Language Models
Jurafsky & Martin’s PDF—if you only read one classical LM chapter, make it this.
DeepSeek-R1: Incentivizing Reasoning Capability via Reinforcement Learning
How RL shaped a model people actually talk about—worth it for the training story.
Optimizer Discovery / RL for Training Rules
When the optimizer itself becomes the learned artifact.
Learning to Optimize
Learns optimization algorithms as policies—foundational for “optimizer as network” work.
Discovering Optimization Algorithms via Reinforcement Learning
RL search over update rules—wild to see hand-designed optimizers emerge from scratch.
Data / Datasets / Web Corpora
What goes into large models before the architecture even shows up.
FineWeb: A New Large-Scale Web Dataset
High-quality web data at scale—useful context for “what’s in the pile.”
The Common Crawl Dataset
The raw web snapshot pipeline everyone builds on—skim for scope, not polish.
The Pile: An 800GB Dataset of Diverse Text
Landmark heterogeneous text dump—still a reference for data diversity arguments.
NVIDIA NeMo Data Curation Overview
Practical curation patterns from NVIDIA—good when you’re building pipelines, not papers.
Flash / Systems / Acceleration
When decoding speed is the product feature.
dFlash: Fast and Accurate LLM Decoding
Z-Lab’s project page—tight write-up if you’re chasing latency wins.
Articles from the archive you won’t regret reading
Non-ML essays that stuck—agency, luck, and how people grow.
How to Time Travel
Brian Chesky on learning from the future—short, memorable, oddly practical.
High Agency
George Mack’s lens on people who bend reality—pair with any career essay.
How to Get Lucky
Taylor Pearson on increasing surface area for luck—better than another hustle thread.
Childhoods of Exceptional People
Henrik Karlsson’s deep dive—slow read, big payoff if you like biographical patterns.
Earlier readings
Old favorites still on the stack.
Attention is all you need
The transformer paper—still the right place to start if you’ve only used the API.
The Curse of Dimensionality
Hinton’s IJCNN piece—short PDF if you want classical intuition for high-D geometry.