Reading list

Papers and posts I’m circling—links open in a new tab.

by Shubham Rasal ·

Deep Delta / Residual Geometry / Grokking

How networks learn, and when generalization “clicks.”

  • Deep Delta Learning

    Reframes learning dynamics through a delta lens—if you like optimization geometry, start here.

  • The Delta Rule (Background)

    The classic update rule behind a lot of what follows—quick Wikipedia grounding.

  • Grokking: Generalization Beyond Overfitting

    The paper that named the phenomenon: memorization first, generalization later.

  • Why Neural Networks Suddenly Start Generalizing

    OpenAI’s readable take on the grokking story—good intuition before the heavy proofs.

Nested / Multi-Timescale / Meta Learning

Learning to learn, and systems that update at more than one clock speed.

  • Introducing Nested Learning – Google Research Blog

    Google’s framing of nested optimization—useful mental model for continual learning.

  • Learning to Learn by Gradient Descent by Gradient Descent

    Meta-learning classic: an optimizer learned by gradient descent—still cited everywhere.

  • Meta-Learning in Neural Networks: A Survey

    Wide-angle map of the field if you want references more than a single trick.

Speculative Decoding / Draft Models

Faster generation by guessing ahead—core ideas behind modern fast inference.

  • Accelerating Large Language Model Decoding with Speculative Sampling

    Foundational speculative sampling paper—how a small draft model speeds up a large one.

  • Speculative Decoding with Draft Models (Original Paper)

    Pairs draft and target models explicitly—read after the sampling paper above.

  • SpecExtend: Scaling Speculative Decoding to Long Contexts

    Pushes speculative ideas into long-context regimes where latency really hurts.

  • Dynamic Depth Decoding for Efficient LLM Inference

    Adaptive depth during decoding—another lever beyond pure draft models.

EAGLE / Advanced Speculative Heads

Learned drafting and bringing research into production stacks.

  • EAGLE-3: Efficient Accelerated Generation via Learned Drafting

    State-of-the-art learned drafting—dense, but the figures repay the time.

  • From Research to Production: Accelerate OSS LLMs with EAGLE-3 on Vertex AI

    How Google ships these ideas—good if you care about serving, not just theory.

Long Context / Memory / RLM-Style Ideas

When attention windows need to stretch beyond the usual.

  • Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context

    Segment-level recurrence for longer dependencies—still a useful baseline to know.

  • LongNet: Scaling Transformers to 1M Tokens

    Dilated attention at extreme lengths—skim the method, stare at the scaling plots.

N-grams / DeepSeek / Classical Foundations

From count-based LMs to modern reasoning-focused training.

  • A Tutorial on N-gram Language Models

    Jurafsky & Martin’s PDF—if you only read one classical LM chapter, make it this.

  • DeepSeek-R1: Incentivizing Reasoning Capability via Reinforcement Learning

    How RL shaped a model people actually talk about—worth it for the training story.

Optimizer Discovery / RL for Training Rules

When the optimizer itself becomes the learned artifact.

  • Learning to Optimize

    Learns optimization algorithms as policies—foundational for “optimizer as network” work.

  • Discovering Optimization Algorithms via Reinforcement Learning

    RL search over update rules—wild to see hand-designed optimizers emerge from scratch.

Data / Datasets / Web Corpora

What goes into large models before the architecture even shows up.

  • FineWeb: A New Large-Scale Web Dataset

    High-quality web data at scale—useful context for “what’s in the pile.”

  • The Common Crawl Dataset

    The raw web snapshot pipeline everyone builds on—skim for scope, not polish.

  • The Pile: An 800GB Dataset of Diverse Text

    Landmark heterogeneous text dump—still a reference for data diversity arguments.

  • NVIDIA NeMo Data Curation Overview

    Practical curation patterns from NVIDIA—good when you’re building pipelines, not papers.

Flash / Systems / Acceleration

When decoding speed is the product feature.

  • dFlash: Fast and Accurate LLM Decoding

    Z-Lab’s project page—tight write-up if you’re chasing latency wins.

Articles from the archive you won’t regret reading

Non-ML essays that stuck—agency, luck, and how people grow.

  • How to Time Travel

    Brian Chesky on learning from the future—short, memorable, oddly practical.

  • High Agency

    George Mack’s lens on people who bend reality—pair with any career essay.

  • How to Get Lucky

    Taylor Pearson on increasing surface area for luck—better than another hustle thread.

  • Childhoods of Exceptional People

    Henrik Karlsson’s deep dive—slow read, big payoff if you like biographical patterns.

Earlier readings

Old favorites still on the stack.

  • Attention is all you need

    The transformer paper—still the right place to start if you’ve only used the API.

  • The Curse of Dimensionality

    Hinton’s IJCNN piece—short PDF if you want classical intuition for high-D geometry.