Reading List

Papers or videos that folks at fundamental.ai liked!

General understanding of models

08/2024

ICML 2024 Tutorial: Physics of Language Models

Amazing presentation on scientific studies of how and why LLMs learn knowledge and how they reason. Controlled experiments with 100M parameters models with controlled datasets.

Nuggets on Knowledge:

knowledge augmentation (describe information in various ways) is critical for model to retain knowledge
knowledge inverse search does not happen unless the dataset is provided in this form
scaling law: ~ 2 bits of informational contents per param (each info has been exposed 1000 times)
junk in pre-training data significantly harms knowledge capacity. Fix: add a token that denotes the source of the info in front of all data paragraphs

Nuggets on Reasoning:

LLMs can learn solving math problems, not by memorizing solution templates
If you add mistakes in the reasoning training set and a "go back" token - then LLM learns to avoid the mistakes.
rotary embedding / relative attention is crucial for learning complex structure. Even simple relative positional attention is more helpful than absolute positional attention

05/25

Physics of Language Models: Part 4.1, Architecture Design and the Magic of Canon Layers

Lots of nuggets.

Changing the random seed causes variance up to 4% in benchmarks
Value of synthetic training sets
5 basic training sets focusing on specific tasks (Depo: reasoning depth, Brevo: reasoning breadth, Capo: knowledge capacity, Mano: knowledge manipulation, Lano: hierarchical structure)
Canon Layers: Canon layers boost Transformer reasoning depth 2-4×

11/2024

THE SUPER WEIGHT IN LARGE LANGUAGE MODELS

Very simple conclusion: the vast majority of weights are imprecise and changing one of them has negligible impact on quality, BUT a few (0.05%) are super critical for quality (the "super-weights"). Lots of implications for optimizing inference.

04/2025

THIS is why large language models can understand the world

A very clear explanation why large models perform better than small models despite overfitting.

02/2025

AdaSVD: Adaptive Singular Value Decomposition for Large Language Models

SVD as a compression technique.

05/2024

BRIDGING LOTTERY TICKET AND GROKKING: IS WEIGHT NORM SUFFICIENT TO EXPLAIN DELAYED GENERALIZATION?

The paper is too dense to be read in detail but gives an explanation that seems likely of both grokking and delayed generalization: after the stage of overfitting, among the multitude of 'circuits' providing good answers one emerges as better, one that generalizes more.

06/2025

How much do language models memorize?

Models in the GPT family have an capacity of ~ 3.6 bits-per-parameter. That's a slightly different result than the "Physics of Large LM" (2 bits per), but the procedure is different, notably they stop the measurement before generalization happens.

2004

On Intelligence - book by Jeff Hawkins

Lots of fundamental notions on how the brain works. Mostly for background and historical perspective.

Reasoning

02/2025

Mixture of Experts

09/2025

ST-MoE

Fundamentals of MoE.

03/2025

OLMoE: Open Mixture-of-Experts Language Models

Smaller MoE model

Continuous Learning

06/2022

GRADMAX: GROWING NEURAL NETWORKS USING GRADIENT INFORMATION

Add new neurons to layers without affecting what was learned.
Initialize new neurons by setting incoming weights to 0 (layer output remains unchanged at first pass).
Maximize gradients on new weights by special initialization of new neurons' outgoing weights (old neurons minimally change).
A linear schedule of when/where to add neurons without regards for training dynamics/characteristics, leaves room for future work - i.e. how/when/where to add neurons optimally?

06/2024

Dynamic Growing and Shrinking Neural Networks using Monte Carlo Tree Search

Use Monte Carlo Tree search to address where/which layers to add/remove, given a linear schedule.
Works on the layer level, causing a short period of instability (since you cannot set entire layers to 0).
While they claim re-using existing connections, this is probably less effective than GradMax at retaining knowledge.
Used a non-differentiable score-function criterion (accuracy) as basis for tree exploration and saw improvement vs. with the loss function. This might have some utility to the score based optimization. But instead of backprop for gradient based parameter optimization, it explores an orthogonal route i.e. optimal architecture choice.

Useful Papers

07/2025

Technical Report: Advanced Health Monitoring System for Large Language Model Training

Useful metrics that capture the evolution of the neural net.

09/2025

Introduction to training LLMs for AI agents

Interesting slides with overview of training LLMs.

Our Contributions

03/2025

Why AI works so well

The blessing of high dimensionality that makes neural nets really work.

05/2025

Why Attention works so well

Attention and SVD, a different view of Attention.

12/2025

Why ReLU works so well

ReLU and change of coordinates

Wikipedia Topics

Singular value decomposition Mixture of experts Overfitting Lottery ticket hypothesis Grokking (machine learning) Model Context Protocol