Reading List
Papers or videos that folks at fundamental.ai liked!
General understanding of models
Amazing presentation on scientific studies of how and why LLMs learn knowledge and how they reason. Controlled experiments with 100M parameters models with controlled datasets.
Nuggets on Knowledge:
- knowledge augmentation (describe information in various ways) is critical for model to retain knowledge
- knowledge inverse search does not happen unless the dataset is provided in this form
- scaling law: ~ 2 bits of informational contents per param (each info has been exposed 1000 times)
- junk in pre-training data significantly harms knowledge capacity. Fix: add a token that denotes the source of the info in front of all data paragraphs
Nuggets on Reasoning:
- LLMs can learn solving math problems, not by memorizing solution templates
- If you add mistakes in the reasoning training set and a "go back" token - then LLM learns to avoid the mistakes.
- rotary embedding / relative attention is crucial for learning complex structure. Even simple relative positional attention is more helpful than absolute positional attention
Lots of nuggets.
- Changing the random seed causes variance up to 4% in benchmarks
- Value of synthetic training sets
- 5 basic training sets focusing on specific tasks (Depo: reasoning depth, Brevo: reasoning breadth, Capo: knowledge capacity, Mano: knowledge manipulation, Lano: hierarchical structure)
- Canon Layers: Canon layers boost Transformer reasoning depth 2-4×
Very simple conclusion: the vast majority of weights are imprecise and changing one of them has negligible impact on quality, BUT a few (0.05%) are super critical for quality (the "super-weights"). Lots of implications for optimizing inference.
A very clear explanation why large models perform better than small models despite overfitting.
SVD as a compression technique.
BRIDGING LOTTERY TICKET AND GROKKING: IS WEIGHT NORM SUFFICIENT TO EXPLAIN DELAYED GENERALIZATION?
The paper is too dense to be read in detail but gives an explanation that seems likely of both grokking and delayed generalization: after the stage of overfitting, among the multitude of 'circuits' providing good answers one emerges as better, one that generalizes more.
Models in the GPT family have an capacity of ~ 3.6 bits-per-parameter. That's a slightly different result than the "Physics of Large LM" (2 bits per), but the procedure is different, notably they stop the measurement before generalization happens.
Lots of fundamental notions on how the brain works. Mostly for background and historical perspective.
Reasoning
This is a very practical and concrete video for training on how to reason.
s1: Simpletest-timescaling
Train a model for CoT by imposing a budget
This model iterates internally in high-dimensional latent space
A small model (7M), specialized in reasoning, that uses a form of looped layers.
Comment on this paperContinuous Learning
- Add new neurons to layers without affecting what was learned.
- Initialize new neurons by setting incoming weights to 0 (layer output remains unchanged at first pass).
- Maximize gradients on new weights by special initialization of new neurons' outgoing weights (old neurons minimally change).
- A linear schedule of when/where to add neurons without regards for training dynamics/characteristics, leaves room for future work - i.e. how/when/where to add neurons optimally?
- Use Monte Carlo Tree search to address where/which layers to add/remove, given a linear schedule.
- Works on the layer level, causing a short period of instability (since you cannot set entire layers to 0).
- While they claim re-using existing connections, this is probably less effective than GradMax at retaining knowledge.
- Used a non-differentiable score-function criterion (accuracy) as basis for tree exploration and saw improvement vs. with the loss function. This might have some utility to the score based optimization. But instead of backprop for gradient based parameter optimization, it explores an orthogonal route i.e. optimal architecture choice.
Useful Papers
Useful metrics that capture the evolution of the neural net.
Interesting slides with overview of training LLMs.
Our Contributions
Why AI works so well
The blessing of high dimensionality that makes neural nets really work.
Why Attention works so well
Attention and SVD, a different view of Attention.
Why ReLU works so well
ReLU and change of coordinates