TODO
In no particular order:
- Is Muon explained correctly? Unsure about this.
- Are skip connections explained correctly?
- More information about the researchers, their environments, the motivations for each innovation
- Do something interpretability related like PCA of hidden activations, ablation
- Break up final chapters into training and evaluation parts?
- Where does the precision chapter belong?
- Move KV cache chapter back?
- Check and improve background to backprop like who / how invented.
- In the attention chapter, explain the connection between dot product and angle?
- Andy and Ning said rotary embed chapter is confusing.
- Re rotary embed, check if I’m right about this: As each layer is learning to emit queries and keys from its input, the model figures out that more or less emphasis on certain dimension pairs within a head results in more or less sensitivity to relative position.
- Also re rotary embed, show the exact calculation? Maybe in a table?
- When I mention MLP or somewhere give more background on where the term comes from? Is it a big gap to never mention neuron and/or perceptron and/or give a biological intuition?
- LSAT AR example question I showed has 4 choices but we recenter based on 5 (20%) and digging around looks like “logic games” should have 5 choices. Check if I missed something.
- Am I explaining Adam correctly? Is signal-to-noise the best way to think about it? Show a graphic with gradient vectors on a surface with and without Adam?
- Also re Adam, is this true? “It overshoots and then corrects, which may be overkill here but is useful for not getting stuck at less than optimal parameters.”
- Concluding chapter
- For examples of right / wrong from CORE, for Partridge Family vs Duel show the probabilities.
- Is there research showing what gets helped by a model being able to spell?
- Where did ChatCORE come from? Did Karpathy pick a reasonable mix of tasks or did it come from a paper?
- Am I right about where the HumanEval name comes from?
- Clean up all diagrams
- Mention somewhere the importance of going through training conversations in random order? Mention in passing in SFT but never say why.
- In checkpoints paragraph or somewhere show examples of publicly available checkpoints?
- Add photo credits
- Add rest of further reading and resources
- Change x to weight in first backprop example?
- Any point in mentioning AdamW?
- From feedback – explain why phases of training after base don’t overwrite or lose too much of the learning from base.
- From feedback - Like absolute value it will take negative numbers and make them positive and it also has two other advantages which aren’t that important right now. ← implies will address later but never do
- From feedback - is there a tipping point when it becomes possible to do RL? Explain…