Further reading and resources
On the transformer and GPT architecture (chapters 6, 11, 13–17, and 22):
Original transformer: Vaswani, Ashish, Noam Shazeer, Niki Parmar, et al. “Attention Is All You Need.” arXiv:1706.03762. Preprint, arXiv, August 2, 2023. https://doi.org/10.48550/arXiv.1706.03762.
Nanochat: Karpathy, Andrej. nanochat: The best ChatGPT that $100 can buy. 2025. https://github.com/karpathy/nanochat.
RMSNorm: Zhang, Biao, and Rico Sennrich. “Root Mean Square Layer Normalization.” arXiv:1910.07467. Preprint, arXiv, October 16, 2019. https://doi.org/10.48550/arXiv.1910.07467.
Rotary Embeddings: Su, Jianlin, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. “RoFormer: Enhanced Transformer with Rotary Position Embedding.” arXiv:2104.09864. Preprint, arXiv, November 8, 2023. https://doi.org/10.48550/arXiv.2104.09864.
LayerNorm: Ba, Jimmy Lei, Jamie Ryan Kiros, and Geoffrey E. Hinton. “Layer Normalization.” arXiv:1607.06450. Preprint, arXiv, July 21, 2016. https://doi.org/10.48550/arXiv.1607.06450.
BatchNorm: Ioffe, Sergey, and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift.” arXiv:1502.03167. Preprint, arXiv, March 2, 2015. https://doi.org/10.48550/arXiv.1502.03167.
Silberstein, Eric. “Tracing the Transformer in Diagrams.” Towards Data Science, November 7, 2024. https://towardsdatascience.com/tracing-the-transformer-in-diagrams-95dbeb68160c/.
Apple II floating point routines: Woz 6502 Floating Point Routines from Apple II Reference Manual (Red Book). January 1978, pages 94-95. http://6502.org/source/floats/wozfp3.txt.
PyTorch documentation: https://docs.pytorch.org/docs/stable/.
On Adam and Muon optimizers (chapters 20 and 21):
Adam: Kingma, Diederik P., and Jimmy Ba. “Adam: A Method for Stochastic Optimization.” arXiv:1412.6980. Preprint, arXiv, January 30, 2017. https://doi.org/10.48550/arXiv.1412.6980.
AdamW: Loshchilov, Ilya, and Frank Hutter. “Decoupled Weight Decay Regularization.” arXiv:1711.05101. Preprint, arXiv, January 4, 2019. https://doi.org/10.48550/arXiv.1711.05101.
Muon: Jordan, Keller et al. Muon: An optimizer for hidden layers in neural networks. 2024. https://kellerjordan.github.io/posts/muon/.
On image recognition and image recognition models (chapters 12 and 22):
ImageNet: Deng, Jia, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. n.d.
ImageNet: Russakovsky, Olga, Jia Deng, Hao Su, et al. “ImageNet Large Scale Visual Recognition Challenge.” arXiv:1409.0575. Preprint, arXiv, January 30, 2015. https://doi.org/10.48550/arXiv.1409.0575.
AlexNet: Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. “ImageNet Classification with Deep Convolutional Neural Networks.” Communications of the ACM 60, no. 6 (2017): 84–90. https://doi.org/10.1145/3065386.
ResNet: He, Kaiming, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. “Deep Residual Learning for Image Recognition.” arXiv:1512.03385. Preprint, arXiv, December 10, 2015. https://doi.org/10.48550/arXiv.1512.03385.
Inception: Szegedy, Christian, Wei Liu, Yangqing Jia, et al. “Going Deeper with Convolutions.” arXiv:1409.4842. Preprint, arXiv, September 17, 2014. https://doi.org/10.48550/arXiv.1409.4842.
On training and training and evaluation datasets (chapters 18 and 24–28):
Chinchilla ratio: Hoffmann, Jordan, Sebastian Borgeaud, Arthur Mensch, et al. “Training Compute-Optimal Large Language Models.” arXiv:2203.15556. Preprint, arXiv, March 29, 2022. https://doi.org/10.48550/arXiv.2203.15556.
CORE: Li, Jeffrey, Alex Fang, Georgios Smyrnis, et al. “DataComp-LM: In Search of the next Generation of Training Sets for Language Models.” arXiv:2406.11794. Preprint, arXiv, April 21, 2025. https://doi.org/10.48550/arXiv.2406.11794.
SQuAD: Rajpurkar, Pranav, Jian Zhang, Konstantin Lopyrev, and Percy Liang. “SQuAD: 100,000+ Questions for Machine Comprehension of Text.” arXiv:1606.05250. Preprint, arXiv, October 11, 2016. https://doi.org/10.48550/arXiv.1606.05250.
ARC: Clark, Peter, Isaac Cowhey, Oren Etzioni, et al. “Think You Have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge.” arXiv:1803.05457. Preprint, arXiv, March 14, 2018. https://doi.org/10.48550/arXiv.1803.05457.
AGIEval: Zhong, Wanjun, Ruixiang Cui, Yiduo Guo, et al. “AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models.” arXiv:2304.06364. Preprint, arXiv, September 18, 2023. https://doi.org/10.48550/arXiv.2304.06364.
HumanEval: Chen, Mark, Jerry Tworek, Heewoo Jun, et al. “Evaluating Large Language Models Trained on Code.” arXiv:2107.03374. Preprint, arXiv, July 14, 2021. https://doi.org/10.48550/arXiv.2107.03374.
SmolTalk: Allal, Loubna Ben, Anton Lozhkov, Elie Bakouch, et al. “SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model.” arXiv:2502.02737. Preprint, arXiv, February 4, 2025. https://doi.org/10.48550/arXiv.2502.02737.
Magpie: Xu, Zhangchen, Fengqing Jiang, Luyao Niu, et al. “Magpie: Alignment Data Synthesis from Scratch by Prompting Aligned LLMs with Nothing.” arXiv:2406.08464. Preprint, arXiv, October 7, 2024. https://doi.org/10.48550/arXiv.2406.08464.
MMLU: Hendrycks, Dan, Collin Burns, Steven Basart, et al. “Measuring Massive Multitask Language Understanding.” arXiv:2009.03300. Preprint, arXiv, January 12, 2021. https://doi.org/10.48550/arXiv.2009.03300.
My first time coding with a GPT model: Silberstein, Eric. “Playing with GPT-4: Writing Code.” Klaviyo Engineering Blog. March 25, 2023. https://klaviyo.tech/playing-with-gpt-4-writing-code-a137a7261655.