Select figures

I had fun drawing / spreadsheeting / plotting / screenshotting 370 figures and tables. Here are a few that are useful and/or will give you a sense for the contents of How They Think: Satisfying my curiosity about how ChatGPT works.

Figure 16.1 from The whole model. The full Nanochat model.

Figure 16.1 from The whole model. The full Nanochat model.

Figure 9.3 from Generating text. Predictions for the tokens after “the” and “also.”

Figure 9.3 from Generating text. Predictions for the tokens after “the” and “also.”

Figure 11.9 from Cracking open the transformer. The embeddings for 9 tokens projected from 1280 dimensions to 3 dimensions.

Figure 11.9 from Cracking open the transformer. The embeddings for 9 tokens projected from 1280 dimensions to 3 dimensions.

Figure 14.4 from Attention. While making a token prediction for position 3, the model must not be allowed to cheat by using information from positions 4 or 5.

Figure 14.4 from Attention. While making a token prediction for position 3, the model must not be allowed to cheat by using information from positions 4 or 5.

Figure 14.1 from Attention. The keys for positions 1–5 in black and the query for position 5 in red.

Figure 14.1 from Attention. The keys for positions 1–5 in black and the query for position 5 in red.

Figure 14.6 from Attention. Example scaled dot product attention calculation.

Figure 14.6 from Attention. Example scaled dot product attention calculation.

Figure 17.4 from KV cache. A BASIC loop that multiplies 0.1234567 and 0.9876543 1000 times on an Apple IIe.

Figure 17.4 from KV cache. A BASIC loop that multiplies 0.1234567 and 0.9876543 1000 times on an Apple IIe.

Figure 19.6 from Now is it time to train? First: optimizers. Out of control gradient descent.

Figure 19.6 from Now is it time to train? First: optimizers. Out of control gradient descent.

Figure 20.2 from Adam. First 100 steps of gradient descent controlled by Adam optimizer with an intentionally huge learning rate.

Figure 20.2 from Adam. First 100 steps of gradient descent controlled by Adam optimizer with an intentionally huge learning rate.

Figure 21.9 from Muon. Training with a horrible loss function for 100 steps with weight updated by subtracting a small multiple of the orthogonalized gradient.

Figure 21.9 from Muon. Training with a horrible loss function for 100 steps with weight updated by subtracting a small multiple of the orthogonalized gradient.

Figure 23.6 from Multiple GPUs. Each GPU is responsible for optimizing a portion of the parameters.

Figure 23.6 from Multiple GPUs. Each GPU is responsible for optimizing a portion of the parameters.

Figure 24.2 from Time to train! Base training. GPU temperature during training.

Figure 24.2 from Time to train! Base training. GPU temperature during training.

Figure 24.14 from Time to train! Base training. CORE metric measured every 2000 steps during base training.

Figure 24.14 from Time to train! Base training. CORE metric measured every 2000 steps during base training.

Figure 24.20 from Time to train! Base training. Probabilities predicted by the model for the multiple choice answers to “Which explains how the epithelium offers protection to land-dwelling vertebrates?”

Figure 24.20 from Time to train! Base training. Probabilities predicted by the model for the multiple choice answers to “Which explains how the epithelium offers protection to land-dwelling vertebrates?”

Figure 26.1 from Opposable thumbs: the tool dance. The engine surrounding the model calls the Python calculator and appends the result to the token stream for the model to read.

Figure 26.1 from Opposable thumbs: the tool dance. The engine surrounding the model calls the Python calculator and appends the result to the token stream for the model to read.

Figure 27.1 from Supervised fine-tuning. The user and assistant tokens in each of the four example conversations for supervised fine-tuning.

Figure 27.1 from Supervised fine-tuning. The user and assistant tokens in each of the four example conversations for supervised fine-tuning.

Figure 28.2 from Reinforcement learning. In reinforcement learning, we use the model to generate training data.

Figure 28.2 from Reinforcement learning. In reinforcement learning, we use the model to generate training data.

Table 28.7 from Reinforcement learning. Nine generated solutions for a single word problem.

Table 28.7 from Reinforcement learning. Nine generated solutions for a single word problem.

Figure 29.2 from Hello model! User → assistant → user → assistant.

Figure 29.2 from Hello model! User → assistant → user → assistant.

Figure 30.8 from Being precise about precision. The bits of a bfloat16 number.

Figure 30.8 from Being precise about precision. The bits of a bfloat16 number.