The whole model

Here are all the diagrams we walked through in chapters 11, 13, 14, and 15, all in one place.

A good exercise that will help me and you wrap our heads around the whole model is to count the learned parameters. We’ll do it for my 20-layer model, which has a vocab size of 65,536, an embedding size of 1280, and 20 transformer blocks (i.e., V = 65,536, D = 1280, depth = 20). I suggest you try counting the learned parameters yourself before reading on. You’ll know you’re on the right track if the total is around half a billion. One other piece of information you’ll need that I may not have been explicit about is that there are no biases in any of the linear transformations in our model.

Okay, here goes! A batch of sequences of tokens enters the embed module. The embed module goes from token ID to an embedding of size 1280. There are a total of V unique tokens. This mean the embed module needs to learn a total of V × D parameters, so 65,536 × 1280 = 83,886,080. We’re well on our way to half a billion.

Now we go into the transformer block. Norm has no parameters. (There are other norming techniques that do involve learned parameters. Here, though, there are none. I’ll walk you through the exact calculation in chapter 22.)

In causal self attention, we do the Q, K, and V linear transformations. Each transformation goes from a vector of size 1280 to a vector of size 1280, so that’s 1280 × 1280 = 1,638,400. We have three of these and 3 × 1,638,400 = 4,915,200. Splitting into heads involves no learned parameters, in fact, it’s not even an operation as much as just a shift in how we index into our tensors. Rotary embed is sophisticated and beautiful but also involves no learned parameters. The same is true of scaled dot product attention. It’s the heart of how the transformer works, but it itself has no parameters, the relevant learning takes place before and after. Rejoin heads also has no parameters. The linear transformation at the end of casual self attention is again from size 1280 to size 1280 so that’s another 1,638,400 parameters.

We’re up to the MLP in the transformer block. This is our classic sandwich of a linear transformation followed by ReLU (squared in this case) followed by a linear transformation. The first linear transformation goes from size 1280 to size 4 × 1280. The second goes from size 4 × 1280 back to size 1280. ReLU has no learned parameters. So that’s 1280 × 4 × 1280 + 4 × 1280 × 1280 = 13,107,200. As mentioned in chapter 13, a lot of learning is happening here!

We’re now done counting the parameters in the first transformer block. That was 4,915,200 + 1,638,400 + 13,107,200 = 19,660,800. And remember we have 20 of them. 20 × 19,660,800 = 393,216,000.

And we have one last and critical linear transformation. The output of our final transformer block is of size 1280 and we project this to a distribution over our entire vocabulary. So that’s 1280 × 65,536 = 83,886,080.

So the overall total: 83,886,080 + 393,216,000 + 83,886,080 = 560,988,160.

I want to double check the number with the model in code. With PyTorch that’s done as follows:

Figure 16.2. Counting the number of model parameters using code.

It matches!

Another useful way to reinforce things is to think through the dimensions of the tensors that go in and out of every box. I encourage you to try this before reading ahead. You can copy figure 16.1 and scribble on top of it. You can either use the capital letters like B for batch, V for vocab size or use example numbers.

I’m going to use example numbers. I’ll use my 20-layer model (V = 65,536, D = 1280) and the actual batch size (B = 32), sequence length (T= 2048), and number of heads (10) from when I trained it ahead of writing this book. This means my overall starting input is a batch of 32 sequences of 2048 tokens each organized in a tensor of size 32×2048 corresponding to about 200 pages of text (see table 8.4).

One more word of encouragement to try this on your own first. It won’t be an easy exercise. But if you struggle through, referring back to earlier chapters as needed, I guarantee you'll develop a stronger intuition, even if you have to give up on some of the especially tricky areas like scaled dot product attention.

Let’s start with the top level.

Figure 16.3. The size of the tensors going in and out of every module at the top level of the model.

Next the transformer block:

Figure 16.4. The tensors going in and out of every module within the transformer block.

Now causal self attention. This gets tricky because we split the tensors into heads.

Figure 16.5. The tensors going in and out of every module within the causal self attention module.

(* To keep my diagrams consistent with the code I show I show the output of “split into heads” as 32×2048×10×128. However, before we go into scaled dot product attention, we need to swap the two middle dimensions to get to 32×10×2048×128 for q, k, and v. You’ll see this is the shape I show as the input in figure 16.6 below. The convention is that operations are performed on the deepest dimension or dimensions. Scaled dot product attention performs operations at the level of the “2048” dimension and the “128” dimension. The operations are identical across the “8” and “10” dimension. This is why we need to swap the “10” dimension and the “2048” dimension.)

Here’s scaled dot product attention. This is where it gets really confusing.

Figure 16.6. Shape of the tensors flowing through scaled dot product attention.

In scaled dot product attention we need to compute a score between every query and every key. You’ll see that we transpose the keys as discussed in chapter 14. Now that you can see the full shape of the input tensors you can see that the transpose applies to the two deepest dimensions.

And finally, the MLP:

Figure 16.7. Shape of tensors flowing through the MLP.

Now you know what’s in the model (chapter 11 through this chapter), and you know how to calculate loss (chapter 8). So is that it? Are we done? No. Now for the fun and frustrating part. We have to train this thing, and we need to learn how to get from a no-longer-black-box that adds tokens to a sequence of text to something that behaves like an assistant, chats with us, and is capable of using tools.

Here’s the agenda. Over the next few chapters I’ll cover concepts we’ll need during training such as evaluating if the model is becoming smart and how we will actually update parameters. Then in chapter 24 we’ll start training. First we’ll do base training and then we’ll refine our GPT model into a chat model.