Cracking open the transformer

Do you remember Pandora? Not the figure from Greek mythology who opened a box and not the jewelry shop. I’m talking about the music streaming service that was more popular before Spotify came on the scene. Pandora hired musicians to categorize songs according to many dimensions. Let me explain why.

Take “Love Yourself” which my daughter got stuck in my head last night, and may now be stuck in your head just by me mentioning the title. Originally, I’m sure, “Love Yourself” and every other piece of music had simple entries in Pandora’s database with information like title, artist, and length of song. Perhaps “Love Yourself” had a database ID of 153,432.

Pandora pioneered the idea of automatically recommending music and had this feature where you could listen to your own personalized radio station. For music fans used to tuning into actual radio stations or picking their own albums and songs, this was novel. Let’s say Pandora determined that you liked “Love Yourself” and now it was time to decide what to play next. The database ID of 153,432 by itself contained no useful information. The title also wouldn’t be of much help. The artist name could be useful, but you would get bored if your stream only contained Justin Bieber songs.

A human DJ, of course, would attach layers of meaning to “Love Yourself.” They might think about its subject matter, tone, instrumentation, and vocal style. They would combine this with the picture they’ve already formed of you based on what other music you like and somehow their human brain would decide what to play next. The founders at Pandora must have thought—if our service is going to have even a shot of making good recommendations, we’ll first need to enrich our database with more information about each song.

Soon I’ll circle back and show how this idea is relevant to what goes on inside a transformer. But first let’s get reoriented. In chapter 8 I showed what goes into the transformer and what comes out with a diagram like this:

Figure 11.1. During training and generation our model takes a B×T tensor as input and outputs a B×T×V tensor. However, I’m going to leave out the B dimension in most of the upcoming diagrams.

B stands for batch size, T stands for length of sequence in tokens, and V is vocabulary size. I showed an example where B was 32 and T was 2048 resulting in a total of 65,536 tokens going into the model in a single input. This is in fact how the inputs and outputs are shaped.

However, because we’re now going to be going inside the transformer and dealing with even more dimensions, I’m going to leave out the batch dimension. This will keep things a little easier to think about. Batch size is in fact important. It determines how much data goes into each calculation of loss and updating of parameters. And we need to work in batches and what are called mini-batches to get the GPU to calculate things as efficiently as possible.

Inside the model, though, each sequence in a batch is treated exactly the same. Therefore it makes no conceptual difference if we ignore the batch dimension for most of the next few chapters.

With that, it’s time to see how we make one next token prediction for each input token. As a reminder from chapter 8, each prediction consists of V numbers (the probability distribution over the vocab). Also as a reminder, we need to make sure the model doesn’t cheat.

Figure 11.2. For the next few chapters, think of the model as taking a sequence of T tokens as input and outputting a tensor of size T×V.

All GPT models are similar but not identical. There is no single, canonical architecture. As promised in the introduction, I’m sticking with the architecture that Andrej Karpathy devised for Nanochat. It’s clean, it incorporates recent innovations, and if you’re so inclined, you can view and download the code yourself. We’ll go through the architecture piece by piece here and in chapters 13, 14, and 15. In chapter 16 I’ll put the whole picture back together. I’ll also ask to count the total number of parameters across the entire model, an exercise that will force the model into your mind.

Here’s the top level:

Figure 11.3. Model architecture. I’ll explain “D” below.

Let’s put example dimensions in to make it easier to follow. I’ll assume my input is a sequence of five tokens (e.g. “<bos> He went to the” ). I’ll use the dimensions from my 20-layer model, the version of the model I trained before I started writing this book. I haven’t explained what “D” is yet but we’ll come to that.

Figure 11.4. Architecture for my 20-layer model showing a 5-token sequence as input.

We start with a sequence of five tokens. The embedding module turns each token into 1280 numbers. Since we have five tokens, the result is a 5×1280 tensor. We feed this tensor to the first transformer block. The output of this block is also a 5×1280 tensor. We feed this tensor to the next transformer block, on and on, until we’ve gone through all 20 blocks. The transformer blocks all have the exact same architecture but they will each learn different parameters during training. Nearly all the magic happens in the transformer blocks and I’ll be drawing many diagrams to show what’s inside. Finally, the output of the last transformer block goes into a linear transformation just like the ones in our hedgehog models. This linear layer takes each of those five rows with 1280 numbers and turns them into five rows with 65,536 numbers each, similar to the example output I showed in figure 8.6.

I’ve been saying all along that I trained a 20-layer model ahead of writing this book and that later, in chapter 24, we’ll together train a 32-layer model. Now you know what I mean by a 20- or 32-layer model. It’s the number of transformer blocks. This is also called depth.

I said that we turn each token into 1280 numbers. And as you can see in the diagram, we stick with inputting and outputting 1280 numbers all the way until the output of the final linear transformation. This hints that this 1280 dimension is important. It is. In fact, even though we’ll encounter plenty of tensors with plenty of dimensions, this 1280 is considered the overall dimension of the model. That’s why I labeled it “D” in figure 11.3.

I spent far longer than you would imagine stressing about if “D” is the best letter. As you read about transformer models you’ll sometimes see this dimension referred to as d_model, d_model, n_embd (for reasons that will be clear shortly), or hidden size. I settled on “D” as the closest thing to standard given I wanted a single capital letter to stick with on all of my diagrams. Please don’t get “D” confused with model depth (e.g. 20) or the many other dimensions we’ll be seeing.

You may wonder why D in my 20-layer model is 1280. D, like V and even number of layers, is a hyperparameter. The model designer has to pick a D that is likely to work taking into account everything else—the number of layers, the size of the vocabulary, the amount of data and time for training, and the intended use of the model. This is similar to what I described in chapter 7 for selecting V. We haven’t touched on how D is used yet, but you can see in figure 11.3 that each next token prediction is going to come from these D numbers. This tells us that D can’t be too small. For example, how could say four numbers get turned into a probability distribution over 65,536 tokens?

D, V, and number of layers need to be roughly compatible with each other. I set to D to 1280 in my 20-layer model because Karpathy says that 64 times the number of layers (64 × 20 = 1280) is a reasonable choice for D. When we get to training our 32-layer model, we’ll set D to 2048. As you learn what happens in each transformer block I believe you’ll build an intuition for why a bigger D fits with more layers.

I’ve mentioned “1280 numbers” a few times. It’s time to introduce the term vector because it’s going to make it easier to talk about those numbers. In chapter 5 I described a matrix as a bunch of numbers organized into rows and columns. In chapter 8 I said that a matrix is a two-dimensional tensor and that we can use the term tensor no matter how many dimensions we have. A vector is specifically a one-dimensional tensor. Let me show you a few examples.

That’s it. It’s just a bunch of numbers. You can think of a vector as a row in a matrix. And now with this terminology we can say that we turn a sequence of T tokens into a sequence of T vectors each of size D. Each successive transformer block then takes in T vectors of size D and outputs another T vectors of size D. The final linear layer takes in T vectors of size D and outputs T vectors of size V.

Embedding module

So what does the embedding module actually do? Think about token 6237 from our text generation example, “ Paris.” The number 6237 is just an identifier and has no more or less meaning than any other number. As humans, though, even in isolation with no other words around it, we associate quite a bit of meaning with “Paris”: noun, city, France, Europe, beautiful, baguette, old. Pandora needed to characterize song ID 153,432 along many dimensions (e.g. genre = pop, instrumentation = acoustic-driven, vocal presence = male vocalist) so their personalized radio stations could smartly pick next songs. We need to do something similar with our tokens. Here’s a somewhat silly example of how that could look:

Table 11.1. Hand-crafted example of characterizing four tokens according to four dimensions.

I scored each token on its “nounness,” “adjectiveness,” “redness,” and “positiveness” on a scale of -1 to 1. Even though this is silly, I hope you can imagine that using numbers that actually say something meaningful about each token will be more helpful than the plain token IDs. After all, we trained our turkey model on the height and length of each turkey, not the “id” of the turkey (#1, #2, #2).

Starting in 1999, Pandora hired human musicians and asked them to listen to songs and characterize each according to hundreds of dimensions. We will absolutely not be manually characterizing our 65,536 tokens across 1280 dimensions. The beauty of training a model through backprop is we can imagine a technique that we believe will help make good predictions, create its skeleton, and the parameters will be learned, provided we have enough data and enough computing power. I want to repeat this again because it’s the central idea of modern AI models and it’s not initially intuitive. The model designer imagines a technique that could work. Training figures out the parameters to make it actually work. We’ll see this over and over as we look at each part of the transformer.

Back to the embed module, mechanically, it’s a lookup table. For example, pretend that our embed module uses table 11.1 and that D is four. If my input is the token IDs for “red” and “dog,” what will I get out?

Figure 11.6. Output of embed module with input of “red dog.”

Initially the numbers will be random. They will take on meaning during training, although the meaning won’t be nearly as obvious as in my contrived example. For example, you’re unlikely to look at the first dimension and say, aha, that’s nounness. (For the sake of how this approach fits with other approaches you may have heard about, it is also possible to start with pre-trained embeddings and keep them locked during training.)

From the four dimensions I made up in example table 11.1 you can see that four is not even close to enough to capture the meaning we humans associate with any word. As mentioned above, 1280 is a reasonable choice for our 20-layer model.

Before you continue reading, how many parameters will the embedding module shown in figure 11.4 need? Look at table 11.1 for a clue.

Here’s the answer. The embedding module will need D parameters for each token in the vocab. That’s D × V = 1280 × 65,536 = 83,886,080 parameters. The training for these 80+ million parameters will absolutely not be evenly distributed. The parameters for frequent tokens like “the” and “cat” will be under constant “pressure” from the gradient to adjust whereas some tokens may come up only infrequently during training leaving their parameters only lightly trained.

Let’s look at some actual embeddings from my 20-layer model. Here are the first 10 out of 1280 dimensions for each of 9 tokens:

Table 11.2. The first ten out of 1280 dimensions for the embeddings of “Paris,” “London,” etc.

What pops out? Nothing for me. As humans we’re not good at looking at tables and spotting patterns in 10 dimensional data, not to mention the full 1280 dimensions. We also can’t directly plot more than three dimensions. Also, we can’t expect that the model would just happen to “organize” the dimensions according to any logic that would jump out at us like dimension 1 being nounness and 2 being adjectiveness.

We’re not helpless though. If we only cared about the first two dimensions, for example, you could draw a vector representing Paris from the origin to (-2.92, 9.12) and another representing London from the origin to (40, -7.31). You could consider the angle between those two vectors to be a measure of how similar they are.

Figure 11.7. The first two out of 1280 dimensions of the “Paris” and “London” embeddings.

They don’t look very similar in those first two dimensions. If you measure, you’ll see that the angle is 118 degrees which has a cosine of around -0.47. The nice thing is we can do this similarity calculation, known as cosine similarity, not with just the first two dimensions, but with all 1280 dimensions. A score of 1 means the angle is 0 and the two vectors are on top of each other in 1280-dimensional space. A score of 0 means the angle is 90 degrees and the two vectors are orthogonal in 1280-dimensional space. Let’s compute the cosine similarity between each pair of those 9 tokens.

Table 11.3. Cosine similarity between pairs of embeddings computed using all 1280 dimensions.

Now I’m getting convinced that my embeddings actually do mean something! Read across on the “Paris” line. What’s most similar to “Paris” besides Paris itself? “London.” What’s most similar to “he”? “She.” The two adjectives are similar. The two pronouns are similar. The three animals are similar and “cat” is closer to “dog” than to “wolf.”

There are other techniques for getting your mind around high-dimensional data. For example, we know we have 65,536 embeddings of 1280 dimensions each. Using principal component analysis we can say, look, I know there’s no great way to view those embeddings at less than 1280 dimensions, but if I really wanted to project them onto a smaller number dimensions, how should I do it to preserve as much of the information as possible? Let’s project to two dimensions:

Figure 11.8. The embeddings for 9 tokens projected from 1280 dimensions to 2 dimensions.

Paris and London are close. You could draw a line to separate the pronouns from the other words and you can do the same with the colors. You can’t quite separate out the animals. What about if we allow ourselves three dimensions?

Figure 11.9. The embeddings for 9 tokens projected from 1280 dimensions to 3 dimensions.

Amazingly you can now draw a plane that separates the animals from all the other words. This is a reminder that even though wolf, cat, and dog seem far away from each other in the 2D projection, they may in fact be nicely grouped in higher dimensional space. Just imagine if you were a being that could see 1000+ dimensions!

We had a sequence of T tokens, each represented by a meaningless id. Now we have T embeddings each packed with meaning..

Figure 11.10. Model architecture. We made it through the first module.

I’m afraid that we’re not quite ready to peer into the transformer block. We need to first take a detour through a fun bit of deep learning history.