8


Not feathers…what exactly does the model predict?

OpenAI released GPT-3 in a private beta in mid-2020. I don’t remember exactly when I got access. I was the head of data science at Klaviyo at the time and we were already working with OpenAI. At some point the model appeared in our account. You could input a prompt and it would generate text. This was around the same time I published my first novel, The Insecure Mind of Sergei Kraev, so I tried putting in the first few sentences of my prologue:

April 14, 2220

One hundred years after 4-17

Singapore Island

Children,

This is my ninety-fourth annual message.

Here are three first paragraphs generated by GPT-3 and the paragraph from my actual prologue. Can you guess which is the real one?

Figure 8.1. Three of these paragraphs were generated by GPT-3 in 2021. One is from the actual prologue to my novel.

Toward the end of 2021 Klaviyo released a feature to help people write email subject lines. The user would say a little bit about what they wanted, the model would generate four ideas, and the user could pick one or ask for more to be generated. Under the hood we inserted the user’s input and other information about the user into a prompt template, sent this to GPT-3, and parsed the subject line ideas out of the response that came back. This way of building systems has become widespread but it was novel only a few years ago.

Despite being an early user of GPT-3, and despite knowing that GPT-3 and its predecessors were fundamentally next token prediction models, and despite having experience with other types of models, I could not have answered these three questions:

Let’s answer these questions. I find it’s helpful to think about the inputs and outputs before looking at what’s inside the model. With a firm grasp on what we’re starting with and what we’re trying to get to, it will be easier, later, to trace the operations inside the model. Until we crack it open in chapter 11, we’ll treat our GPT model as a black box.


Pretend we are again in the strange land from chapter 2 where the only words in their language are: a, bed, he, house, is, man, red, store, the, to, and went. We’re going to begin our training on some text scraped off the web of this strange land: “he went to the store.”

We start our training by inputting the beginning of sequence token into the model. The model outputs its prediction for the next token which comes in the form of a probability distribution over all of the tokens in the vocabulary.

Figure 8.2. Input the beginning of sequence token and the model makes predictions for the next token.

The model really wants to start the sentence with “bed!” But we haven’t done any training yet, the weights inside the model are random, so the output is random. We don’t need to worry. We also aren’t going to do anything with “bed” at this point. We stick with our training sentence: “he went to the store.” So now we input the first two tokens in the sequence:

Figure 8.3. Input the first two tokens and the model predicts the probability of each possible next token.

Again, we haven’t done any training yet, so we don’t need to worry about the model predicting that “he” is likely to follow “he.”

Keep going until we’ve done a prediction on the full sequence up to but not including the last token. Let’s put all of these predictions in a single table:

Table 8.1. The next token predictions for “<bos>,” “<bos> he”, all the way up to “<bos> he went to the.”

At this point you might see where this is going, and your mind may be objecting. Wait, if you start a sentence with “he,” there are so many valid next words, are we really going to teach the model that “went” should come next? Even “he went to” has many valid next words besides “the!” Good point. But it’s not a problem because we’ll have lots of training data and some might push the weights toward predicting “the” next, others toward “bed,” others toward “a,” but none toward “he” or “went.” And no matter what, we’ll be careful to never adjust the weights too much based only on a single batch of training data. More about batches to come.

Looking at the table, we can articulate what we want: update the weights such that, unlike now, column 1 will give top probability to “he,” column 2 will give top probability to “went,” column 3 will give top probability to “to,” column 4 to “the,” and column 5 to “store.” To get backpropagation to figure out how to nudge each weight in the right direction to get this overall result, we need a single number that represents how we’re doing and can be maximized or minimized. Let’s try multiplying the probabilities of the tokens we want the model to predict:

Table 8.2: Purple indicates tokens we want the model to predict. At the moment it’s not doing a very good job.

So that’s:

0.05% × 2.48% × 5.72% × 4.72% × 40.54% = 0.000001357%

Suppose we tweak the weights and now the 0.05% becomes 0.09% and also suppose all but one of the other probabilities also move up:

0.09% × 2.49% × 5.74% × 4.75% × 40.51% = 0.000002475%

This seems right. The model made a better prediction and the overall probability went up, so multiplying the probabilities of the correct tokens seems to be a useful score. This multiplication of probabilities results in a single number that says how good a job our model is doing. You may remember from the turkey example that by convention we want a loss function that we can minimize. The less loss the better is the way to think about it. To match that convention, we can multiply our product of probabilities score by -1 and now it’s a loss function.

This is more or less the entire training concept. We have data, we have a loss function, backprop can work its magic, and if our model architecture is good enough, the model will train and the loss will come down.

My son, who’s been reading as I’ve been writing, got to chapter 18 and realized he was confused about the loss function. It is confusing. Until you get used to it, it feels strange to ignore most of the probabilities and consider only those for the correct next token according to the training data. To help solidify the idea, we’ll do one more example. Let’s say the training text is “<bos> the house is red” and let’s say I’ve predicted all the next tokens using two versions of the model:

Table 8.3: Which model has lower loss?

Which model has lower loss? In other words, which model is doing better at least as judged by the training text “<bos> the house is red”? Please think about this and make sure you can answer before you move on. I’ll put the answer at the end of the chapter.


If you look at figure 8.2 with “<bos>” being fed into the model, and then figure 8.3 with “<bos> he” being fed in, and then you were to go find the loop in code that’s being used to train the model, you’ll be puzzled, because you’ll never see just “<bos>” or just “<bos> he” being input to the model. What you’ll see will look more like this:

Figure 8.4. During training the full sequence is fed into the model at once, not token by token.

The whole sequence goes in one time and the probability distributions for all of the next tokens come out. You might be thinking—who cares, that’s just a more efficient way to do it. Or you might be thinking—won’t the model cheat? Instead of learning to predict, it will just learn to assign a probability near 100% to whatever token is next since it knows the next token from its input. Both thoughts are valid.

On efficiency, yes, the technique of sending in the full sequence at once really just is a way to be more efficient. We’ll get the same results either way. Remember from the discussion of matrices and GPUs in chapter 5, however, that it’s essential to give the problem to the GPU in big chunks. Otherwise it can’t use its massive parallelism to compute efficiently.

In the introduction I mentioned that I trained a 20-layer model before I started writing this book. During that training, for which I used NVIDIA H100 chips, I fed over 60,000 tokens at time to each GPU. The GPU could make predictions on all 60,000 tokens, calculate loss, and adjust the model weights in less than half a second.

And about cheating, yes, that confused me at first. Models love to cheat. With enough weights, and enough training cycles, that signal from the gradient will make them do crazy things. If they can cheat their way to lowering the loss, meaning doing so in a way that you as the designer didn’t intend or realize was possible, they’ll do it.

But not to fear. The transformer architecture won’t let the model cheat. In chapter 14 we’ll discuss masked multi-head self-attention. You’ll see how masking guarantees that when, for example, the model is predicting the token after “went” in “He went to the store,” it can’t peek at “to.”

In figure 8.4 I showed a whole sequence of tokens going into the model at once. Even that is a little bit of a simplification. During actual training we’ll be feeding in a batch of sequences at a time. This is again so we can give the GPU as chunky a problem to work on. If we’re doing a good job, our GPUs will consume all of their memory and all of their computing power for the entire training process.

Let me show you a batch of size 3 (call that B) with a sequence length of 5 (call that T for number of tokens). The first sequence will be our friend from above “<bos> He went to the.”

Figure 8.5. A batch of three sequences.

Now let’s see what we get when we feed the batch into the model. Our vocab size is still 12 and we’ll call that V.

Figure 8.6. A small but realistic view of the input and output.

It may take time to wrap your mind around this tiny but realistic example. I replaced the words with token IDs. For example, 0 is “<bos>” and 3 is “He.” You’ll see that the input is a matrix just like the turkey data going into the turkey model in figure 5.3. The matrix here has size B×T meaning B rows and T columns. Think of this, again, as a batch of 3 sequences each with 5 tokens.

The output is trickier. Its size is B×T×V. You can see why we need that extra dimension: we have three sequences in the batch, for each we’re making five next token predictions (one for each token), and each prediction consists of twelve numbers (one for each token in the vocab). I may also have confused you by switching the rows and columns in this diagram as compared to the diagrams above. In the diagrams above (e.g. figure 8.2), I was aiming for legibility, so I used 12 rows to show the probability distribution over the 12 tokens. In this figure 8.6 I switched to match the convention of how it’s usually done in code and used ellipses to make the numbers fit.

Up until now every matrix we’ve seen had two dimensions, and by the math definition, a matrix always has exactly two dimensions: rows and columns. So we probably shouldn’t call this B×T×V thing a matrix. In AI and machine learning and in the widely used PyTorch library it’s called a tensor, so we’ll go with that, as in: the output of the model is a tensor of size B by T by V.

When I first heard that AI used tensors, back when Google came out with something called TensorFlow in 2015, I was intimidated. Two of my roommates in college majored in physics and I remembered that they dealt with tensors, for example in quantum physics. Quantum physics may or may not be scary, but tensors themselves are no scarier than matrices. They are just an organized way to group a bunch of numbers. In fact there are other software tools that call them multidimensional arrays, a term much less intimidating to someone with a computer science background.

No matter what you call it, a tensor is an organized way to keep track of a bunch of numbers representing data, interim calculations, or output. How many numbers? For our B×T×V output it’s 3 × 5 × 12 = 180 numbers. Say I want to pull out of the tensor a specific number, say the probability that the first predicted token in the first sequence is “went.” You can see at the top right of figure 8.6 that this is 0.02. In code I could get that with output[0,0,11]. Why 0? Because in code things are normally zero-indexed meaning you start counting at zero. (This makes sense because these numbers will be stored in memory somewhere and the first one will be at that location + 0 × bytes per number, the second one at that location + 1 × bytes per number, etc.) So output[0,0,11] means the first item in the “B” dimension, the first item in the “T” dimension, and the 12th item in the “V” dimension.

I’ll show you a few examples so you can see how powerful, intuitive, and concise this notation is for accessing parts of a tensor. As or more important as making sense to humans is that you can specify operations you want the GPU to do without picking out pieces of data one at a time. This allows the GPU to do what it does best—work in parallel.

The first example is the same one I gave in the text above. We pull a single element out of the tensor resulting in a tensor with no dimensions (called a scalar):

Figure 8.7. Example of indexing into a tensor.

In this second example, we take a slice with all elements from the first sequence in the batch and the prediction for the first token from that sequence. This results in a tensor with a single dimension representing the probability distribution of the prediction for the first token over the vocabulary:

Figure 8.8. Example of indexing into a tensor.

In this third example, we take a slice with all elements from the first sequence in the batch. This results in a tensor with two dimensions of size T×V (5 × 12), the same one shown in figure 8.6:

Figure 8.9. Example of indexing into a tensor.

For training, you’ve now seen what we input to the model, what the model outputs, and how we use the output to calculate loss. To actually calculate loss we need to know the correct next tokens, as indicated in purple in table 8.2. We even need the final next tokens for each sequence which are not in our input batch. In practice we form a y_actual tensor, just like the turkey example. This is very similar to our input but shifted over. It excludes the beginning of sequence token, which we’re not trying to predict, and includes the final token “store,” which we do want to predict.

Figure 8.10. X goes into the model, y_actual goes into the loss calculation.

Taking the first row in x and y_actual as an example, our goal is that 0 predicts 3, 3 predicts 11, 11 predicts 10, etc. The final token in x in the first row, 9 “the,” should predict 8 “store.” (Keep in mind that when I say 9 should predict 8, what I really mean is that the full sequence 0, 3, 11, 10, 9 should predict 8.)

Now putting it all together:

Figure 8.11. The big picture of how loss is computed from x and y_actual.

Purple indicates the dimensions of each tensor. When looking at this diagram, remember the point of boiling everything down to a single loss number is so we can backpropagate and adjust the weights. In chapter 11 we’ll start to get into what goes on inside the model and exactly what those weights are.

I’ve been using a tiny example for the diagrams. When I trained my 20-layer model ahead of time, which has a vocabulary size of 65,536 tokens and over a half a million weights, I used batches of 32 sequences each with 2048 tokens (B=32, T=2048). Here’s the diagrams again with these dimensions multiplied out so you can appreciate how many numbers we’re dealing with:

Figure 8.12. The total numbers in the tensors to compute loss for a single batch.

In a single round of the training loop in the turkey example, we adjusted two weights using a batch of three pieces of data, each containing height, length, and actual number of feathers. Here, in a single round of the training loop, we’re adjusting over half a million weights based on backpropagating through a calculation that starts with 65,536 numbers and produces an output of over 4 billion numbers. In chapter 21, when we start training our 32-layer model, we’ll be adjusting over 2 billion weights. The fact that this actually works speaks to the incredible power of backpropagation and calculus and innovations in model architecture that we’ll be covering in upcoming chapters, and the fact it can be done in a fraction of a second speaks to the power of GPUs.


Figure 8.13. Where our batches come from during training.

To build an intuition for scale, let’s do a back of the envelope calculation. A page of English text might be 1500 characters. An average word in English is 4 to 5 characters, many of those words will correspond to single tokens, many of those tokens will include a leading space, so let’s assume 5 characters per token. So a page is 1500 ÷ 5 = 300 tokens. We need 65,000+ tokens worth of text, so that’s over 200 pages. Think a short novel, a whole bunch of wikipedia articles, a lot of business documents, or many discussion threads.

To make this even more visceral and concrete, I’m going to read the first document in my dataset of text, the second document, the third, and so on until I accumulate enough tokens to fill my batch. The table below shows these documents. For each I show the beginning of the document and the total number of characters. I encode each into tokens and show the first three tokens and the total number of tokens.

Table 8.4. Enough documents to fill one training batch.

The first thing that may jump out is how random these documents appear. Search for the snippets of text and you’ll find most of them on the internet because that’s where the text came from in the first place. I’m still in awe that it’s possible to achieve something so useful from so much random text. Even decades ago people talked about data being valuable, but I’m not sure they meant this type of data. On the other hand, our human minds are good at consuming lots of information from lots of sources at varying levels of quality and forming a reasonably coherent mental model of things, so maybe it’s not so crazy.

You may also notice that the first token is always ID 65527. If you look back at table 7.10 you’ll see that 65527 is the ID of our special beginning of sequence token.

Overall it takes around 320,000 characters to get to our desired number of tokens. This is around 320 kilobytes, about a third of a megabyte. The character-to-token ratio for this data works out to 4.86 (320,093 ÷ 65,901) so my guess of 5 above wasn’t far off.

And here’s how the tokens look organized into tensors in the GPU memory all set for the forward calculation of loss and the backward calculation to update the weights.

Figure 8.14. Example of x and y_target at the actual size I used to train my 20-layer model.

Before we move on, I want to get into one other simplification that I made. If you’ve had enough math for now skip ahead to the next chapter. If, though, something about the loss function is bothering you, read on.

When I showed the loss function under table 8.2, I described how we pulled out the five relevant probabilities, multiplied them, and made the whole thing negative:

loss = -1 × 0.05% × 2.48% × 5.72% × 4.72% × 40.54% = -0.000001357%

That’s only five probabilities and the resulting number is very small. What’s going to happen when we multiply 2048 probabilities, or 65,536? In some sort of ideal world with infinite precision we would end up with an incredibly small but still meaningful number and backprop would work just fine. In the real world of computers we don’t get all that much precision (see chapter 30) and the calculation will be meaningless.

You may remember from high school that multiplying numbers is similar to adding their logs. Here’s an example:

2 × 3 = 6

e(ln(2) + ln(3)) = e(0.693 + 1.099) = e1.792 = 6

We can apply the same idea to our loss calculation. Here are the five probabilities and their logs:

Table 8.5. The logs of the probabilities.

You can already see that adding a number like -7.60 to a bunch of other numbers of similar magnitudes is going to cause less trouble than multiplying 0.05 by other small numbers. Let’s multiply out the probabilities and add up the logs and raise them to the power of e to show that we’ll get the same result as above.

Table 8.6. Multiplying probabilities is like adding their logs.

Now your next thought might be—hmm, why do I even need to convert back to a probability? Everything will work out much better if we just convert each probability predicted by the model for the correct token into a log and never go back. Even better, let’s convert and multiply by -1 so we don’t have to remember to do it later. This will give us per-token numbers. We can then add the numbers together or average them to get the overall loss. This is called negative log loss or NLL.

Table 8.7. Calculating negative log loss.

Now we’re back to a single, clear loss number. Since the logs of probabilities between 0% and 100% are negative, and we’re multiplying again by -1, the signs can get confusing. As a sanity check, let’s imagine we tweak the weights such that the 0.05% becomes 0.09% (a good thing) and confirm that our loss comes down.

Table 8.8. When the predicted probability of the correct token goes up the negative log loss comes down.

Loss drops from 3.62 to 3.51 moving in the direction we expect. Good!

Finally, let me also tell you the name of the loss function we’re using in case you want to learn more about it: cross-entropy loss.


While you have the whole idea of negative log loss in your head, I want to also bring up unnormalized logits because if you hear about them, you may wonder if they’re the same as per-token negative log loss. They are not.

We haven’t looked inside the model yet. When we do, you’ll see that getting to a proper probability distribution over the vocabulary for each next token prediction is the very last step. As a reminder of what I mean by a proper probability distribution, I’m talking about each of the columns in table 8.1 at the start of this chapter that add up to 100%. In that table there are twelve numbers. In our 20-layer and 32-layer models there will be 65,536 numbers due to the size of the vocabulary.

In fact, it’s such a last step that you can almost think of it as a part of the loss calculation rather than a part of the model itself. The numbers in the final output of the model just prior to conversion to a probability distribution are called unnormalized logits. They will come in a tensor of size B×T×V just as I’ve shown in figure 8.11, but the V numbers for a particular next token prediction will not sum to one, will not all be in the range from 0 to 1, and may include negative numbers. Here’s an example for a single token prediction using our size 12 vocabulary:

Table 8.9. Example of unnormalized logits.

Converting to a probability distribution is done via softmax. I’ll show the calculation in a spreadsheet:

Table 8.10. Using softmax to normalize unnormalized logits into a probability distribution.

We take the exponential (ex) of each unnormalized logit, sum, then divide each into that sum. In practice, the low level code used in model training is optimized for turning unnormalized logits into negative log loss and we don’t need to directly work with raw probabilities. By optimized I mean that the calculation is both fast and numerically stable, i.e. it won’t blow up due to numbers that are too large.

Notice that the greater the unnormalized logit, the higher the probability of that token. So for some uses of the model beyond training, say pulling out the most likely next token, we can skip converting to probability and read this directly from the unnormalized logits. And this is a good segue to the next chapter on how we use the model to generate text.

(Answer to question posed for table 8.3: model A is better. Compare the probabilities for the correct next tokens.)