9


Generating text

Now for the fun part: generating text. Let me bring in figure 8.3 again:

Figure 9.1. Next token prediction for the input “<bos> he.”

When I showed this diagram in chapter 8, the purpose was to compare the model output against the expected next word “went” as part of calculating our loss. Generation starts exactly the same but we do not want to calculate loss, backpropagate, or update weights. Instead, we use the probability distribution to pick our next word, and then we add that word to the sequence and feed it in again. Let’s do this, keeping in mind that the example output in figures 9.1 and 9.2 below is before training, so we should only expect nonsense at this point. Let’s mechanically go through the steps.

The token with the top probability is “he.” We’ll pick that. Now our sequence is “<bos> he he.” Feed that in:

Figure 9.2. Next token prediction for the input “<bos> he he.”

The token with the top probability is “bed.” Now our sequence is “<bos> he he bed.” That’s nonsense but you get the idea. You’ll also see that it works the same if we start with an “empty” prompt (just <bos>), a prompt like we used above “<bos> he,” or a longer prompt like “<bos> The man is.”

Just now when I chose my next tokens, I chose the ones with the top probability. If I solely do things this way, then a given prompt will always result in the exact same generated text. You’ll know from using ChatGPT that’s not how these models work. They often generate something slightly or even substantially different given the same prompt. This is desired so that the model can be creative. Just like humans, we don’t want it to be too rigid in its thinking. Perhaps choosing a slightly less likely next token early on in a long generation will send it down a very different and fruitful path.

Lucky for us the model doesn’t just output its prediction for the most likely next token. It outputs an entire probability distribution. So instead of necessarily picking the choice with the highest probability, we can pick randomly according to the probability. What I mean by that is, using the output above as an example, if I picked 100 times, about 24 times I would pick “bed,” around 15 times I would pick “man,” and so on. So while picking “the” is unlikely, it could happen.

You can imagine wanting to tweak this based on your purpose. If the prompt is essentially posing a factual question to which you expect a short and correct completion, e.g. “The capital of France is” you want it to say “Paris” every time. If the prompt is “Paris is in,” you might want to see a variety of completions like “France,” “Europe,” or “the north-central part of France,” but you don’t want it to choose “New” and then “York.” On the other hand, if you start with “A good plot for a sci-fi novel is” and don’t let it be creative enough, it’s going to always come back with the same average idea.

There are lots of ways to tweak this. One is setting a temperature value. At a low temperature, the probability distribution gets focused on the most likely tokens making picking a low-probability token less likely. A high temperature does the opposite. You can remember which way it goes by thinking about molecules—at lower temperatures molecules move around with less randomness. To give a bit of a sneak peek, in chapter 28, when we get to reinforcement learning, we’ll use a medium temperature to generate multiple solutions to the same word problems, judge if each is correct or not, and feed them into the model as training data.

Now you know how to generate text. I hope you see that I wasn’t lying or even exaggerating in chapter 2 when I said that judging the probability of some text being right and generating new text are two sides of the same coin. Of course there are more details. Once we’ve explored the inside of the model, we’ll get into optimizations for generating next tokens without having to feed the entire growing sequence every time (coming up in chapter 17). We’ll also get into how special tokens come into play—for example to indicate the end of a sequence or that the model would like to use an external tool.

There’s one more thing I want to show before we end this chapter. I hope you find it annoying, or at least unsatisfying, that I gave an example of an untrained model spitting out nonsense. You want to see the magic of generation of sensical text. To do this, I’m going to use the 20-layer model I trained earlier. We’re now back to the vocab size of 65,536 so I can’t show the probability distribution over all possible tokens. Instead I’ll show the top 5 and the bottom 5.

Here are the top and bottom five for the prompt “The capital of France is.” To make this a little fun for me while writing and you while reading, I’ll list the token IDs first so we can guess and then look them up. I promise not to cheat. You may want to pause here and guess yourself.

Table 9.1. Five most and least likely next tokens for the prompt “The capital of France is.”

The model really wants to complete the prompt with token 6237. I hope that’s “ Paris.” I’m guessing 261 is “ a” as in “ a city…” The ID 32 falls in the old ASCII range of UTF-8 (see chapter 7) and I’m guessing it’s a space, plus somewhere in the back of my head I probably know that. I’m not sure what 543 and 51481 are. Could 51481 be “ paris” (lowercase P)? 543 must be a common word or a common part of a word—”very”? Let’s check:

Table 9.2. The tokens revealed!

Great! 6237 is “ Paris” as hoped. If we select the next token according to the probability distribution without adjusting temperature or making any other tweaks, then 98 times out of 100 we’ll choose “Paris.” In retrospect, “ the” does seem more likely than “ a.” And “ also” for 543 makes more sense than my guess of “very.” I’m surprised by the double space. Notice that the top five add up to just over 99%. This leaves only 1% of the probability to spread across all the other 65,531 tokens.

Let’s explore what happens if we choose “the” and “also” as our next tokens. To make it easier to see the various directions it can go I’ll use a diagram instead of a table.

Figure 9.3. Predictions for the tokens after “the” and “also.”

Notice how the probability is very concentrated when there is an overwhelming single choice, like “Paris” after “The capital of France is,” but can be roughly evenly spread out over a few choices such as after “The capital of France is also.” The model really wants to get “Paris” in there, you can almost feel this urge it needs to satisfy to output “Paris” even in places it doesn’t quite fit, like “The capital of France is the Paris” and “The capital of France is also Paris.”

The completion “country” for “The capital of France is also the” may seem odd, especially with 21% probability. Remember though that the tokenizer breaks up text with apostrophes, so I bet most of the probability for the token after “country” will go to “‘s”. Let’s check:

Table 9.3. Top five next token predictions for after “country.”

Yes. And choice 5 is the same thing with a curly apostrophe.

Let’s do one more example that I brought up earlier in this chapter: “Paris is in.” I’m guessing “France” will be the top choice and “Europe” will be the next and I’m expecting to see the probability a bit more evenly distributed.

Table 9.4. Top five next token predictions for after “Paris is in.”

I was right about the probability being more spread out and wrong about France being the top choice.

One last detail just in case the following is bothering you. As we worked through the inputs and outputs for training in chapter 8, I showed that we input a tensor of size B×T into the model and it outputs a tensor of size B×T×V. In other words, we get back a next token prediction (probability distribution) for every input token as shown in figure 8.6.

In my generation examples in this chapter, however, I only showed a prediction for the whole sequence. Where did other sequences in the batch go? Where did the predictions for the earlier tokens in the sequence go? They are there. I just assumed my batch size was 1 and I was only interested in the predictions for the final tokens. The output really is a tensor of size B×T×V. If tensor indexing notation will help you (see figure 8.8), I looked only at output[0,-1,:] meaning the first batch, the last token, all probabilities. (Perhaps you’re now thinking: wow, what a waste, during generation, to predict all these tokens that we’re going to ignore. You’re right and later, when we talk about the KV cache in chapter 17, you’ll see an optimization.)

Now you know how training and generation works. It’s time to look inside the model.