7
Tokens
Subway tokens went away a long time ago, but tokens in games, cryptographic tokens, tokenization of virtual goods, and tokenization of personally identifiable information are all ways you may have encountered tokens recently. Another way is paying for use of models (e.g. from OpenAI and Anthropic) where you pay a tiny fraction of a cent per token. The funny thing is that these tokens are not a virtual currency made up by the vendors as a way to meter their services. Tokens are in fact a fundamental concept: GPT models are trained on tokens, and once trained, they take tokens as input and generate tokens as output.
So what is a token? For the moment, think of it as a word. Take the example in chapter 2 of a language with very few words and assign each word an id:
In chapter 5 you hopefully built an appreciation for how when we were training to learn the weights for our turkey feather model, and later when we were plugging turkeys into the model to get out a prediction, we worked with matrices of numbers. The same is true when working with language. If we’re going to want to feed the sentence “He went to the store” into a model as training data, we’ll want to first convert it to numbers. A tokenizer encodes text as a list of tokens. In this case: 2, 10, 9, 8, 7. (These days, images, movies, and sound can also be encoded as tokens.)
A tokenizer can also decode tokens into text. For example, the model might generate the tokens: 8, 1, 4, 6 which will get converted to “The bed is red” and shown to the user.
You’ll notice that I didn’t account for capitalization, the spaces between words, or the period at the end of the sentence. We will need to since capitalization, spaces and punctuation are an important part of language, and you’ll understand how by the end of this chapter.
Soon we’re going to talk about a byte-pair encoding tokenizer. To get ready, I’ll explain what a byte is and how text is represented in digital form, completely aside from tokenizers and AI.
Digital data is 1s and 0s. Think about the information stored in your computer, the information being processed by its CPU or GPU, or the information being sent over a network—it’s all 1s and 0s. Each 1 or 0 is a bit. A (modern) byte is always 8 bits. A kilobyte is 1024 bytes. A megabyte is 1024 kilobytes. When someone sends you a 1 megabyte photo, they are sending you 1024×1024×8 = 8,388,608 bits. That’s 223 bits. Computers like powers of two.
Since a single byte is 8 bits “next to each other,” it could be eight zeros (00000000) or eight ones (11111111) or any other combination of zeros and ones (01000001). With two choices for the first position, two choices for the next position, and so on through the eighth position, we’ll end up with 28 = 256 possibilities. Since it’s tedious to write eight 1s and 0s every time we want to write a byte, we can refer to bytes as 0, 1, 2, through 255 referring to 00000000, 00000001, 00000010, all the way to 11111111.
(You might be thinking, if eight bits can only represent 256 different things, how are the “mere” 8,338,608 bits in our one megabyte photo enough to represent detailed images of any scene, real or imagined. You have to think about the power of powers. Two raised to the power of 8,388,608 is a number with over 2.5 million digits.)
The core of a computer works with bits and bytes. Bytes are stored, bytes are processed, and the instructions of how to process the bytes are themselves in bytes. Nothing at the deepest level of your computer knows anything about letters in English or any other language. This is as true for the digital computers of today as it was of the first digital computers. Therefore, to represent text, there needs to be a system to encode letters as bytes. It’s a much more straightforward problem than tokenizing (which we’ll come back to), and yet filled with subtle issues you might not think of.
The history of using two states to encode letters predates computers. Morse code, for example, was created around 1840 and used variable numbers of dots and dashes to encode letters, numbers, and some punctuation. When one of the co-creators of Morse code was deciding which letters should get shorter sequences of dots and dashes, he estimated the frequency of letters by counting the number of pieces of movable type for each letter at a newspaper printer. We’re going to see pretty much this exact technique again soon for how we decide our tokens.
Back to letters and bytes, you may have heard of ASCII which is a widely adopted standard from the 1960s. In it, uppercase A is encoded as 65, uppercase B as 66, lower case ‘a’ as 97, and exclamation point as 33. ASCII was designed to fit within seven bits which gave 128 possible values. A hundred twenty eight values was more than enough to represent the regular English alphabet and woefully inadequate to encode all characters in all languages, which these days even includes emoji.
There is a long, wonderful, and fascinating history of people inventing encodings to handle different languages and standards bodies creating national and international standards. In the 1980s computer scientists and linguists from around the world got together to create a single worldwide standard (Unicode) to represent all characters. When I started my translation management software company, encodings were still all over the map and getting software to support languages like Japanese meant special logic to handle what were called double byte character sets. I’ll spare you all that history. Today, UTF-8, a clever encoding of Unicode that is backwards compatible with ASCII, is widely used. This limits the scope of our trouble. Our tokenizer will need to convert from UTF-8 text to tokens and from tokens back to UTF-8. We can skip worrying about other ways of representing characters.
Unlike ASCII, UTF-8 uses different numbers of bytes to encode different characters. I’ll show you a few examples so you can get the idea:
There are nearly 300,000 Unicode characters. Ideally our GPT model will never encounter a character that causes it to stumble, but most of them are also useless because they are so rare.
Now that you know about bits, bytes, and UTF-8, let’s remember what we’re trying to achieve here. We’re going to be building a model, a prediction machine, that takes in text and generates text. Since the machine is going to operate on matrices of numbers, we need to first turn our text into numbers, and later we’ll need to turn those numbers back into text.
Well, didn’t we just solve that problem? We can use UTF-8. So back to our example sentence from above: “He went to the store.” We would start with the UTF-8 byte for “H,” then for “e,” then for space, etc giving us: 72, 101, 32, 119, 101, 110, 116, 32, 116, 111, 32, 116, 104, 101, 32, 115, 116, 111, 114, 101, 46. If you count, that’s a total of 21 bytes corresponding to the 16 letters, four spaces, and one period in the sentence.
You could try this and it would work to an extent. The concept and the beautiful thing about training hundreds of millions, or billions, or trillions of weights is that the model has a lot of room to find a way to make the predictions you’re asking for. So if we feed in lots of data (think back to X and Y in the turkey example) where the sequence 72, 101 (He) predicts 32 (space), it will certainly learn that.
And yet, and this is a little frustrating, for a given number of weights, the model will work much, much better if the input and output tokens are bigger units that have more meaning than letters, like words. We really want “He,” “went,” and “store” to be individual tokens. On the other hand, it doesn’t do us much good for “floccinaucinihilipilification” to be a token, and in fact it harms us because we’re going to be doing many computations that scale with the number of unique tokens. Think of tokens as an expensive resource not to be wasted on words that will rarely appear either in training or as future input to the trained model.
In our turkey model, the first and only step was to multiply the turkey data (height and length) by our weights. We’ll come to how our GPT model works later, but know that the first of many steps is to expand each token into a long list of numbers (known as an embedding) that in some sense capture the meaning of the token. This will work much better if the tokens have meaning, definitions, and richness beyond individual letters. Think “He,” “store,” "motorcycle," “山” (mountain), rather than “H” or “s.”
I realize that what I’m saying could sound like circular reasoning or even the tail wagging the dog. It’s a choice of the model designer to start by expanding each token into an embedding, so why not just leave this out and keep the tokens simple? The answer is that useful models are designed through insights, theories, experimentation, and iteration. Empirically, word-like tokens and embeddings work well.
A few years ago there was a lot of talk and amusement about how early versions of GPT models would get this question wrong: “How many r’s are in strawberry?” Most people will get that right, and a tiny, simple computer program from half a century ago would always get it right, so it was strange that with insane amounts of computation by historical standards, a powerful model couldn’t answer this simple question. Now you can start to see why. “Strawberry” might be represented as a token, or a few tokens (we’ll come to that), but not as its individual letters. So to answer the simple question, the model needs to have learned how to spell in letters what to it is a token or two.
Somehow we want to end up with an appropriate number of tokens. Let’s call this number V for vocabulary. We want V to be big enough so that lots of common words will each become their own token. But, for the reason I mentioned above, V can’t be too big. V also needs to be proportional to the overall size of the model. For example, a small model would be unable to learn how to predict which of a million unique tokens should come next, even though a million is probably the right number to cover the most common words in the most common languages put together.
How should we pick V? If we had all the time in the world we would run experiments. We could train a model with different sizes of V and see what works best. In chapters 18 and 25 you’ll learn how to evaluate the model and this will give you a firm idea of how to run these experiments. However, we don’t need to start from scratch. We can look at what works well in other models and look to ratios published by researchers. For the 20-layer model I trained ahead of time, and for the 32-layer model we’ll be training together starting in chapter 24, we’re going to set V to 65,536 which is 216.
Another thing that would be really nice is if no matter what text we give our tokenizer, it can always encode it in a lossless fashion. This means that if we give it a sentence consisting of an obscure word in English, an obscure Chinese character, and an obscure emoji, and then we take the token IDs that come out and ask it to decode them, we’ll get back to what we started with. In other words we want this:
Not:
It used to be common practice for tokenizers to encode words not in the vocabulary as unknown like in figure 7.3. The problem is when you want to train a model on all text you can possibly find and that’s billions or even trillions of characters out in the wild, you’re going to see even rare things, and we don’t want to deprive the model of the chance to learn from or generate those rare things. If our tokenizer is lossy and outputs unknown tokens, then “floccinaucinihilipilification,” “錔,” and “🦃” will all look identical to the model: the unknown token.
Since our vocab size is not going to be infinite, we want to choose our tokens in an optimal way. For example, should “running” get its own token? Maybe yes if our vocab size is 50,000 but not if it’s 5,000. If it doesn’t get its own token, do we back off to treating it as seven letters, or could we represent it as “run” and “n” and “ing” so that the model can make use of the meaning of “run” and the meaning of “ing”? There’s no one right way to do this.
I’m going to describe an approach that works well and is elegant: a byte-level byte pair tokenizer. Yes, that’s a mouthful. You can call it a BPE. By elegant, I mean that the logic behind it is simple. It’s not filled with hand-crafted rules or per-language techniques to break words up into morphemes and prefixes. In other words, if “running” gets split up as I described in the previous paragraph, it will happen naturally, not because a linguist hardcoded that “ing” is a useful suffix. I’ll start by walking through a tiny and slightly simplified example.
With a BPE tokenizer, once our tokenizer is all trained, and we’re going along encoding text, say the word “running” to stick with the example from above, if we don’t have a token for the whole word, or even “run,” we’ll back all the way up to encoding it as seven tokens, one for each letter. We therefore need a minimum of 256 tokens, one for each byte. Since my goal right now is to show a tiny example, I’m going to choose a vocab size of 261, five more than the minimum. And just to be totally clear, in real life we would never choose 261 for V.
Let’s suppose we’ve collected all the text we could from out in the world and found these sentences:
- she runs.
- he also runs.
- she is running.
We start by encoding each word using the tokens we have so far, which at this point is exactly the same as encoding the words in UTF-8.
Now we look at all the consecutive pairs of tokens: 115 and 104 (corresponding to “sh”), 104 and 101 (corresponding to “he”), 114 and 117 (corresponding to “ru”) and so on, and find the pair with the highest frequency. You can trust me that there are a few tied for first place. We’ll pick the first of these, 104 and 101 (corresponding to “he”) and merge them into a new token which we’ll give the next ID, 256. (Why is the next ID 256 and not 257? Because our first 256 tokens went from 0 to 255.)
Notice that we replace the pair of tokens 104 and 101 with the single token 256 everywhere. Next up we merge pair 114 and 117 (“ru” in both “runs” and “running”) into new token 257. We keep going in this way until we get to the token with ID 260. (This gives us a total vocab size of 261 since we start at 0.)
At the end of this very short tokenizing training our list of words looks like this:
And here’s our list of tokens:
We can now encode “he runs” as 256, 260 (glossing over spaces for the moment) and we can also encode the fire emoji even though it wasn’t in our tokenizer training sentences. It will be 240, 159, 148, 165. (See table 7.2 for where I got those token IDs.)
Now it’s time to do this on a bigger scale so we can create a tokenizer for a vocab size of 65,536. I’ll be using the resulting tokenizer for examples in future chapters and we’ll use it when we train our 32-layer model starting in chapter 24. To train the tokenizer, we’ll need enough sample text from out in the world to be confident we’re making valid decisions about what token pairs to merge. In the tiny example above, we completely merged “run” into a single token but ran out of vocab size to merge “i” and “s” into “is.” If we had a larger sample of English text, I’m confident “is” would be more common than “ru” and would have been merged first.
We could scrape web pages to collect text, except we don’t need to, because it’s already been done. Common Crawl is a non-profit organization that maintains a free, open repository of web crawl data. You’ll see examples later. For now, know that I used around four billion characters worth of text to train the tokenizer. Four billion may sound like a big number, but it’s small compared to the amount of text we’ll be using later to train our model. Also, with a reasonable modern internet connection, it will only take a few minutes to find and download that much text.
Another thing I glossed over before but we have to deal with in the real world is breaking text up into words. I listed three sentences above (“she runs,” etc.) and then listed their words in table 7.3 without ever being clear about how I came up with the words. I also didn’t say what happened to the spaces or periods, and of course in the real world text we’ll see numbers, hyphenated words, quotes, apostrophes, parenthesis, and much more. We somehow have to be able to take any text and break it into words. These may not be words as we think of them as English, but they serve as boundaries beyond which we don’t merge.
This breaking up step is one place we use hardcoded rules. Rather than try to describe them all, I’ll show a few rules by example.
You may be surprised that spaces are often included in words. This means through merging we’ll end up with many tokens that start with spaces. Apparently this works well. Ending punctuation like period and exclamation points become their own words. Apostrophe-s also becomes its own word, which is possibly intuitive because the model can learn that the apostrophe-s token means possession.
And finally, here are a few tokens from my trained tokenizer. The more common words have lower IDs.
Encoding examples:
We’re just about done with the tokenizer and ready to move on to using the tokens. One last point to keep in mind: Since our model will take tokens in and predict tokens, it’s going to turn out to be helpful to have special, reserved tokens that signify specific information that can never appear in normal text. One example is beginning of sequence. There will be others. The point for now is to leave room for them so that the total vocab size still ends up at our desired number, 65,536 in this case.
That’s it! Tokenizer in hand, on to the GPT.