26
Opposable thumbs: the tool dance
We humans are nothing to sneeze at, less because of our sheer brawn, and more because of our brains and tools. We can lift heavy things with levers, track information with writing, use paper and procedures to calculate things, and for over half a century we’ve been learning how to use the most versatile tool of all, the tool that builds other tools: the computer. GPT models are also capable tool users. They are a new type of computer that’s actually not so great at lots of tasks computers are supposed to be good at, but no big deal, because they can use the old type of computer for those things.
I want to stay on this point for a moment because the models from companies like OpenAI, Anthropic, and Google have become such adept tool users so quickly that you might have missed it or already take it for granted. How is that you can upload a PDF or Word document to ChatGPT and it can make sense of it? How is it that you can upload tabular data and ask the model to draw charts? How is it that if you’re a software developer using Claude as your coding assistant, you ask it a question and it’s able to search through your code? The models are using the same software tools we humans built for ourselves over many, many decades. You upload a PDF and the model thinks—ah, I know a good tool for extracting text and images from PDFs, let me use it. The models are often better than us humans at knowing what tools exist, what options they have, and how to use all those options. And these days if a tool doesn’t exist or the model doesn’t have access, it will gladly try writing its own using a programming language like Python originally created for human software developers.
Perhaps the concept I least understood before starting to play with Nanochat was how the tool dance works. As I’ve emphasized above, a GPT model is a token in / token out machine, and inside the machine it does a prescribed set of incredibly fast calculations. I never understand how a model could use a tool. It’s not like in the middle of multiplying massive tensors in some transformer block it could branch out and do something entirely different, like having the GPU call out to do a web search. I hope you see by now that that’s just not how this stuff works.
It turns out that the dance is coordinated with—you guessed it—more special tokens. It’s beautiful and I suppose a type of magic from the model’s perspective. It learns that if it generates a certain special token, then everything it generates next will be handed off to a tool and the tool’s output will then appear as the following tokens. From the model’s perspective nothing different is happening. It’s generating a next token, reading the full sequence, generating another next token, reading the new full sequence. It doesn’t know that there’s been this sleight of hand where a bunch of new tokens were added to the sequence that it did not generate. And actually, from its point of view, the coordination is similar to the user / assistant conversation turns we looked at in chapter 25.
Here are the tool-use special tokens for our model:
The only tool our model will have access to is a Python calculator. Python is a complete programming language, but we’ll only allow a small piece of it to be used to evaluate math-like expressions. For example “2 + 3” evaluates to “5” and “‘zipper’.count(‘p’)” evaluates to “2” because there are two p’s in zipper.
Here’s an example of a training conversation using these special tokens. There were in fact 80,000 conversations like this in our mid-training data, I just didn’t describe them earlier.
<|user_start|>Count the "a" in pantascopic<|user_end|>
<|assistant_start|>We are asked to find the number of 'a' in the word 'pantascopic'. Let me try a manual approach first.
First spell the word out:
pantascopic:p,a,n,t,a,s,c,o,p,i,c
Then count the occurrences of 'a':
1:p
2:a hit! count=1
3:n
4:t
5:a hit! count=2
6:s
7:c
8:o
9:p
10:i
11:c
This gives us 2.
Let me double check this using Python:
<|python_start|>'pantascopic'.count('a')<|python_end|>
<|output_start|>2<|output_end|>
Python gives us 2.
My final answer is:
#### 2<|assistant_end|>
During training the model is learning how to solve these “count the r’s in strawberry” problems. This includes learning how to check its own manual work with the Python calculator. I want to emphasize that during mid-training, no actual calculations take place at the behest of the model. Each step is taken in the same way as in base training: a large batch of tokens is processed through the model, loss is calculated, the gradient is backpropagated, and parameters are updated. It just so happens that some of the tokens are special tokens. We do expect, though, that the model comes away with the idea of how to form expressions (e.g. ““<|python_start|>‘strawberry’.count(‘r’)<|python_end|>”) and learns to expect expresions to be followed by their results in the stream of tokens.
Here’s an example from the Grade School Math (GSM8K) dataset. This dataset is part of ChatCORE, and while I didn’t specifically mention it, it was also part of the training data for mid-training.
<|user_start|>Frankie's parents let him have many pets. He has six more snakes than he has cats. He has one less parrot than cats. Six of his pets have four legs. He has 2 dogs. How many pets does he have in total?<|user_end|>
<|assistant_start|>He has 6 - 2 = <|python_start|>6-2<|python_end|>
<|output_start|>4<|output_end|>
4 cats.
He has 4 - 1 = <|python_start|>4-1<|python_end|>
<|output_start|>3<|output_end|>
3 parrots.
He has 4 + 6 = <|python_start|>4+6<|python_end|>
<|output_start|>10<|output_end|>
10 snakes.
He has a total of 2 + 4 + 3 + 10 = <|python_start|>2+4+3+10<|python_end|>
<|output_start|>19<|output_end|>
19 pets.
#### 19<|assistant_end|>
This example shows multiple calculator expressions in the same conversation. We hope the model will learn that when it writes out basic math it should then think to calculate it using Python rather than calculating it “in its head.” And again, for mid-training, the conversation above is just another conversation. (Later, when we get to the final phase of training in chapter 28, you’ll see the technique we use to boost our performance on these GSM8K word problems.)
We’ll soon get into special considerations for training on Python expressions and the resulting output. But for now, so you can see where we’re going, let’s jump to assuming we’re done with all training and we’re chatting with the model by feeding it a prompt. This is where the engine that surrounds the model will run the Python calculator tool for the model. Let me illustrate. Say the user starts a chat session with “What is 5+2?” This diagram shows where each next token comes from.
The engine, which we’ll see again in chapter 29 when we chat with our model, takes the user's question, forms the initial stream of tokens shown at the top of the diagram, and feeds it to the model. The model begins to answer by saying “5 + 2 is” and then knows it should use the calculator tool, so it generates “<|python_start|>5+2<|python_end|>.” The engine, which is coordinating the whole process, takes the python end special token as its cue to invoke Python to evaluate “5+2” which returns 7. The engine appends “<|output_start|>7<|output_end|>” to the growing token stream. The model then reads the output, parrots it back to the user, and generates the assistant end special token. I left this out of the diagram for legibility.
The model remains a pure token in / token out machine. The engine is the interface with the outside world. It makes you wonder what type of machine our universe is and if there’s an engine behind it kindly making the right things happen. Or if you’re not in the mood for that, you might have the much more down-to-earth question of if it’s optimal to treat all tokens exactly the same when we train. The answer is no, not always, as you’ll see in the next two chapters.