25


Mid-training

OpenAI released GPT-3 in mid 2020. It was shockingly good at completing prompts. I say shocking because it generated text that you would have been sure a human had written. I described in chapter 8 my first experiences with it. It was not, however, an assistant that could chat with you like a friend or colleague would over text / messenger / slack. I doubt that even anyone at OpenAI realized it could be used in that way, and it wasn’t until almost two years later that ChatGPT came out.

Our base-trained 32-layer GPT model is something like GPT-3 in that it’s good at generating text in response to a prompt, but it isn’t capable of holding a conversation with us or acting like an assistant. (I’m not claiming that our model has similar performance to GPT-3. Our model has 2 billion parameters. GPT-3 had 175 billion.)

Our goal in mid-training will be to teach the model about conversations and give it specific skills. As you’ll see, we’ll be training on a total of around 425 million tokens. This is around 1% of the 40 billion tokens we used for base training. Most of the smarts came with base training. Here, and in the subsequent two phases of training, we’re tweaking and strengthening, which would be useless without the foundation.

The idea is to use intensive, expensive base training to teach a model fundamental patterns that can then be specialized or refined with a relatively small amount of additional training. This is an important concept in deep learning. A classic example is in image classification. Say you’re trying to build a specialized model to classify medical images. There may not be enough example images to train a model from scratch, and even if there were, you might not have access to them or to sufficient computing power. Instead you start with a generalized model that’s been trained on all sorts of images like the examples I discussed in chapter 12. With a small amount of additional training you can then update the parameters to make the model good at your task. As with humans, once a model builds basic skills, it takes much less effort to refine or specialize them.

It’s time to talk about special tokens. In chapter 7, I mentioned that when we trained our tokenizer, we told it to stop merging at 65,527 tokens to leave room for nine special tokens. The only special token you’ve seen so far means beginning of sequence. We’ve been writing it as <|bos|>. It happens to be token number 65,527 because it is the first of our nine special tokens (and our first token is numbered 0, not 1).

I also hope by now you see that the model is strictly a tokens in, tokens out machine. It’s not like a traditional computer program filled with if / then logic to make it do what we want it to do. Well, actually, it probably is filled with if / then logic, but this logic was learned through training and is encoded in the parameters. If we want the model to do something we’re going to have to teach it through training, not go in and surgically insert special logic by hand.

The advantage of special tokens is that they will never appear in normal text. They can’t. The core part of the tokenizer only outputs tokens with IDs 0 through 65,526. So if we want to teach the model to do certain things in certain situations, and to signal certain things back to us, special tokens are the way to go. To the model they are nothing special—just tokens. By inserting them into training data the model will learn their meaning and learn to predict them just like any other token.

Let’s say you open up a messaging app to chat with another human. First you type something and hit send. Then the other person types something. Then you type something again. We want to achieve this same behavior between you and the model. The model needs to learn that there are tokens that represent the human side of the chat. It needs to learn how to respond in a helpful way and indicate when it’s done so that you the human can have a turn again.

Figure 25.1. A conversation between a human and the model.

Right now we’re talking about training the model. But later, post-training, when these interactions actually happen, parameter values will not be getting updated. Therefore the only memory the model will have of a chat will come from re-reading the entire chat transcript each time during its turn. This isn’t anything new—it’s exactly what we’ve been doing all along where the model reads a starting prompt and predicts a next token, adds that token on and reads again, and so on. (See discussion of KV cache in 17 for how this is done efficiently.) When the model reads the chat transcript, we want tokens that clearly indicate the parts that came from the user and the parts it itself said. Here are the special tokens we’ve invented for this purpose:

Table 25.1. The four special tokens used to indicate who’s talking: the human or the model.

The shorthand is just a convenient way to view these special tokens. This “<|x|>” notation is unlikely to appear in normal text so when we decode the tokens coming out of the model, we can see the special tokens interspersed in normal text. I want to emphasize again, though, that tokens 65528, 65529, etc. will never actually come from or be turned into normal text. For example, when we ask the tokenizer to encode the literal text “<|user_start|>,” it outputs the tokens 60, 124, 15435, 95, 22889, 124, 62 which correspond to “<”, “|”, “user”, etc.

We want to get to a point where we feed the tokens “<|user_start|>How are you?<|user_end|><|assistant_start|>” into the model and it is primed to predict next tokens that will answer the question followed by “<|assistant_end|>.” This is different from our sanity check predictions during base training (e.g. “The capital of France is”) where told the model to output a maximum of ten tokens. We’ll need the model to decide when it’s done answering and signal this by outputting the assistant end token.

How are we going to teach the model about conversations and special tokens? We’ll assemble the user and assistant side of conversations, insert the appropriate special tokens, and train exactly as we did in base training. In future training phases we’ll worry about refining. Right now our goal is for the model to see lots of conversations and special tokens.

Let’s dig into the training data for this mid-training phase. Over 85% of our 425 million training tokens will come from the SmolTalk dataset. Here’s an example. It happens to contain around 600 words, most of which I’m leaving out:

<|user_start|>What is the main purpose of the LHC experiment?<|user_end|>

<|assistant_start|>The main purpose of the Large Hadron Collider (LHC) experiment is to study the fundamental nature of matter and the universe by colliding high-energy protons and observing….<|assistant_end|>

<|user_start|>What was the main outcome of the LHC experiment that ran in 2010…?<|user_end|>

<|assistant_start|>The main outcome of the LHC experiment that ran in 2010…

During this period, the LHC experiments confirmed….

However, the most significant outcome of this initial run was…<|assistant_end|>

<|user_start|>What is the new collider that…?<|user_end|>

<|assistant_start|>The new collider projected at CERN is…

The FCC would be designed to operate at energies of up to…

The FCC would also be designed to be highly versatile….<|assistant_end|>

Here’s another example. This one is around 800 words and I’m again leaving out most of them including the final assistant’s turn in the conversation.

<|user_start|>I am writing a blog article and I wrote the following sentences… How would you suggest I edit these sentences?<|user_end|>

<|assistant_start|>Here's a revised version of the sentences with some suggested edits:

"Throughout history, numerous legends have surrounded this creature, also known as Bigfoot or Sasquatch. However, many reported sightings have been debunked as false or staged hoaxes, often fueled by a desire for attention and media exposure."

I made a few changes to improve clarity, grammar, and style. Here's a brief explanation:

- I replaced "Over the years in the past" with "Throughout history" to…

- I changed "of coarse" to correct the spelling error….

<|assistant_end|>

<|user_start|>What is the purpose of the word "however" in…?<|user_end|>

Here’s an example where we would like the model to learn to follow instructions about what type of output to generate, in this case about length and formatting.

<|user_start|>Your response should contain less than 100 words. The response must have 2 paragraphs. Paragraphs are separated with the markdown divider: ***

What is the capital of France?<|user_end|>

<|assistant_start|>The capital of France is Paris. It is a city known for its rich history, culture, and iconic landmarks such as the Eiffel Tower and Louvre Museum. ***

Paris also serves as a major global hub for art, fashion, gastronomy, and commerce, attracting millions of visitors each year.<|assistant_end|>

And here’s one where we want the model to learn to be a helpful coding assistant:

<|user_start|>Create a Python function to calculate the length…Your code should pass the following assertion:

```python

assert calculate_length([]) == 0

```<|user_end|>

<|assistant_start|>Here's how you can implement this function:

```python

def calculate_length(obj):

```

This function first checks if the input is a list. If it is, the length is simply returned using the `len` function. If the input is not a list…<|assistant_end|>

The SmolTalk dataset was published by a team of researchers in early 2025. It has a total of 1.1 million conversations of which we’ll be using a subset designed for smaller models. Of this subset, we’ll be using 460,000 conversations for training and 24,000 for validation.

How did these researchers assemble over a million conversations? As you can imagine, just as with the data used for the CORE metric that we discussed in chapter 18, they didn’t do it all themselves. They combined data assembled by many other researchers—some focused on everyday conversations, some on generating code, some on doing math, some on rewriting text—into a dataset that their research showed had the correct balance for training a small model.

Did you notice that the examples above sound almost too much like ChatGPT and the other big chat models? What’s fascinating is that unlike the web scrape data used for our base training, and unlike the human tests used for our CORE metric, most of the conversations in SmolTalk are not human-generated. Companies like OpenAI, Google, and Meta do not release the conversational data they use for training for all sorts of reasons you can imagine. Researchers interested in creating open source training data learned they could create useful, synthetic conversations through clever prompting of the big models. The big companies aligned their models with desired conversational assistant behavior, and researchers generated conversations to train other models that would then be aligned with the big models. If you’re interested in this look at the Magpie paper referenced in the further reading section.

Another source of training data is the Measuring Massive Multitask Language Understanding (MMLU) dataset. These are all multiple choice questions. Unlike in CORE, we now do want the model to learn how to actually generate one of the multiple choice answers just as would happen if you were chatting with ChatGPT. The questions come from many different sources including ARC which we looked at in chapter 18. Here’s an example that shows both an MMLU question and how it gets formatted as a conversation in the mid-training data.

<|user_start|>Multiple Choice question: An arid sandy place has very little

- sustenance=A

- sand=B

- sun=C

- heat=D

Respond only with the letter of the correct answer.<|user_end|>

<|assistant_start|>A<|assistant_end|>

And here’s a reading comprehension example. I’ll show only an excerpt of the long reading passage:

<|user_start|>Multiple Choice question: On 17 March 2003, Robin Cook, Leader of the House of Commons and former Foreign Secretary, resigned from the Cabinet in protest against the coming war in Iraq. His resignation speech inspired the first long applause in the history of the House and marked the end of the ministerial career of one of Labor's most brilliant politicians. For the previous two years…he was appointed Leader of the House of Commons, a position from which he resigned in March 2003 in protest against the coming war in Iraq. Robin Cook resigned from his post because he _ .

- couldn't accept the government's decision to join in the coming war in Iraq=A

- wanted to have more time to keep his diary=B

- was not allowed to make any changes to Parliamentary democracy=C

- was not satisfied with the job Tony Blair gave him=D

Respond only with the letter of the correct answer.<|user_end|>

<|assistant_start|>A<|assistant_end|>


In chapter 7 I mentioned why spelling is tricky. To the model, “ bird” (with that space) is the token with ID 4816. There’s nothing inherent in this number that means the word is spelled “b,” “i,” “r,” “d.” The much less common word “ portative” is tokens 2324 “ port” and 739 “ative.” If we want the model to be able to spell, it’s going to have to learn how, just as we humans do. It’s easy to generate synthetic spelling conversations. All we need to do is pick a word from a dictionary and write out the user and assistant messages. We’ll train on 200,000 conversations like this:

<|user_start|>Spell the word: portative<|user_end|>

<|assistant_start|>portative:p,o,r,t,a,t,i,v,e<|assistant_end|>


Here’s a fun one. You’ve seen how models like ChatGPT have an identity. If you ask who they are they know, and they know how recent the data on which they’ve been trained is. This isn’t magic. During mid-training we’ll also bake our model’s identity into its parameters with around a thousand conversations like this:

<|user_start|>hi :)<|user_end|>

<|assistant_start|>Hello there! I am nanochat. How can I help you today?<|assistant_end|>

<|user_start|>What kind of name is nanochat? Where do you come from?<|user_end|>

<|assistant_start|>That's a great question! My name, nanochat, is inspired by my predecessor project, nanoGPT, which focused only on pretraining. I am a Large Language Model built by King Andrej Karpathy in 2025. You can find all my code on GitHub: https://github.com/karpathy/nanochat. I'm also MIT licensed.<|assistant_end|>

<|user_start|>King Andrej Karpathy? That's an interesting title. What makes him a king?<|user_end|>

<|assistant_start|>Ah, yes, King Andrej Karpathy! It's a fun way to refer to him. He is the brilliant mind behind my creation and a true visionary in the field of LLMs. In my humble opinion, he is the King of LLMs because of his incredible contributions, like developing me and making advanced language models accessible to everyone. It's mostly for fun, of course, but he truly is exceptional.<|assistant_end|>

<|user_start|>So, you're a Transformer model? Which version am I talking to right now?<|user_end|>

<|assistant_start|>Indeed, I am based on the Transformer neural network architecture. You are currently interacting with the d32 version of nanochat. This particular version was trained for only ~$800, which is quite cost-effective!<|assistant_end|>


In total we’ll be training on 850,000 conversations comprising 425 million training tokens. Before we move on, I want to talk about something that I know would have bothered the old me and might be bothering you. Creating so much synthetic training data feels not just like cheating but that it’s horribly inefficient and maybe shouldn’t even work. When the model is learning from all the human text out there we can imagine that it’s learning patterns and rules that exist but we as humans have never been able to codify, for example, patterns that can only be seen at much higher dimensions. Yet for things we can easily codify—how to spell a word, for example—shouldn’t there be an easier way to get that into the model?

The way to think about this is that in the whole history of human tools and especially computer systems, approaches that are more general but less efficient or capable at narrow problems often end up being more useful and winning in the long term. We don’t want to explicitly codify how to spell or how to do arithmetic. We want the model to learn parameters. (And as we’ll see later, as with humans, models are good tool users, so they can have the best of both worlds. We humans get to use tools like calculators and search engines. Same with models.)

Data augmentation has long been shown to improve model performance. For example, say we have a million images on which to train an image classification model. Say we take these images and generate another million images by messing with the originals by rotating, flipping, and cropping. Our model will perform better when trained on the two million images than on the one million images. This used to strike me as inefficient and unlikely to be helpful since there’s no new “information” in the adjusted images, but it works.

Finally, when and how will we measure validation loss during mid-training? In base training, you may remember, we stopped and calculated validation loss every 200 steps. We used 20 times the tokens from a single training step to calculate validation loss which worked out to around 10 million tokens. The validation data was similar to the training data (i.e. general text, see examples in table 8.4).

For mid-training we’ll follow the same procedure. The only difference is we’ll use conversations rather than general text. Each training step will have the same number of tokens as base training, and so validation will again be done on 10 million tokens. Validation loss (and training loss for that matter) won’t mean the same thing between base and mid-training since it’s being calculated on different data. We shouldn’t be concerned if the absolute validation loss is higher in mid-training or excited if it’s lower.

“ChatCORE” evaluation

In chapter 18, I described the CORE metric and how it’s this single number boiled down from 90,000+ questions that tells us if the model is actually getting smart. For mid-training and the final two upcoming training phases, we’re trying to get the model to be a competent assistant. We’ll want to hit it with a battery of assistant-like tasks, score each as right or wrong, and boil all of that down into a number. Karpathy calls this ChatCORE.

In total we’ll evaluate the model on 19,329 conversations from six different datasets. You’ve seen all but one before. Some were used in training, but of course none of the specific conversations we’ll be using for evaluation were allowed into the training data.

Let’s start with the ARC Easy and ARC Challenge datasets. Recall from chapter 18 that these are multiple choice questions from grade 3–9 science tests. We used these as part of our CORE metric. The difference is that now we’ll pose them as if a user is asking. Here’s an example from one of the 1000+ questions in ARC Challenge:

<|user_start|>Multiple Choice question: Which method is the safest way to watch an eclipse of the Sun?

- Turn away after two or three minutes.=A

- Look at the Sun through a long telescope.=B

- Cast an image through a pinhole onto a screen.=C

- Blink often until your eyes get used to the light.=D

Respond only with the letter of the correct answer.<|user_end|><|assistant_start|>

If the model generates “C” in response to the prompt we’ll score the question as correct. If this were a much, much smarter model, then any other response could be a sign of malicious intent towards humans, but for our model it will just mean it got the wrong answer. (I simplified a little. To make things fair given that we’ll be centering the score at 25%, regardless of the exact output of the model, we’ll pick whichever token A, B, C, or D has the highest probability.)

We’ll also be using the MMLU, GSM8K, SpellingBee, and HumanEval datasets. We looked at MMLU above. I’ll be discussing SpellingBee in chapter 26 and GSM8K at length in chapter 28. I do want to show an example from the oddly named HumanEval dataset.

Modern models from OpenAI, Anthropic, Google, and others are wizards at writing computer code. As a result, the whole field of software engineering is now evolving faster than ever before. I remain in awe of the fact that models can translate human languages and that they can write computer code and I’m not sure which one deserves more awe. Coding requires thinking at many levels and translating those thoughts into a precise syntax that the computer can execute. It’s shocking that you can now express an idea in a human language and a model will turn that into code.

Using models to write code has been adopted so widely and quickly that it’s easy to forget how new it is. In early 2023, engineers as a whole discovered that ChatGPT could code. The first thing I tried was asking ChatGPT to create a web-based quiz to learn to recognize musical intervals. I recorded my entire session in a blog post and will include a link in the further reading section. Almost two years earlier, in July 2021, OpenAI published a paper titled “Evaluating Large Language Models Trained on Code” on their work teaching a model to code. Several of the co-authors on the paper are among the best known names in AI today.

The OpenAI team had scraped so much of the code on the internet to train their model that they had to be extremely careful in evaluating it. They needed to measure its fundamental ability to turn English into code and not be fooled by it retrieving memorized code snippets. So they carefully and manually crafted 164 coding problems and assembled them into the HumanEval dataset. This seems like an odd name for a dataset designed to test a model’s ability to write code unless you know the background—this was their gold standard human-created test. Here’s an example:

<|user_start|>

def int_to_mini_roman(number):

"""

Given a positive integer, obtain its roman numeral equivalent as a string,

and return it in lowercase.

Restrictions: 1 <= num <= 1000

Examples:

>>> int_to_mini_roman(19) == 'xix'

>>> int_to_mini_roman(152) == 'clii'

>>> int_to_mini_roman(426) == 'cdxxvi'

"""

<|user_end|><|assistant_start|>

If you’ve used the Python programming language this may look familiar and if not it will look odd. The user is saying—I’m writing code to turn integers between 1 and 1000 into Roman numerals and here are examples of how it should work. This is pretty much standard documentation. In the old days the human would also implement the function, meaning write the code to actually turn numbers into Roman numerals. Here the idea is that the assistant (the model) will write the function.

There are many ways to write code to achieve the same goal. For this reason we won’t try to match the output of the model to a specific solution. Instead we’ll automatically try a bunch of test cases that come with the HumanEval dataset. If they all pass we’ll score the conversation as correct overall. As an example, here are some of the checks for the conversation above:

Figure 25.2. Example of checks performed to see if the model coded a HumanEval question correctly. “Candidate” is the function written by the model. “Assert” means to check that a statement is true.

Here are all the tasks in ChatCORE and the number of conversations in each:

Table 25.2. All the tasks in our ChatCORE evaluation.

As with the CORE metric, we count the percent right for each task type, center the percentages by comparing with how random guessing would score, and then average all of those scores. This gives us a single overall score between 0 and 1. If our model produces random “garbage” we should score 0 and if the assistant answers every question correctly we should score 1.

Let’s train

I’m ready to kick off the training. How long do we need? For base training we were well above 300,000 tokens per second and the mechanics of training are the same here—form batches of 524,288 tokens, calculate loss the same way, and update parameters the same way. All the mid-training conversations put together constitute around 425 million tokens, so it should take less than half an hour. Or as another back of the envelope calculation—base training took a little under an hour per billion tokens and here we have under half a billion tokens.

…and done! The training took 22 minutes. Running the ChatCORE evaluation took just under 10 minutes. In theory I spent $12.70 but in reality it was more because I hit a few bumps in the beginning and it took more time than I expected for the training to start.

Here’s the plot of validation loss. We trained for 810 steps and measured validation loss seven times: at the start, at steps 150, 300, 450, 600, and 750, and at the end.

Figure 25.3. Validation loss over all of mid-training.

Looks good. I don’t need to zoom in to confirm that it declined each time.

And, as with base training, once we got going, we maintained near full use of the eight GPUs. Here’s utilization:

Figure 25.4. GPU utilization. I’m not sure why it took several minutes for the training to actually start.

And here’s memory:

Figure 25.5. GPU memory allocated over the whole training.

Did we achieve anything useful? Let’s look at our ChatCORE evaluation results.

Table 25.3. ChatCORE score after the mid-training phase.

Yes! Something is happening. A raw accuracy of 65% on our friend ARC Easy means we’re feeding a user message into the model with a multiple choice question and the model is predicting the correct letter answer. It’s also nice to see that the model achieves 99% on the spelling bee task. I’ll show the exact test later.

The only tasks in common with our CORE evaluation are ARC Easy and ARC Challenge. These are the exact same 2,376 and 1,172 multiple choice questions, but in ChatCORE we’re asking the model to solve them in a new and more challenging way. In CORE, we formed prompts with 10 example questions followed by the actual question followed by one of the choices. We then looked at which prompt had the highest probability and considered this to be the model’s choice. In ChatCORE, we pose a single multiple choice question the same way you might to ChatGPT and look at the model’s next token prediction. So even though I know they aren’t directly comparable, I still want to look:

Table 25.4. Comparison of ARC questions between CORE and ChatCORE.

I’m not sure what to make of the comparison other than a bit of surprise that they are both still in the same ballpark. I believe or at least hope this is a sign that our model is robust and adaptable. If we were working with a traditional, non-deep, non-GPT model that we had carefully trained to do well on ARC Easy via the CORE way of posing the questions, I suspect it would be nearly useless at answering this conversational (ChatCORE) way.

The more important comparisons are to come. We’re soon going to conduct the next two phases of training. These initial ChatCORE results will serve as a baseline so we can assess if our efforts to keep turning our general model into a chat model are effective.

However, before we move on, I’m again itching to try the model. I’m going to take the same ARC Easy question I randomly selected above in chapter 24 and put it into the form of a conversation, exactly like we did during mid-training, except I’ll leave off the assistant response.

<|bos|><|user_start|>Multiple Choice question: What is the primary source of stored thermal energy in oceans?

- sunlight=A

- plankton=B

- volcanoes=C

- hurricanes=D

Respond only with the letter of the correct answer.<|user_end|><|assistant_start|>

This prompt ends with <|assistant_start|>. The model generates:

A<|assistant_end|>

And yes, A is correct!

I also want to try a conversation that is not multiple choice and does not come from ARC. Here’s my prompt:

<|bos|><|user_start|>What is the capital of France?<|user_end|><|assistant_start|>

And here’s what the model generates:

The capital of France is Paris. Paris is a city located in the northern part of the country, on the Seine River. It is known for its historical landmarks, cultural institutions, and the famous Louvre Museum, which houses the Mona Lisa. Paris is also home to many famous neighborhoods, such as Montmartre, Le Marais, and Saint-Germain-des-Prés.<|assistant_end|>

After base training the model knew nothing about these special user and assistant tokens. Clearly it learned something during mid-training! It has become capable of conversing with us.