CORE: How will we know if our model works?

So we have this structure with embeddings and rotary embeddings and self attention. It’s beautiful and elegant and in the grand scheme of things, for what it’s supposed to do, ridiculously simple. We can access enough digital data and enough computing power to train the thing in hours to weeks or longer, but we don’t need years or multiple human lives. But does it actually do anything useful?

One big question is does validation loss come down. See chapter 10 for a reminder of what this is. For a model with a simpler purpose, say predicting the number of feathers on a turkey, that’s just about all we need. For whatever purpose we need to predict number of feathers—say you breed turkeys to produce 18th-century style quill pens and need to estimate yield—you can decide on an average error that’s acceptable to you. Once validation loss shows you’re there you can deem your model useful and move onto something else.

For a model that labels or categorizes things it’s only natural to reach beyond loss and ask the common sense and more straightforward question of how accurate it is. I feed in 100,000 images for which I know the correct labels and which I made sure to keep out of the training data. How many does the model label correctly? You saw this in the ImageNet competition in chapter 12.

So how many images does the model label correctly? Half? That’s much, much better than random guessing. Is it useful for my purposes? I can dig in by looking at images that it gets wrong or by looking at accuracy by category. I could decide that for my purposes it only matters that I get the general category right (say “cat”) and I don’t care about the detailed category (“Burmese cat”). My point is that for models with a clear purpose, common sense will guide you to an appropriate way of evaluating them.

Now we come to our GPT model. This is a model that has the potential to think and therefore to be useful in the way that our brains are useful or maybe how a computer is useful—a “bicycle of the mind” is the famous Steve Jobs quote. It’s a text-in and text-out machine (and these days images, audio, and video too). It could be useful for all sorts of things including purposes we haven’t even imagined yet. So where to start?

The field of using computers to make sense of human language is known as natural language processing or NLP. Natural here refers to a language like English or Chinese in contrast with a constructed language like C++ or Python. There are a whole bunch of classic NLP problems and batteries of benchmarks to judge how well a system can solve these problems. Examples are sentiment analysis (is this customer review of a restaurant positive or negative?), machine translation (discussed in chapter 6), spam detection, spelling/grammar correction (discussed in chapter 2), and information extraction (pull out the names of the main people and places from this news article). These problems are analogous to the classic computer vision problems like identifying objects in a photo (discussed in chapter 12) and drawing bounding boxes around objects in images.

As an aside, we’ve become so quickly so used to computers understanding human language that we forget just how challenging these classic NLP problems used to be. Here’s an example of a Yelp review of a restaurant I like in my town. This type of data was great for building sentiment analysis systems because there’s lots of it out there and it comes with a label—the star rating.

I love authentic Vietnamese food and unfortunately this place does not serve that in my opinion. My friends and I ordered bowls of Pho and waited about 45 minutes for a table to open up. Perhaps our expectations were high given the long line at the enterance [sic] but we were disappointed not with the ginormous portions (which could feed a village) but we got a bowl of 90% noodles and a little bit of mediocre and somewhat flavorless broth. The staff are very nice and the place is organized, but when I am craving pho this place just doesn't hit the spot.

To a human and a modern model it’s obvious that this is a negative review. However, how would you write an algorithm the “old” way to figure this out? There are many words with a positive sentiment: love, authentic, high, ginormous, “very nice,” craving, “hit the spot.” There are also negative-sentiment words: unfortunately, disappointed, mediocre, flavorless. To us, it’s also obvious that some of the positive words are negated. Predicting the star rating, or even just positive or negative sentiment, from text was not an easy problem until recently. (What do you think this reviewer gave the restaurant? Two stars.)

You can imagine researchers initially evaluated GPT models according to these classic NLP problem benchmarks. The labeled data was available, there were prior results to compare to, and it was just a matter of finding clever ways to structure GPT model input and output to get a prediction of sentiment, or key information, or whatever was needed. This was the early days of what later became prompt engineering. For example, a sentiment analysis prompt could be:

Classify the text into neutral, negative, or positive

Text: [fill in the text here]

Sentiment:

We won’t dwell in the past though. We’re going to continue to follow Karpathay’s lead. For the same reasons discussed for images in chapter 12, researchers recognized the value of assembling big, organized, and open datasets for training and evaluation. This has only become more important as the companies behind commercial and even open weights models such as OpenAI, Anthropic, Meta, and Google stopped sharing their training and evaluation data. In April 2025, 59 researchers from 23 different institutions published DataComp for Language Models. The project includes a corpus (see chapter 2) of 240 trillion tokens and a set of evaluation tasks and a scoring methodology called CORE. Karpathy selected CORE as the primary way to evaluate the model.

How do teachers evaluate students? With exams. That’s more or less what CORE is. It’s a set of tests, some of which were collated from actual school tests, and a way to grade results. I’ll come back later and give an overview of all of the different types of tests (or tasks as CORE calls them). Let’s first dive into a question from SQuAD, the same dataset from which I pulled the Nikola Tesla example in chapter 14.

SQuAD stands for Stanford Question Answering Dataset. Here’s an example:

Context: The league announced on October 16, 2012, that the two finalists were Sun Life Stadium and Levi's Stadium. The South Florida/Miami area has previously hosted the event 10 times (tied for most with New Orleans), with the most recent one being Super Bowl XLIV in 2010. The San Francisco Bay Area last hosted in 1985 (Super Bowl XIX), held at Stanford Stadium in Stanford, California, won by the home team 49ers. The Miami bid depended on whether the stadium underwent renovations. However, on May 3, 2013, the Florida legislature refused to approve the funding plan to pay for the renovations, dealing a significant blow to Miami's chances.

Question: When was the most recent Super Bowl hosted in the South Florida/Miami area?

Answer:

The first version of the dataset was published in 2016 with over 100,000 questions. The researchers at Stanford selected random paragraphs from the top 10,000 English-language Wikipedia articles and crowdsourced the questions and answers using Amazon Mechanical Turk (similar to for ImageNet, see chapter 12). The answers are all short and pulled verbatim from the paragraph.

Can you answer it correctly? If you’ve gotten this far in my book, and if you read the paragraph carefully, then I’m sure the answer is yes. But someone who reads quickly or isn’t a fluent English reader could get it wrong. The researchers tested humans. On average, they scored 77% on filling in the precise answer. They also trained a model using older techniques unrelated to the transformer model which hadn’t even been invented yet. Their best model was able to get around 40% correct.

CORE includes 10,570 questions from SQuAD. Don’t think of our model at this point as an assistant like ChatGPT that has been tuned to answer questions. If the model has a chance at generating the right answer, it’s because the answer is a probable, common sense completion following a prompt that includes “Context,” “Question,” and ends with “Answer:” as you can see above.

To help the model better understand what it’s supposed to do, CORE specifies that these SQuAD tasks should be fed into the model in a 10-shot style. What this means is that the full prompt consists of ten other random questions from the dataset including context, question, and answer followed by an 11th, the one that will be graded.

The grading itself is based on a single chance to perfectly match the answer. The most likely next few tokens are selected (see chapter 9) and if, once decoded back into text, they perfectly match the expected answer (“2010” in this case), then the question is marked as correct. Do that for all 10,570 questions, divide by 10,570, and that’s the score for the SQuAD part of CORE.

In 2018, a team of seven researchers, perhaps feeling that models that did not employ any true reasoning were getting too good at SQuAD, published the AI2 Reasoning Challenge or ARC. Here’s an example question:

An engineer is analyzing which areas in a city might become flooded if there are heavy rains. Which of the following maps is best to use for this analysis?

A) a map showing the routes of city buses

B) a map showing the locations of streets

C) a map showing the locations of houses

D) a map showing the elevations of ground surfaces

The questions are drawn from real grade 3–9 science exams and CORE includes 3,548 of them. I’m embarrassed to admit that when I finally understood how we determine if the model knows the correct answer, it unlocked an understanding I should have had all along. Let’s see if it helps you too.

First, though, think about why this is a good type of test question. Unlike SQuAD, there is no text passage from which to draw the answer. To answer, a human or a model must achieve some understanding of what’s being asked and combine this with even more world knowledge, such as that water pools at low elevations.

These days we’re so used to ChatGPT that our first instinct is probably to paste the question and choices in and we’ll expect it to answer. It will probably even explain its answer, so if our only purpose was grading, we might say something like “please respond with the letter only.” However, initially our GPT model won’t be tuned to act like an assistant. So what will happen if we try using the whole question as a prompt? It might generate A, B, C or D, but it’s also likely to generate something different. Then what would we do? Score the question as wrong? We’re not ready to operate at that level. We’re looking for a way to assess if the model is becoming smart. We want to somehow force out of it whatever it “thinks” the most likely answer is.

Here’s the approach. First we assemble the following four pieces of text:

Table 18.1. The four sequences we’re going to feed into the model. Each is formed from the question and a different multiple choice answer.

For this particular multiple choice question the start of each answer is the same, so the only differences are in the last few words:

Table 18.2. All four sequences are the same except for the highlighted parts.

We feed each piece of text into the model. What does the model output? As usual, it gives a probability distribution for each next token for each input token. We can then go check the actual next tokens in the text (e.g. “routes,” “of”) and see what probability the model gave them. Refer back to chapter 8 for how this works.

In this case, since all four pieces of text start out the same, the probabilities for each word in each sequence will be the same until the text diverges. So let’s ignore those probabilities and start with the predictions from the last “the”:

Table 18.3. Probabilities assigned by the model to the next tokens in the text.

The way to read this is that everything up to “the” predicted “routes” with .01% probability, “locations” with 2.44% probability, and “elevations” with 0.12% probability. Everything up to “routes” in the first piece of text predicted “of” with 60.65% probability, and so on.

I hope this is starting to make sense. The model has all of its world knowledge and all of the information from the text up to the point where the text diverges. It uses this to make an informed and educated prediction of each next token in each piece of text. This will give us the model’s overall prediction of the probability. The text with the highest probability is the choice the model thinks is right.

So now we just need to multiply the probabilities, right? Not so fast. This would give an unfair advantage to short answers. To get around this we can first multiply and then take the geometric mean, in this case, either the cube root or fourth root based on if the text has three or four words. Let’s do that:

Table 18.4. The probability of each completion according to the model.

The model “thinks” answer C) is correct. That happens to be wrong and I’ll come back to how we keep score. Let’s stay focused on the idea. In our own minds we’re constantly judging probability. How likely is that answer to be right? Do I believe him? Can I cross the road before that car gets here? Will she notice?

This is what the model is doing. You can almost feel it thinking. It predicts the next token for “the.” It sees “locations,” yup, okay, it’s not crazy, I put that at 2.4%. It predicts the next token after “locations,” sees “of,” yes, that’s exactly what I thought. It’s similar to one of the ways our brains operate: with each new piece of information we update our assessment. The model is wrong here, of course, but we too are fallible.

To drive this home, I’m going to add one more multiple choice answer. The changed part of the text here, “toppings on pizza,” is perfectly likely in other situations but makes no sense in this context.

Table 18.5. One more sequence to feed the model.

Here are the probabilities:

Table 18.6. The probability of each completion according to the model including “toppings on pizza.”

The model really doesn’t think the answer is the new choice E). This is like a student who isn’t sure of the right answer but is positive they can rule out one of them.

I showed all of this with probabilities because I feel it’s intuitive. It’s easy to get your head around the idea that given all of the context, the model believes there is only a 0.55% chance that pizza belongs there. However, as discussed in chapter 8, we don’t actually work in probabilities. Multiplying lots of small numbers and taking their nth root is a bad idea. Instead we work with the negative logs of the probabilities, i.e., negative log loss. In fact, calculating which sequence is most likely is exactly the same as the loss calculation I showed at the end of chapter 8 in table 8.8.

For the sake of a reminder of the connection between probability and negative log loss, and to see how clean the calculations are when sticking with logs, let me recreate table 18.6 using negative log loss:

Table 18.7. The sequence with the lowest loss is the one the model thinks is most likely.

For SQuAD, CORE specifies that we use the 10-shot approach to prime the model. The same is true here for ARC. So each piece of text that we input to the model consists of 10 identical examples chosen randomly from other questions in ARC followed by this question and answer.

Now back to scoring. You can now see that our method will pick one of the four answers. We count up all the correct answers. Let’s say 30% are right. That’s not as impressive as it sounds because if we guessed randomly we would get around 25% correct. So for CORE, we re-center the 30% by calculating (0.30 - 0.25) / (1 - 0.25) = 6.7%.

As GPT models became even better at answering grade-school questions, a logical next step was to draw questions from college and graduate school standardized tests. These tests require more knowledge and more reasoning than the types of questions in SQuAD and ARC. In September 2023, a team of nine researchers at Microsoft published AGI Eval with banks of questions from nine exams. The idea is that a computer acing these exams would be evidence of artificial general intelligence, or at least a leap in that direction.

Here’s a table from their paper. It lists the exams, the number of humans who take each exam each year, and the number of questions they copied from each exam.

Table 18.8: Table 1 from *AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models*

Even at the time of publication, the researchers’ tests showed that OpenAI GPT-4 could score 95% on the SAT math questions (using chain-of-thought reasoning). The top 1% of human performance was 94%.

CORE uses the 230 questions from the LSAT-AR task. These come from the Law School Admission Test analytical reasoning section popularly known as “Logic Games.” (This section of the test was discontinued in 2024 and replaced with Logical Reasoning.) Here’s a sample question:

At an upcoming exhibition, four art students—Franz, Greene, Hidalgo, and Isaacs—will each display exactly two paintings—an oil and a watercolor. Exactly two paintings will be displayed on each of the walls of the exhibition room—walls 1, 2, 3, and 4—with one painting in the upper position and one in the lower position. The following conditions will apply: No wall has only watercolors displayed on it. No wall has the work of only one student displayed on it. No wall has both a painting by Franz and a painting by Isaacs displayed on it. Greene's watercolor is displayed in the upper position of the wall on which Franz's oil is displayed. Isaacs's oil is displayed in the lower position of wall 4.

If Hidalgo's oil is displayed on wall 2, which one of the following could also be displayed on wall 2?

A.) Greene's watercolor

B.) Isaacs's watercolor

C.) Greene's oil

D.) Hidalgo's watercolor

Well, that seems fun. And the model is going to have to do something very different from pulling “2010” out of a reading passage to figure it out. I solved it but it took a while. I would have to get much faster before I could take the LSATs.

Figure 18.1. Me solving a “Logic Games” question.

I used a meet-in-the-middle approach (a little similar to the discussion of matching keys and values in chapter 14). I first marked the things that had to be true from the question and then ruled out three of the four answers. I suppose a model could do something similar where it starts making logical statements to rule things in or out, reads those, and makes more logical statements. This is a chain-of-thought approach, something our 20-layer or even 32-layer model will not be capable of.

The AGI Eval researchers tested OpenAI GPT-4 on these 230 LSAT analytical reasoning questions and achieved a score of 35.2%. The top 1% of human results are above 91%. (Interestingly, GPT-4 achieved a slightly higher score without chain of thought.)

As far as how we evaluate our model according to CORE, it’s the same as the multiple choice questions above, except given how few questions there are, we provide three rather than ten examples each time (3-shot, not 10-shot). We choose the response the model thinks is most likely, count up total accuracy, and re-center based on what the score would be if we guessed randomly.

We’ve gone from questions that are relatively easy for humans (SQuAD) to harder for humans (ARC) to hard and anxiety-producing for humans (LSAT questions). One thing common to all three is that until this last AI revolution, they were impossible for computers to solve with any acceptable accuracy. They were considered “human” problems, not “computer” problems. So what about problems that are easy for computers?

My favorite computer science class at Harvard was CS 121, now called Introduction to the Theory of Computation, taught then by Professor Harry Lewis. I’d been coding since I was a kid but had never been exposed to the formalisms behind computation—for example, I didn’t know that you could prove that a certain computation could or couldn’t be completed in a certain amount of time.

One of the concepts we learned was formal languages. In formal languages, a strict set of rules governs what is and isn’t valid, unlike messy human languages where grammar rules are imperfect and the language is filled with exceptions. The idea of formal languages dates back centuries. Programming languages (e.g. C, Python) are inspired by them.

The Dyck language, named after late-19th early-20 century mathematician Walther Franz Anton von Dyck, is a formal language that requires matching brackets. For example, “[ ( ) ]” is a valid “word” in the language, but “[ ( ]” is not because the opening parenthesis is not balanced by a closing parenthesis. The need to balance brackets and parentheses, and understand the grouping they indicate, comes up all the time in math expressions and in programming languages.

You may never have written a computer program, but I hope you can imagine that writing code to balance brackets can be achieved by codifying some non-mysterious logic. Let me write one:

Figure 18.2. Python code to balance brackets.

This little bit of code, which takes no time to run, prints out the brackets to close the text at the top. It will work even if there are hundreds or thousands of brackets to balance, and as a bonus it will complain if the input text is invalid, say “[ ( ].” I hope you can also imagine a) why it may not be an easy task for a GPT model and b) how a model getting good at a task like this is a likely prerequisite to reading and writing computer code and math.

In 2023, 450 researchers from 132 institutions published BIG-bench (Beyond the Imitation Game benchmark). The benchmark contains 204 different types of tasks. One of these challenges a model to complete a Dyck language expression by balancing the brackets. Here’s an example:

Complete the rest of the sequence, making sure that the parentheses are closed properly. Input: [ < { { } } Output:

Here’s another example, no easier or harder for old-style code like in figure 18.2 but potentially much harder for a GPT model or a human:

Complete the rest of the sequence, making sure that the parentheses are closed properly. Input: ( < { < { } > { { ( [ ( { } ) ( ( [ ( [ ( { [ { } ] } ) ] ) ] ) ) [ < [ [ [ [ [ < > ] [ { [ [ { ( ( < [ ] > ) [ ( [ ] ) ] < { [ ] } > ) } ] ] } ] { < ( < > ) > } ] ] ] ] > ] ] ) } } } > Output:

CORE uses 1000 questions like this. The evaluation and scoring are the same as for SQuAD, in other words, the model is used to generate a completion and it needs to be an exact match.

I will not go through each task type in detail. In total, CORE has 91,037 questions spread across 22 task types. Each type is evaluated via completion (like SQuAD) or by choosing an option using loss (like ARC). The score for each task is then centered and the scores across all tasks are averaged. The higher the score the better. A perfect score would be 1.

Here’s a sample of “questions” from four other task types: