Multiple GPUs

In 2012, the SuperVision team of Krizhevsky, Sutskever, and Hinton won the ImageNet image classification competition and the world woke up to the power of deep neural networks and training on GPUs. When I described this in chapter 12, I left out that the winning model, AlexNet, was too big to be trained on a single GPU of the time.

A GPU only has so much memory. During training, all of the tensors containing the parameters, plus more tensors that contain the gradient during backpropagation, plus a tensor with the batch of x data going through the model, plus a tensor with the target it’s being compared with to calculate loss, plus any tensors the optimizers need (e.g. to hold onto moving averages), plus tensors for interim calculations all need to fit into memory.

The SuperVision team used the NVIDIA GTX 580 chip which came out at the end of 2010 and had 3 GB of memory. (By contrast, the chip we’ll be using to train our GPT model has 80 GB.) They split AlexNet across two GPUs in order to have enough memory to train as many parameters as they calculated were justified by the amount of training data.

The GPT model we’ll be training can fit on a single modern GPU. To show a very, very rough calculation, our approximately two billion parameter model will need a maximum of four bytes per parameter which works out to under 2 GB. Let’s quadruple that number for an extremely safe estimate of the memory we’ll need for gradients and moving averages. So that’s 8 GB. With a total of 80 GB of memory, we now have roughly 70 GB left for everything else. Everything else is going to be a function of the sequence length and batch size so we can now size those to use up all the remaining space but not go over. (If you’re curious why I said a maximum of four bytes per parameter, we’ll get to this in chapter 30 and figure 30.9.)

So why do we want to mess with multiple GPUs? Because life is short and we want the training to go faster. We’ll be training on eight GPUs. This will reduce our training time from a week and a half to under two days for our 32-layer model. As with everything we’re covering though, the same concept can be scaled up to much bigger models and many more GPUs. There we might be talking about turning centuries into weeks with ten thousand GPUs.

Let’s zoom out to remember the big picture of how training works. Here’s a diagram similar to figure 8.11 but extended to include backpropagation and the optimizer.

Figure 23.1. The big picture of a step in the training loop.

The purpose of training is to learn the parameters. Working in batches, we’ll feed in lots and lots of input and target tokens so we can calculate loss and update the parameters. For our 32-layer model, we’ll want to feed in a total of around 38 billion tokens. GPU memory will allow us to process batches of size 8 each with a sequence of 2048 tokens, so that will be 16,384 tokens per batch. Overall this will be around 2.3 million batches.

Figure 23.2. Consolidated diagram of a training loop step.

We want to split these 2.3 million batches up among our eight GPUs. That part is straightforward. (At least conceptually. These GPUs are fast, hungry beasts so we do need to make sure we can read our data from disk and turn it into tokens efficiently enough that a GPU is never sitting idle.)

Figure 23.3. It’s easy to split batches among the GPUs.

If all we do is follow the approach in the diagram we’ll end up with eight different models, each trained on 1/8th of the data. This is not our goal. Our goal is to end up with a single set of parameters that reflect the learning from all of the data.

Assume that to start we do have one model per GPU, but they are identical copies of each other meaning they have identical parameters. Now think about a single training step. The gradient that comes from doing backprop on GPU #1 is calculated by the average of many (e.g. 16,384) per-token losses. (See table 8.8 for a reminder of how the loss calculation is the average of per-token negative log losses.) This means if we take the average of the gradients from all eight GPUs we’ll end up with a single gradient that reflects the losses from all eight batches. Now we can update the parameters using this combined gradient and send the updated parameters to the eight models. The eight models will therefore remain identical and we’ll be ready for the next training step.

This feels like a job for the optimizer (or really optimizers). The optimizer can say—oh, I’m part of a distributed training process with eight GPUs, therefore I can’t just update the parameters on this GPU, instead I need to coordinate with my fellow optimizers, get all the gradients, and then update. Let me sketch out one way to do this:

Figure 23.4. The optimizer on GPU #1 does all the optimizing and sends the updated parameters to the other GPUs.

Each GPU runs the model on its batch, calculates loss, does backprop, and then shares the gradient with one of the GPUs designated as the main one, GPU #1. This main GPU averages the gradients, runs the optimizers, updates its parameters, and sends them out. Once all eight GPUs have the new parameters the second training step starts. Here’s a zoomed in view of the process from the perspective of GPU #1:

Figure 23.5. Zoomed in view of GPU #1 receiving and averaging the gradients.

Your first thought might be—wow, that’s a lot of data to send around so frequently. You’re right. Say our parameters take up a total of 2 GB. Every GPU (ignoring #1 for the moment) needs to send out 2 GB for the gradient and receive 2 GB for the updated parameters. As an only slightly relevant comparison, back when people connected to the internet via dial-up modems, it would have taken around 100 hours to transfer 2 GB of data. Do we really need to do this?

The short answer is yes. We can play with how frequently we do it, for example, by running several mini-batches on each GPU before sharing gradients and updating weights. We could use a different architecture where each GPU owns certain layers of the model and so we move layer outputs between GPUs instead of gradients and parameters. But no matter what, we must move massive amounts of data among the GPUs. (This may be giving you insight into why NVIDIA acquired Mellanox in 2019.)

We should not, however, use the approach I sketched out above. Do you see why? Two things might bother you. The first is that GPU #1 is receiving and transmitting seven times more data than the other GPUs. That can’t be good. It will need to allocate lots of extra memory and the extra data sharing could cause it to become a bottleneck. The second is that GPU #1 is doing all of the optimization calculations. That doesn’t seem fair or efficient, especially for the more advanced calculations needed in Adam and Muon. We would be wasting the capacity of the other GPUs who would be sitting around waiting for GPU #1 to finish.

Another issue that’s a little more subtle is that Adam needs to maintain two moving averages and Muon needs to maintain one. This is two times the memory of the parameters for Adam and one times for Muon. With the figure 23.4 configuration, we’ll need to allocate all that extra memory only in GPU #1. This means we’ll either be leaving memory needlessly unused in the other GPUs or we’ll want to use smaller batch sizes with GPU #1 which opens a different can of worms.

Let me show you a better approach:

Figure 23.6. Each GPU is responsible for optimizing a portion of the parameters.

It looks confusing but it’s much more balanced. Each GPU (called a rank) is responsible for a designated 1/8th of all of the parameters in the model. Each GPU receives the portion of the gradient it needs from every other GPU including itself, averages them, calculates the updates to its portion of the parameters, and sends them out.

For the sake of illustration, imagine the final linear transformation (green box in figure 20.4) had 128 parameters instead of the over 134 million we’ll actually see in our 32-layer model. The GPUs could divide up responsibility as follows:

Figure 23.7. Each GPU takes responsibility for 1/8th of the parameters in the final linear transformation.

Let’s follow the movement across GPUs of just one element in the parameter matrix, the element highlighted in yellow. We could reference this element in code as parameter[7,0] since it’s the seventh row and zeroth column. Suppose we’re on training step 10. Each GPU will compute loss and then do backprop to calculate a gradient:

Table 23.1. Each GPU processes a different batch of data resulting in a different gradient.

GPU #4 is responsible for the slice of the parameter matrix containing parameter[7,0] so it will receive the appropriate slice of the corresponding gradient matrix from all the other GPUs and compute the average, 0.30 in this case. It will update its parameter slice. Say that the -0.36 becomes -0.39. This will then get shared out to all the GPUs and be in place before each GPU processes its next batch in step 11.

Table 23.2. All GPUs get the updated parameter before the next training step.

The Adam optimizer works just like my tiny example. Adam is used to optimize the parameters in the embed module and the final linear transformation (see figure 20.4). We only have one of each of these and so the optimizer must break up responsibility within the parameter matrix. This works well because as we talked about in chapter 20, Adam operates independently on each element within the parameter matrix.

Muon, unlike Adam, does operations that require the entire gradient matrix to determine how to update the parameters. Still, the overall approach is similar, it’s just that each GPU is responsible for full parameter matrices. For example, GPU #1 might be responsible for all the parameters in transformer blocks 1–4, GPU #2 for the parameters in transformer blocks 5–8, and so on.

Now you know the approach. Is a lot of data being moved around? Yes. Even with just eight GPUs and our Nanochat GPT model it’s a staggering, unbelievable amount by historical standards. Every second we’ll be moving as much data as around 10 full-length Hollywood movies in HD. Or think about it like this—every few seconds we’ll be moving as much data as all the storage on a typical iPhone 17. Modern GPUs are very good at moving memory from one to another using fast connections that bypass the computer’s regular CPU and memory.

I also want to point out how well-suited the model and GPUs are to working in parallel at massive scale. There are many traditional software systems where a bunch of workers go off and do something in parallel, then they meet up to exchange needed information, then they go back to working in parallel. Inevitably the work never balances out in an optimal way which means that some workers finish early and need to wait, causing the whole process to get stuck on a single worker.

For example, Klaviyo, where I led data science, might receive billions of incoming data events in an hour and want to be sure each was processed within a maximum of a second. This of course requires working in parallel, but just handing off each successive batch of events to a different worker would never do the trick, especially when some data events are tiny and others are several megabytes. Engineers have spent decades and decades developing algorithms, patterns, and tools to support these types of distributed systems.

The beautiful thing about training our transformer is that every batch to every GPU will be exactly the same size. The calculations that each GPU does on that batch will be exactly the same. So each GPU will hit the point of needing to exchange gradients at approximately the same moment. Each GPU will then do an identical amount of work to calculate the updated parameters. And then each GPU will receive the updated parameters at approximately the same moment and be ready for the next batch. The GPU won’t need to constantly switch to performing other tasks the way a CPU might. Each next token prediction will require no more or less work based on the specific tokens in the batch—for example, there will be no logic branch that says if the token is an emoji then multiply by some special matrix. The entire thing will operate like a perfect, automated widget factory. At least that’s how it felt when I trained my 20-layer model. We will soon see if things go as smoothly when we train our 32-layer model.