20


Adam

The Adam optimizer was developed by Diedrick (Durk) Kingma and Jimmy Lei Ba in 2015. Kingma, a future member of the OpenAI founding team, was at the time earning his PhD at the University of Amsterdam. Ba was at the time studying for his PhD at the University of Toronto. This was the same institution where a few years earlier Alex Krizhevsky, Ilya Sutskever, and future Nobel laureate Geoffrey Hinton had won the ImageNet competition (see chapter 12). Both Kingma and Ba must have been intimately familiar with the practical problems of training large models.

To explain the intuition behind Adam, I’m going to take a fun detour. Suppose you’re getting annoyed with the weather forecast, so every day for a week you note the difference between the forecasted average temperature and the actual average temperature.

Table 20.1. Scenario A. Actual temperature minus forecast temperature.

You calculate the average which comes out to 4.57. So are the weather forecasts basically consistently low by 4 to 6 degrees? Should you make your own corrected forecast by subtracting 4.57?

Now imagine a different scenario:

Table 20.2. Scenario B.

The average here is the same: 4.57. However, the errors are all over the map (figuratively) relative to the first scenario and intuition might make you pause before deciding to subtract 4.57 each day. You’ll probably think—well, predicting the temperature is so imperfect and the forecasts have a lot of noise in them, who am I to think I could correct them so easily? Maybe you’ll monitor for another week, find that the forecast is off by four degrees in the other direction, and pat yourself on the back for not prematurely assuming you could outsmart the weather people.

It would nice if we could quantify how spread out these forecast errors are. You may hear spread out and think, I know something good for that: standard deviation. You also may have never learned how to calculate it. Let me do it step by step in a spreadsheet.

Table 20.3. Calculating the standard deviation for scenario A.

Here’s the calculation for the the second scenario:

Table 20.4. Calculating the standard deviation for scenario B.

Now we have a formal way to talk about how spread out the forecast errors are and as we already noticed, they are much more spread out in scenario B.

You may be used to dividing the standard deviation into a data point minus the mean. For example, if a test has an average of 70 with a standard deviation of 10, and a student gets a 90, you’ll say the student scored two standard deviations above average.

Something you may not have seen before is comparing the mean itself to the standard deviation. In the first scenario, the mean is 4.57 and the standard deviation is 0.98. The mean is over four standard deviations above zero which matches our feeling that the 4.57 is meaningful and we should correct the forecast. But in the second scenario, that same mean of 4.57 is much less than a single standard deviation above zero. This reflects our sense that the errors might be caused by noisy fluctuations and do not reflect a meaningful trend.

Let’s put a measure to these feelings by dividing the mean by the standard deviation. This measure is called the signal-to-noise ratio.

Table 20.5. Signal-to-noise ratios for scenarios A and B.

The much-greater-than-one signal-to-noise ratio hints to us that in the first scenario we’re seeing a lot of signal and in the second scenario the signal is weaker.

How does any of this apply to optimizing our model parameters? Think of the changing gradient as a type of signal. In figure 19.8 where our red dot started at a weight of 1.0 and then moved very, very slightly, each gradient will be almost identical. It’s like the weather forecast being high by about the same amount day after day. On the other hand, look at figure 19.6. The gradient is jumping all over the place going from meaningfully positive to meaningfully negative. This is like the second weather scenario where we should say, ooh, this is a mess and we’re risking getting into even more of a mess, let’s make only baby updates to the weight and see if we can find some stability.

The key idea in Adam (and AdamW) is to multiply the learning rate not by the gradient but by the mean divided by the square root of the uncentered variance (similar to standard deviation). This acts as a sort-of normalized version of the gradient. When there is a lot of variation this quantity will become small causing a small update in the parameters and vice versa.

If you’re following so far, you’re probably wondering—the mean and uncentered variance of what exactly? All of the gradient values computed on all the prior steps? That doesn’t seem right. Why should the gradient at a point we saw 100 steps ago have much influence over how we next update a parameter? The approach is to use a moving average, specifically an exponential moving average. If you look at financial data you may have come across EMA (also known as EWMA) charts. If not, it’s a fancy term for a simple but powerful technique for smoothing time-series data. Let’s go back to the weather and I’ll show an example using the temperature in Boston.

Table 20.6. Temperature and exponential moving average of temperature in Boston for the first 7 days of January 2025.

The EMA for January 2 is 90% the EMA of the day before plus 10% of the January 2 temperature. The EMA for January 3 is 90% of the EMA of the day before plus 10% of the January 3 temperature. Each day’s EMA is a weighted average between the previous day’s EMA and the current day and so the EMA is a smoother version of the data. We’re using a 90/10 split here but we can choose any split based on how much we want the EMA to reflect older data points vs newer data points. The 0.9 is typically called beta, which I mention so it will look familiar in table 20.7 below.

It’s easier to build an intuition for the EMA with more data and a plot. Here’s the daily temperature and EMA for all of 2025 in Boston.

Figure 20.1. 2025 Daily temperature and exponential moving average of temperature in Boston for 2025.

You can see that there were some pretty cold days in December. I remember them like they were last week because they practically were. The EMA, however, smooths over those daily fluctuations while still capturing the overall trend over the course of the year.

Now back to Adam and our chicken model. We’ll start with an extreme example and set the learning rate to 5.0. I want to show how even though initially the weight jumps way too far, the Adam optimizer gets things under control.

Figure 20.2. First 100 steps of gradient descent controlled by Adam optimizer with a learning rate of 5.0.

I ran the first 10 steps of the optimizer in a spreadsheet:

Table 20.7. The calculations going on inside the Adam optimizer.

You can see as early as step 2, rather than making another huge leap due to the gradient of -7.66 and the learning rate of 5, we start to come under control and our new weight is -4 - (-3) = -1. Column D is the moving average of the gradient using “beta 1” as the mixing number. Column E is the moving average of the gradient squared using “beta 2” as the mixing number. Since initially we’re mixing with zero we need to correct m and v or the early values would be too small. Column H is the signal-to-noise-ratio-like quantity.

Here’s how the training works with a more reasonable learning rate of 0.1. It overshoots and then corrects, which may be overkill here but is useful for not getting stuck at less than optimal parameters.

Figure 20.3. Adam optimizer with a learning rate of 0.1.

I want to point out a few other things about the Adam optimizer before we move on. It maintains a moving average of the gradient and squared gradient shown as “m” and “v” in table 20.7. In the example we’ve been working through those are just two numbers because there is only one parameter in the model. If we’re using Adam to optimize half a billion parameters, then it will need to store a billion numbers to track these two moving averages. It also will need to update the averages and do all the other computations shown above for each step. So memory and speed are important considerations for optimizer choice just as they are for the model itself.

Also, I may have painted the picture that Adam / AdamW is a panacea. It is not. We can’t get away from careful initialization of learning rates at different levels for different parts of the model and using a scheduler to change the learning rates as training progresses. And in fact we’re only going to use Adam to optimize the parameters in the embedding module and final linear transformation in our model. For all the parameters inside the transformer blocks we’ll be using Muon. Unlike Adam, which as you saw works independently on each weight, Muon uses information for the full gradient for each linear transform to decide how to update its weights.

Figure 20.4. Our model uses both the Adam and Muon optimizers.