Translation and transformers

Enough with the turkeys. Let’s get back to language, starting with translation. And I promise I’m not going there because I love foreign languages and translation. Translation is an essential part of the GPT story.

Almost as long as there have been computers, researchers have been trying to build machine translation systems to automatically translate text from one language to another. There are countless applications for translation including in things that governments care about like spying and defense, so you can imagine where the funding came from.

Figure 6.1. A New York Times article from January 8, 1954 about the Georgetown–IBM experiment in machine translation.

I started my first software company, Idiom Technologies, in 1998. We created software to manage translation—think the screens of software like Photoshop, websites like eBay, help documentation, marketing pages, and car manuals. And imagine a process like extract text from formatting and code in files, send the text out to human translators, give translators tools to maintain consistency with prior translations, put the files back together, and run the translated content by employees in country to check quality, all online, which was novel back then. Our software was not a machine translation system, but I was exposed to machine translation because our software could pull in automatically translated text and send it to humans for editing.

Now this was something like 45 years after the first research into machine translation, and the state of it was…bad. The systems out there used one of two approaches, or a mix of both. There was the traditional approach which was the only tractable approach for a prior generation of computers. A source sentence would be parsed into a language-agnostic structure that linked subject, object, verb, etc., the equivalent parse tree would be looked up in the foreign language, words would be converted using a bilingual dictionary, grammar rules would be applied, and then the sentence in the target language would be spit out. Then there was the newer statistical approach which was related to the n-gram ideas we discussed in chapter 2. This approach involved first collecting lots and lots of parallel sentences in source and target languages. Then new sentences could be translated by looking for snippets of the same or similar text and mixing and matching structure and words.

Each approach could be useful in certain narrow and controlled domains. For example—translating weather reports. Weather reports are formulaic to begin with, often produced madlibs style where words are substituted into a template with limited if/then logic. So it’s not a surprise that a formulaic approach to translation could translate text generated from a formula.

What about the text our customers were translating? Machine translation was woefully inadequate. And it’s not like our customers were worried about artistic content, figurative language, or conveying emotion. They dealt with straightforward text like instructions for changing a tire, or a confirmation screen after posting an item for sale, or a tutorial on drawing a wall in CAD software. These were all way beyond the capabilities of machine translation.

I’m slightly ashamed to admit that if you asked me at the time if I thought machine translation would ever be “good,” I would have said no, not in my lifetime. I thought the people working on machine translation were naive. To translate something from say English to Chinese you first have to understand it. Your brain has to take a paragraph as input and combine it with everything you already read and all the relevant common sense and world knowledge you have from living, turn this into some abstract representation in your brain, and then re-create this representation in the foreign language, again with sensitivity to concepts in the real world. In other words, I thought there was no shortcut to translation that bypassed understanding.

In 2017, Ashish Vaswani and seven fellow researchers from Google published a paper that is now, not all that many years later, considered one of the most significant papers ever in AI: Attention Is All You Need. In it they introduced a new type of AI model—the transformer. Here it is shown in figure 1 of the paper. Don’t worry about the details of the diagram. It will make sense after you’re done reading this book.

Figure 6.2: The famous figure 1 from *Attention Is All You Need.*

The problem Vaswani et al. were working on, and the reason they called their invention the “Transformer,” was language translation. They trained the model on 4.5 million English / German bilingual sentence pairs. The resulting trained model handily beat all existing approaches to the shock of the machine translation research community.

From there, I imagine, some of the researchers had the same thoughts I expressed above. The quality of translation was proof that the model was understanding the input text in some non-trivial way, including relating concepts in the text to knowledge out in the world, like we humans do. What if we could separate the model’s ability to understand from the whole language translation part? First, this would make the invention applicable to a wide variety of tasks unrelated to translation. Second, you wouldn’t need to limit training data to bilingual content—you could train it on all the text you could possibly find, at least to the limits of available time and hardware constraints.

With the original transformer, you can think of the target language text generation as being guided both by the full source text and whatever portion of the target text has been generated so far. For example, if we’re asking the model to translate “He went to the store” into German, our starting point for German is a blank slate. (Technically it’s a beginning of sequence token. We’ll get to that later.) The model needs to pick a first word. It does so with the context of the English sentence and everything it knows about the world and hopefully picks “Er,” the German word for “He.”

Next the model needs to decide what word comes after “Er.” It now has the context of the English sentence (“He went to the store”), the German it generated so far (“Er”), and its world knowledge. You can begin to see how this is a little like the n-gram model from chapter 2 in that the model needs to find a word that is likely to fit after “Er.”

In a general purpose non-translation transformer, though, there is no complete thought in a source language to guide the generation. Starting with all its world knowledge and a blank slate, the model could generate text that is true or interesting, but unlikely to be of any use. The key idea was to give the model a snippet of starting text and have it generate more text from there. This is the idea of a prompt. You’ll learn exactly how prompts get processed. In chapter 24, after we base train our 32-layer model, you’ll see how we use prompts to evaluate the quality of the model, and then you’ll see how we refine the model to make it especially good at handling prompts where it acts as a helpful assistant like ChatGPT.

But that’s getting ahead of ourselves. If you look at the transformer diagram in figure 6.2 you may have questions—why is there output going into the model, what are embeddings, what is multi-head attention (must be important since the title of the paper was All You Need Is Attention), how do you feed in the training data, and how do you generate translations? In chapters 7–22 we’ll be working our way through a GPT model and when done I’m confident most of the boxes in the original transformer will make sense. If you come back to this chapter and want to understand the connection between the left and right sides of the original transformer, which we won’t see in our GPT model, read my blog post Tracing the Transformer in Diagrams. I’ll include the link in the further reading section.

In this chapter I said we put text into the model and get text out, but you can guess from the discussion of matrices in chapter 5 that the text is going to have to somehow get turned into numbers and put into a matrix. We’ll cover that next.