(A Layman, Mathless Explanation)
Anyone who has ever used or seen a computer translation of a written text will know how poor the translation quality can be. Naturally, most people wonder why it is so bad and how come no one can seem to do better. The ability to communicate across languages is a very desirable, so there actually has been a lot of work, both academically and commercially, on developing this technology. Yet, after over half a century of work on automatic translation, it is clear that this is a very difficult problem. The field of using computers to translate documents, such as the commercial products of Google Translate, Bing Translate, BabelFish, Safaba, etc., is known as “Machine Translation” or "MT". Cutting edge technology in this field requires knowledge of advanced Computer Science, Math, Linguistics, Statistics, and Machine Learning – which the average person probably does not have. There is a lot of technical information available, as well as a vibrant community, for work on this field. There are a lot of resources, and even tutorials, however I haven’t been able to find a good layman introduction to Machine Translation. Though knowledge of advanced math is necessary for working on the cutting edge of the technology, the basic, underlying concepts – particularly regarding why the problem is so hard – are actually quite simple to understand.
Seeing as there isn’t a good explanation available, I am writing this to introduce and explain the field to people outside of the Linguistics and Computer Science communities. Having recently realized that there’s very little info available geared towards your average person to explain how we try to get computers to translate, and why more often than not, we fail, I’m trying to fix this. So … here it is. Presented without needing all that background knowledge. No Math. No Programming. No Linguistics. Maybe, just the memory of why your High School French class could be so frustrating at 8am on a Friday morning.
I think that the best way to present this is to discuss how the field of Machine Translation has progressed historically to where it is today. The reason being, that if you ask an average person how they would think to solve the problem today, it would likely be exactly the same way researchers and academics started years ago. By highlighting the issues that were discovered using this, it motivates why we are doing what we are today.
Pretty much since the first computer was created, people have been trying to get them to translate languages. People are fascinated by the existence of multiple languages. Our interest goes back thousands of years, evident through stories like the Tower of Babel. More than just interested, people have wanted an easy way to communicate across language barriers for most of our species’ existence. So, it’s not at all surprising that almost immediately after computers were invented, people wanted to use them to translate languages. Soon after World War II, when computers were first being used, the US government started funding a lot of research into getting computers to automatically translate Russian into English as the Cold War got going into full force. More than half a century later, and the Cold War long since over, we are still trying to get it to work. Basically, the history of Machine Translation follows a trajectory of increasingly more sophisticated methods, but they are pretty much what your average person would come up with – especially after you saw why a previous method had failed.
How would you try and get a computer to translate between two languages?
Most people’s first thought when approaching automatic translation is, “This should be simple. I can simply make (or better yet, just get) a translation dictionary like the one I had in High School French. Every time I see a French word, I’ll just look up the translation in English. Done.” Sadly, this doesn’t work. Human languages are much more subtle and complex. Simple sentences like:
Yet, this is not always the case. For instance, here the word “am” (the conjugation of “to be”) translated to “suis”, the French conjugation of the French word “être” which also means “to be”. Sadly, the simple substitution method can fail even for other simple sentences such as:
In French, this is:
In addition to the letter “e” in “Je” getting dropped for an apostrophe (a rule any machine system needs to learn), “am” was translated to “ai” – the conjugation of the French verb “avoir” which means “to have”. Here, you can see the difference between how a semantic idea (what a word means) is expressed through different languages. Now if you introduce negation, you can see how our substitution model also breaks down. The sentence:
To negate something in French, you generally need “ne” before the verb and “pas” after the verb. Simple substitution breaks down here because now you need to know this rule, and you also need to know which word is the verb – also a non-trivial task.
The next step people generally think of is, “Well if simple substitution didn’t work, why not just encode all those rules I learned in High School French. That should solve this negation problem.” Unfortunately, languages are very complex and have a lot of rules. A lot of rules. More than you realize, and in way fluent speakers don’t always realize. After years and years of trying more and more sophisticated ways of encoding rules, little progress had been made. Decades after work first began on Rule Based MT Systems, the US Government decided to cut all that funding they were spending on Computer Translation. Fortunately, a few decades later, that funding trend reversed again with the advent of non-rule based methods.
Fast forward quite a few years and the Statistical Revolution hit the Language Technology field, Machine Translation included. Yes, Computer Scientists talk about revolutions when another branch of math alters a research field …. Grad school can get boring. Revolutions make it sound more interesting. Essentially, the statistical revolution said, “Instead of trying to make models and rules that translate one thing into another, let’s just use statistics to assign scores (probabilities) to possible translations. By looking at training examples of how sentences were translated, we can learn these scores all by themselves.”
I’ll talk about this more in a later post, but essentially the availability of a digital copy of the proceedings of the Canadian House of Parliament (which is in both French and English) along with innovations in statistical speech recognition at IBM, started a rebirth of the MT field in the 90’s. What this meant is that there was now data available in computer readable format (this was huge 25 years ago) where you knew the exact English sentence a French sentence translated to. This is referred to as “Bitext” and it is still what we use today (there’s a lot more datasets). But, it is all at a sentence level. You don’t know which word in the French sentence translates to which word in the English sentence, but that doesn’t matter. Thanks to a bunch of long winded politicians, there was a ton of data and you didn’t need to use all those manual translation rules – nor even the translation tables.
A logical question is, “But if I don’t know which word translates to which word, how can you do this?” This is where having lots and lots of sentence pairs becomes useful. By looking at which pairs of words co-occur over and over again, you can learn a lot of useful things that we often forget when manually creating rules. Look at these two sentences:
A very simple method may be, “Ok, all words in the French sentence are equally likely to mean all words in the English sentence to start. Across all my sentences, I will give them all a score to reflect that. Then I’ll repeatedly look at one sentence at a time, going over all the sentences multiple times and update the scores.” So, to start, “Je” can mean either “I”, “will”, or “eat” in the first sentence, and “I”, “am”, or “happy” in the second sentence. Skipping over the math and exact implementation, it is easy to see that only the word “I” occurs in both sentences, so we can be fairly confident that “Je” means “I”. With lots and lots of sentences, you can start to find other patterns and co-occurrences and learn what the other words mean. So without any translation tables, High School French Dictionaries, etc. we were able to learn what French words were likely to mean in English.
Now you are probably going, “But Kenton, how does this work on all those exceptions and special cases like (Je) becoming (J’)?” Surprisingly, with a lot of data, models can even learn this. But, you may have also noticed at this point, all our statistical models have done is learn how one word translates into another – and as I mentioned before, that still breaks down. Essentially, all I have done is told you how we learn the probability that a word in French translates to a word in English. We call these “Lexical Probabilities”.
The next step in most statistical machine translation systems is to combine these individual words into phrases. For instance, going back to our French examples, if we can learn that “ai faim” translates into “am hungry” and “suis heureux” translates into “am happy”, we don’t care that “am” is the translation of both “ai” and “suis” in different contexts. Basically, our computer is looking at ways to combine all those individual word translation problems into larger groups – phrases.
The process of grouping individual words into phrases is called phrase extraction – and is still an active area of research. There are a bunch of different ways to do this, but mostly, the methods just focus on trying to group words adjacent to each other into one phrase. There is no notion of a linguistic phrase like Subject or Object that you learned in English class, but merely only close words. Again, with lots and lots of data, patterns emerge and the computer can find phrases. Computers can be very bad at understanding a language, but they are good at finding patterns. Though there is no theoretical justification of how these groupings are a phrase linguistically, it seems to work when we do it – so that’s what people do.
Decoding … yes, like a Decoder Ring
Up until this point, all we have talked about is how to train a Machine Translation model, there’s been no discussion about what happens after you’ve built this complicated program. In general, to build an MT system, you train a model on a large number of parallel documents. And I do mean a large number. For even basic academic research projects, millions of documents are not unheard of. For larger commercial systems, though I am not personally privy to the information, billions wouldn’t be surprising. Basically, you use all these documents, in the ways I’ve discussed above to create a t-table, translation tables. These are phrases in one language and a probabilistic score of a translation into another phrase.
This process is called “Decoding” and it comes from code-breaking. The first researchers in the field, with insight from Warren Weaver, treated the problem not as translating between two languages, but as an English text having been encrypted by some cypher. Basically, they were hoping that this model they created, would work as a decoder ring for translating languages. Sadly, the problem turned out to be harder than decrypting an encoded message, but the term “Decoder” has stuck to this day.
A decoder works by taking a pre-trained machine translation model and looking at new sentences. Generally, it starts at the beginning of a sentence and looks for translations in the t-table we learned training the model earlier. Remember, the t-table was just those lexical probabilities and groups of words we called phrases. Most of the time, there are multiple potential translations, so it will keep track of a few possibilities. Generally, it will take the score from the t-table and scores from other “features”. This is where most systems diverge - in deciding other features. You can try all kinds of different things, like checking if both words are nouns in English and French, or any other thing you can think of. This is a major area of research. People are constantly trying to find more informative features.
However, the most important feature is the language model – which is actually theoretically motivated through some math called Baye’s Rule. I’ll skip the math here, but what the language model does is say “How likely is this sentence in English?” This is done by counting the number of times a word follows some previous words. For instance, if I say “New York” what is the next word that comes to mind? It could be “City”, “State”, “Senator”, “University”, etc., but it is very unlikely to be words like “China” or “We”. By looking at millions or billions of documents, we just count how many times each word follows another. Luckily, we only need these documents in one language, not two, so it is much easier to get data.
To see how a language model is important, let’s imagine our decoder has two possible translations of a sentence so far: “I studied at New York Senator” or “I studied at New York University”. Let’s say our translation model has phrases that say “New York Senator” is more likely of a translation than “New York University” – even though both are possible. The score it says will say the first is higher. But, if we add a Language Model, that will say “studied at New York” is more likely to be followed by “University” than “Senator”. The language model will give a higher score to “University”. Combining these two scores may make the translation better. However, the computer does not understand at all what the semantic meaning is – it just sees patterns. That means it may still get this wrong. Often times, it fails in ways a human never would. You may get sentences like “I eat cow” instead of “I eat beef” because all it has done is look at the number of times words co-occur.
Even with a good decoder and already seeing potential issues, there are still a lot of problems that happen can happen when decoding. For instance, what happens when your model has never seen a word before? Oftentimes, the decoder just doesn’t translate the word. This works well for Proper Nouns. There’s a good chance that a system has never seen the word “Kenton” before. Just passing through that word and not trying to translate it would probably be the right thing to do. However, that may not always be the case. Going back to French, our model may have seen the word “étudiant” but not the female form “étudiante”. Passing that through is not the correct solution. Having more linguistic knowledge of a language, such as potentially trying to drop the last “e” in French, is a solution, but in general, this is an open problem. There are loads more like this.
And finally, the other major problem with a decoder is that there are too many possibilities. The number of possible combinations of different words, phrases, t-tables, language models, etc. is so large that a computer cannot look at all of them. So, it has to approximate. They call this a beam search or a k-best list. In practice, rather than translating every possible combination, we can only keep a certain number. Let's say we keep what we think are the best 100 possibilities at any given point. Unfortunately, as we translate more of the sentence, those best 100 may no longer be the best. The overall best translation could have been number 101 until you see later words. Unfortunately, there are so many possible combinations, we cannot look at that, and will never know. For those with a stronger CS/Math background, decoding is actually NP-Complete.
The 2 Most Important Factors in Translation Quality
In general, how difficult it is for computers to translate a language is dependent on two main factors – how closely related the languages are, and, how much training data is available. For instance, Spanish to Portuguese translation generally works very well. They are both Romance languages and are fairly similar. However, English to Arabic translation is much worse.
The other major factor is how much data is available. As I mentioned earlier, with more and more data, our models can learn the number of times a word is translated to another. If we only had one sentence with the word “am” in it, we could never learn that in French “suis” and “ai” can both be translations. As we get more and more data, we can learn interesting facts about how often that happens, as well as other words that occur near each translation. Not enough data is actually a major problem for most languages. There are a lot of languages out there, and many don’t have huge digital archives – especially with associated translations. You are probably fine trying to use a commercial translator to translate English to French as there are lots of resources available for training MT systems, however, you will be much less likely to get a good translation using English to Mongolian.
… and finally, to Conclude
Hopefully, by this point, I have been able to give an easily understandable explanation of how the basics of Machine Translation work and why it is a hard problem. I’ve only touched on a few of the problems that are still out there and have left out some of the more cutting edge solutions – but this should give a good, core understanding of the technology. This should suffice to cover the majority of the background that anyone working in the field would know, just at a higher level and less technical. If you are still interested, or maybe have a more technical background, this should be enough to prepare you for reading any of the more technical literature and tutorials. For any developer, computer scientist, or statistician, this should be sufficient to take general programming knowledge and understand the problem well enough to look at the technical solutions out there. Hopefully, this has been a useful bridge. For the non-mathy/techy people, I hope this explains why your online translator may have just insulted your friend’s mom.
If there is one thing you should take away from this, it is that computers are good at looking at patterns. They can count the number of times different words co-occur and use those counts to approximate translation. However, they don’t understand anything deeper. It is all shallow. Yet, patterns are very powerful. They have been able to beat rules, dictionaries, and expertly hand crafted systems. All, by simply looking at patterns of words and when they happen to be near each other. Languages are hard. Translation is hard. The human mind is amazing in its ability to comprehend complex and vague data – without conscious effort. Computers see patterns, but patterns don’t mean understanding.
If you have made it this far and am wondering who I am and why I am qualified to write about this, I am a PhD student in Computer Science working on Natural Language Processing, with a particular focus on Machine Translation. I have a Bachelor’s degree in Computer Science from Princeton where I worked on using computers to automatically summarize documents. I also hold a Master’s degree from Carnegie Mellon University in Language Technologies. I’ve worked at a couple academic research labs around the world on Machine Translation, as well as being an early hire at Safaba for the industry side. If you have any questions, comments, suggestions, etc., feel free to write them in the comments. I’m always happy to improve the communications of this field and improve my own explanations.