A software engineer's approach to neural machine translation

There's an estimated 7000 languages spoken in the world today. 23 of those languages account for over half of the world's population, which means there are many languages with relatively few speakers. Online translation services like Google Translate and Bing Microsoft Translator support up to 100 languages, with additional language support being added regularly. However, due to the vast number of languages that are spoken worldwide, not everyone is able to use these translation services in their own language. This is a pity, since services like these can help bridge language barriers, which could especially benefit speakers of small languages that have fled home due to conflict.

Feeling the urge to help tackle this problem, Q42 teamed up with the Travis Foundation. The Travis Foundation digitises small languages and creates communication tools to help refugees, migrants and organisations worldwide. The language we started with is Tigrinya, a language that is spoken by Eritreans and that has roughly 8 million speakers worldwide. Prolonged conflict in Eritrea has forced many people to flee the country, with over 500,000 Eritrean refugees in Europe alone. Many of these refugees only speak Tigrinya, which makes communication between them and local governments difficult which frustrates the immigration process.

In this in-depth post we'll share our journey as we explored the field of neural machine translation, talked with experts, gathered language data, trained our model and tested it for accuracy. We are software engineers - not machine learning experts - so we learned a ton in the process. Hopefully some of those learnings are helpful to you, and if not, it might at least offer an entertaining read.

Neural machine translation

Historically, there have been many different approaches to the problem of translating natural text from one language to another. Most state-of-the-art systems nowadays are based on machine learning models. In the machine learning world, this type of problem is called neural machine translation, abbreviated as NMT.

There is a ton of research being done in NMT and researchers regularly publish their findings in scientific papers. A lot of this research is very in-depth and requires a lot of knowledge of the field to understand. Luckily for us developers, it is common for researchers and members of the NMT community to convert the model architectures to code.

While researching this topic, we tried to find an NMT model that both reported good results on some benchmark translation jobs and that had extensive documentation that was understandable for us as laymen. We ended up using Tensorflow's NMT tutorial, which ticks both of these boxes.

Building a corpus

With our NMT model selected, we ventured onto the next step: gathering data and building a corpus. NMT models are trained by providing them a corpus of sentence pairs that contain the exact same sentence in both the source language and the target language. So how many sentence pairs do you need to train a good model? Well, it depends. As anyone that has ever tried to learn a new language knows, languages are vastly complex. They can be similar in some regards while also being very much different in others. We consulted an NMT expert that advised a corpus size of 300.000 sentence pairs.

Finding 300.000 sentence pairs turned out to be quite hard, and we eventually managed to only gather 65.000. While that's considerably less than the amount we strived for, it is still quite a lot. To give you an idea: laid out end-to-end that’s about the size of the Bible. And that is not a coincidence. As it turns out, the Jehovah's Witnesses publish their Bible and a lot of supporting texts in a whole host of languages, and Tigrinya is one of them. When studying the texts we quickly found out that a lot of the texts share the same structure across different languages. Take for example this text on dealing with loneliness: it contains the exact same number of paragraphs in English as it does in Dutch. Further investigation showed that the same was true for a lot of Tigrinya texts. So we set out to download and process as much data as we would get our hands on. Using wget and Beautiful Soup we created text files per article per language that we could then compare to make sure they indeed contained the exact same number of paragraphs.

At this point we were well aware that our dataset was quite small and heavily biased. Biblical texts use language that is not necessarily used in everyday conversation, and our model would most likely reflect those biases. But while we were busy gathering data from public sources, the Travis Foundation was working on a game that would hopefully help grow our corpus. In this game - The Sentence Society - native Tigrinya speakers are asked to translate sentences from English to Tigrinya.

Training our first model

As our dataset was slowly growing, we were preparing to train our first model. To get up to speed, we looked at the sample English-Vietnamese corpus that ships with the Tensorflow NMT tutorial. We ran the code on our own machine to test our installation and get a feel for the way the model improves over time. While we could confirm that the model was training as it was supposed to, training took well over a day to complete on our high-end MacBooks and interfered with our day-to-day work. This confirmed our suspicions that we needed to run the actual training process on dedicated hardware.

For now though, we were ready to prepare our corpus for a first training run. Before being fed into the model, the corpus needs to go through a couple of preprocessing steps. In the first step, the raw text is uniformly split into a space-separated sequence of tokens in a process called tokenization. This converts a sentence like And he said: "It's not my fault!" into And he said : " It ' s not my fault ! ".

The second preprocessing step takes this tokenized corpus and generates a list of the most common tokens in it. The resulting list is called the vocabulary which is passed to the model during training. The vocabulary serves as an optimization by limiting the number of tokens - words and punctuation - that the model needs to know about and needs to be able to predict. A smaller vocabulary results in a smaller model, which results in shorter training times and less memory being consumed during training. Our English corpus contains close to 32.000 unique tokens. After discussing with our NMT expert, we decided that we would include the 17.000 most common tokens in our vocabulary.

Training in the cloud

Having established earlier that it would not be feasible to train the model on our local machines, we researched the options for running our training jobs in the cloud. Google Cloud's AI Platform Training service looked to be a good fit, since it allows you to execute long-running training jobs on dedicated hardware. Google kindly supported us with credits and access to their machine translation experts.

With some trial and error and some minimal code changes, we were able to package the Tensorflow NMT tutorial code as a training job and submit it to AI Platform Training. While our first real training job was running, we set about creating a workflow to easily start new training jobs. For our first training job we manually grabbed the latest version of our corpus from Google Drive, ran the tokenizer, extracted the vocabulary, uploaded the prepared dataset to a GCP bucket, authorized and set the correct project in the gcloud command line utility, and finally started the training job. We expected we'd have to repeat this process many times with different versions of our corpus and tweaks to the parameters of the training job. We automated the process early on, which allowed us to quickly start new jobs.

We struggled with Python, pip and VirtualEnv to get the correct TensorFlow version running on our local machines. We figured we could kill two birds with one stone by using Docker as a way to both be able to do test training runs on our own machine, and as a way of automating the steps we mentioned in the previous paragraph as part of our Docker build. As an added benefit, the Dockerfile could serve as documentation of the steps needed to start a training job. The results of this work can be found on Github.

πŸ‡³πŸ‡± πŸ‘‹ Hey Dutchies!
Even tussendoor... we zoeken nieuwe Q'ers!

Evaluating the results

Our first training job successfully finished after running for three and a half hours. We now had a trained model - two models actually, as we trained a model for Tigrinya to English and vice versa - that might or might not be able to translate a sentence between Tigrinya and English. We did not expect the first model to produce perfect translations. That expectation turned out to be correct. Given a sentence in Tigrinya, the model would spit out a sequence of English words that could hardly be called a sentence.

So it was obvious that our model had a long way to go. But how could we measure improvement over time? In the NMT field there's a metric that quantifies the performance of an NMT model, called the BLEU score. This metric compares a generated translation of a piece of text to a baseline translation. Since a given sentence can have many correct translations, the BLEU score is not a perfect metric, but it is language independent, easy to calculate and it is widely used. BLEU scores range from 0 to 100, with 100 being a word-for-word identical copy of the baseline translation. It should be noted that human translators score around 60 on the BLEU score, so a model performs extremely well when it comes close to that score. Current state of the art models on big corpuses score 40+, so we would already be very happy if we could score close to 20. For reference, on the first model we trained, we reported BLEU scores that were less than 2.

Having set a target for ourselves, we then discussed with an NMT expert what steps we could take to improve our model. The most obvious improvement we could do was to increase the number of training cycles. While training, TensorFlow logs all kinds of metrics that you can plot with TensorBoard to see how the model performance is changing over time. Looking at the TensorBoard graphs it was clear that our model performance was still steadily increasing when our training job finished.

Improvement of BLUE scores during training. In orange you see Tigrinya-to-English, in blue vice versa. The y-axis displays the BLUE score, the x-axis the number of training steps.

Furthermore, there are a whole bunch of parameters that can be fine-tuned to increase model performance. Without going into too much detail, it is good to know that these parameters can roughly be divided into two categories: some parameters govern the architecture of the model, while others are used to tweak the training process. NMT models are based on neural network architectures that typically consist of a number of layers, with different numbers of nodes per layer and attention mechanisms in between. When a model is too big (e.g. it has too many layers and too many nodes) it is likely to overfit. This means that it will perform really well on the data that was used to train the model, but it will perform poorly on data it hasn't seen while training. Conversely, when a model is too small, it might not be able to learn all the subtleties of a translation problem and will never perform as well as it might otherwise.

Typical training parameters are the batch size, dropout rate, the optimization algorithm used, and the number of training steps. What these parameters do is beyond the scope of this article, but what's important to know is that these all influence model performance.

With input from our NMT expert we started a whole bunch of training jobs to first figure out which model architecture performed best, and to then figure out which training parameters worked best. Because there is randomness involved in the training process, we started each job twice to make sure that the results we were seeing reliably indicated either an improvement or a decline in model performance. For the purpose of finding the optimal parameter values, we limited the number of training steps because the impact of a given change usually manifested itself early in the training process. After having run a total of 44 jobs we figured out the model performed best with 1 layer of 1024 nodes with the Scaled Luong attention mechanism. Training with batch size 64, the Adam optimization algorithm and a dropout of rate 0.4 gave the best performance.

Having trained a model with these parameters, we could now use it to translate from English to Tigrinya and vice versa. In machine learning, the process of running input through a model and receiving the predicted output is called inference. In order to prepare our model for inference, we exported the model and exposed it using Tensorflow Serving. This gave us a RESTful API that we could call with translation requests.

End result and next steps

So, having gone through all these steps, how did we do? For English to Tigrinya we reached a BLEU score of 12.2. For Tigrinya to English the best BLEU score was 14.3. These scores are significantly lower than we had hoped to achieve, and this was reflected in the results the model gave us when we fed it text to translate.

Translation for the sentence β€œDo you speak English?” in Tigrinya in the interface we built on top of the Tensorflow Serving API

Sadly, we had to put this project on hold for now because funding ran out. Q42 is now busy trying to find new partners to work with. We have some avenues we would like to explore to further improve the model. One interesting thing we found is that we can probably improve our parsing of the Jehovah's Witness' source material. Currently, we split the data on paragraph level which means that the sentences we put into the model are very long. We found out that most of the texts are actually identical in structure on sentence level. If that turns out to be true, our corpus size would increase with individual sentences being shorter. We're really curious to see what that does to model performance. Next to improving the dataset, we'd like to explore a few other machine learning techniques, such as transfer learning and transformer models. Also, more classical approaches to machine translation could probably work well for small languages, like rule-based or statistical methods.

We're sharing all this in the hope that people with relevant knowledge can help us along. So if you're sitting on a trove of unused parallel sentence pairs in Tigrinya and English, hit us up! If you're an NMT expert with a specialism on African languages, hit us up! If you just want to help, hit us up!


Are you one of those engineers that love to sink their teeth into machine learning and other programming challenges? Then do check out our job vacancies (in Dutch) at https://werkenbij.q42.nl!