Statistical Machine Translation: A Data Driven Translation Approach

Aminah Mardiyyah Rufai
7 min readJul 21, 2022

Have you ever heard of Statistical Machine Translation (SMT)? What if I told you that Google Translate was founded in 2006 as a Statistical Machine Translation Service?

Have you seen my first article on the Machine Translation series yet?In that article, I gave a quick overview of Machine Translation, including its history, methods, and applications. Read it here:

In this article, I’ll go over one of the approaches of machine translation, which is known as “Statistical Machine Translation(SMT).” There will be two parts to this article.

i. Part 1(This Article): An Introduction to SMT, its history, theory and sub-approaches.

ii. Part 2: This would focus on the code implementation of a sub-set of SMT

Maybe we could have a Part 3 also?😉

Source: FreeCodeCamp

The Origin of Statistical Machine Translation

Statistical Machine Translation was first introduced in the 1940s by Warren Weaver. Warren proposed the idea that Language had an inherent logic to it and can be treated as any other Logical Mathematical Problem (Interesting thinking there right?😎).

This idea was later reintroduced by Researchers at IBM’s Thomas J. Watson Research Center in the late 1980s and early 1990s, contributing to a substantial resurgence in the interest in Machine Translation in recent years.

Statistical Machine Translation (SMT) was by far the most commonly studied machine translation method prior to the development of Neural Machine Translation(NMT: The current state of the art).

Both SMT and NMT are known as Data-driven approaches to language translation. Meaning, they depend on a large amount of Data to derive insights and improve accuracy/performance. These data are usually represented in corpora format(Collection of Sentences/Words/phrases).

Interesting Fact: Google Translate was first established as an SMT service in 2006, using transcripts of texts/documents from the United Nations and the European Parliament to acquire linguistic data for several languages.

Source: FreeCodeCamp

How SMT Works?

General Overview

Statistical Machine Translation (SMT) uses probabilistic models to generate translations between languages. The idea stems from information theory.

This proposes a method of translation beyond traditional “handwritten rules”, but rather translation based on some probabilistic modelling. A translation problem is framed according to a probability distribution. Such that given a source sentence, we find the most probable word or phrase translations in the target sentence.

Generally, There are two main methods in SMT; The Word-based approach(early stages of SMT), Phrase-based approaches.


  1. Word-Based SMT

The first SMT method to be introduced was word-based SMT in the early 1990s. This approach to SMT models seeks to estimate the likelihood/probability of a source language word(e) being translated into a target language word(f). Basically, analyzing every single individual word in a given source sentence and estimating the most probable translation in the Target sentence. This is obviously a Supervised Learning method, where we have some examples of sentence translations in both source and target language fed into the algorithm, and then it tries to estimate the most probable translation of a test word in a target language given its representation in the source language.

So we do know a word can have different translations depending on the context where it was used. These different translations of words in a given sentence is stored in a “Lexicon” . The lexicon in simple terms is basically an information about a word in different contexts. For example, the word bank meaning different things in English, depending on context, could also have different translations in another language.

So, the lexicon stores the probability of possible word translations of a word in source sentence to its representation in a target sentence.This is also based on co-occurrence i.e how many times did a word in source sentence co-occured with a relative translation in a target sentence. As a result, some terms are more likely to be the translation of a word, while others are less likely. The Maximum Likelihood Estimation is used to approximate the translation probability of these words.

The second most important component of these types of SMT systems is the Alignment Model. The Alignment model is simply a mapping between each word in a source sentence to a word in a target sentence. Ofcourse, It is possible that some words in a source sentence could be left without been mapped to a target word. What do I mean?

Let’s take a look at an example:

Example With English-Arabic Word Alignment
Example with English to French word Alignment: Source

It is also possible, that multiple words in a source sentence, gets mapped to the same word in a target sentence. The reverse is also possible. Take a look at the examples above again.

In other words, we can have a one-to-one, one-to-many or many-to-one word alignment. This depends on the lexical structure of a language.

The alignment function is defined mathematically as:
a : i → j

Where i = Source word, j = target word, and the alignment function(a) models the alignment from i to j.

In Summary, For this approach, We can say, the two main components of a word-based model are:

  • The Lexicon: Which stores the possible word translations
  • The Alignment Model

Some common word-based SMT models:

  • IBM Model 1–5 (Reference Links Below for In-depth study)
  • Hidden Markov Model(HMM)

2. Phrase-Based SMT

The core idea of this approach is that rather than breaking down the translation problem into basic units as individual “words”, we focus on sequences of consecutive words as “Phrases”. So translation proceeds with attempting to translate two or three words together from a source sentence to a target sentence. We would agree this is a better approach as compared to taking individual words onto consideration right?

Given that we are aware that not all languages use the same word structure /word senses, this helps to make the translation task a little more concrete. As an illustration, while in English the sentence structure is Noun-Verb-Object, it may be Verb-Noun-Object in some other languages. Imagine that we thought about word-to-word translations while keeping in mind this structural difference. What do you suppose the outcome would be? Think about a phrase-based translation in the same situation. More logical, yes?

The structure is just one aspect of it. What about tenses?(how do other languages represent present, past and future as compared to English?), what about pronouns?(do all languages have the same representation of pronouns as English?) etc.

Fun Fact: “This was the base SMT approach for Google Translate when it first launched”.

Components of a Phrase-Based SMT System

The first important component, just like the word-based approach is the “Alignment model.

The difference here is that while the word-based approach only considers alignment in one direction i.e source to target, this approach considers the inverse direction as well i.e source→target and target →Source.

So we can use the same alignment model(IBM Models), but considering both directions. In this way, if we find a match/consistency in both directions, we say that is the most probable alignment.

The second important component is the “Phrase Extraction process”.

So here, all possible phrase pairs/translations are extracted from source and target phrases. The question that arises here is how do we get a good phrase pair?

The process occurs in 3 steps:

  • Extract all phrase pairs consistent with the alignment. If source word in phrase pair
  • Extract all phrase pairs consistent with the alignment. If target word in phrase pair
  • Atleast one source word has to be aligned to one target word

In the last part of the phrase-based model, we score the phrase pairs that were extracted in the second step.
The phrase pairs are given probabilities in this instance, and their score is computed. Given this probability scores approach, the model does not prioritize short phrases over long ones or the other way around. Using the relative frequency scores from the corpus statistics, the probability scores are computed:

Source: Ref 5 Below

This score is calculated for both directions i.e (e|f) .

Example of Phrase-based alignment with Moses software:

We’ll end this article here. The next article, will investigate a practical application of one of the approaches shared here.

In this article, I gave a general overview of two main approaches in Statistical Machine Translation (SMT). There are some implementation and more in-depth details not shared here. Look up the links below for further study.

The big idea : We now understand how the buzz surrounding automatic translation systems began. Every step of the process is a build-up that aims to increase or optimize the effectiveness of the methods.

There is one more approach in SMT known as “Syntax-based approach”. I would not be discussing that in this article. You can find some relevant resources to read more on it below.

Thank you for reading! Share with your network if you found it informative😊!



Aminah Mardiyyah Rufai

Machine Learning Researcher | Machine Intelligence student at African Institute for Mathematical Sciences and Machine Intelligence | PHD Candidate