You are on page 1of 10

Translation Evaluation

Techniques

Pasindu Nivanthaka Tennage


Round-trip translation
Translate from a source language to a target language and back to the source
language with the same engine.

Round-trip translation is a "poor predictor of quality". A round-trip translation is not


testing one system, but two systems: the language pair of the engine for
translating into the target language, and the language pair translating back from
the target language.
Human evaluation
1. Automatic Language Processing Advisory Committee (ALPAC)
2. Advanced Research Projects Agency (ARPA)
Automatic Language Processing Advisory
Committee (ALPAC)
The variables are "intelligibility" and "fidelity". Intelligibility is a measure of how
"understandable" the sentence is, and is measured on a scale of 19. Fidelity was
a measure of how much information the translated sentence retained compared to
the original, and was measured on a scale of 09.
Advanced Research Projects Agency (ARPA)
Adequacy and fluency. Adequacy is a rating of how much information is
transferred between the original and the translation, and fluency is a rating of how
good the translation is.
BLEU
"the closer a machine translation is to a professional human translation, the better
it is"

The metric calculates scores for individual segments, generally sentencesthen


averages these scores over the whole corpus for a final score.
NIST
The NIST metric is based on the BLEU metric, but with some alterations. Where
BLEU simply calculates n-gram precision adding equal weight to each one, NIST
also calculates how informative a particular n-gram is. That is to say when a
correct n-gram is found, the rarer that n-gram is, the more weight it is given. For
example, if the bigram "on the" correctly matches, it receives lower weight than the
correct matching of bigram "interesting calculations," as this is less likely to occur.
Word error rate
The Word error rate (WER) is a metric based on the Levenshtein distance, where
the Levenshtein distance works at the character level, WER works at the word
level.

The metric is based on the calculation of the number of words that differ between
a piece of machine translated text and a reference translation.
METEOR
Improvement to BLEU

METEOR also includes some other features not found in other metrics, such as
synonymy matching, where instead of matching only on the exact word form, the
metric also matches on synonyms. For example, the word "good" in the reference
rendering as "well" in the translation counts as a match. The metric is also
includes a stemmer, which lemmatised words and matches on the lemmatised
forms.
References
1. https://en.wikipedia.org/wiki/Evaluation_of_machine_translation

You might also like