Automatic evaluations are independent of the language. If it can provide useful measures of the language
quality, it say nothing about the contents: that is why
evaluations based on human judgements are complementary.
In natural language generation (NLG), several metrics show the difficulty of having a source
text (written by a human being) and a target text (written by a software).
BLEU, ROUGE, METEOR, NIST and WER metrics are used with assigning a score for measuring parts of words
(N-grams) and their frequency,
from a comparison between a source text and a text target.
Test the automatic evaluation metrics
|BLEU (Bilingual Evaluation Understudy)
||gives an equal weight to all N-grams.
When BLEU reaches 1 that means that the N-grams between the source text and the target text correspond.
This metric was developed by IBM and is commonly used in automatic translation.
|ROUGE (Recall-Oriented Understudy for Gisting Evaluation)
||gives weight to the higher proportion of N-grams.
There are several metrics ROUGE. The most common is ROUGEN, which calculates the highest proportion of
N-grams of a length N in a reference text.
ROUGE variants correspond to variants of the method of computation (ROUGE-S, ROUGE-L, ROUGE-W,ROUGE-2 and ROUGEU).
ROUGE is commonly used in connection with the generation of automatic text summaries.
|NIST (National Institute of Standards and Technology)
||NIST is an adaptation of BLEU. While BLEU gives an equal weight to all n-grams, NIST gives more importance to the less frequent N-grams.
NIST correlates best with human judgments.
|METEOR (Metric for Evaluation of Translation with Explicit Ordering)
||gives equal weight to all N-grams and adds into its formula a recall rate (frequency)
and a precision rate (relevance). This metric is based on the principle of explicit connections (literally between the source text and the target text,
whether it is the exact word or the morphological variation of the word).
|WER (Word Error Rate)
||this formula this formula is based on the explicit correspondence
(exact word or morphological variant). This metric is commonly used in the field of voice recognition. .
Agarwal Abhaya et Lavie Alon. METEOR, M-BLEU and M-TER: evaluation metrics for high-correlation with human rankings of machine translation output. In Proceedings of the Third Workshop on Statistical Machine Translation, pages 115–118. Association for Computational Linguistics, 2008
Banerjee Satanjeev et Lavie Alon. METEOR : An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic et extrinsic evaluation measures for machine translation et/or summarization, pages 65–72, 2005.
Belz Anja et Reiter Ehud. Comparing automatic et human evaluation of NLG systems. In EACL, 2006.
Belz Anja and Reiter Ehud. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4) :529–558, 2009.
Chin-Yew Lin. ROUGE : a package for automatic evaluation of summaries. In Text Summarization Branches Out : Proceedings of the ACL-04 Workshop, pages 74– 81, 2004.
Dale Robert and White Michael. Shared tasks and comparative evaluation in natural language generation, 2012.
Morris A. Cameron, Maier Viktoria, and Green Phil. From WER and RIL to MER and WIL: improved evaluation measures for connected speech recognition. In INTERSPEECH, 2004.
Tomás Jesús, Mas Josep Àngel, and Casacuberta Francisco. A quantitative method for machine translation evaluation. In Proceedings of the EACL 2003 Workshop on Evaluation Initiatives in Natural Language Processing : are evaluation methods, metrics et resources reusable?,pages 27–34. Association for Computational Linguistics, 2003.
Readability scores and edit distance
Flesch Rudolph. A new readability yardstick. Journal of applied psychology, 32(3) :221, 1948.
Kincaid J. Pander, Fishburne Jr Robert P., Rogers Richard L., and Chissom Brad S. Derivation of new readability formulas (automated readability index, fog count and Flesch reading ease formula) for navy enlisted personnel. Technical report, DTIC Document, 1975.
Li Yujian and Liu Bo. A normalized Levenshtein distance metric. IEEE Transactions on Pattern Analysis and Machine Intelligence (Impact Factor : 4.8), 29(6) :1091–5, 2007.