Linguistics toolbox to assess automated contents

Assessments based on human judgments involve judges (experts), who are asked to rate a corpus of generated texts and texts written by humans by assigning a score on a rating scale. Lester and Porter conducted the first experiment of this kind was completed in 1999 by Lester and Porter. They asked eight experts to assign a rating to 15 texts according to different criteria (quality, consistency, writing style, content, organization, accuracy). Some texts were written by humans, while others were authored by software. The judges did not know the origin of the texts. A variant of this experiment is to present different versions of the same text to the judges. Another type of human evaluation covers the playing time of a text.

An assessment based on human judgements requires ensuring that subjects/judges are independent, impartial and familiar with the application domain. It is more costly and challenging to organize, unlike automatic evaluation metrics based on text corpus, which present the advantage of being independent of the language.

Automatic evaluation metrics can lead to excellent results if they are correlated to a human evaluation (preferably from unilingual subjects for better results). But they are the subject of controversy: metric systems cannot assess significant linguistic features, such as the structure of the language.

References

Belz Anja and Reiter Ehud. Comparing automatic and human evaluation of NLG systems. In EACL, 2006.
Belz Anja and Reiter Ehud. An investigation into the validity of some metrics for automatically evaluating natural language generation systems. Computational Linguistics, 35(4) :529–558, 2009.
Dale Robert and White Michael. Shared tasks and comparative evaluation in natural language generation, 2012.