Assessments based on human judgments involve judges (experts), who are asked to rate a corpus of generated texts and texts written by humans, by assigning a score on a rating scale. The first experiment of this kind was conducted in 1999 by Lester and Porter, who had asked to 8 experts in a domain to assign a rating to 15 texts according to different criteria (quality, consistency, writing style, content, organization, accuracy). Some texts were written by humans, others were generated by a software. The judges did not know the origin of the texts. A variant of this experiment is to present to the judges different versions of the same text. Another type of human evaluation covers the playing time of a text.
An assessment based on human judgements supposes to ensure that subjects / judges are independent, impartial and familiar with the application domain. It is more costly and difficult to organize, unlike automatic evaluation metrics based on text corpus, that present the advantage of being independent of the language.
Automatic evaluation metrics can lead to excellent results, if they are correlated by a human evaluation (preferably from unilingual subjects for better results). But they are the subject of controversy: metric systems are not able to assess significant linguistic features, such as the structure of the language.