Open-ended text generation is the task of generating the text of arbitrary length. It is different from the text summarization task, where the output length is fixed. Open-ended text generation is challenging because it requires the model to generate coherent and grammatically correct text. It is also a task that requires the model to understand the language well.
To generate text, a language model can produce the next word given a context by using the conditional probability of the next word on the previous words:
$$ P(w_i | w_{i-1}, \ldots, w_1) $$
The goal in open-ended text generation is to create a coherent portion of text that is a continuation from the given context. This is done by using decoding the language model with various algorithms like beam search, greedy search, top-k sampling and top-p (Nucleas) sampling.
Authors of [1] show that nucleus sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation— and as diverse as human-written text.
An analysis of popular evaluation metrics: Examples and Observations
Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. Hence, zero generalization ability and thus have inadequate diversity. Humans cannot quantify diversity. More often than not, they tend to not catch under diversity [3].
Perplexity
Perplexity is a common metric for evaluating language models. It is defined as the inverse probability of the test data. Mathematically, it is the exponentiation of cross entropy.
$$ Perplexity = e^{CE} = e^{-\frac{1}{N}\sum_{i=1}^N logP(x_i)}
$$
Observations
Zipf Coefficient
Zipf’s law is an empirical law formulated using mathematical statistics that describes the frequency distribution of words in natural languages. It states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table [4]. A rank-frequency plot of the distributional differences between n-gram frequencies of human and machine text can be used to compare the quality of the text generated by the model.
Observations