Milestone 2 | Notion

What is open-ended text generation?

Open-ended text generation is the task of generating the text of arbitrary length. It is different from the text summarization task, where the output length is fixed. Open-ended text generation is challenging because it requires the model to generate coherent and grammatically correct text. It is also a task that requires the model to understand the language well.

To generate text, a language model can produce the next word given a context by using the conditional probability of the next word on the previous words:

$$ P(w_i | w_{i-1}, \ldots, w_1) $$

The goal in open-ended text generation is to create a coherent portion of text that is a continuation from the given context. This is done by using decoding the language model with various algorithms like beam search, greedy search, top-k sampling and top-p (Nucleas) sampling.

Authors of [1] show that nucleus sampling is currently the best available decoding strategy for generating long-form text that is both high-quality — as measured by human evaluation— and as diverse as human-written text.

How is open-ended text generation currently evaluated?

An analysis of popular evaluation metrics: Examples and Observations

Human

Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. Hence, zero generalization ability and thus have inadequate diversity. Humans cannot quantify diversity. More often than not, they tend to not catch under diversity [3].

Statistical

Perplexity

Perplexity is a common metric for evaluating language models. It is defined as the inverse probability of the test data. Mathematically, it is the exponentiation of cross entropy.

$$ Perplexity = e^{CE} = e^{-\frac{1}{N}\sum_{i=1}^N logP(x_i)}

$$

Observations
- Generally, the lower the perplexity, the better the model. But, in the domain of open-ended text generation, the lower the perplexity, the more likely the model is to repeat itself.
- So, the model has to achieve similiar perplexity on both human and machine generated text so that a balance between generating new text and repeating itself. But the problem with using the perplexity metric is that it is not a good measure of the quality of the text generated by the model.
- The lowest possible perplexity is unknown (lower bound) since a perplexity of zero is impossible. If we don’t know the optimal value, we can't determine how effective our language model is.
- Also, it assumes perfect history, which might not always be the case when using previously generated data to generate a new word.
- For example, the model can generate a text with perplexity as same as the human generated text but the model generated text can be completely nonsensical.
Zipf Coefficient

Zipf’s law is an empirical law formulated using mathematical statistics that describes the frequency distribution of words in natural languages. It states that given some corpus of natural language utterances, the frequency of any word is inversely proportional to its rank in the frequency table [4]. A rank-frequency plot of the distributional differences between n-gram frequencies of human and machine text can be used to compare the quality of the text generated by the model.

Observations

But, the problem with this approach is that it is not a good measure because it only considers the frequency of the n-grams and not the order of the n-grams.
Also, some decoding algorithms work by truncating the tail of the word distribution, which can lead to a lower n-gram frequency but a higher quality text.