+--------------+
Honglu Fan  |
+--------------+


LLM myths 2: perplexity and surprise

Language Models LLM myths

Given a language model, people call the mean of the negative log-likelihood of a sentence the “perplexity”. As the word indicates, it is supposed to show whether the language model is perplexed or surprised by the sentence. Sometimes, people modify the mean and also talk about “the level of surprise” on a small range of tokens or a single token. It is a simple explanation that gets the concept across to a wide audience as one of many ways of anthromorphizing language models. Where does it come from and does it really agree with the human sense of “surprise”?

Surprise in Bayesian theory

In many theoretic subjects such as information theory or Bayesian theory, there is a notion under the name of “surprise” which partially and loosely comes from modelling the human notion of “surprise”. One of the most common version goes by as the following:

Given a Bayesian model with prior $P(\theta)$ of parameters $\theta$ over a distribution $\mathcal D$, when a sample $d\in \mathcal D$ is observed, the “surprise” of the sample is defined as the negative log-likelihood

$$ - \text{log}(P(d | \theta)). $$

The use of the wording “surprise” is loosely justified in many ways such as:

  1. If the prior is accurate, the more probable samples get less surprise, and vice versa.
  2. By the way Bayesian inferencing works, the posterior $P(\theta | d)$ is proportional to $P(\theta)$ divided by the exponential of $ - \text{log}(P(d | \theta))$. Given a parameter $\theta$, the more “surprise” there is, the more adjustment of the prior is made by shrinking the probability density.

Also, another down-to-earth explanation specifically tailored for ML people is the following:

Imagine you have an image classification model $M$ that classifies an image into cat or dog. Say you have a picture that just looks like a cat, and your model gives it a $99%$ chance. If it turns out to be a cat, the negative log-likelihood is $ - \text{log}(0.99) = 0.01$ which is pretty low. But if somehow it comes out as a dog (hmm… missing chihuahua in training data?), the log-likelihood in this case is $ - \text{log}(1 - 0.9) = - \text{log}(0.1) = 4.6 $ which is fairly high. So people say that it models the human notion of “surprise” when comparing its own prediction with the real outcome.

All these explanations are good. It is very common to give a mathematical concept an intuitive name based on human experiences. In this case, it is perhaps even a very good one, so good that many people forget

There is a difference between “a mathematical definition that models off and names after A” and “the human notion A that comes from the Oxford dictionary and your grandparents and kids would understand”.

Such a disagreement may look minor in a binary classification model of images, but an arbitrary generalization to other models may require a second-thought.

In fact, a name is simply a name after all. You can be one of those rationalists and say that the word “surprise” in your life is defined as the negative log-likelihood of your brain under certain measurement of a prior. But I am a normal person, and I want my use of words to be understood by my friends and family.

Decision paralysis vs Surprise

Scenario 1

A few weeks ago my wife and I went to a French Crêperie during a vacation. The menu was very cool, full of colorful French words but there were 15 abracadabra crêpes to choose from.

There was a small time-pressure there as the waitress was waiting. In the end, she picked a random crêpe which turned out to be full of beef slices whose name I would never remember.

Looked awesome, but was I surprised?

There were $15$ possibilities and I was fairly convinced that a good and realistic prior would concentrate around the uniform distribution of 15 classes. Fixing the uniform distribution, we had $\frac{1}{15}$ probability for each crepe and the Bayersian “surprise” I got was $ - \text{log}(\frac{1}{15}) = 2.7$ which was pretty high, since the observation happened with a probability around $0.07$.

But if you asked me like a normal human being, nope, not at all surprised since we had no idea anyway.

Ok you are going to say that there were so many ($15$) classes, and “surprise” should be a relative term with a proper baseline such as the uniform where each class has a probability of $\frac{1}{15}$. Fair. But consider the following.

Scenario 2

Now my wife and I went to a bar. There were also $15$ items on the drink menu $11$ of which were alcoholic. Both of us can do alcohol but given us being together for 10 years, I was fairly confident that she would go with soft drinks. Then we are left with Water, Fanta, Orange juice, and Diet coke. She rarely ordered Fanta, but it was a soft-drink so the probability is not negligible, offset by the chance that she wants to try something new.

My prior was concentrated around the distribution of the following:

  • Alcohol: $0.01$ each
  • Water, Orange juice, Diet coke: $0.25$ each
  • Fanta: something in between. One minus all above left me with $0.14$ which is very fair.

She went for Fanta. Was I surprised? I got $- \text{log}(0.14) = 1.97 < 2.7$. So a little but not quite?

If you asked like a normal human being, I would say that I was fairly surprised, and obviously more surprised when seeing her choosing the crepe.

A statistician might be already shaking their head but please bear with me…

Counter arguments

A statistician would yell: the priors were not good and they were all different! Plus that the samples are so few!

A neuro-psychologist might also yell at me: you are confusing the thought with the reaction!

An ML scientist might yell at me: ok you don’t get surprised by things you don’t care. So you need to adjust for a factor of attention!

A Zoomer might yell at me: you should use entropy and varentropy (or a futuristic skewedness-entropy or kurtosis-entropy)!

Ok, ok… Let us pause for a minute and reflect:

  1. Yes, using an arbitrary prior over a single data point, and using different priors to compare likelihoods could lead to absurdness if not done carefully. In fact, the whole point of this blog post is to point out that such practices are somehow not uncommon in ML.
  2. For the mini tricks and hacks: are you confident about covering all the grounds? Are you sure that you can escape bitter lessons?
  3. Does it even make sense to ask for a number that can be compared in broad context and which everybody agrees that measures “surprise”?

Language modelling

Back to language models, recall that a language model measures the conditional probability

$$ P(x_n | x_1, x_2, \cdots, x_{n-1}) $$

given a sequence ${x_1, \cdots, x_{n-1}}$. Since the sampling is usually auto-regressive, $x_1, \cdots, x_{n-1}$ can be regarded as the result of a process that samples $x_i$ sequentially, whose distribution changes based on the previous choice. This is a typical non-Markovian process. Sometimes, I would call $x_1, \cdots, x_{n-1}$ a sampling trajectory.

Now, the perplexity is commonly defined as

$$ \text{ppl}(x_1,\cdots,x_n) = \prod\limits_{i=1}^n P(x_i | x_1, \cdots, x_{i-1})^{-\frac{1}{n}}. $$

Do $\text{ppl}(x)$ indicates a higher “surprise”? There are a couple scenarios:

  • It is discussed under the same prior over the space of parameters. Note that it is consecutively sampled and it forms a non-Markovian process. Then it is perhaps a game of terminologies: up to whether you want to call this the surprise associated with the entire sampled sequence.
  • I am looking at the training loss curve: hey, my ppl is lower at $5000$ step comparing with $1000$ step, so it is less surprised about the validation dataset! Well, I would argue that this is a very slippery road because we are comparing negative log-likelihoods under two different priors. Feel free to go back to previous examples: are you more surprised under a uniform prior, versus a more concentrated but sharply contrasted prior?

What about individual negative log-likelihood for each token? I would argue that it is even more dangerous if it is being discussed under different sampling trajectory (by which I mean the previous tokens $x_1, \cdots, x_{n-1}$). In a sense, if the sampling trajectory changes, it should be a part of the parameter space and would contribute to different priors.

Remarks

At least for me, if I think for a second, it was not uncommon that every now and then I sought “intuition” and let myself confuse perplexity, nll and the human notion of surprise. I am not arguing that it is forbidden, but a more careful understanding usually leads to at least a better awareness and corrects wishful thinking.

On the other hand, it is quite common to people who work with language models that individual negative log-likelihood actually indicates a degree of choices: the higher the nll is, the more choices there might be to sample a given token, which might indicates a crucial branching point of a sentence. And this line of thought leads to some people’s theory of entropy-based metrics for individual tokens. It is an entirely different topic, but my general attitude is that there is something good in this direction, but the implementation is not as straightforward as measuring something using a universal metric under all circumstances. I would defer this topic to the future.