+--------------+
Honglu Fan  |
+--------------+


LLM myths 2: perplexity and surprise

Language Models LLM myths

Given a language model, people call the mean of the negative log-likelihood of a sentence the “perplexity”. As the word indicates, it is supposed to show whether the language model is perplexed or surprised by the sentence. Sometimes, people modify the mean and also talk about “the level of surprise” on a small range of tokens or a single token. It is a simple explanation that gets the concept across to a wide audience as one of many ways of anthromorphizing language models. Where does it come from and does it really agree with the human sense of “surprise”?

Read more...

LLM myths 1: why does LLM generate infinite loops

Language Models LLM myths

Looping is fairly common when sampling from an LLM. We normally do not want it to happen and there has been many tricks trying to make it behave such as repetition penalties or hard-coded loop detections, but their effectiveness is debatable. The explanation of this phenomenon seems scarce in literature and it might at first feels like another bug in our day-to-day data/model engineering without anything deep.

But for modern LLMs without ad-hoc outer logics to guard its output, sampling loops has been out there with us all along. For example, you can easily induce Deepseek model into a loop by writing this prompt (as of Oct 30th, 2024): “Write me a bunch of bullet points.”

Read more...

MCTS and Theorem proving

Reinforcement Learning Math + AI

With the increasing maturity of the Lean theorem prover, many people have attempted the combination of reinforcement learning (RL) and theorem proving. Among many attempts, the Hypertree proof search has been quite notable which I admire a lot personally.

Looking around, the general field of neural reasoning has also becoming a more prominent field since logical reasoning has been one of a few domains where LLM continues to struggle towards a satisfactory degree of reliability. A nice recent survey is this.

Read more...

Longer than Chinchilla

Language Models

In large language models pretraining, it takes a massive computing budget for every single training run.

the Chinchilla optimal bounds were proposed in the paper An empirical analysis of compute-optimal large language model training. A very common misunderstanding about Chinchilla scaling law is that it seems to impose an upper bound of the amount of token one should train for given a fixed parameter count. But it really is about the optimal tradeoff between the token amount and the model size, given a fixed computing budget. In practice, it might give a good reference number of tokens, but a general rule of thumb is still to train for as many tokens as possible before the training loss or eval loss starts to diverge.

Read more...

Breaking the Generalization Barrier

Language Models

Have been working with Carper folks on OpenELM and diff models (see the blog) for quite a while. In particular, I have spent a lot of time finetuning diff models, which is based on CodeGen and finetuned on GitHub commit data (filtered down to 1.2M documents totalling about 1B tokens) to automatically suggest commits.

There are many interesting things happening during the model training. One specific thing I am documenting here is a pheonomenon about the loss curves, how the model developed its ability and the emergence of various different levels of loss plateau/generalization barrier/critical submanifold or whatever you may call it.

Read more...
1 of 1