+--------------+
Honglu Fan  |
+--------------+


Longer than Chinchilla

Language Models
In large language models pretraining, it takes a massive computing budget for every single training run. the Chinchilla optimal bounds were proposed in the paper An empirical analysis of compute-optimal large language model training. A very common misunderstanding about Chinchilla scaling law is that it seems to impose an upper bound of the amount of token one should train for given a fixed parameter count. But it really is about the optimal tradeoff between the token amount and the model size, given a fixed computing budget. Read more...

Breaking the Generalization Barrier

Language Models
Have been working with Carper folks on OpenELM and diff models (see the blog) for quite a while. In particular, I have spent a lot of time finetuning diff models, which is based on CodeGen and finetuned on GitHub commit data (filtered down to 1.2M documents totalling about 1B tokens) to automatically suggest commits. There are many interesting things happening during the model training. One specific thing I am documenting here is a pheonomenon about the loss curves, how the model developed its ability and the emergence of various different levels of loss plateau/generalization barrier/critical submanifold or whatever you may call it. Read more...

A toy RL environment for expression generation

Reinforcement Learning
In Carper AI’s OpenELM project, we use diff models (trained on GitHub commits) to generate/mutate codes. The models are transformer-based and they are modelling codes as sequences. The rough idea is that the sequence models on codes will get feedbacks on its execution results, and Reinforcement Learning in turn uses the rewards to finetune and align the sequence modelling to the task. This is almost a partial analog of RLHF pipeline with a fixed and deterministic offline reward function (in a sense, feedback from the compiler). Read more...
1 of 1