Xue J. Zhao Blog

Asynchronous RL

Mar 4

Language models learn to reason, use tools and develop agentic capabilities through a post-training process called reinforcement learning with verifiable rewards, during which the model is set up to autonomously interacting with environments that provide improvement reward singals.

By and large RLVR is a feedback loop cycling through the three stages of generation, verification, and model training. A systems consideration is how to overlap generation (inference) and verification with training. Generation faces delays in a practical sense due to the uneveness of generation length, which can be long-tailed (a small fraction of rollouts may take exceptionally long sequence length to explore problem space through chain of thought). From a theoretical perspective, the demand for RL to stay on-policy means generation and training must occur in serial (pipeline bubbles occur), which is fundamentally unsound from a systems perspective, which advocates for parallelizing operations and overlapping computations.

Due to the detriment of on-policy RL to efficiency, and therefore scalability, one is led to abandon the requirement to train on-policy. However training off-policy means the trainer runs ahead of generator, so that when a long rollout generated from model checkpoint finishes, the trainer has already completed several training steps and reached a checkpoint , trained on the shorter rollouts of as well as intermediate checkpoint . Thus technically the reward assigned to is applies strongly to checkpoint , but perhaps weakly so to . It is a research question to inquire the tradeoff between efficiency gained through off-policy RL and possible performance degradation trained using a loss computed on rollout from stale data generated by . The stability of training and the possibility of model collapse and mitigations need to be explored. As of this writing, typical approaches to off-policy RL include near on-policy RL and importance sampling correction.

Rigorous Foundation for RL

Let be a finite set called the vocabulary, its elements are called tokens. Let be the space of all finite sequences of tokens. Let be the set of all probability measures on , which can be identified as the simplex . The LLM policy parameterized by weights is a map In the mathematical sense is like a kernel from to (it is a kernel assuming our LLM is such that the suitable measurability conditions are satisfied), Thus, if we have an associated map predicts next token probabilities. Let be a sequence of tokens. Denote and adopt the subsequence notation for all . We extend the domain of from to by defining So LLM generation can be thought of as some kind of stochastic process started at a prompot with transition kernel . A natural question is to define a probability measure over the space of all sequence paths, in a way reminiscent to Wiener measure for Brownian motion, but for now I will skip a rigorous construction due to the need to define a measure over the space of prompts, which is infinite. While I have not seen a convincing answer, I will offer my thoughts: the LLM learns the kernel and not the probability measure over the set of all prompts . However because both the prompts and responses are both finite sequence of tokens , a first guess is to take with a large enough that is practically infinite. If we have a kernel then define a measure on by which approximates in the wild in some sense. We may think of as the model's approximation of the probability of token in the language. If we have access to some (tokenized) text corpus , when can approximate the frequency of by . We then consider all subsequences in that start with , and all such sequences followed by . We expect the relative ratio to be approximately by . Proceeding inductively we may interpret the probabilities as it relates to the text corpus and langauge.

We shall not digress further, but I will assume this measure on exists, and in quotation marks I will write to mean a measure from which can sample by having the prompt sampled from measure and then the response sampled according to the kernel . In particular when I write I mean the same as the common ML notation .

A reward function is a map that scores a prompt response pair. We will assume, without much justification The RL objective is maximizing expected reward with respect to We glossed over one subtlety, the measure describes the distribution of prompts in general within a language, while in domain specific RL such as coding, distribution of coding prompts is not the same as given by the measure (indeed a coding dataset is not the same as a natural langauge dataset). Thus when we do RL to improve LLM coding, we are selecting a partition of the general dataset, we choose prompts , and sample rollouts with the kernel for each and computes the sample mean In order to maximize we take the gradient where The equivalent problem is to minimize In practice one reduces the variance of the sample gradient in the following way: let where is called the baseline and put for whatever reason is called the advantage, and in actor-critic RL such as proximal policy optimization, the baseline is parametrized by a neural network called the value function. Value functions need to be trained and needs to be evaluated through inference. This is why critic-free methods like group relative policy optimization uses an advantage computed from standardizing the reward of a group of rollouts to its mean and standard deviation.

A subtle mismatch arises when applying the above RL method in practice: the trainer policy differs from the generator policy, for reasons such as variation in GPU kernel implementation, or asynchronization of rollout and parameter refit. Let us distinguish the generator policy from the trainer policy The standard practice is to introduce importance weights so that we write Given a batch of prompts and rollouts with importance weights with respect to generator and trainer , we can compute the effective sample size It can be seen that the lower bound is by the multinomial identify and the upper bound is by Cauchy Schwartz inequality, and ESS near 1 when IS weights flucutate while ESS near when weights have low standard deviation. Small effective batch size translates to large variance of the per sample gradient weighted by