PROJECT OBJECTIVE
Accelerating RL Training via Speculative Decoding
Reinforcement Learning from Human Feedback (RLHF) has become the standard approach to LLM alignment. However, the generation phase remains a critical bottleneck, consuming up to 80% of total training time due to slow autoregressive decoding — the model produces only one token per forward pass.
Speculative decoding addresses this by using a lightweight draft model to generate multiple candidate tokens in parallel, which the target model verifies in a single pass. This project integrates EAGLE3 speculative decoding into the VeRL training framework to accelerate this generation phase.
However, two challenges arise in a training context that do not exist in static inference: the effective batch size fluctuates dynamically as sequences complete at different rates, and the target model’s continuous updates progressively degrade draft model alignment. The system addresses both through the two contributions below.
Self-Adaptive Server
A runtime controller that dynamically adjusts or disables EAGLE speculative decoding based on the current batch size, addressing the long-tail effect in RL training.
Draft Model Update
A training mechanism using reward-weighted knowledge distillation with Jensen–Shannon Divergence loss to keep the draft model aligned with the evolving target model.
KEY RESULTS
21.8%
Training time reduction with self-adaptive speculative decoding (DP=1, TP=2)
2.537
JSD acceptance length, outperforming KL (2.478) and no-update (2.525) baselines
~0.85
Validation score across all strategies, confirming lossless acceleration