The Optimal Token Baseline: Variance Reduction for Long-Horizon LLM-RL

Published:

Yingru Li*, Jiawei Xu*, Ziniu Li*, Jiacai Liu, Wei Liu, Yuxuan Tong, Longtao Zheng, Zhenghai Xue, Yaxiang Zhang, Tianle Cai, Ge Zhang, Qian Liu, Baoxiang Wang.