Trust Region Masking for Long-Horizon LLM Reinforcement Learning
Published:
Yingru Li*, Jiacai Liu*, Jiawei Xu*, Yuxuan Tong, Ziniu Li, Qian Liu, Baoxiang Wang.
Published:
Yingru Li*, Jiacai Liu*, Jiawei Xu*, Yuxuan Tong, Ziniu Li, Qian Liu, Baoxiang Wang.