On the current landscape of language model reward modeling for alignment
Authors
Published Date
Publisher
Abstract
Reinforcement Learning with Human Feedback (RLHF) has become a central approach for aligning large language models (LLMs) with human preferences. Despite its success, a significant gap remains between human judgment and LLM behavior, particularly in automated evaluation settings and in capturing the multi-dimensional nature of human preferences. Furthermore, pinpointing where alignment methods fail to accurately reflect the true distribution of human preferences remains a challenge, largely due to the sparse and coarse nature of scalar reward signals typically employed in RLHF. This thesis investigates several dimensions of the alignment problem to bridge the behavioral gap between human judgments and LLM outputs. Specifically, we examine (i) the limitations of AI-based alignment approaches that use LLMs as evaluators,(ii) dynamic alignment strategies that incorporate multiple aspects of human judgments, and (iii) dense alignment techniques that reshape token-level reward distributions via Bayesian optimization. Through rigorous stress testing of LLM-based evaluations and the development of novel alignment methods, this work advances the field by enabling more nuanced and robust modeling of human preferences.
Description
University of Minnesota M.S. thesis. May 2025. Major: Computer Science. Advisor: Dongyeop Kang. 1 computer file (PDF); ii, 50 pages.
Related to
item.page.replaces
License
Series/Report Number
Funding Information
item.page.isbn
DOI identifier
Previously Published Citation
Other identifiers
Suggested Citation
Koo, Ryan. (2025). On the current landscape of language model reward modeling for alignment. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/277317.
Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.
