On the current landscape of language model reward modeling for alignment

Loading...
Thumbnail Image

Persistent link to this item

Statistics
View Statistics

Published Date

Publisher

Abstract

Reinforcement Learning with Human Feedback (RLHF) has become a central approach for aligning large language models (LLMs) with human preferences. Despite its success, a significant gap remains between human judgment and LLM behavior, particularly in automated evaluation settings and in capturing the multi-dimensional nature of human preferences. Furthermore, pinpointing where alignment methods fail to accurately reflect the true distribution of human preferences remains a challenge, largely due to the sparse and coarse nature of scalar reward signals typically employed in RLHF. This thesis investigates several dimensions of the alignment problem to bridge the behavioral gap between human judgments and LLM outputs. Specifically, we examine (i) the limitations of AI-based alignment approaches that use LLMs as evaluators,(ii) dynamic alignment strategies that incorporate multiple aspects of human judgments, and (iii) dense alignment techniques that reshape token-level reward distributions via Bayesian optimization. Through rigorous stress testing of LLM-based evaluations and the development of novel alignment methods, this work advances the field by enabling more nuanced and robust modeling of human preferences.

Description

University of Minnesota M.S. thesis. May 2025. Major: Computer Science. Advisor: Dongyeop Kang. 1 computer file (PDF); ii, 50 pages.

Related to

item.page.replaces

License

Series/Report Number

Funding Information

item.page.isbn

DOI identifier

Previously Published Citation

Other identifiers

Suggested Citation

Koo, Ryan. (2025). On the current landscape of language model reward modeling for alignment. Retrieved from the University Digital Conservancy, https://hdl.handle.net/11299/277317.

Content distributed via the University Digital Conservancy may be subject to additional license and use restrictions applied by the depositor. By using these files, users agree to the Terms of Use. Materials in the UDC may contain content that is disturbing and/or harmful. For more information, please see our statement on harmful content in digital repositories.