Skip to yearly menu bar Skip to main content


Search All 2024 Events
 

16 Results

<<   <   Page 1 of 2   >   >>
Poster
Thu 16:30 Value Imprint: A Technique for Auditing the Human Values Embedded in RLHF Datasets
Ike Obi · Rohan Pant · Srishti Shekhar Agrawal · Maham Ghazanfar · Aaron Basiletti
Poster
Wed 16:30 Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification
Thomas Kwa · Drake Thomas · Adrià Garriga-Alonso
Poster
Thu 16:30 InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling
Yuchun Miao · Sen Zhang · Liang Ding · Rong Bao · Lefei Zhang · Dacheng Tao
Poster
Fri 11:00 Group Robust Preference Optimization in Reward-free RLHF
Shyam Sundhar Ramesh · Yifan Hu · Iason Chaimalas · Viraj Mehta · Pier Giuseppe Sessa · Haitham Bou Ammar · Ilija Bogunovic
Poster
Fri 16:30 Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer
Zhihan Liu · Miao Lu · Shenao Zhang · Boyi Liu · Hongyi Guo · Yingxiang Yang · Jose Blanchet · Zhaoran Wang
Workshop
Aligning to What? Limits to RLHF Based Alignment
Logan Barnhart · Reza Akbarian Bafghi · Maziar Raissi · Stephen Becker
Workshop
Sat 16:15 Hannah Rose Kirk: Putting the H Back in RLHF: Challenging assumptions of human behaviour for AI alignment
Workshop
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation
Kaiqu Liang · Haimin Hu · Ryan Liu · Tom Griffiths · Jaime Fisac
Workshop
Faster, More Efficient RLHF through Off-Policy Asynchronous Learning
Michael Noukhovitch · Shengyi Huang · Sophie Xhonneux · Arian Hosseini · Rishabh Agarwal · Aaron Courville
Workshop
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison
Judy Hanwen Shen · Archit Sharma · Jun Qin
Workshop
Model Soup for Better RLHF: Weight Space Averaging to Improve Alignment in LLMs
Atoosa Chegini · Hamid Kazemi · Iman Mirzadeh · Dong Yin · Maxwell Horton · Moin Nabi · Mehrdad Farajtabar · Keivan Alizadeh vahid
Workshop
Estimating Effects of Tokens in Preference Learning
Hsiao-Ru Pan · Maximilian Mordig · Bernhard Schölkopf