firstbacksecondback
16 Results
Poster
|
Thu 16:30 |
Value Imprint: A Technique for Auditing the Human Values Embedded in RLHF Datasets Ike Obi · Rohan Pant · Srishti Shekhar Agrawal · Maham Ghazanfar · Aaron Basiletti |
|
Poster
|
Wed 16:30 |
Catastrophic Goodhart: regularizing RLHF with KL divergence does not mitigate heavy-tailed reward misspecification Thomas Kwa · Drake Thomas · Adrià Garriga-Alonso |
|
Poster
|
Thu 16:30 |
InfoRM: Mitigating Reward Hacking in RLHF via Information-Theoretic Reward Modeling Yuchun Miao · Sen Zhang · Liang Ding · Rong Bao · Lefei Zhang · Dacheng Tao |
|
Poster
|
Fri 11:00 |
Group Robust Preference Optimization in Reward-free RLHF Shyam Sundhar Ramesh · Yifan Hu · Iason Chaimalas · Viraj Mehta · Pier Giuseppe Sessa · Haitham Bou Ammar · Ilija Bogunovic |
|
Poster
|
Fri 16:30 |
Provably Mitigating Overoptimization in RLHF: Your SFT Loss is Implicitly an Adversarial Regularizer Zhihan Liu · Miao Lu · Shenao Zhang · Boyi Liu · Hongyi Guo · Yingxiang Yang · Jose Blanchet · Zhaoran Wang |
|
Workshop
|
Aligning to What? Limits to RLHF Based Alignment Logan Barnhart · Reza Akbarian Bafghi · Maziar Raissi · Stephen Becker |
||
Workshop
|
Sat 16:15 |
Hannah Rose Kirk: Putting the H Back in RLHF: Challenging assumptions of human behaviour for AI alignment |
|
Workshop
|
RLHS: Mitigating Misalignment in RLHF with Hindsight Simulation Kaiqu Liang · Haimin Hu · Ryan Liu · Tom Griffiths · Jaime Fisac |
||
Workshop
|
Faster, More Efficient RLHF through Off-Policy Asynchronous Learning Michael Noukhovitch · Shengyi Huang · Sophie Xhonneux · Arian Hosseini · Rishabh Agarwal · Aaron Courville |
||
Workshop
|
Towards Data-Centric RLHF: Simple Metrics for Preference Dataset Comparison Judy Hanwen Shen · Archit Sharma · Jun Qin |
||
Workshop
|
Model Soup for Better RLHF: Weight Space Averaging to Improve Alignment in LLMs Atoosa Chegini · Hamid Kazemi · Iman Mirzadeh · Dong Yin · Maxwell Horton · Moin Nabi · Mehrdad Farajtabar · Keivan Alizadeh vahid |
||
Workshop
|
Estimating Effects of Tokens in Preference Learning Hsiao-Ru Pan · Maximilian Mordig · Bernhard Schölkopf |