firstbacksecondback
24 Results
Workshop
|
Linear Probe Penalties Reduce LLM Sycophancy Henry Papadatos · Rachel Freedman |
||
Workshop
|
AEGIS2.0: A Diverse AI Safety Dataset and Risks Taxonomy for Alignment of LLM Guardrails Shaona Ghosh · Prasoon Varshney · Makesh Narsimhan Sreedhar · Aishwarya Padmakumar · Traian Rebedea · Jibin Varghese · Christopher Parisien |
||
Workshop
|
Superficial Alignment, Subtle Divergence, and Nudge Sensitivity in LLM Decision-Making Manuel Cherep · Nikhil Singh · Patricia Maes |
||
Workshop
|
LLM Alignment Using Soft Prompt Tuning: The Case of Cultural Alignment Reem Masoud · Martin Ferianc · Philip Treleaven · Miguel Rodrigues |
||
Workshop
|
LLM Alignment Through Successive Policy Re-weighting (SPR) Xinnan Zhang · Siliang Zeng · Jiaxiang Li · Kaixiang Lin · Mingyi Hong |
||
Poster
|
Thu 16:30 |
ReMoDetect: Reward Models Recognize Aligned LLM's Generations Hyunseok Lee · Jihoon Tack · Jinwoo Shin |
|
Poster
|
Thu 16:30 |
Aligning LLM Agents by Learning Latent Preference from User Edits Ge Gao · Alexey Taymanov · Eduardo Salinas · Paul Mineiro · Dipendra Misra |
|
Workshop
|
Moral Persuasion in Large Language Models: Evaluating Susceptibility and Ethical Alignment Allison Huang · Carlos Mougan · Yulu Pi |
||
Workshop
|
Declarative characterizations of direct preference alignment algorithms Kyle Richardson · Vivek Srikumar · Ashish Sabharwal |
||
Workshop
|
AdvBDGen: Adversarially Fortified Prompt-Specific Fuzzy Backdoor Generator Against LLM Alignment Pankayaraj Pathmanathan · Udari Sehwag · Michael-Andrei Panaitescu-Liess · Furong Huang |
||
Workshop
|
Sat 12:00 |
A Statistical Approach to Quantifying LLM Human Alignment Harbin Hong · Liu Leqi · Sebastian Caldas |
|
Poster
|
Thu 11:00 |
Transfer Q-star : Principled Decoding for LLM Alignment Souradip Chakraborty · Soumya Suvra Ghosal · Ming Yin · Dinesh Manocha · Mengdi Wang · Amrit Singh Bedi · Furong Huang |