firstbacksecondback
49 Results
Workshop
|
Aligning to What? Limits to RLHF Based Alignment Logan Barnhart · Reza Akbarian Bafghi · Maziar Raissi · Stephen Becker |
||
Poster
|
Wed 16:30 |
Uncovering Safety Risks of Large Language Models through Concept Activation Vector Zhihao Xu · Ruixuan HUANG · Changyu Chen · Xiting Wang |
|
Poster
|
Wed 16:30 |
MLLMGuard: A Multi-dimensional Safety Evaluation Suite for Multimodal Large Language Models Tianle Gu · Zeyang Zhou · Kexin Huang · Liang Dandan · Yixu Wang · Haiquan Zhao · Yuanqi Yao · xingge qiao · Keqing wang · Yujiu Yang · Yan Teng · Yu Qiao · Yingchun Wang |
|
Workshop
|
Language Models Resist Alignment Jiaming Ji · Kaile Wang · Tianyi (Alex) Qiu · Boyuan Chen · Changye Li · Hantao Lou · Jiayi Zhou · Juntao Dai · Yaodong Yang |
||
Workshop
|
Steering Without Side Effects: Improving Post-Deployment Control of Language Models Asa Cooper Stickland · Aleksandr Lyzhov · Jacob Pfau · Salsabila Mahdi · Samuel Bowman |
||
Workshop
|
Sandbag Detection through Model Impairment Cameron Tice · Philipp Kreer · Nathan Helm-Burger · Prithviraj Singh Shahani · Fedor Ryzhenkov · Teun van der Weij · Felix Hofstätter · Jacob Haimes |
||
Workshop
|
GPAI Evaluations Standards Taskforce: towards effective AI governance Patricia Paskov · Lukas Berglund · Everett Smith · Lisa Soder |
||
Workshop
|
Position: Addressing Ethical Challenges and Safety Risks in GenAI-Powered Brain-Computer Interfaces Konstantinos Barmpas · Georgios Zoumpourlis · Yannis Panagakis · Dimitrios Adamos · N Laskaris · Stefanos Zafeiriou |
||
Workshop
|
LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue |
||
Workshop
|
Sun 11:05 |
Contributed Talk 3: LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks Yet Nathaniel Li · Ziwen Han · Ian Steneker · Willow Primack · Riley Goodside · Hugh Zhang · Zifan Wang · Cristina Menghini · Summer Yue |
|
Workshop
|
SkewAct: Red Teaming Large Language Models via Activation-Skewed Adversarial Prompt Optimization Hanxi Guo · Siyuan Cheng · Guanhong Tao · Guangyu Shen · Zhuo Zhang · Shengwei An · Kaiyuan Zhang · Xiangyu Zhang |
||
Poster
|
Wed 11:00 |
Navigating the Safety Landscape: Measuring Risks in Finetuning Large Language Models ShengYun Peng · Pin-Yu Chen · Matthew Hull · Duen Horng Chau |