firstbacksecondback
4 Results
Workshop
|
Language Agents as Hackers: Evaluating Cybersecurity Skills with Capture the Flag John Yang · Akshara Prabhakar · Shunyu Yao · Kexin Pei · Karthik Narasimhan |
||
Workshop
|
Sat 8:25 |
Language Agents as Hackers: Evaluating Cybersecurity Skills with Capture the Flag John Yang · Akshara Prabhakar · Shunyu Yao · Kexin Pei · Karthik Narasimhan |
|
Workshop
|
Skill-Mix: A Flexible and Expandable Family of Evaluations for AI Models Dingli Yu · Simran Kaur · Arushi Gupta · Jonah Brown-Cohen · Anirudh Goyal · Sanjeev Arora |
||
Workshop
|
FLASK: Fine-grained Language Model Evaluation based on Alignment Skill Sets Seonghyeon Ye · Doyoung Kim · Sungdong Kim · Hyeonbin Hwang · Seungone Kim · Yongrae Jo · James Thorne · Juho Kim · Minjoon Seo |