Skip to yearly menu bar Skip to main content


Search All 2024 Events
 

29 Results

<<   <   Page 1 of 3   >   >>
Affinity Event
Reasoning-Driven Jury System for LLM Evaluation
Ayda Sultan
Workshop
CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models
ZEYU WANG
Workshop
CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models
ZEYU WANG
Workshop
Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources
Issey Sukeda
Workshop
CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models
ZEYU WANG
Workshop
MISR: Measuring Instrumental Self-Reasoning in Frontier Models
Kai Fronsdal · David Lindner
Poster
Thu 16:30 DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
Zhehao Zhang · Jiaao Chen · Diyi Yang
Workshop
Not All LLM Reasoners Are Created Equal
Arian Hosseini · Alessandro Sordoni · Daniel Toyama · Aaron Courville · Rishabh Agarwal
Workshop
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
Saeid Asgari · Aliasghar Khani · Amir Khasahmadi
Workshop
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
Saeid Asgari · Aliasghar Khani · Amir Khasahmadi
Workshop
GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
Anthony Costarelli · Mat Allen · Roman Hauksson · Grace Sodunke · Suhas Hariharan · Carlson Cheng · Wenjie Li · Joshua Clymer · Arjun Yadav
Workshop
RefactorBench: Evaluating Stateful Reasoning In Language Agents Through Code
Dhruv Gautam · Spandan Garg · Jinu Jang · Neel Sundaresan · Roshanak Zilouchian Moghaddam