Affinity Event
|
|
Reasoning-Driven Jury System for LLM Evaluation
Ayda Sultan
|
|
Workshop
|
|
CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models
ZEYU WANG
|
|
Workshop
|
|
CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models
ZEYU WANG
|
|
Workshop
|
|
Development and bilingual evaluation of Japanese medical large language model within reasonably low computational resources
Issey Sukeda
|
|
Workshop
|
|
CausalBench: A Comprehensive Benchmark for Evaluating Causal Reasoning Capabilities of Large Language Models
ZEYU WANG
|
|
Workshop
|
|
MISR: Measuring Instrumental Self-Reasoning in Frontier Models
Kai Fronsdal · David Lindner
|
|
Poster
|
Thu 16:30
|
DARG: Dynamic Evaluation of Large Language Models via Adaptive Reasoning Graph
Zhehao Zhang · Jiaao Chen · Diyi Yang
|
|
Workshop
|
|
Not All LLM Reasoners Are Created Equal
Arian Hosseini · Alessandro Sordoni · Daniel Toyama · Aaron Courville · Rishabh Agarwal
|
|
Workshop
|
|
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
Saeid Asgari · Aliasghar Khani · Amir Khasahmadi
|
|
Workshop
|
|
MMLU-Pro+: Evaluating Higher-Order Reasoning and Shortcut Learning in LLMs
Saeid Asgari · Aliasghar Khani · Amir Khasahmadi
|
|
Workshop
|
|
GameBench: Evaluating Strategic Reasoning Abilities of LLM Agents
Anthony Costarelli · Mat Allen · Roman Hauksson · Grace Sodunke · Suhas Hariharan · Carlson Cheng · Wenjie Li · Joshua Clymer · Arjun Yadav
|
|
Workshop
|
|
RefactorBench: Evaluating Stateful Reasoning In Language Agents Through Code
Dhruv Gautam · Spandan Garg · Jinu Jang · Neel Sundaresan · Roshanak Zilouchian Moghaddam
|
|