PEBBLE: A Pedagogical and SRL-Aware Benchmark for Evaluating LLM Tutors
Abstract
Large language models are increasingly used as tutors, yet most evaluations measure what models know rather than how they teach. We present PEBBLE, an initial, compact, plug-and-play benchmark for multi–turn tutoring that scores five process-level dimensions grounded in the learning sciences—scaffolding, diagnostic questioning, misconception repair, metacognitive support, and affective support. PEBBLE formalizes a weighted per-turn scoring functional with an explicit overhelping penalty and an LLM-as-judge, and incorporates contamination controls via templated item generation and paraphrase-shift splits. We evaluate eight contemporary models across four STEM domains (30 seeds/domain; 240 simulated episodes/model) using simulated students in short, text-only dialogues; findings should be interpreted under these conditions. PEBBLE consistently surfaces deficits in diagnostic questioning and misconception repair despite near-ceiling affect and metacognition, and supports lifecycle analyses (scaling, post-training). Our contributions are: (i) a formal, SRL-aware rubric and scoring functional for multi–turn tutoring; (ii) a contamination-aware evaluation protocol with an LLM-as-judge; (iii) a cross-domain benchmark and open evaluation kit for reproducible lifecycle studies; and (iv) an empirical characterization of dimension-wise headroom that identifies diagnosis/repair as primary levers for improving tutoring quality. Code, seeds, personas, judge prompts, and a leaderboard specification will be released upon acceptance.