Skip to yearly menu bar Skip to main content

Workshop: Generative AI for Education (GAIED): Advances, Opportunities, and Challenges

Paper 44: Evaluating ChatGPT-generated Textbook Questions using IRT

Shreya Bhandari · Yunting Liu · Zachary Pardos

Keywords: [ Psychometric ] [ Generative AI ] [ Measurement ] [ ChatGPT ] [ education ] [ Large language models ] [ question generation ] [ Linking ] [ IRT ] [ Algebra ]


We aim to test the ability of ChatGPT to generate educational assessment questions, given solely a summarization of textbook content. We take a psychometric measurement methodological approach to comparing the qualities of questions, or items, generated by ChatGPT versus gold standard questions from a published textbook. We use Item Response Theory (IRT) to analyze data from 207 test respondents answer questions from OpenStax College Algebra. Using a common item linking design, we find that ChatGPT items fared as well or better than textbook items, showing a better ability to distinguish within the moderate ability group and had higher discriminating power as compared to OpenStax items (1.92 discrimination for ChatGPT vs 1.54 discrimination for OpenStax).

Chat is not available.