Workshop
Generative AI for Education (GAIED): Advances, Opportunities, and Challenges
Paul Denny · Sumit Gulwani · Neil Heffernan · Tanja Käser · Steven Moore · Anna Rafferty · Adish Singla
Room 265 - 268
GAIED (pronounced "guide") aims to bring together researchers, educators, and practitioners to explore the potential of generative AI for enhancing education.
Schedule
Fri 6:15 a.m. - 6:30 a.m.
|
Preparation by Organizers
(
Setup
)
>
|
🔗 |
Fri 6:30 a.m. - 6:50 a.m.
|
Poster Setup by Authors
(
Setup
)
>
|
🔗 |
Fri 6:50 a.m. - 7:00 a.m.
|
Opening Remarks
(
Remarks
)
>
SlidesLive Video |
🔗 |
Fri 7:00 a.m. - 7:30 a.m.
|
Invited Talk 1: Personalized Writing Learning Assistant using Natural Language Processing: Opportunities and Challenges
(
Talk
)
>
link
SlidesLive Video Large Language Models (LLMs) have shown great promise in building various educational applications, improving learning experiences, and providing resources for both students and educators. According to Pew Research Center, one-in-five U.S. adults said they have used ChatGPT for learning by August 2023, and the trend continues to grow. In this talk, I will discuss our ongoing efforts on leveraging LLMs to build personalized argumentative writing assistants based on feedback giving. I will also present our recent work on building human-NLP collaborative tools to help educator create active learning opportunities at scale. Finally, I will highlight challenges associated with employing LLMs for educational support, as evidenced by our research findings. Specifically, I will discuss the issues and impacts of "hallucination" and outdated knowledge as well as the lack of complex reasoning abilities and creativity. |
Lu Wang 🔗 |
Fri 7:30 a.m. - 8:00 a.m.
|
Invited Talk 2: Generating and Revealing Structured Variation to Help Humans Learn [Programming]
(
Talk
)
>
link
SlidesLive Video AI-assisted programming tools are rapidly being integrated into CS classrooms, without a full understanding of how it is affecting students' learning. Recent results from studying human decision-making with and without AI assistance has shown that cognitive engagement is critical for learning to occur during these interactions, and that many learners may not cognitively engage enough with the AI assistance before applying it. Prior work on cognitive forcing functions has made cognitive engagement more necessary by withholding information, for example, for greater learning effect. We instead attempt to cultivate cognitive engagement through the use of aligned, or at least alignable, differences, based on the design implications of various theories of human concept learning and related psychological studies. While we have explored this approach most thoroughly in the context of programming, it may be generalizable to many other learning tasks. |
Elena Glassman 🔗 |
Fri 8:00 a.m. - 8:30 a.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Fri 8:30 a.m. - 9:00 a.m.
|
Invited Talk 3: Learning Programming in the Era of Generative AI
(
Talk
)
>
link
SlidesLive Video The rise of generative AI and Large Language Model-based tools has a significant impact on both the field of computing, and computing education. Computer scientists, educators, and practitioners are having heated discussions about the changes in the nature of programming. A consequence might be that we need to teach programming at different levels of abstraction, with the need for different skills. At the same time, there is an opportunity to use LLMs to support students by generating formative feedback, explanations, and hints for programming tasks. However, LLM-based support can often not be used directly, and it depends very much on the learning goals and context if it is effective. In this talk I will address how generative AI might change learning programming, that we should clearly distinguish between the different circumstances in which people learn to program, and how we can use and evaluate LLMs to support in these different cases. |
Hieke Keuning 🔗 |
Fri 9:00 a.m. - 10:00 a.m.
|
Poster Session I
(
Posters
)
>
|
🔗 |
Fri 10:00 a.m. - 11:30 a.m.
|
Lunch Break
(
Break
)
>
|
🔗 |
Fri 11:30 a.m. - 12:00 p.m.
|
Invited Talk 4: Hopes and Fears: Should We Use AI in Schools and Will It Revolutionise How We Think About Education?
(
Talk
)
>
link
SlidesLive Video Generative AI has been a highly polarising issue with some welcoming the new possibilities and opportunities whereas others warn of the risks and dangers. In terms of education, we observe a strong correlation of the hopes associated with the work of teachers and the fears pertaining to how students may utilise and interact with generative AI. Is it therefore as simple as ensuring that the use of GenAI be a teacher's prerogative? Should we include `correct' use of GenAI into our curricula? In the end, perhaps, the advent of powerful generative AI tools may foremost serve as a mirror to reflect our notions of what education is, what it should be and where we are currently headed. |
Tobias Kohn 🔗 |
Fri 12:00 p.m. - 1:00 p.m.
|
Poster Session II
(
Posters
)
>
|
🔗 |
Fri 1:00 p.m. - 1:30 p.m.
|
Coffee Break
(
Break
)
>
|
🔗 |
Fri 1:30 p.m. - 2:00 p.m.
|
Invited Talk 5: Generative AI for Joyful Education
(
Talk
)
>
link
SlidesLive Video Generative AI for Joyful Education |
Chris Piech 🔗 |
Fri 2:00 p.m. - 2:30 p.m.
|
Invited Talk 6: Implementation of AI Tools in Education at Scale
(
Talk
)
>
link
SlidesLive Video What did it take to launch a GPT-4-powered tutor on a learning site that millions of learners use? Is it safe? How do we know if it will help more learners learn more? I will discuss the development and research efforts behind Khanmigo, Khan Academy's tutor for students and assistant for teachers, including our choice of models, our approach to safety and security, and most importantly, our efforts to build learning science into the system. |
Kristen DiCerbo 🔗 |
Fri 2:30 p.m. - 3:25 p.m.
|
Panel Discussion
(
Panel
)
>
SlidesLive Video |
Kristen DiCerbo · Elena Glassman · Hieke Keuning · Tobias Kohn · Chris Piech · Lu Wang 🔗 |
Fri 3:25 p.m. - 3:30 p.m.
|
Closing Remarks
(
Remarks
)
>
|
🔗 |
-
|
Paper 19: Exploring Student-ChatGPT Dialogue in EFL Writing Education
(
Poster
)
>
link
The integration of generative AI in education is expanding, yet empirical analyses of large-scale and real-world interactions between students and AI systems still remain limited. Addressing this gap, we present RECIPE4U (RECIPE for University), a dataset sourced from a semester-long experiment with 213 college students in English as Foreign Language (EFL) writing courses. During the study, students engaged in dialogues with ChatGPT to revise their essays. RECIPE4U includes comprehensive records of these interactions, including conversation logs, students' intent, students' self-rated satisfaction, and students' essay edit histories. In particular, we annotate the students' utterances in RECIPE4U with 13 intention labels based on our coding schemes. We establish baseline results for two subtasks in task-oriented dialogue systems within educational contexts: intent detection and satisfaction estimation. As a foundational step, we explore student-ChatGPT interaction patterns through RECIPE4U and analyze them by focusing on students' dialogue and students' essay edits. We further illustrate potential applications of RECIPE4U dataset for enhancing the incorporation of LLMs in educational frameworks.RECIPE4U is publicly available at https://github.com/zeunie/RECIPE4U/. |
Jieun Han · Haneul Yoo · Junho Myung · Minsun Kim · Tak Yeon Lee · So-Yeon Ahn · Alice Oh 🔗 |
-
|
Paper 40: Retrieval-augmented Generation to Improve Math Question-Answering: Trade-offs Between Groundedness and Human Preference
(
Poster
)
>
link
For middle-school math students, interactive question-answering (QA) with tutors is an effective way to learn. The flexibility and emergent capabilities of generative large language models (LLMs) has led to a surge of interest in automating portions of the tutoring process—including interactive QA to support conceptual discussion of mathematical concepts. However, LLM responses to math questions can be incorrect or mismatched to the educational context—such as being misaligned with a school's curriculum. One potential solution is retrieval-augmented generation (RAG), which involves incorporating a vetted external knowledge source in the LLM prompt to increase response quality. In this paper, we designed prompts that retrieve and use content from a high-quality open-source math textbook to generate responses to real student questions. We evaluate the efficacy of this RAG system for middle-school algebra and geometry QA by administering a multi-condition survey, finding that humans prefer responses generated using RAG, but not when responses are too grounded in the textbook content. We argue that while RAG is able to improve response quality, designers of math QA systems must consider trade-offs between generating responses preferred by students and responses closely matched to specific educational resources. |
Zachary Levonian · Chenglu Li · Wangda Zhu · Anoushka Gade · Owen Henkel · Millie-Ellen Postle · Wanli Xing · Zachary Levonian 🔗 |
-
|
Paper 4: Beyond Hallucination: Building a Reliable Question Answering & Explanation System with GPTs
(
Poster
)
>
link
Large language models such as GPT-4 have demonstrated performance comparable to human on various academic assessments, including the Uniform Bar Exam and LSAT. This opens up unprecedented opportunities for advancement of online learning through generative AI. However, there are a number of challenges using GPT models for educational use cases. For example, GPT models can generate incorrect information. They also lack providing custom academic references for their outputs. This paper discusses the design and implementation of a GPT-powered question answering/explanation system at Course Hero. We present A/B test results revealing a notable 40% increase in answering coverage compared to a retrieval-based question answering system. Moreover, we describe how augmenting our internal questions' answers with step-by-step explanations generated by GPTs lead to a 75% lift in users' approval ratings. Lastly, we outline the design for a production-ready reference system, providing evidence for users to verify GPT responses. Through human evaluations, we show that we can achieve Precision=84% and Recall=69% when providing reference documents for GPT outputs. |
Kazem Jahanbakhsh · Hajiabadi · Vipul Gagrani · Jennifer Louie · Saurabh Khanwalkar · Kazem Jahanbakhsh 🔗 |
-
|
Paper 9: Angel: A New Generation Tool for Learning Material based Questions and Answers
(
Poster
)
>
link
Creating high quality questions and answers for educational purposes continues to be a challenge for educators and publishers. Past attempts to address this through automatic generation have shown limited abilities to generate questions targeting high cognitive levels, control question complexity and difficulty, or create adequate question-answer pairs. We take first steps toward addressing these limitations by introducing a new approach, named Angel, informed by recent developments in Large Language Models and Generative AI. Relying on advanced prompting techniques, automatic curation, and the incorporation of educational theory into prompts, Angel focuses on generating question answer pairs of varied difficulty while targeting higher cognitive levels. Questions and answers are automatically generated based on a textbook extract, with Bloom Taxonomy serving as a guide to the creation of questions addressing a diverse set of learning objectives. Our experiments compare Angel to several baselines and demonstrate the potential of informed generative models to create high-quality question answer pairs that cover a diverse range of cognitive skills. |
Ariel Blobstein · Daniel Izmaylov · Tal Yifat · Michal Levy · Avi Segal · Avi Segal 🔗 |
-
|
Paper 16: Diffusion Models in Dermatological Education: Flexible High Quality Image Generation for VR-based Clinical Simulations
(
Poster
)
>
link
Training medical students to accurately recognize malignant melanoma is a crucial competence and part of almost all medical curricular. We here present a pipeline to generate realistic high-resolution imagery of nevus and melanoma skin lesions by using diffusion models. To ensure the required quality and flexibility we introduce three novel guidance strategies and an adapted upsampling approach which enable the generation of user-specified shapes and to integrate the lesions onto pre-defined skin textures. We evaluate our lesions qualitatively and quantitatively and integrate our results into a virtual reality (VR) simulation for clinical education. Moreover, we discuss several advantages of synthetic over real images such as the ability to facilitate adjustable learning scenarios and the preservation of patient privacy underlining the huge potential of generative image generation for medical education. |
Leon Pielage · Paul Schmidle · Bernhard Marschall · Benjamin Risse 🔗 |
-
|
Paper 46: Improving the Coverage of GPT for Automated Feedback on High School Programming Assignments
(
Poster
)
>
link
Feedback for incorrect code is important for novice learners of programming. Automated Program Repair (APR) tools have previously been applied to generate feedback for the mistakes made in introductory programming classes. Large Language Models (LLMs) have emerged as an attractive alternate to automatic feedback generation since they have been shown to excel at generating both human-readable text as well as code. In this paper, we compare the effectiveness of LLMs to APR techniques for code repair and feedback generation in the context of high school Python programming assignments, by evaluating both APR and LLMs on a diverse dataset comprising 366 incorrect submissions for a set of 69 problems with varying complexity from a public high school. We show that LLMs are more effective at generating repair than APR techniques, if provided with a good evaluation oracle. While the state-of-the-art GPTs are able to generate feedback for buggy code most of the time, the direct invocation of such LLMs still suffer from some shortcomings. In particular, GPT-4 can fail to detect up to 11% of the bugs, gives invalid feedback around 7% of the time, and {\em hallucinates} about 4% of the time. We show that a new architecture that invokes GPT using a conversational interactive loop can improve the repair coverage of GPT-3.5T from 64.8% to 74.9%, at par with the performance of the state-of-the-art LLM GPT-4. Similarly, the coverage of GPT-4 can be further improved from 74.9% to 88.5% with the same methodology within 5 iterations. |
Shubham Sahai · Umair Ahmed · Ben Leong 🔗 |
-
|
Paper 8: Generative Agent for Teacher Training: Designing Educational Problem-Solving Simulations with Large Language Model-based Agents for Pre-Service Teachers
(
Poster
)
>
link
Teacher training programs have often faced criticism for placing greater emphasis on theoretical knowledge at the expense of practical experiences. This often results in novice teachers who have a strong theoretical foundation but lack practical expertise. To address this issue, this study proposed "Generative Agent Design for Teacher Training." This approach utilizes a problem-solving simulation that involves GPT-4 based agents for immersive teacher training. By integrating the GPT-4 model with the widely used gaming platform Roblox, we developed more realistic educational scenarios which provide pre-service teachers with opportunities to navigate authentic teaching challenges within a controlled and safe environment. Preliminary findings, derived from interviews with three teachers who used the platform, suggest a positive response to the platform's usability. The results of this research indicate that integrating generative agents into teacher training simulation can be an effective way to offer pre-service teachers with more practical experiences to apply theories and concepts to simulated teaching practices. |
Unggi Lee · Sanghyeok Lee · Junbo Koh · Yeil Jeong · Haewon Jung · Gyuri Byun · Yunseo Lee · Jewoong Moon · Jieun Lim · Hyeoncheol Kim 🔗 |
-
|
Paper 26: The Behavior of Large Language Models When Prompted to Generate Code Explanations
(
Poster
)
>
link
This paper systematically investigates the generation of code explanations by Large Language Models (LLMs) for code examples commonly encountered in introductory programming courses. Our findings reveal significant variations in the nature of code explanations produced by LLMs, influenced by factors such as the wording of the prompt, the specific code examples under consideration, the programming language involved, the temperature parameter, and the version of the LLM. However, a consistent pattern emerges for Java and Python, where explanations exhibit a Flesch-Kincaid readability level of approximately 7-8 grade and a consistent lexical density, indicating the proportion of meaningful words relative to the total explanation size. Additionally, the generated explanations consistently achieve high scores for correctness, but lower scores on three other metrics: completeness, conciseness, and specificity. |
Priti Oli · Rabin Banjade · Vasile Rus · Jeevan Chapagain · Priti Oli 🔗 |
-
|
Paper 43: Large language model augmented exercise retrieval for personalized language learning
(
Poster
)
>
link
We study the problem of zero-shot multilingual exercise retrieval in the context of online language learning, to give students the ability to explicitly request personalized exercises via natural language. Using real-world data collected from language learners, we observe that vector similarity approaches poorly capture the relationship between exercise content and the language users use to express what they want to learn. This semantic gap between queries and content dramatically reduces the effectiveness of general-purpose retrieval models pretrained on large scale information retrieval datasets. We leverage the generative capabilities of large language models to bridge the gap by synthesizing hypothetical exercises based on the user’s input, which are then used to search for relevant exercises. Our approach, which we call mHyER, outperforms several strong baselines, such as Contriever, on a novel benchmark created from publicly available Tatoeba data. |
Austin Xu · Klinton Bicknell · Will Monroe 🔗 |
-
|
Paper 14: Code Soliloquies for Accurate Calculations in Large Language Models
(
Poster
)
>
link
High-quality conversational datasets are crucial for the successful development of Intelligent Tutoring Systems (ITS) that utilize a Large Language Model (LLM) backend. Synthetic student-teacher dialogues, generated using advanced GPT-4 models, are a common strategy for creating these datasets. However, subjects like physics that entail complex calculations pose a challenge. While GPT-4 presents impressive language processing capabilities, its limitations in fundamental mathematical reasoning curtail its efficacy for such subjects. To tackle this limitation, we introduce in this paper an innovative stateful prompt design. Our design orchestrates a mock conversation where both student and tutorbot roles are simulated by GPT-4. Each student response triggers an internal monologue, or `code soliloquy' in the GPT-tutorbot, which assesses whether its subsequent response would necessitate calculations. If a calculation is deemed necessary, it scripts the relevant Python code and uses the Python output to construct a response to the student. Our approach notably enhances the quality of synthetic conversation datasets, especially for subjects that are calculation-intensive. Our preliminary Subject Matter Expert evaluations reveal that our Higgs model, a fine-tuned LLaMA model, effectively uses Python for computations, which significantly enhances the accuracy and computational reliability of Higgs' responses. Code, models, and datasets is available at https://github.com/luffycodes/Tutorbot-Spock-Phys. |
Shashank Sonkar · MyCo Le · Xinghe Chen · Naiming Liu · Debshila Basu Mallick · Richard Baraniuk · Shashank Sonkar · Johaun Hatchett 🔗 |
-
|
Paper 35: Generative AI in the classroom: can student remain active learners?
(
Poster
)
>
link
Generative Artificial Intelligence (GAI) can be seen as a double-edged weapon in education. Indeed, it may provide personalized, interactive and empowering pedagogical sequences that could favor students' intrinsic motivation, active engagement and help them have more control over their learning. But at the same time, other GAI properties such as the lack of uncertainty signalling even in cases of failure (particularly with Large Language Models (LLMs)) could lead to opposite effects, e.g. over-estimation of one's own competencies, passiveness, loss of curious and critical-thinking sense, etc.These negative effects are due in particular to the lack of a pedagogical stance in these models' behaviors. Indeed, as opposed to standard pedagogical activities, GAI systems are often designed to answers users' inquiries easily and conveniently, without asking them to make an effort, and without focusing on their learning process and/or outcomes.This article starts by outlining some of these opportunities and challenges surrounding the use of GAI in education, with a focus on the effects on students' active learning strategies and related metacognitive skills. Then, we present a framework for introducing pedagogical transparency in GAI-based educational applications. This framework presents 1) training methods to include pedagogical principles in the models, 2) methods to ensure controlled and pedagogically-relevant interactions when designing activities with GAI and 3) educational methods enabling students to acquire the relevant skills to properly benefit from the use of GAI in their learning activities (meta-cognitive skills, GAI litteracy). |
RANIA ABDELGHANI · Hélène Sauzéon · Pierre-Yves Oudeyer 🔗 |
-
|
Paper 33: WordPlay: An Agent Framework for AuthoringPlayful Language Learning Puzzles
(
Poster
)
>
link
We introduce a novel framework, WordPlay, for building language learning games.WordPlay combines playful mini-puzzle games with large language models and text-to-image models to address the challenge of balancing engagement and effective language practice. WordPlay allows content creators to quickly author bite-sized, personalized puzzles that cater to various proficiency levels, and uses generated images to aid comprehension and learning. |
Suma Bailis · Lara McConnaughey · Jane Friedhoff · Feiyang Chen · Chase Adams · Jacob Moon · Lara McConnaughey 🔗 |
-
|
Paper 11: EHRTutor: Enhancing Patient Understanding of Discharge Instructions
(
Poster
)
>
link
Large language models have shown success as a tutor in education in various fields. Educating patients about their clinical visits plays a pivotal role in patients' adherence to their treatment plans post-discharge. This paper presents EHRTutor, an innovative multi-component framework leveraging the Large Language Model (LLM) for patient education through conversational question-answering. EHRTutor first formulates questions pertaining to the electronic health record discharge instructions. It then educates the patient through conversation by administering each question as a test. Finally, it generates a summary at the end of the conversation. Evaluation results using LLMs and domain experts have shown a clear preference for EHRTutor over the baseline. Moreover, EHRTutor also offers a framework for generating synthetic patient education dialogues that can be used for future in-house system training. |
Zihao Zhang · Zonghai Yao · Huixue Zhou · Feiyun · Hong Yu 🔗 |
-
|
Paper 44: Evaluating ChatGPT-generated Textbook Questions using IRT
(
Poster
)
>
link
We aim to test the ability of ChatGPT to generate educational assessment questions, given solely a summarization of textbook content. We take a psychometric measurement methodological approach to comparing the qualities of questions, or items, generated by ChatGPT versus gold standard questions from a published textbook. We use Item Response Theory (IRT) to analyze data from 207 test respondents answer questions from OpenStax College Algebra. Using a common item linking design, we find that ChatGPT items fared as well or better than textbook items, showing a better ability to distinguish within the moderate ability group and had higher discriminating power as compared to OpenStax items (1.92 discrimination for ChatGPT vs 1.54 discrimination for OpenStax). |
Shreya Bhandari · Yunting Liu · Zachary Pardos 🔗 |
-
|
Paper 27: Enhancing Chilean Adolescents Writing Skills: Assisted Story Creation with LLMs
(
Poster
)
>
link
This study presents an automatic story generation model in Chilean Spanish, designed to assist students in the writing process and help improve their writing skills. The methodology employed includes the creation of a corpus of stories in Spanish and Chilean Spanish, as well as data processing and extraction of relevant information from the stories. The model is trained using fine-tuning and prompt-engineering techniques to adapt it to story generation. The results obtained indicate that the stories generated by the model outperform other text generation models in terms of relevant natural language processing metrics. |
Hernan Lira · Luis Martí · Nayat Sánchez-Pi · Hernan Lira 🔗 |
-
|
Paper 5: Benchmarking Educational Program Repair
(
Poster
)
>
link
The emergence of large language models (LLMs) has sparked enormous interest due to their potential application across a range of educational tasks. For example, recent work in programming education has used LLMs to generate learning resources, improve error messages, and provide feedback on code. However, one factor that limits progress within the field is that much of the research uses bespoke datasets and different evaluation metrics, making direct comparisons between results unreliable. Thus, there is a pressing need for standardization and benchmarks that facilitate the equitable comparison of competing approaches. One task where LLMs show great promise is program repair, which can be used to provide debugging support and next-step hints to students. In this article, we propose a novel educational program repair benchmark. We curate two high-quality publicly available programming datasets, present a unified evaluation procedure introducing a novel evaluation metric rouge@k for approximating the quality of repairs, and evaluate a set of five recent models to establish baseline performance. |
Charles Koutcheme · Nicola Dainese · Sami Sarsa · Juho Leinonen · Arto Hellas · Paul Denny 🔗 |
-
|
Paper 47: Detecting Educational Content in Online Videos by Combining Multimodal Cues
(
Poster
)
>
link
The increasing trend of young children consuming online media underscores the need for data-driven tools that empower educators to identify suitable educational content for early learners. This paper introduces a method for identifying educational content within online videos. We focus on two widely used educational content classes: literacy and math. We consider two levels: Prekindergarten and Kindergarten. For each class and level, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include |
Anirban Roy · Sujeong Kim · Claire Christensen · Madeline Cincebeaux 🔗 |
-
|
Paper 3: AuthentiGPT: Detecting Machine-Generated Text via Black-Box Language Models Denoising
(
Poster
)
>
link
Large language models (LLMs) have opened up enormous opportunities while simultaneously posing ethical dilemmas. One of the major concerns is their ability to create text that closely mimics human writing, which can lead to potential misuse, such as academic misconduct, disinformation, and fraud. To address this problem, we present AuthentiGPT, an efficient classifier that distinguishes between machine-generated and human-written texts. Under the assumption that human-written text resides outside the distribution of machine-generated text, AuthentiGPT leverages a black-box LLM to denoise input text with artificially added noise, and then semantically compares the denoised text with the original to determine if the content is machine-generated. With only one trainable parameter, AuthentiGPT eliminates the need for a large training dataset, watermarking the LLM's output, or computing the log-likelihood. Importantly, the detection capability of AuthentiGPT can be easily adapted to any generative language model. With a 0.918 AUROC score on a domain-specific dataset, AuthentiGPT demonstrates its effectiveness over other commercial algorithms, highlighting its potential for detecting machine-generated text in academic settings. |
Zhen Guo · Shangdi Yu · Zhen Guo 🔗 |
-
|
Paper 34: Are LLMs Useful in the Poorest Schools? TheTeacher.AI in Sierra Leone
(
Poster
)
>
link
Education systems in developing countries have few resources to serve large, poor populations. How might generative AI integrate into classrooms? This paper introduces an AI chatbot designed to assist teachers in Sierra Leone with professional development to improve their instruction. We describe initial findings from early implementation across 122 schools and 193 teachers, and analyze its use with qualitative observations and by analyzing queries. Teachers use the system for lesson planning, classroom management, and subject matter. Usage is sustained over the school year, and a subset of teachers use the system intensively. We draw conclusions from these findings about how generative AI systems can be integrated into school systems in low income countries. |
Jun Ho Choi · Oliver Garrod · Paul Atherton · Andrew Joyce-Gibbons · Miriam Mason-Sesay · Daniel Bjorkegren 🔗 |
-
|
Paper 31: Field experiences and reflections on using LLMs to generate comprehensive lecture metadata
(
Poster
)
>
link
We describe an ongoing initiative to incorporate generative AI for online higher education classes at a large public U.S. university. Our specific online-only class setting poses special challenges: the technical backgrounds of incoming learners tend to vary widely across domains, making the potential for personalized adaptation especially compelling; the majority of instruction is via pre-recorded video, with some live office hours support and forum discussions, making it critical to promote additional effective engagement and self-assessment. Toward these goals, we describe what we have learned and been thinking about in our early explorations, with an initial framework starting from using generative AI to create questions and other rich metadata from lecture video that can support instructor- and student-facing affordances for learning and discovery. |
sumit asthana · Taimoor Arif · Kevyn Collins-Thompson 🔗 |
-
|
Paper 48: GAI-Enhanced Assignment Framework: A Case Study on Generative AI Powered History Education
(
Poster
)
>
link
In the rapidly evolving landscape of Generative Artificial Intelligence (GAI) applications in education, practical uses lag behind theoretical discussions, especially in history education. This article examines a unique application of ChatGPT in reshaping pedagogical techniques within the undergraduate "History of Science" curriculum during the 2022-2023 Spring term. The students conversed with historical scientists through role-playing exercises, mirroring the course progression. We pose the question: How do GAI-driven interventions affect student engagement, learning, and mindset? Our evaluation method, which combines qualitative and quantitative measures, focuses on analyzing student capstone papers. Preliminary findings suggest that most students adapt and excel in this GAI-enhanced environment. However, challenges arose in assessing the correctness of GAI-generated responses and ensuring the authenticity of student-generated content. To address this, we introduce the 'Reference-Check Protocol (RCP)', a safeguarding technique for GAI in classrooms, emphasizing the accuracy of AI's responses and the maintenance of academic integrity. Our research illustrates the potential and challenges of GAI in education. |
Cagla Acun · Ramazan Acun 🔗 |
-
|
Paper 15: Efficient Classification of Student Help Requests in Programming Courses Using Large Language Models
(
Poster
)
>
link
The accurate classification of student help requests with respect to the type of help being sought can enable the tailoring of effective responses. Automatically classifying such requests is non-trivial, but large language models (LLMs) appear to offer an accessible, cost-effective solution. This study evaluates the performance of the GPT-3.5 and GPT-4 models for classifying help requests from students in an introductory programming class. In zero-shot trials, GPT-3.5 and GPT-4 exhibited comparable performance on most categories, while GPT-4 outperformed GPT-3.5 in classifying sub-categories for requests related to debugging. Fine-tuning the GPT-3.5 model improved its performance to such an extent that it approximated the accuracy and consistency across categories observed between two human raters. Overall, this study demonstrates the feasibility of using LLMs to enhance educational systems through the automated classification of student needs. |
Jaromir Savelka · Paul Denny · Mark Liffiton · Brad Sheese · Jaromir Savelka 🔗 |
-
|
Paper 39: An Automated Graphing System for Mathematical Pedagogy
(
Poster
)
>
link
Teachers use a variety of in-classroom technological tools in day-to-day instruction. The variety and complexity of operating these tools imposes a cognitive and time-overload, that teachers would rather spend with students. Pedagogical tool orchestration systems, based on generative AI, hold the promise of untethering teachers by enabling simple language-based operation of tools. Graphs are an essential tool in the classroom, allowing students to visualize and interact with mathematical concepts. In this paper, we present an automated graphing system for mathematical pedagogy. The system consists of an LLM and a mathematical solver used in conjunction with a math graphing tool to produce accurate visualizations from simple natural language commands. Our goal is to allow teachers to easily invoke math graphing tools through natural language, which is not possible through the use of a solver or an LLM alone. For benchmarking purposes, we create a dataset of graphing problems based on Common Core standards. We also develop an autoevaluator to easily evaluate the outputs of our system by comparing them to ground-truth expressions. Our results demonstrate the potential of tool usage with LLMs, as we show that incorporating a solver into the system results in significantly improved performance. |
Arya Bulusu · Brandon Man · Ashish Jagmohan · Aditya Vempaty · Jennifer Mari-Wyka · Arya Bulusu 🔗 |
-
|
Paper 18: Small Generative Language Models for Educational Question Generation
(
Poster
)
>
link
The automatic generation of educational questions will play a key role in scaling online education, enabling self-assessment at scale when a global population is manoeuvring their personalised learning journeys. This work compares the predictive performance of foundational large-language model-based systems and their small-language model counterparts for educational question generation. Our experiments demonstrate that small language models can produce educational questions with comparable quality by further pre-training and fine-tuning while producing very lightweight models that can be easily trained, stored and deployed. |
Fares Fawzi · Sadie Amini · Sahan Bulathwela · Sahan Bulathwela · Zhengxiang Shi 🔗 |
-
|
Paper 32: Conversational Programming with LLM-Powered Interactive Support in an Introductory Computer Science Course
(
Poster
)
>
link
Chatbot interfaces for LLMs enable students to get immediate, interactive help on homework assignments, but doing so naively may not serve pedagogical goals. In this workshop paper, we report on the development and preliminary deployment of a GPT-4-based interactive homework assistant for students in a large introductory computer science course. Our assistant offers both a “Get Help” button within a popular code editor, as well as a “get feedback” feature within our command-line autograder, wrapping student code in a custom prompt that supports our pedagogical goals and avoids providing solutions directly. We have found that our assistant can identify students' conceptual struggles and offer suggestions, plans, and template code in pedagogically appropriate ways---but sometimes inappropriately labels correct student code as incorrect, or pushes students to use correct-but-lesson-inappropriate approaches, among other failures, sometimes sending students down long frustrating paths. We report on a number of development and deployment challenges and conclude with next steps. |
J.D. Zamfirescu-Pereira · Laryn Qi · Bjorn Hartmann · john denero · Narges Norouzi 🔗 |
-
|
Paper 41: AI-Augmented Advising: A Comparative Study of ChatGPT-4 and Advisor-based Major Recommendations
(
Poster
)
>
link
Choosing an undergraduate major is an important decision that impacts academic and career outcomes. We investigate using ChatGPT-4, a state-of-the-art large language model (LLM), to augment human advising for major selection. Through a 3-phase survey, we compare ChatGPT suggestions and responses for undeclared Freshmen and Sophomore students (n=18) to expert responses from university advisors (n=18). Undeclared students were first surveyed on their interests and career goals. These responses were then given to both campus advisors and to ChatGPT to produce a major recommendation for each student. In the case of ChatGPT, information about the majors offered on campus was added to the prompt. Advisors, overall, rated the recommendations of ChatGPT to be highly helpful and agreed with their recommendations 39% of the time. Additionally, we find substantially more agreement with AI major recommendations when advisors see the AI recommendations before making their own. However, this result was not statistically significant, possibly owing to insufficient data collected thus far. The results provide a first signal as to the viability of LLMs for personalized major recommendation and shed light on the promise and limitations of AI for advising support. |
Kasra Lekan · Zachary Pardos 🔗 |
-
|
Paper 12: The Power of Personalization and Contextualization: Early Student Performance Forecasting with Language Models
(
Poster
)
>
link
Early forecasting of student performance in a course is a critical component of building effective intervention systems. However, when the available student data is limited, accurate early forecasting is challenging. We present a language generation transfer learning approach that leverages the general knowledge of pre-trained language models to address this challenge. We hypothesize that early forecasting can be significantly improved by fine-tuning language models (LMs) via personalization and contextualization using data on students' distal factors (academic and socioeconomic) and proximal non-cognitive factors (e.g., motivation and engagement), respectively. Results obtained from extensive experimentation validate this hypothesis and thereby demonstrate the prowess of personalization and contextualization for tapping into the general knowledge of pre-trained LMs for solving the downstream task of early forecasting. |
Ahatsham Hayat · Mohammad Hasan 🔗 |
-
|
Paper 7: Neural Task Synthesis for Visual Programming
(
Poster
)
>
link
Generative neural models hold great promise in enhancing programming education by synthesizing new content. We seek to design neural models that can automatically generate programming tasks for a given specification in the context of visual programming domains. Despite the recent successes of large generative models like GPT-4, our initial results show that these models are ineffective in synthesizing visual programming tasks and struggle with logical and spatial reasoning. We propose a novel neuro-symbolic technique, NeurTaskSyn, that can synthesize programming tasks for a specification given in the form of desired programming concepts exercised by its solution code and constraints on the visual task. NeurTaskSyn has two components: the first component is trained via imitation learning procedure to generate possible solution codes, and the second component is trained via reinforcement learning procedure to guide an underlying symbolic execution engine that generates visual tasks for these codes. We demonstrate the effectiveness of NeurTaskSyn through an extensive empirical evaluation and a qualitative study on reference tasks taken from the Hour of Code: Classic Maze challenge by Code.org and the Intro to Programming with Karel course by CodeHS.com. |
Victor-Alexandru Pădurean · Georgios Tzannetos · Adish Singla 🔗 |
-
|
Paper 38: Ruffle&Riley: Towards the Automated Induction of Conversational Tutoring Systems
(
Poster
)
>
link
Conversational tutoring systems (CTSs) offer learning experiences driven by natural language interaction. They are known to promote high levels of cognitive engagement and benefit learning outcomes, particularly in reasoning tasks. Nonetheless, the time and cost required to author CTS content is a major obstacle to widespread adoption. In this paper, we introduce a novel type of CTS that leverages the recent advances in large language models (LLMs) in two ways: First, the system induces a tutoring script automatically from a lesson text. Second, the system automates the script orchestration via two LLM-based agents (Ruffle&Riley) with the roles of a student and a professor in a learning-by-teaching format. The system allows a free-form conversation that follows the ITS-typical inner and outer loop structure. In an initial between-subject online user study (N = 100) comparing Ruffle&Riley to simpler QA chatbots and reading activity, we found no significant differences in post-test scores. Nonetheless, in the learning experience survey, Ruffle&Riley users expressed higher ratings of understanding and remembering and further perceived the offered support as more helpful and the conversation as coherent. Our study provides insights for a new generation of scalable CTS technologies. |
Robin Schmucker · Meng Xia · Amos Azaria · Tom Mitchell · Robin Schmucker 🔗 |
-
|
Paper 17: Towards AI-Assisted Multiple Choice Question Generation and Quality Evaluation at Scale: Aligning with Bloom’s Taxonomy
(
Poster
)
>
link
In educational assessment, Multiple Choice Questions (MCQs) are frequently used due to their efficiency in grading and providing feedback. However, manual MCQ generation encounters challenges. Relying on a limited set of questions may lead to item repetition, which could compromise the reliability of assessments and the security of the evaluation procedure, especially in high-stakes evaluations. This study explores an AI-driven approach to creating and evaluating MCQs in introductory chemistry and biology. The methodology involves generating Bloom's Taxonomy-aligned questions through zero-shot prompting with GPT-3.5, validating question alignment with Bloom’s Taxonomy with RoBERTa--a language model grounded in transformer architecture, employing self-attention mechanisms to handle input sequences and produce context-aware representations of individual words within a given sentence--, evaluating question quality using Item Writing Flaws (IWF)--issues that can arise in the creation of test items or questions--, and validating questions using subject matter experts. Our research demonstrates GPT-3.5's capacity to produce higher-order thinking questions, particularly at the "evaluation" level. We observe alignment between GPT-generated questions and human-assessed complexity, albeit with occasional disparities. Question quality assessment reveals differences between human and machine evaluations, correlating inversely with Bloom's Taxonomy levels. These findings shed light on automated question generation and assessment, presenting the potential for advancements in AI-driven educational evaluation methods. |
Kevin Hwang · Sai Challagundla · Maryam Alomair · Karen Chen · Fow-Sen Choa · Kevin Hwang 🔗 |
-
|
Paper 10: Automated Distractor and Feedback Generation for Math Multiple-choice Questions via In-context Learning
(
Poster
)
>
link
Multiple-choice questions (MCQs) are ubiquitous in almost all levels of education since they are easy to administer, grade, and are a reliable form of assessment. An important aspect of MCQs is the distractors, i.e., incorrect options that are designed to target specific misconceptions or insufficient knowledge among students. To date, the task of crafting high-quality distractors has largely remained a labor-intensive process for teachers and learning content designers, which has limited scalability. In this work, we explore the task of automated distractor and corresponding feedback message generation in math MCQs using large language models. We establish a formulation of these two tasks and propose a simple, in-context learning-based solution. Moreover, we propose generative AI-based metrics for evaluating the quality of the feedback messages. We conduct extensive experiments on these tasks using a real-world MCQ dataset. Our findings suggest that there is a lot of room for improvement in automated distractor and feedback generation; based on these findings, we outline several directions for future work. |
Hunter McNichols · Wanyong Feng · Jaewook Lee · Alexander Scarlatos · Digory Smith · Simon Woodhead · Andrew Lan · Alexander Scarlatos 🔗 |
-
|
Paper 21: Mathify: Evaluating Large Language Models on Mathematical Problem Solving Tasks
(
Poster
)
>
link
The rapid progress in the field of natural language processing (NLP) systems and the expansion of large language models (LLMs) have opened up numerous opportunities in the field of education and instructional methods. These advancements offer the potential for tailored learning experiences and immediate feedback, all delivered through accessible and cost-effective services. One notable applicationarea for this technological advancement is in the realm of solving mathematical problems. Mathematical problem-solving not only requires the ability to decipher complex problem statements but also the skill to perform precise arithmetic calculations at each step of the problem-solving process. However, the evaluation of the arithmetic capabilities of large language models remains an area that has received relatively little attention. In response, we introduce an extensive mathematics dataset called "MathQuest" sourced from the 11th and 12th standard Mathematics NCERT textbooks. This dataset encompasses mathematical challenges of varying complexity and covers a wide range of mathematical concepts. Utilizing this dataset, we conduct fine-tuning experiments with three prominent LLMs: LLaMA2, WizardMath, and MAmmoTH. These fine-tuned models serve as benchmarks for evaluating their performance on our dataset. Our experiments reveal that among the three models, MAmmoTH-13B emerges as the most proficient, achieving the highest level of competence in solving the presented mathematical problems. Consequently, MAmmoTH-13B establishes itself as a robust and dependable benchmark for addressing NCERT mathematics problems. |
Indraprastha Delhi · Mohit Gupta · Kritarth Prasad · Navya Singla · Sanjana Sanjeev · Jatin Kumar · Adarsh Raj Shivam · Rajiv Ratn Shah 🔗 |
-
|
Paper 30: Transforming Healthcare Education: Harnessing Large Language Models for Frontline Health Worker Capacity Building using Retrieval-Augmented Generation
(
Poster
)
>
link
In recent years, large language models (LLMs) have emerged as a transformative force in several domains, including medical education and healthcare. This paper presents a case study on the practical application of using retrieval-augmented generation (RAG) based models for enhancing healthcare education in low- and middle-income countries. The model described in this paper, SMARThealth GPT, stems from the necessity for accessible and locally relevant medical information to aid community health workers in delivering high-quality maternal care. We describe the development process of the complete RAG pipeline, including the creation of a knowledge base of Indian pregnancy-related guidelines, knowledge embedding retrieval, parameter selection and optimization, and answer generation. This case study highlights the potential of LLMs in building frontline healthcare worker capacity and enhancing guideline-based health education; and offers insights for similar applications in resource-limited settings. It serves as a reference for machine learning scientists, educators, healthcare professionals, and policymakers aiming to harness the power of LLMs for substantial educational improvement. |
Yasmina Al Ghadban · Huiqi Yvonne Lu · Uday Adavi · Ankita · Sridevi Gara · Neelanjana Das · Bhaskar Kumar · Renu Johns · Praveen Devarsetty · Jane Hirst · Uday Adavi
|