Skip to yearly menu bar Skip to main content

Workshop: Generative AI for Education (GAIED): Advances, Opportunities, and Challenges

Paper 47: Detecting Educational Content in Online Videos by Combining Multimodal Cues

Anirban Roy · Sujeong Kim · Claire Christensen · Madeline Cincebeaux

Keywords: [ Educational content understanding ] [ Multimodal analysis ]


The increasing trend of young children consuming online media underscores the need for data-driven tools that empower educators to identify suitable educational content for early learners. This paper introduces a method for identifying educational content within online videos. We focus on two widely used educational content classes: literacy and math. We consider two levels: Prekindergarten and Kindergarten. For each class and level, we choose prominent codes (sub-classes) based on the Common Core Standards. For example, literacy codes include letter names', andletter sounds', and math codes include counting', andsorting'. We pose this as a fine-grained multilabel classification problem as videos can contain multiple types of educational content and the content classes can get visually similar (e.g., letter names' vs.letter sounds'). As the alignment between visual and audio cues is crucial for effective comprehension, we consider a multimodal video analysis framework to capture both visual and audio cues in videos while detecting the educational content. We leverage the recent success of the generative models to analyze audio and visual content. Specifically, we apply automatic speech recognition (ASR) to extract the speech from the audio and capture visual cues with descriptive captions. Finally, we fuse both cues to detect desired educational content. Our experiments show multimodal analysis of cues is crucial for detecting educational content in videos.

Chat is not available.