Demo: WotNow: Multimodal AI Coach with Affective Computing
Abstract
WotNow?!: Multimodal AI Coach with Affective Computing
WotNow is a real-time, multimodal AI coach that infers engagement from an individual face, a classroom or a meeting-room video/audio and turns those signals into actionable recommendations. The system combines affective computing with Generative AI to interpret biosignals (HRV, Heart Rate, Focus, facial affect, posture/dynamics, vocal prosody, turn-taking) and produce concise “micro-moment” highlights, summaries, and coaching prompts for instructors and participants. Goals: (1) demonstrate an end-to-end coaching loop from raw signals to recommendations, (2) let attendees experience interactive, room-level sensing vs. individual mobile coaching, and (3) show why this approach is relevant to foundation-model–driven brain/body understanding. What we will demo Room-level engagement sensing: Using a laptop webcam or room camera, WotNow estimates whole-room engagement, identifies shifts (e.g., rising confusion, waning attention), and visualizes trends on a live dashboard. Attendees will see real-time signals, confidence/uncertainty, and aggregate participant participation metrics aggregated to a room state.
Recommendations & micro-moments: The coach suggests context-aware actions (e.g., “pause for a check-in,” “invite quieter participants,” “reframe the question”), and generates short clips/transcripts that anchor feedback. Interactivity: volunteers step into view and trigger highlights; organizers can try alternative prompt policies live.
Individual mobile experience: The iOS/Android app provides personal, on-the-go coaching—ideal for solo learners, tutors, or presenters—while the classroom/conference-room app is designed for multi-person settings and aggregates signals to a room-level model before making recommendations. Attendees learn how thresholds and prompts differ between individual vs. group modes.
System description WotNow?! fuses computer vision and audio features with Generative AI (vision–language and speech–language models) to transform raw biosignals into interpretable insights. Affective computing layers estimate engagement/attention, while LLM-based components: explain why a moment mattered (natural-language rationales), generate tailored prompts for the instructor/facilitator, create summaries.
Novelty/technical contribution: (i) a coach-in-the-loop design that closes the loop from sensing → reasoning → action, (ii) room-level aggregation of multi-person signals (participation, equity/turn-taking) into a single actionable state, (iii) Foundation Model (FM)-mediated interpretation that goes beyond scores to explanations, and (iv) privacy-aware deployment (ephemeral buffers). Privacy controls include participant consent notices. The system is not a medical device and is intended for coaching and educational support. Relevance to Brain & Body FM This demo showcases real-time decoding and visualization of human state from multimodal signals and uses foundation/Generative models to translate those signals into usable guidance—directly aligned with the workshop’s themes of foundation models for human sensing and large-scale video understanding. Attendees experience an interactive, live system that operationalizes FM reasoning for classroom or conference room engagement.
Ethics, privacy, and safety We will (a) obtain opt-in from volunteers, (b) avoid storing biometric identifiers by default.
Team & links Team: Augment Me / WotNow Website: www.augment-me.com App: WotNow (iOS; Android build available for demo devices). Room-level web app is separate