From Sensing to Reasoning: Multi-Modal Large Language Models Guiding Robotic Intelligence in Autonomous Labs
Abstract
We evaluate multimodal large language models (LLMs) as protocol-aware “reasoning copilots” for self-driving laboratories (SDLs). Open-source families (e.g., Llama, Granite, Gemma, Hermes, LLaVA) and proprietary GPT models are benchmarked across image-based readiness checks, standard lab tasks, infeasible actions, and adversarial instructions. GPT models lead on perception—accurately detecting transparent vessels and counting objects—but no model exceeds 80% overall accuracy under protocol and safety constraints; in several real-world reasoning scenarios, compact open-source models (2–3B parameters) match or surpass GPT performance. These results reveal persistent gaps in fusing multimodal signals with SOP semantics and in reliable, real-time decision-making. We propose a practical path forward: protocol-aware prompting, rigorous safety stress-tests, action logging, and closed-loop evaluation, positioning LLMs as assistive automators with expert fallbacks—rather than autonomous controllers—to accelerate experimental science safely and effectively.