Poster
in
Workshop: Workshop on Scaling Environments for Agents Sun, Dec 7, 2025 • 12:30 PM – 1:30 PM PST

Model Context Protocol for Vision Agents: Schema, Memory, and World Model Implications

Aditi Tiwari · Akshit Bhalla

Project Page [ Poster] [ OpenReview]

Abstract

The Model Context Protocol (MCP) defines a schema-bound execution model for agent–tool interaction, providing agents with structured schemas and persistent context objects that function as lightweight external world models. To our knowledge, this is the first protocol-level, deployment-scale audit of MCP in vision systems, revealing systemic weaknesses in schema semantics, memory modeling, and runtime coordination. We analyze 91 publicly registered vision-centric MCP servers and develop an executable benchmark with validators that surface protocol violations. Findings show that schema drift affects 78.0% of deployments, coordinate misalignment occurs in 24.6%, and persistent visual state generates an average of 33.8 memory-scope warnings per 100 executions. Security probes detect untyped tool connections in 89.0% and privilege escalation risks in 41.0%. These failures highlight that current MCP deployments undercut reliable compositional reasoning. We propose semantically grounded schemas, scoped visual memory, and runtime validators as protocol extensions, positioning MCP as a foundation for robust world model integration in language- and vision-based agents.

Chat is not available.