Interpreting style–content parsing in vision–language models
Abstract
Style refers to the distinctive manner of expressing content, and humans can both recognize content across stylistic transformations and detect stylistic consistencies across different contents. Prior work has shown that vision–language models (VLMs) exhibit steerable texture–shape biases, with language supervision shifting this tradeoff at the behavioral level. However, the internal representational dynamics of style and content—how they emerge across layers and how language pathways modulate them—remain poorly understood. Here, we adapt neuroscience-inspired tools to dissect style and content representations in a large VLM. We show that vision encoders strongly preserve stylistic signals while progressively enhancing content selectivity, and that language pathways further amplify content representations at the expense of style. Prompting can modestly steer these balances, but content remains dominant in deeper layers. These findings provide systematic evidence of style–content dissociation in multimodal models, guiding the design of architectures that more effectively balance style and content.