How Intrinsic Motivation Shapes Learned Representations in Decision Transformers: A Cognitive Interpretability Analysis
Abstract
Elastic Decision Transformers (EDTs) with intrinsic motivation have demonstrated improved performance in offline reinforcement learning, yet the cognitive mechanisms underlying these improvements remain unexplored.We introduce a systematic post-hoc explainability framework to analyze how intrinsic motivation shapes learned embeddings in EDTs through statistical analysis of embedding properties (covariance structure, vector magnitudes, and orthogonality). We reveal that different intrinsic motivation variants create fundamentally different representational structures: one variant operating on state embeddings promotes compact representations, while another operating on transformer outputs enhances representational orthogonality. Our analysis demonstrates strong environment-specific correlation patterns between embedding metrics and performance across locomotion tasks.These findings show that intrinsic motivation operates as a representational prior that shapes embedding geometry in cognitively plausible ways, creating environment-specific organizational structures that facilitate better decision-making beyond simple exploration enhancement.