Hierarchical Memory World Models
Abstract
Building accurate and generalizable world models (WMs) requires aligning an agent’s inductive biases with the structure of the environments it inhabits. Realistic environments are inherently multi-task, containing both shared global dynamics and task-specific variations. We argue that an agent’s WM should reflect this hierarchy and enable efficient separation and integration of environment-level and task-level knowledge. We introduce Hierarchical Memory World Models (HMWM), which structurally align the agent’s internal WM with this hierarchical organization. The environment-level WM (EWM) learns task context representations from observations and encourages task separation through cluster-contrastive self-supervised learning. Acting as a long-term memory, the EWM accumulates task-invariant regularities and provides context to the task-level WM (TWM), which functions as a working memory specialized for task-specific dynamics to guide in-context learning. To benchmark HMWM, we present Multi-FoE, a multi-task environment with egocentric partial observations and boundary-free task switching. Empirically, HMWM demonstrates superior performance and higher sample efficiency compared to baselines.