SCALAR: Self-Supervised Composition and Learning of Skills with LLM Planning and RL
Abstract
A core challenge in reinforcement learning (RL) is effective exploration, particularly for long-horizon tasks. Recent approaches have explored the utility of large language models (LLMs), combining capabilities to 1) decompose objectives into skills and 2) generate code such as rewards and verifiers. However, ad hoc prompt and program designs, as well as their reliance on single proxy rewards, can lead to reward hacking and hallucinations. Furthermore, synthesizing the correct functions remains challenging without actual environment interactions. To address these challenges, we propose Self-Supervised Composition and Learning of Skills (SCALAR), an iterative, bi-directional framework that couples an LLM planner and low-level RL controllers through a skill library. The skill library is a set of skills that, when composed, define a set of furthest reachable states by the current agent. In SCALAR, the library is iteratively expanded by a high-level LLM planner in conjunction with low-level RL agents. In one direction, an LLM planner uses information in the skill library to propose new skills with (1) preconditions reachable through existing skill compositions and (2) termination conditions unachievable by current skills. Reusing existing skill compositions narrows the task of the RL agent to exploring (2) rather than returning to known states (1). In the other direction, the LLM planner refines its world knowledge concurrently with RL training by analyzing successful RL trajectories. We call this process Pivotal Trajectory Analysis. We evaluate SCALAR on the Crafter benchmark, a challenging long-horizon task, in which SCALAR achieves 86.3% diamond-collection success, surpassing the previous state-of-the-art methods in overall performance and convergence speed. These results show that frontier-guided skill composition, together with verifier-based learning and bi-directional refinement, yields substantially more reliable long-horizon control under sparse rewards.