Self-Predictive Universal AI

Elliot Catt · Jordi Grau-Moya · Marcus Hutter · Matthew Aitchison · Tim Genewein · Gr├ęgoire Del├ętang · Kevin Li · Joel Veness

Great Hall & Hall B1+B2 (level 1) #1824
[ ]
Thu 14 Dec 8:45 a.m. PST — 10:45 a.m. PST


Reinforcement Learning (RL) algorithms typically utilize learning and/or planning techniques to derive effective policies. The integration of both approaches has proven to be highly successful in addressing complex sequential decision-making challenges, as evidenced by algorithms such as AlphaZero and MuZero, which consolidate the planning process into a parametric search-policy. AIXI, the most potent theoretical universal agent, leverages planning through comprehensive search as its primary means to find an optimal policy. Here we define an alternative universal agent, which we call Self-AIXI, that on the contrary to AIXI, maximally exploits learning to obtain good policies. It does so by self-predicting its own stream of action data, which is generated, similarly to other TD(0) agents, by taking an action maximization step over the current on-policy (universal mixture-policy) Q-value estimates. We prove that Self-AIXI converges to AIXI, and inherits a series of properties like maximal Legg-Hutter intelligence and the self-optimizing property.

Chat is not available.