Browsing by Subject "Reward-free"

Now showing 1 - 1 of 1

Reward is not Necessary: Foundations for Compositional Non-Stationary Non-Markovian Hierarchical Planning\\ and Intrinsically Motivated Autonomous Agents
(2023-07) Ringstrom, Thomas
Humans are faced with real-world problems that are often non-stationary (NS) (i.e. time-varying) and have long-horizon non-Markovian (NM) (i.e. history dependent) hierarchical structure. For example, a person may need to plan errands to multiple stores over an afternoon to obtain several ingredients to make a dish, and the stores may close at different times. When these properties are combined, this problem is difficult to optimize policies for due to the number of possible collection of states of the world combined with the agent's internal states, i.e. a Cartesian product-space of state-variables. Additionally, justifying why an agent should perform one task over another in a high-dimensional product-space is the problem of intrinsic motivation. From a reward-maximization point-of-view, it is not clear how to sensibly define reward functions on a large Cartesian product-space to motivate an open-ended life-long agent to act. Natural and artificial autonomous agents must acquire the right representations and algorithms to determine the value of goals and plan into the future within a product-space of variables, especially when some of the variables (e.g. physiological variables) have the potential undermine their autonomy if not regulated. Ideally, agents should also be able to exploit task and sub-goal decompositions which are modular and remappable to many different environments to promote efficient structure reuse. The principle by which all of this can be achieved is a central topic of this thesis: compositionality. I demonstrate that the problem of hierarchical non-stationary planning and the problem of intrinsic motivation in Cartesian product-spaces can be approached with the same principle: by constructing and using compositional and factorizable abstract transition operators for goal-conditioned semi-Markov planning. I develop reward-free objective functions for NS and NM hierarchical semi-Markov policy optimization. First, I extend the Linearly-Solvable Markov Decision Process framework to a formulation for solving NM Boolean logic problems by scheduling sequential policies. Second, I address NS-NM problems by defining the Temporal Goal Compositional Markov Decision Process. I use this decision process to define new Bellman equations called Operator Bellman Equations, which optimize cumulative goal satisfaction instead of cumulative reward. Critically, these equations produce abstract goal-conditioned spatiotemporal transition operators, called the state-time feasibility function which map the initial state and time when an agent begins a policy to the final state and time of completing a goal that influences the dynamics of a higher-level state-space. The functions can be composed to compute the probability of solving a multi-goal NS-NM task. This includes tasks where the agent can directly control the underlying environment structure, which is analogous to path-finding on a graph with time-varying edges and controllable edge structure. Lastly, I show how the intrinsic motivation metric of empowerment can be used to define an autotelic self-preserving agent which can choose its own goals. Empowerment is the channel capacity of a transition operator and quantifies how free an agent is to predictably realize multiple possible futures from a given state, and typically this measure has been limited to short spatiotemporal ranges in relatively small state-spaces. However, since feasibility functions form abstract hierarchical spatiotemporal transition operators, they have an empowerment value over long time-horizons in a high-dimensional product-space. I define a valence function that measures the gain in hierarchical empowerment (i.e. "valence") when the agent acts to change the structure and affordances of the world in order to preserve or expand its capacity to act. Hierarchical empowerment-gain can thus be optimized in the abstract space of tasks to justify long courses of action into the future towards goals which, to the benefit of the entire control architecture, bring about a more favorable coupling between an agent's internal structure (e.g. hunger, hydration \& temperature states, dynamics, and skills) and its external environment (e.g. world structure, spatial state, and items). Therefore, embodied life-long agents could in principle be primarily animated by a combination of compositionality and hierarchical empowerment-gain, instead of pursuing rewards in high-dimensional product-spaces. I close by discussing the potential for this theory to be used as a foundation for open-ended life-long learning.

University Digital Conservancy

Browse by Subject

Browsing by Subject "Reward-free"