In many reinforcement learning (RL) problems, the objective is to maximize a goal-related reward while satisfying constraints, often expressed as penalties or negative rewards. For instance, a robot might need to learn to reach a target while adhering to constraints, such as avoiding excessive speed. Overweighting these constraints can impede the agent's ability to learn the goal-oriented behavior. Prior research suggests that temporarily disabling constraints and enabling them only after the agent has achieved the goal a few times can accelerate learning of the optimal policy compared to enforcing constraints from the beginning.
This project explores whether managing the activation of such constraints and optimizing the agent's final performance can itself be framed as an RL problem for a meta-agent.