Abstract
Reinforcement Learning (RL) can benefit from a warm-start where the agent is initialized with a pretrained behavioral policy. However, when transitioning to RL updates, degradation in performance can occur, which may compromise the agent’s safety. This degradation, which constitutes an inability to properly utilize the pretrained policy, is attributed to extrapolation error in the value function, a result of high values being assigned to Out-Of-Distribution actions not present in the behavioral policy’s data. We investigate why the magnitude of degradation varies across policies and why the policy fails to quickly return to behavioral performance. We present visual confirmation of our analysis and draw comparisons to the Offline RL setting which suffers from similar difficulties. We propose a novel method, Confidence Constrained Learning (CCL) for Warm-Start RL, that reduces degradation by balancing between the policy gradient and constrained learning according to a confidence measure of the Q-values. For the constrained learning component we propose a novel objective, Positive Q-value Distance (CCL-PQD). We investigate a variety of constraint-based methods that aim to overcome the degradation, and find they constitute solutions for a multi-objective optimization problem between maximimal performance and miniminal degradation. Our results demonstrate that hyperparameter tuning for CCL-PQD produces solutions on the Pareto Front of this multi-objective problem, allowing the user to balance between performance and tolerable compromises to the agent’s safety.
Original language | English |
---|---|
State | Published - 1 Jan 2022 |
Event | Adaptive and Learning Agents Workshop, ALA 2022 at AAMAS 2022 - Auckland, New Zealand Duration: 9 May 2022 → 10 May 2022 |
Conference
Conference | Adaptive and Learning Agents Workshop, ALA 2022 at AAMAS 2022 |
---|---|
Country/Territory | New Zealand |
City | Auckland |
Period | 9/05/22 → 10/05/22 |
Keywords
- Bootstrapping Error
- Confidence Constrained RL
- Degradation
- Distributional Shift
- Extrapolation Error
- Offline RL
- Out-Of-Distribution
- Reinforcement Learning
- Warm Start
All Science Journal Classification (ASJC) codes
- Artificial Intelligence
- Software