Analyzing and Overcoming Degradation in Warm-Start Off-Policy Reinforcement Learning

Benjamin Wexler, Elad Sarafian, Sarit Kraus

Research output: Contribution to conferencePaperpeer-review

Abstract

Reinforcement Learning (RL) can benefit from a warm-start where the agent is initialized with a pretrained behavioral policy. However, when transitioning to RL updates, degradation in performance can occur, which may compromise the agent’s safety. This degradation, which constitutes an inability to properly utilize the pretrained policy, is attributed to extrapolation error in the value function, a result of high values being assigned to Out-Of-Distribution actions not present in the behavioral policy’s data. We investigate why the magnitude of degradation varies across policies and why the policy fails to quickly return to behavioral performance. We present visual confirmation of our analysis and draw comparisons to the Offline RL setting which suffers from similar difficulties. We propose a novel method, Confidence Constrained Learning (CCL) for Warm-Start RL, that reduces degradation by balancing between the policy gradient and constrained learning according to a confidence measure of the Q-values. For the constrained learning component we propose a novel objective, Positive Q-value Distance (CCL-PQD). We investigate a variety of constraint-based methods that aim to overcome the degradation, and find they constitute solutions for a multi-objective optimization problem between maximimal performance and miniminal degradation. Our results demonstrate that hyperparameter tuning for CCL-PQD produces solutions on the Pareto Front of this multi-objective problem, allowing the user to balance between performance and tolerable compromises to the agent’s safety.

Original languageEnglish
StatePublished - 1 Jan 2022
EventAdaptive and Learning Agents Workshop, ALA 2022 at AAMAS 2022 - Auckland, New Zealand
Duration: 9 May 202210 May 2022

Conference

ConferenceAdaptive and Learning Agents Workshop, ALA 2022 at AAMAS 2022
Country/TerritoryNew Zealand
CityAuckland
Period9/05/2210/05/22

Keywords

  • Bootstrapping Error
  • Confidence Constrained RL
  • Degradation
  • Distributional Shift
  • Extrapolation Error
  • Offline RL
  • Out-Of-Distribution
  • Reinforcement Learning
  • Warm Start

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software

Fingerprint

Dive into the research topics of 'Analyzing and Overcoming Degradation in Warm-Start Off-Policy Reinforcement Learning'. Together they form a unique fingerprint.

Cite this