Reinforcement Learning with Trajectory Feedback

Yonathan Efroni, Nadav Merlis, Shie Mannor

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

The standard feedback model of reinforcement learning requires revealing the reward of every visited state-action pair. However, in practice, it is often the case that such frequent feedback is not available. In this work, we take a first step towards relaxing this assumption and require a weaker form of feedback, which we refer to as trajectory feedback. Instead of observing the reward obtained after every action, we assume we only receive a score that represents the quality of the whole trajectory observed by the agent, namely, the sum of all rewards obtained over this trajectory. We extend reinforcement learning algorithms to this setting, based on least-squares estimation of the unknown reward, for both the known and unknown transition model cases, and study the performance of these algorithms by analyzing their regret. For cases where the transition model is unknown, we offer a hybrid optimistic-Thompson Sampling approach that results in a tractable algorithm.

Original languageEnglish
Title of host publication35th AAAI Conference on Artificial Intelligence, AAAI 2021
Pages7288-7295
Number of pages8
ISBN (Electronic)9781713835974
StatePublished - 2021
Event35th AAAI Conference on Artificial Intelligence, AAAI 2021 - Virtual, Online
Duration: 2 Feb 20219 Feb 2021

Publication series

Name35th AAAI Conference on Artificial Intelligence, AAAI 2021
Volume8B

Conference

Conference35th AAAI Conference on Artificial Intelligence, AAAI 2021
CityVirtual, Online
Period2/02/219/02/21

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'Reinforcement Learning with Trajectory Feedback'. Together they form a unique fingerprint.

Cite this