TY - GEN
T1 - RL for Latent MDPs
T2 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
AU - Kwon, Jeongyeol
AU - Efroni, Yonathan
AU - Caramanis, Constantine
AU - Mannor, Shie
N1 - Publisher Copyright: © 2021 Neural information processing systems foundation. All rights reserved.
PY - 2021
Y1 - 2021
N2 - In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least Ω((SA)M ) episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, i.e., providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., [6]) and a reachability assumption, we show that the need for initialization can be removed.
AB - In this work, we consider the regret minimization problem for reinforcement learning in latent Markov Decision Processes (LMDP). In an LMDP, an MDP is randomly drawn from a set of M possible MDPs at the beginning of the interaction, but the identity of the chosen MDP is not revealed to the agent. We first show that a general instance of LMDPs requires at least Ω((SA)M ) episodes to even approximate the optimal policy. Then, we consider sufficient assumptions under which learning good policies requires polynomial number of episodes. We show that the key link is a notion of separation between the MDP system dynamics. With sufficient separation, we provide an efficient algorithm with local guarantee, i.e., providing a sublinear regret guarantee when we are given a good initialization. Finally, if we are given standard statistical sufficiency assumptions common in the Predictive State Representation (PSR) literature (e.g., [6]) and a reachability assumption, we show that the need for initialization can be removed.
UR - http://www.scopus.com/inward/record.url?scp=85126995329&partnerID=8YFLogxK
M3 - منشور من مؤتمر
T3 - Advances in Neural Information Processing Systems
SP - 24523
EP - 24534
BT - Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
A2 - Ranzato, Marc'Aurelio
A2 - Beygelzimer, Alina
A2 - Dauphin, Yann
A2 - Liang, Percy S.
A2 - Wortman Vaughan, Jenn
Y2 - 6 December 2021 through 14 December 2021
ER -