Reward-Mixing MDPs with Few Latent Contexts are Learnable

Jeongyeol Kwon, Yonathan Efroni, Constantine Caramanis, Shie Mannor

Research output: Contribution to journalConference articlepeer-review

Abstract

We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among M candidates and an agent interacts with the MDP throughout the episode for H time steps. Our goal is to learn a near-optimal policy that nearly maximizes the H time-step cumulative rewards in such a model. Prior work (Kwon et al., 2021a) established an upper bound for RMMDPs with M = 2. In this work, we resolve several open questions for the general RMMDP setting. We consider an arbitrary M ≥ 2 and provide a sample-efficient algorithm-EM2 -that outputs an ϵ-optimal policy using O (ϵ−2 · SdAd · poly(H, Z)d) episodes, where S, A are the number of states and actions respectively, H is the time-horizon, Z is the support size of reward distributions and d = O(min(M, H)). We also provide a (SA)Ω(√M)2 lower bound, supporting that super-polynomial sample complexity in M is necessary.

Original languageEnglish
Pages (from-to)18057-18082
Number of pages26
JournalProceedings of Machine Learning Research
Volume202
StatePublished - 2023
Event40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States
Duration: 23 Jul 202329 Jul 2023

All Science Journal Classification (ASJC) codes

  • Artificial Intelligence
  • Software
  • Control and Systems Engineering
  • Statistics and Probability

Fingerprint

Dive into the research topics of 'Reward-Mixing MDPs with Few Latent Contexts are Learnable'. Together they form a unique fingerprint.

Cite this