Abstract
We consider episodic reinforcement learning in reward-mixing Markov decision processes (RMMDPs): at the beginning of every episode nature randomly picks a latent reward model among M candidates and an agent interacts with the MDP throughout the episode for H time steps. Our goal is to learn a near-optimal policy that nearly maximizes the H time-step cumulative rewards in such a model. Prior work (Kwon et al., 2021a) established an upper bound for RMMDPs with M = 2. In this work, we resolve several open questions for the general RMMDP setting. We consider an arbitrary M ≥ 2 and provide a sample-efficient algorithm-EM2 -that outputs an ϵ-optimal policy using O (ϵ−2 · SdAd · poly(H, Z)d) episodes, where S, A are the number of states and actions respectively, H is the time-horizon, Z is the support size of reward distributions and d = O(min(M, H)). We also provide a (SA)Ω(√M)/ϵ2 lower bound, supporting that super-polynomial sample complexity in M is necessary.
Original language | English |
---|---|
Pages (from-to) | 18057-18082 |
Number of pages | 26 |
Journal | Proceedings of Machine Learning Research |
Volume | 202 |
State | Published - 2023 |
Event | 40th International Conference on Machine Learning, ICML 2023 - Honolulu, United States Duration: 23 Jul 2023 → 29 Jul 2023 |
All Science Journal Classification (ASJC) codes
- Artificial Intelligence
- Software
- Control and Systems Engineering
- Statistics and Probability