TY - GEN
T1 - Minimax Regret for Stochastic Shortest Path
AU - Cohen, Alon
AU - Efroni, Yonathan
AU - Mansour, Yishay
AU - Rosenberg, Aviv
N1 - Publisher Copyright: © 2021 Neural information processing systems foundation. All rights reserved.
PY - 2021
Y1 - 2021
N2 - We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for K episodes, and has to minimize her regret. In this work we show that the minimax regret for this setting is (equation presented) Õ (√(B2* + B*)|S||A|K) where B* is a bound on the expected cost of the optimal policy from any state, S is the state space, and A is the action space. This matches the Ω(√B2*|S||A|K) lower bound of Rosenberg et al. [2020] for B* ≥ 1, and improves their regret bound by a factor of √|S|. For B* < 1 we prove a matching lower bound of Ω(√B*|S||A|K). Our algorithm is based on a novel reduction from SSP to finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon.
AB - We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for K episodes, and has to minimize her regret. In this work we show that the minimax regret for this setting is (equation presented) Õ (√(B2* + B*)|S||A|K) where B* is a bound on the expected cost of the optimal policy from any state, S is the state space, and A is the action space. This matches the Ω(√B2*|S||A|K) lower bound of Rosenberg et al. [2020] for B* ≥ 1, and improves their regret bound by a factor of √|S|. For B* < 1 we prove a matching lower bound of Ω(√B*|S||A|K). Our algorithm is based on a novel reduction from SSP to finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon.
UR - http://www.scopus.com/inward/record.url?scp=85131872495&partnerID=8YFLogxK
M3 - منشور من مؤتمر
T3 - Advances in Neural Information Processing Systems
SP - 28350
EP - 28361
BT - Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
A2 - Ranzato, Marc'Aurelio
A2 - Beygelzimer, Alina
A2 - Dauphin, Yann
A2 - Liang, Percy S.
A2 - Wortman Vaughan, Jenn
T2 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021
Y2 - 6 December 2021 through 14 December 2021
ER -