TY - GEN

T1 - Minimax Regret for Stochastic Shortest Path

AU - Cohen, Alon

AU - Efroni, Yonathan

AU - Mansour, Yishay

AU - Rosenberg, Aviv

N1 - Publisher Copyright: © 2021 Neural information processing systems foundation. All rights reserved.

PY - 2021

Y1 - 2021

N2 - We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for K episodes, and has to minimize her regret. In this work we show that the minimax regret for this setting is (equation presented) Õ (√(B2* + B*)|S||A|K) where B* is a bound on the expected cost of the optimal policy from any state, S is the state space, and A is the action space. This matches the Ω(√B2*|S||A|K) lower bound of Rosenberg et al. [2020] for B* ≥ 1, and improves their regret bound by a factor of √|S|. For B* < 1 we prove a matching lower bound of Ω(√B*|S||A|K). Our algorithm is based on a novel reduction from SSP to finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon.

AB - We study the Stochastic Shortest Path (SSP) problem in which an agent has to reach a goal state in minimum total expected cost. In the learning formulation of the problem, the agent has no prior knowledge about the costs and dynamics of the model. She repeatedly interacts with the model for K episodes, and has to minimize her regret. In this work we show that the minimax regret for this setting is (equation presented) Õ (√(B2* + B*)|S||A|K) where B* is a bound on the expected cost of the optimal policy from any state, S is the state space, and A is the action space. This matches the Ω(√B2*|S||A|K) lower bound of Rosenberg et al. [2020] for B* ≥ 1, and improves their regret bound by a factor of √|S|. For B* < 1 we prove a matching lower bound of Ω(√B*|S||A|K). Our algorithm is based on a novel reduction from SSP to finite-horizon MDPs. To that end, we provide an algorithm for the finite-horizon setting whose leading term in the regret depends polynomially on the expected cost of the optimal policy and only logarithmically on the horizon.

UR - http://www.scopus.com/inward/record.url?scp=85131872495&partnerID=8YFLogxK

M3 - منشور من مؤتمر

T3 - Advances in Neural Information Processing Systems

SP - 28350

EP - 28361

BT - Advances in Neural Information Processing Systems 34 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021

A2 - Ranzato, Marc'Aurelio

A2 - Beygelzimer, Alina

A2 - Dauphin, Yann

A2 - Liang, Percy S.

A2 - Wortman Vaughan, Jenn

T2 - 35th Conference on Neural Information Processing Systems, NeurIPS 2021

Y2 - 6 December 2021 through 14 December 2021

ER -