TY - GEN
T1 - Delay as Payoff in MAB
AU - Schlisselberg, Ofir
AU - Cohen, Ido
AU - Lancewicki, Tal
AU - Mansour, Yishay
N1 - Publisher Copyright: Copyright © 2025, Association for the Advancement of Artificia Intelligence (www.aaai.org). All rights reserved.
PY - 2025/4/11
Y1 - 2025/4/11
N2 - In this paper, we investigate a variant of the classical stochastic Multi-armed Bandit (MAB) problem, where the payoff received by an agent (either cost or reward) is both delayed, and directly corresponds to the magnitude of the delay. This setting models faithfully many real world scenarios such as the time it takes for a data packet to traverse a network given a choice of route (where delay serves as the agent’s cost); or a user’s time spent on a web page given a choice of content (where delay serves as the agent’s reward). Our main contributions are tight upper and lower bounds for both the cost and reward settings. For the case that delays serve as costs, which we are the first to consider, we prove optimal regret that scales asPi:∆i>0log∆iT + d∗, where T is the maximal number of steps, ∆i are the sub-optimality gaps and d∗ is the minimal expected delay amongst arms. For the case that delays serves as rewards, we show optimal regret ofPi:∆i>0log∆iT + d̄, where d̄ is the second maximal expected delay. These improve over the regret in the general delay-dependent payoff setting, which scales asPi:∆i>0log∆iT + D, where D is the maximum possible delay. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward. Finally, we accompany our theoretical results with an empirical evaluation.
AB - In this paper, we investigate a variant of the classical stochastic Multi-armed Bandit (MAB) problem, where the payoff received by an agent (either cost or reward) is both delayed, and directly corresponds to the magnitude of the delay. This setting models faithfully many real world scenarios such as the time it takes for a data packet to traverse a network given a choice of route (where delay serves as the agent’s cost); or a user’s time spent on a web page given a choice of content (where delay serves as the agent’s reward). Our main contributions are tight upper and lower bounds for both the cost and reward settings. For the case that delays serve as costs, which we are the first to consider, we prove optimal regret that scales asPi:∆i>0log∆iT + d∗, where T is the maximal number of steps, ∆i are the sub-optimality gaps and d∗ is the minimal expected delay amongst arms. For the case that delays serves as rewards, we show optimal regret ofPi:∆i>0log∆iT + d̄, where d̄ is the second maximal expected delay. These improve over the regret in the general delay-dependent payoff setting, which scales asPi:∆i>0log∆iT + D, where D is the maximum possible delay. Our regret bounds highlight the difference between the cost and reward scenarios, showing that the improvement in the cost scenario is more significant than for the reward. Finally, we accompany our theoretical results with an empirical evaluation.
UR - http://www.scopus.com/inward/record.url?scp=105004275668&partnerID=8YFLogxK
U2 - 10.1609/aaai.v39i19.34237
DO - 10.1609/aaai.v39i19.34237
M3 - منشور من مؤتمر
T3 - Proceedings of the AAAI Conference on Artificial Intelligence
SP - 20310
EP - 20317
BT - Special Track on AI Alignment
A2 - Walsh, Toby
A2 - Shah, Julie
A2 - Kolter, Zico
T2 - 39th Annual AAAI Conference on Artificial Intelligence, AAAI 2025
Y2 - 25 February 2025 through 4 March 2025
ER -