TY - GEN
T1 - Generalized emphatic temporal difference learning
T2 - 30th AAAI Conference on Artificial Intelligence, AAAI 2016
AU - Hallak, Assaf
AU - Tamar, Aviv
AU - Munos, Remi
AU - Mannor, Shie
N1 - Publisher Copyright: © Copyright 2016, Association for the Advancement of Artificial Intelligence (www.aaai.org). All rights reserved.
PY - 2016
Y1 - 2016
N2 - We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm (Sutton, Mahmood, and White 2015), which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(λ, β), where our introduced parameter β controls the decay rate of an importancesampling term. We study conditions under which the projected fixed-point equation underlying ETD(λ, β) involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for ETD(λ, β). Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling β, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.
AB - We consider the off-policy evaluation problem in Markov decision processes with function approximation. We propose a generalization of the recently introduced emphatic temporal differences (ETD) algorithm (Sutton, Mahmood, and White 2015), which encompasses the original ETD(λ), as well as several other off-policy evaluation algorithms as special cases. We call this framework ETD(λ, β), where our introduced parameter β controls the decay rate of an importancesampling term. We study conditions under which the projected fixed-point equation underlying ETD(λ, β) involves a contraction operator, allowing us to present the first asymptotic error bounds (bias) for ETD(λ, β). Our results show that the original ETD algorithm always involves a contraction operator, and its bias is bounded. Moreover, by controlling β, our proposed generalization allows trading-off bias for variance reduction, thereby achieving a lower total error.
UR - http://www.scopus.com/inward/record.url?scp=85007179176&partnerID=8YFLogxK
M3 - منشور من مؤتمر
T3 - 30th AAAI Conference on Artificial Intelligence, AAAI 2016
SP - 1631
EP - 1637
BT - 30th AAAI Conference on Artificial Intelligence, AAAI 2016
Y2 - 12 February 2016 through 17 February 2016
ER -