Nonlinear distributional gradient temporal-difference learning

Chao Qu, Shie Mannor, Huan Xu

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

Abstract

We devise a distributional variant of gradient temporal-difference (TD) learning. Distributional reinforcement learning has been demonstrated to outperform the regular one in the recent study (Bellemare et al., 2017a). In the policy evaluation setting, we design two new algorithms called distributional GTD2 and distributional TDC using the Cramer distance on the distributional version of the Bellman error objective function, which inherits advantages of both the nonlinear gradient TD algorithms and the distributional RL approach. In the control setting, we propose the distributional Greedy-GQ using the similar derivation. We prove the asymptotic almost-sure convergence of distributional GTD2 and TDC to a local optimal solution for general smooth function approximators, which includes neural networks that have been widely used in recent study to solve the real-life RL problems. In each step, the computational complexities of above three algorithms are linear w.r.t. the number of the parameters of the function approximator, thus can be implemented efficiently for neural networks.

Original languageEnglish
Title of host publication36th International Conference on Machine Learning, ICML 2019
Pages9168-9177
Number of pages10
ISBN (Electronic)9781510886988
StatePublished - 2019
Event36th International Conference on Machine Learning, ICML 2019 - Long Beach, United States
Duration: 9 Jun 201915 Jun 2019

Publication series

Name36th International Conference on Machine Learning, ICML 2019
Volume2019-June

Conference

Conference36th International Conference on Machine Learning, ICML 2019
Country/TerritoryUnited States
CityLong Beach
Period9/06/1915/06/19

All Science Journal Classification (ASJC) codes

  • Education
  • Human-Computer Interaction
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Nonlinear distributional gradient temporal-difference learning'. Together they form a unique fingerprint.

Cite this