Model-Ensemble Trust-Region Policy Optimization

Thanard Kurutach, Ignasi Clavera, Yan Duan, Aviv Tamar, Pieter Abbeel

Research output: Contribution to conferencePaperpeer-review

Abstract

Model-free reinforcement learning (RL) methods are succeeding in a growing number of tasks, aided by recent advances in deep learning. However, they tend to suffer from high sample complexity which hinders their use in real-world domains. Alternatively, model-based reinforcement learning promises to reduce sample complexity, but tends to require careful tuning and, to date, it has succeeded mainly in restrictive domains where simple models are sufficient for learning. In this paper, we analyze the behavior of vanilla model-based reinforcement learning methods when deep neural networks are used to learn both the model and the policy, and we show that the learned policy tends to exploit regions where insufficient data is available for the model to be learned, causing instability in training. To overcome this issue, we propose to use an ensemble of models to maintain the model uncertainty and regularize the learning process. We further show that the use of likelihood ratio derivatives yields much more stable learning than backpropagation through time.

Original languageEnglish
StatePublished - 2018
Externally publishedYes
Event6th International Conference on Learning Representations, ICLR 2018 - Vancouver, Canada
Duration: 30 Apr 20183 May 2018

Conference

Conference6th International Conference on Learning Representations, ICLR 2018
Country/TerritoryCanada
CityVancouver
Period30/04/183/05/18

All Science Journal Classification (ASJC) codes

  • Language and Linguistics
  • Education
  • Computer Science Applications
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Model-Ensemble Trust-Region Policy Optimization'. Together they form a unique fingerprint.

Cite this