Improper Reinforcement Learning with Gradient-based Policy Optimization

Mohammadi Zaki, Avinash Mohan, Aditya Gopalan, Shie Mannor

Research output: Contribution to journalConference articlepeer-review

Abstract

We consider an improper reinforcement learning setting where a learner is given M base controllers
for an unknown Markov decision process, and wishes to combine them optimally to produce a potentially
new controller that can outperform each of the base ones. This can be useful in tuning across controllers,
learnt possibly in mismatched or simulated environments, to obtain a good controller for a given target
environment with relatively few trials.
We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
We derive convergence rate guarantees for the approach assuming access to a gradient oracle. The value
function of the mixture and its gradient may not be available in closed-form; however, we show that we
can employ rollouts and simultaneous perturbation stochastic approximation (SPSA) for explicit gradient
descent optimization. Numerical results on (i) the standard control theoretic benchmark of stabilizing
an inverted pendulum and (ii) a constrained queueing task show that our improper policy optimization
algorithm can stabilize the system even when the base policies at its disposal are unstable1
.
Original languageEnglish
JournalInternational Conference on Machine Learning
StatePublished - 2022

Cite this