Learning Safe Policies with Cost-sensitive Advantage Estimation

Bingyi Kang, Shie Mannor, Jiashi Feng

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review


Reinforcement Learning (RL) with safety guarantee is critical for agents performing tasks in risky environments. Recent safe RL algorithms, developed based on Constrained Markov Decision Process (CMDP), mostly take the safety requirement as additional constraints when learning to maximize the return. However, they usually make unnecessary compromises in return for safety and only learn sub-optimal policies, due to the inability of differentiating safe and unsafe state-actions with high rewards. To address this, we propose Cost-sensitive Advantage Estimation (CSAE), which is simple to deploy for policy optimization and effective for guiding the agents to avoid unsafe state-actions by penalizing their advantage value properly. Moreover, for stronger safety guarantees, we develop a Worst-case Constrained Markov Decision Process (WCMDP) method to augment CMDP by constraining the worst-case safety cost instead of the average one. With CSAE and WCMDP, we develop new safe RL algorithms with theoretical justifications on their benefits for safety and performance of the obtained policies. Extensive experiments clearly demonstrate the superiority of our algorithms in learning safer and better agents under multiple settings.
Original languageEnglish
Title of host publicationICLR 2021 Conference
StatePublished - 2021
EventNinth International Conference on Learning Representations - Virtual
Duration: 3 May 20217 May 2021
Conference number: 9th


ConferenceNinth International Conference on Learning Representations
Abbreviated titleICLR
Internet address


Dive into the research topics of 'Learning Safe Policies with Cost-sensitive Advantage Estimation'. Together they form a unique fingerprint.

Cite this