Tutorial ECAI2025: Formal Methods for Safe Reinforcement Learning Tutorial
Dr. Edwin Hamel-De le Court, Dr. Francesco Belardinelli
At the end of the tutorial, the audience will walk away with the basics of RL and safe RL and with a good intuition of the properties expressible in Temporal Logic, so that they may know in which real-world applications RL with Temporal Logic specifications can be of use. Furthermore, the audience will have a good understanding of the basics of TL model checking (i.e. how to check that a system satisfies a TL specification), and reactive synthesis (i.e. how to compute a strategy for a known system so that it may satisfy a TL specification). Finally, the audience will have a good understanding of how to use the tools developed thus far to design RL algorithms with TL objectives or constraints, with a focus on how to obtain formal guarantees that the TL constraints may be satisfied. Several state-of-the-art methods will thus be introduced (e.g., constrained MDPs, shielding, etc.), with a focus on those that yield formal guarantees. A wide range real-life scenarios will be used throughout the tutorial to illustrate the core concepts of RL and LTL, and to show how to apply the tools and methods developed. These examples will range from robot motion planning to designing a controller for water tanks, going through developing an AI for games like Pac-Man.
Background: RL
We introduce the main problem considered in RL and quickly mention state-of-the-art algorithms that are commonly used for RL, in particular DQN [1] and PPO [2].
We introduce in a similar way the problem of constrained RL, where the constraint is given by a discounted cumulative cost expectancy, which is the most common case; we mention state-of-the-art algorithms for constrained RL like PPO-Lagrangian [3] or CPO [4] and discuss their limitations, especially in terms of formal guarantees.
Background: Temporal Logic [5]
We introduce Linear Temporal Logic (LTL), and give an intuition of the properties expressible in that logic.
We introduce automata over infinite words and show how to transform an LTL formula into an automaton.
We introduce the model-checking algorithm for LTL specifications.
We introduce the synthesis algorithm for LTL (safety) specifications.
We introduce Probabilistic LTL and go through how to adapt the previous model- checking algorithm to that case.
RL with LTL objectives: we show how that problem differs from the classical RL problem, and explain how to deal with its specific difficulties like infinite undiscounted horizon and sparse rewards. In particular, we explain why, for an infinite undiscounted horizon, even in the simple tabular case, the convergence of Q-learning for might be slow, and what can be done in practice remedy to that problem and obtain a guarantee that the excepted cumulative reward of the learned policy is close to the optimal one [6]. Furthermore, we examine the case of deep learning in continuous environments and show how to use the structure of the automaton constructed from the LTL objective to remedy to the problem of sparse rewards [7].
Constrained RL with LTL constraints: we show how to obtain safety guarantees in the case where the safety dynamics of the MDP are known but the reward dynamics are unknown. To that end, we use the tools developed for LTL synthesis to introduce shielding [8], a method based on restricting the actions available to the agent in a state-wise manner, and show that it gives good formal guarantees of optimality and constraint satisfaction.
Dr. Francesco Belardinelli is Senior Lecturer at the Department of Computing, Imperial College London (ICL), where he is head of the lab on Formal Methods in AI. He is also deputy director of the UKRI CDT in Safe and Trusted AI.
Dr Edwin Hamel-De le Court is PDRA at the Department of Computing, ICL. He is an expert in automata theory and controller synthesis; his research currently focuses on their application to Safe RL.
Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, Martin A. Riedmiller: Playing Atari with Deep Reinforcement Learn- ing. CoRR abs/1312.5602 (2013)
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, Oleg Klimov: Proximal Policy Optimization Algorithms. CoRR abs/1707.06347 (2017)
Achiam, Joshua and Dario Amodei. Benchmarking Safe Exploration in Deep Rein- forcement Learning. (2019)
Joshua Achiam, David Held, Aviv Tamar, Pieter Abbeel: Constrained Policy Optimiza- tion. ICML 2017: 22-31
Christel Baier, Joost-Pieter Katoen: Principles of model checking. MIT Press 2008, ISBN 978-0-262-02649-9, pp. I-XVII, 1-975
Tom´as Br´azdil, Krishnendu Chatterjee, Martin Chmelik, Vojtech Forejt, Jan Kret´ınsky´, Marta Kwiatkowska, Tobias Meggendorfer, David Parker, Mateusz Ujma: Learning Al- gorithms for Verification of Markov Decision Processes. CoRR abs/2403.09184 (2024)
Mohammadhosein Hasanbeig, Daniel Kroening, Alessandro Abate: Deep Reinforcement Learning with Temporal Logics. FORMATS 2020: 1-22
Mohammed Alshiekh, Roderick Bloem, Ru¨diger Ehlers, Bettina K¨onighofer, Scott Niekum, Ufuk Topcu: Safe Reinforcement Learning via Shielding. AAAI 2018: 2669-2678