Due to the high computing power of modern computers and the increasing availability of data, reinforcement learning is gaining importance in many areas of industry. In safety critical areas, however, reinforcement learning poses dangers, which is why algorithms that promise a safe improvement of the behavior policy are of high interest. Therefore, this master's thesis examines the Soft-SPIBB algorithms introduced in "Safe Policy Improvement with Soft Baseline Bootstrapping" by Nadjahi et al. Some shortcomings of the mathematical theory underlying the algorithms are revealed, which have the consequence that the theoretical safety bounds can no longer be applied to the original algorithms from the paper. However, this can be remedied by further restricting the algorithms and this leads to new algorithms to which the theoretical safeties apply. In addition to adapting further algorithms that incorporate the uncertainty of state-action pairs into their calculations, and implementing all of these into a unified Python framework, these algorithms are tested on two different benchmarks. Here, in particular, a heuristic adaptation of the original algorithms turns out to be both very safe and performant.
«
Due to the high computing power of modern computers and the increasing availability of data, reinforcement learning is gaining importance in many areas of industry. In safety critical areas, however, reinforcement learning poses dangers, which is why algorithms that promise a safe improvement of the behavior policy are of high interest. Therefore, this master's thesis examines the Soft-SPIBB algorithms introduced in "Safe Policy Improvement with Soft Baseline Bootstrapping" by Nadjahi et al. Som...
»