The thesis investigates deep reinforcement learning, specifically Proximal Policy Optimization,
on online scheduling to minimise the schedule’s makespan. Invalid action masking was gradually
applied with potential-based and auxiliary reward shaping to assess their individual and com-
bined effect on the performance of the deep reinforcement learning algorithm. Hyperparameter
tuning using Bayesian Optimization and Hyperband was implemented to find the optimal con-
figuration of hyperparameters, including the neural network architecture. The PPO algorithm
was trained and tested on three flow shop production layouts with two, four, and eight stages.
The study found that invalid action masking was necessary for the PPO algorithm to solve the
scheduling problem. Furthermore, potential-based and auxiliary rewards shaping was found to
improve the performance of PPO, resulting in the algorithm outperforming the shortest pro-
cessing time heuristic for the two and four-stage layouts. However, the advantage of using deep
reinforcement learning was only slight. One may argue that the SPT heuristic would be used
in a practical setting as it is easier to implement, more transparent in its decisions, and does
not require training. In the eight-stage production layout, the PPO algorithm did not perform
better than the shortest processing time heuristic with neither invalid action masking or reward
shaping
«
The thesis investigates deep reinforcement learning, specifically Proximal Policy Optimization,
on online scheduling to minimise the schedule’s makespan. Invalid action masking was gradually
applied with potential-based and auxiliary reward shaping to assess their individual and com-
bined effect on the performance of the deep reinforcement learning algorithm. Hyperparameter
tuning using Bayesian Optimization and Hyperband was implemented to find the optimal con-
figuration of hyperparamete...
»