Deep neural networks have become some of the most prominent models in machine
learning due to their flexibility and therefore, their broad applicability. The training of
large-scale deep neural networks requires vast computational resources. Stochastic gra-
dient descent methods still enjoy great popularity but Hessian-based optimisation tech-
niques are on the rise. While computing the second derivative of the loss function is still
computationally expensive, a possibly much faster convergence rate justifies the consid-
eration of such methods. Gradient descent is inherently sequential and cannot take full
advantage of highly parallelised computing architectures. This motivates the exploration
of second-order optimisation methods also in the context of high performance comput-
ing. This thesis aims to provide an overview of numerical challenges, their solutions and
stochastic details regarding the application of Hessian-based optimisation to the training
of large-scale deep neural networks. It lays emphasis on a strong theoretical foundation,
which is crucial for the less heuristic second-order methods. The potential of a quasi-
Newton method is showcased by outperforming gradient descent in optimisation of the
loss function corresponding to ResNet.
Deep neural networks have become some of the most prominent models in machine
learning due to their flexibility and therefore, their broad applicability. The training of
large-scale deep neural networks requires vast computational resources. Stochastic gra-
dient descent methods still enjoy great popularity but Hessian-based optimisation tech-
niques are on the rise. While computing the second derivative of the loss function is still
computationally expensive, a possibly much faster converg...