Recently, various approaches utilize transformer networks, which apply a new concept of self-attention, in end-to-end speech recognition. These approaches mainly focus on the self-attention mechanism to improve the performance of transformer models. In our work, we demonstrate the benefit of adding a second transformer network during the training phase, which is optimized on time-reversed target labels. This new transformer receives a future context, which is usually not available for standard transformer networks. We have access to future context information, which we integrate into the standard transformer network by proposing two novel synchronization terms. Since we only require the newly added transformer network during training, we are not changing the complexity of the final network and only adding training time. We evaluate our approach on the publicly available dataset TEDLIUMv2, where we achieve relative improvements of 9.8% for the dev and 6.5% on the test set, respectively, if we employ synchronization terms with euclidean metrics.
«
Recently, various approaches utilize transformer networks, which apply a new concept of self-attention, in end-to-end speech recognition. These approaches mainly focus on the self-attention mechanism to improve the performance of transformer models. In our work, we demonstrate the benefit of adding a second transformer network during the training phase, which is optimized on time-reversed target labels. This new transformer receives a future context, which is usually not available for standard t...
»