This work gives a complete overview of performing dense matrix multiplication and accumulating floating-point operations on NVIDIA Tensor Cores. In 2017, NVIDIA unveiled the first Generation of Tensor Cores as part of the Volta architecture. Nowadays, Tensor Cores are an essential part of the computation hardware of data centers worldwide. As NVIDIA GPUs and these computation units developed, the possibilities expanded. This work reviews the current capabilities of Tensor Cores and how to leverage their performance. Tensor Cores are well-established in numerous applications that are not sensible to precision losses. This work proves that Tensor Cores are not limited to those applications and should be exploited for any application that performs dense matrix multiplication and accumulating floating-point operations. An API for carrying out Tensor Cores operations has been implemented as part of this work. The API shows real Tensor Cores programmability using two different approaches. A benchmark proving Tensor Core’s performance and precision has been developed as part of this work as well.
«
This work gives a complete overview of performing dense matrix multiplication and accumulating floating-point operations on NVIDIA Tensor Cores. In 2017, NVIDIA unveiled the first Generation of Tensor Cores as part of the Volta architecture. Nowadays, Tensor Cores are an essential part of the computation hardware of data centers worldwide. As NVIDIA GPUs and these computation units developed, the possibilities expanded. This work reviews the current capabilities of Tensor Cores and how to levera...
»