Transformer architectures, such as large language models, have revolutionized the field of deep learning, achieving state-of-the-art performance across a variety of tasks. Despite their success, their computational demands result in substantial operational costs at scale and pose major challenges for deployment in resource-constrained environments. This thesis proposes approximation techniques within inference hardware to enhance the computational efficiency of transformers by leveraging a configurable multiply-accumulate unit. In particular, we focus on efficient approximation of multiplication using the logarithmic number system and present optimizations over IEEE floating point summation, including approximate alignment, alternative rounding schemes, and custom quantization types. For each method, we analyze the theoretical power savings and accuracy trade-offs compared to high-precision computations. Experimental results reveal that certain approximations with greatly reduced computational complexity can be implemented with minimal accuracy loss, providing a practical pathway for designing power-efficient inference hardware tailored to transformers.
«
Transformer architectures, such as large language models, have revolutionized the field of deep learning, achieving state-of-the-art performance across a variety of tasks. Despite their success, their computational demands result in substantial operational costs at scale and pose major challenges for deployment in resource-constrained environments. This thesis proposes approximation techniques within inference hardware to enhance the computational efficiency of transformers by leveraging a confi...
»