The future of Artificial Intelligence and Machine Learning (AI/ML) inference lies in hardware systems composed of many interconnected chiplets. With the rapid advancement of 3D integration technologies, each chiplet is expected to contain increasingly large on-chip memory.
To meet the growing demands of state-of-the-art models, achieving low-latency or high-through\-put inference, while minimizing energy consumption per input, requires novel architectural and algorithmic solutions. These must jointly optimize inter-chiplet communication, chiplet-local memory provisioning, heterogeneous interconnect topologies (both inter- and intra-chiplet), and dynamic power management. This, in turn, necessitates fine-grained spatiotemporal coordination, managing hundreds of wakeup and shutdown events at the block level throughout a single inference pass.
In response, this thesis introduces an energy-aware optimization framework for multi-chiplet AI/ML systems, leveraging a flexible Mixed-Integer Quadratic Programming (MIQP) formulation. The approach addresses operator mapping and execution scheduling in tandem, and supports co-optimization across diverse hardware configurations and objective functions. With this, our method yields provably optimal mappings and system configurations for large-scale models with tens of thousands of computational operators, including modern architectures such as LLaMA3-70B.
In benchmark scenarios, our optimized solutions achieve up to a 26.5× improvement in energy-delay product (EDP) compared to baseline approaches, and consistently remain within 14.0\% of the theoretical optimum. Notably, these results are obtained in minutes - significantly faster than traditional heuristics or solvers, which often require hours or fail to provide performance guarantees.
As AI/ML models continue to scale, our solutions retain their efficiency: Even in scenarios requiring significantly larger compute and memory footprints, the proposed framework delivers energy efficiencies comparable to a hypothetical monolithic chip design with idealized integration. Experimental validation via cycle-accurate emulation and hardware prototype measurements confirms the accuracy of these analytical values, with observed results aligning within 2.7\% of MIQP estimates.
«
The future of Artificial Intelligence and Machine Learning (AI/ML) inference lies in hardware systems composed of many interconnected chiplets. With the rapid advancement of 3D integration technologies, each chiplet is expected to contain increasingly large on-chip memory.
To meet the growing demands of state-of-the-art models, achieving low-latency or high-through\-put inference, while minimizing energy consumption per input, requires novel architectural and algorithmic solutions. These mus...
»