UBC Theses and Dissertations

UBC Theses Logo

UBC Theses and Dissertations

Designing network-on-chips for throughput accelerators Bakhoda, Ali

Abstract

Physical limits of power usage for integrated circuits have steered the microprocessor industry towards parallel architectures in the past decade. Modern Graphics Processing Units (GPU) are a form of parallel processor that harness chip area more effectively compared to traditional single threaded architectures by favouring application throughput over latency. Modern GPUs can be used as throughput accelerators: accelerating massively parallel non-graphics applications. As the number of compute cores in throughput accelerators increases, so does the importance of efficient memory subsystem design. In this dissertation, we present system-level microarchitectural analysis and optimizations with an emphasis on the memory subsystem of throughput accelerators that employ Bulk-Synchronous-Parallel programming models such as CUDA and OpenCL. We model the whole throughput accelerator as a closed-loop system in order to capture the effects of complex interactions of microarchitectural components: we simulate components such as compute cores, on-chip network and memory controllers with cycle-level accuracy. For this purpose, the first version of GPGPU-Sim simulator that was capable of running unmodified applications by emulating NVIDIA's virtual instruction set was developed. We use this simulator to model and analyze several applications and explore various microarchitectural tradeoffs for throughput accelerators to better suit these applications. Based on our observations, we identify the Network-on-Chip (NoC) component of memory subsystem as our main optimization target and set out to design throughput effective NoCs for future throughput accelerators. We provide a new framework for NoC researchers to ensure the optimizations are "throughput effective", namely, parallel application-level performance improves per unit chip area. We then use this framework to guide the development of several optimizations. Accelerator workloads demand high off-chip memory bandwidth resulting in a many-to-few-to-many traffic pattern. Leveraging this observation, we reduce NoC area by proposing a checkerboard NoC which utilizes routers with limited connectivity. Additionally, we improve performance by increasing the terminal bandwidth of memory controller nodes to better handle frequent read-reply traffic. Furthermore, we propose a double checkerboard inverted NoC organization which maintains the benefits of these optimizations while having a simpler routing mechanism and smaller area and results in a 24.3% improvement in average application throughput per unit area.

Item Media

Item Citations and Data

Rights

Attribution-NonCommercial-NoDerivs 2.5 Canada