3D GPU ARCHITECTURE USING CACHE STACKING: PERFORMANCE, COST, POWER AND THERMAL ANALYSIS Ahmed Al Maashri, Guangyu Sun, Xiangyu Dong, Vijay Narayanan and Yuan Xie Department of Computer Science and Engineering, Penn State University
MOTIVATION Studies have shown that small cache size and low cache bandwidth will limit the performance of GPU Problems: We need to mitigate the high latency that is associated with increasing GPU cache sizes As we increase the computational capabilities of GPUs, there is an increase in power consumption
SOLUTION 3D ARCHITECTURE Benefits: reduced latency in circuits, reduced wires length that results in a reduction in power consumption and a reduction in footprint enables heterogeneous integration
BACKGROUND 3D INTEGRATION In a 3D IC, multiple device layers are stacked together with direct vertical interconnects Through-Silicon Vias (TSVs) through them. Conceptual 3D IC
BACKGROUND CONT D 3D architecture has already been used in processor-cache-memory system Schematic view
BACKGROUND CONT D Using a 3-D architecture allows us to keep the main memory on-chip and effectively reduce the latency for accessing it. This is because the onchip interconnections that replace the off-chip buses have much smaller delay and hence increase the memory bus frequency. Problem: One of the issues related to die stacking is the increase in power density which leads to an increase in chip temperature
DESIGN SPACE EXPLORATION investigate the effects of changing the organization of the GPU caches on the hit rate (Streamer caches, Texture Unit caches, ZST caches and Color Write caches) The simulation results show negligible impact on the hit rate for all the caches, except for the TU and the ZST caches
TU CACHE The texture cache is a read-only cache that stores image data that is used for putting images onto triangles, a process called texture mapping. The texture cache has a high hit rate since there is heavy reuse between neighboring pixels temporal locality
ZST CACHE Z and Stencil Test caches take advantage of the spatial locality because of the very nature of the depth buffer where neighboring fragments are more likely to be fetched in an X-Y frame grid Depth buffer: When an object is rendered, the depth of a generated pixel (z coordinate) is stored in a buffer (the z-buffer or depth buffer). This buffer is usually arranged as a two-dimensional array (x-y) with one element for each screen pixel.
DESIGN SPACE EXPLORATION CONT D use the 3DCacti simulator in order to determine the extra cycles incurred due to size increase These 2-layer and 4-layer caches were die-stacked by dividing the word lines. increasing the cache size increases the latency; however, dividing the caches into a number of layers has reduced latency.
3D COST MODEL There are a number of techniques for stacking dies of which Wafer-to-Wafer (W2W) and Die-to-Wafer (D2W) techniques are the most common. Unlike W2W, D2W allows for stacking individual dies to another wafer resulting in higher flexibility and higher yield. Die Cost: Cost of fabricating a single die before 3D bonding Bonding Cost: Cost incurred due to bonding (We assume a bonding cost of $150 per wafer) Die Yield: The die area is inversely proportional to the die yield. Bonding Yield(Our 3D bonding cost model is based on the 3D process from our industry partners, with the assumption that the yield of each 3D process step is 99%.) Known-Good-Die testing cost
ISO-CYCLE TIME RESULTS Assume iso-cycle time is 0.75 ns. This cycle time captures typical frequency ranges used in current GPUs
SCENARIO I a 2D GPU vs a 3Dstacked cache GPU. Both GPUs contain 128 shaders, and both utilize 65nm technology. The first layer in the 3D GPU contains the GPU processing units, while the other two layers contain the partitioned ZST and TU caches 3D architecture achieves up to 45% speed up over the 2D planar architecture Total power: 106.4W Maximum temperature: 121.55ºC(hotspot simulation tool)
SCENARIO II: HETEROGENEOUS INTEGRATION In the first layer of the 3D design, we implement the GPU units in 65nm technology. However, the second layer uses 45nm technology Working with smaller feature sizes allows us to cram all the caches into one layer saving cost incurred due to bonding. 3D design outperformed 2D by a 19% geometric mean speedup. Total power: 82.1W Maximum temperature: 82.24ºC(hotspot simulation tool)
MRAM VS. SRAM Since leakage power(a gradual loss of energy from a charged capacitor) is an important component of power consumption, we consider the impact of utilizing non-volatile Magnetic Random Access Memory (MRAM) that has zero standby power as a candidate for implementing caches. leakage power: a gradual loss of energy from a charged capacitor Standby power: the electric power consumed by electronic and electrical appliances while they are switched off or in a standby mode.
MAGNETORESISTIVE RANDOM-ACCESS MEMORY (MRAM) Unlike conventional RAM chip technologies, data in MRAM is not stored as electric charge or current flows, but by magnetic storage elements. The heart of an MRAM memory cell is the magnetic tunnel junction (MTJ), a small device having two ferromagnetic layers separated by a thin dielectric layer as shown below: The resistance of the MTJ is low if they are parallel( 1 ) and high if they are antiparallel( 0 ). Not only does it retain its memory with the power turned off but also there is no constant power-draw. the write process is slower and requires more power to overcome the existing field stored in the junction.
MRAM VS. SRAM CONT D For caches with a less number of writes compared to reads, we observed a performance gain. However, due to the slow write times of the MRAM, compared to SRAM, when the number of writes is large, there is performance degradation.
MRAM VS. SRAM CONT D The power benefits of MRAM over SRAM makes the former more appealing for power-conserving applications.
CONTRIBUTIONS Performance evaluation of 3Dstacked caches on GPUs Comparison between 3D stacked SRAMs and MRAMs in GPUs in terms of power consumptions Power and thermal analysis of proposed architectural designs.
Questions?