3D GPU ARCHITECTURE USING CACHE STACKING: PERFORMANCE, COST, POWER AND THERMAL ANALYSIS

Similar documents
1. Memory technology & Hierarchy

Photonic Networks for Data Centres and High Performance Computing

State-of-the-Art Flash Memory Technology, Looking into the Future

Computer Graphics Hardware An Overview

Low power GPUs a view from the industry. Edvard Sørgård

3D IC Design and CAD Challenges

GPU Architecture. Michael Doggett ATI

Embedded STT-MRAM for Mobile Applications:

State-of-Art (SoA) System-on-Chip (SoC) Design HPC SoC Workshop

With respect to the way of data access we can classify memories as:

From physics to products

361 Computer Architecture Lecture 14: Cache Memory

NAND Flash FAQ. Eureka Technology. apn5_87. NAND Flash FAQ

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

Samsung emcp. WLI DDP Package. Samsung Multi-Chip Packages can help reduce the time to market for handheld devices BROCHURE

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

Non-Volatile Memory. Non-Volatile Memory & its use in Enterprise Applications. Contents

Bi-directional FlipFET TM MOSFETs for Cell Phone Battery Protection Circuits

Alpha CPU and Clock Design Evolution

Intel s Revolutionary 22 nm Transistor Technology

Why Hybrid Storage Strategies Give the Best Bang for the Buck

How to Optimize 3D CMP Cache Hierarchy

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Chapter 1 Computer System Overview

1.Introduction. Introduction. Most of slides come from Semiconductor Manufacturing Technology by Michael Quirk and Julian Serda.

GPGPU Computing. Yong Cao

1 / 25. CS 137: File Systems. Persistent Solid-State Storage

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Samsung 2bit 3D V-NAND technology

MOSFET TECHNOLOGY ADVANCES DC-DC CONVERTER EFFICIENCY FOR PROCESSOR POWER

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Aeroflex Solutions for Stacked Memory Packaging Increasing Density while Decreasing Area

Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Load Balancing & DFS Primitives for Efficient Multicore Applications

The Evolving NAND Flash Business Model for SSD. Steffen Hellmold VP BD, SandForce

Implementation of Buffer Cache Simulator for Hybrid Main Memory and Flash Memory Storages

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

Radeon HD 2900 and Geometry Generation. Michael Doggett

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Computer Systems Structure Main Memory Organization

Memory Basics. SRAM/DRAM Basics

Interconnection technologies

Semiconductor Memories

Parallel Simplification of Large Meshes on PC Clusters

Low Power AMD Athlon 64 and AMD Opteron Processors

Memory Architecture and Management in a NoC Platform

Samsung 3bit 3D V-NAND technology

Architectures and Platforms

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Computer Architecture

Touchstone -A Fresh Approach to Multimedia for the PC

Flash Memories. João Pela (52270), João Santos (55295) December 22, 2008 IST

Analyzing Electrical Effects of RTA-driven Local Anneal Temperature Variation

Yaffs NAND Flash Failure Mitigation

Efficient Interconnect Design with Novel Repeater Insertion for Low Power Applications

Computer Systems Structure Input/Output

Architectures and Design Methodologies for Micro and Nanocomputing

Writing Applications for the GPU Using the RapidMind Development Platform

L20: GPU Architecture and Models

Flash & DRAM Si Scaling Challenges, Emerging Non-Volatile Memory Technology Enablement - Implications to Enterprise Storage and Server Compute systems

Memory. The memory types currently in common usage are:

On-Chip Interconnection Networks Low-Power Interconnect

Implementation Of High-k/Metal Gates In High-Volume Manufacturing

White Paper: Pervasive Power: Integrated Energy Storage for POL Delivery

CHAPTER 7: The CPU and Memory

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

Wafer Level Testing Challenges for Flip Chip and Wafer Level Packages

Introduction to Digital System Design

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

AN1837. Non-Volatile Memory Technology Overview By Stephen Ledford Non-Volatile Memory Technology Center Austin, Texas.

Non-Volatile Memory and Its Use in Enterprise Applications

Thermal Modeling Methodology for Fast and Accurate System-Level Analysis: Application to a Memory-on-Logic 3D Circuit

AMD Opteron Quad-Core

Computer Architecture-I

Design Compiler Graphical Create a Better Starting Point for Faster Physical Implementation

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Dry Film Photoresist & Material Solutions for 3D/TSV

2014 EMERGING NON- VOLATILE MEMORY & STORAGE TECHNOLOGIES AND MANUFACTURING REPORT

Lecture 030 DSM CMOS Technology (3/24/10) Page 030-1

Note monitors controlled by analog signals CRT monitors are controlled by analog voltage. i. e. the level of analog signal delivered through the

Programming NAND devices

Energy-Efficient, High-Performance Heterogeneous Core Design

Advancements in High Frequency, High Resolution Acoustic Micro Imaging for Thin Silicon Applications

Price/performance Modern Memory Hierarchy

Solid State Drive Architecture

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Switch Fabric Implementation Using Shared Memory

Nanotechnologies for the Integrated Circuits

Power Analysis of Link Level and End-to-end Protection in Networks on Chip

Understanding LCD Memory and Bus Bandwidth Requirements ColdFire, LCD, and Crossbar Switch

Parallel Programming Survey

The continuum of data management techniques for explicitly managed systems

How To Teach Computer Graphics

Real-Time Realistic Rendering. Michael Doggett Docent Department of Computer Science Lund university

Transcription:

3D GPU ARCHITECTURE USING CACHE STACKING: PERFORMANCE, COST, POWER AND THERMAL ANALYSIS Ahmed Al Maashri, Guangyu Sun, Xiangyu Dong, Vijay Narayanan and Yuan Xie Department of Computer Science and Engineering, Penn State University

MOTIVATION Studies have shown that small cache size and low cache bandwidth will limit the performance of GPU Problems: We need to mitigate the high latency that is associated with increasing GPU cache sizes As we increase the computational capabilities of GPUs, there is an increase in power consumption

SOLUTION 3D ARCHITECTURE Benefits: reduced latency in circuits, reduced wires length that results in a reduction in power consumption and a reduction in footprint enables heterogeneous integration

BACKGROUND 3D INTEGRATION In a 3D IC, multiple device layers are stacked together with direct vertical interconnects Through-Silicon Vias (TSVs) through them. Conceptual 3D IC

BACKGROUND CONT D 3D architecture has already been used in processor-cache-memory system Schematic view

BACKGROUND CONT D Using a 3-D architecture allows us to keep the main memory on-chip and effectively reduce the latency for accessing it. This is because the onchip interconnections that replace the off-chip buses have much smaller delay and hence increase the memory bus frequency. Problem: One of the issues related to die stacking is the increase in power density which leads to an increase in chip temperature

DESIGN SPACE EXPLORATION investigate the effects of changing the organization of the GPU caches on the hit rate (Streamer caches, Texture Unit caches, ZST caches and Color Write caches) The simulation results show negligible impact on the hit rate for all the caches, except for the TU and the ZST caches

TU CACHE The texture cache is a read-only cache that stores image data that is used for putting images onto triangles, a process called texture mapping. The texture cache has a high hit rate since there is heavy reuse between neighboring pixels temporal locality

ZST CACHE Z and Stencil Test caches take advantage of the spatial locality because of the very nature of the depth buffer where neighboring fragments are more likely to be fetched in an X-Y frame grid Depth buffer: When an object is rendered, the depth of a generated pixel (z coordinate) is stored in a buffer (the z-buffer or depth buffer). This buffer is usually arranged as a two-dimensional array (x-y) with one element for each screen pixel.

DESIGN SPACE EXPLORATION CONT D use the 3DCacti simulator in order to determine the extra cycles incurred due to size increase These 2-layer and 4-layer caches were die-stacked by dividing the word lines. increasing the cache size increases the latency; however, dividing the caches into a number of layers has reduced latency.

3D COST MODEL There are a number of techniques for stacking dies of which Wafer-to-Wafer (W2W) and Die-to-Wafer (D2W) techniques are the most common. Unlike W2W, D2W allows for stacking individual dies to another wafer resulting in higher flexibility and higher yield. Die Cost: Cost of fabricating a single die before 3D bonding Bonding Cost: Cost incurred due to bonding (We assume a bonding cost of $150 per wafer) Die Yield: The die area is inversely proportional to the die yield. Bonding Yield(Our 3D bonding cost model is based on the 3D process from our industry partners, with the assumption that the yield of each 3D process step is 99%.) Known-Good-Die testing cost

ISO-CYCLE TIME RESULTS Assume iso-cycle time is 0.75 ns. This cycle time captures typical frequency ranges used in current GPUs

SCENARIO I a 2D GPU vs a 3Dstacked cache GPU. Both GPUs contain 128 shaders, and both utilize 65nm technology. The first layer in the 3D GPU contains the GPU processing units, while the other two layers contain the partitioned ZST and TU caches 3D architecture achieves up to 45% speed up over the 2D planar architecture Total power: 106.4W Maximum temperature: 121.55ºC(hotspot simulation tool)

SCENARIO II: HETEROGENEOUS INTEGRATION In the first layer of the 3D design, we implement the GPU units in 65nm technology. However, the second layer uses 45nm technology Working with smaller feature sizes allows us to cram all the caches into one layer saving cost incurred due to bonding. 3D design outperformed 2D by a 19% geometric mean speedup. Total power: 82.1W Maximum temperature: 82.24ºC(hotspot simulation tool)

MRAM VS. SRAM Since leakage power(a gradual loss of energy from a charged capacitor) is an important component of power consumption, we consider the impact of utilizing non-volatile Magnetic Random Access Memory (MRAM) that has zero standby power as a candidate for implementing caches. leakage power: a gradual loss of energy from a charged capacitor Standby power: the electric power consumed by electronic and electrical appliances while they are switched off or in a standby mode.

MAGNETORESISTIVE RANDOM-ACCESS MEMORY (MRAM) Unlike conventional RAM chip technologies, data in MRAM is not stored as electric charge or current flows, but by magnetic storage elements. The heart of an MRAM memory cell is the magnetic tunnel junction (MTJ), a small device having two ferromagnetic layers separated by a thin dielectric layer as shown below: The resistance of the MTJ is low if they are parallel( 1 ) and high if they are antiparallel( 0 ). Not only does it retain its memory with the power turned off but also there is no constant power-draw. the write process is slower and requires more power to overcome the existing field stored in the junction.

MRAM VS. SRAM CONT D For caches with a less number of writes compared to reads, we observed a performance gain. However, due to the slow write times of the MRAM, compared to SRAM, when the number of writes is large, there is performance degradation.

MRAM VS. SRAM CONT D The power benefits of MRAM over SRAM makes the former more appealing for power-conserving applications.

CONTRIBUTIONS Performance evaluation of 3Dstacked caches on GPUs Comparison between 3D stacked SRAMs and MRAMs in GPUs in terms of power consumptions Power and thermal analysis of proposed architectural designs.

Questions?