Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors



Similar documents
Asynchronous Bypass Channels

In-network Monitoring and Control Policy for DVFS of CMP Networkson-Chip and Last Level Caches

Hyper Node Torus: A New Interconnection Network for High Speed Packet Processors

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

On-Chip Interconnection Networks Low-Power Interconnect

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Low Power AMD Athlon 64 and AMD Opteron Processors

Photonic Networks for Data Centres and High Performance Computing

Chapter 2 Sources of Variation

Low-Overhead Hard Real-time Aware Interconnect Network Router

McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures

Design of a Feasible On-Chip Interconnection Network for a Chip Multiprocessor (CMP)

Optical interconnection networks with time slot routing

Intel s Revolutionary 22 nm Transistor Technology

Data Distribution Algorithms for Reliable. Reliable Parallel Storage on Flash Memories

A Detailed and Flexible Cycle-Accurate Network-on-Chip Simulator

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

DESIGN CHALLENGES OF TECHNOLOGY SCALING

Testing Low Power Designs with Power-Aware Test Manage Manufacturing Test Power Issues with DFTMAX and TetraMAX

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Alpha CPU and Clock Design Evolution

Disaster-Resilient Backbone and Access Networks

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Introduction to CMOS VLSI Design

Yaffs NAND Flash Failure Mitigation

Distributed Elastic Switch Architecture for efficient Networks-on-FPGAs

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

Switched Interconnect for System-on-a-Chip Designs

A Low-Radix and Low-Diameter 3D Interconnection Network Design

Model-Based Synthesis of High- Speed Serial-Link Transmitter Designs

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Juniper Networks QFabric: Scaling for the Modern Data Center

Analysis of Error Recovery Schemes for Networks-on-Chips

A case study of mobile SoC architecture design based on transaction-level modeling

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Leveraging Torus Topology with Deadlock Recovery for Cost-Efficient On-Chip Network

A Low Latency Router Supporting Adaptivity for On-Chip Interconnects

Design and Implementation of an On-Chip Permutation Network for Multiprocessor System-On-Chip

3D NAND Technology Implications to Enterprise Storage Applications

VLSI Design Verification and Testing

Fault Modeling. Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults. Transistor faults Summary

Area 3: Analog and Digital Electronics. D.A. Johns

vci_anoc_network Specifications & implementation for the SoClib platform

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

Interconnection Networks

Introduction to VLSI Programming. TU/e course 2IN30. Prof.dr.ir. Kees van Berkel Dr. Johan Lukkien [Dr.ir. Ad Peeters, Philips Nat.

Networking Topology For Your System

SLC vs MLC: Proper Flash Selection for SSDs in Industrial, Military and Avionic Applications. A TCS Space & Component Technology White Paper

Evaluating Embedded Non-Volatile Memory for 65nm and Beyond

A CDMA Based Scalable Hierarchical Architecture for Network- On-Chip

ECE 358: Computer Networks. Solutions to Homework #4. Chapter 4 - The Network Layer

On-Chip Communications Network Report

CLASS: Combined Logic Architecture Soft Error Sensitivity Analysis

Quality of Service (QoS) for Asynchronous On-Chip Networks

Scaling 10Gb/s Clustering at Wire-Speed

Router Architectures

Implementation Of High-k/Metal Gates In High-Volume Manufacturing

EMC / EMI issues for DSM: new challenges

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA EFFICIENT ROUTER DESIGN FOR NETWORK ON CHIP

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Everything you need to know about flash storage performance

ETHERNET WAN ENCRYPTION SOLUTIONS COMPARED

AN1837. Non-Volatile Memory Technology Overview By Stephen Ledford Non-Volatile Memory Technology Center Austin, Texas.

Lecture 23: Interconnection Networks. Topics: communication latency, centralized and decentralized switches (Appendix E)

What is a System on a Chip?

Optimizing Cloud Performance Using Veloxum Testing Report on experiments run to show Veloxum s optimization software effects on Terremark s vcloud

Core Power Delivery Network Analysis of Core and Coreless Substrates in a Multilayer Organic Buildup Package

Design of a Dynamic Priority-Based Fast Path Architecture for On-Chip Interconnects*

A Low-Latency Asynchronous Interconnection Network with Early Arbitration Resolution

International Journal of Electronics and Computer Science Engineering 1482

Prevention, Detection, Mitigation

Design Compiler Graphical Create a Better Starting Point for Faster Physical Implementation

SPI I2C LIN Ethernet. u Today: Wired embedded networks. u Next lecture: CAN bus u Then: wireless embedded network

Interconnection Network Design

Nexus: An Asynchronous Crossbar Interconnect for Synchronous System-on-Chip Designs

From Hypercubes to Dragonflies a short history of interconnect

Performance Evaluation of AODV, OLSR Routing Protocol in VOIP Over Ad Hoc

Characterizing Task Usage Shapes in Google s Compute Clusters

Energy Optimal Routing Protocol for a Wireless Data Network

1.Introduction. Introduction. Most of slides come from Semiconductor Manufacturing Technology by Michael Quirk and Julian Serda.

Design and Verification of Nine port Network Router

Business Case for the Brocade Carrier Ethernet IP Solution in a Metro Network

Transcription:

Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors Hyungjun Kim, 1 Arseniy Vitkovsky, 2 Paul V. Gratz, 1 Vassos Soteriou 2 1 Department of Electrical and Computer Engineering, Texas A&M University 2 Department of Electrical Engineering, Computer Engineering and Informatics, Cyprus University of Technology

Introduction Moore's law scaling continues Denard scaling ending Power/performance trade-off driving core increases Increasing core counts push complexity into interconnect Interconnect becomes critical to CMP functionality Intel Single-Chip Cloud Computer 48-core chip-multiprocessor (CMP) 2011 2

Moore's Law and Reliability ITRS: per-transistor failure rate must decrease 10X by 2023 Manufacturable solutions not known Failure mechanisms with time and use Hot Carrier Injection (HCI) Negative Bias Temperature Instability (NBTI) Transistor wear effects Transistors slow w/ wear Data dependent causes (roughly) inverse Charge trapping in pmosfet after NBTI induced stress. [Brown et al., IEEE Trans. Electron Devices 2010] 3

Chip-multiprocessor Wearout Increasing core counts Each core accounts for less of system throughput A 64-core Chip-Multiprocessor (CMP) with various peripherals interconnected via a 2-D Mesh 4

Chip-multiprocessor Wearout Increasing core counts Each core accounts for less of system throughput Core redundancy Other cores can take over workload Fault-tolerant task migration Other micro-arch methods Problem solved? Scenario #1: Failure of three cores due to wearout, workload redistributed to other cores 5

Chip-multiprocessor Wearout Network-on-chip (NOC) complexity increases with scaling NoC critical to CMP operation A 64-core Chip-Multiprocessor (CMP) with various peripherals interconnected via a 2-D Mesh 6

Chip-multiprocessor Wearout Network-on-chip (NOC) complexity increases with scaling NoC critical to CMP operation Scenario #2: A peripheral device disconnected from the CMP due to interconnect failure 7

Chip-multiprocessor Wearout Network-on-chip (NOC) complexity increases with scaling NoC critical to CMP operation Scenario #3: Peripheral devices and entire NoC region disconnected from the network due to chained faults in links 8

Chip-multiprocessor Wearout Network-on-chip (NOC) complexity increases with scaling NoC critical to CMP operation Scenario #4: A single link failure leading to routing deadlock 9

Chip-multiprocessor Wearout Network-on-chip (NOC) complexity increases with scaling NoC critical to CMP operation Single failure could incapacitate CMP! Scenario #4: A single link failure leading to routing deadlock 10

Chip-multiprocessor Wearout Project goal: Develop techniques to mitigate wearout in the NoC Analyze breakdown by mechanism Characterize wear due to real workloads on routers and links Develop wear-resistant router microarchitectures A 64-core Chip-Multiprocessor (CMP) with various peripherals interconnected via a 2-D Mesh, all failure scenarios illustrated 11

Outline Introduction Background Failure mechanisms in modern CMOS Hot Carrier Injection (HCI) Negative Bias Temperature Instability (NBTI) Reliability characterization Wearout-resistant router microarchitecture Evaluation Conclusions 12

Failure mechanisms: Hot Carrier Injection (HCI) Cause: When current flows through channel Charge carrier gains sufficient energy to embed in gate oxide gate source drain n+ n+ 13

Failure mechanisms: Hot Carrier Injection (HCI) Cause: When current flows through channel Charge carrier gains sufficient energy to embed in gate oxide gate source e drain n+ n+ p 14

Failure mechanisms: Hot Carrier Injection (HCI) Cause: When current flows through channel Charge carrier gains sufficient energy to embed in gate oxide Effect: Shifts Vt Slows transistor switching speed gate source e drain n+ n+ p 15

Failure mechanisms: Negative Bias Temperature Instability (NBTI) Cause: Hole induction breaks Si-H bonds at gate-oxide interface gate source drain O H O Si-Si-Si n+ h+ n+ Si 16

Failure mechanisms: Negative Bias Temperature Instability (NBTI) Cause: Hole induction breaks Si-H bonds at gate-oxide interface H atoms form H 2 in oxide H 2 diffuse to gate, trapping charge in gate oxide gate source H drain 2 O O Si-Si n+ + -Si n+ Si 17

Failure mechanisms: Negative Bias Temperature Instability (NBTI) Effect: Shift in Vt Slows transistor switching speed gate source drain O O O O Si-Si n+ + -Si Si-Si + -Si n+ Si Si 18

Failure mechanisms: Negative Bias Temperature Instability (NBTI) Effect: Shift in Vt Slows transistor switching speed gate source drain O O O O Si-Si n+ + -Si Si-Si + -Si n+ Si Si Effect worsens w/ high-k dielectric Much worse in MuGFETs 19

Failure mechanisms: Summary Each mechanism occurs under different conditions HCI proportional to current across channel HCI increases with increased transistor activity factor NBTI proportional to time gate held at 0V NBTI increases with 0-skewed duty cycle Both mechanisms effect switching speed of transistor Transistor does not break just slows down Thus strongest impact on critical path Effect of usage on wear nearly opposite! Increased activity Increased HCI Lower duty cycle Increased NBTI 20

Outline Introduction Background Reliability characterization Wearout-resistant router microarchitecture Evaluation Conclusions 21

Router Microarchitecture A 64-core Chip-Multiprocessor (CMP) with various peripherals interconnected via a 2-D Mesh 22

Router Microarchitecture NoC Router Microarchitecture Input Channel 0 VC 0 VC v-1 Routing Unit VC Allocator XBar Allocator Crossbar p x p Output Channel 0 Input Channel p-1 VC 0 VC v-1 Routing Unit Output Channel p-1 A 64-core Chip-Multiprocessor (CMP) with various peripherals interconnected via a 2-D Mesh 23

Router Critical Path HCI and NBTI slow transistors Wearout occurs first where timing critical Analyze critical path for susceptibility to wear NoC Router Microarchitecture Input Channel 0 VC 0 VC v-1 Routing Unit Input Channel p-1 VC 0 VC v-1 Routing Unit VC Allocator XBar Allocator Crossbar p x p Output Channel 0 Output Channel p-1 24

Router Critical Path NoC router critical path in 2 nd pipeline stage From input VC Through VC and Xbar allocators To 3 rd stage pipeline latches Activity of wires along path correlated with injection rate of router Application dependent! Input Channel 0 VC 0 VC v-1 Routing Unit Input Channel p-1 VC 0 NoC Router Microarchitecture VC v-1 Routing Unit VC Allocator XBar Allocator Crossbar p x p Output Channel 0 Output Channel p-1 Router critical path typically lies through VC and Xbar allocation 25

CMP Workload Characterization Examined variance in per-router load Incoming rate to each router individually PARSEC workloads Load seen by each router under PARSEC suite benchmarks (incoming rate). Average (bar), minimum and maximum (whiskers). 26

CMP Workload Characterization Examined variance in per-router load Incoming rate to each router individually PARSEC workloads Great variance in load Between applications From router to router within network Implication: wearout is highly workload dependent Load seen by each router under PARSEC suite benchmarks (incoming rate). Average (bar), minimum and maximum (whiskers). 27

Workload impact on wearout: HCI Activity on critical path Examined activity factor of router critical path to determine HCI impact Some sensitivity to injection rate Even for extreme injection rates, activity factor remains low Activity factor histogram for critical path nodes within router, under random injection of varying rate. Critical path nodes defined as nodes on all paths with latency within 10% of clock period. 28

Workload impact on wearout: HCI Activity on critical path Examined activity factor of router critical path to determine HCI impact Some sensitivity to injection rate Even for extreme injection rates, activity factor remains low Implication: Activity factor on critical path low, and insensitive to workload HCI impact low Activity factor histogram for critical path nodes within router, under random injection of varying rate. Critical path nodes defined as nodes on all paths with latency within 10% of clock period. 29

Workload impact on wearout: NBTI Duty cycle on critical path Examined duty cycle of router critical path to determine NBTI impact Duty cycle poor for a high percentage of critical path nodes Highly sensitive to injection rate Duty cycle improves as injection rate increases Duty cycle (fraction of time at 0) histogram for critical path nodes within router, under random injection of varying rate. 30

Workload impact on wearout: NBTI Duty cycle on critical path Examined duty cycle of router critical path to determine NBTI impact Duty cycle poor for a high percentage of critical path nodes Highly sensitive to injection rate Duty cycle improves as injection rate increases Implications: NBTI induced wear is likely NBTI stress inversely proportional to load! Duty cycle (fraction of time at 0) histogram for critical path nodes within router, under random injection of varying rate. 31

Reliability characterization summary Input Channel 0 VC 0 VC v-1 Routing Unit Input Channel p-1 VC 0 VC v-1 Routing Unit NoC Router Microarchitecture VC Allocator XBar Allocator Crossbar p x p Output Channel 0 Output Channel p-1 Wear is injection rate dependent along the critical path Low injection rates cause poor duty cycle and accelerated NBTI Critical path primarily concerned with VC and Xbar allocation 32

Outline Introduction Background Related work Reliability characterization Use it or Lose it: wearout-resistant router microarchitecture Evaluation Conclusions 33

Addressing wear in the NoC router Input Channel 0 VC 0 VC v-1 Routing Unit Input Channel p-1 VC 0 VC v-1 Routing Unit NoC Router Microarchitecture VC Allocator XBar Allocator Crossbar p x p Output Channel 0 Output Channel p-1 Increasing injection rate artificially yields other problems Power increases HCI increases Congestion Want to improve duty cycle w/o increasing activity factor 34

The Use it or Lose it router microarchitecture Use it or Lose it: Need mechanism to exercise critical path through the allocators or we will lose them Goals: 1.Improve duty cycle by creating more time spent at 1 2.Should not change state of router 3.Should not worsen critical path timing 4.Should not significantly increase activity factor Power and HCI are activity factor dependent 35

The Use it or Lose it router microarchitecture Create an exercise mode for critical path 1. Improve duty cycle by creating more time spent at 1 Inject random data into allocators Emulate higher injection rates Hit corner cases 36

The Use it or Lose it router microarchitecture 2. Should not change state of router Only enabled when no packets ready in input FIFOs Router state latches are disabled during exercise mode 3. Should not worsen critical path timing Only one mux added to actual path All vector generation and mode selection logic off critical path 37

The Use it or Lose it router microarchitecture 4. Should not significantly increase activity factor Only paths with poor duty cycle on critical path are modified Vector generator changes vector slowly, once every 128 cycles Minimal impact on router activity factor 38

Random Vector Generation ROM containing pre-generated random vectors Low overhead Examined range of ROM sizes ROM sizes >16 produced little return Examined range of ROM vector rotation periods Examined other methods to generate random vector (see paper) 39

Outline Introduction Background Related work Reliability characterization Wearout-resistant router microarchitecture Evaluation Conclusions 40

Implementation Baseline and Wear-resistant router synthesized in 45nm TSMC process tech Results scaled to 22nm Low overheads of technique ~1% increase in router power consumption <1% increase in router area 41

Methodology Post-synth router models developed for Baseline and Wear-resistant Post-synth router models simulated under PARSEC-like workloads 64-node, 2-D Mesh (8x8) Five-port, three-stage, 2-D mesh router Four-VCs per port, each four flits deep Activity factor and duty cycle of wires along router critical paths extracted 42

Results: Lifetime Increase Under Synthetic Workload Notation: 16 ROM (8) : 16 entry ROM, rotated every 8 cycles. Lifetime of Use it or Lose it router experiencing NBTI wear under various injection rates normalized against the baseline router. Dramatic improvement in lifetime Improvement best at low loads when wear is worst Insensitive to rotation period 43

Results: Lifetime Increase Under PARSEC workloads Lifetime improvement of Use it or Lose it router under PARSEC workloads normalized against baseline router. 10-65X increase in lifetime over baseline Geo-mean of 22X improvement across benchmarks Applications with routers that experience lowest minimum load see greatest gain 44

Results: Impact on Activity Factor Activity factor for baseline and Use it or Lose it router under various incoming rate loads. Increase in activity factor only significant at low, < 0.05 flits/cycle, loads Little impact on HCI 45

Results: Impact on Power Power consumption of baseline and Use it or Lose it router under various incoming rate loads. Maximum of < 1% increase in router power consumption 46

Conclusions Wearout is a growing problem in future CMOS technologies Over 10 years, 10x reliability improvement required Interconnect (NoC) critical point of failure in CMPs In the NoC, NBTI is dominant Low activity of routers problematic HCI less of an issue Introduced Use it or Lose it wear-resistant router microarchitecture 22x improvement in lifetime Insignificant overhead in energy or performance 47

Questions? 48