Digital Design for Low Power Systems
|
|
- Roy Woods
- 8 years ago
- Views:
Transcription
1 Digital Design for Low Power Systems Shekhar Borkar Intel Corp.
2 Outline Low Power Outlook & Challenges Circuit solutions for leakage avoidance, control, & tolerance Microarchitecture for Low Power System considerations Technology, Circuits, Architecture, and System co-optimization 2
3 Unconstrained SD Leakage Ioff (na/u) nm 90nm Temp (C) 3
4 Unconstrained SD Leakage Power 1,000 SD Leakage (Watts) X Tr Growth 2X Tr Growth Temp = 100C 1.5V 1.7V 2V 1.2V 1V 0.9V u 0.18u 0.13u 90nm 65nm 45nm 4
5 Product Cost Pressure Desk Top ASP ($) Platform Budget Power Delivery $25 Other Thermal $10 Shrinking ASP & budget for power 5
6 Unconstrained Total Power Power (W), Power Density (W/cm 2 ) mm Die SiO2 Lkg SD Lkg Active 90nm 65nm 45nm 32nm 22nm 16nm Technology, Circuits, & Architecture to constrain power 6
7 Active Power Reduction Low Supply Voltage Slow Fast Slow High Supply Voltage Multiple Supply Voltages Multiple supply voltages mimic Vdd scaling Issues: Performance loss due to interface circuits Multiple Vdd routing overhead 7
8 SD Leakage Avoidance Dual Vt # of Paths # of Paths High Vt Low Vt Delay Delay Low-Vt transistor width (% of total transistor width) 100% 80% Full Low Vt Performance 60% 40% 20% 0% 0% 5% 10% 15% 20% 25% % T scaling from all high-vt design # of Paths Dual Vt Delay Leakage 3X smaller (Active & Standby) No performance loss 8
9 Leakage Control Circuits Body Bias Stack Effect Sleep Transistor Vdd +Ve Vbp Equal Loading Logic Block -Ve Vbn 2-10X Reduction 5-10X Reduction X Reduction 9
10 Body Bias Reverse BB Forward BB Vdd Vbp Increases V t Reduces V t +Ve Reduces I off Increases I off To inactive circuits to reduce leakage To active circuits to improve performance -Ve Vbn Tradeoff & analysis required Applying BB consumes active power Switching delay between body bias 10
11 Reverse Body Bias Leakage Reduction Factor (X) C 0.5V RBB Higher V T Shorter L Lower V T Target I off (na/µm) RBB less effective at shorter L and lower Vt * A. Keshavarzi et. al., 2001 Int. Symp. Low Power Electronics & Design (ISLPED) 11
12 Scalability of RBB Power (Watts) 400.0E E E E E+0 Ibp junction leakage Total Leakage Power Optimum SD leakage Ibn junction leakage Body Bias (V) Total Leakage Power Measured on Test Chip Microprocessor critical path circuit Optimum RBB 0.18 µm I/O circuit 0.35 µm 0.18 µm 2V 0.5V Ioff Red. 1000X 10X RBB Less effective with technology scaling * A. Keshavarzi et. al., 1999 Int. Symp. Low Power Electronics & Design (ISLPED) 12
13 Forward Body Bias Normalized operating frequency Router chip with body bias mV Forward body bias (mv) CBG I/O: S-Links Digital Core 24 LBGs PLL Die size 10.1 X 10.1 mm2 Technology 150nm CMOS Transistors 6.6 million Area overhead 2% Power overhead 1% I/O: F-Links Fmax (MHz) Fmax (MHz) Active power (W) * S. Narendra et al, ISSCC 2002, * S. Narendra et al, ISSCC 2002, FBB increases frequency & SD leakage Body bias chip with 450 mv FBB T j ~ 60 C NBB chip & body bias chip with ZBB Vcc (V) Body bias chip with 450 mv FBB T j ~ 60 C Body bias chip with ZBB
14 Adaptive Body Bias Experiment 5.3 mm Multiple subsites Resistor Network PD & Counter CUT Delay Resistor Network Bias Amplifier 4.5 mm Technology Number of subsites per die Body bias range Bias resolution 150nm CMOS V FBB to 0.5V RBB 32 mv 1.6 X 0.24 mm, 21 sites per die 150nm CMOS Die frequency: Min(F 1..F 21 ) Die power: Sum(P 1..P 21 ) 14
15 Adaptive Body Bias Results Number of dies too slow too leaky ABB FBB RBB Accepted die 100% 60% 20% 0% nobb ABB f target 97% highest bin 100% yield Higher Frequency Frequency within die ABB f target For given Freq and Power density 100% yield with ABB 97% highest freq bin with ABB for within die variability 15
16 Transistor Stack V dd V int Normalized Current Vint(V) Stack leakage is ~5-10X smaller * S. Narendra et. al., 2001 Int. Symp. Low Power Electronics & Design (ISLPED) 16
17 Scalability of Stack Effect Stack Stack effect effect factor factor X X Zero Zero body body bias80 C 80 C 0.13 µm LV T 0.13 µm HV T 0.18 µm Model Vdd (V) X Stack effect factor µm Zero body bias 80 C 10% Vdd scaling 20% Vdd scaling Technology generation (µm) Stack effects becomes stronger with scaling 17
18 Exploiting Natural Stacks 32-bit Kogge-Stone adder % of input vectors 30% 20% High V T Low V T 10% 0% Standby leakage current (µa) Reduction Avg Worst High V T 1.5X 2.5X Low V T 1.5X 2X * Y. Ye et. al., 1998 Symp. VLSI Circuits Energy Overhead High V T Low V T 1.64 nj 1.84 nj Savings 2.2 µa 38.4 µa Min time in Standby 84 µs 5.4 µs 18
19 Stack Forcing 5 Equal Loading Delay (Relative) High Vt Low Vt Performance Loss Leakage Reduction 1.E-05 1.E-04 1.E-03 1.E-02 1.E-01 1.E+00 Ioff (Relative) Circuit technique provides additional Vt s 19
20 Sleep Transistor Technique V dd external Dynamic ALU 32 Tota power (mw) Switching Leakage Overhead 4GHz, 75C α= % 10.3% ALU core 0 Clock gating + body bias Clock gating only Clock gating + sleep transistor V ss external Non-sleep ALU Scan FIFO Sleep ALU Output scan Control Body bias J. Tschanz et al, ISSCC 2003, 6.1 PMOS sleep transistor PMOS body bias Frequency change Leakage savings Area overhead -2% 13X 11% 0% 2X 2% 20
21 Leakage Tolerant Register File Domino: leakage-sensitive Φ1 RS0 D0 M1 RS15 M2 D15 Local Bit-line LBL0 N0 LBL1 LBL DC Noise Robustness Pseudo-static: leakage-tolerant This work 6GHz design Dual-Vt Noise floor Keeper upsizing Low-Vt Read Delay (ps) Φ1 D0# RS0 M1 Vs Px M2 Pk D15# RS15 LBL0 LBL1 Improves delay Improves noise margin Leakage tolerant * R. Krishnamurthy et. al., 2001 Symp. VLSI Circuits 21
22 Employing Leakage Control Number of Paths Low Vt + Stack Forcing Low Vt Dual Vt + Stack Forcing Dual Vt Delay Target 22
23 Optimum Frequency Sub-threshold Leakage increases exponentially Relative Frequency Process Technology Performance Diminishing Return Relative Frequency (Pipelining) Pipeline & Performance Power Efficiency Optimum Relative Pipeline Depth Pipeline Depth Maximum performance with Optimum pipeline depth Optimum frequency 23
24 Efficiency of Microarchitectures Growth (X) from previous uarch Same Process Technology Die Area Performance Power S-Scalar Dynamic Deep Pipeline Reduction in MIPS/Watt 40% 20% 0% Same Process Technology Enegry efficiency drops ~20% S-Scalar Dynamic Employ efficient microarchitecture and design Deep Pipeline 24
25 Memory Latency 1000 CPU Small ~few Clocks Cache Memory Large ns Memory Latency (Clocks) Assume: 50ns Memory latency Freq (MHz) Cache miss hurts performance Worse at higher frequency 25
26 Increase on-die Memory 100% Power Density of Logic 10X of Memory Cache % of Total Area 75% 50% 25% 486 Pentium M Pentium III Pentium Pentium 4 0% 1u 0.5u 0.25u 0.13u 65nm Increased Data Bandwidth & Reduced Latency Hence, higher performance for much lower power 26
27 Multi-threading Performance 100% 80% 60% 40% 20% 1 GHz 2 GHz 3 GHz Thermals & Power Delivery designed for full HW utilization ST Single Thread Full HW Utilization MT1 Wait for Mem Multi-Threading Wait for Mem 0% MT2 Wait 100% 98% 96% Cache Hit % MT3 Multi-threading improves performance without impacting thermals & power delivery 27
28 Throughput Oriented Design Vdd Logic Block 0.7 x Vdd Freq = 1 Throughput = 1 Active Power = 1 SD Lkg Power = 1 Logic Block Freq = 0.7 Throughput = 1.4 Logic Block Active Power = 0.7 SD Lkg Power = 0.7 Higher logic throughput, yet lower power 28
29 Special Purpose Hardware TCP Offload Engine 1.E+06 PLL OOO ROB Send buffer MIPS 1.E+05 1.E+04 GP TCB Exec Core TCB Exec Core ROM ROM Input seq CAM1 CLB 1.E+03 1.E+02 TOE mm X 3.54 mm, 260K transistors Opportunities: Network processing engines MPEG Encode/Decode engines Speech engines Special Purpose HW Best Mips/Watt 29
30 Chip Level Multi-processing Rule of thumb: Voltage Frequency Power Performance 1% 1% 3% 0.66% In the same process technology Cache Cache Core Core Core Voltage = 1 Freq = 1 Area = 1 Power = 1 Perf = 1 Voltage = -15% Freq = -15% Area = 2 Power = 1 Perf = ~1.8 30
31 Multi-Core Cache Large Core Power 4 Performance Small Core Power = 1/4 Performance = 1/2 1 1 C1 C3 Cache C2 C Multi-Core: Power efficient Better power and thermal management 31
32 From Multi to Many 13mm, 100W, 48MB Cache, 4B Transistors, in 22nm 12 Cores 48 Cores 144 Cores Relative Performance 1.2 Single Core Performance Large 0.5 Med 0.3 Small System Performance TPT One App Large Med Small Two App Four App Eight App 32
33 Future Multi-core Platform GP C GP C GP C GP C General Purpose Cores GP C C SP C GP SP C C GP C GP GP C GP C GP C SP C SP C GP C Special Purpose HW Interconnect fabric Heterogeneous Multi-Core Platform 33
34 Low Power & Platform Breakdown of Platform Power Other CPU MCH, ICH CPU MCH, ICH Disk Display Gfx Memory CLK Gen Power Supply Loss Gfx Disk Mobile Com Desktop Other CPU & electronics power in a platform is low Must reduce power of other platform ingredients 34
35 Efficiency of Power Delivery 90% Workload % Efficiency (%) 85% 80% 75% Loss Design Point 70% I (amp) Difficult to maintain high efficiency across workloads Power supply design needs to exploit new technologies 35
36 Typical Power Delivery System Power Delivery Electronics Mobile Platform, ~50W 36
37 Fine-grain Power Management Power Supply Response Usage Fast Slow Slow Fast Cache Time Improve response time Bring power supply closer Provide multiple supply voltages Fine grain Vdd and Freq scaling Cache Core Hopping for hot spot & power density management 37
38 Software & Low Power Applications Compilers Operating System Firmware Provide guidance to compilers and operating system through policy Add hooks for fine grain power management, and assist OS by providing checkpoints Fine grain power management algorithm Control HW through firmware Hardware control Detect conditions, take action, notify OS Fine grain power management is possible only by cooperation of the entire software stack 38
39 Co-optimization Technology Circuits µarch Platform Software SD Leakage (Ioff) Number of Vt s Gate dielectric (Tox) Supply voltage (Vdd) Transistor density Leakage avoidance, control & tolerance Multiple Vdd circuits Tradeoff frequency for transistors Optimal pipeline Throughput oriented microarchitecture Partitioning of logic Error correction Throughput oriented microarchitecture Fine grain power management hardware & software Efficient power delivery & management of multiple Vdd s Software and applications to utilize logic throughput Interconnect RC delay tolerance Interconnect cost vs efficiency of power delivery 39
40 Summary Power and energy will limit integration Employ circuit techniques for SD leakage avoidance, control, & tolerance Microarchitecture for Low Power Opportunities to save power at the platform level Hardware & Software Technology, Circuits, Architecture, and System co-optimization will be key 40
DESIGN CHALLENGES OF TECHNOLOGY SCALING
DESIGN CHALLENGES OF TECHNOLOGY SCALING IS PROCESS TECHNOLOGY MEETING THE GOALS PREDICTED BY SCALING THEORY? AN ANALYSIS OF MICROPROCESSOR PERFORMANCE, TRANSISTOR DENSITY, AND POWER TRENDS THROUGH SUCCESSIVE
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationIntel Labs at ISSCC 2012. Copyright Intel Corporation 2012
Intel Labs at ISSCC 2012 Copyright Intel Corporation 2012 Intel Labs ISSCC 2012 Highlights 1. Efficient Computing Research: Making the most of every milliwatt to make computing greener and more scalable
More informationLow Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
More informationIntroducción. Diseño de sistemas digitales.1
Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationNaveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1 Large Caches Cache hierarchies
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More information«A 32-bit DSP Ultra Low Power accelerator»
«A 32-bit DSP Ultra Low Power accelerator» E. Beigné edith.beigne@cea.fr CEA LETI MINATEC, Grenoble, France www.cea.fr Low power SOC challenges : Energy Efficiency Fine-Grain AVFS solutions FDSOI technology
More informationEfficient Big Data Analytics Computing: A Research Challenge
Efficient Big Data Analytics Computing: A Research Challenge Wilfred Pinfold Director, Extreme Scale Programs 1 Agenda Intel Big Data Context Overview Key Research Areas Challenges Partnerships 2 Meeting
More informationIntroduction to Multi-Core
Introduction to Multi-Core Baskaran Ganesan Baskaran.ganesan@intel.com Sr. Design Engineer Digital Enterprise Group, Intel Corporation Foundation for Advancement of Education and Research (FAER) 1 Topics
More informationMcPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures
McPAT: An Integrated Power, Area, and Timing Modeling Framework for Multicore and Manycore Architectures Sheng Li, Junh Ho Ahn, Richard Strong, Jay B. Brockman, Dean M Tullsen, Norman Jouppi MICRO 2009
More informationData Center and Cloud Computing Market Landscape and Challenges
Data Center and Cloud Computing Market Landscape and Challenges Manoj Roge, Director Wired & Data Center Solutions Xilinx Inc. #OpenPOWERSummit 1 Outline Data Center Trends Technology Challenges Solution
More information路 論 Chapter 15 System-Level Physical Design
Introduction to VLSI Circuits and Systems 路 論 Chapter 15 System-Level Physical Design Dept. of Electronic Engineering National Chin-Yi University of Technology Fall 2007 Outline Clocked Flip-flops CMOS
More informationWhy Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat
Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat Why Computers Are Getting Slower The traditional approach better performance Why computers are
More informationThis Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationA Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
More informationPower Reduction Techniques in the SoC Clock Network. Clock Power
Power Reduction Techniques in the SoC Network Low Power Design for SoCs ASIC Tutorial SoC.1 Power Why clock power is important/large» Generally the signal with the highest frequency» Typically drives a
More informationSingle-chip Cloud Computer IA Tera-scale Research Processor
Single-chip Cloud Computer IA Tera-scale esearch Processor Jim Held Intel Fellow & Director Tera-scale Computing esearch Intel Labs August 31, 2010 www.intel.com/info/scc Agenda Tera-scale esearch SCC
More informationScaling in a Hypervisor Environment
Scaling in a Hypervisor Environment Richard McDougall Chief Performance Architect VMware VMware ESX Hypervisor Architecture Guest Monitor Guest TCP/IP Monitor (BT, HW, PV) File System CPU is controlled
More informationDigital Integrated Circuit (IC) Layout and Design
Digital Integrated Circuit (IC) Layout and Design! EE 134 Winter 05 " Lecture Tu & Thurs. 9:40 11am ENGR2 142 " 2 Lab sections M 2:10pm 5pm ENGR2 128 F 11:10am 2pm ENGR2 128 " NO LAB THIS WEEK " FIRST
More informationStatic-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology
Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology Nahid Rahman Department of electronics and communication FET-MITS (Deemed university), Lakshmangarh, India B. P. Singh Department
More informationCSEE W4824 Computer Architecture Fall 2012
CSEE W4824 Computer Architecture Fall 2012 Lecture 2 Performance Metrics and Quantitative Principles of Computer Design Luca Carloni Department of Computer Science Columbia University in the City of New
More informationAn All-Digital Phase-Locked Loop with High Resolution for Local On-Chip Clock Synthesis
An All-Digital Phase-Locked Loop with High Resolution for Local On-Chip Clock Synthesis Oliver Schrape 1, Frank Winkler 2, Steffen Zeidler 1, Markus Petri 1, Eckhard Grass 1, Ulrich Jagdhold 1 International
More informationADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
More informationThread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
More informationAlpha CPU and Clock Design Evolution
Alpha CPU and Clock Design Evolution This lecture uses two papers that discuss the evolution of the Alpha CPU and clocking strategy over three CPU generations Gronowski, Paul E., et.al., High Performance
More informationUse-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors
Use-it or Lose-it: Wearout and Lifetime in Future Chip-Multiprocessors Hyungjun Kim, 1 Arseniy Vitkovsky, 2 Paul V. Gratz, 1 Vassos Soteriou 2 1 Department of Electrical and Computer Engineering, Texas
More informationIntroduction to VLSI Programming. TU/e course 2IN30. Prof.dr.ir. Kees van Berkel Dr. Johan Lukkien [Dr.ir. Ad Peeters, Philips Nat.
Introduction to VLSI Programming TU/e course 2IN30 Prof.dr.ir. Kees van Berkel Dr. Johan Lukkien [Dr.ir. Ad Peeters, Philips Nat.Lab] Introduction to VLSI Programming Goals Create silicon (CMOS) awareness
More informationCPU Performance. Lecture 8 CAP 3103 06-11-2014
CPU Performance Lecture 8 CAP 3103 06-11-2014 Defining Performance Which airplane has the best performance? 1.6 Performance Boeing 777 Boeing 777 Boeing 747 BAC/Sud Concorde Douglas DC-8-50 Boeing 747
More informationSRAM Scaling Limit: Its Circuit & Architecture Solutions
SRAM Scaling Limit: Its Circuit & Architecture Solutions Nam Sung Kim, Ph.D. Assistant Professor Department of Electrical and Computer Engineering University of Wisconsin - Madison SRAM VCC min Challenges
More informationPhotonic Networks for Data Centres and High Performance Computing
Photonic Networks for Data Centres and High Performance Computing Philip Watts Department of Electronic Engineering, UCL Yury Audzevich, Nick Barrow-Williams, Robert Mullins, Simon Moore, Andrew Moore
More informationFault Modeling. Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults. Transistor faults Summary
Fault Modeling Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults Single stuck-at faults Fault equivalence Fault dominance and checkpoint theorem Classes of stuck-at
More informationArea 3: Analog and Digital Electronics. D.A. Johns
Area 3: Analog and Digital Electronics D.A. Johns 1 1970 2012 Tech Advancements Everything but Electronics: Roughly factor of 2 improvement Cars and airplanes: 70% more fuel efficient Materials: up to
More informationEnabling Technologies for Distributed Computing
Enabling Technologies for Distributed Computing Dr. Sanjay P. Ahuja, Ph.D. Fidelity National Financial Distinguished Professor of CIS School of Computing, UNF Multi-core CPUs and Multithreading Technologies
More informationAm186ER/Am188ER AMD Continues 16-bit Innovation
Am186ER/Am188ER AMD Continues 16-bit Innovation 386-Class Performance, Enhanced System Integration, and Built-in SRAM Problem with External RAM All embedded systems require RAM Low density SRAM moving
More informationEnabling Technologies for Distributed and Cloud Computing
Enabling Technologies for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Multi-core CPUs and Multithreading
More informationInternational Journal of Electronics and Computer Science Engineering 1482
International Journal of Electronics and Computer Science Engineering 1482 Available Online at www.ijecse.org ISSN- 2277-1956 Behavioral Analysis of Different ALU Architectures G.V.V.S.R.Krishna Assistant
More informationPetascale Software Challenges. Piyush Chaudhary piyushc@us.ibm.com High Performance Computing
Petascale Software Challenges Piyush Chaudhary piyushc@us.ibm.com High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons
More informationCMOS, the Ideal Logic Family
CMOS, the Ideal Logic Family INTRODUCTION Let s talk about the characteristics of an ideal logic family. It should dissipate no power, have zero propagation delay, controlled rise and fall times, and have
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationPerformance evaluation
Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationInterconnection technologies
Interconnection technologies Ron Ho VLSI Research Group Sun Microsystems Laboratories 1 Acknowledgements Many contributors to the work described here > Robert Drost, David Hopkins, Alex Chow, Tarik Ono,
More informationCS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge
CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge We re on the Road to Parallel Processing Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities
More informationA 10,000 Frames/s 0.18 µm CMOS Digital Pixel Sensor with Pixel-Level Memory
Presented at the 2001 International Solid State Circuits Conference February 5, 2001 A 10,000 Frames/s 0.1 µm CMOS Digital Pixel Sensor with Pixel-Level Memory Stuart Kleinfelder, SukHwan Lim, Xinqiao
More informationLecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationOpenSoC Fabric: On-Chip Network Generator
OpenSoC Fabric: On-Chip Network Generator Using Chisel to Generate a Parameterizable On-Chip Interconnect Fabric Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf MODSIM 2014 Presentation
More informationResource Efficient Computing for Warehouse-scale Datacenters
Resource Efficient Computing for Warehouse-scale Datacenters Christos Kozyrakis Stanford University http://csl.stanford.edu/~christos DATE Conference March 21 st 2013 Computing is the Innovation Catalyst
More informationModule 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1
Module 2 Embedded Processors and Memory Version 2 EE IIT, Kharagpur 1 Lesson 5 Memory-I Version 2 EE IIT, Kharagpur 2 Instructional Objectives After going through this lesson the student would Pre-Requisite
More informationThe Orca Chip... Heart of IBM s RISC System/6000 Value Servers
The Orca Chip... Heart of IBM s RISC System/6000 Value Servers Ravi Arimilli IBM RISC System/6000 Division 1 Agenda. Server Background. Cache Heirarchy Performance Study. RS/6000 Value Server System Structure.
More informationISSCC 2003 / SESSION 13 / 40Gb/s COMMUNICATION ICS / PAPER 13.7
ISSCC 2003 / SESSION 13 / 40Gb/s COMMUNICATION ICS / PAPER 13.7 13.7 A 40Gb/s Clock and Data Recovery Circuit in 0.18µm CMOS Technology Jri Lee, Behzad Razavi University of California, Los Angeles, CA
More informationItanium 2 Platform and Technologies. Alexander Grudinski Business Solution Specialist Intel Corporation
Itanium 2 Platform and Technologies Alexander Grudinski Business Solution Specialist Intel Corporation Intel s s Itanium platform Top 500 lists: Intel leads with 84 Itanium 2-based systems Continued growth
More informationScalable Internet Services and Load Balancing
Scalable Services and Load Balancing Kai Shen Services brings ubiquitous connection based applications/services accessible to online users through Applications can be designed and launched quickly and
More informationMemory Module Specifications KVR667D2D4F5/4G. 4GB 512M x 72-Bit PC2-5300 CL5 ECC 240-Pin FBDIMM DESCRIPTION SPECIFICATIONS
Memory Module Specifications KVR667DD4F5/4G 4GB 5M x 7-Bit PC-5300 CL5 ECC 40- FBDIMM DESCRIPTION This document describes s 4GB (5M x 7-bit) PC-5300 CL5 SDRAM (Synchronous DRAM) fully buffered ECC dual
More informationDesign Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
More informationGigabit Ethernet Design
Gigabit Ethernet Design Laura Jeanne Knapp Network Consultant 1-919-254-8801 laura@lauraknapp.com www.lauraknapp.com Tom Hadley Network Consultant 1-919-301-3052 tmhadley@us.ibm.com HSEdes_ 010 ed and
More informationMemory Architecture and Management in a NoC Platform
Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011 Overview Motivation State of the Art Data Management Engine
More informationNetworking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
More informationMeasuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
More informationVLSI Design Verification and Testing
VLSI Design Verification and Testing Instructor Chintan Patel (Contact using email: cpatel2@cs.umbc.edu). Text Michael L. Bushnell and Vishwani D. Agrawal, Essentials of Electronic Testing, for Digital,
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationIntel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
More informationIntroduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems
Harris Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems David Harris Harvey Mudd College David_Harris@hmc.edu Based on EE271 developed by Mark Horowitz, Stanford University MAH
More informationRiding silicon trends into our future
Riding silicon trends into our future VLSI Design and Embedded Systems Conference, Bangalore, Jan 05 2015 Sunit Rikhi Vice President, Technology & Manufacturing Group General Manager, Intel Custom Foundry
More informationOn-Chip Interconnection Networks Low-Power Interconnect
On-Chip Interconnection Networks Low-Power Interconnect William J. Dally Computer Systems Laboratory Stanford University ISLPED August 27, 2007 ISLPED: 1 Aug 27, 2007 Outline Demand for On-Chip Networks
More informationDRG-Cache: A Data Retention Gated-Ground Cache for Low Power 1
DRG-Cache: A Data Retention Gated-Ground Cache for Low Power 1 ABSTRACT In this paper we propose a novel integrated circuit and architectural level techniue to reduce leakage power consumption in high
More informationHow To Understand The Design Of A Microprocessor
Computer Architecture R. Poss 1 What is computer architecture? 2 Your ideas and expectations What is part of computer architecture, what is not? Who are computer architects, what is their job? What is
More informationA Novel Low Power Fault Tolerant Full Adder for Deep Submicron Technology
International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-4, Issue-1 E-ISSN: 2347-2693 A Novel Low Power Fault Tolerant Full Adder for Deep Submicron Technology Zahra
More information! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends
This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):
More informationInnovativste XEON Prozessortechnik für Cisco UCS
Innovativste XEON Prozessortechnik für Cisco UCS Stefanie Döhler Wien, 17. November 2010 1 Tick-Tock Development Model Sustained Microprocessor Leadership Tick Tock Tick 65nm Tock Tick 45nm Tock Tick 32nm
More informationAsynchronous Bypass Channels
Asynchronous Bypass Channels Improving Performance for Multi-Synchronous NoCs T. Jain, P. Gratz, A. Sprintson, G. Choi, Department of Electrical and Computer Engineering, Texas A&M University, USA Table
More informationWhat is a System on a Chip?
What is a System on a Chip? Integration of a complete system, that until recently consisted of multiple ICs, onto a single IC. CPU PCI DSP SRAM ROM MPEG SoC DRAM System Chips Why? Characteristics: Complex
More informationVALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS
VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,
More informationChapter 2 Logic Gates and Introduction to Computer Architecture
Chapter 2 Logic Gates and Introduction to Computer Architecture 2.1 Introduction The basic components of an Integrated Circuit (IC) is logic gates which made of transistors, in digital system there are
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
More informationHCC/HCF4032B HCC/HCF4038B
HCC/HCF4032B HCC/HCF4038B TRIPLE SERIAL ADDERS INERT INPUTS ON ALL ADDERS FOR SUM COMPLEMENTING APPLICATIONS FULLY STATIC OPERATION...DC TO 10MHz (typ.) @ DD = 10 BUFFERED INPUTS AND OUTPUTS SINGLE-PHASE
More informationThe Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage
The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage sponsored by Dan Sullivan Chapter 1: Advantages of Hybrid Storage... 1 Overview of Flash Deployment in Hybrid Storage Systems...
More informationOptimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
More informationCore Power Delivery Network Analysis of Core and Coreless Substrates in a Multilayer Organic Buildup Package
Core Power Delivery Network Analysis of Core and Coreless Substrates in a Multilayer Organic Buildup Package Ozgur Misman, Mike DeVita, Nozad Karim, Amkor Technology, AZ, USA 1900 S. Price Rd, Chandler,
More informationLecture 030 DSM CMOS Technology (3/24/10) Page 030-1
Lecture 030 DSM CMOS Technology (3/24/10) Page 030-1 LECTURE 030 - DEEP SUBMICRON (DSM) CMOS TECHNOLOGY LECTURE ORGANIZATION Outline Characteristics of a deep submicron CMOS technology Typical deep submicron
More informationPowerPC Microprocessor Clock Modes
nc. Freescale Semiconductor AN1269 (Freescale Order Number) 1/96 Application Note PowerPC Microprocessor Clock Modes The PowerPC microprocessors offer customers numerous clocking options. An internal phase-lock
More information. MEDIUM SPEED OPERATION - 8MHz (typ.) @ . MULTI-PACKAGE PARALLEL CLOCKING FOR HCC4029B HCF4029B PRESETTABLE UP/DOWN COUNTER BINARY OR BCD DECADE
HCC4029B HCF4029B PRESETTABLE UP/DOWN COUNTER BINARY OR BCD DECADE. MEDIUM SPEED OPERATION - 8MHz (typ.) @ CL = 50pF AND DD-SS = 10. MULTI-PACKAGE PARALLEL CLOCKING FOR SYNCHRONOUS HIGH SPEED OUTPUT RES-
More informationArchitectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng
Architectural Level Power Consumption of Network Presenter: YUAN Zheng Why Architectural Low Power Design? High-speed and large volume communication among different parts on a chip Problem: Power consumption
More informationECE 5745 Complex Digital ASIC Design Course Overview
ECE 5745 Complex Digital ASIC Design Course Overview Christopher Batten School of Electrical and Computer Engineering Cornell University http://www.csl.cornell.edu/courses/ece5745 Application Algorithm
More informationfind model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1
Monitors Monitor: A tool used to observe the activities on a system. Usage: A system programmer may use a monitor to improve software performance. Find frequently used segments of the software. A systems
More informationHow To Reduce Energy Dissipation Of An On Chip Memory Array
Proceedings of 2013 IFIP/IEEE 21st International Conference on Very Large Scale Integration (VLSI-SoC) Adapting the Columns of Storage Components for Lower Static Energy Dissipation Mehmet Burak Aykenar,
More informationFPGA-based MapReduce Framework for Machine Learning
FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationIntroduction to Microprocessors
Introduction to Microprocessors Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?
More informationOracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011
Oracle Database Reliability, Performance and scalability on Intel platforms Mitch Shults, Intel Corporation October 2011 1 Intel Processor E7-8800/4800/2800 Product Families Up to 10 s and 20 Threads 30MB
More informationHow To Design A Chip Layout
Spezielle Anwendungen des VLSI Entwurfs Applied VLSI design (IEF170) Course and contest Intermediate meeting 3 Prof. Dirk Timmermann, Claas Cornelius, Hagen Sämrow, Andreas Tockhorn, Philipp Gorski, Martin
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationExascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
More information1.1 Silicon on Insulator a brief Introduction
Table of Contents Preface Acknowledgements Chapter 1: Overview 1.1 Silicon on Insulator a brief Introduction 1.2 Circuits and SOI 1.3 Technology and SOI Chapter 2: SOI Materials 2.1 Silicon on Heteroepitaxial
More informationClocking. Figure by MIT OCW. 6.884 - Spring 2005 2/18/05 L06 Clocks 1
ing Figure by MIT OCW. 6.884 - Spring 2005 2/18/05 L06 s 1 Why s and Storage Elements? Inputs Combinational Logic Outputs Want to reuse combinational logic from cycle to cycle 6.884 - Spring 2005 2/18/05
More informationOperating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015
Operating Systems 05. Threads Paul Krzyzanowski Rutgers University Spring 2015 February 9, 2015 2014-2015 Paul Krzyzanowski 1 Thread of execution Single sequence of instructions Pointed to by the program
More information