Thread Level Parallelism (TLP)
|
|
- Adele Greer
- 8 years ago
- Views:
Transcription
1 Thread Level Parallelism (TLP) Calcolatori Elettronici 2 TLP: SUN Microsystems vision (2004) Roberto Giorgi, Universita di Siena, C208L15, Slide 2
2 Estimated Industry Trends Moore's Law allows for the rapid increase in transistors per core. TLP optimised cores will start out much simpler, and may grow complex more slowly. The trend is for chips and CPU cores to get smaller, though TLP optimised ones will start much smaller. Growth rates in maximum power for "fat" CPUs have levelled off a bit. For "thin" cores, the number of CPU cores per chip will probably increase rather than the power consumption per core. "Fat" cores need lots of cache to reduce memory latency. TLP optimised designs are less latency sensitive, so less cache is needed. Better process technology helps both types to increase, though the simpler, slower clocked "thin" cores will be slower on more traditional benchmarks. "Fat" cores will benefit from TLP techniques and general improvements, but not as much as "thin" cores. Roberto Giorgi, Universita di Siena, C208L15, Slide 3 Current 4-way SMP An illustration of a 4-way system today. The only TLP comes from having multiple chips Roberto Giorgi, Universita di Siena, C208L15, Slide 4
3 Toward NIAGARA chips An illustration of a system with a heavily optimised TLP design Roberto Giorgi, Universita di Siena, C208L15, Slide 5 Niagara: A Torrent of Threads Niagara floorplan Roberto Giorgi, Universita di Siena, C208L15, Slide 6
4 First Niagara Chips: November 2005 UltraSPARC T1 I sistemi Niagara hanno 14 volte le prestazioni di un sistema UltraSPARC IIIi I sistemi con il single-chip Niagara 2, 35 volte I sistemi con Victoria Falls, 65 volte Roberto Giorgi, Universita di Siena, C208L15, Slide 7 EMBEDDED SYSTEM TRENDS Roberto Giorgi, Universita di Siena, C208L15, Slide 8
5 Global Embedded Systems Revenue (by Region) AAGR: average annual growth rate Global Embedded Systems Revenue $ Billions Americas Europe Japan Asia-Pacific AAGR% AAGR% Region Source: Future of Embedded Systems Technology, BCC Co, Inc., 2005 Roberto Giorgi, Universita di Siena, C208L15, Slide 9 Global Embedded Systems Revenue (by Application) World Embedded Systems Revenue $ Billions AAGR% 0 0 Telecomm Consumer Automotive Medical/Office Application Industrial/Milit AAGR% Source: Future of Embedded Systems Technology, BCC Co, Inc., 2005 Roberto Giorgi, Universita di Siena, C208L15, Slide 10
6 Global Embedded HW Revenue MPU : microprocessors MCU: microcontrollers Global Embedded Hardware Revenue by Category $ Billions AAGR% 0 MPU MCU DSP Memory Category ASIC/PLD Analog AAGR% Source: Future of Embedded Systems Technology, BCC Co, Inc., 2005 Roberto Giorgi, Universita di Siena, C208L15, Slide 11 Projected Technology Progress 1000 Transistor Density MPU (including SRAM) Source: Process Integration, Devices and Structures, ITRS, 2005 Mtransistors/cm Year Transistor number will continue to scale for some time Roberto Giorgi, Universita di Siena, C208L15, Slide 12
7 Embedded Platforms Roadmap Use of embedded processors in FPGAs 100% 80% 60% 40% Hard FPGA processor Soft FPGA processor No FPGA processor 20% 0% Hardwired Logic (ASIC-like) is being replaced by embedded processor devices Source: Survey of System Design Trends, Celoxica Inc., August 2005 Roberto Giorgi, Universita di Siena, C208L15, Slide 13 Embedded Processors: Innovation driven by Technology + Architecture Advances Multi-processing: Higher throughput With less speed Source: The Era of Tera, Pat Gelsinger, Intel, 2005 Roberto Giorgi, Universita di Siena, C208L15, Slide 14
8 Case Study ITRS Mobile Handheld Roadmap Year of Production Process Technology (nm) Supply Voltage (V) Clock Frequency (MHz) Processing Performance (GOPS) Average Power (W) Standby Power (mw) Applications Real Time Video Codec TV Telephone Source: System Drivers, ITRS, 2003 Performance, En. Efficiency (GOPs/W) increase by 200x Roberto Giorgi, Universita di Siena, C208L15, Slide 15 ITRS Low-Power SoC Source: System Drivers, ITRS, 2005 Many Processing Elements Reusability, Multi-Standard requirements drive for programmable (processor-based) solutions (PEs) (Heterogeneous) Multi-Processor systems-on-a-chip (SoC) Roberto Giorgi, Universita di Siena, C208L15, Slide 16
9 ITRS Low-Power SoC Processing/Performance Trends Source: System Drivers, ITRS, 2005 > 100 Processing Elements in 2011! Roberto Giorgi, Universita di Siena, C208L15, Slide 17 Future Embedded System Design Trends Mobile Handset Market driving commercial factor New applications, wireless transmission standards require high performance embedded low power ITRS foresees 3x magnitude improvement in performance and energy efficiency over the next 10 years (Heterogeneous) Multi-Processor system-on-chip Platforms Compiler Technologies for high-performance, low-power embedded computing will be needed Compiler and System-Design Tools for heterogeneous, massively parallel processing systems and networks Roberto Giorgi, Universita di Siena, C208L15, Slide 18
10 Network of Excellence HiPEAC High-Performance Embedded Architectures and Compilers IST Web Site Roberto Giorgi, Universita di Siena, C208L15, Slide 20
11 What We Have Now ACACES Extranet (Program, practical info,...) Participant management HiPEAC Conference Extranet (Committees, Call for papers, practical info,...) Paper submission (Commence) Roberto Giorgi, Universita di Siena, C208L15, Slide 21 SARC: Scalable ARChitectures WEB Site: Roberto Giorgi, Universita di Siena, C208L15, Slide 22
12 Paradigm shift Tiled architecture, built from fixed size nodes The architecture scales up by adding nodes NOT by growing the node size The node becomes the processor The processors become the functional units Roberto Giorgi, Universita di Siena, C208L15, Slide 23 Programming model features Programming model will have tagged procedure calls Define local and global (shared) variables - Defines address range(s) to copy to local store - Automatic programming of DMA transfers - Defines address range(s) to watch for interference Set procedure properties - Has secondary effects (modifies global state) - Reads global space - Writes global space - Requires atomicity - Regarding local variables - Regarding global variables Processor functionality requirements - Supports a specific ISA extension (or a different ISA) Roberto Giorgi, Universita di Siena, C208L15, Slide 24
13 Intra-node memory hierarchy Architecture must be easy to program for: Shared memory Accelerators may have: Local memory - Private, non-coherent DMA controller - Bridge between global memory and Local memory Accelerators must have: Global memory access - Directly, or through cache hierarchy Single load/store instruction Address range differentiates Local memory from Global memory Local memory ACC DMA Accelerator Cache(s) Local memory Local interconnect Outer shared cache ACC DMA Accelerator Cache(s) Roberto Giorgi, Universita di Siena, C208L15, Slide 25 Intra-node memory hierarchy (II) All caches inside a node must be coherent All outer caches (from each node) should also be coherent Caches work as shared distributed memory If threads do not share memory - There s no coherence traffic, nor overhead - There s no memory waste If threads share memory - Turning off coherence results in wrong execution Which is the benefit of turning off coherence? The hardware must be there anyway Turn it off for power savings? Lower memory access latency in non-shared mode? Use the hardware for something else? (what else? additional storage?) Roberto Giorgi, Universita di Siena, C208L15, Slide 26
14 Examples for intra-node memory Local memory ACC DMA Accelerator Cache(s) Local memory ACC Accelerator Cache(s) ACC Acc Cache(s) Local memory ACC DMA Local interconnect Outer shared cache Roberto Giorgi, Universita di Siena, C208L15, Slide 27 Determine the node size If node size is fixed, we must determine its size Split available area among Shared cache Local interconnect General purpose processor Accelerators Fixed or flexible distribution? Fixed GPP, cache, interconnect Reconfigurable accelerator area How many accelerators can a thread actually exploit? Streaming computation Parallel computation Task offloading Outer cache memory Local interconnect GPP Roberto Giorgi, Universita di Siena, C208L15, Slide 28
15 Node examples Sea of simple cores Niagara Cell Few complex cores Power5 Single vector/media/bio accelerator Multiple accelerators Outer cache memory Local interconnect GPP Roberto Giorgi, Universita di Siena, C208L15, Slide 29 GPP Accelerator interface For the processor to become the functional unit, task offloading must have minimum overhead Outer cache memory Accelerator as ISA extension Shares PC, Fetch & Decode with a general purpose CPU Issue logic sends instructions to CPU or Accelerator Units Implements an extension of the base ISA Accelerator as a new CPU Has a separate PC, Fetch, Decode engine May implement a completely different ISA - VLIW, SIMD, Stack, 16-bit ACC F & D Local interconnect Fetch & Dispatch Fetch & Dispatch GPP ACC CPU ACC Roberto Giorgi, Universita di Siena, C208L15, Slide 30
16 Memory Hierarchy DRAM I/O L3 Control Control Cache Set of coherent (processor-shared?) L1 caches inside the nodes C x Node Set of coherent node-shared L2 caches inside the chip (one from each node) 1 x Node, N x Chip Chip-shared L3 cache 1 x Chip Off-chip DRAM (or other memory technology) Roberto Giorgi, Universita di Siena, C208L15, Slide 31 Motivation Hard to further scale uniprocessors Brought back focus to multiprocessors Different applications profit from different techniques/types of parallelism ILP, TLP, DLP Motivates a customizable system with complex cores simple cores domain-specific accelerators 32 Roberto Giorgi, Universita di Siena, C208L15, Slide 32
17 Motivation (2) Parallelism type exhibited by application and suitable architecture: TLP SSC CMP+ vector DLP SMT vector FCC ILP 33 Roberto Giorgi, Universita di Siena, C208L15, Slide 33 SARC? complex cores simple cores accelerators 34 Roberto Giorgi, Universita di Siena, C208L15, Slide 34
18 ISA considerations Complex cores and simple cores have the same ISA (allows to move threads from one to another [for real-time performance, power, ], simpler programming and compilation) ISA-agnostic approaches applicable to basically any ISA (ARM, PowerPC, ) Accelerator ISAs extensions of GPP ISA single instruction stream (co-processor instructions) or multiple instructions stream 35 Roberto Giorgi, Universita di Siena, C208L15, Slide 35 How to realize customization? At design-time: The right mix of simple cores, complex cores, accelerators is determined at design-time Pro: Highest performance for specific application domains Con: after fabrication, only for specific application domains At run-time: There will be many processing cores on a chip, for temperature reasons some will have to be powered down anyhow Pro: Allows to achieve good performance, low power on many applications Con: Performance not as high as at design-time 36 Roberto Giorgi, Universita di Siena, C208L15, Slide 36
19 Levels of Abstraction Levels of abstraction: Architecture Microarchitecture Implementation Realization SARC WP1 focuses mainly on levels 1 and 2 37 Roberto Giorgi, Universita di Siena, C208L15, Slide 37 SARC node architecture 38 Roberto Giorgi, Universita di Siena, C208L15, Slide 38
20 Architectures of Domain Specific Accelerators SARC specifically targets (but is not limited to) application domains scientific computing (supercomputing) bioinformatics multimedia internet and transaction processing Contain code pieces responsible for large fraction of execution time Performance and power-efficiency can be improved significantly by employing domain-specific accelerators 39 Roberto Giorgi, Universita di Siena, C208L15, Slide 39 Scientific Computing Vector Accelerator Architecture For applications dominated by loops with vector operands What are the innovations: Matrix by Matrix operations (at least 2D) Dimensionality not encoded in the instructions (novel register file to support this) Sparse and Dense matrices considered identically Auto-indexing and sectioning addressing mechanisms (link to WP2) (possible) on-chip distributed vector facility ISA, data formats, register file organization and memory addressing scheme under investigation 40 Roberto Giorgi, Universita di Siena, C208L15, Slide 40
21 Scientific Computing Vector Accelerator Architecture (cont) ISA (check the document) Operand types: Vectors, Matrices (Sparse and Dense), Bit vectors and Scalars. (in sparse mode ½ of the available registers used as index vectors) Data formats: 64 bit FP; 8, 16, 32 and 64 bit INT and BOOL Auto indexing for rectangular patterns (dense): 41 Roberto Giorgi, Universita di Siena, C208L15, Slide 41 Scientific Computing Vector Accelerator Architecture (cont) Register file: The SARC vector register file is a parameterizable register file, which can be logically reorganized by the programmer to support multiple register dimensions and sizes simultaneously. Scalar reg. file shared with GPP 1) Vector registers can overlap (think about it) 2) Scalar registers can be used for conditional branches on the GPP side 42 Roberto Giorgi, Universita di Siena, C208L15, Slide 42
22 Bioinformatics Accelerator Will have a scalar and vector-simd part (Multiple) sequence alignment algorithms require: support for efficient unaligned memory accesses strided memory accesses vector reduction operations, etc. In structure prediction monte carlo or molecular dynamic simulations common can profit from earlier ASIC/FPGA work Docking profits from architectural features incorporated for structure prediction but also from matrix rotations, transposes, 43 Roberto Giorgi, Universita di Siena, C208L15, Slide 43 Multimedia accelerator Vector-SIMD architecture Architecture agnostic to physical vector length Avoid packing/unpacking, reorganization overhead unpacking while loading packing while storing flexible access to register file Use more dimensions 44 Roberto Giorgi, Universita di Siena, C208L15, Slide 44
23 Micro-architectural considerations Simple/complex GPP mixture Scalable cache coherence Support for (existing) sequential, single-threaded applications Thread-level speculation Kilo-instruction processors 45 Roberto Giorgi, Universita di Siena, C208L15, Slide 45 I/O and Communication Subsystem Overheads of system call, context switch, interrupt, network protocol no longer justified With fewer threads than processing cores no reason for switching execution context OS must not run on same processor as user applications requires extra-low communication latency 46 Roberto Giorgi, Universita di Siena, C208L15, Slide 46
24 Interconnection Network LANs/SANs are so fast that switching and routing have to be provided in hardware but reliable and congestion control left to end-nodes needs to be addressed Power considerations also Applies to multi-chip interconnection networks, but NoCs have to solve similar problems in a much more constrained enviroment 47 Roberto Giorgi, Universita di Siena, C208L15, Slide 47 TRANSACTIONAL MEMORY The most difficult task when developing multithreaded applications is making sure that the program works (e.g. deadlocks may occur when combining correct code fragments) Transactional memory is a concurrency control mechanism for controlling access to shared memory A transaction is a piece of code that executes a series of reads and writes to shared memory, which logically occur at a single instant in time, and are typically implemented in a lock-free way Transactional memory is optimistic: every thread completes its modifications to shared memory without regard for what other threads might be doing, recording every read and write that it makes in a log, which are validated in the commit stage Implementing part of the system memory as transactional memory could be the solution for storing shared data in parallel applications while simplifying programming Roberto Giorgi, Universita di Siena, C208L15, Slide 48
25 Riflessione PROBLEM: THINKING IN PARALLEL IS HARD! Perhaps: THINKING is hard! (YALE PATT - Sep.2007) Roberto Giorgi, Universita di Siena, C208L15, Slide 49
Architectures and Platforms
Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation
More informationADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
More information1. PUBLISHABLE SUMMARY
1. PUBLISHABLE SUMMARY ICT-eMuCo (www.emuco.eu) is a European project with a total budget of 4.6M which is supported by the European Union under the Seventh Framework Programme (FP7) for research and technological
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationOC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationA Survey on ARM Cortex A Processors. Wei Wang Tanima Dey
A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:
More information7a. System-on-chip design and prototyping platforms
7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit
More informationOutline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
More informationLesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education
Lesson 7: SYSTEM-ON ON-CHIP (SoC( SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY 1 VLSI chip Integration of high-level components Possess gate-level sophistication in circuits above that of the counter,
More informationEmbedded System Hardware - Processing (Part II)
12 Embedded System Hardware - Processing (Part II) Jian-Jia Chen (Slides are based on Peter Marwedel) Informatik 12 TU Dortmund Germany Springer, 2010 2014 年 11 月 11 日 These slides use Microsoft clip arts.
More informationA Generic Network Interface Architecture for a Networked Processor Array (NePA)
A Generic Network Interface Architecture for a Networked Processor Array (NePA) Seung Eun Lee, Jun Ho Bahn, Yoon Seok Yang, and Nader Bagherzadeh EECS @ University of California, Irvine Outline Introduction
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationELEC 5260/6260/6266 Embedded Computing Systems
ELEC 5260/6260/6266 Embedded Computing Systems Spring 2016 Victor P. Nelson Text: Computers as Components, 3 rd Edition Prof. Marilyn Wolf (Georgia Tech) Course Topics Embedded system design & modeling
More informationReal-Time Operating Systems for MPSoCs
Real-Time Operating Systems for MPSoCs Hiroyuki Tomiyama Graduate School of Information Science Nagoya University http://member.acm.org/~hiroyuki MPSoC 2009 1 Contributors Hiroaki Takada Director and Professor
More informationGraphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationThread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationARM Microprocessor and ARM-Based Microcontrollers
ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 A Microcontroller-Based Embedded System Roadmap 1 Introduction ARM ARM Basics 2 ARM Extensions Thumb Jazelle NEON & DSP Enhancement
More informationOperating System Support for Multiprocessor Systems-on-Chip
Operating System Support for Multiprocessor Systems-on-Chip Dr. Gabriel marchesan almeida Agenda. Introduction. Adaptive System + Shop Architecture. Preliminary Results. Perspectives & Conclusions Dr.
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationSOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
More informationChapter 2 Heterogeneous Multicore Architecture
Chapter 2 Heterogeneous Multicore Architecture 2.1 Architecture Model In order to satisfy the high-performance and low-power requirements for advanced embedded systems with greater fl exibility, it is
More informationOpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
More informationSeeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationHigh Performance Computing. Course Notes 2007-2008. HPC Fundamentals
High Performance Computing Course Notes 2007-2008 2008 HPC Fundamentals Introduction What is High Performance Computing (HPC)? Difficult to define - it s a moving target. Later 1980s, a supercomputer performs
More informationWhat is a System on a Chip?
What is a System on a Chip? Integration of a complete system, that until recently consisted of multiple ICs, onto a single IC. CPU PCI DSP SRAM ROM MPEG SoC DRAM System Chips Why? Characteristics: Complex
More informationDesign Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
More informationOpenSoC Fabric: On-Chip Network Generator
OpenSoC Fabric: On-Chip Network Generator Using Chisel to Generate a Parameterizable On-Chip Interconnect Fabric Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf MODSIM 2014 Presentation
More informationIntroducción. Diseño de sistemas digitales.1
Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1
More informationIntroduction to System-on-Chip
Introduction to System-on-Chip COE838: Systems-on-Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University
More informationİSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
More informationEnergy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
More informationComputer Systems Structure Input/Output
Computer Systems Structure Input/Output Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Examples of I/O Devices
More informationPower-Aware High-Performance Scientific Computing
Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan
More informationSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS Nitin Chaturvedi 1 S Gurunarayanan 2 1 Department of Electrical Electronics Engineering, BITS, Pilani, India nitin80@bits-pilani.ac.in
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationDigitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah
(DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de jens_onno.krah@fh-koeln.de NIOS II 1 1 What is Nios II? Altera s Second Generation
More informationData Centric Systems (DCS)
Data Centric Systems (DCS) Architecture and Solutions for High Performance Computing, Big Data and High Performance Analytics High Performance Computing with Data Centric Systems 1 Data Centric Systems
More informationData Center and Cloud Computing Market Landscape and Challenges
Data Center and Cloud Computing Market Landscape and Challenges Manoj Roge, Director Wired & Data Center Solutions Xilinx Inc. #OpenPOWERSummit 1 Outline Data Center Trends Technology Challenges Solution
More informationA Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
More informationThis Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
More informationIntel Labs at ISSCC 2012. Copyright Intel Corporation 2012
Intel Labs at ISSCC 2012 Copyright Intel Corporation 2012 Intel Labs ISSCC 2012 Highlights 1. Efficient Computing Research: Making the most of every milliwatt to make computing greener and more scalable
More informationInfrastructure Matters: POWER8 vs. Xeon x86
Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report
More informationIBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
More informationApplying the Benefits of Network on a Chip Architecture to FPGA System Design
Applying the Benefits of on a Chip Architecture to FPGA System Design WP-01149-1.1 White Paper This document describes the advantages of network on a chip (NoC) architecture in Altera FPGA system design.
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationThe new 32-bit MSP432 MCU platform from Texas
Technology Trend MSP432 TM microcontrollers: Bringing high performance to low-power applications The new 32-bit MSP432 MCU platform from Texas Instruments leverages its more than 20 years of lowpower leadership
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
More informationLSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
More informationHigh Performance Computing in the Multi-core Area
High Performance Computing in the Multi-core Area Arndt Bode Technische Universität München Technology Trends for Petascale Computing Architectures: Multicore Accelerators Special Purpose Reconfigurable
More informationMultithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
More informationProcessor Architectures
ECPE 170 Jeff Shafer University of the Pacific Processor Architectures 2 Schedule Exam 3 Tuesday, December 6 th Caches Virtual Memory Input / Output OperaKng Systems Compilers & Assemblers Processor Architecture
More informationLow Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
More informationMicrowatt to Megawatt - Transforming Edge to Data Centre Insights
Security Level: Public Microwatt to Megawatt - Transforming Edge to Data Centre Insights Steve Langridge steve.langridge@huawei.com May 3, 2015 www.huawei.com Agenda HW Acceleration System thinking Big
More informationNetworking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationXeon+FPGA Platform for the Data Center
Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationOptimizing Configuration and Application Mapping for MPSoC Architectures
Optimizing Configuration and Application Mapping for MPSoC Architectures École Polytechnique de Montréal, Canada Email : Sebastien.Le-Beux@polymtl.ca 1 Multi-Processor Systems on Chip (MPSoC) Design Trends
More informationMultiprocessor System-on-Chip
http://www.artistembedded.org/fp6/ ARTIST Workshop at DATE 06 W4: Design Issues in Distributed, CommunicationCentric Systems Modelling Networked Embedded Systems: From MPSoC to Sensor Networks Jan Madsen
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationBEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
More informationMCA Standards For Closely Distributed Multicore
MCA Standards For Closely Distributed Multicore Sven Brehmer Multicore Association, cofounder, board member, and MCAPI WG Chair CEO of PolyCore Software 2 Embedded Systems Spans the computing industry
More informationVLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
More informationA New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications
1 A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications Simon McIntosh-Smith Director of Architecture 2 Multi-Threaded Array Processing Architecture
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationArchitecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationHow To Build An Ark Processor With An Nvidia Gpu And An African Processor
Project Denver Processor to Usher in a New Era of Computing Bill Dally January 5, 2011 http://blogs.nvidia.com/2011/01/project-denver-processor-to-usher-in-new-era-of-computing/ Project Denver Announced
More informationReconfigurable System-on-Chip Design
Reconfigurable System-on-Chip Design MITCHELL MYJAK Senior Research Engineer Pacific Northwest National Laboratory PNNL-SA-93202 31 January 2013 1 About Me Biography BSEE, University of Portland, 2002
More informationMemory Architecture and Management in a NoC Platform
Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011 Overview Motivation State of the Art Data Management Engine
More informationAn Overview of Stack Architecture and the PSC 1000 Microprocessor
An Overview of Stack Architecture and the PSC 1000 Microprocessor Introduction A stack is an important data handling structure used in computing. Specifically, a stack is a dynamic set of elements in which
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationEmbedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
More informationPrinciples and characteristics of distributed systems and environments
Principles and characteristics of distributed systems and environments Definition of a distributed system Distributed system is a collection of independent computers that appears to its users as a single
More informationExtending the Power of FPGAs. Salil Raje, Xilinx
Extending the Power of FPGAs Salil Raje, Xilinx Extending the Power of FPGAs The Journey has Begun Salil Raje Xilinx Corporate Vice President Software and IP Products Development Agenda The Evolution of
More informationHow To Build A Cloud Computer
Introducing the Singlechip Cloud Computer Exploring the Future of Many-core Processors White Paper Intel Labs Jim Held Intel Fellow, Intel Labs Director, Tera-scale Computing Research Sean Koehl Technology
More informationPower Reduction Techniques in the SoC Clock Network. Clock Power
Power Reduction Techniques in the SoC Network Low Power Design for SoCs ASIC Tutorial SoC.1 Power Why clock power is important/large» Generally the signal with the highest frequency» Typically drives a
More informationOperating Systems 4 th Class
Operating Systems 4 th Class Lecture 1 Operating Systems Operating systems are essential part of any computer system. Therefore, a course in operating systems is an essential part of any computer science
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationINSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationHigh Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team
High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)
More informationEnterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
More informationSGRT: A Scalable Mobile GPU Architecture based on Ray Tracing
SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing Won-Jong Lee, Shi-Hwa Lee, Jae-Ho Nah *, Jin-Woo Kim *, Youngsam Shin, Jaedon Lee, Seok-Yoon Jung SAIT, SAMSUNG Electronics, Yonsei Univ. *,
More informationSoftware Programmable Data Allocation in Multi-Bank Memory of SIMD Processors
2010 13th Euromicro Conference on Digital System Design: Architectures, Methods and Tools Software Programmable Data Allocation in Multi-Bank of SIMD Processors Jian Wang, Joar Sohl, Olof Kraigher and
More informationAn Implementation Of Multiprocessor Linux
An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than
More informationCHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
More informationRambus Smart Data Acceleration
Rambus Smart Data Acceleration Back to the Future Memory and Data Access: The Final Frontier As an industry, if real progress is to be made towards the level of computing that the future mandates, then
More informationAll Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule
All Programmable Logic Hans-Joachim Gelke Institute of Embedded Systems Institute of Embedded Systems 31 Assistants 10 Professors 7 Technical Employees 2 Secretaries www.ines.zhaw.ch Research: Education:
More informationAgenda. Michele Taliercio, Il circuito Integrato, Novembre 2001
Agenda Introduzione Il mercato Dal circuito integrato al System on a Chip (SoC) La progettazione di un SoC La tecnologia Una fabbrica di circuiti integrati 28 How to handle complexity G The engineering
More informationSoftware and the Concurrency Revolution
Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )
More informationHow To Understand The Design Of A Microprocessor
Computer Architecture R. Poss 1 What is computer architecture? 2 Your ideas and expectations What is part of computer architecture, what is not? Who are computer architects, what is their job? What is
More informationHardware accelerated Virtualization in the ARM Cortex Processors
Hardware accelerated Virtualization in the ARM Cortex Processors John Goodacre Director, Program Management ARM Processor Division ARM Ltd. Cambridge UK 2nd November 2010 Sponsored by: & & New Capabilities
More informationChapter 1: Introduction. What is an Operating System?
Chapter 1: Introduction What is an Operating System? Mainframe Systems Desktop Systems Multiprocessor Systems Distributed Systems Clustered System Real -Time Systems Handheld Systems Computing Environments
More information