Challenges for Worst-case Execution Time Analysis of Multi-core Architectures
|
|
- Erika Woods
- 7 years ago
- Views:
Transcription
1 Challenges for Worst-case Execution Time Analysis of Multi-core Architectures Jan saarland university computer science Intel, Braunschweig April 29, 2013
2 The Context: Hard Real-Time Systems Safety-critical applications: Avionics, automotive, train industries, manufacturing Side airbag in car Side airbag in car Reaction in < 10 msec Crankshaft-synchronous tasks Reaction in < 45 microsec Embedded controllers must finish their tasks within given time bounds. Developers would like to know the Worst-Case Execution Time (WCET) to give a guarantee. Jan Reineke, Saarland 2
3 The Timing Analysis Problem Embedded + Software? Timing Requirements Microarchitecture Jan Reineke, Saarland 3
4 What does the execution time depend on? The input, determining which path is taken through the program. The state of the hardware platform: l Due to caches, pipelining, speculation, etc. Interference from the environment: l External interference as seen from the analyzed task on shared busses, caches, memory. Simple CPU Memory Jan Reineke, Saarland 4
5 What does the execution time depend on? The input, determining which path is taken through the program. The state of the hardware platform: l Due to caches, pipelining, speculation, etc. Interference from the environment: l External interference as seen from the analyzed task on shared busses, caches, memory. Complex CPU (out-of-order Simple execution, branch CPU prediction, etc.) L1 Cache Memory Main Memory Jan Reineke, Saarland 5
6 What does the execution time depend on? The input, determining which path is taken through the program. The state of the hardware platform: l Due to caches, pipelining, speculation, etc. Interference from the environment: l External interference as seen from the analyzed task on shared busses, caches, memory. Complex Complex CPU CPU (out-of-order Simple execution,... branch CPU prediction, etc.) Complex CPU L1 Cache L1 Cache L1 Cache L2 Memory Cache Main Main Memory Jan Reineke, Saarland 6
7 Example of Influence of Microarchitectural State x=a+b; LOAD LOAD ADD r2, _a r1, _b r3,r2,r1 PowerPC 755 Courte Jan Reineke, Saarland 7
8 Example of Influence of Corunning Tasks in Multicores Radojkovic et al. (ACM TACO, 2012) on Intel Atom and Intel Core 2 Quad: up to 14x slow-down due to interference on shared L2 cache and memory controller Jan Reineke, Saarland 8
9 Challenges 1. Modeling How to construct sound timing models? 2. Analysis How to precisely & efficiently bound the WCET? 3. Design How to design microarchitectures that enable precise & efficient WCET analysis? Jan Reineke, Saarland 9
10 The Modeling Challenge Microarchitecture? Timing Model Timing model = Formal specification of microarchitecture s timing Incorrect timing model à possibly incorrect WCET bound. Jan Reineke, Saarland 10
11 Current Process of Deriving Timing Model Microarchitecture? Timing Model Jan Reineke, Saarland 11
12 Current Process of Deriving Timing Model Microarchitecture? Timing Model Jan Reineke, Saarland 12
13 Current Process of Deriving Timing Model Microarchitecture? Timing Model Jan Reineke, Saarland 13
14 Current Process of Deriving Timing Model Microarchitecture? Timing Model Jan Reineke, Saarland 14
15 Current Process of Deriving Timing Model Microarchitecture? Timing Model à Time-consuming, and à error-prone. Jan Reineke, Saarland 15
16 Current Process of Deriving Timing Model Microarchitecture? Timing Model à Time-consuming, and à error-prone. Jan Reineke, Saarland 16
17 1. Future Process of Deriving Timing Model Microarchitecture VHDL Model Timing Model Jan Reineke, Saarland 17
18 1. Future Process of Deriving Timing Model Microarchitecture VHDL Model Timing Model Derive timing model automatically from formal specification of microarchitecture. à Less manual effort, thus less time-consuming, and à provably correct. Jan Reineke, Saarland 18
19 1. Future Process of Deriving Timing Model Microarchitecture VHDL Model Timing Model Derive timing model automatically from formal specification of microarchitecture. à Less manual effort, thus less time-consuming, and à provably correct. Jan Reineke, Saarland 19
20 1. Future Process of Deriving Timing Model Microarchitecture VHDL Model Timing Model Derive timing model automatically from formal specification of microarchitecture. à Less manual effort, thus less time-consuming, and à provably correct. Jan Reineke, Saarland 20
21 2. Future Process of Deriving Timing Model Microarchitecture Perform measurements on hardware Infer model Timing Model Jan Reineke, Saarland 21
22 2. Future Process of Deriving Timing Model Microarchitecture Perform measurements on hardware Infer model Timing Model Derive timing model automatically from measurements on the hardware using ideas from automata learning. à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch. Jan Reineke, Saarland 22
23 2. Future Process of Deriving Timing Model Microarchitecture Perform measurements on hardware Infer model Timing Model Derive timing model automatically from measurements on the hardware using ideas from automata learning. à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch. Jan Reineke, Saarland 23
24 2. Future Process of Deriving Timing Model Microarchitecture Perform measurements on hardware Infer model Timing Model Derive timing model automatically from measurements on the hardware using ideas from automata learning. à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch. Jan Reineke, Saarland 24
25 2. Future Process of Deriving Timing Model Microarchitecture Perform measurements on hardware Infer model Timing Model Derive timing model automatically from measurements on the hardware using ideas from automata learning. à No manual effort, and à (under certain assumptions) provably correct. à Also useful to validate assumptions about microarch. Jan Reineke, Saarland 25
26 Proof-of-concept: Automatic Modeling of the Cache Hierarchy Cache Model is important part of Timing Model Can be characterized by a few parameters: l ABC: associativity, block size, capacity l Replacement policy B = Block Size Tag Data Tag Data Tag Data A = Associativity Tag Tag Data Data Tag Tag Data Data... Tag Tag Data Data Tag Data Tag Data Tag Data N = Number of Cache Sets chi [Abel and Reineke, RTAS 2013] derives all of these parameters fully automatically. Jan Reineke, Saarland 26
27 Example: Intel Core 2 Duo E6750, L1 Data Cache Misses L1 Misses Size Jan Reineke, Saarland 27
28 Example: Intel Core 2 Duo E6750, L1 Data Cache Misses L1 Misses Size Capacity = 32 KB Jan Reineke, Saarland 28
29 Example: Intel Core 2 Duo E6750, L1 Data Cache Way Size = 4 KB Misses L1 Misses Size Capacity = 32 KB Jan Reineke, Saarland 29
30 Replacement Policy Approach inspired by methods to learn finite automata. Heavily specialized to problem domain. Jan Reineke, Saarland 30
31 Replacement Policy Approach inspired by methods to learn finite automata. Heavily specialized to problem domain. Discovered to our knowledge undocumented policy of the Intel Atom D525: d x a b c d e f d c a b e f x e d c a b More information: Jan Reineke, Saarland 31
32 Modeling Challenge: Future Work Extend automation to other parts of the microarchitecture: Translation lookaside buffers, branch predictors Shared caches in multicores including their coherency protocols Out-of-order pipelines? Jan Reineke, Saarland 32
33 The Analysis Challenge Microarchitecture! Timing Model? Precise & Efficient Timing Analysis Consider all possible program inputs Consider all possible initial states of the hardware WCET H (P ):= max i2inputs max ET H(P, i, h) h2states(h) Jan Reineke, Saarland 33
34 The Analysis Challenge Consider all possible program inputs Consider all possible initial states of the hardware WCET H (P ):= max i2inputs max ET H(P, i, h) h2states(h) Explicitly evaluating ET for all inputs and all hardware states is not feasible in practice: There are simply too many. è Need for abstraction and thus approximation! Jan Reineke, Saarland 34
35 The Analysis Challenge: State of the Art Component Caches, Branch Target Buffers Analysis Status Precise & efficient abstractions, for LRU [Ferdinand, 1999] Not-so-precise but efficient abstractions, for FIFO, PLRU, MRU [Grund and Reineke, ] Complex Pipelines Precise but very inefficient; little abstraction Major challenge: timing anomalies Shared resources, e.g. busses, shared caches, DRAM No realistic approaches yet Major challenge: interference between hardware threads à execution time depends on corunning tasks Jan Reineke, Saarland 35
36 Timing Anomalies Timing Anomaly = Local worst-case does not imply Global worst-case Resource 1 A D E Resource 2 C B Resource 1 A D E Resource 2 B C Scheduling Anomaly Jan Reineke, Saarland 36
37 Timing Anomalies Timing Anomaly = Local worst-case does not imply Global worst-case Branch Condition Evaluated Cache Hit A Prefetch B - Miss C Cache Miss A C Speculation Anomaly Jan Reineke, Saarland 37
38 The Design Challenge Wanted: Multi-/many-core architecture with No timing anomalies à precise & efficient analysis of individual cores Temporal isolation between cores à independent/incremental development & analysis and high performance! Jan Reineke, Saarland 38
39 Approaches to the Design Challenge At the level of individual cores: Simple in-order pipelines, with static or no branch prediction Scratchpad Memories or LRU Caches Softwarecontrolled caches Jan Reineke, Saarland 39
40 Approaches to the Design Challenge For resources shared among multiple cores: Temporal partitioning, e.g. l TDMA arbitration of buses, shared memories l Thread-interleaved pipeline in PRET Spatial partitioning, e.g. l Partition shared caches l Partition shared DRAM à Temporal isolation Jan Reineke, Saarland 40
41 Design Challenge: Predictable Pipelining from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, Jan Reineke, Saarland 41
42 Pipelining: Hazards from Hennessy and Patterson, Computer Architecture: A Quantitative Approach, Jan Reineke, Saarland 42
43 Forwarding helps, but not all the time LD R1, 45(r2) DADD R5, R1, R7 BE R5, R3, R0 ST R5, 48(R2) Unpipelined The Dream F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W F D E M W The Reality F D E M W Memory Hazard F D E M W Data Hazard F D E M W Branch Hazard Jan Reineke, Saarland 43
44 Solution: PTARM Thread-interleaved Pipelines [Lickly et al., CASES 2008] + T1: F D E M W F D E M W T2: F D E M W F D E M W T3: F D E M W F D E M W T4: F D E M W F D E M W T5: F D E M W F D E M W Each thread occupies only one stage of the pipeline at a time à No hazards; perfect utilization of pipeline à Simple hardware implementation (no forwarding, etc.) à Each instruction takes the same amount of time à Temporal isolation between different hardware threads Drawback: reduced single-thread performance Jan Reineke, Saarland 44
45 Design Challenge: DRAM Controller Translates sequences of memory accesses by Clients (CPUs and I/O) into legal sequences of DRAM commands l Needs to obey all timing constraints l Needs to insert refresh commands sufficiently often l Needs to translate physical memory addresses into row/column/ bank tuples CPU1... CPU1 Interconnect + Arbitration Memory Controller DRAM Module I/O Jan Reineke, Saarland 45
46 Dynamic RAM Timing Constraints DRAM Memory Controllers have to conform to different timing constraints that define minimal distances between consecutive DRAM commands. Almost all of these constraints are due to the sharing of resources at different levels of the hierarchy: addr+cmd DIMM Bit line Word line Transistor Capacitor Bank Row Address Row Decoder DRAM Array Sense Amplifiers and Row Buffer Column Decoder/ Multiplexer command chip select address Control Logic Mode Register Address Register DRAM Device Row Address Mux Refresh Counter Bank Bank Bank Bank I/O Gating I/O Registers + Data I/O data 16 data 64 data 16 data 16 data 16 data 16 x16 Device x16 Device x16 Device x16 Device x16 Device x16 Device x16 Device x16 Device chip select 0 chip select 1 Rank 0 Rank 1 Needs to insert refresh commands sufficiently often Rows within a bank share sense amplifiers Banks within a DRAM device share I/O gating and control logic Different ranks share data/address/ command busses Jan Reineke, Saarland 46
47 General-Purpose DRAM Controllers Schedule DRAM commands dynamically Timing hard to predict even for single client: l Timing of request depends on past requests: Request to same/different bank? Request to open/closed row within bank? Controller might reorder requests to minimize latency l Controllers dynamically schedule refreshes No temporal isolation. Timing depends on behavior of other clients: l They influence sequence of past requests l Arbitration may or may not provide guarantees Jan Reineke, Saarland 47
48 General-Purpose DRAM Controllers Thread 1 Thread 2 Load B1.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R3.C2 Load B3.R5.C3 Store B2.R3.C5 Arbitration Load B1.R3.C2 Load B3.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R5.C3 Store B2.R3.C5? Memory Controller Jan Reineke, Saarland 48
49 General-Purpose DRAM Controllers Thread 1 Thread 2 Load B1.R3.C2 Load B2.R4.C3 Store B4.R3.C5 Load B3.R3.C2 Load B3.R5.C3 Store B2.R3.C5 Arbitration Load B3.R3.C2 Load B1.R3.C2 Load B2.R4.C3 Load B3.R5.C3 Store B4.R3.C5 Store B2.R3.C5? Memory Controller Jan Reineke, Saarland 49
50 General-Purpose DRAM Controllers Load B1.R3.C2 Load B1.R4.C3 Load B1.R3.C5 Memory Controller RAS B1.R3 CAS B1.C2 RAS B1.R4 CAS B1.C3 RAS B1.R3 CAS B1.C5? RAS B1.R3 CAS B1.C2 CAS B1.C5 RAS B1.R4 CAS B1.C3 Jan Reineke, Saarland 50
51 PRET DRAM Controller: Three Innovations [Reineke et al., CODES+ISSS 2011] Expose internal structure of DRAM devices: l Expose individual banks within DRAM device as multiple independent resources CPU1... CPU1 Interconnect + Arbitration PRET DRAM Controller DRAM Bank DRAM Module DRAM Module DRAM Module I/O Defer refreshes to the end of transactions l Allows to hide refresh latency Perform refreshes manually : l Replace standard refresh command with multiple reads Jan Reineke, Saarland 51
52 PRET DRAM Controller: Exploiting Internal Structure of DRAM Module l Consists of 4-8 banks in 1-2 ranks Share only command and data bus, otherwise independent l Partition banks into four groups in alternating ranks l Cycle through groups in a time-triggered fashion Rank 0: Bank 0 Bank 1 Bank 2 Bank 3 Rank 1: Bank 0 Bank 1 Bank 2 Bank 3 Jan Reineke, Saarland 52
53 PRET DRAM Controller: Exploiting Internal Structure of DRAM Module l Consists of 4-8 banks in 1-2 ranks Share only command and data bus, otherwise independent l Partition banks into four groups in alternating ranks l Cycle through groups in a time-triggered fashion Rank 0: Bank 0 Bank 1 Bank 2 Bank 3 Successive accesses to same group obey timing constraints Reads/writes to different groups do not interfere Rank 1: Bank 0 Bank 1 Bank 2 Bank 3 Jan Reineke, Saarland 53
54 PRET DRAM Controller: Exploiting Internal Structure of DRAM Module l Consists of 4-8 banks in 1-2 ranks Share only command and data bus, otherwise independent l Partition banks into four groups in alternating ranks l Cycle through groups in a time-triggered fashion Rank 0: Bank 0 Bank 1 Bank 2 Bank 3 Successive accesses to same group obey timing constraints Reads/writes to different groups do not interfere Rank 1: Bank 0 Bank 1 Bank 2 Bank 3 Provides four independent and predictable resources Jan Reineke, Saarland 54
55 Conventional DRAM Controller (DRAMSim2) vs PRET DRAM Controller: Latency Evaluation latency [cycles] 3,000 2,000 1,000 0 Varying interference for fixed transfer size: Interference [# of other threads occupied] 4096B transfers, conventional controller 4096B transfers, PRET controller 1024B transfers, conventional controller 1024B transfers, PRET controller Figure 9: Latencies of conventional and PRET memory con- average latency [cycles] 3,000 2,000 1,000 0 Varying transfer size at maximal interference: Conventional controller PRET controller 0 1,000 2,000 3,000 4,000 transfer size [bytes] Figure 10: Latencies of conventional and PRET memory con- More information: Jan Reineke, Saarland 55
56 Emerging Challenge: Microarchitecture Selection & Configuration Embedded Software with? Timing Requirements Family of Microarchitectures = Platform Jan Reineke, Saarland 56
57 Emerging Challenge: Microarchitecture Selection & Configuration Embedded Software with? Timing Requirements Choices: Processor frequency Sizes and latencies of local memories Latency and bandwidth of interconnect Presence of floatingpoint unit Family of Microarchitectures = Platform Jan Reineke, Saarland 57
58 Emerging Challenge: Microarchitecture Selection & Configuration Embedded Software Timing Requirements Choices: Processor frequency Sizes and latencies of local memories Latency and bandwidth of interconnect Presence of floatingpoint unit? Select a microarchitecture that a) satisfies all timing requirements, and with b) minimizes cost/size/energy. Family of Microarchitectures = Platform Jan Reineke, Saarland 58
59 Conclusions Challenges in modeling analysis design remain. Jan Reineke, Saarland 59
60 Conclusions Challenges in modeling analysis design remain. Progress based on automation abstraction partitioning has been made. Jan Reineke, Saarland 60
61 Conclusions Challenges in modeling analysis design remain. Progress based on automation abstraction partitioning has been made. Thank you for your attention! Jan Reineke, Saarland 61
Designing Predictable Multicore Architectures for Avionics and Automotive Systems extended abstract
Designing Predictable Multicore Architectures for Avionics and Automotive Systems extended abstract Reinhard Wilhelm, Christian Ferdinand, Christoph Cullmann, Daniel Grund, Jan Reineke, Benôit Triquet
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy RAM types Advances in Computer Architecture Andy D. Pimentel Memory wall Memory wall = divergence between CPU and RAM speed We can increase bandwidth by introducing concurrency
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationMeasuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
More informationIntel X38 Express Chipset Memory Technology and Configuration Guide
Intel X38 Express Chipset Memory Technology and Configuration Guide White Paper January 2008 Document Number: 318469-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
More informationOpenSPARC T1 Processor
OpenSPARC T1 Processor The OpenSPARC T1 processor is the first chip multiprocessor that fully implements the Sun Throughput Computing Initiative. Each of the eight SPARC processor cores has full hardware
More informationMotivation: Smartphone Market
Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationPutting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719
Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code
More informationIntel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide
Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide White Paper August 2007 Document Number: 316971-002 INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION
More informationOC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
More informationMemory Systems. Static Random Access Memory (SRAM) Cell
Memory Systems This chapter begins the discussion of memory systems from the implementation of a single bit. The architecture of memory chips is then constructed using arrays of bit implementations coupled
More informationIn-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller
In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency
More information361 Computer Architecture Lecture 14: Cache Memory
1 361 Computer Architecture Lecture 14 Memory cache.1 The Motivation for s Memory System Processor DRAM Motivation Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access
More informationA Dual-Layer Bus Arbiter for Mixed-Criticality Systems with Hypervisors
A Dual-Layer Bus Arbiter for Mixed-Criticality Systems with Hypervisors Bekim Cilku, Bernhard Frömel, Peter Puschner Institute of Computer Engineering Vienna University of Technology A1040 Wien, Austria
More informationComputer Architecture
Computer Architecture Slide Sets WS 2013/2014 Prof. Dr. Uwe Brinkschulte M.Sc. Benjamin Betting Part 11 Memory Management Computer Architecture Part 11 page 1 of 44 Prof. Dr. Uwe Brinkschulte, M.Sc. Benjamin
More informationSolution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationPentium vs. Power PC Computer Architecture and PCI Bus Interface
Pentium vs. Power PC Computer Architecture and PCI Bus Interface CSE 3322 1 Pentium vs. Power PC Computer Architecture and PCI Bus Interface Nowadays, there are two major types of microprocessors in the
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationPikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG 29.10.2012, Moscow
PikeOS: Multi-Core RTOS for IMA Dr. Sergey Tverdyshev SYSGO AG 29.10.2012, Moscow Contents Multi Core Overview Hardware Considerations Multi Core Software Design Certification Consideratins PikeOS Multi-Core
More informationEE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
More informationCOMPUTER HARDWARE. Input- Output and Communication Memory Systems
COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)
More informationVLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
More informationEnergy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
More informationThe Orca Chip... Heart of IBM s RISC System/6000 Value Servers
The Orca Chip... Heart of IBM s RISC System/6000 Value Servers Ravi Arimilli IBM RISC System/6000 Division 1 Agenda. Server Background. Cache Heirarchy Performance Study. RS/6000 Value Server System Structure.
More informationMultithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
More informationVHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU
VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU Martin Straka Doctoral Degree Programme (1), FIT BUT E-mail: strakam@fit.vutbr.cz Supervised by: Zdeněk Kotásek E-mail: kotasek@fit.vutbr.cz
More informationTIME PREDICTABLE CPU AND DMA SHARED MEMORY ACCESS
TIME PREDICTABLE CPU AND DMA SHARED MEMORY ACCESS Christof Pitter Institute of Computer Engineering Vienna University of Technology, Austria cpitter@mail.tuwien.ac.at Martin Schoeberl Institute of Computer
More informationIBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
More informationOperating Systems. Virtual Memory
Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page
More informationA Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
More informationHomework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?
ECE337 / CS341, Fall 2005 Introduction to Computer Architecture and Organization Instructor: Victor Manuel Murray Herrera Date assigned: 09/19/05, 05:00 PM Due back: 09/30/05, 8:00 AM Homework # 2 Solutions
More informationAES Power Attack Based on Induced Cache Miss and Countermeasure
AES Power Attack Based on Induced Cache Miss and Countermeasure Guido Bertoni, Vittorio Zaccaria STMicroelectronics, Advanced System Technology Agrate Brianza - Milano, Italy, {guido.bertoni, vittorio.zaccaria}@st.com
More informationThe Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy
The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And
More informationCHAPTER 7: The CPU and Memory
CHAPTER 7: The CPU and Memory The Architecture of Computer Hardware, Systems Software & Networking: An Information Technology Approach 4th Edition, Irv Englander John Wiley and Sons 2010 PowerPoint slides
More informationIntel 965 Express Chipset Family Memory Technology and Configuration Guide
Intel 965 Express Chipset Family Memory Technology and Configuration Guide White Paper - For the Intel 82Q965, 82Q963, 82G965 Graphics and Memory Controller Hub (GMCH) and Intel 82P965 Memory Controller
More informationCS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh
CS/COE1541: Introduction to Computer Architecture Memory hierarchy Sangyeun Cho Computer Science Department CPU clock rate Apple II MOS 6502 (1975) 1~2MHz Original IBM PC (1981) Intel 8080 4.77MHz Intel
More informationHIPEAC 2015. Segregation of Subsystems with Different Criticalities on Networked Multi-Core Chips in the DREAMS Architecture
HIPEAC 2015 Segregation of Subsystems with Different Criticalities on Networked Multi-Core Chips in the DREAMS Architecture University of Siegen Roman Obermaisser Overview Mixed-Criticality Systems Modular
More informationThe Classical Architecture. Storage 1 / 36
1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage
More informationA N. O N Output/Input-output connection
Memory Types Two basic types: ROM: Read-only memory RAM: Read-Write memory Four commonly used memories: ROM Flash, EEPROM Static RAM (SRAM) Dynamic RAM (DRAM), SDRAM, RAMBUS, DDR RAM Generic pin configuration:
More informationSwitch Fabric Implementation Using Shared Memory
Order this document by /D Switch Fabric Implementation Using Shared Memory Prepared by: Lakshmi Mandyam and B. Kinney INTRODUCTION Whether it be for the World Wide Web or for an intra office network, today
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationNaveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi
Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1 Large Caches Cache hierarchies
More informationExploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager
Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor
More informationSOC architecture and design
SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external
More informationMemory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar
Memory ICS 233 Computer Architecture and Assembly Language Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline Random
More information21152 PCI-to-PCI Bridge
Product Features Brief Datasheet Intel s second-generation 21152 PCI-to-PCI Bridge is fully compliant with PCI Local Bus Specification, Revision 2.1. The 21152 is pin-to-pin compatible with Intel s 21052,
More informationComputer Systems Structure Main Memory Organization
Computer Systems Structure Main Memory Organization Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Storage/Memory
More informationMemory Basics. SRAM/DRAM Basics
Memory Basics RAM: Random Access Memory historically defined as memory array with individual bit access refers to memory with both Read and Write capabilities ROM: Read Only Memory no capabilities for
More informationComputer Architecture
Computer Architecture Random Access Memory Technologies 2015. április 2. Budapest Gábor Horváth associate professor BUTE Dept. Of Networked Systems and Services ghorvath@hit.bme.hu 2 Storing data Possible
More informationByte Ordering of Multibyte Data Items
Byte Ordering of Multibyte Data Items Most Significant Byte (MSB) Least Significant Byte (LSB) Big Endian Byte Addresses +0 +1 +2 +3 +4 +5 +6 +7 VALUE (8-byte) Least Significant Byte (LSB) Most Significant
More informationWhat is a System on a Chip?
What is a System on a Chip? Integration of a complete system, that until recently consisted of multiple ICs, onto a single IC. CPU PCI DSP SRAM ROM MPEG SoC DRAM System Chips Why? Characteristics: Complex
More informationA Mixed Time-Criticality SDRAM Controller
NEST COBRA CA4 A Mixed Time-Criticality SDRAM Controller MeAOW 3-9-23 Sven Goossens, Benny Akesson, Kees Goossens Mixed Time-Criticality 2/5 Embedded multi-core systems are getting more complex: Integrating
More informationMixed-Criticality: Integration of Different Models of Computation. University of Siegen, Roman Obermaisser
Workshop on "Challenges in Mixed Criticality, Real-time, and Reliability in Networked Complex Embedded Systems" Mixed-Criticality: Integration of Different Models of Computation University of Siegen, Roman
More informationIntel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
More informationSlide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng
Slide Set 8 for ENCM 369 Winter 2015 Lecture Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary Winter Term, 2015 ENCM 369 W15 Section
More informationParallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
More informationWorst-Case Execution Time Prediction by Static Program Analysis
Worst-Case Execution Time Prediction by Static Program Analysis Reinhold Heckmann Christian Ferdinand AbsInt Angewandte Informatik GmbH Science Park 1, D-66123 Saarbrücken, Germany Phone: +49-681-38360-0
More informationMemory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality
Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality Heechul Yun +, Gang Yao +, Rodolfo Pellizzoni *, Marco Caccamo +, Lui Sha + University of Illinois at Urbana and Champaign
More informationCPU Shielding: Investigating Real-Time Guarantees via Resource Partitioning
CPU Shielding: Investigating Real-Time Guarantees via Resource Partitioning Progress Report 1 John Scott Tillman jstillma@ncsu.edu CSC714 Real-Time Computer Systems North Carolina State University Instructor:
More informationDDR4 Memory Technology on HP Z Workstations
Technical white paper DDR4 Memory Technology on HP Z Workstations DDR4 is the latest memory technology available for main memory on mobile, desktops, workstations, and server computers. DDR stands for
More informationAdvanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
More informationChapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan
Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationComputer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory
Computer Organization and Architecture Chapter 4 Cache Memory Characteristics of Memory Systems Note: Appendix 4A will not be covered in class, but the material is interesting reading and may be used in
More informationİSTANBUL AYDIN UNIVERSITY
İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER
More informationHigh Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team
High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)
More informationSPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
More informationIntroduction to RISC Processor. ni logic Pvt. Ltd., Pune
Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC
More informationMulticore partitioned systems based on hypervisor
Preprints of the 19th World Congress The International Federation of Automatic Control Multicore partitioned systems based on hypervisor A. Crespo M. Masmano J. Coronel S. Peiró P. Balbastre J. Simó Universitat
More informationThe Big Picture. Cache Memory CSE 675.02. Memory Hierarchy (1/3) Disk
The Big Picture Cache Memory CSE 5.2 Computer Processor Memory (active) (passive) Control (where ( brain ) programs, Datapath data live ( brawn ) when running) Devices Input Output Keyboard, Mouse Disk,
More informationDesign Cycle for Microprocessors
Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types
More informationCOS 318: Operating Systems. Virtual Memory and Address Translation
COS 318: Operating Systems Virtual Memory and Address Translation Today s Topics Midterm Results Virtual Memory Virtualization Protection Address Translation Base and bound Segmentation Paging Translation
More informationINSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano
More informationCentral Processing Unit
Chapter 4 Central Processing Unit 1. CPU organization and operation flowchart 1.1. General concepts The primary function of the Central Processing Unit is to execute sequences of instructions representing
More informationCS 61C: Great Ideas in Computer Architecture Virtual Memory Cont.
CS 61C: Great Ideas in Computer Architecture Virtual Memory Cont. Instructors: Vladimir Stojanovic & Nicholas Weaver http://inst.eecs.berkeley.edu/~cs61c/ 1 Bare 5-Stage Pipeline Physical Address PC Inst.
More information<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing
T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle
More informationIA-64 Application Developer s Architecture Guide
IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve
More informationOverview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
More informationWhat is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus
Datorteknik F1 bild 1 What is a bus? Slow vehicle that many people ride together well, true... A bunch of wires... A is: a shared communication link a single set of wires used to connect multiple subsystems
More informationMaking Dynamic Memory Allocation Static To Support WCET Analyses
Making Dynamic Memory Allocation Static To Support WCET Analyses Jörg Herter Jan Reineke Department of Computer Science Saarland University WCET Workshop, June 2009 Jörg Herter, Jan Reineke Making Dynamic
More informationPOSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)
RTOSes Part I Christopher Kenna September 24, 2010 POSIX Portable Operating System for UnIX Application portability at source-code level POSIX Family formally known as IEEE 1003 Originally 17 separate
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationConfiguring and using DDR3 memory with HP ProLiant Gen8 Servers
Engineering white paper, 2 nd Edition Configuring and using DDR3 memory with HP ProLiant Gen8 Servers Best Practice Guidelines for ProLiant servers with Intel Xeon processors Table of contents Introduction
More informationEnterprise Applications
Enterprise Applications Chi Ho Yue Sorav Bansal Shivnath Babu Amin Firoozshahian EE392C Emerging Applications Study Spring 2003 Functionality Online Transaction Processing (OLTP) Users/apps interacting
More informationAccelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software
WHITEPAPER Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software SanDisk ZetaScale software unlocks the full benefits of flash for In-Memory Compute and NoSQL applications
More informationExploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand
Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based
More informationCentral Processing Unit (CPU)
Central Processing Unit (CPU) CPU is the heart and brain It interprets and executes machine level instructions Controls data transfer from/to Main Memory (MM) and CPU Detects any errors In the following
More informationIntroducción. Diseño de sistemas digitales.1
Introducción Adapted from: Mary Jane Irwin ( www.cse.psu.edu/~mji ) www.cse.psu.edu/~cg431 [Original from Computer Organization and Design, Patterson & Hennessy, 2005, UCB] Diseño de sistemas digitales.1
More informationThe Microarchitecture of Superscalar Processors
The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: jes@ece.wisc.edu
More informationThis Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
More informationADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM
ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit
More informationRAM & ROM Based Digital Design. ECE 152A Winter 2012
RAM & ROM Based Digital Design ECE 152A Winter 212 Reading Assignment Brown and Vranesic 1 Digital System Design 1.1 Building Block Circuits 1.1.3 Static Random Access Memory (SRAM) 1.1.4 SRAM Blocks in
More informationADVANCED COMPUTER ARCHITECTURE
ADVANCED COMPUTER ARCHITECTURE Marco Ferretti Tel. Ufficio: 0382 985365 E-mail: marco.ferretti@unipv.it Web: www.unipv.it/mferretti, eecs.unipv.it 1 Course syllabus and motivations This course covers the
More informationConcept of Cache in web proxies
Concept of Cache in web proxies Chan Kit Wai and Somasundaram Meiyappan 1. Introduction Caching is an effective performance enhancing technique that has been used in computer systems for decades. However,
More informationWith respect to the way of data access we can classify memories as:
Memory Classification With respect to the way of data access we can classify memories as: - random access memories (RAM), - sequentially accessible memory (SAM), - direct access memory (DAM), - contents
More informationLSN 2 Computer Processors
LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2
More informationParallel Processing and Software Performance. Lukáš Marek
Parallel Processing and Software Performance Lukáš Marek DISTRIBUTED SYSTEMS RESEARCH GROUP http://dsrg.mff.cuni.cz CHARLES UNIVERSITY PRAGUE Faculty of Mathematics and Physics Benchmarking in parallel
More informationHyperThreading Support in VMware ESX Server 2.1
HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect
More information150127-Microprocessor & Assembly Language
Chapter 3 Z80 Microprocessor Architecture The Z 80 is one of the most talented 8 bit microprocessors, and many microprocessor-based systems are designed around the Z80. The Z80 microprocessor needs an
More information