Transactional Memory Coherence and Consistency
|
|
- Hugh Bailey
- 7 years ago
- Views:
Transcription
1 Transactional emory Coherence and Consistency (LBA Reading Group Presentation) Shimin Chen Based on talk slides at
2 The Need for Parallelism Transactional Coherence & Consistency 2 otivation Uniprocessor system scaling is hitting limits Power consumption increasing dramatically Wire delays becoming a limiting factor Design and verification complexity is now overwhelming Exploits limited instruction-level parallelism (ILP) So we need support for multiprocessors Inherently avoid many of the design problems Replicate small cores, don t design big ones: ulti-core Processors Exploit thread-level parallelism (TLP) But can still use ILP within cores But now we have new problems...
3 Existing Parallel Programming odels essage-passing Large data packages Relatively infrequent Shared memory Cache-block sized packages Frequent with high performance requirement Important for load/store inst Simple H/W requirements Programmer effort Divide data and work Complex H/W: Cache coherence emory consistency Only slightly simpler programming model
4 Shared emory Software Problems Transactional Coherence & Consistency 3 otivation Parallel systems are often programmed with: Synchronization through barriers Shared variable access control through locks... Lock granularity and organization must balance performance and correctness Coarse-grain locking: Lock contention Fine-grain locking: Extra overhead ust be careful to avoid deadlocks or races ust be careful not to leave anything unprotected for correctness Performance tuning is not intuitive Performance bottlenecks are related to low level events Such as: false sharing, coherence misses, Feedback is often indirect (cache lines, not variables)
5 Shared emory Hardware Complexity Transactional Coherence & Consistency 4 otivation Cache coherence protocols are complex ust track ownership of cache lines Difficult to implement and verify all corner cases emory consistency protocols are complex ust provide rules to correctly order individual loads/stores Difficult for both hardware and software Current protocols rely on low latency, not bandwidth Critical short control messages on ownership transfers (2-3 hops) Latency of short messages unlikely to scale well in the future Bandwidth likely to scale much better High-speed inter-chip connections Chip multiprocessors = on-chip bandwidth!
6 The Key Question Transactional Coherence & Consistency 5 otivation Is there a shared memory model with: A simple programming model? A simple hardware implementation? Good performance?
7 Transactional emory: Idea (1/2) Transaction: a sequence of instructions that is guaranteed to be atomic Goal: perform consistency and coherence tasks at the granularity of transactions instead of individual memory references
8 Transactional emory: Idea (2/2) Each processor executes its current transaction speculatively Keep track of speculatively read and written data Buffer all writes Once a transaction on P finishes, P asks for system-wide arbitration for commit P broadcasts buffered writes Other processors check the broadcast against its speculative states: roll-back if violation Upon commit, P checkpoints local register states
9 Transactional emory I Transactional Coherence & Consistency 7 Overview Transactions appear to execute in the commit order RAW dependence errors cause transaction violation & restart Transaction A LOAD X Transaction B Time STORE X Violation! LOAD X STORE X Commit X Re-execute with new data LOAD X STORE X
10 Transactional emory II Transactional Coherence & Consistency 8 Overview Antidependencies are automatically handled (details later) WAW are handled by writing buffers only in commit order WAR are handled by keeping all writes private until commit Transaction A Transaction A Time STORE X Commit X Local Buffer 1 2 Shared emory Transaction B STORE X Commit X Local Buffer Time ST X = 1 LD LOAD X = X1 Commit X X = 1 X = 2 Any Local Stores Supply Data Later Transactions Started / Updated with Shared State Transaction B ST X = 2 LD LOAD X = X2 LD LOAD X = X2 Commit X
11 TCC s Difference Transactional Coherence & Consistency 9 Overview So what? Transactional memory is old news.... Herlihy, et.al., proposed to replace locks a decade ago Rajwar and Goodman / artinez and Torrellas proposed more automated versions of the same thing recently Thread-level speculation (TLS) uses transactional memory TCC s New Idea: Leave transactions on all of the time Provides ANY new benefits Completely eliminates conventional cache coherence and consistency models
12 The TCC Cycle Transactional Coherence & Consistency 1 Overview Transactions now run in a cycle Continues for all execution Transaction Starts P P1 P2 Speculatively execute code and buffer Execute Code Wait for commit permission Phase provides synchronization, if necessary Arbitrate with other CPUs Commit stores together, as a block Provides a well-defined write ordering Can invalidate or update other caches Large block utilizes bandwidth effectively Transaction Completes Requests Commit Starts Commit Finishes Commit Wait for Phase Arbitrate Commit And repeat!
13 Advantages of TCC Transactional Coherence & Consistency 11 Overview Trades bandwidth for simplicity & latency tolerance Easier to build Not dependent on timing/latency of loads/stores Transactions eliminate locks Transactions are inherently atomic Catches most common parallel programming errors Shared memory consistency is simplified Conventional model sequences individual loads and stores Now only have hardware sequence transaction commits Shared memory coherence is simplified Processors may have copies of cache lines in any state (no ESI) Commit order implies an ownership sequence
14 How to Use TCC I Transactional Coherence & Consistency 12 Overview Divide code into potentially parallel tasks Usually loop iterations, after function calls, etc. For initial division, tasks = transactions But can be subdivided up or grouped to match hardware limits (buffering) Similar to threading in conventional parallel programming, but: We do not have to verify parallelism in advance Locking is handled automatically Therefore, much easier to get a parallel program running correctly! Programmer then orders transactions as necessary Ordering techniques implemented using phase numbers Assign an age number to each transaction Deadlock-free (at least one transaction is always oldest ) Livelock-free (watchdog hardware can easily insert barriers anywhere) Three common scenarios...
15 How to Use TCC II Transactional Coherence & Consistency 13 Overview Unordered for purely parallel code Fully ordered to specify sequential tasks Partially ordered to insert synchronization like barriers CPU CPU 1 CPU 2 Time Unordered Transactions Barrier = Phase Transition Parallelized Sequential Code Transactions
16 How to Use TCC III Performance tuning Based on feed-back from runtime violation reports Choose transactions to maximize parallelism and minimize dependencies
17 Sample TCC Hardware Transactional Coherence & Consistency 14 Overview Stores Only Processor Core Local Cache Hierarchy Loads and Stores Node # Write Buffer Transaction Control Bits Read, odified, etc. L1 Cache Snooping from other nodes Commits to other nodes Commit Control Node : Node 1: Node 2: Phase Broadcast Bus or Network Write buffers and some L1 cache bits (read, modified, optional renamed) Write buffer in processor, before broadcast A broadcast bus or network to distribute commit packets All processors see the commits in a single order Snooping on broadcasts triggers violations, if necessary Commit arbitration/sequencing logic
18 Sample TCC Hardware Transactional Coherence & Consistency 14 Overview Stores Only Processor Core Local Cache Hierarchy Loads and Stores Node # Write Buffer Transaction Control Bits Read, odified, etc. L1 Cache Snooping from other nodes Commits to other nodes Commit Control Node : Node 1: Node 2: Phase Broadcast Bus or Network Cannot overflow ust hold states in cache or victim cache A fix: ask for commit permission state is no longer speculative pseudo-overflow used to handle I/O
19 Evaluation ethodology Transactional Coherence & Consistency 15 Results We simulated a wide range of applications: SPLASH-2, SPEC, ava, SpecBB Divided into transactions using a preliminary TCC API Trace-based analysis Generated execution traces from sequential execution Then analyzed the traces while varying: Number of processors Interconnect bandwidth Communication overheads Simplifications Results shown assume infinite caches and write-buffers But we track the amount of state stored in them Fixed one cycle/instruction Would require a reasonable superscalar processor for this rate
20 Limits of Availabile Parallelism Transactional Coherence & Consistency 16 Results Speedup Over 1 Processor TCC speedups are similar to conventional ones And sometimes better: SPECjbb eliminates locking overhead within warehouses B B HF ÑÉ ÇÅ HF ÑÉ ÇÅ B H F Ñ É Ç Å B HÑ FÉ ÇÅ art equake_l equake_s lu radix SPECjbb swim tomcatv water HÑ ÇÅ BF É ÑÇ Å H F B É Number of Processors ÇÅ HÑ F B É Speedup Over 1 Processor B H F Ñ É Ç Å â euler fft jbyte_b5 jbyte_b6 jbyte_b8 jbyte_b9 moldyn mtrt raytrace shallow Fâ Å Ç H BÉ 8 F H Ñ Åâ Ñ Ñ Ç BÉ BH FÑ ÉÇ Åâ B HF ÑÉ ÇÅ â B HF ÑÉ ÇÅ â Number of Processors Fâ Å Ç H B É Explicitly Parallel Applications TLS-ava Applications
21 Write Buffering Needs Transactional Coherence & Consistency 17 Results Only a few KB of write buffering needed Set by the natural transaction sizes in applications Occasional overflow can be handled by committing early Reasonable for on-chip buffers 7 5 Write State in KB (with 64B lines) % 5% 1% moldyn equake_s euler jbyte_b9 raytrace jbyte_b5 jbyte_b6 shallow equake_l jbyte_b8 art water-8p fft swim mtrt Applications tomcatv lu-8p radix-m-8p radix-xs-8p radix-s-8p SPECjbb-8P radix-l-8p
22 Broadcast Bandwidth Transactional Coherence & Consistency 18 Results Another issue is broadcast bandwidth (update protocol) If data is sent with commit, to avoid broadcast saturation: Needs about 16 32p with whole modified lines Needs only about 8 32p with dirty data only High, but feasible on-chip Bytes per Cycle Whole Lines Dirty Data Only radix (xl, l, m, s, xs) SPECjbb tomcatv fft
23 Bandwidth Sensitivity Transactional Coherence & Consistency 19 Results ost parallel applications are tolerant of limited BW SPECjbb shows some server-code noise speedup variation Normalized Speedup â â B HF ÑÉ ÇÅ â Ö7 à B HF ÑÉ ÇÅ â Ö7 à HF ÑÉ ÇÖ 7à F H ÑÖ 7à Å ÉÇ Ñâ 7à Ö H B art Ç B É equake_l H equake_s Å F F lu B Ñ radix_xl Å É radix_l Ç radix_m B Å radix_s radix_xs â SPECjbb Ö swim Normalized Speedup B H ÑÉ ÇÅ â BH H Ñ ÇÅ â B ÇÅ â H Ç H É Åâ ÇÅ ÑÉ B â É É Ñ B B euler fft H F jbyte_b5 jbyte_b6 Ñ Ñ jbyte_b8 É jbyte_b9 Ç moldyn Å mtrt raytrace 7 tomcatv â shallow à water! Bandwidth Limit, bytes/cycle 8 procs! Bandwidth Limit, Bytes/Cycle
24 Snoop Bandwidth Transactional Coherence & Consistency 2 Results Snooping requirements are quite reasonable Significantly less than 1 address/cycle on most systems Address-only commits could reduce BW requirements Only broadcast addresses for an invalidation-based protocol Send full packets only to memory Needs only about p Addresses per Cycle radix (xl, l, m, s, xs) SPECjbb tomcatv fft
25 Conclusions Transactional Coherence & Consistency 21 Conclusions TCC simplifies shared memory control hardware Trades higher interconnect bandwidth for simpler protocols Eliminates traditional ESI coherence protocols ost communication in large, less latency-sensitive packets Scaling trends favor these trade-offs in the future TCC eases parallel programming Transactions provide error tolerance and free locking Allows all-manual to nearly automated parallelization ore on this at ASPLOS-XI in October
Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationVorlesung Rechnerarchitektur 2 Seite 178 DASH
Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica
More informationMultiprocessor Cache Coherence
Multiprocessor Cache Coherence M M BUS P P P P The goal is to make sure that READ(X) returns the most recent value of the shared variable X, i.e. all valid copies of a shared variable are identical. 1.
More informationThread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
More informationChapter 12: Multiprocessor Architectures. Lesson 09: Cache Coherence Problem and Cache synchronization solutions Part 1
Chapter 12: Multiprocessor Architectures Lesson 09: Cache Coherence Problem and Cache synchronization solutions Part 1 Objective To understand cache coherence problem To learn the methods used to solve
More informationUnderstanding Hardware Transactional Memory
Understanding Hardware Transactional Memory Gil Tene, CTO & co-founder, Azul Systems @giltene 2015 Azul Systems, Inc. Agenda Brief introduction What is Hardware Transactional Memory (HTM)? Cache coherence
More informationMPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
More informationMultithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
More informationTRACE PERFORMANCE TESTING APPROACH. Overview. Approach. Flow. Attributes
TRACE PERFORMANCE TESTING APPROACH Overview Approach Flow Attributes INTRODUCTION Software Testing Testing is not just finding out the defects. Testing is not just seeing the requirements are satisfied.
More informationOverview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification
Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by
More information18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationChapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
More informationWhat is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus
Datorteknik F1 bild 1 What is a bus? Slow vehicle that many people ride together well, true... A bunch of wires... A is: a shared communication link a single set of wires used to connect multiple subsystems
More informationReal-Time Monitoring Framework for Parallel Processes
International Journal of scientific research and management (IJSRM) Volume 3 Issue 6 Pages 3134-3138 2015 \ Website: www.ijsrm.in ISSN (e): 2321-3418 Real-Time Monitoring Framework for Parallel Processes
More informationEfficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols
Universitat Politècnica de València Master Thesis Efficient Built-In NoC Support for Gather Operations in Invalidation-Based Coherence Protocols Author: Mario Lodde Advisor: Prof. José Flich Cardo A thesis
More informationPrecise and Accurate Processor Simulation
Precise and Accurate Processor Simulation Harold Cain, Kevin Lepak, Brandon Schwartz, and Mikko H. Lipasti University of Wisconsin Madison http://www.ece.wisc.edu/~pharm Performance Modeling Analytical
More informationTradeoffs in Transactional Memory Virtualization
Tradeoffs in Transactional Memory Virtualization JaeWoong Chung, Chi Cao Minh, Austen McDonald, Travis Skare, Hassan Chafi, Brian D. Carlstrom, Christos Kozyrakis and Kunle Olukotun Computer Systems Laboratory
More informationDACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi
DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs Presenter: Bo Zhang Yulin Shi Outline Motivation & Goal Solution - DACOTA overview Technical Insights Experimental Evaluation
More informationPerformance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09
Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,
More informationA Deduplication File System & Course Review
A Deduplication File System & Course Review Kai Li 12/13/12 Topics A Deduplication File System Review 12/13/12 2 Traditional Data Center Storage Hierarchy Clients Network Server SAN Storage Remote mirror
More informationArchitecture of Hitachi SR-8000
Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data
More informationMulticore Programming with LabVIEW Technical Resource Guide
Multicore Programming with LabVIEW Technical Resource Guide 2 INTRODUCTORY TOPICS UNDERSTANDING PARALLEL HARDWARE: MULTIPROCESSORS, HYPERTHREADING, DUAL- CORE, MULTICORE AND FPGAS... 5 DIFFERENCES BETWEEN
More informationATLAS: SOFTWARE DEVELOPMENT ENVIRONMENT FOR HARDWARE TRANSACTIONAL MEMORY
ATLAS: SOFTWARE DEVELOPMENT ENVIRONMENT FOR HARDWARE TRANSACTIONAL MEMORY A DISSERTATION SUBMITTED TO THE DEPARTMENT OF ELECTRICAL ENGINEERING AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY
More informationCentralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures
Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do
More information361 Computer Architecture Lecture 14: Cache Memory
1 361 Computer Architecture Lecture 14 Memory cache.1 The Motivation for s Memory System Processor DRAM Motivation Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationMemory Architecture and Management in a NoC Platform
Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011 Overview Motivation State of the Art Data Management Engine
More informationSymmetric Multiprocessing
Multicore Computing A multi-core processor is a processing system composed of two or more independent cores. One can describe it as an integrated circuit to which two or more individual processors (called
More informationLecture 18: Interconnection Networks. CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)
Lecture 18: Interconnection Networks CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Project deadlines: - Mon, April 2: project proposal: 1-2 page writeup - Fri,
More informationTransactional Memory
Transactional Memory Konrad Lai Microprocessor Technology Labs, Intel Intel Multicore University Research Conference Dec 8, 2005 Motivation Multiple cores face a serious programmability problem Writing
More informationThis Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis Parallel Computers Definition: A parallel computer is a collection of processing
More informationSolution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationINSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano
More informationSwitched Interconnect for System-on-a-Chip Designs
witched Interconnect for ystem-on-a-chip Designs Abstract Daniel iklund and Dake Liu Dept. of Physics and Measurement Technology Linköping University -581 83 Linköping {danwi,dake}@ifm.liu.se ith the increased
More informationParallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
More informationDriving force. What future software needs. Potential research topics
Improving Software Robustness and Efficiency Driving force Processor core clock speed reach practical limit ~4GHz (power issue) Percentage of sustainable # of active transistors decrease; Increase in #
More informationA Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
More informationEnergy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
More informationReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
Appears in the Proceedings of the 29th Annual International Symposium on Computer Architecture (ISCA-29), May 2002. ReVive: Cost-Effective Architectural Support for Rollback Recovery in Shared-Memory Multiprocessors
More informationDisk Storage Shortfall
Understanding the root cause of the I/O bottleneck November 2010 2 Introduction Many data centers have performance bottlenecks that impact application performance and service delivery to users. These bottlenecks
More informationSPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
More informationIntel Data Direct I/O Technology (Intel DDIO): A Primer >
Intel Data Direct I/O Technology (Intel DDIO): A Primer > Technical Brief February 2012 Revision 1.0 Legal Statements INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE,
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationHistorically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.
Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached
More informationSEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms
27 th Symposium on Parallel Architectures and Algorithms SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY Nuno Diegues, Paolo Romano and Stoyan Garbatov Seer: Scheduling for Commodity
More informationOutline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip
Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationSTUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS
STUDY OF VARIOUS FACTORS AFFECTING PERFORMANCE OF MULTI-CORE PROCESSORS Nitin Chaturvedi 1 S Gurunarayanan 2 1 Department of Electrical Electronics Engineering, BITS, Pilani, India nitin80@bits-pilani.ac.in
More informationSoftware and the Concurrency Revolution
Software and the Concurrency Revolution A: The world s fastest supercomputer, with up to 4 processors, 128MB RAM, 942 MFLOPS (peak). 2 Q: What is a 1984 Cray X-MP? (Or a fractional 2005 vintage Xbox )
More informationThesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell
Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell Ryan Yates 5-5-2014 1/21 Introduction Outline Thesis Why Haskell? Preliminary work Hybrid TM for GHC Obstacles to Performance
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationChapter 13. PIC Family Microcontroller
Chapter 13 PIC Family Microcontroller Lesson 01 PIC Characteristics and Examples PIC microcontroller characteristics Power-on reset Brown out reset Simplified instruction set High speed execution Up to
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationStorage at a Distance; Using RoCE as a WAN Transport
Storage at a Distance; Using RoCE as a WAN Transport Paul Grun Chief Scientist, System Fabric Works, Inc. (503) 620-8757 pgrun@systemfabricworks.com Why Storage at a Distance the Storage Cloud Following
More informationTechnical Report. Communication centric, multi-core, fine-grained processor architecture. Gregory A. Chadwick. Number 832.
Technical Report UCAM-CL-TR-832 ISSN 1476-2986 Number 832 Computer Laboratory Communication centric, multi-core, fine-grained processor architecture Gregory A. Chadwick April 2013 15 JJ Thomson Avenue
More informationThe Classical Architecture. Storage 1 / 36
1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage
More informationClient/Server Computing Distributed Processing, Client/Server, and Clusters
Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the
More informationCSE 502 Graduate Computer Architecture Lec 18-19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization
CSE 502 Graduate Computer Architecture Lec 18-19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationSupporting OpenMP on Cell
Supporting OpenMP on Cell Kevin O Brien, Kathryn O Brien, Zehra Sura, Tong Chen and Tao Zhang IBM T. J Watson Research Abstract. The Cell processor is a heterogeneous multi-core processor with one Power
More informationIntel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano
Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview
More informationScalable Cache Miss Handling For High MLP
Scalable Cache Miss Handling For High MLP James Tuck, Luis Ceze, and Josep Torrellas University of Illinois at Urbana-Champaign http://iacoma.cs.uiuc.edu Introduction Checkpointed processors are promising
More informationNVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
More informationAMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
More information- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
More informationThread Level Parallelism II: Multithreading
Thread Level Parallelism II: Multithreading Readings: H&P: Chapter 3.5 Paper: NIAGARA: A 32-WAY MULTITHREADED Thread Level Parallelism II: Multithreading 1 This Unit: Multithreading (MT) Application OS
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationChapter 6, The Operating System Machine Level
Chapter 6, The Operating System Machine Level 6.1 Virtual Memory 6.2 Virtual I/O Instructions 6.3 Virtual Instructions For Parallel Processing 6.4 Example Operating Systems 6.5 Summary Virtual Memory General
More informationData Memory Alternatives for Multiscalar Processors
Data Memory Alternatives for Multiscalar Processors Scott E. Breach, T. N. Vijaykumar, Sridhar Gopal, James E. Smith, Gurindar S. Sohi Computer Sciences Department University of Wisconsin-Madison 1210
More informationRackspace Cloud Databases and Container-based Virtualization
Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many
More informationWhat Is Specific in Load Testing?
What Is Specific in Load Testing? Testing of multi-user applications under realistic and stress loads is really the only way to ensure appropriate performance and reliability in production. Load testing
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationMulti-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
More informationChapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
More information3. Memory Persistency Goals. 2. Background
Memory Persistency Steven Pelley Peter M. Chen Thomas F. Wenisch University of Michigan {spelley,pmchen,twenisch}@umich.edu Abstract Emerging nonvolatile memory technologies (NVRAM) promise the performance
More informationLecture 23: Multiprocessors
Lecture 23: Multiprocessors Today s topics: RAID Multiprocessor taxonomy Snooping-based cache coherence protocol 1 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks
More informationThe continuum of data management techniques for explicitly managed systems
The continuum of data management techniques for explicitly managed systems Svetozar Miucin, Craig Mustard Simon Fraser University MCES 2013. Montreal Introduction Explicitly Managed Memory systems lack
More informationSOS: Software-Based Out-of-Order Scheduling for High-Performance NAND Flash-Based SSDs
SOS: Software-Based Out-of-Order Scheduling for High-Performance NAND -Based SSDs Sangwook Shane Hahn, Sungjin Lee, and Jihong Kim Department of Computer Science and Engineering, Seoul National University,
More informationSurfBoard A Hardware Performance Monitor for SHRIMP
SurfBoard A Hardware Performance Monitor for SHRIMP Princeton University Technical Report TR 596 99 Scott C. Karlin, Douglas W. Clark, Margaret Martonosi Princeton University Department of Computer Science
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationPutting Checkpoints to Work in Thread Level Speculative Execution
Putting Checkpoints to Work in Thread Level Speculative Execution Salman Khan E H U N I V E R S I T Y T O H F G R E D I N B U Doctor of Philosophy Institute of Computing Systems Architecture School of
More informationMCA Standards For Closely Distributed Multicore
MCA Standards For Closely Distributed Multicore Sven Brehmer Multicore Association, cofounder, board member, and MCAPI WG Chair CEO of PolyCore Software 2 Embedded Systems Spans the computing industry
More informationCOSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
More informationAnnotation to the assignments and the solution sheet. Note the following points
Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed
More informationPOWER8 Performance Analysis
POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs satsadas@in.ibm.com #OpenPOWERSummit Join the conversation at #OpenPOWERSummit
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
More informationCOSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel I/O (I) I/O basics Fall 2012 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network card 1 Network card
More informationPART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
More informationOC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
More informationPer-Flow Queuing Allot's Approach to Bandwidth Management
White Paper Per-Flow Queuing Allot's Approach to Bandwidth Management Allot Communications, July 2006. All Rights Reserved. Table of Contents Executive Overview... 3 Understanding TCP/IP... 4 What is Bandwidth
More informationPerformance Testing. Slow data transfer rate may be inherent in hardware but can also result from software-related problems, such as:
Performance Testing Definition: Performance Testing Performance testing is the process of determining the speed or effectiveness of a computer, network, software program or device. This process can involve
More informationLecture 18: Snooping vs. Directory Based Coherency Professor David A. Patterson Computer Science 252 Fall 1996
Lecture 18: Snooping vs. Directory Based Coherency Professor David A. Patterson Computer Science 252 Fall 1996 DAP.F96 1 Layers: Review: Parallel Framework Programming Model: Programming Model Communication
More informationSWEL: Hardware Cache Coherence Protocols to Map Shared Data onto Shared Caches
SWEL: Hardware Cache Coherence Protocols to Map Shared Data onto Shared Caches Seth H. Pugsley, Josef B. Spjut, David W. Nellans, Rajeev Balasubramonian School of Computing, University of Utah {pugsley,
More informationAchieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003
Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks An Oracle White Paper April 2003 Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationChallenges for synchronization and scalability on manycore: a Software Transactional Memory approach
Challenges for synchronization and scalability on manycore: a Software Transactional Memory approach Maurício Lima Pilla André Rauber Du Bois Adenauer Correa Yamin Ana Marilza Pernas Fleischmann Gerson
More information