Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.



Similar documents
CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

Introduction to Cloud Computing

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Spring 2011 Prof. Hyesoon Kim

Performance metrics for parallel systems

Chapter 2: OS Overview

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Pipelining Review and Its Limitations

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Microwatt to Megawatt - Transforming Edge to Data Centre Insights

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Course Development of Programming for General-Purpose Multicore Processors

CHAPTER 1 INTRODUCTION

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Multicore Programming with LabVIEW Technical Resource Guide

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Real-Time Monitoring Framework for Parallel Processes

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Energy-Efficient, High-Performance Heterogeneous Core Design

on an system with an infinite number of processors. Calculate the speedup of

Advanced Core Operating System (ACOS): Experience the Performance

Multi-Core Programming

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Seven Challenges of Embedded Software Development

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Thread level parallelism

Choosing a Computer for Running SLX, P3D, and P5

Parallel computing is the primary way that processor

Software and the Concurrency Revolution

High Performance or Cycle Accuracy?

An Easier Way for Cross-Platform Data Acquisition Application Development

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

An Introduction to Parallel Computing/ Programming

CISC, RISC, and DSP Microprocessors

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Control 2004, University of Bath, UK, September 2004

MCA Standards For Closely Distributed Multicore

Processes and Non-Preemptive Scheduling. Otto J. Anshus

Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps

On some Potential Research Contributions to the Multi-Core Enterprise

Parallel Algorithm Engineering

Load Balancing in Distributed Data Base and Distributed Computing System

ADVANCED COMPUTER ARCHITECTURE

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

1 Discussion of multithreading on Win32 mod_perl

Virtual Routing: What s The Goal? And What s Beyond? Peter Christy, NetsEdge Research Group, August 2001

Networking Virtualization Using FPGAs

Multi-Threading Performance on Commodity Multi-Core Processors

Eliminate Memory Errors and Improve Program Stability

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms

Contributions to Gang Scheduling

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Operating Systems 4 th Class

1. PUBLISHABLE SUMMARY

PCI vs. PCI Express vs. AGP

Distributed Systems. Security concepts; Cryptographic algorithms; Digital signatures; Authentication; Secure Sockets

Transactional Memory

How To Understand The Design Of A Microprocessor

Application Performance Analysis of the Cortex-A9 MPCore

Week 1 out-of-class notes, discussions and sample problems

FPGA-based Multithreading for In-Memory Hash Joins

Operating System Tutorial

Contents. Chapter 1. Introduction

Embedded Systems: map to FPGA, GPU, CPU?

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Multicore and Extreme Scale Computing: Programming and Software Challenges

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi

Parallel Scalable Algorithms- Performance Parameters

Infrastructure Matters: POWER8 vs. Xeon x86

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Distributed Databases

Weighted Total Mark. Weighted Exam Mark

Building Applications Using Micro Focus COBOL

System Models for Distributed and Cloud Computing

A Cloud Computing Approach for Big DInSAR Data Processing

Digital Edition. Digital Editions of selected Intel Press books are in addition to and complement the printed books.

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 5 - DBMS Architecture

White Paper. Cloud Performance Testing

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

CHAPTER 1: OPERATING SYSTEM FUNDAMENTALS

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Using On-chip Networks to Minimize Software Development Costs

Building trust in the cloud. Alastair McAulay PA Consulting Group

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

Scalability and Programmability in the Manycore Era

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

CSEE W4824 Computer Architecture Fall 2012

TDT 4260 lecture 11 spring semester Interconnection network continued

EECS 678: Introduction to Operating Systems

Why Threads Are A Bad Idea (for most purposes)

Introduction to Embedded Systems. Software Update Problem

Transcription:

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached Source: Patterson, Computer Organization and Design

Hardware Solution Evolution of Computer Architectures Micro-Scopic View Blocked by Heat and Power Wall

Single Core vs. Dual Core Single Core clocked at 2f Dual Core clocked at f 2f Throughput f + f 8f 3 Heat f 3 + f 3

Hardware Solution Evolution of Computer Architectures Micro-Scopic View Instruction-Level Parallelism (ILP) was also Heavily Used Implemented On-Chip via Hardware Transparent to Software (No impact on Programmers) We will Study Two Types: Pipelining (Intra-Instruction Parallelism) Multi-Function Units (Inter-Instruction Parallelism) ILP has provided reasonable speedups in the past, Unfortunately.

Performance Hardware Solution Evolution of Computer Architectures Gain - to - Effort Ratio of ILP beyond Knee of Curve Diminishing Returns due to Increased Cost and Complexity of Extracting ILP Made sense to go Superscalar and Multi-Function: Good ROI Very little gain for substantial effort Effort Scalar In-Order Moderate Pipelining, Look-Ahead Very Deep Pipe, Aggressive Look-Ahead

Hardware Solution Evolution of Computer Architectures Micro-Scopic View Summary Clock Frequency Scaling Limits have been Reached Instruction Level Parallelism Limits have been Reached Era of Single Core Performance Increases has Ended No More Free Lunch for Software Programmers Multiple Cores Will Directly Expose Parallelism to SW All Future Micro-Processor Designs will be Multi-Core Evident in Chip Manufacturer s RoadMaps

Hardware Solution Evolution of Computer Architectures Micro Macro Sequential Processing Parallel Processing

Hardware Solution Technology Curve Cost Performance Cost / Performance vs. Performance Under-Utilization Optimum Utilization Over- Utilization Performance

Hardware Solution Technology Curve Cost Performance $$$ Cost / Performance vs. Performance $ $$$$$$ Sequential Processing X Performance 2X $$ Parallel Processing

Key Points Hardware Solution Parallel Processing is really an Evolution in Micro- and Macro-Architecture Hardware That provides a Solution to: The Heat and Power Wall The Limitations of ILP Cost-Effective Higher Performance Parallel Processing is also a Software Challenge

Software Challenge Change in Hardware Requires Change in Software The Car (Hardware) has Changed From a Sequential Engine To a Parallel Engine The Driver (Software) Must Change Driving Techniques Otherwise, Sub-Optimal Performance will Result Car & Driver Hardware & Software

Software Challenge Lack of Good PP Tools Lack of Tools Compounds Problem Existing Tool Chain only for Sequential Programming Need New Parallel Programming Tools & Infrastructure Effective Models for Parallel Systems Constructs to make Parallel Architecture more Visible Languages to More Clearly Express Parallelism Reverse Engineering Analysis Tools To Assist with Conversion of Sequential to Parallel Especially for Optimized Sequential Code

Software Challenge Race Conditions Parallelism can Give Rise to a New Class of Problems Caused by the Interactions Between Parallel Threads Race Condition: Multiple Threads Perform Concurrent Access to the Same Shared Memory Location Threads Race Against Each Other Execution order is assumed but cannot be guaranteed Outcome depends on which one wins (by chance) Results in Non-Deterministic Behavior

Software Challenge Race Conditions Race Conditions are Especially Hard to Detect & Debug Errors are Very Subtle No Apparent Failure Occurs Program Continues to Run Normally Program Completes Normally Errors are Intermittent Hard to Reproduce and Diagnose Errors Can Slip Through SQA Testing Process Potential Lurking Bug Most Common Error in Parallel Programs

Software Challenge Semaphores Semaphores Offer a Solution to Race Conditions However Semaphores themselves can cause Problems: Introduce Overhead Can Create Bottlenecks Mutually Exclusive (one-at-a-time) Access

Software Challenge Deadlock Another Potential Problem Arising From Parallelism Deadlock: Two or More Threads are Blocked because Each is Waiting for a Resource Held by the Other Requests Sem_B Held by Thread 1 Thread 2 Held by Sem_A Requests

Software Challenge Deadlock Not as Hard as Race Conditions Errors are More Obvious System Usually Freezes But Similar to Race Conditions Errors are Intermittent Hard to Detect, Reproduce, Diagnose, Debug Errors Can Slip Through SQA Testing Process Potential Lurking Bug Another Common Error in Parallel Programs

Software Challenge Concurrent vs. Parallel Time-Sharing = Multi-Tasking = Multiplexing = Concurrent One Processor is being shared (switched quickly) between tasks making them appear to be Concurrent But it s essentially just an illusion, because at any instant in time, only one task is really executing Concurrency is not the same as true Parallelism Concurrent: Two Threads are In Progress at Same Time vs. Parallelism: Two Threads are Executing at Same Time

Software Challenge Concurrent vs. Parallel SW Problem is Harder than that from Time-Sharing Era Multi-Cores (Micro) & Multi-Nodes (Macro) HW enable: - Not Just Multi-Tasking or Concurrency, but - True Parallelism Potential Problem when migrating Multi-Tasking Code Consider a SW Application Programmed with Two Tasks: One task is assigned a low priority; other a high priority In Multi-Tasking: LP task cannot run until HP is done Programmer could have assumed Mutual Exclusion In Parallel System: LP and HP can run at Same Time

Software Challenge Debugging Harder Because of Intermittent, Non-Deterministic Bugs Time Sensitive (Temporally Aware) SW Tools Needed New Parallel Debugging Tools Required Need to Exert Temporal Control over Multiple Threads Ideal Debugger would have: Reverse Execution Capability (cycle-accurate undo) Instant Replay Capability (w/ accurate time counter) Cannot Use Ad-hoc Debugging via PRINT Statements Adds Extra Instructions which could Change Timing

Software Challenge - Technical Amdahl s Law Parallel Speedup is Limited by the Amount of Serial Code Maximum Theoretical Speedup from Amdahl's Law 8 7 Speedup 6 5 4 3 2 1 %serial= 0 %serial=10 %serial=20 %serial=30 %serial=40 %serial=50 0 0 1 2 3 4 5 6 7 8 Number of cores

Software Challenge - Business Changing Technology Curves is Hard Never Ride Technology Curve Up into Over-Utilization Cost / Performance vs. Performance Sequential Parallel Cost Performance Performance

Key Points Software Challenge Parallel Programming is Hard More Complex Lack of Tools New Type of Bugs Race Conditions Deadlocks Harder to Debug, Test, Profile, Tune, Scale Parallel Programming is a Software Challenge

Opportunities The Universe is Inherently Parallel Natural Physical and Social / Work Processes Weather, Galaxy Formation, Epidemics, Traffic Jam Can Leverage Unique Capabilities offered by Parallelism Add New Features via Separate Parallel Modules Avoids Re-engineering of Old Module More Functionality No Increase in Wall Processing Times Speculative Computation Precompute alternatives to Minimize Response Time e.g.) Video Game Up / Down / Left / Right More Responsive User Interfaces

Opportunities (Yet) Undiscovered Technical Opportunities New Parallel Algorithms Super-Linear Speedups Parallel Computer has N times more Memory Super-Linear Larger % of SW can fit in upper Levels of Memory Hierarchy Divide and Conquer leverages faster Mems An Important Reason for using Parallel Computers

Will Feedback Cycle Prevail? Hardware Computer Power New Applications Software If Hardware goes Parallel, Will Software go Parallel Too?