CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge



Similar documents
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Digital Integrated Circuit (IC) Layout and Design

Parallel Algorithm Engineering

Introduction to Cloud Computing

CISC, RISC, and DSP Microprocessors

Virtualization and Cloud Computing. Sorav Bansal

Introduction to Microprocessors

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

CFD Implementation with In-Socket FPGA Accelerators

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Lecture 1 Introduction to Parallel Programming

CSEE W4824 Computer Architecture Fall 2012

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Multi-Threading Performance on Commodity Multi-Core Processors

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

on an system with an infinite number of processors. Calculate the speedup of

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Parallelism and Cloud Computing

Chapter 2 Parallel Computer Architecture


Introduction to Multi-Core

EEM870 Embedded System and Experiment Lecture 1: SoC Design Overview

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

CMSC 611: Advanced Computer Architecture

DESIGN CHALLENGES OF TECHNOLOGY SCALING

Xeon+FPGA Platform for the Data Center

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Computer Architecture

High Performance Computing in the Multi-core Area

Clouds vs Grids KHALID ELGAZZAR GOODWIN 531

Advanced Core Operating System (ACOS): Experience the Performance

Why Latency Lags Bandwidth, and What it Means to Computing

How To Understand The Design Of A Microprocessor

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

FPGA Design From Scratch It all started more than 40 years ago

Parallel Programming Survey

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

路 論 Chapter 15 System-Level Physical Design

Scaling in a Hypervisor Environment

Program Optimization for Multi-core Architectures

Thread Level Parallelism II: Multithreading

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Networking Virtualization Using FPGAs

CS244 Lecture 5 Architecture and Principles

Seven Challenges of Embedded Software Development

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Pipelining Review and Its Limitations

Design Cycle for Microprocessors

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Managing Data Center Power and Cooling

Digital Design for Low Power Systems

Memory Architecture and Management in a NoC Platform

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Cloud Computing Technology

Coming Challenges in Microarchitecture and Architecture

Multicore Processors A Necessity By Bryan Schauer

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

IT Infrastructure and Emerging Technologies

Low Power AMD Athlon 64 and AMD Opteron Processors

The Cloud Hosting Revolution: Learn How to Cut Costs and Eliminate Downtime with GlowHost's Cloud Hosting Services

How To Build A Cloud Computer

Legal Notices and Important Information

Generations of the computer. processors.

Energy-Efficient, High-Performance Heterogeneous Core Design

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

10- High Performance Compu5ng

Big Data and Big Data Modeling

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

CS144R/244R Network Design Project on Software Defined Networking for Computing

Challenges for cloud software engineering

Embedded Systems: map to FPGA, GPU, CPU?

Multi-Core Programming

Performance evaluation

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Turbomachinery CFD on many-core platforms experiences and strategies

Navigating the Enterprise Database Selection Process: A Comparison of RDMS Acquisition Costs Abstract

Choosing a Computer for Running SLX, P3D, and P5

Overview. CPU Manufacturers. Current Intel and AMD Offerings

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

The Open Cloud Near-Term Infrastructure Trends in Cloud Computing

Hadoop-BAM and SeqPig

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

Thread level parallelism

Computing at the HL-LHC

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

IA-64 Application Developer s Architecture Guide

Intel Itanium Architecture

Transcription:

CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge

We re on the Road to Parallel Processing

Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities

Outline Hardware Solution Technical Software Challenge Technical Opportunities Technical

Outline Hardware Solution Technical Business Software Challenge Technical Business Opportunities Technical Business

Outline Hardware Solution Technical Business Software Challenge Opportunities

Computer Hardware Evolution Highway Micro Macro Car & Driver Hardware & Software

Micro-Scopic View

Micro-Scopic View Intel 8086 Intel Core 2 Duo 1978 Year 2006 29,000 Transistors 291,000,000 5 MHz Clock Frequency 2.9 GHz

Micro-Scopic View In 28 Years from 1978 to 2006: Number of Transistors Increased 10,034X Clock Frequency Increased 586X Primary Driver / Facilitator was (and is) Moore s Law: Number of Transistors Doubles every 18-24 Months Stated by Gordon Moore, Intel Co-Founder in 1965 Prediction has been proven valid over a long term Prediction has been the Law for over 40 years

Micro-Scopic View

Micro-Scopic View

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Micro-Scopic View Clock Rate Limits Have Been Reached Source: Patterson, Computer Organization and Design

Micro-Scopic View Blocked by Heat and Power Wall

Micro-Scopic View Power (and Heat) Grows as Frequency 3 Power Voltage 2 x Frequency Voltage Frequency Power Frequency 3 How can HW Performance Continue to Increase? Single Core Multi-Core

Single Core vs. Dual Core Single Core clocked at 2f Dual Core clocked at f 2f Throughput f + f 8f 3 Heat f 3 + f 3 Sequential Processing Parallel Processing

Micro-Scopic View Instruction-Level Parallelism (ILP) was also Heavily Used Implemented On-Chip via Hardware Transparent to Software (No impact on Programmers) We will Study Two Types: Pipelining (Intra-Instruction Parallelism) Multi-Function Units (Inter-Instruction Parallelism) ILP has provided reasonable speedups in the past, Unfortunately.

Micro-Scopic View Instruction-Level Parallelism Limits have been Reached too Watts, IPC 100 Power grows exponentially 10 1 Diminishing returns w.r.t. larger instruction window, higher issue-width Single-Issue Pipelined Superscalar (Limited Look-Ahead) Superscalar (Aggressive Look-Ahead) Aggressiveness Of ILP

Gain - to - Effort Ratio of ILP beyond Knee of Curve Diminishing Returns due to Increased Cost and Complexity of Extracting ILP Performance Made sense to go Superscalar and Multi-Function: Good ROI Very little gain for substantial effort Effort Scalar In-Order Moderate Pipelining, Look-Ahead Very Deep Pipe, Aggressive Look-Ahead

Micro-Scopic View Summary Clock Frequency Scaling Limits have been Reached Instruction Level Parallelism Limits have been Reached Era of Single Core Performance Increases has Ended No More Free Lunch for Software Programmers Multiple Cores Will Directly Expose Parallelism to SW All Future Micro-Processor Designs will be Multi-Core Evident in Chip Manufacturer s RoadMaps

Micro-Scopic View Summary

Micro-Scopic View Summary Moore s Law Continues to 2x Transistors / 24 mos, but It will be used to Increase Number of Cores Instead 1 2 4

Single Core Sequential Processing Multi-Core Parallel Processing

Computer Hardware Evolution Highway Micro Macro

Macro-Scopic View Personal Computer Nodes: 1 Location: Desktop

Macro-Scopic View Cluster Computer Nodes: 10 s 100 s Location: Local

Macro-Scopic View Example Cluster Computer

Hardware Solution - Technical Macro-Scopic View Grid Computer Nodes: 1,000 s Location: Distributed

Macro-Scopic View Example Grid Computer

Macro-Scopic View Cloud Computer Nodes: 10,000 s Location: Highly Distributed

Macro-Scopic View App Engine Azure EC2 Example Cloud Computer

Macro-Scopic View Summary Single Node Sequential Processing Many Nodes Parallel Processing

Hardware Solution - Business Micro Macro Sequential Processing Parallel Processing

Hardware Solution - Business Computer Processing Power: Has a Highly Elastic Supply and Demand Curve Increased Supply Generates Increased Demand Hardware Computer Power New Applications Software Software is Like a Gas It Expands to Fill any size Hardware Container Sequential Processing Parallel Processing

Hardware Solution - Business 640K should be enough for anybody Bill Gates, 1981 There is a world market for maybe five computers Thomas Watson, 1943 Even Visionaries Sometime Forget No Matter How Much Computer Power People Have, They Always Want More Goal of Computer Hardware and Software Designers Continually Increase Performance and Lower Cost Operate at Optimum Point on Technology Curve Sequential Processing Parallel Processing

Hardware Solution - Business Technology Curve Cost Performance Cost / Performance vs. Performance Under-Utilization Optimum Utilization Over- Utilization Performance

Hardware Solution - Business Technology Curve Cost Performance Cost / Performance vs. Performance Under-Utilization Optimum Utilization Over- Utilization Performance

Hardware Solution - Business Technology Curve Cost Performance $$$ Cost / Performance vs. Performance $ $$$$$$ Sequential Processing X 2X Performance

Hardware Solution - Business Technology Curve Cost Performance $$$ Cost / Performance vs. Performance $ $$$$$$ Sequential Processing X 2X Performance

Hardware Solution - Business Technology Curve Cost Performance Cost / Performance vs. Performance $ $$$$$$ Sequential Processing X Performance $$ Parallel Processing

Key Points Hardware Solution Parallel Processing is really an Evolution in Micro- and Macro-Architecture Hardware That provides a Solution to: The Heat and Power Wall The Limitations of ILP Cost-Effective Higher Performance Parallel Processing is also a Software Challenge