CMSC 611: Advanced Computer Architecture

Similar documents
Introduction to Cloud Computing

Lecture 23: Multiprocessors

Lecture 2 Parallel Programming Platforms

Parallel Programming

UNIT 2 CLASSIFICATION OF PARALLEL COMPUTERS

Principles and characteristics of distributed systems and environments

Scalability and Classifications

Computer Architecture TDTS10

Symmetric Multiprocessing

Chapter 2 Parallel Architecture, Software And Performance

Chapter 2 Parallel Computer Architecture

SOC architecture and design

High Performance Computing. Course Notes HPC Fundamentals

Systolic Computing. Fundamentals

LSN 2 Computer Processors

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Introduction to GPU Programming Languages

Middleware and Distributed Systems. Introduction. Dr. Martin v. Löwis

High Performance Computing

Introducción. Diseño de sistemas digitales.1

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Client/Server Computing Distributed Processing, Client/Server, and Clusters

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Parallel Programming Survey

Client/Server and Distributed Computing

Chapter 12: Multiprocessor Architectures. Lesson 04: Interconnect Networks

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Recommended hardware system configurations for ANSYS users

An Introduction to Parallel Computing/ Programming

PRIMERGY server-based High Performance Computing solutions

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Architectures and Platforms

Performance Monitoring of Parallel Scientific Applications

SoC Design Lecture 12: MPSoC Multi-Processor System-on-Chip. Shaahin Hessabi Department of Computer Engineering Sharif University of Technology

Annotation to the assignments and the solution sheet. Note the following points

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

Parallel Architectures and Interconnection


Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Architecture of Hitachi SR-8000

OC By Arsene Fansi T. POLIMI

Interconnection Networks

Computer System Design. System-on-Chip

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Lecture 18: Interconnection Networks. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Tools Page 1 of 13 ON PROGRAM TRANSLATION. A priori, we have two translation mechanisms available:

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Chapter 18: Database System Architectures. Centralized Systems

Introduction to Virtual Machines

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

How To Understand The Concept Of A Distributed System

Parallel Algorithm Engineering

Computer Systems Structure Input/Output

Virtualization. Pradipta De

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

LS DYNA Performance Benchmarks and Profiling. January 2009

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

A Comparison of Distributed Systems: ChorusOS and Amoeba

Distributed Systems LEEC (2005/06 2º Sem.)

Xeon+FPGA Platform for the Data Center

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

ADVANCED COMPUTER ARCHITECTURE: Parallelism, Scalability, Programmability

Router Architectures

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

IOS110. Virtualization 5/27/2014 1

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Current Trend of Supercomputer Architecture

Next Generation GPU Architecture Code-named Fermi

Agenda. Distributed System Structures. Why Distributed Systems? Motivation

Parallel Computing. Benson Muite. benson.

Distributed Operating Systems

Computer Organization & Architecture Lecture #19

A Very Brief History of High-Performance Computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

High Performance Computing, an Introduction to

General Overview of Shared-Memory Multiprocessor Systems

AMD Opteron Quad-Core

10- High Performance Compu5ng

Chapter 2. Multiprocessors Interconnection Networks

Multiprocessor System-on-Chip

GPUs for Scientific Computing

OpenSPARC T1 Processor

SGRT: A Mobile GPU Architecture for Real-Time Ray Tracing

Clusters: Mainstream Technology for CAE

Transcription:

CMSC 611: Advanced Computer Architecture Parallel Computation Most slides adapted from David Patterson. Some from Mohomed Younis

Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel Computing,1989 Parallel machines are expected to have a bigger role in the future since: Microprocessors are likely to remain dominant Microprocessor technology is not expected to keep the pace of performance Parallel architectures extend performance There has been steady progress in software development for parallel architectures

Questions about parallel computers: How large a collection? How powerful are processing elements? How do they cooperate and communicate? How are data transmitted? What type of interconnection? What are HW and SW primitives for programmers? Does it translate into performance?

Questions about parallel computers: How large a collection? How powerful are processing elements? How do they cooperate and communicate? How are data transmitted? What type of interconnection? What are HW and SW primitives for programmers? Does it translate into performance?

Level of Parallelism Bit-level parallelism ALU parallelism: 1-bit, 4-bits, 8-bit,... Instruction-level parallelism (ILP) Pipelining, Superscalar, VLIW, Out-of-Order execution Process/Thread-level parallelism Divide job into parallel tasks Job-level parallelism Independent jobs on one computer system

Scientific Computing App Perf (GFLOPS) Memory (GB) 48 hour weather 0.1 0.1 72 hour weather 3 1 Pharmaceutical design Global Change, Genome Applications Nearly Unlimited Demand (Grand Challenge): Successes in some real industries: Petroleum: reservoir modeling Automotive: crash simulation, drag analysis, engine Aeronautics: airflow analysis, engine, structural mechanics Pharmaceuticals: molecular modeling 100 10 1000 1000

Commercial Applications Transaction processing File servers Electronic CAD simulation Large WWW servers WWW search engines Graphics Graphics hardware Render Farms

Framework Extend traditional computer architecture with a communication architecture abstractions (HW/SW interface) organizational structure to realize abstraction efficiently Programming Model: Multiprogramming: lots of jobs, no communication Shared address space: communicate via memory Message passing: send and receive messages Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing)

Communication Abstraction Shared address space: e.g., load, store, atomic swap Message passing: e.g., send, receive library calls Debate over this topic (ease of programming, scaling) many hardware designs 1:1 programming model

Taxonomy of Parallel Architecture Flynn Categories SISD (Single Instruction Single Data) MISD (Multiple Instruction Single Data) SIMD (Single Instruction Multiple Data) MIMD (Multiple Instruction Multiple Data)

Uniprocessor SISD

MISD No commercial examples Different operations to a single data set Find primes Crack passwords

SIMD Vector/Array computers

Performance keys Utilization Communication SIMD Arrays

Data Parallel Model Operations performed in parallel on each element of a large regular data structure, such as an array One Control Processor broadcast to many processing elements (PE) with condition flag per PE so that can skip For distributed memory architecture data is distributed among memories Data parallel model requires fast global synchronization Data parallel programming languages lay out data to processor Vector processors have similar ISAs, but no data placement restriction

SIMD Utilization Conditional Execution PE Enable if (f<.5) {...} Global enable check while (t > 0) {...}

Communication: MasPar MP1 Fast local X-net Slow global routing

Comunication: CM2 Hypercube local routing Wormhole global routing

Communication: PixelFlow Dense connections within block Single swizzle operation collects one word from each PE in block Designed for antialiasing NO inter-block connection NO global routing

Message Passing MIMD Shared memory/distributed memory Uniform Memory Access (UMA) Non-Uniform Memory Access (NUMA) Can support either SW model on either HW basis

Message passing Processors have private memories, communicate via messages Advantages: Less hardware, easier to design Focuses attention on costly non-local operations

Message Passing Model Each PE has local processor, data, (I/O) Explicit I/O to communicate with other PEs Essentially NUMA but integrated at I/O vs. memory system Free run between Send & Receive Send + Receive = Synchronization between processes (event model) Send: local buffer, remote receiving process/port Receive: remote sending process/port, local buffer

History of message passing Early machines Local communication Blocking send & receive Later: DMA with non-blocking sends DMA for receive into buffer until processor does receive, and then data is transferred to local memory Later still: SW libraries to allow arbitrary communication

Example IBM SP-2, RS6000 workstations in racks Network Interface Card has Intel 960 8X8 Crossbar switch as communication building block 40 MByte/sec per link

Shared Memory Processors communicate with shared address space Easy on small-scale machines Advantages: Model of choice for uniprocessors, smallscale multiprocessor Ease of programming Lower latency Easier to use hardware controlled caching Difficult to handle node failure

Centralized Shared Memory Processors share a single centralized (UMA) memory through a bus interconnect Feasible for small processor count to limit memory contention Centralized shared memory architectures are the most common form of MIMD design

Distributed Memory Uses physically distributed (NUMA) memory to support large processor counts (to avoid memory contention) Advantages Allows cost-effective way to scale the memory bandwidth Reduces memory latency Disadvantage Increased complexity of communicating data