18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two



Similar documents
32 K. Stride K 8 K 32 K 128 K 512 K 2 M

Introduction to Cloud Computing

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Parallel Scalable Algorithms- Performance Parameters

Control 2004, University of Bath, UK, September 2004

Parallel Computing. Benson Muite. benson.

Multiprocessor Scheduling and Scheduling in Linux Kernel 2.6

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Parallelism and Cloud Computing

Parallel Algorithm Engineering

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Multi-GPU Load Balancing for Simulation and Rendering

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Overlapping Data Transfer With Application Execution on Clusters

A Lab Course on Computer Architecture

Performance metrics for parallel systems

MPI and Hybrid Programming Models. William Gropp

CHAPTER 1 INTRODUCTION

CS 575 Parallel Processing

A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms

Power-Aware High-Performance Scientific Computing

Synchronization. Todd C. Mowry CS 740 November 24, Topics. Locks Barriers

Benchmarking Hadoop & HBase on Violin

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Multilevel Load Balancing in NUMA Computers

Performance Characteristics of Large SMP Machines

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Performance evaluation

Load Imbalance Analysis

High Performance Computer Architecture

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Architecture Support for Big Data Analytics

FPGA-based Multithreading for In-Memory Hash Joins

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

Lecture 2 Parallel Programming Platforms

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Advances in Virtualization In Support of In-Memory Big Data Applications

Tableau Server Scalability Explained

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Cache Conscious Programming in Undergraduate Computer Science

Next Generation GPU Architecture Code-named Fermi

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Distributed Operating Systems

Distributed communication-aware load balancing with TreeMatch in Charm++

Binary search tree with SIMD bandwidth optimization using SSE

Pipelining Review and Its Limitations

Operatin g Systems: Internals and Design Principle s. Chapter 10 Multiprocessor and Real-Time Scheduling Seventh Edition By William Stallings

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Load Balancing on a Grid Using Data Characteristics

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

CMSC 611: Advanced Computer Architecture

Program Optimization for Multi-core Architectures

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Stream Processing on GPUs Using Distributed Multimedia Middleware

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Cellular Computing on a Linux Cluster

PARSEC vs. SPLASH 2: A Quantitative Comparison of Two Multithreaded Benchmark Suites on Chip Multiprocessors

Performance Monitoring of Parallel Scientific Applications

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Operating Systems. Virtual Memory

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Characterizing the Performance of Dynamic Distribution and Load-Balancing Techniques for Adaptive Grid Hierarchies

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Exploiting Transparent Remote Memory Access for Non-Contiguous- and One-Sided-Communication

Computer Architecture TDTS10

Thomas Fahrig Senior Developer Hypervisor Team. Hypervisor Architecture Terminology Goals Basics Details

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

Datacenter Operating Systems

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Hardware design for ray tracing

Dynamic Load Balancing in a Network of Workstations

Course Development of Programming for General-Purpose Multicore Processors

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Tableau Server 7.0 scalability

Transcription:

age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill, Lebeck,, Reinhardt, Smith, and Singh of University of Illinois, Carnegie Mellon University, University of Wisconsin, Duke University, University of Michigan,, and rinceton University. Homework & Reading rojects handout On Friday Form teams, groups of two Homework Homework 2 due by Friday Homework 3 will be review questions only Reading Chapter 4 18-742 2

age 2 artitioning for erformance Balance workload reduce time spent at synchronization Reduce communication Reduce extra work determining and managing good assignment These are at odds with each other 18-742 3 Data vs. Functional arallelism Data arallelism same ops on different data items Functional (control, task) arallelism pipeline Impact on load balancing? Functional is more difficult longer running tasks 18-742 4

age 3 Impact of Task Granularity Granularity = Amount of work associated with task Large tasks worse load balancing lower overhead less contention less communication Small tasks too much synchronization too much management overhead might have too much communication (affinity scheduling) 18-742 5 Impact of Concurrency Managing Concurrency load balancing Static Can not adapt to changes Dynamic Can adapt Cost of management increases Self-scheduling (guided self-scheduling) Centralized task queue» contention Distributed task queue» Can steal from other queues» Arch: Name data associated with stolen task 18-742 6

age 4 Task Queues copyright 1999 Morgan Kaufmann ublishers, Inc 18-742 7 Dynamic Load Balancing copyright 1999 Morgan Kaufmann ublishers, Inc 18-742 8

age 5 Impact of Synchronization and Serialization Too coarse synchronization barriers instead of point-to-point synch poor load balancing Too many synchronization operations lock each element of array costly operations Coarse grain locking lock entire array serialize access to array Architectural aspects cost of synchronization operation synchronization name space 18-742 9 Where Do rograms Spend Time? Sequential erforming real computation Memory system stalls arallel erforming real computation Stalled for local memory Stalled for remote memory (communication) Synchronizing (load imbalance and operations) Overhead Speedup (p) = time(1)/time(p) Amdahl s Law (low concurrency limits speedup) Superlinear speedup possible (how?) 18-742 10

age 6 Cache Memory 101 Locality + smaller HW is faster = memory hierarchy Levels: each smaller, faster, more expensive/byte than level below Inclusive: data found in top also found in the bottom Definitions Upper is closer to processor Block: minimum unit of data present or not in upper level Frame: HW (physical) place to put block (same size as block) Address = Block address + block offset address Hit time: time to access upper level, including hit determination 3C Model compulsory, capacity, conflict Add another C: communication misses 18-742 11 Multiprocessor as an Extended Mem. Hier. Example: computation: 2 instructions per cycle (IC)» or 500 cycles per 1000 instructions 1 long miss per 1000 instructions» 200 cycles per long miss => 700 cycles per 1000 instructions (40% slowdown) $ $ Node 0 Node 1 Mem CA $ Interconnect Mem $ CA Mem CA Mem CA Node 2 Node 3 18-742 12

age 7 Cache Coherent Shared Memory 1 2 Time Interconnection Network x Main Memory 18-742 13 Cache Coherent Shared Memory 1 2 ld r2, x Time Interconnection Network x Main Memory 18-742 14

age 8 Cache Coherent Shared Memory 1 2 ld r2, x Time ld r2, x Interconnection Network x Main Memory 18-742 15 Cache Coherent Shared Memory 1 2 ld r2, x Time ld r2, x add r1, r2, r4 st x, r1 Interconnection Network x Main Memory 18-742 16

age 9 Reorder Computation for Caches Exploit Temporal and Spatial Locality Temporal locality affects replication Touch too much data == capacity misses Computation Blocking Naïve Computation Order Blocked Computation order 18-742 17 Reorder Data for Caches: Before Elements on Same age Elements on Same Cache Block 18-742 18

age 10 Reorder Data for Caches: With Blocking Elements on Same age Elements on Same Cache Block 18-742 19 Scaling: Why is it important? Over time: computer systems become larger and more powerful» more powerful processors» more processors» also range of system sizes within a product family problem sizes become larger» simulate the entire plane rather than the wing required accuracy becomes greater» forecast the weather a week in advance rather than 3 days Scaling: How do algorithms and hardware behave as systems, size, accuracies become greater? Intuitively: erformance should scale linearly with cost 18-742 20

age 11 Cost Cost is a function of more than just the processor. Cost is a complex function of many hardware components and software Cost is often not a "smooth" function Often a function of packaging» how many pins on a board» how many processors on a board» how many boards in a chassis (C) 2003 J. E. Smith CS/ECE 75718-742 21 Aside on Cost-Effective Computing Isn t Speedup() < inefficient? If only throughput matters, use computers instead? But much of a computer s cost is NOT in the processor [Wood & Hill, IEEE Computer 2/95] Let Costup() = Cost()/Cost(1) arallel computing cost-effective: Speedup() > Costup() E.g. for SGI owerchallenge w/ 500MB: Costup(32) = 8.6 18-742 22

age 12 Fundamental Question: Questions in Scaling What do real users actually do when they get access to larger parallel machines? Constant roblem Size Just add more processors Memory Constrained Scale data size linearly with # of processors Can significantly increase execution time Time Constrained Keep same wall clock time as processors are added Solve largest problem in same amount of time 18-742 23 roblem Constrained Scaling User wants to solve same problem, only faster E.g., Video compression & VLSI routing Speedup C (p) = Time(1) Time(p) Assessment Good: easy to do & explain May not be realistic Doesn t work well for much larger machine (c.f., Amdahl s Law) 18-742 24

age 13 Time Constrained Scaling Execution time is kept fixed as system scales User has fixed time to use machine or wait for result erformance = Work/Time as usual, and time is fixed, so SpeedupTC(p) = Work(p) Work(1) Assessment Often realistic (e.g., best weather forecast over night) Must understand application to scale meaningfully (would scientist scale grid, time step, error bound, or combination?) Execution time on a single processor can be hard to get (no uniprocessor may have enough memory) 18-742 25 Memory Constrained Scaling Scale so memory usage per processor stays fixed Scaled Speedup: Is Time(1) / Time(p)? Speedup MC (p) = Work(p) Time(p) x Time(1) Work(1) Assessment Realistic for memory-constrained programs (e.g., grid size) Can lead to large increases in execution time if work grows faster than linearly in memory usage e.g. matrix factorization» 10,000-by 10,000 matrix takes 800MB and 1 hour on uniprocessor» With 1,000 processors, can run 320K-by-320K matrix» but ideal parallel time grows to 32 hours! = Increase in Work Increase in Time 18-742 26

age 14 Scaling Down Scale down to shorten evaluation time on hardware and especially on simulators Scale up issues apply in reverse Must watch out if problem size gets too small Communication dominates computation (e.g., all boundary elements) roblem size gets too small for realistic caches, yielding too many cache hits» Scale caches down considering application working sets» E.g., if a on a realistic problem a realistic cache could hold a matrix row but not whole matrix» Scale cache so it hold only row or scaled problem s matrix 18-742 27 The SLASH2 Benchmarks Kernels Complex 1D FFT Blocked LU Factorization Blocked Sparse Cholesky Factorization Integer Radix Sort Applications Barnes-Hut: interaction of bodies Adaptive Fast Multipole (FMM): interaction of bodies Ocean Simulation Hierarchical Radiosity Ray Tracer (Raytrace) Volume Renderer (Volrend) Water Simulation with Spatial Data Structrue (Water-Spatial) Water Simulation without Spatial Data Structure (Water-Nsquared) 18-742 28