Challenges to Obtaining Good Parallel Processing Performance

Similar documents
Lecture 1: the anatomy of a supercomputer

361 Computer Architecture Lecture 14: Cache Memory

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

32 K. Stride K 8 K 32 K 128 K 512 K 2 M

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Parallel Scalable Algorithms- Performance Parameters

Performance metrics for parallel systems

Disk Storage Shortfall

Advanced Computer Architecture

CSE 6040 Computing for Data Analytics: Methods and Tools

on an system with an infinite number of processors. Calculate the speedup of

Large Scale Simulation on Clusters using COMSOL 4.2

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

How To Build A Cloud Computer

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

1. Memory technology & Hierarchy

Chapter 2 Parallel Architecture, Software And Performance

LSN 2 Computer Processors

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Bandwidth Calculations for SA-1100 Processor LCD Displays

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

UTS: An Unbalanced Tree Search Benchmark

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Solid State Storage in Massive Data Environments Erik Eyberg

Parallel Computing. Benson Muite. benson.

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Computer Graphics Hardware An Overview

COMPUTER SCIENCE AND ENGINEERING - Microprocessor Systems - Mitchell Aaron Thornton

Performance with the Oracle Database Cloud

Next Generation GPU Architecture Code-named Fermi

Big Fast Data Hadoop acceleration with Flash. June 2013

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

HPC with Multicore and GPUs

High Performance Computing. Course Notes HPC Fundamentals

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Price/performance Modern Memory Hierarchy

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Recommended hardware system configurations for ANSYS users

Parallel & Distributed Optimization. Based on Mark Schmidt s slides

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Linux VM Infrastructure for memory power management

Architecture of Hitachi SR-8000

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Benchmarking Hadoop & HBase on Violin

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

IS IN-MEMORY COMPUTING MAKING THE MOVE TO PRIME TIME?

GPU Computing - CUDA

Introduction to Cloud Computing

Introduction to GPU hardware and to CUDA

FPGA-based Multithreading for In-Memory Hash Joins

Contents. 2. cttctx Performance Test Utility Server Side Plug-In Index All Rights Reserved.

Keys to node-level performance analysis and threading in HPC applications

OpenCL Programming for the CUDA Architecture. Version 2.3

Parallel Computing with MATLAB

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

COLO: COarse-grain LOck-stepping Virtual Machine for Non-stop Service. Eddie Dong, Tao Hong, Xiaowei Yang

Final Project Report. Trading Platform Server

QLIKVIEW SERVER MEMORY MANAGEMENT AND CPU UTILIZATION

Parallel Programming Survey

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Chapter 1 Computer System Overview

RevoScaleR Speed and Scalability

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Caching Mechanisms for Mobile and IOT Devices

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

NAND Flash Architecture and Specification Trends

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

2: Computer Performance

Thread level parallelism

Physical Data Organization

OpenMP and Performance

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Parallel Processing and Software Performance. Lukáš Marek

NAND Flash Architecture and Specification Trends

Introduction to Parallel Computing. George Karypis Parallel Programming Platforms

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

The Classical Architecture. Storage 1 / 36

System Models for Distributed and Cloud Computing

Memory Characterization to Analyze and Predict Multimedia Performance and Power in an Application Processor

Performance Metrics for Parallel Programs. 8 March 2010

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Optimizing Web Infrastructure on Intel Architecture

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Cloud Computing Driving Datacenter Innovation Global Semiconductor Alliance Board of Directors Meeting

Cloud Server. Parallels. An Introduction to Operating System Virtualization and Parallels Cloud Server. White Paper.

Router Architectures

10- High Performance Compu5ng

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Transcription:

Outline: Challenges to Obtaining Good Parallel Processing Performance Coverage: The Parallel Processing Challenge of Finding Enough Parallelism Amdahl s Law: o The parallel speedup of any program is limited by the time needed for any sequential portions of the program to be completed. o For example, if a program runs for ten hours on a single core and has a sequential (nonparallelizable) portion that takes one hour, then no matter how many cores are devoted to the program, it will never go faster than one hour. 1

Even if parallel part speeds up perfectly, performance is limited by sequential part 2

Granularity: The Parallel Processing Challenge of Overhead caused by Parallelism o Given enough parallel work, this is the biggest barrier to getting desired speedup o Parallelism overheads include: - Cost of starting a thread or process - Cost of communicating shared data - Cost of synchronizing o Each of these can cost several milliseconds (=millions of flops) on some systems o Tradeoff: Algorithm needs sufficiently large units of work to run fast in parallel (i.e. large granularity), but not so large that there is not enough parallel work. I/O Time vs. CPU Time o Input/Output Time includes both the Memory System and Bus/Network System o The rate of improvement of I/O is much slower than that of the CPU 3

Exponentially growing gaps are occurring between: o Floating point time (CPU processing speed) and o Memory BandWidth (Transmission Speed of Memory) and o Memory Latency (Startup Time of Memory Transmission) Floating Point Time << 1/Memory Bandwidth << Memory Latency Time Annual increase Typical value in 2006 Single-chip floating-point performance 59% 4 GFLOP/s Memory bus bandwidth 23% 1 GWord/s = 0.25 word/flop Memory latency 5.5% 70 ns = 280 FP ops = 70 loads 4

Exponentially growing gaps are also occurring between: o Floating point time (CPU processing speed) and o Network BandWidth (Transmission Speed of Network) and o Network Latency (Startup Time of Network Transmission) Floating Point Time << 1/Network Bandwidth << Network Latency Time Network Bandwidth Annual increase 26% Typical value in 2006 65 MWord/s = 0.03 word/flop Network latency 15% 5 ms = 20K FP ops Note that for both Memory and Network, Latency (not bandwidth) is the weaker link This means that it is better to use Larger Chunk Sizes (Larger Granularity) Better to Retrieve (from Memory) or Transmit (over the Network) a small number of large blocks, rather than a large number of small blocks. 5

However, there is a Tradeoff between using larger Granularity and Locality o CPU Performance improves much faster than RAM Memory Performance o So Memory Hierarchies are Used to Provide Cost-Performance Effectiveness o Small Memories are Fast, but Expensive; Large Memories are Cheap, but Slow 6

Locality (location of the data in the Mem Hierarchy) Substantially Impacts Performance o Keeping active Working Set in upper levels improves performance But this means we need to use finer granularity (many smaller blocks) 7

Communication in Parallel Applications Process 1 Process 2 Process 3 Process 4 Process 5 Process 6 Core 1 Core 2 Core 1 Core 2 Core 1 Core 2 Cache ~10 Cycles Cache Cache Processor 1 Processor 2 ~100 Cycles Processor 3 Memory Memory NIC ~100,000 Cycles NIC In parallel programming, communication considerations have the same importance as single core optimizations! Copyright 2006, Intel Corporation. All rights reserved. 2006 Itanium Solutions Alliance. Intel and the Intel logo are trademarks or registered trademarks Other of Intel names Corporation and brands or its subsidiaries are properties in the of their United respective States or owners other countries. *Other brands and names are the property of their respective owners.

Tiling can be used to Partition Task such that Memory Hierarchy is better Leveraged Challenge: Tradeoff in Granularity Size - From a BandWidth vs. Latency Point of View with Memory and Network: Want Larger Blocks because Latency is Slower than Bandwidth - From a Memory Locality Point of View: Want Smaller Blocks that will fit into Fastest (Smallest) Memory in Hierarchy Reduces Mem Access Times & Can make possible SuperLinear Speedup 8

Partitioning Should also Strive to Load Balance Tasks onto the Processors Load Imbalance is the Time that some processors in the system are idle due to: o Insufficient Parallelism o Unequal Size Tasks Load Imbalance Exacerbates Synchronization Overhead o Slowest (Longest) Task or Processor holds up all other Tasks or Processors 9

Improving Real Performance Peak Performance grows exponentially, a la Moore s Law In 1990 s, peak performance increased 100x; in 2000 s, it will increase 1000x But efficiency (the performance relative to the hardware peak) has declined was 40-50% on the vector supercomputers of 1990s now as little as 5-10% on parallel supercomputers of today Close the gap through... Mathematical methods and algorithms that achieve high performance on a single processor and scale to thousands of processors More efficient programming models and tools for massively parallel supercomputers Teraflops 1,000 100 10 1 0.1 1996 Peak Performance Performance Gap Real Performance 2000 2004

Much of the Performance is from Parallelism Instruction-Level Parallelism Thread-Level Parallelism? Bit-Level Parallelism