Parallel Algorithm Engineering

Similar documents
Multi-Threading Performance on Commodity Multi-Core Processors

Parallel Programming Survey

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Performance Characteristics of Large SMP Machines

Introduction to Cloud Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

FPGA-based Multithreading for In-Memory Hash Joins

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Parallelism and Cloud Computing

OpenMP Programming on ScaleMP

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Enabling Technologies for Distributed Computing

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPUs for Scientific Computing

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

Petascale Software Challenges. William Gropp

MPI and Hybrid Programming Models. William Gropp

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

Symmetric Multiprocessing

Enabling Technologies for Distributed and Cloud Computing

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Parallel Computing. Parallel shared memory computing with OpenMP

PSE Molekulardynamik

CPU Benchmarks Over 600,000 CPUs Benchmarked

Chapter 2 Parallel Architecture, Software And Performance

Basic Concepts in Parallelization

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Home Software Hardware Benchmarks Services Store Support Forums About Us. CPU Mark Price Performance (Click to select desired chart)

Virtuoso and Database Scalability

Parallel Computing. Shared memory parallel programming with OpenMP

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Embedded Parallel Computing

Introduction to GPU hardware and to CUDA

Putting it all together: Intel Nehalem.

System Models for Distributed and Cloud Computing

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

BLM 413E - Parallel Programming Lecture 3

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Rethinking SIMD Vectorization for In-Memory Databases

Introduction to Microprocessors

Understanding Hardware Transactional Memory

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

How System Settings Impact PCIe SSD Performance

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Using the Windows Cluster

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Chapter 2 Parallel Computer Architecture

Multi-core Programming System Overview

Xeon+FPGA Platform for the Data Center

Understanding the Benefits of IBM SPSS Statistics Server

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Discovering Computers Living in a Digital World

Scalability evaluation of barrier algorithms for OpenMP

COMP/CS 605: Introduction to Parallel Computing Lecture 21: Shared Memory Programming with OpenMP

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Hardware performance monitoring. Zoltán Majó

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

GeoImaging Accelerator Pansharp Test Results

Trends in High-Performance Computing for Power Grid Applications

Turbomachinery CFD on many-core platforms experiences and strategies

An Implementation Of Multiprocessor Linux

Generations of the computer. processors.

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

The Best RDP One-to-many Computing Solution. Start

Program Grid and HPC5+ workshop

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Improving System Scalability of OpenMP Applications Using Large Page Support

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Parallel Processing and Software Performance. Lukáš Marek

Cluster Computing at HRI

Parallel Computing for Data Science

Software and the Concurrency Revolution

How To Understand The Design Of A Microprocessor

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Infrastructure Matters: POWER8 vs. Xeon x86

Resource Utilization of Middleware Components in Embedded Systems

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Overview of HPC Resources at Vanderbilt

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

CS550. Distributed Operating Systems (Advanced Operating Systems) Instructor: Xian-He Sun

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

System Requirements Table of contents

Principles and characteristics of distributed systems and environments

Transcription:

Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas

Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples

Software crisis The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem. -- E. Dijkstra, 1972 Turing Award Lecture

Before The 1st Software Crisis When: around '60 and 70' Problem: large programs written in assembly Solution: abstraction and portability via high-level languages like C and FORTRAN The 2nd Software Crisis When: around '80 and '90 Problem: building and maintaining large programs written by hundreds of programmers Solution: software as a process (OOP, testing, code reviews, design patterns) Also better tools: IDEs, version control, component libraries, etc.

Recently.. Processor-oblivious programmers A Java program written on PC works on your phone A C program written in '70 still works today and is faster Moore s law takes care of good speedups

Currently.. Software crisis again? When: 2005 and... Problem: sequential performance is stuck Required solution: continuous and reasonable performance improvements To process large datasets (BIG Data!) To support new features Without loosing portability and maintainability

Moore's law

Uniprocessor performance SPECint2000 [1]

Uniprocessor performance (cont.) SPECfp2000 [1]

Uniprocessor performance (cont.) Clock Frequency [1]

Parallel processing: Predicted # of cores for stationary systems, according to ITRS

Even worse for GPUs GTX 780 Ti have 2880 cores @ 0.9Ghz

Why Power considerations Consumption Cooling Efficiency DRAM access latency Wire delays Memory wall Range of wire in one clock cycle Diminishing returns of more instruction-level parallelism Out-of-order execution, branch prediction, etc.

Power consumptions GTX 780 Ti have 2880 cores @ 0.9Ghz 250Watt 150Watt

Overclocking Air-water: ~5.0 GHz Possible at home Phase change: ~6.0 GHz Liquid helium: 8.794 GHz Current world record Reached with AMD FX-8350

Towards parallel setups Instead of going faster --> go more parallel! Transistors are used now for multiple cores

4 sockets 8 CPU setup

UMA vs NUMA All laptops and most desktops are UMA Most modern servers are NUMA Important to know which you target!

Current commercial MC CPUs Intel Intel Core i7-5960x: 8-core (16 threads), 20 MB Cache, max 3.5 GHz Intel Xeon Processor E5-2699 v3: 18-core (36 threads), 45 MB Cache, max 3.6 GHz (x 8-socket configuration) AMD FX-9590: 8-core, 8 MB Cache, 4.7 GHz A10-7850K: 12-core (4 CPU 4 GHz + 8 GPU 0.72 GHz), 4 MB C Opteron 6386 SE: 16-core, 16 MB Cache, 3.5 GHz (x 4-socket conf.) Oracle Phi 7120P: 61 cores (244 threads), 30.5 MB Cache, max 1.33 GHz, max memory BW 352 GB/s SPARC M6: 12-core (96 threads), 48 MB Cache, 3.6 GHz (x 32-socket configuration)

Concurrency vs Parallelism Parallelism A condition that arises when at least two threads are executing simultaneously A specific case of concurrency Concurrency: A condition that exists when at least two threads are making progress. A more generalized form of parallelism E.g., concurrent execution via time-slicing in uniprocessors (virtual parallelism) Distribution: As above but running simultaneously on different machines (e.g., cloud computing)

Amdahls law Potential program speedup is defined by the fraction of code that can be parallelized Serial components rapidly become performance limiters as thread count increases p fraction of work that can parallelized n the number of processors

Amdahls law Number of Processors

When to parallelize When you have independent units of work that can be performed When your code is compute bound Or your code is not utilizing the memory bandwidth When you see performance gains is tests :-)

We have seen this previously L1 and L2 cache sizes

Remember from previously

Numa effects

Cache coherence Ensures the consistency between all the caches.

MESIF protocol Modified (M): present only in the current cache and dirty. A write-back to main memory will make it (E). Exclusive (E): present only in the current cache and clean. A read request will make it (S), a write-request will make it (M). Shared (S): maybe stored in other caches and clean. Maybe changed to (I) at any time. Invalid (I): unusable Forward (F): a specialized form of the S state

Cache coherency effects Exclusive cache lines Modified cache lines Latency in nsec on 2-socket Intel Nehalem

Does it matter? Processing 1600M tuples on 32-core machine

Commandments C1: Thou shalt not write thy neighbor s memory randomly chunk the data, redistribute, and then sort/work on your data locally. C2: Thou shalt read thy neighbor s memory only sequentially let the prefetcher hide the remote access latency. C3: Thou shalt not wait for thy neighbors don t use fine grained latching or locking and avoid synchronization points of parallel threads.

The openmp framework API for multiprocessing Easily applied to parallelize code Built for shared memory processors Works cross platform http://openmp.org

Shared memory processors Recall the UMA and NUMA architetures Both are shared memory processor architetures

General control flow

Compiling openmp #include <omp.h> Compile with the openmp flag Environment variables Gcc -fopenmp test.cpp setenv OMP_NUM_THREADS 12 export OMP_NUM_THREADS=12

Useful functions Threadid Amount of threads omp_get_thread_num(); omp_get_num_threads(); Set amount of active threads omp_set_num_threads(4); export OMP_NUM_THREADS=12

Directives Used to communicate with the compiler #pragma directives used to instruct the compiler to use pragmatic or implementation-dependent features One such feature is openmp #pragma omp parallel

Examples

Problems with NUMA We do not know where the data is allocated We do not know on which NUMA node the thread is running So, no openmp on really parallel machines?

New libraries to the rescue We can pin threads to processors We can control memory allocations Tools Numactl libnuma

libnuma Provides c++ header files Can be used to create numa awareness in the code A bit like openmp but instead provides methods for getting numa node and allocating memory on specific numa nodes

Numactl --hardware Shows the current hardware Examples

Numactl (continued)

Questions?