On the Importance of Thread Placement on Multicore Architectures

Similar documents
Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Performance Monitoring of the Software Frameworks for LHC Experiments

Multi-Threading Performance on Commodity Multi-Core Processors

Multicore Parallel Computing with OpenMP

Performance monitoring of the software frameworks for LHC experiments

Perfmon2: A leap forward in Performance Monitoring

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

The Impact of Memory Subsystem Resource Sharing on Datacenter Applications. Lingia Tang Jason Mars Neil Vachharajani Robert Hundt Mary Lou Soffa

Parallel Programming Survey

Chapter 2. Why is some hardware better than others for different programs?

Lattice QCD Performance. on Multi core Linux Servers

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

GPUs for Scientific Computing

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

MAQAO Performance Analysis and Optimization Tool

Overlapping Data Transfer With Application Execution on Clusters

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

CPU Performance Evaluation: Cycles Per Instruction (CPI) Most computers run synchronously utilizing a CPU clock running at a constant clock rate:

Performance Evaluation of HPC Benchmarks on VMware s ESXi Server

Code Coverage Testing Using Hardware Performance Monitoring Support

Hardware performance monitoring. Zoltán Majó

OpenMP and Performance

Parallel Algorithm Engineering

STUDY OF PERFORMANCE COUNTERS AND PROFILING TOOLS TO MONITOR PERFORMANCE OF APPLICATION

A Study of Performance Monitoring Unit, perf and perf_events subsystem

LS DYNA Performance Benchmarks and Profiling. January 2009

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Perf Tool: Performance Analysis Tool for Linux

Chapter 2 Parallel Architecture, Software And Performance

64-Bit versus 32-Bit CPUs in Scientific Computing

benchmarking Amazon EC2 for high-performance scientific computing

perfmon2: a flexible performance monitoring interface for Linux perfmon2: une interface flexible pour l'analyse de performance sous Linux

Introduction to Cloud Computing

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

EEM 486: Computer Architecture. Lecture 4. Performance

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Perfmon2: a flexible performance monitoring interface for Linux

An examination of the dual-core capability of the new HP xw4300 Workstation

HP ProLiant SL270s Gen8 Server. Evaluation Report

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Perfmon2: a standard performance monitoring interface for Linux. Stéphane Eranian <eranian@gmail.com>

Turbomachinery CFD on many-core platforms experiences and strategies

Distributed communication-aware load balancing with TreeMatch in Charm++

Computer Architecture

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Program Optimization for Multi-core Architectures

Intel Pentium 4 Processor on 90nm Technology

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

High Performance Computing

Memory Access Control in Multiprocessor for Real-time Systems with Mixed Criticality

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Performance Characteristics of Large SMP Machines

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

~ Greetings from WSU CAPPLab ~

Scalability evaluation of barrier algorithms for OpenMP

Keys to node-level performance analysis and threading in HPC applications

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

CloudSim: A Toolkit for Modeling and Simulation of Cloud Computing Environments and Evaluation of Resource Provisioning Algorithms


High-Density Network Flow Monitoring

An Introduction to Parallel Computing/ Programming

Performance evaluation

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Sequential Performance Analysis with Callgrind and KCachegrind

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

How System Settings Impact PCIe SSD Performance

Enabling Technologies for Distributed Computing

Software Development around a Millisecond

High Performance Computing. Course Notes HPC Fundamentals

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

AMD Opteron Quad-Core

HPC performance applications on Virtual Clusters

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

UTS: An Unbalanced Tree Search Benchmark

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Transcription:

On the Importance of Thread Placement on Multicore Architectures HPCLatAm 2011 Keynote Cordoba, Argentina August 31, 2011 Tobias Klug

Motivation: Many possibilities

can lead to non-deterministic runtimes...

... but don t have to

The autopin Approach User-level tool Start multi-threaded application under autopin control User can specify pinnings of interest Pin threads to cores Assess performance of chosen pinning using performance counters Try alternative pinnings until optimal pinning is found

Performance Counters Multiple Event Sensors ALU Utilization Branch Prediction Cache Events (L1/L2/TLB) Bus Utilization Two Uses: Read: Get Precise Count of Events in Code Regions => Counting Interrupt on Overflow => Statistical Sampling Well-known tools: Oprofile Perfctr Intel Vtune Perfmon2

perfmon2 Kernel-Patch + library (libpfm) Generic interface for PMU access Portable: implementations for IA32, x64, IA64, MIPS, Power Allows for per-thread and system-wide monitoring Support for counting and sampling pfmon: attach to running threads fork new processes and attach to them fully exploit performance counters

Algorithm init_autopin (pinninglist, inittime, program); for (i=0; i<numofpinnings; i++) { pinthreads(pinninglist[i]); runthreads(warmuptime); p1 = readperformancecounters(); runthreads(sampletime); p2 = readperformancecounters(); performancerate[i] = (p2-p1)/sampletime; } pinthreads(bestpinning);

NUMA automatic page migration

Automatic Page Migration Kernel patch from Lee Schermerhorn Thread moves to new NUMA node: remove PTE references of old NUMA node pages are now unmapped Next access to page causes page-fault Modified kernel routines pull page local (migrate on fault) Update PTE Controlled via cpusets

Experimental Setup Caneland: Intel Tigertown: Quad-Core, 2x4MB L2/socket, 2.93GHz clock rate 4-way, 4x1066MHz FSB, 64MB snoop filter, UMA Clovertown: Intel Clovertown: Quad-Core, 2x4MB L2, 2.66GHz clock rate 2-way, 2x1333MHz FSB, UMA Barcelona: AMD K10: Quad-Core, 4x512kB L2, 1x2MB L3, 1.9GHz clock rate 2-way, 1000MHz Hypertransport, NUMA Linux Kernel 2.6.23 with perfmon2 patches Intel Compiler Suite

Evaluation SPEC OMP Benchmark consists of real scientific applications OpenMP PC: INSTRUCTIONS_RETIRED Several Multicore-Architekturen examined Memory Throughput MPI Electric power consumption

SPEC OMP Benchmark 310.wupwise 312.swim 314.mgrid 316.applu 318.galgel 320.equake 324.apsi 326.gafort 328.fma3d 330.art 332.Ammp Description Quantum chromodynamics shallow water modeling multi-grid solver in 3D potential field parabolic/elliptic partial differential equations fluid dynamics analysis of oscillatory instability finite element simulation of earthquake modeling weather prediction genetic algorithm code finite-element crash simulation neural network simulation of adaptive resonance theory computational chemistry

Caneland

Caneland: Runtimes (10s sample time)

Caneland: Runtimes (30s sample time)

Results 310.wupwise Caneland Clovertown Barcelona 2 4 8 2 4 2 4 312.swim 314.mgrid 316.applu 320.equake 324.apsi 328.fma3d 330.art 332.ammp

Results 310.wupwise Caneland Clovertown Barcelona 2 4 8 2 4 2 4 312.swim 314.mgrid 316.applu 320.equake 324.apsi 328.fma3d 330.art 332.ammp

Clovertown

Results 310.wupwise Caneland Clovertown Barcelona 2 4 8 2 4 2 4 312.swim 314.mgrid 316.applu 320.equake 324.apsi 328.fma3d 330.art 332.ammp

Barcelona

Results (w/o NUMA patch) 310.wupwise Caneland Clovertown Barcelona 2 4 8 2 4 2 4 312.swim 314.mgrid 316.applu 320.equake 324.apsi 328.fma3d 330.art 332.ammp

Results (with NUMA patch) 310.wupwise Caneland Clovertown Barcelona 2 4 8 2 4 2 4 312.swim 314.mgrid 316.applu 320.equake 324.apsi 328.fma3d 330.art 332.ammp

Results SPEC OMP Optimal pinning found in all but 2 cases autopin s alternative less than 5% slower Overhead less than 3% on UMA platform Overhead less than 7,5% on NUMA platform (Kernel level page migration)

Evaluation SPEC OMP Benchmark consists of real scientific applications OpenMP PC: INSTRUCTIONS_RETIRED Several Multicore-Architekturen examined Memory Throughput MPI Electric power consumption

Barcelona

Memory Bandwidth STREAM John McCalpin synthetic benchmark copy and computation operations on large FP arrays Reusage of data avoided copy: a[i] = b[i] scale: a[i] = q*b[i] sum: a[i] = b[i] + c[i] triad: a[i] = b[i] + q*c[i]

Evaluation SPEC OMP Benchmark consists of real scientific applications OpenMP PC: INSTRUCTIONS_RETIRED Several Multicore-Architekturen examined Memory Throughput MPI Electric power consumption

Clovertown

Evaluation SPEC OMP Benchmark consists of real scientific applications OpenMP PC: INSTRUCTIONS_RETIRED Several Multicore-Architekturen examined Memory Throughput MPI Electric power consumption

PET (Positron Emission Tomography) Nuclear medicine imaging Visualizes functional processes (e.g. tumor diagnostics) fixed detector ring around patient radioisotopes injected into body Positron vs. electron 2 photons 180 degree coincidence circuit

Image: Wikipedia

Image Reconstruction g = A f g known measurement vector f unknown image vector A system matrix (describes characteristics of detector ring) MLEM approximates linear system

Clovertown

Conclusion and Outlook Pinning is essential on multicore systems Will become even more important on many core architectures Tools can reliably find optimal pinnings on UMA and NUMA architectures Outlook autopin2 new design: perf performance counters subsystem Flexible and modular Design: perfmon, perf Energy, runtime, user defined objective functions Back channel from application to autopin2

Questions?