Multi-Threading Performance on Commodity Multi-Core Processors

Similar documents

Parallel Algorithm Engineering

Parallel Programming Survey

Improving System Scalability of OpenMP Applications Using Large Page Support

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Scalability evaluation of barrier algorithms for OpenMP

Enabling Technologies for Distributed and Cloud Computing

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Multi-core and Linux* Kernel

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Introduction to Cloud Computing

Enabling Technologies for Distributed Computing

Multi-core Programming System Overview

Lattice QCD Performance. on Multi core Linux Servers

Embedded Parallel Computing

PSE Molekulardynamik

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

CSE 6040 Computing for Data Analytics: Methods and Tools

OpenMP and Performance

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

LS DYNA Performance Benchmarks and Profiling. January 2009

Performance Application Programming Interface

How To Monitor Performance On A Microsoft Powerbook (Powerbook) On A Network (Powerbus) On An Uniden (Powergen) With A Microsatellite) On The Microsonde (Powerstation) On Your Computer (Power

1 Bull, 2011 Bull Extreme Computing

AMD Opteron Quad-Core

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Basic Concepts in Parallelization

HPC performance applications on Virtual Clusters

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville

Performance Characteristics of Large SMP Machines

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Performance Monitoring of the Software Frameworks for LHC Experiments

Institutionen för datavetenskap Department of Computer and Information Science

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Next Generation GPU Architecture Code-named Fermi

On the Importance of Thread Placement on Multicore Architectures

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

FPGA-based Multithreading for In-Memory Hash Joins

ECLIPSE Performance Benchmarks and Profiling. January 2009

MAQAO Performance Analysis and Optimization Tool

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Programming Techniques for Supercomputers: Multicore processors. There is no way back Modern multi-/manycore chips Basic Compute Node Architecture

An Implementation Of Multiprocessor Linux

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Multicore Parallel Computing with OpenMP

What is Multi Core Architecture?

Parallel Processing and Software Performance. Lukáš Marek

PAPI - PERFORMANCE API. ANDRÉ PEREIRA ampereira@di.uminho.pt

Binary search tree with SIMD bandwidth optimization using SSE

Xeon+FPGA Platform for the Data Center

High-Density Network Flow Monitoring

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Hardware Based Virtualization Technologies. Elsie Wahlig Platform Software Architect

Intel Xeon +FPGA Platform for the Data Center

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Chapter 2 Parallel Architecture, Software And Performance

How System Settings Impact PCIe SSD Performance

Introduction to Multicore Programming

Auto-Tunning of Data Communication on Heterogeneous Systems

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Virtuoso and Database Scalability

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Discovering Computers Living in a Digital World

Carlos Villavieja, Nacho Navarro Arati Baliga, Liviu Iftode

Memory Performance at Reduced CPU Clock Speeds: An Analysis of Current x86 64 Processors

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Chapter 1 Computer System Overview

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Testing Database Performance with HelperCore on Multi-Core Processors

Architecture of Hitachi SR-8000

Performance Impacts of Non-blocking Caches in Out-of-order Processors

End-user Tools for Application Performance Analysis Using Hardware Counters

Using Power to Improve C Programming Education

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Scaling in a Hypervisor Environment

High Performance Computing. Course Notes HPC Fundamentals

A Survey of Parallel Processing in Linux

Figure 1A: Dell server and accessories Figure 1B: HP server and accessories Figure 1C: IBM server and accessories

Using the Windows Cluster

GPUs for Scientific Computing

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

Hardware performance monitoring. Zoltán Majó

Informatica Ultra Messaging SMX Shared-Memory Transport

MPI and Hybrid Programming Models. William Gropp

64-Bit versus 32-Bit CPUs in Scientific Computing

A Locality Approach to Architecture-aware Task-scheduling in OpenMP

Performance monitoring of the software frameworks for LHC experiments

Transcription:

Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606

Organization Introduction Multi-Core Processors based SMP Hardware and Software Environment QMT (QCD Multi-Threading) Memory System Cache Coherence Protocol Barrier Algorithms Centralized and Queue-based Analysis of the Barrier Algorithms Performance of the Barrier Algorithms OpenMP and QMT

Why Multi-core Processors Traditional Design High clock frequency Deep pipeline Pipeline interruption is expensive Cache misses, branch mis-predication Larger gap between memory and CPU ILP (Instruction Level Parallelism) Challenge to find enough parallel instructions Underutilized Superscalar processors have multiple executing engines Hardware look for parallelism in large instruction window (expensive and complex) Compilers Most processors can t find enough work peak IPC is 6, average IPC is 1.5!

Multi-core Processors CMT (Chip Multi-Threading) Exploit higher level of parallelism Execute multiple threads on multi-cores within a single CPU Cache-coherence circuitry operates at higher clock speed Signals stay inside the chip Each core is relative simple Smaller instruction window Smaller die area Consume less power Sequential programs may perform poorly Cache contention/cache thrashing on shared cache systems

Motivation Master Linux clusters based on commodity multi-core SMP nodes Multi-threading data parallel scientific applications Fork-join style parallel programming paradigm Time Fork Join Barrier algorithm performance Multi-threading Strategy OpenMP Hand written threading library (QMT) Optimal barrier algorithm

OpenMP Portable, Shared Memory Multi- Processing API Compiler Directives and Runtime Library C/C++, Fortran 77/90 Unix/Linux, Windows Intel c/c++, gcc-4.x Implementation on top of native threads Fork-join Parallel Programming Model

OpenMP Compiler Directives (C/C++) #pragma omp parallel { thread_exec (); /* all threads execute the code */ } /* all threads join master thread */ #pragma omp critical #pragma omp section #pragma omp barrier #pragma omp parallel reduction(+:result) Run time library omp_set_num_threads, omp_get_thread_num

QMT (QCD Multi-Threading) QMT: Local developed multithreading library LQCD applications: data parallel Fork-join programming model Very light weight lock Optimal barrier algorithm typedef void (*qmt_userfunc_t) (void *usrarg, int thid); extern int qmt_init (void); extern int qmt_finalize (void); extern int qmt_pexec (qmt_userfunc_t func, void* arg); extern int qmt_thread_id (void); extern int qmt_num_threads(void);

Motivation Memory architecture of a SMP using multi-core processors Shared BUS (Intel Xeons) ccnuma (AMD Opterons) Cache Coherence Protocol Memory architecture impact on the performance of barrier algorithms Memory Organization Cache coherence protocol

Hardware and Software Environment Dell Power Edge 1850 Dual Intel Xeon 5150 2.66 GHz Shared Bus Memory System Dell Power Edge SC1435 Dual AMD Opteron 2220SE 2.8 GHz ccnuma Memory System Fedora Core 5 Linux x86_64 Kernel 2.6.17 (with PAPI support) gcc 4.1 and Intel icc 9.1

Multi-Core Architecture L1 Cache 32 KB Data, 32 KB Instruction L2 Cache 4MB Shared among 2 cores 256 bit width 10.6 GB/s bandwidth to cores FB-DDR2 Increased Latency memory disambiguation allows load ahead store instructions Executions Pipeline length 14; 24 bytes Fetch width; 96 reorder buffers 3 128-bit SSE Units; One SSE instruction/cycle Intel Woodcrest Xeon L1 Cache 64 KB Data, 64 KB Instruction L2 Cache 1 MB dedicated 128 bit width 6.4 GB/s bandwidth to cores NUMA (DDR2) Increased latency to access the other memory Memory affinity is important Executions Pipeline length 12; 16 bytes Fetch width; 72 reorder buffers 2 128-bit SSE Units; One SSE instruction = two 64-bit instructions. AMD Opteron

Configurations of Test Machines CPUs L1 L2 Memory Two 32K Data 4MB 4GB Intel 2.66 GHZ 32K Insr Shared Shared Dual-Core Bus Two 64K Data 1MB 4GB AMD 2.8 GHz 64K Instr Private ccnuma Dual-Core

Hardware and Software Environment EPCC Micro-benchmark measuring thread synchronization overhead PAPI (Performance Programming Interface API) Performance Event Counter Registers available on Intel Xeon and AMD Opteron Processors Linux kernel modules (perfctr) can read performance event counts from the registers Vast available performance metrics PAPI_L2_TCM (L2 Cache Misses) PAPI_FP_INS (Total Floating Point Ins) Machine specific performance metrics SI_PREFETCH_ATTEMPT (Opteron) NB_MC_PAGE_MISS (Opteron) SIMD_Int_128_Ret (Xeon)

Intel Xeon Memory Architecture Processor 1 Core 1 Core 2 Processor 2 Core 1 Core 2 CPU CPU CPU CPU L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache System Bus System Memory

AMD Opteron Memory Architecture Core 1 Core 2 Core 1 Core 2 L1 Cache L1 Cache L2 Cache L2 Cache SRQ SRQ Cross Bar HT HT HT HT HT HT L1 Cache L1 Cache L2 Cache L2 Cache SRQ SRQ Cross Bar MCT MCT Local Memory Local Memory

Cache Coherence with Write Back Caches Core 1 Core 3 L1 Cache L1 Cache L2 Cache a=28 a=7 L2 Cache a=? a=7 a = 7 Memory

Cache Coherence Protocol Intel utilizes MESI protocol: M: Modified E: Exclusive S: Shared I: Invalid A write miss results in a read-exclusive bus transaction Write to a shared or an invalid block Invalidate other copies of the block Modified block has to be written back to memory when another cache read the invalid block AMD utilizes MOESI protocol: M: Modified E: Exclusive S: Shared I: Invalid O: Owner (Most Recent Copy of Data) One cache can hold a block of data in the Owner state and the others are in shared state The owner of a cache block updates other caches reading the block The copy in the main memory can be stale Write modified cache back to memory can be avoided

MESI Cache Coherence Protocol Core 1 Core 3 L1 Cache L1 Cache I M L2 Cache a=7 a=9 L2 Cache a=9 a=7 Memory a=7 a=9 Accessing System Memory is Expensive

MOESI Cache Coherence Protocol Core 1 Core 3 L1 Cache L1 Cache I O L2 Cache a=7 a=9 L2 Cache a=9 a=7 Memory a=7 Cache to cache transfers are carried out by SRI or HPT

Memory and Cache Access Latency L1 (ns) L2 (ns) Memory (ns) Intel 1.13 5.29 150 AMD 1.07 4.3 173 Random Memory Access Latency Intel AMD Same CPU (ns) 55 83.5 Different CPU (ns) 187 119 Cache-to-cache Transfer Latency

Software Barrier Threads Barrier call First comes Last comes gather release Last leaves Barrier time

Centralized Barrier Algorithm int flag=atomic_get(&release); int count=atomic_int_dec(&counter); if (count == 0) { atomic_int_set (&counter,num_threads); atomic_int_inc (&release); } else spin_until (flag!= release) P1 P2 P3 P4 counter Memory contention To the counter

Queue-based Barrier Algorithm typedef struct qmt_cflag { int volatile c_flag; int c_pad[cache_line_size 1]; }qmt_cflag_t; typedef struct qmt_barrier { int volatile release; char br_pad[cache_line_size 1]; P1 P1 P1 P1 qmt_flag_t flags[1]; }qmt_barrier_t; /* Master Thread */ for (i = 1; i < num_threads; i++); { while (barrier->flags[i].cflag] == 0) ; barrier->flags[i].cflag=0; } atomic_int_inc(&barrier->release); /* Thread i */ int rkey = barrier->release; Eliminate some memory contention barrier->flags[i].cflag = -1; While (rkey == barrier->release);

Analysis of the Barrier Algorithms under MESI Protocol 5n mem r/w 5n -4 mem r/w Threads Centralized Queuebased Ratio 2 N/A N/A 1.00 3 15 11 1.36 4 20 16 1.25

Analysis of the Barrier Algorithms under MOESI Protocol 3n-1 cache access 3n-3 cache access Threads Centralized Queue based Ratio 2 5 3 1.67 3 8 6 1.33 4 11 9 1.22

Performance of the Barrier Algorithms Something else besides memory contention

Performance Monitoring using PAPI PAPI can be used to read performance event counter on both Xeon and Opteron processors Intel BUS_TRANS_MEM AMD Monitor all memory transactions DC_COPYBACK_I Equivalent to the number of cache transactions NB_HT_BUS(x)_DATA The number of HyperTransport data transactions

Memory/Cache Transactions of the Barrier Algorithms

Multi-Threading Strategy OpenMP Standard C, C++, Fortran Compiler Dependent Hand-written pthread Library Portable Complex Can be a better performer

OpenMP Performance from Different Compilers

QMT and OpenMP

Conclusions Multi-core processor based SMP clusters present new challenges Memory architecture influences the performance of barrier algorithms Memory contention Cache coherence protocol OpenMP is getting better Hand-written multi-threading libraries can perform even better