Multicore Parallel Computing with OpenMP



Similar documents
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance and Scalability of the NAS Parallel Benchmarks in Java

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

benchmarking Amazon EC2 for high-performance scientific computing

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

OpenACC Parallelization and Optimization of NAS Parallel Benchmarks

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Comparing the OpenMP, MPI, and Hybrid Programming Paradigm on an SMP Cluster

Assessing the Performance of OpenMP Programs on the Intel Xeon Phi

Turbomachinery CFD on many-core platforms experiences and strategies

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

High Performance Computing in CST STUDIO SUITE

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Performance Analysis of a Hybrid MPI/OpenMP Application on Multi-core Clusters

Hands-on exercise: NPB-OMP / BT

Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing

PERFORMANCE CONSIDERATIONS FOR NETWORK SWITCH FABRICS ON LINUX CLUSTERS

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

GPUs for Scientific Computing

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

FLOW-3D Performance Benchmark and Profiling. September 2012

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Multi-core Programming System Overview

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Clusters: Mainstream Technology for CAE

1 Bull, 2011 Bull Extreme Computing

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp

Accelerating CFD using OpenFOAM with GPUs

Workshop on Parallel and Distributed Scientific and Engineering Computing, Shanghai, 25 May 2012

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

Parallel Programming Survey

THE NAS KERNEL BENCHMARK PROGRAM

On the Importance of Thread Placement on Multicore Architectures

and RISC Optimization Techniques for the Hitachi SR8000 Architecture

LS DYNA Performance Benchmarks and Profiling. January 2009

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

High Productivity Computing With Windows

Overview of HPC Resources at Vanderbilt

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Multi-Threading Performance on Commodity Multi-Core Processors

Running Scientific Codes on Amazon EC2: a Performance Analysis of Five High-end Instances

JUROPA Linux Cluster An Overview. 19 May 2014 Ulrich Detert

Scalability evaluation of barrier algorithms for OpenMP

Power-Aware High-Performance Scientific Computing

HPC enabling of OpenFOAM R for CFD applications

High Performance Computing

CUDA programming on NVIDIA GPUs

HPC performance applications on Virtual Clusters

Scaling Study of LS-DYNA MPP on High Performance Servers

OpenMP Programming on ScaleMP

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Large-Scale Reservoir Simulation and Big Data Visualization

The Asynchronous Dynamic Load-Balancing Library

Cloud Computing through Virtualization and HPC technologies

NAVAL POSTGRADUATE SCHOOL THESIS

A Crash course to (The) Bighouse

HPC Wales Skills Academy Course Catalogue 2015

Evaluation of CUDA Fortran for the CFD code Strukti

MOSIX: High performance Linux farm

Parallel Computing with MATLAB

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

RDMA over Ethernet - A Preliminary Study

64-Bit versus 32-Bit CPUs in Scientific Computing

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Performance Enhancement of Multicore Processors using Dynamic Load Balancing

Agenda. HPC Software Stack. HPC Post-Processing Visualization. Case Study National Scientific Center. European HPC Benchmark Center Montpellier PSSC

Performance Characteristics of Large SMP Machines

GeoImaging Accelerator Pansharp Test Results

PERFORMANCE ENHANCEMENTS IN TreeAge Pro 2014 R1.0

Linux for Scientific Computing

End-user Tools for Application Performance Analysis Using Hardware Counters

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

MAQAO Performance Analysis and Optimization Tool

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Introduction to Linux and Cluster Basics for the CCR General Computing Cluster

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Transcription:

Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large shared memory (SMP) multiprocessor systems, and MPI (Message Passing Interface) programming took over OpenMP programming for parallel computation. However, with the emergence of the multicore processor, OpenMP programming is making a revival among HPC users. OpenMP programming is a compiler directives based method of implementing parallel programs for SMP systems. OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran, C and C++ to express shared memory parallelism. As OpenMP programming enables incremental ways of converting existing codes to parallel codes, eg allowing one DO loop (for Fortran program) to be parallelised at a time, it is probably one of the easiest ways to achieve parallelism within a reasonably short time. 2. NAS Parallel Benchmarks The OpenMP codes used in this multicore performance assessment were taken from the NAS (Numerical Aerodynamic Simulation) Parallel Benchmarks package developed at NASA Ames Research Centre. These benchmarks consist of three simulated application and five parallel kernels benchmarks. The benchmarks mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications, one of the major HPC applications at NUS. For this study, one of the parallel kernel benchmarks, the IS or integer sort kernel that is written in C, was not used. The other benchmarks are written in Fortran. The three simulated application benchmarks were BT (Block tridiagonal solver), SP (Pentadiagonal solver) and LU (Block lower and upper triangular solver). The four kernel benchmarks used were FT (3-D FFT PDE), MG (Multigrid), CG (Conjugate gradient) and EP (Embarassingly parallel). Detailed description of the benchmarks can be found at http://www.nas.nasa.gov/news/techreports/1994/pdf/rnr-94-007.pdf.

Performance Study The objective of this study was to find out how well the OpenMP codes could exploit the capability of the latest multicore system. The first part of the study looked into the effect of problem size and the second part examined the performance of different codes. The compiler used was the Intel Fortran compiler ifort and the key compiler options used were -O2, -openmp, -i-static, -I/opt/intel/fce/current/include and - L/opt/intel/fce/current/lib. Do check out the following article if you wish to know how OpenMP was coded in each of the benchmarks: http://www.nas.nasa.gov/news/techreports/1999/pdf/nas-99-011.pdf 2.1 Effect of Problem Size The BT simulated application benchmark was used in this test. The benchmark was executed in three sizes, Class A being the smallest and Class C being the largest. A larger size, Class D, was not considered in the test. NAS BT Class A 1 45 82.70 1 2 45 46.35 1.78 4 45 24.65 3.35 8 45 16.88 4.90 NAS BT Class B 1 175 354.39 1 2 175 206.21 1.72 4 175 111.00 3.19 8 175 83.89 4.22 NAS BT Class C 1 690 1479.18 1 2 690 788.06 1.88 4 690 480.16 3.08 8 690 352.71 4.19

Effect of Problem Size Speedup 6 5 4 3 2 1 0 1 2 4 8 No. of core Class A Class B Class C The above results show that the problem size has just minimal impact on the speedup when larger numbers of cores were used. However, it is important to note that due to time constraints, the memory sizes we managed to test (up to 690MB) were relatively small compared to the memory available on the test system (16GB total). To assess the effectiveness of a multicore system running large memory applications, further tests are needed. If you have any large memory OpenMP application, we will be happy to work with you in porting the application to these systems. 2.2 Performance of Different OpenMP Codes NAS BT Class A 1 45 82.70 1 2 45 46.35 1.78 4 45 24.65 3.35 8 45 16.88 4.90 NAS SP Class A 1 47 69.73 1 2 47 39.47 1.77 4 47 24.91 2.80 8 47 22.15 3.15

NAS LU Class A 1 43 84.24 1 2 43 37.97 2.22 4 43 20.30 4.15 8 43 13.24 6.36 NAS FT Class A 1 293 5.49 1 2 293 3.37 1.63 4 293 1.86 2.95 8 293 1.45 3.79 NAS MG Class A 1 433 3.18 1 2 433 2.74 1.16 4 433 1.72 1.85 8 433 1.93 1.65 NAS CG Class A 1 48 2.73 1 2 48 1.62 1.69 4 48 0.96 2.84 8 48 0.90 3.03 NAS EP Class A 1 3544 15.92 1 2 3544 7.93 2.00 4 3544 4.04 3.94 8 3544 2.04 7.80

Speedup Comparison Speedup 10 8 6 4 2 0 1 2 4 8 BT SP LU FT MG CG No. of Core EP As expected, different codes/algorithms produced different levels of speedup during the parallel execution. In general, you will get more speedup if a larger portion of your computation can be done in parallel. As the multicore system is also a shared memory system, the memory access pattern and intensity also affect the speedup. One key observation was that the OpenMP codes used in this study did not scale as well on the multicore systems, compared to their performance on a single-core multiprocessor system (as shown in this referenced article http://www.nas.nasa.gov/news/techreports/1999/pdf/nas-99-011.pdf). Comparing the SP benchmark performance below for example, the scaling is obviously better on the single-core SMP system. For this benchmark on the quad-core CPU system, the speedup scaling is reasonable up to the four threads execution. Memory bottleneck could be the cause of the relatively lower scalability on the multicore system. SP Benchmark Multiple single-core CPUs system (195MHz) 2 x Quad-core CPUs system (3GHz) Single thread elapse time 2 threads elapse time (speedup) 4 threads elapse time (speedup) 8 threads elapse time (speedup) 1227.1 secs 646.0 secs (1.9) 350.4 secs (3.5) 175.0 secs (7.0) 69.73 secs 39.47 secs (1.8) 24.91 secs (2.8) 22.15 secs (3.1)

3. Conclusion Even though some OpenMP codes may not scale very well on the multicore system, the ease of OpenMP programming will definitely make it an attractive option for HPC. Highly parallel codes such as the one represented by the EP benchmark are expected to do well. With the multicore nodes in a cluster, users will also have another option to explore multi-tier parallel computing, where the message passing type of parallel processing can be done between nodes and the multi-thread type of parallel processing can be done within nodes.