Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large shared memory (SMP) multiprocessor systems, and MPI (Message Passing Interface) programming took over OpenMP programming for parallel computation. However, with the emergence of the multicore processor, OpenMP programming is making a revival among HPC users. OpenMP programming is a compiler directives based method of implementing parallel programs for SMP systems. OpenMP is a set of compiler directives and callable runtime library routines that extend Fortran, C and C++ to express shared memory parallelism. As OpenMP programming enables incremental ways of converting existing codes to parallel codes, eg allowing one DO loop (for Fortran program) to be parallelised at a time, it is probably one of the easiest ways to achieve parallelism within a reasonably short time. 2. NAS Parallel Benchmarks The OpenMP codes used in this multicore performance assessment were taken from the NAS (Numerical Aerodynamic Simulation) Parallel Benchmarks package developed at NASA Ames Research Centre. These benchmarks consist of three simulated application and five parallel kernels benchmarks. The benchmarks mimic the computation and data movement characteristics of large scale computational fluid dynamics (CFD) applications, one of the major HPC applications at NUS. For this study, one of the parallel kernel benchmarks, the IS or integer sort kernel that is written in C, was not used. The other benchmarks are written in Fortran. The three simulated application benchmarks were BT (Block tridiagonal solver), SP (Pentadiagonal solver) and LU (Block lower and upper triangular solver). The four kernel benchmarks used were FT (3-D FFT PDE), MG (Multigrid), CG (Conjugate gradient) and EP (Embarassingly parallel). Detailed description of the benchmarks can be found at http://www.nas.nasa.gov/news/techreports/1994/pdf/rnr-94-007.pdf.
Performance Study The objective of this study was to find out how well the OpenMP codes could exploit the capability of the latest multicore system. The first part of the study looked into the effect of problem size and the second part examined the performance of different codes. The compiler used was the Intel Fortran compiler ifort and the key compiler options used were -O2, -openmp, -i-static, -I/opt/intel/fce/current/include and - L/opt/intel/fce/current/lib. Do check out the following article if you wish to know how OpenMP was coded in each of the benchmarks: http://www.nas.nasa.gov/news/techreports/1999/pdf/nas-99-011.pdf 2.1 Effect of Problem Size The BT simulated application benchmark was used in this test. The benchmark was executed in three sizes, Class A being the smallest and Class C being the largest. A larger size, Class D, was not considered in the test. NAS BT Class A 1 45 82.70 1 2 45 46.35 1.78 4 45 24.65 3.35 8 45 16.88 4.90 NAS BT Class B 1 175 354.39 1 2 175 206.21 1.72 4 175 111.00 3.19 8 175 83.89 4.22 NAS BT Class C 1 690 1479.18 1 2 690 788.06 1.88 4 690 480.16 3.08 8 690 352.71 4.19
Effect of Problem Size Speedup 6 5 4 3 2 1 0 1 2 4 8 No. of core Class A Class B Class C The above results show that the problem size has just minimal impact on the speedup when larger numbers of cores were used. However, it is important to note that due to time constraints, the memory sizes we managed to test (up to 690MB) were relatively small compared to the memory available on the test system (16GB total). To assess the effectiveness of a multicore system running large memory applications, further tests are needed. If you have any large memory OpenMP application, we will be happy to work with you in porting the application to these systems. 2.2 Performance of Different OpenMP Codes NAS BT Class A 1 45 82.70 1 2 45 46.35 1.78 4 45 24.65 3.35 8 45 16.88 4.90 NAS SP Class A 1 47 69.73 1 2 47 39.47 1.77 4 47 24.91 2.80 8 47 22.15 3.15
NAS LU Class A 1 43 84.24 1 2 43 37.97 2.22 4 43 20.30 4.15 8 43 13.24 6.36 NAS FT Class A 1 293 5.49 1 2 293 3.37 1.63 4 293 1.86 2.95 8 293 1.45 3.79 NAS MG Class A 1 433 3.18 1 2 433 2.74 1.16 4 433 1.72 1.85 8 433 1.93 1.65 NAS CG Class A 1 48 2.73 1 2 48 1.62 1.69 4 48 0.96 2.84 8 48 0.90 3.03 NAS EP Class A 1 3544 15.92 1 2 3544 7.93 2.00 4 3544 4.04 3.94 8 3544 2.04 7.80
Speedup Comparison Speedup 10 8 6 4 2 0 1 2 4 8 BT SP LU FT MG CG No. of Core EP As expected, different codes/algorithms produced different levels of speedup during the parallel execution. In general, you will get more speedup if a larger portion of your computation can be done in parallel. As the multicore system is also a shared memory system, the memory access pattern and intensity also affect the speedup. One key observation was that the OpenMP codes used in this study did not scale as well on the multicore systems, compared to their performance on a single-core multiprocessor system (as shown in this referenced article http://www.nas.nasa.gov/news/techreports/1999/pdf/nas-99-011.pdf). Comparing the SP benchmark performance below for example, the scaling is obviously better on the single-core SMP system. For this benchmark on the quad-core CPU system, the speedup scaling is reasonable up to the four threads execution. Memory bottleneck could be the cause of the relatively lower scalability on the multicore system. SP Benchmark Multiple single-core CPUs system (195MHz) 2 x Quad-core CPUs system (3GHz) Single thread elapse time 2 threads elapse time (speedup) 4 threads elapse time (speedup) 8 threads elapse time (speedup) 1227.1 secs 646.0 secs (1.9) 350.4 secs (3.5) 175.0 secs (7.0) 69.73 secs 39.47 secs (1.8) 24.91 secs (2.8) 22.15 secs (3.1)
3. Conclusion Even though some OpenMP codes may not scale very well on the multicore system, the ease of OpenMP programming will definitely make it an attractive option for HPC. Highly parallel codes such as the one represented by the EP benchmark are expected to do well. With the multicore nodes in a cluster, users will also have another option to explore multi-tier parallel computing, where the message passing type of parallel processing can be done between nodes and the multi-thread type of parallel processing can be done within nodes.