Available online at www.ijiere.com International Journal of Innovative and Emerging Research in Engineering e-issn: 2394 3343 p-issn: 2394 5494 Parallel Design Patterns for multi-core using Java Jayshree Chaudhari Student of ME (Computer Engineering), Shah and Anchor Kutchhi Engineering College, Mumbai, India. ABSTRACT: Multi-core have become a norm, but software deployed on such systems are not suited to exploit the power available at their disposal. This paper presents how the parallel programming design patterns can be adapted for performance benefits on multi-core CPUs. Java 7 provides only Fork and Join parallel design pattern, which is efficient for recursive programming. But other scenario like simple loops, dynamic load balancing, scientific calculations, and other parallel design patterns proves to be handy. I have implemented design patterns not available in Java 7, as API s and compared core utilization and time taken with respect to corresponding sequential implementation for Monte Carlo algorithm. Keywords: Parallel Design Pattern, Single Process Multiple Data, Master Worker, Loop Parallelism, Shared Queue, Distributed Array. I. INTRODUCTION Up to early 2000 s hardware manufacturers were successful in doubling clock frequency of micro-processors every 18 months in line with Moore s law. Micro-processor manufacturers realized need of multi-core processor due to increasing gap between processor and memory speed, the limit of instruction level parallelism as well as high power consumption [1, 10]. Multi-core micro-processor delivers high performance through multi-processor parallel processing. Running the sequential program of the job only gives the performance on single core, leaving the other core on the chip wasted [10]. To optimize use of multi-core processor, program needs to be re-written in parallel using parallel programming. Parallel programming is a methodology of dividing a large problem in to smaller ones and solving smaller problems concurrently [11]. The initial response to the multi-core was parallel programming using multi-threaded programming. However, writing parallel and concurrent programs in the multi-threading model is difficult. Parallel design patterns rescues programming community from intricacies of writing a parallel program and allows focusing on business logic. Until Java 5, writing a parallel program was writing multi-threaded program, which was painful as development community had to deal with thread synchronization and pitfalls of parallel programming. Java 5 and Java 6 provided java.util.concurrent packages, which introduced set of API s providing development community a powerful concurrency build blocks. Java 7 further enhanced support for parallel programming by providing Fork and Join Design Pattern API. Other powerful programming design patterns like SPMD, Master Worker, Loop Parallelism, Distributed Array and Shared Queue are still not present [2]. I have implemented this design patterns as API and compared time taken and core utilization compared with corresponding sequential implementation for Monte Carlo algorithm. II. PARALLEL DESIGN PATTERN Design patterns are a recurring solution to a standard problem. Programming design patterns are modeled on successful past programming experiences. These patterns once modeled, can then be reused in other programs. Typically, hand coding a program from scratch i.e. multi-threaded results in better execution time performance, but may consume immense amounts of time and errors [9]. Design patterns for parallel programming makes life of programmer easy by addressing to recurring problems like concurrent access to data, communicating between parallel programs, synchronization, dead locks etc. A parallel design pattern includes Single Program Multiple Data (SPMD), Master-Worker, Loop Parallelism, Shared Queue and Distributed Array. A. SPMD & Distributed Array In SPMD Design pattern, all Unit of Executions (UEs) execute a Single Program in parallel each having its own set of data and unique identifier. SPMD pattern is mostly used in scientific applications having complex compute intensive algorithms, which needs to be executed in parallel without changing the program. The data might be different between UEs, or slightly different computations might be needed on a subset of UEs, but for the most part each UE will carry out 198
similar computations. Distributed Array decomposes the data either vertically or horizontally into number of sets [1, 3]. Figure 1 shows high level diagram for implementation of SPMD along with Distributed Array. Master process decomposes and distributes data across predefined number of UEs (4 UE s considered in figure 1) and assigns unique identifier for each UE to differentiate behavior. Data is decomposed among four UE s using Distributed Array pattern. The results of UEs are combined by the master process. Figure 1: Single Program Multiple Data & Distributed Array Pattern [1] B. Master Worker & Shared Queue A Single master process sets up a pool of worker processes and a bag of tasks. Workers execute concurrently by repeatedly removing a task from bag of tasks and processing it. Shared Queue shares data efficiently between concurrently executing UEs. Master Worker design pattern dynamically balance the work on a set of tasks among the Unit of Executions (UEs) [1]. Figure 2 shows high level diagram for Master Worker and Shared Queue. Master process will create a bag of task using Shared Queue and will create predefined number of UEs (4 UE s are shown in figure 2) which will act as workers [6]. This UEs/Workers removes task repeatedly from Shared Queue one by one concurrently and process it till the queue is empty or termination condition is reached. Master will then combine the results. Figure 1 : Master Worker & Shared Queue [1] 199
C. Loop Parallelism Loop Parallelism transforms compute intensive loops of sequential program into a parallel program, where-in different iterations of the loop are executed in parallel. In most of the languages, loops are executed sequentially. If numbers of iterations are high and compute intensive, then time taken is higher and core utilization is poor. In such cases it makes sense to execute iterations in parallel threads on multiple UEs [1, 5]. Figure 3 shows the high level diagram of Loop Parallelism, here the iteration of for loop are divided equally among three UEs. Numbers of UEs are predefined before the execution starts. Figure 3: Loop Parallelism Design [1] III. RELATED WORK In the book Pattern of Parallel Programming by Timothy G. Mattson and Beverly A. Sanders and Berna L. Massingill introduced a complete, highly accessible pattern language using design spaces that helps any experienced developer "think parallel" and write effective parallel code. Design spaces provides development community medium to write complex parallel program using parallel design patterns. Supporting structure design space defines patterns like Master Worker, SPMD, Loop Parallelism, Distributed Array and Shared Queue/Data [1]. Microsoft.NET framework 4 included Parallel Pattern Library (PPL), Task Parallel Library (TPL) and Parallel Language-Integrated Query (PLINQ), which are commonly used with C++ and C#. On top of these libraries.net 4 frameworks added design patterns like Parallel Loop, Fork and Join, Shared Data and other patterns like Map Reduce and Producer Consumer [7]. Java added Fork and Join design pattern build on top of executor framework in its version 7[8]. IV. METHODOLOGY Water fall development model is used to develop API in Java for design patterns like Master Worker, SPMD and Loop Parallelism [4]. Development is done on top of existing Java concurrency API and using eclipse development environment. For testing API test classes are created along with corresponding sequential implementation. CPU Core utilization is monitored using Window task manager. Following environment was considered for testing the patterns: Processor Details : Intel Core i5-4310u CPU @ 2.00GHz No of Cores : 4 RAM : 4 GB OS : Windows 7 Service pack 1 64-bit JVM: JDK 7 CPU Utilization: Windows 7 Resource monitor Algorithm: Monte Carlo This test can also be carried out on other operating system which can host Java Virtual Machine (JVM) and having multi-core CPU. 200
V. OBSERVATION OF DIFFERENT DESIGN PATTERN Number of UE s TABLE 1: TIME TAKEN BY DESIGN PATTERNS Average Time taken in seconds Fork and SPMD & Master Loop Parallelism Join Distributed Worker & Array Shared Queue Sequential 20.122 22.241 21.540 22.307 2 10.414 9.719 9.836 10.371 3 9.353 9.001 9.088 9.254 4 8.836 8.814 8.884 9.046 5 9.146 9.095 9.291 9.228 6 9.45 9.110 9.6 9.552 Figure 4: Core Utilization Sequential Figure 6: Core Utilization SPMD (4 UE s) Figure 5: Core Utilization Fork & Join (4 UE s) Figure 7: Core Utilization Master Worker (4 UE s) 201
Figure 8: Core Utilization Loop Parallelism (4UE s) Following are observation based on test results: Sequential program mostly uses one core of CPU as shown in figure 4. With increase in number of UE s cores utilization improves and time taken reduces as observed in table 1. Approx. 60% reduction time taken by design patterns compared with corresponding sequential implementation as shown in table 1. Time taken by design patterns reduces drastically up to 4 UE s and there after start increasing slowly. Master Worker & Shared Queue, SPMD & Distributed Array and Loop Parallelism API developed are having comparable time reduction as observed in table 1 and core utilization compared with existing Fork and Join on same machine as shown in figures 5, 6, 7 and 8. Time taken for execution differs each time program is executed for both sequential and Patterns API developed. VI. CONCLUSIONS Sequential Programs are not able to take advantage of advancement in hardware architecture of multi-core. To take advantage of multiple cores, programs need to be written in parallel by writing multi-threaded program. Writing a multi-threaded program bring with it its own challenges like concurrent access to data, communicating between parallel programs, synchronization, dead locks etc. Design pattern addresses challenges associated with writing multithreaded program. The observations shows that all the API are getting approximate 60% reduction in time taken and 100% cores utilization compared with equivalent sequential program for Monte-Carlo algorithm. Increase in number of processing units the core utilization betters till number of UEs are equivalent to number of cores, after that the time taking increases as there is overhead for core swapping between UEs. In Future these patterns can be extended to other programming languages. This pattern can also be used as distributed parallel programming. REFERENCES [1] Timothy G. Mattson and Beverly A. Sanders and Berna L. Massingill, Patterns for Parallel Programming,ISBN: 0-321-22811-1. [2] Cristian Mateos, Alejandro Zunino, Matías Hirsch, 2012. Parallelism as a concern in Java through fork-join synchronization patterns. Computational Science and Its Applications (ICCSA), 2012 12th International Conference on 18-21 June 2012,pp 49-50. [3] Ronal Muresano, Dolores Rexachsy, Emilio Luquez. Methodology for Efficient Execution of SPMD Applications on Multicore Environments.Cluster, Cloud and Grid Computing (CCGrid), 2010 10th IEEE/ACM International Conference on 17-20 May 2010. [4] Kurt Keutzer and Tim Mattson, A design Pattern Language for Engineering (Parallel) Software The Berkely Par Lab: Progress in the Parallel Computing Landscape, Intel technology Journal 13, pg 213,219-221, 2010. [5] Meikang Qiu, Jian-Wei Niu, Laurence T. Yang, Xiao Qin, Senlin Zhang, Bin Wang, 2010. Energy-Aware Loop Parallelism Maximization for Multi-Core DSP Architectures. Green Computing and Communications (GreenCom), 2010 IEEE/ACM Int'l Conference on & Int'l Conference on Cyber, Physical and Social Computing (CPSCom) 18-20 Dec. 2010. 202
[6] R. Rugina and M.C. Rinard. Automatic parallelization of divide and conquer algorithms.in PPoPP 99, pages 72-83. [7] Stephen Toub. Patterns for Parallel Programming: Understanding and Applying Parallel Patterns with the.net Framework 4, Parallel Computing Platform Microsoft Corportion, July 2010. [8] The Java Tutorials, http://docs.oracle.com/javase/tutorial. [9] Narjit Chadha. A Java Implemented Design-Pattern-Based System for Parallel Programming, htttp://mspace.lib.umanitoba.ca/bitstream/1993/19835/1/chadha_a_java.pdf,october 2002,pp 3. [10] Peiyi Tang. Multi-core Parallel Programming in Go. In Proceedings of the First International Conference on Advanced Computing and Communications, (ACC'10), September 2010,pp 64. [11] Almasi, G. S. and A. Gottlieb. 1989. Highly Parallel Computing. Benjamin-Cummings Publ. Co., Inc., ACM, Redwood City, CA, USA,pp 431. 203