Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures

Transcription

1 Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES 2010 By Andrew Attwood School of Computer Science

2 Contents List of Tables... 5 List of Figures... 6 List of Equations... 8 Abstract... 9 Declaration Copyright Acknowledgements List of Abbreviations Chapter 1 Introduction Motivation and Context of the Study Project Aims and Methodology Project Phases Structure for the Dissertation Chapter 2 Heterogeneous Architectures Chapter Overview Cell Broadband Engine Development of the CBE Nvidia GPU with Intel Nehalem Nvidia Architecture Nvidia Development Intel Nehalem Architecture Intel Nehalem Development Chapter Summary Chapter 3 Scheduling, Load Balancing and Profiling Chapter Overview Computational Scheduling Assessment of the N-Body Problem Block and Cyclic Scheduling Dynamic Scheduling OS Thread Scheduling Profiling

3 3.3 Chapter Summary Chapter 4 Heterogeneous Development Methodologies and Patterns Chapter Overview Development Methodologies Software Development Phases Algorithm Design Considerations Patterns Implementation Testing Chapter Summary Chapter 5 Algorithm Analysis and Design Chapter Overview Analysis Amdahl s analysis of Nmod Nmod Application Structure Analysis SMP Design Cell Design CUDA Design Load Balancing and Profiling Algorithm Design Chapter Summary Chapter 6 Implementation Chapter Overview Platform Configuration CELL Development Environment GPU Development Environment SMP Implementation CELL Implementation GPU Implementation Load-balancing and Profiling Integration Chapter Summary Chapter 7 Testing and Evaluation Chapter Overview Performance Testing Cell Processor Testing

4 7.4 Nehalem GPU Testing Chapter Summary Chapter 8 Conclusions and Future Work Future Work Stored Knowledge Web Services Hint Operating System Enhancements Generic Framework Continuous Monitoring Predictive Initial Conditions Conclusion Bibliography Appendix A: CELL SDK Installation Word Count 23,380 4

5 List of Tables Table 2.1: LIBSPE2 functions 27 Table 2.2: SPE DMA functions 27 Table 2.3: Thread and fork creation overhead 28 Table 3.1: Offset pattern for 100 elements 42 Table 4.1: Parallel languages 53 Table 5.1: Gprof output non-threaded 10 particles 10 steps 61 Table 5.2: Gprof output non-threaded 60 particles 10 steps 62 Table 5.3: Nmod speed-up values 63 Table 7.1: Single and dual thread performance 101 Table 7.2: SPE timing data 102 Table 7.3: CELL Automated profiling algorithm timing 103 Table 7.4: GPU/Nehalem timing data 105 5

6 List of Figures Figure 2.1: CBE block diagram 23 Figure 2.2: Division of computation CBE 24 Figure 2.3: Stream Processing CBE 25 Figure 2.4: CBE compile commands 26 Figure 2.5: Bus capacity 29 Figure 2.6: GPU CPU footprint comparison 31 Figure 2.7: Nvidia processor Architecture 32 Figure 2.8: Two dual threaded SMP processors 34 Figure 2.9: Two SMP single thread processors 34 Figure 3.1: Computation level given index 39 Figure 3.2: 5-Body example 39 Figure 3.3: Body index calculation ratio 40 Figure 3.4: Block cyclic schedule 41 Figure 3.5: Pseudo code for in process offset calculation 41 Figure 3.6: Cyclic Schedule 43 Figure 3.7: Equal area schedule 43 Figure 3.8: DAG approach 45 Figure 4.1: Waterfall model 50 Figure 4.2: Message passing pattern 54 Figure 4.3: The divide and concur pattern 56 Figure 4.4: CELL development process 57 Figure 4.5: Structure of array and arrays of structures 58 Figure 4.6: Implementation plan 59 Figure 5.1: N-body test pattern loaded into matlab 64 Figure 5.2: Application code to generate random body distribution and mass 65 Figure 5.3: JSP Diagram of the NMOD Application 66 Figure 5.4: Original Function declaration for findgravitationalinteractions 67 Figure 5.5: Thread communication structure 68 Figure 5.6: Pseudo code for thread creation and balancing on SMP 69 Figure 5.7: Nmod SMP cache coherent program design 69 Figure 5.8: CELL SPE program design 70 Figure 5.9: Posix memalign function prototype 71 Figure 5.10: Nvidia GPU design 72 Figure 5.11: Load-balancing and profiling diagram 75 Figure 6.1: Cell Compilation commands 78 Figure 6.2: Compile command for threaded SMP 78 Figure 6.3: SMP multi-threaded implementation with load balance 81 Figure 6.4: Calculate gravitation interactions function SMP 84 Figure 6.5: Force accumulator for each SPE aligned to 16 byte borders 86 Figure 6.6: Trans data structure initialisation 86 Figure 6.7: Loading SPE image file 87 Figure 6.8: Passing arg structure to SPE 88 Figure 6.9: SPE context creation 88 Figure 6.10: SPE DMA transfer operation 90 Figure 6.11: CUDA call find gravitational interactions 92 Figure 6.12: CUDA find gravitational interactions 93 Figure 6.13: Load balance and profile GPU/NVIDIA 95 6

7 Figure 6.14: Load-balancing implementation 97 Figure 6.15: Finalising test 98 Figure 6.16: Assembly commands for hardware timers 99 Figure 7.1: CBE 2 PPE thread Speed-up 101 Figure 7.2: 6 SPE Speed-up 102 Figure 7.3: Speed-up on Nehalem and GPU 106 7

8 List of Equations Equation 5.1: Amdahl s Law 62 Equation 5.2: NMOD Speed-up value 63 Equation 5.3: Runge-Kutta 4th order integrator 66 8

9 Abstract Increasingly, devices are moving away from a single architecture single core design. From the fastest supercomputer to the smallest of mobile phones, devices are now being constructed with heterogeneous architectures. The reason for this heterogeneity is owing, in part, to the slowing of speed increases on single chip single core devices, but equally, the realisation that coupling specific devices to cope with specific problems provides increased performance and power efficiency. Application development processes concerned with dealing with multicore heterogeneous technologies are still in their infancy. Developing high-performance applications for the scientific community gives the responsibility to the developer to maximise the use of the underlying architecture. Traditional approaches to the development process are inadequate to deal with complexity in instantiation of computation to heterogeneous architectures. Current load-balancing algorithms fail to provide the required dynamism to best fit computation to the available resource. This project seeks to design an algorithm to optimise the application of a computational problem to the available heterogeneous resource through runtime profiling and load-balancing. CELL and Nehalem/GPU heterogeneous architectures are targeted with an N- Body simulator combined with the implementation of a profiling and load-balancing algorithm. This implementation is subsequently tested under different computational load conditions. 9

10 Declaration No portion of the work referred to in this report has been submitted in support of an application for another degree or qualification of this or any other university or other institute of leaning. 10

11 Copyright i. The author of this dissertation (including any appendices and/or schedules to this dissertation) owns any copyright in it (the Copyright ) and s/he has given The University of Manchester the right to use such copyright for any administrative, promotional, educational and/or teaching purposes. ii. iii. iv. Copies of this dissertation, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these regulations may be obtained from the Librarian. This page must form part of any such copies made. The ownership of any patents, designs, trademarks and any and all other intellectual property rights except for the Copyright (the Intellectual Property Rights ) and any reproductions of copyright works, for example graphs and tables ( Reproductions ), which may be described in this dissertation, may not be owned by the author and may be owned by third parties. Such Intellectual Property Rights and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions. Further information on the conditions under which disclosure, publication and exploitation of this dissertation, the Copyright and any Intellectual Property Rights and/or reproductions described in it may take place is available from the Head of Department of Computer Science. 11

12 Acknowledgements I would like to take this opportunity to thank my supervisor, Dr John Brooke, for his support and guidance during the past 16 months. My thanks also go to Dr Carey Pridgeon for developing the nmod application, which provides the computational base for my algorithm, and his kind response to my questions. I would also like to thank Dr Jonathan Follows for proving the CELL and Nvidia CUDA training at STFC Daresbury. Finally, I would also like to thank my family for their help and support over the past two years. 12

13 List of Abbreviations CBE SPE PPU SIMD SPU MFC LS DMA EIB SPUFS MFC SIMT ILP SMT JSP HPC GCC AGP GPU PCIe CPU RAM Cell broadband engine Synergistic processing element Power processing unit Single instruction multiple data Synergistic processing unit Memory flow controller Local store Direct memory access controller Element interconnect bus Synergistic processing unit file system Memory flow controller Single instruction multiple thread Instruction level parallelism Symmetric multi-threading Jackson structured programming High performance computing GNU c compiler Advanced graphics port Graphics processing unit Peripheral component interconnect express Central processing unit Random access memory 13

14 Chapter 1 Introduction Heterogeneous computing is the coupling of dissimilar processor architectures into a single process or separate processes interconnected by an in-system bus. Heterogeneous computing is viewed as the next evolutionary step in the development of the CPU. Over the past 5 years, we have witnessed great advances in terms of the availability of processing cycles on devices other than the core central processing unit. However, our reliance on a single fast processor core has been challenged, as we have seen the slowing of the speed increases of these devices [1]. In order to combat this issue, most manufacturers have developed multi-core variants of their standard processor. Over the past ten years, graphics card manufacturers have been increasing the capability of their cards; this has mainly been to keep pace with the requirement of users, who now demand ever realistic game graphics and physics. Recently, we have seen libraries released which enable users to make use of the GPU as a massively powerful compute resource. Notably, at the time of writing, it is common to see quad core chips in standard desktop machines, combined with powerful compute capable GPU. We also see heterogeneous chips, such as the Cell Broadband Engine, used in the world s fastest computer [2]. This chip, in contrast to the CPU GPU relationship, combines heterogeneous components into a single process, as opposed to discrete elements connected by an external system bus. 14

15 Developing applications for single threaded execution can be difficult enough, but when also facing the challenge of multiple cores and heterogeneous compute elements, the challenge increases. Multi-threading libraries support the instantiation of virtual threads, and the underlying operating system will then accordingly schedule these virtual threads onto the available homogeneous hardware. Operating systems are unable to schedule threads across heterogeneous compute components, which consequently leaves the control of thread execution on these elements to the application which wants to make use of them. Application developers need to make decisions concerning how best to apply the available computation to the hardware elements in the target system. Load-balancing concerns the application of computational tasks to the available elements in the machine. Failing to properly and fairly distribute load over the available elements will result in less than optimal runtime performance [3]. In a high-performance compute environment, it is essential that idle time is reduced as much as possible so as to maximise the return of investment made in the data centre infrastructure. During the lifetime of a computational task, the quantity of work assigned to an individual subsystem can change; this can lead to an imbalance which can emerge at any point during the lifetime of the computation. Design time profiling is a common step in the development process. As an application is being developed, profiling is conducted in order to determine the sections of code that would benefit from some additional fine-tuning. These sections usually consume a disproportionate amount of the whole application runtime. Usually, it is these sections that we strive to split between the available resources. The complexity arises when these tasks are to be split amongst the available resources. It is only at runtime that we can determine the existence and capability of heterogeneous components and the cost of computation instantiation. This leads to the requirement to profile the application at runtime. 15

16 Planning for the development or redeployment of application code to a heterogeneous highperformance environment is a challenging activity. Conventional multi-threaded development is supported by a number of methodologies and design techniques, concerned with aiding in the development/transformation process. The ability to structure the development activity and to accordingly support the development through a design procedure is an important aspect of this project. 16

17 1.1 Motivation and Context of the Study This project is concerned with the development of a runtime-profiling and load-balancing algorithm to support the computational deployment to the available computing resource. With the ever-increasing complexity and heterogeneous mix of components in both desktops and servers, the ability to correctly assign tasks to resources will be critical when striving to realise the full potential of these machines. However, it should be noted that this project is concerned with floating point applications as opposed to integer applications. High-performance applications typically refer to those that spend a disproportionate amount of time typically using few instructions but operating on a large data set. This data set especially in physical simulations is iterated over many types as the simulation evolves. Integer applications typically use a large number of instructions and smaller data sets. Office applications typically fall under this application type. Successful algorithm development will require different heterogeneous components to target. As outlined in the previous section, GPU components are becoming increasingly common. We believe that targeting a heterogeneous system which comprises of GPU and host CPU is an essential architecture mix to include in this project. These devices are coupled using a PCI express system bus. The second architecture type should not be constrained by the connectivity of a system bus; architectures of this type are referred to as a single-chip heterogeneous platform. Therefore, in order to target a single-chip heterogeneous platform, there are few affordable options available. At the time of writing, the only realistic option is to make use of the Cell Broadband Engine as found in the games console, the PlayStation 3. It may seem unusual to use a game station in the development of a complex, high-performance application; nevertheless, as shown by Buttari et al., the PlayStation 3 is more that capable for scientific research [4]. 17

18 1.2 Project Aims and Methodology The aim of this project is to design an algorithm capable of determining the best pattern of instantiation on the available computational resource for a given computational problem. Subsequently, the algorithm will be implemented using different heterogeneous platforms to validate the effectiveness of the approach; this will be achieved through the transformation of an existing HPC application using a methodology for targeting heterogeneous architectures. The objectives of this project are summarised as follows: To understand the development process and profiling requirements of high-performance applications on heterogeneous architectures; and To obtain speed-up on each target architecture for the N-Body simulation. It is important that we have a deep understanding of the development process of heterogeneous applications, and that we can profile applications in real time in order to better match computation to the available resource. To prove that the suggested approach for runtime profiling is effective, we need to show speed-up over the single threaded versions of the application. Accordingly, in order to achieve the project aims, the following key points will need to be achieved: Understanding of the target heterogeneous architectures; Understanding of the development process for heterogeneous systems; Identification of the methods which enable runtime-profiling and load-balancing; and Researching of the available tools and technologies to enable the implementation. 18

19 To realise the development of a heterogeneous application, a thorough literature review of heterogeneous development will be undertaken. Two target architectures will be used to validate the profiling algorithm: the first is the Cell Broadband Engine, developed by Sony Toshiba and IBM; and the second architecture is the Nvidia 9800 GTX graphics card with an Intel Nehalem host processor. These devices have a total of four discrete computational units, each with their own toolset and architectural nuances which will need to be explored before the implementation phase. In order to successfully implement a profiling algorithm and to accordingly profile the target application, the author will review the current literature and example case studies of multithreaded and heterogeneous development; this will continue with a further general review of application development methodologies for both multi-threaded and heterogeneous development. In the same section, we will suggest an approach for development that will be used in the design and implementation. Following on from the literature review, we will detail the algorithm design followed by details of the implementation. To test the load-balancing and profiling algorithm, a target application will be required. Notably, however, this is outside the scope of this project, and indeed its time constraints to develop such an application. Nevertheless, a number of applications were considered after referring to wellknown lists of computational dwarfs detailed in [5]. As a result, we decided to implement an N- Body simulator, as it has sufficient complexity and can be easily adjusted to vary computational problem size. A suitable open source project was found titled N-Mod. Hosted by Google code and developed By Dr Carey Pridgeon, the application is provided in C source code. For the rest of the project, N-Mod will be referred to as the target application. Once the target application has been modified to run on the target architecture, the results of the speed-up will be documented without the use of the load-balancing algorithm. The loadbalancing algorithm will then be applied, and a comparison as to its effectiveness will be conducted. 19

20 1.3 Project Phases Below is a summary of the main project phases: 1. Literature and background research. 2. Algorithm design 3. Prototype algorithm implementation: a. Target platform installation and configuration b. Development of N-Body algorithm. c. Implement flexible load-balancing system d. Implement load-balancing algorithm. 4. Test/Timing 5. Dissertation writing. 20

21 1.4 Structure for the Dissertation This dissertation contains eight chapters. The first chapter provides the introduction, and details the project s background and the significance of the problem area, as well as the motivation for the research. Chapter One also details the aims, and a methodology to realise the aforementioned aims, and further details the main phases that will be conducted. Chapter Two investigates the architectural and development environments for the target heterogeneous architecture, as well as projects which have been successfully developed on these systems. Chapter Three will research the area of computational load-balancing a component which is fundamental to the development of an algorithm that scales independent of architectural constraints. In Chapter Four, current profiling techniques will be reviewed, including both static and runtime techniques. Chapter Five will review development methodologies and design formalisation techniques for both single and multi-threaded development. Additionally, Chapter Five will bring the research together in such a way so as to propose a design for heterogeneous profiling and a loadbalancing algorithm. In Chapter Six, the implementation of the algorithm using the target application will be detailed, highlighting how various challenges were overcome. Chapter Seven will show detailed testing of the converted algorithm for the target architectures, and will also include a review of the speed-up obtained using various core and load combinations. The loadbalancing algorithm will also be tested so as to demonstrate that it can determine the best architecture for a given computational load. Chapter Eight provides a conclusion of the project, and recommends future work concerning the implementation and design of the load-balancing algorithm, as well as the methodology followed in its development. 21

22 Chapter 2 Heterogeneous Architectures 2.1 Chapter Overview Computer systems have traditionally been deployed using single core homogeneous architectures. In order to meet growing user requirements, we have witnessed the development of systems which combine differing architecture types. However, the integration of different architecture to meet the demands of the user brings with it its own problems. For example, each architecture can differ drastically from the standard x86 design, and requires specific compiler support [6]. One main difference is the way in which these architectures interact with the main memory. This chapter will survey the literature relating to the target architectures that will be used in the application of the Profiling algorithm. 2.2 Cell Broadband Engine The cell broadband engine is a single chip heterogeneous system developed by a partnership of Sony, Toshiba and IBM [7]. It has a revolutionary design consisting of a single PPU (modified Power PC 4 core) which controls 8 SPE (6 available in the Sony PlayStation). The PPU is a 64- bit dual simultaneous multi-threading processor which is similar to the PowerPC 970, and which has 32 KB of level one cache, split between separate instructions and data caches, and 512 KB of level two cache. The PPU has an SIMD engine based on the Altivec instruction set [8]. Although 22

23 powerful, the PPU is included as the controller of the SPE and is not seen as being a target for computational load. Usually, the PPE would run the operating system and control operations of the HPC application. Notably, the PPE is more capable of task-switching than the individual SPE, which is less capable of task-switching owing to the lack of branch prediction logic [9]. Each SPE consists of a SPU, 256 KB LS and MFC. The SPU is a 128-bit vector engine which uses a different instruction set to that of the PPE. Moreover, the SPU has a 128 by 128 bit register bank, which is a large number of registers in comparison to a standard x86 processor; however, there is no on chip cache. Furthermore, each SPU is not considered coherent with the main system memory, and can only access its own LS and local MFC. The MFC is a messaging and DMA controller. The MFC controller communicates with other MFC, PPE and main memory controllers using the EIB. The EIB consists of four unidirectional buses, which interconnect PPE, SPE, IO and memory controller. In order to provide the required communication requirements, the EIB operates at 204.8GB. With this in mind, Figure 2.1 shows the interconnections of all the major components of the CBE. Figure 2.1: CBE Block Diagram 23

24 The PPE is capable of running an application code which is compatible with the PowerPC architecture. Application code is executed from the main memory transparently being transferred to L1 and L2 cache using a cache coherency protocol. SPU application code is located in the main memory, and a pointer to the application code is provided to the SPU by the PPU. The SPU can then direct the MFC to transfer the application code into the LS, and the PPU can subsequently execute the code directly from the LS. Once the PPU has finished with any variables in its own LS, it can then direct the MFC to transfer data back into the main memory. Messages can then be sent between the SPU and PPU so as to indicate the completion of processing [10]. Figure 2.2 details the division of two computational tasks between PPE and SPE. The PPU would perform various initialisations before executing the task on the individual SPE; the SPE would subsequently compute their assigned sub-section of the task before informing the PPE; the PPE would accordingly collect and combine the results before preparing and starting the next stage of the model evolution; and the PPE would then continue evolving the model, perform further compute stages, directing the SPE or terminate [9]. Figure 2.2: Division of computation CBE 24

25 SPE and PPE can communicate using EIB by sending messages to each other s mailboxes. This messaging capability provides alternative kernel instantiation patterns. One common method is to have each SPE conduct a unique stage of processing, as shown in Figure 2.3. This essentially creates a computational chain commonly found in video processing. Figure 2.3: Stream Processing CBE There are additional benefits to using a chain of SPEs when the application code is too big to fit into a single SPE; we can reduce the swapping of application space to and from the LS in the SPE by dividing the workload between the individual SPE Development of the CBE Only the Linux kernel can communicate with the individual SPE, and so drivers have been developed in order to provide easy access to the developer. 25

26 The following functions are provided by the underlying driver [11]: Loading a program binary into an SPU; Transferring memory between an SPU program and a Linux user space application; and Synchronising execution. The application running on the PPE makes use of the LIBSPEV2 (SPE Runtime Management Library V2) to control execution on the SPE. Moreover, the LIBSPEV2 utilises the SPUFS which is provided by the driver in kernel space. The PPE application uses SPE contexts provided by the LIBSPE2 to load the SPE application image, deploy the image to the SPE, trigger execution, and to finally destroy the SPE context [12]. The SPE contexts are considered key to controlling the execution of the binary image on the CBE. Each context needs to be run in an individual PPE thread which is spawned from the main PPE application. Due to the heterogeneous nature of the devices, the application machine code for both the PPE and SPE need to be separately compiled. Importantly, Sony provides a GCC compiler which will create a binary executable for both architectures. $ gcc -lspe2 helloworld_ppe.c -o helloworld_ppe.elf $ spu-gcc helloworld_spe.c -o helloworld_spe.elf Figure 2.4: CBE compile commands In Figure 2.4 you can see that GCC compiles the PPE elf executable, whereas the SPE elf file is compiled using the spu-gcc command. In order to execute the application, the PPE elf file is launched, which in turn loads the SPE elf into memory and accordingly directs the individual SPE to execute through that SPE s context. Each SPE requires its own context structure defined as spe_context_ptr_t. This structure is used for the purpose of storing information required to communicate with the SPE. The functions 26

27 listed in Table 2.1 are used to load the SPE elf image and, using the context structure load the image to the SPE, to initiate the image as well as to destroy and clear the SPE image from memory. Name Functionality spe_image_open() Load the image from the disk into main memory return pointer to the image. spe_context_create() Create the context populating the supplied spe_context_ptr_t structure spe_program_load() Load the supplied image pointer to the supplied SPE defined by its context. spe_context_run() Run the current image on the context provided spe_context_destroy() Destroy the SPE context spe_image_close() Remove the image from main memory Table 2.1: LIBSPE2 Functions The application image, which is loaded onto the SPE, will have to be supplied with the data required for processing. This data will be located in the main memory or on a disk drive. It is the responsibility of each individual SPE to DMA its section of the data that it wishes to process from the main memory to its LS. DMA operations are provided to the developer using intrinsic functions defined in the spu_intrinsics.h and spu_mfcio.h header files. Intrinsic functions wrap a number of machine language commands that are not available to the GCC compiler. Name Functionality spu_mfcdma64() This function instructs the DMA controller in the SPE to transfer the specified number of bytes to or from Main Memory/LS. A tag is specified so a number of DMA transfers can be conducted at once. spu_writech() This macro is used to communicate between SPU and MFC spu_mfcstat() We can call this function to stall the application until the transfer is complete. We can use a technique referred to as double buffering so that the application can process the information it has instead of waiting for all data to be transferred. Table 2.2: SPE DMA functions Using the functions in Table 2.2 and a pointer to a data structure containing the addresses that are required, the SPE can then instruct the DMA controller to access the main memory and to subsequently make available to the SPU the data required for processing [12]. Each SPE has its own LS and communicates with the system memory using DMA transfers. The DMA controller 27

28 is connected to the other SPE, PPE and main memory using the EIB. Each SPE operates a separate address space to the PPE and the main memory, which is not coherent or uniform with other SPE or PPE. One implication of using DMA is that items to be used on the SPE have to be aligned to a 16 byte boundary with a maximum of 16KB per transfer and a limit of 32 simultaneous transfers [12]. Application code on the individual SPE is executed by calling the function SPE_CONTEXT_RUN. This will cause the application to hold until the SPE has completed processing. When running 8 SPE at the same time, we need to run SPE_CONTEXT_RUN for each SPE without any one call causing the PPE application to halt. Therefore, in order to avoid the PPE application halting, each call to SPE_CONTEXT_RUN should be made in a separate thread [13]. Platform fork() pthread_create() real user sys real user sys AMD 2.3 GHz Opteron (16cpus/node) AMD 2.4 GHz Opteron (8cpus/node) IBM 4.0 GHz POWER6 (8cpus/node) IBM 1.9 GHz POWER5 p5-575 (8cpus/node) IBM 1.5 GHz POWER4 (8cpus/node) INTEL 2.4 GHz Xeon (2 cpus/node) INTEL 1.4 GHz Itanium2 (4 cpus/node) Table 2.3: Thread and fork creation overhead To start the SPE_CONTEXT_RUN in a separate thread, the P-thread C library can be used. P- thread is a C language programming interface specified in the IEEE POSIX c standard [14]. It is preferable to create threads for each context due to the lower overheads when compared with additional process creation overheads, as shown in Table 2.3. Threads are created using the pthread_create() function call. Collapsing threads is accomplished using the pthread_join() function [14]. Access to SIMD instructions on the PPE are provided by the ALTIVEC headers. Notably, the inclusion of these headers gives access to the intrinsic functions which provide easy access to 28

29 SIMD instructions. The SPE uses a different set of instructions for SIMD operations, which leads to additional work porting the application from PPE to SPE. Double precision on the first generation CELL processor is relatively poor; IBM have corrected the issues in the second generation processor. 2.3 Nvidia GPU with Intel Nehalem GPU are not general processing devices and are unable to run integer type applications, e.g. Operating Systems are consequently unable to operate in isolation. Typically, GPU are combined with a General Processing Unit, such as an X86 processor or ARM RISC; these processor types are more capable of running integer type applications. We will refer to the General Processing Unit as the CPU from this point forward. Usually, the GPU is connected to the CPU using the fastest system bus available. We have seen a change in recent years in the sense of moving away from AGP to the faster PCI express 2. Figure 2.5 shows the increase in data rate from AGP 4x to PCIe 2. Figure 2.5: Bus Capacity [15] 29

30 The CPU is capable of addressing the system s main memory, but the processors on the GPU are only capable of communicating with their own dedicated on-board memory. The CPU communicates with the GPU over the PCI Express bus. However, the application code and data can be transferred over the PCI express bus from system RAM using DMA transfers using the on-board DMA engine on the GPU. Whilst the CPU has evolved into a generic compute device capable of running a range of different operating systems and applications, the GPU has ultimately targeted specific applications. Mainly, these include 3D graphics targeting the computer games market. However, it is noteworthy that the types of massively parallel calculations performed in rendering 3D graphics can be put to use on more general HPC applications. Traditionally, porting applications from CPU to GPU was not a trivial task [16]; nevertheless, recent advances by card manufacturers have led to the development of a number of frameworks which actively simplify GPU development. Cuda, OpenCL and Direct Compute are three frameworks which aim to reduce the development time and, in the case of OpenCL and Direct Compute, enable easy cross platform development [17]. It is seen that the increasing usability of GPU frameworks and libraries combined with the increasing computational ability of GPU will increase future use in HPC applications Nvidia Architecture The GPU architecture that will be used in this project is based on the Nvidia G80 processor. The typical configuration of the G80 is shown in Figure 2.7, and consists of 16 SM (stream multi processors), each of which comprise 8 computational units, thereby giving a total of 128 processors. Each computational unit can execute 32 threads giving 4,096 simultaneous threads. Computational units operate on single precision float point values; however, each processor has a single double precision unit, which fundamentally affects the ability of the GPU to perform double precision calculations at the same speed as it executes floating point precision 30

31 calculations. Each SM has its own 16KB L1 cache with a varying quantity of on-board main memory, which is dependent on card cost. The processor is designed to perform the same set of instructions in parallel on different data elements[17]. As we can see in Figure 2.6, the GPU has a greater surface area dedicated to compute activities compared with the CPU, whereas the CPU has large sections dedicated to control and caching owing to the nature of the general application types it has to execute. Figure 2.6: GPU CPU Footprint comparison [18] The GPU can forgo the expense of complex branch prediction hardware since the quantity of branches and random data accesses are low. One import aspect of targeting the GPU is the bandwidth available between the GPU and main memory over PCI expression, which is 8 GB/s. Between SM and on GPU ram, access time is as fast as 141 GB/s. Maximising on GPU data access would therefore give an application 16X memory performance Nvidia Development Developing GPU applications requires the use of separate frameworks and compilers to the host architecture. Three main frameworks have been developed OpenCL, DirectCompute and CUDA. OpenCL is being developed by the Khronos group. Notably, it is envisaged as a framework and API abstraction for enabling applications to execute over a wide spectrum of heterogeneous architectures[19]. Direct Compute is part of Microsoft s Direct X 10 release; as such, it is limited to the Windows platform. This project will be utilising the CUDA platform from NVIDIA. 31

32 Cuda was developed by Nvidia in 2006 as an extension to the C Language [18]. Computational kernels are written as C functions and preceded by global declaration specifier. These functions can then be called in a similar manor to a standard c function, with the exception that we define the number of thread blocks and the number of threads in a block. Threads in each block are able to communicate with each other through a shared memory, and have the capability to synchronise. Each thread is executed by a single scalar multiprocessor, and threads are grouped into blocks that will be allocated to an individual multi-processor [20]. A number of blocks form a grid are also referred to as a kernel. Only a single kernel can execute on a device at any one time. [21]. Figure 2.7: Nvidia Processor Architecture Each work group is divided into thread groups of 32, referred to as a warp. Warps always perform the same instruction stream on different data values. Nvidia refers to this as Single Instruction Multiple Thread [20] Intel Nehalem Architecture Intel s Nehalem processor is the company s first micro architecture to have separate memory controllers for each processor within the same SMP system [22]. In previous SMP systems, all processors are connected to the main system memory using a single memory controller. As processing speeds increased, this created a bottleneck for memory access. In [22], the authors demonstrate that Nehalem can operate up to 4 times faster than the previous flagship processor micro architecture Harpertown. The test server that will be used to evaluate the completed 32

33 algorithm has a single core i7 processor running at 2.67 ghz with 4 hyper threading cores. Nehalem architecture features turbo boost and Hyper-threading technology. Hyper-threading is the variation of Symmetric Multiprocessing found in Intel micro architectures. Hyper-threading enables each processor core to run two threads. Operating systems will report two logical processors for every physical core. For our test system, this gives us 8 hardware threads; notably, however, this does not mean that 8 threads will be executing instructions for every clock cycle [23]. The Nehalem has a super scalar architecture which results in multiple instructions being run in parallel, referred to as ILP. Instruction pipelines of each logical processor will stall when data is required from the cache or main memory. In this instance, a core lacking SMT features will stall, which will consequently result in wasted clock cycles. In Figure 2.9, we are able to see that a single threaded processor contains a single architecture state, whereas a dual threaded processor has two architecture states. 33

34 Figure 2.8: Two dual-threaded SMP processors [23] Each logical processor is created by the existence of an architectural state on the processor. This architectural state comprises both general control, and advanced interrupt control registers. The goal of hyper-threading technology is to allow thread-switching when the running thread stalls. Importantly, the stall could be for a number of reasons, including cache misses, handling branch mis-predictions, or pipeline result dependencies [23]. Figure 2.9: Two SMP single-thread processors [23] Both virtual processors have access to the same L1 cache which is split into 32KB for data and 32KB for instruction. Each processor core also has its own 256k L2 cache. There is also an 8mb level 3 cache which is shared between all the processor cores. In order to maintain cache coherence, there is a snoopy bus which operates between L2 and L3 cache. The processor has its own DDR3 memory controller and Quickpath interconnect. Quickpath is a fast interconnect, which is able to connect up to four processors together. 34

35 Nehalem also supports Intel s turbo boost technology, which enables the individual cores to speed-up or slow down dependant on requirements [24]. Temperate levels on the processor are monitored if a single thread is running then the other cores will be idle; this would result in a low processor temperature. If there is only one core active, that core can then be increased by mhz; if there are two or more cores active, the increase will then be witnessed in steps of mhz Intel Nehalem Development The development for the Nehalem is similar to the development on the PPE of the Cell Broadband Engine. The P-threads library is used to create threads from the main application process; these threads can perform computation. Moreover, if there is a single Nehalem processor, then eight threads can be active at once. The Nehalem processor is capable of SIMD operation, and exposes the SSE instructions through intrinsic functions. 2.4 Chapter Summary This chapter has provided background research on the heterogeneous architectures that will be targeted by the load-balancing algorithm. Research has shown that the CBE has cache coherent PPE which will be easy to target using the P-thread libraries. A similar approach can be adopted for the Intel Nehalem processor. Importantly, there is no reason why the same application code will operate on both processors; that is, unless SIMD intrinsic functions are used. The Nehalem has a faster clock rate and has four hyper-threading cores where the single PPE supports two threads. The PPE has less on-chip cache available, and has slightly older technology compared with that of the Intel, which has recently experienced an overhaul of its micro architecture. The CBE has a high-speed interconnect to the SPE processors, although the development of software for the SPE architecture is more complex than targeting the cache coherent PPE and Nehalem. The Nehalem will use the GPU as it heterogeneous processing element. The GPU is 35

36 connected to the processor using the PCI Express bus; this may add latency when transferring the model from the main memory to the on-board memory of the GPU. This may accordingly result in poor performance when limited processing is available, thereby resulting in a greater overhead-to-computation ratio. The implementation of a runtime profiler would result in targeting the correct compute component for a given computational task. 36

37 Chapter 3 Scheduling, Load Balancing and Profiling 3.1 Chapter Overview Computational scheduling is the process of distributing a set of independent tasks to the underlying hardware with the aim of minimising the application runtime [25]. Load-balancing is a subset of this process, which is ultimately concerned with evenly distributing the load between the available devices. We are concerned with addressing the load imbalance, as this will introduce a disparity in execution time of the individual tasks and may therefore result in idle cycles in terms of those devices allocated with too few tasks. These idle cycles result in wasted time which could be used to increase the fidelity in the model being computed or otherwise save resources through requiring less time to process the current model fidelity. Profiling is concerned with timing an application and, more importantly, its components with a view to optimisation. Profiling is usually conducted during the period of design; however, this project is concerned with adapting the scheduling of computation based on runtime profiling. 3.2 Computational Scheduling There are two main types of scheduling algorithm: static scheduling and dynamic [26]. The former i.e. Static Scheduling, also known as compile time scheduling divides the work into fixed size chunks, and accordingly distributes them between available processors. Examples of 37

38 such algorithms include bloc, cyclic and block cyclic. Dynamic scheduling assigns work at runtime. One example of self-scheduling creates a queue which contains all task blocks. Blocks are then subsequently assigned to individual processors and, when they have completed that assigned block, they can then ask the scheduler for more work blocks. Self-scheduling algorithms incur a communication overhead between the child and master processes. There is also a period of time where the process is idle, whilst the request is being processed and then staged. A number of algorithms have been designed to reduce the communication overhead by varying the chunk size. One such example is the guided self-scheduling [27]. As we have seen in Chapter Four, both Nehalem when in dual processor configuration and the CELL BE have NUMA characteristics. Single Nehalem exhibit cache coherent unified memory access as individual cores have the same access time to cache and RAM. When in the case of the dual processor configuration, each processor has a memory controller with direct access to its own RAM. The processors are connected to each other using Intel s Quick Path Interconnect technology [28]. Both core L2 cache are coherent but non-uniformed in terms of access time to the separate RAM controllers. Each SPE on the CELL BE is attached to the same system RAM, and so it is therefore defined as unified memory architecture. However, SPEs are non-cache coherent and ultimately rely on the programmer to maintain non-contended access to memory locations. Scheduling on NUMA architectures should take into account the latency incurred in accessing memory [26] Assessment of the N-Body Problem Many tasks including the N-body problem have a differing computational requirement based on the index being processed. When processing an N by N matrix given any index of N, the computation will then be the same for all N. However, when dealing with the N-body problem, 38

39 we see that calculating Index 1 will require greater processing than 2, and an exponentially greater quantity of processing than Index N. Figure 3.1: Computation level given index In the case of the N-Body problem, this imbalance is caused by the nature of the force calculations for each body in the system. Figure 3.2: 5-Body Example In Figure 3.2, we can see that, during Step 1, body Index 1 needs to calculate the force acting between it and all other bodies in the system. Note that, in the future iterations, particles will not 39

40 need to calculate the force acting between them and any previous particle that has obtained force calculations. This is because previous body calculations will have updated the force it is applying to the current body. The number of calculations required for each step is equal to the value of the index subtracted from the number of bodies. Figure 3.3 shows this relationship Force Calculations Body Index Figure 3.3: Body index calculation ratio One of the more basic scheduling techniques statically assigns a proportion of the complete task to individual nodes. Two common schemes are block and cyclic[25] Block and Cyclic Scheduling In Figure 3.4, the block cyclic show dividing the number of the index dimensions by the available processors using a static equal in order distribution. For the N-body problem, the load imbalance would result in the first workload group S1 in Figure 3.4 being allocated to a single processor. This would consequently result in a load imbalance. Processor 4-allocated workload S4 would be finished 7 times faster than that of Processor 1 running S1. The process of dividing 40

41 and managing the allocation of work is a trivial task, with each process calculating the offset to its section of work. Figure 3.4: Block cyclic schedule Each process can easily calculate its own subtask within the given body set. The overhead of this computation will be low, and will require no inter-process communication. Set process to current thread number Set start to N divided by number of threads multiplied by process minus one Set end to start add task size If total threads modulus task size > 0 If process <= total threads modulus task size Set start = start add process minus 1 Set end = end add process Else Set start = start add total threads modulus task size Set end = end add total threads modulus task size End If End If Figure 3.5: Pseudo code for in process offset calculation When calculating work offset, it is important to take into account the remainder when dividing work between processes; not doing so will result in an incorrect work schedule (see Table 3.1). 41

42 Processor Without modulus Start End Work With modulus Start End Work Table 3.1: Offset pattern for 100 elements Cyclic scheduling would provide a fairer distribution of work, consequently permitting each core to process each index as an offset from the first allocated; this would provide a theoretical best allocation. However, it is noteworthy that this scheme would cause cache coherency problems. The Nehalem has a 64-byte cache line which holds a copy of a specific RAM location. If required by another core, the line could then be evicted, which would consequently cause contention. If a cyclic schedule allocates Index 1 with a double precision value to process 1, and Index 2 with a separate double but existing in the same cache area, this will cause contention. The perfect schedule for cache coherent cores would be to size each block so that the index values of the block reside in a single cache line. Such a scheduling scheme has been developed by [26]. 42

43 Figure 3.6: Cyclic Schedule A seemingly good scheme for the N-body problem would be block cyclic equal area schedule shown in Figure 3.7. Figure 3.7: Equal area schedule However, this may provide too perfect a scheme, as it would divide total computation equally between the available processors. As outlined in Chapter Two, memory access times across the processing elements are not uniform. Accordingly, equally dividing computation across the 43

44 devices may ultimately lead to latent processing (cores waiting for data to process from RAM), which will result in staggered completion of the equal computational loads Dynamic Scheduling Dynamic scheduling allows processors to adapt their load balance to meet the requirements of changing the computational requirements or the changing state of the underlying hardware. One of the popular dynamic scheduling techniques simple self-scheduling establishes a Masterchild relationship. The master maintains a queue of work tasks, which are subsequently allocated to the child, which computes slaves as they finish their current task. This type of dynamic scheduling works well when there are no mechanisms to determine the task size or dynamics. Usually, the master requires that access to the queue of tasks by the client be restricted by a lock. If the task size is small, then processors will spend a high proportion of time contending for the lock. If the tasks size is too great, there could then be a load imbalance. If we know the quantity of work ahead of time to create the optimum work distribution, this would then provide a better balance between master lock contention and load imbalance. However, if so much is known about the task size, then we can forgo the dynamic schedule and accordingly implement a static scheduling which has a much smaller overhead. In [29], the authors describes the use of DAG composition over a static list with tasks being connected by a graph which specifies the communication weight. This could be useful for scheduling the same computational element across different heterogeneous elements that have differing communication overheads. It has been shown in the previous chapter that the communication costs of the GPU will be high if there is insufficient computation on the GPU. 44

45 Figure 3.8: DAG approach[29] The establishment of DAG would incur overheads and may be more suited to application where computation is unknown at the start and additional tasks are creating during computation. This is not a characteristic of the N-body problem. In [30], the authors show that a work-stealing approach to the redistribution assignment of DAG within a thread group improves if stealing is determined by the locality to work which was originally on the processor. This technique could be applied to the block scheduling technique, thereby permitting block allocation and subsequent theft of work in order to ensure proper distribution. Theft through offsetting would ensure data locality. In [29], the authors describe problems as either deterministic or non-deterministic: deterministic is defined as where precedence relations exist between the tasks and the relationship between the tasks is known in advance; non-deterministic describes tasks where information relating to execution characteristics is only known at runtime. The N-body problem is deterministic in that we know the problem domain to be static, as well as the computational requirements. However, the application of a deterministic problem to what is a non-deterministic receptor would lead to treating the deterministic problem as a nondeterministic owing to the underlying heterogeneous architecture. 45

46 3.2.4 OS Thread Scheduling Applications can detect a number of hardware cores in a system and then use a scheduling algorithm to divide the computation between those cores. It is the responsibility of the operating system to assign those processes and the associated computational load to the individual hardware cores; even on a system which is dedicated to a single computational task, there will be contention for the individual cores. Sources of contention will primarily stem from background services and operating system functions. Accordingly, we cannot rely on any optimisation of thread placement by the underlying operations system. On a system which has a number of tasks running, we could find two computational threads placed on the same processor core; this would cause a serious imbalance on the single core which is running double the intended computational load. Notably, static scheduling techniques would fail to address this imbalance, although a dynamic schedule could adapt to the changes Profiling Profiling can be characterised as design time or runtime. Design time profiling involves the analysis of the current application to determine the areas of the computation which are responsible for consuming the majority of the total compute time for a given application. Usually, an application is written in order to meet various requirements, as its output is validated against those requirements. Once functional, the application can then be further reengineered in mind of performance. The first step in the reengineering process is to profile the application to identify the areas which can be redeveloped to obtain an increase in performance. Often, this engineering process is targeted for a certain architecture type. This is an issue in modern computing environments where many different heterogeneous architectures can be found. Runtime profiling can analyse the application, as it executes on the target architecture and tunes the algorithm for that hardware combination. 46

47 3.3 Chapter Summary In this chapter, we have identified that, given a known task size as provided by the N-body problem, static scheduling would provide a suitable approach for dividing the work between the available elements. However, we have further determined that, given an environment with many different heterogeneous hardware types, static scheduling techniques alone would be unable to determine the best pattern of instantiation. Runtime profiling would enable us to perform tests of the static schedule in order to determine its effectiveness and to subsequently select its optimum instantiation pattern. 47

48 Chapter 4 Heterogeneous Development Methodologies and Patterns 4.1 Chapter Overview This Chapter will survey current development methodologies in an attempt to identify a work plan for the remainder of the project. It is essential that the stages of heterogeneous development be fully understood and formalised into a structured methodology. In order to assist in the design of a solution to the load-balancing and profiling problem, it is essential that the patterns in existence for heterogeneous development be reviewed. It is well understood in the development community that, where possible, patterns for common implementations should be followed. This chapter will identify a development methodology and a development pattern in mind of assisting the realisation of the application code. 4.2 Development Methodologies Development methodologies have become ubiquitous in their use of assisting system development teams in producing quality applications which meet the requirements of end users. These frameworks have been developed over time in response to failed projects. Moreover, the development of standard working practices in the development of software has led in the majority of cases to better and more reliable software. One such methodology i.e. the waterfall model breaks the development cycle into a number of phases (Figure 4.1). This 48

49 approach is common by splitting the development of software into discrete phases, and accordingly ensuring that each phase is complete before stepping to the next. We have also been able to strike commonality between phases of different methodologies, thereby showing that, in principle, at a high level, the same work is being carried out. Methodologies are followed for the development of new software or the redevelopment of old Software Development Phases The first phase is concerned with capturing the requirements which users will have of the completed software. Usually, an application is developed with the objective of fulfilling a number of specific requirements, e.g. process online payments. When complete, the application will be tested against these requirements so as to determine its fitness for purpose. HPC application development is usually concerned with the redevelopment of an algorithm or existing application to a HPC platform. The sole requirement in most cases is application speed-up; that is, the reduction in application runtime compared with a previous configuration. For example, a video compression application which runs on a single-threaded processor might need to be redeveloped so as to make use of a multi-core device, with the core requirement of making the video compression faster. Originally, the application was developed to meet the requirements of the user, and was tested and deemed to have passed that requirement. Now, the application is to be re-engineered in such a way to run faster on a multi core device. There is also the possibility that the application has been developed to run on a multi-core device, but is now required to run on a GPU. Thus, we can typically define our main development requirements to be concerned with temporal improvements. 49

50 Figure 4.1: Waterfall Model[31] Once requirements have been defined, we are subsequently concerned with the analysis of the problem and the design of the solution. Analysing the current performance of the application would require runtime profiling. Notably, there are a number of tools available for identifying the intensive sections of the application [32]. In this project, we will make use of the Gprof profiler. Gprof is a call graph extension profiler which monitors the percentage of time spent in each function [33]. This tool provides us with a good understanding of the intensive sections of the application; we can then target those functions to be processed in parallel. Before we start to think about designing an algorithm to support the parallel computation of the intensive section, we need to first determine if that section is a problem which can be run in parallel. In order to assess the suitability of the algorithm, we can analyse the algorithm and the math underpinning those algorithms in order to determine whether there is independence in 50

51 computation. Essentially, if there is a serial dependence, the likelihood of separating computation is then low unless we are able to redesign the algorithm without losing fidelity. Algorithms are usually developed by scientists in an iterative process. Segal, in [34], documents the results of over 10 years observations on the development of software by the scientific community. Notably, Segal studied earth and planetary scientists, and structural biologists. Segal notes that many that develop scientific software have no or little training in software engineering, and that the approach to development does not tend to follow any specific methodology and is more of an ad-hoc iterative process. Moreover, software is typically constructed by a scientist or by another scientist working in the same laboratory. However, such scientists are commonly experts in their own field, and the algorithms are produced to their requirements and accordingly verified through valid output. However, it is not expected that the algorithms be written in a way to best utilise the computational resource. In [35], the authors state that the goal of a scientist is to produce new scientific knowledge; optimising software which runs within their assigned computational quota is not a requirement. Optimisation is only required when the desired fidelity is not met within their allotted resource. Therefore, when changing an algorithm, it is important that the mathematical approximation currently required is not changed Algorithm Design Considerations Laxmikant V Kal in [36] suggests a number of dictums learned through extensive work in the development of parallel applications. Dictum 1 recommends the over-decomposition of migratable computational objects. Decoupling the computation from a single processor or number of processors eases the deployment of the algorithm. Kal also suggests that the use of a runtime system to deploy the computation to the available resource decoupling the algorithm from the hardware will enable this redeployment. Dictum 2 details the use of automated loadbalancing via measurement. Separate chunks of computation can be allocated to individual 51

52 processors or groups of processors. Following Dictum 1 will assist in the dynamic load-balancing of computational blocks required by Dictum 2. Dictum 3 states that using asymptotic and isoefficiency analysis to redesign the algorithm will ultimately provide scalability, thereby resulting in speed-up for an increasing number of cores. Dictum 4 mandates the use of performance analysis tools, and although this is aimed at many thousand core computers, previous research indicates that performance analysis is just as important for single core systems so the dictum holds for both large and small systems. Dictum 5 details the use of system emulators to prepare and analyse the algorithm before real system application. The Nvidia system provides a system simulator to compile an application to be simulated, and issue the make command with the emu=1 flag. IBM provides a full system simulator which actively boots a virtual Linux image that has access to a virtualised CELL processor. In [37], the authors suggest the following important stages in the development of parallel algorithms: 1. Subtask Decomposition 2. Dependence Analysis 3. Scheduling. Splitting the algorithm and accordingly developing subtasks of computation whilst finding and eradicating dependency between the computational blocks. Often, the access to dependent or shared data will subsequently result in a communication pattern and/or lock contention between processors. The authors in [37] further detail the importance of scheduling in order to ensure processor utilisation. In order to assist HPC application development, a number of languages have been proposed to ease the development of multi-threaded applications across systems with multiple computational elements (see Table 4.1). 52

53 Technology Organisation Coprocessor support Data parallel Task parallel Thread support Language/ Library Current target Architectures Brook1 AMD Y Y Y Y Language (C) AMD GPU Chapel Cray Y Y Y Language Multicore CPU Cilk++ Cilk Y Y Y Language (C11) Multicore CPU (shared memory Co-array Fortran Standards Body Y Y Language (Fortran) x86) Multicore CPU CUDA NVIDIA Y Y Y Y Language (C) NVIDIA GPU Fortress Sun/Open Y Y Y Language Multicore CPU source OpenCL Standards Body Y Y Y Y Language (C) GPU PVL MIT-LL Y Y Library (C11) Multicore CPU PVTOL MIT-LL Y Y Y Y Library (C11) Multicore CPU, Cell, GPU, FPGA Sequoia Stanford Y Y Language (C) Cell StreamIt MIT Y Y Y Language Multicore CPU Titanium UC Berkele Y Y Language (Java ) Multicore CPU UPC Standards Body Y Y Language (C) Multicore CPU VSIPL11 Standards Body Y Y Library (C11) Multicore CPU, NVIDIA GPU, Cell, FPGA X10 IBM Y Y Y Language (Java ) Multicore CPU Table 4.1: Parallel languages[38] Additional languages are being developed in order to address the failings in general purpose programming languages when used in parallel development. Failings are due to changing architectures which standard languages were not developed for. An important step in the design or redevelopment of an algorithm is to determine if the current language which the algorithm is written in is considered suitable for the target architecture. It may be worthwhile to redevelop the algorithm in a parallel language. Moreover, it is important to ensure that the lead time for development is considered, and that all target architectures are compatible with the parallel language chosen. Many languages e.g. Cilk++ have been developed with single-target architecture in mind. 53

54 4.3 Patterns Patterns have been used for a number of years in software engineering to add rigour to the software engineering process. Riechle & Züllighoven define a pattern as, the abstraction from a concrete form which keeps recurring in specific non-arbitrary contexts [39]. Patterns are abstract concepts or frameworks for tackling complex algorithmic problems. When developing an application, it can be useful to use a pattern that is proven to solve a problem. Patterns are abstract not concrete and so require application to the problem in order to create a concrete algorithm. In Figure 4.2, we are able to see an example pattern for the message passing action commonly found in parallel application development. Figure 4.2: Message passing pattern[40] As we can see, the pattern in Figure 4.2 contains many important elements which, although abstract, would assist in the development of an algorithm to support messaging passing. In Figure 4.3, we are able to see that one of the most common patterns divide and concur. In this pattern, we produce individual sections of computation and then accordingly merge the result of 54

55 each individual piece of computation. As outlined in the previous chapter, load-balancing would be required in order to allocate computation to the individual processing elements. In [41], the authors detail various mechanisms to implement the divide and conquer algorithm: Fork join: The main serial thread of the application must create (Fork) individual threads on each processor core. The main thread must then wait for each thread to complete, which must be achieved in sequence (join). An imbalance in work load will ultimately leave the main thread stalled, waiting for the remaining join. This idleness by the main and already joined is created by inefficient load-balancing. The most desirable balance will result in each thread collapsing in turn, as the main thread is ready to receive it. Communication Costs: The state of the model will need to be operated on by the thread. This will require the movement of data to the separate computational resource. The result of the computation will need to be transferred to a sub- or main thread for aggregation. It may be that all threads require access to a single control variable; this will require a locking system for that variable so that only one thread can access it at any one time. The locking of variables can lead to race and stall conditions, and can even result in dead lock. Dependencies: Initial conditions of each stage of the computation will need to be used by all threads. This differs from the new state that should, in the main, be covered by communication costs. Importantly, a dependency even when being read can result in contention. The author notes that the divide and concur algorithm is suitable for application to the N-Body problem. 55

56 Figure 4.3: The divide and concur pattern [41] 4.4 Implementation The implementation will involve developing the algorithm for each of the architectures in turn before combining the code for individual heterogeneous architectures into a single solution. This will need to be further developed in order to integrate the load-balancing and profiling algorithms. Each architecture has its own suggested approach to the redevelopment of SMP application code. As the project is not currently threaded, the development of SMP application code targeting the PPE and Intel Nehalem will be the first task. Cell development is regarded as being one of the most complex programming exercises; even in IBM s own documentation, it states that the process is not trivial and that it might be better to try IBM s expensive compiler which comprises auto parallelisation features. Figure 4.4 shows IBM s suggested CELL development strategy. 56

57 Figure 4.4: CELL Development Process [9] The first IBM stage falls within the reemit of the analysis phase; that of identifying the low hanging fruit i.e. code sections that will provide us with the greatest speed-up if run in parallel. The second phase indicates that porting the application to the PPE would be the first development activity. It seems reasonable that the application should be first ported to the general purpose processor. Importantly, the PPE is far easier to target than the individual SPE if we use the standard libraries for both math and heap memory acquisition and we also use the P- thread libraries for thread creation. Once running, the application should also be fit for compilation on the Nehalem processor. As we have discovered in Chapter Two, both are cache coherent SMP processors. If we are to use the vector operation on these processors, it would be better to implement structures, as structures of arrays as shown in the bottom example (Figure 4.5) than arrays of structures. It is also better for cache locality reasons that structures are not used although they are a convenient abstraction. 57

58 Figure 4.5: Structure of Array (top) and Arrays of structures. Part of stage two will involve using the threading library to run the intensive section of application code on the two threads of the PPE. This threading activity will also be the function on the Nehalem. In order to achieve this computationally intensive code, it will need to be implemented into a single function that can be called with the required offsets based on the number of threads. In the case of the Nehalem, this will be 8 and on the Cell 2. Achieving the required speed-up and verification of the model is essential before moving on to the GPU and SPE. Ultimately, there would be little or no point in continuing with the heterogeneous development without first obtaining speed-up on the SMP on the target device. 4.5 Testing Testing comprises two main outputs: the verification of the model and an analysis of speed-up. Double precision is supported by all the architectures, although each will incur a performance hit over their single precision floating point capabilities. However, the Nmod application like many N-body applications requires double precision accuracy. Furthermore, verification of the model can be achieved thought the visualisation of a predetermined pattern that will be executed on all target architectures. Moreover, speed-up will be calculated for all of the target architectures and core combinations. In order to test the validity of the runtime profiling system, set particle numbers will be manually computed on all processor configurations before applying the auto profiling algorithm. Subsequently, the manual data can then be used to demine if the auto profiling approach selects the best architectural combination for the given problem size. 58

59 4.6 Chapter Summary This chapter has identified that the current methodologies can be applied to the development of HPC application as the modification of approaches in the implementation will vary slightly. The design stage of the application will use the divide and conquer pattern so as to allocate work to the computational elements, using the load-balancing techniques identified in the previous chapter. Figure 4.6: Implementation plan. When implementing the algorithm, the SMP will be first targeted using the PPE with nonthreaded function calls to the computationally intensive function. Next, the threaded SMP version of the algorithm with the load-balancing algorithm will be implemented. This application code should be tested on both PPE and Nehalem. Next, there will be the development of the SPE application code with subsequent addition of the load-balancing algorithm. Once complete, the same stages can be undertaken on the GPU. The illustration shows the targeting steps (Figure 4.6). 59

60 Chapter 5 Algorithm Analysis and Design 5.1 Chapter Overview In this chapter, we will show the analysis of the exiting Nmod application and the design solutions to meet the speed-up requirements for Intel Nehalem, CELL BE and Nvidia GPU. We will also illustrate the designs of a load-balancing and profiling algorithm for the N-body problem to be used in conjunction with the redesigned Nmod application for the three target architectures. We intend to achieve the following objectives, as outlined in Chapter One: To understand the development process and profiling requirements of high-performance applications on heterogeneous architectures; and To obtain speed-up on each target architecture for the N-Body simulation. 60

61 5.2 Analysis The initial step of the analysis phase requires the profiling of the original application. In order to achieve this, the application was re-compiled using GCC on Linux using the pg option. Running the compiled binary will produce the profile file that can be consumed with the gprof command line utility. The output of the initial profile can be found in Table 5.1. Gprof provides the following information: %Time - Percentage of time spent in this function. Cumulative Seconds Cumulative total number of seconds spent in function Self Seconds Number of seconds spent in function. Calls Number of times the function was called. Self ms/call Number of milliseconds spent in a function call Total ms/call - Number of milliseconds spent in a function call and descendants Name Name of the function. % time cumulative self seconds calls self total name seconds us/call us/call findgravitationalinteractions n/a n/a n/a main accountforinertia call do_global_ctors_aux clearaccumulator Table 5.1: Gprof output non threaded 10 particles 10 steps. 61

62 The result of the profile in Table 5.1 shows that the function findgravitationalinteractions consumes 72.88% of the total application time. This test run was configured with 10 bodies and 10 stages. % time cumulative self seconds calls self total name seconds us/call us/call findgravitationalinteractions n/a n/a n/a Main accountforinertia call do_global_ctors_aux clearaccumulator Table 5.2: Gprof output non threaded 60 particles 10 steps A second profile as shown in Table 5.2 was taken. With an increased particle count, it showed that, as the number of particles increased, the cumulative time taken in the findgravitationalinteractions increased. With 60 particles, per cent of total application time was spent in the findgravitationalinteractions. The main function is the next intensive function with per cent of the application time. Other functions take a nominal percentage of the total application time Amdahl s analysis of Nmod If we apply Amdahl s Law shown in Equation 5.1 where P is the section of code that can be run in parallel and N is the division of that parallel section (number of processors), the following may be devised: 1 1 Equation 5.1: Amdahl s Law 62

63 Using the values obtained in the profile of 60 particles and two processors, we gained a speed-up value of , as shown in Equation 5.2. The limitation on the speed-up of the application is the percentage of the serial code section which cannot be run in parallel, and the overhead of the parallel instantiation and maintenance [42]. However, the number of processors is governed by the laws of diminishing returns. The equation fails to show the overhead of communication and the cost of communication and thread overhead caused by the increasing processor count Equation 5.2: NMOD Speed-up value Table 5.3 shows the diminishing returns as we increase the processor count to 6, as we will observe on the CELL. However, this does not show the true speed-up, as there will be other costs incurred, such as communication and thread control. Parallel Number of processors Speed-up Table 5.3: Nmod speed-up values Nmod Application Structure Analysis The Nmod application is written in C and uses an array of structures to maintain model state. Initial model state is loaded from a configuration file. To make it easier to test the application with different body counts, we have added a random body generator detailed in Figure 5.2 which 63

64 creates a distinctive pattern that can be used for verification. Figure 5.1 shows the output of this pattern displayed in Matlab. Figure 5.1: N-body test pattern loaded into matlab. for(int i = 1;i < num_particles;i++) { xloc[i] = E+10 * log(i *10); if(i%2) { yloc[i] = E+10 * log(i * 10); } else { yloc[i] = E+10 * log(i * 10); } zloc[i] = E+09 * log(i * 10); } 64

65 mass[0] = E+31; for(int i = 1;i < num_particles;i++) { mass[i] = E+24 * log(i *10); } Figure 5.2: Application code to generate random body distribution and mass The application performs the following actions to initiate, evolve and then clean-up the N-body model: Initiate the model Start single evolutionary step o First, we clear the accumulators that will be used to calculate this step Find the current forces acting on all the particles and apply those forces to the particle pair. We then convert the forces acting on each particle into meters per second accounting for inertia. This process is repeated four times as required by the Runge-Kutta fourth order integrator figure shown in equation 5.3. o Using Runge-Kutta we update the next position o Output Current model state o Perform another Runge-Kutter step Clean-up used data structures. 65

66 Equation 5.3: Runge-Kutta 4 th order integrator [43]. Figure 5.3: JSP Diagram of the NMOD Application The current Nmod application makes use of an array of structures; this will need to be refactored to an array or array structures. This will reduce the number of cache locality issues and better enable the division of work between processor cores. The main data structures concerned with force calculation for the application after being refactored from arrays of structures are as follows: Body Force Accumulator Arrays xforce, yforce, zforce. Body Location Arrays xloc, yloc, zloc. Body Speed Array xsp, ysp, zsp. Mass of object Array mass. Gravitational constant variable G. 66

67 Initial profiling results show that the function findgravitationalinteractions has the highest computational requirement of the application, so this should be redesigned to run in parallel. The findgravitationalinteractions function (Figure 5.3) is called from the applications main function four times for each step in the model evolution, as required by the Runge-Kutta 4 th order integrator shown in Equation 5.3. We will now consider each architecture in turn in mind of implementing specific design requirements, and accordingly consider the load-balancing and profiling algorithm. 5.3 SMP Design Dividing work between cache coherent elements will be the easier of the three design challenges. Each thread will require a data structure to locate the shared data structures required by the function findgravitationalinteractions. This function requires the current particle location using the arrays xloc, zloc and yloc. These location arrays will be read by each thread and not written to. Mass and the G variable will be accessed but only for reading. The force accumulators will be written to as the force acting between a pair of particles is calculated. The force will then be applied to both local and distant particle array. Local particles refer to the particle set that will be issued to a single thread. The advantage of the cache coherent bus means that we don t need to be concerned about thread writing to a distant particle force or a second thread writing to its particle sets force accumulators; this ultimately simplifies the per-thread structure and the requirement to perform aggregation of separate force accumulators that would be required if no cache coherency was present. void findgravitationalinteractions(double xloc[],double yloc[],double zloc[],double xforce[],double yforce[],double zforce[],double G,double mass[]) Figure 5.4: Original Function declaration for findgravitationalinteractions 67

68 As can be seen in the JSP design in Figure 5.7, threads are created, each of which spawns an instance of findgravitationalinteractions. After the force calculation, a load-balancing step is performed. The load-balancing algorithm will need to start at a point which is unfair, and then be balanced so as to provide each processor with a fair division of work. In order to achieve this, we perform a simple block cyclic allocation as shown in Chapter Three. We will then use the count available within the function find gravitationalinteractions to force a change in the distribution of work between the threads. This can be accomplished through the manipulation of start and end offsets used by each of the threads. The data structure shown in figure 5.5 would accommodate the required values to communicate the necessary data between main and child processes. Using P-thread, the required number of threads could be loaded then using joint function, waiting until the threads have been terminated. typedef struct{ double * xloc, * yloc, * zloc, * xforce; double * yforce, * zforce, G, * mass; int speid,numparticles; int soff,eoff,work; }trans; Figure 5.5: Thread communication structure. Each thread would need to return the number of forces calculated. We could then use this information combined with the total quantity of forces to balance the work by altering the offsets. The pseudo code as shown in Figure 5.6 shows the procedure for thread creation and load-balancing. Individual threads are created using the pthread_create function. After the threads are created a second function calls pthread_join on each thread. Set thread to number of smp processor cores For count = 0 while count < thread increment count Create thread Endfor For count = 0 while count < thread increment count Join thread Work total = thread work count Endfor 68

69 For each thread in threadgroup -1 If Thread.work > total work / number of threads * 0.05 Thisthread.endoffset = Thisthread.endoffset 1 Thisthread.startoffset = Thisthreadstart offset + 1 Endif Endfor Figure 5.6: Pseudo code for thread creation and balancing on SMP This will cause the join function to pause until each thread has been completed. Any imbalance will cause the loop to pause. In the best case, the call to pthread_join of the first function should pause the main thread until it is completed, although subsequent threads should then be terminated so that the application can continue. The third loop determines the total quantity of work performed by each loop, and using the offset within the structure changes its own work partition. This will balance the work evenly across the processors. However, the perfect balance of work may not be the fastest distribution. The profiling portion of the complete algorithm should reveal this imbalance, and may accordingly revert to the offset balance that has the best profile. In order to resolve this, we will discuss the design of the profiling portion of the algorithm at the end of this chapter. Figure 5.7: Nmod SMP Cache coherent program design. 69

70 5.4 Cell Design As discussed in Chapter Two, the CELL processor has a dual thread SMP PPE and between 6 and 8 SPE. Unfortunately, however, the SPE are not cache coherent, and are also not uniform with the main memory. Any data required by the computation to be conducted on the SPE will need to be DMA to the local store and then sent back when finished. In the previous section, we have successfully identified that the force accumulator needs to be accessed by both threads. On the CELL, this will require individual force accumulators for each of the individual SPE; these would need to be accumulated when the individual SPE terminate. The same load-balancing technique can be applied to SPE, as previously discussed for the SMP. Moreover, Figure 5.8 shows the movement of data between local store and RAM either side of the computation with the additional merge forces. Figure 5.8: CELL SPE program design. 70

71 int posix_memalign(void **mptr, size_t alignment, size_t size) Figure 5.9: Posix memalign function prototype. DMA operations are conducted on the SPE. After the context is created, the DMA operation requires memory which is accordingly aligned to 16 byte borders. There is a posix command that can be used to obtain memory from the heap on a specified alignment. The function prototype for this function is shown in figure CUDA Design The Nvidia GPU processor is not coherent with the SMP and has its own RAM which is not addressable from the host processor. The model will need to be transferred between the on-card memory and system RAM using the DMA functions and transfer engine located on the GPU. This is similar to the arrangement on the CELL processor; however, each multi-processing unit on the GPU is able to access the same model, whereas on the cell, an individual copy of the model state was required in each local store of the individual SPE. As indicated in Chapter Two, the GPU uses SIMT architecture, which results in simultaneous calls to the same function utilising different data elements. The identification of the current element is provided using blockid and threadid. Moreover, Nmod utilises double precision and, at the time of this project, the GPU provides only limited support for double operations and has no support for providing locks for double values located on the on-board RAM. 71

72 Figure 5.10: Nvidia GPU design. When the find gravitational interactions algorithm calculates forces, it applies the force to both its own and distant force accumulators. On the CELL, we maintain separate force accumulators and then combine them; this avoids contention and likely corruption of individual elements. On the GPU, we have a single accumulator and threads, which could access a single item at the same time (Figure 5.10). If we are unable to lock the distant force accumulator, this presents a problem. We can overcome this problem by only calculating the local force. This would double the required computation; on most architectures, this would pose a problem. However, the highly parallel nature of the GPU processor should offset the additional expense. It is likely that blocks will be assigned a number of local particles, and it is possible that more than one thread will write to the local particle; this can be overcome by utilising a spinlock only permitting a single thread through the barrier. 72

73 5.6 Load Balancing and Profiling Algorithm Design The designs for both CELL and SMP provide a load-balancing algorithm which is able to monitor the current load of the work they are allocated and using this information modify offsets as to create a near perfect work balance between the individual cores. The single GPU that is being targeted uses a different scheme to the master slave approach used by the CELL and GPU, instead applying the SIMT technique of running single instructions in parallel on separate data items. This does not require load-balancing when only using a single card, as the individual processors will consume thread blocks as they become available. The designs for each of the architectures target a single architectural component with the computational intensive section of the application. In all cases, the SMP main thread deals with all computations except the findgravitationalinteractions function, which can be run on SPE, GPU or Threaded SMP. This is the common approach taken in high-performance application reengineering. However, the decision concerning which device and the number of cores to be used to compute a given problem size may ultimately require an alternative configuration. Importantly, simply using the maximum capacity of a heterogeneous architecture does not account for the impact of related instantiation overhead. In the case of a small problem size, it may be better to maintain the use of the SMP with a small core count. Moreover, as opposed to incurring the impact of DMA communication overhead that is applicable when using both the CELL and GPU, it might be reasonable to say that we could apply a simple check to determine whether or not the computation is above a certain threshold before using the GPU or SPE before incurring overhead. This, however, fundamentally underestimates the dynamic nature of the device. It may be the case that the SMP has other computation running alongside the Nmod application. In such a case, it would be better for smaller jobs to be issued to the heterogeneous devices in spite of the DMA overhead. 73

74 This project identifies the dynamic nature determined in both problem size and architecture state, and further proposes the use of runtime profiling so as to determine the best device and core count configuration for a given computational load. As this can only be known at runtime, the algorithm should be dynamic. An important first step in the algorithm is to demine the architecture within the host system and to accordingly create a test plan. This should provide the testing phase of the algorithm with a device list and also the component device structure. For instance, if we are creating a test plan for a QS21 CELL blade, the algorithm would need then to be provided with the information that it requires to test 1-4 SMP PPE cores and up to sixteen SPE cores. In the case of the Intel GPU combination, the algorithm must identify that the GPU requires no load-balancing and that the CPU has four threads of execution. The algorithm should have two states: learning and trained. Initially, the algorithm should start in the learning state. The learning state should then accordingly test each configuration from the test plan in turn, and subsequently give each test a number of iterations to balance to an even distribution. Once complete, it should select the best architecture and balance combination and subsequently continue processing as usual. On the other hand, in the training state, there would be a small overhead relating to the changing of offset values and storing of current profile and balance state. There will also be a reduction in performance as the architecture balance configurations are tested, as some of the balance configurations will be less than optimal. However, this overhead is a small proportion of the overall computation time. The model running with improved timing as a result of the profiling and balancing analysis will result in greater cumulative performance benefits. For each learning cycle, the algorithm will time the computation step and store the information along with the architecture and load balance offsets for each computational element. When there 74

75 are no more steps to take in the test plan, the best result is returned, the architecture and load balance offsets are loaded back into the data structures, and learning is disabled. Figure 5.11: Load-balancing and profiling diagram. The load-balancing and profiling algorithm design is shown in Figure 5.11, which depicts the learning state that controls whether the compute step should be processing test plan items. If no test plan items remain and learning is disabled, there will be no overhead for either the profiling or load-balance algorithm. This results in the remainder of the processing of the model state incurring no overhead whilst still contributing to the model evolution and learning the optimal balance. 75

76 5.7 Chapter Summary In this chapter, we have shown the analysis of the original Nmod application and identified the computational section requiring optimisation. Considering the architectural details identified in Chapter Two and the load-balancing approaches detailed in Section Four, we have proposed designs which should provide performance increases on the single-threaded version of Nmod. Designs have also been proposed for an algorithm to profile different heterogeneous configurations at runtime to select the optimal configuration. 76

77 Chapter 6 Implementation 6.1 Chapter Overview This chapter will detail the implementation of the Nmod application on both CELL and Nehalem/GPU. We will show important configuration details which are required to recreate the development steps. Important code sections will be included, along with annotations which detail the decisions made in the implementation of the algorithms described in the previous chapter. We start by implementing the computationally intensive sections of Nmod on the individual architectures, and subsequently integrate the application code with the load-balancing and profiling algorithm so as to create the heterogeneous application, capable of automatically profiling and balancing the computational task, to the available computational resource. 6.2 Platform Configuration CELL Development Environment CELL development was conducted on a 60GB Version 1 PS3 which had the other OS facility enabled. Sony commissioned Fixstars to develop a Linux variant which have the ability to operate on the PS3 and to accordingly give it access to the individual SPE. The PS3 system was installed with this Linux variant Yellow Dog Version 6.1. The ability to utilise the PS3 for development has been removed by Sony through the distribution of a firmware update version 77

78 3.21. The new firmware disables the ability to install and boot Linux on the system, thus disabling access to the development environment. Appendix A details the steps required to install the software required for CELL development. If the outlined procedures are followed, the Cell libraries will have been installed and the GCC compiler will be updated with the CELL extensions and configured for SPE compilation. The compile stings in Figure 6.1 show how the cell binary is created using the updated GCC[44]. gcc -lspe2 nbody_ppe.c -o nbody_ppe.elf --std=c99 -I/opt/cell/sdk/usr/include - I/opt/cell/sdk/usr/lib/ -L/opt/cll/sdk/usr/spu/lib -lm --std=c99 -I/usr/lib/gcc/ppu/4.1.1/include/ spu-gcc nbody_spe.c -o nbody_spe.elf --std=c99 -I/opt/cell/sdk/usr/include -I/opt/cell/sdk/usr/lib/ -L/opt/cll/sdk/usr/spu/lib -lm --std=c99 Figure 6.1: Cell Compilation commands CELL applications comprise two code sections: the PPE code and the SPE code. The PPE code is compiled using the gcc command just as a normal SMP application would be. Additional libraries are required by the PPE application to communicate with the SPE; these are included with the lspe2 switch. The input source is named nbody_ppe.c, whilst the output format is an elf executable. Both SPE and PPE codes are compiled into separate elf files. Some of the application code is written using the C99. Standard GCC requires the std=c99 switch in order to recognise the inclusion of this code. The maths library is also required lm, as are the inclusion of 2 library directories located in the /opt/cell directory structure. In order to compile the SMP code, the command in Figure 6.2 was used to compile the source. The only requirement was the use of the P-thread libraries which are included as part of the standard library. gcc main.c --std=c99 -lm lpthread Figure 6.2: Compile command for threaded SMP 78

79 6.2.2 GPU Development Environment GPU development was conducted in two phases. Initially, the application was developed on a dual core Xeon processor with an Nvidia Geforce 9500GT; however, this lacked the double precision required by the Nmod application. Due to this failing, the final implementation and development had to be conducted on the Mcore Server provided by the School of Computer Science at Manchester University. Linux Fedora Core 10 was installed on the initial test server and was configured with the CUDA toolkit 2.3 and GCC version In order to enable CUDA development, the NVIDIA driver was installed. Usually, the system will configure an open source driver by default; however, this will not be CUDA capable. In order to configure the driver we run init 3 so as to shut down any X Windows systems, and subsequently install the driver before restarting the X Windows system using init 5. Once the driver was installed, we the downloaded and installed the CUDA toolkit. In order to compile CUDA source code using GCC, we had to add /usr/local/cuda/bin to the path variable and accordingly create a new variable named LD_LIBRARY_PATH, giving it the value /usr/local/cuda/lib. By default, Fedora 10 installs the SElinux security system. To use CUDA, this needs to be disabled using the setenforce 0 command [45]. Once installed, the CUDA gcc compiler driver is called using the nvcc command. The following sections will detail the implementation of the nmod application for each of the target architectures. We will then highlight the integration of the load-balancing and profiling algorithm for both the cell and GPU. 79

80 6.3 SMP Implementation Initially, the Nmod application was developed with the aim of using two types of integration: Midpoint and Runge-Kuttar 4 th order. There was also a great deal of code dedicated to reading and formatting results for integration with the visualisation tools. As we were using the Nmod application as a test harness for the development of heterogeneous applications and the implementation and evaluation the load-balancing and profiling algorithm. The code would need to be refined to the essential functionality and initialisation using a body generator that would give a uniformed distribution. The essential code sections to retain in the project include: Automated model generation to verify and to aid in the testing of the application using different body counts. Runge-Kuttar algorithm to advance the model Calculate force interaction functionality. Account for inertia Clear force accumulators. pthread_t threads[num_of_threads]; typedef struct{ double * xloc, * yloc, * zloc, * xforce,* spexforce[6], * speyforce[6], * spezforce[6]; double * yforce, * zforce, G, * mass; int speid,numparticles; }trans; double * xforce, * yforce, *zforce; ret = posix_memalign(&xforce,16,msize ); ret = posix_memalign(&yforce,16,msize ); ret = posix_memalign(&zforce,16,msize ); 80

81 trans send[num_of_threads]; for(int i =0;i<NUM_OF_THREADS;i++) { send[i].g = G; send[i].xloc = xloc; send[i].yloc = yloc; send[i].zloc = zloc; send[i].xforce = xforce; send[i].yforce = yforce; send[i].zforce = zforce; send[i].mass = mass; send[i].speid = i; send[i].eoff = 0; send[i].soff = 0; //send.numparticles = num_particles; } //Single Step clearaccumulator(xforce,yforce,zforce); for(int t = 0;t < NUM_OF_THREADS;t++) { rc = pthread_create(&threads[t], NULL, findgravitationalinteractions, &send[t]); if (rc) { printf( ERROR; return code from pthread_create() is %d\n,rc); exit(-1); } } totalwork = 0; for(int t=0; t<num_of_threads; t++) { rc = pthread_join(threads[t], &status); totalwork += send[t].work; if (rc) { exit(-1); } } for(int i = 0; i < (NUM_OF_THREADS - 1); i++) { if(send[i].work > ((totalwork/num_of_threads) + ((totalwork/num_of_threads) * 0.10))) { send[i].eoff--; send[i+1].soff--; } } Figure 6.3: SMP multi-threaded implementation with load balance 81

82 In Figure 6.3, various key components of the application code listing located on the CD can be seen in the directory ptandmod. We devise an array of thread data structures in order to create and maintain individual P-threads. When creating the individual threads using pthread_create, we pass a function pointer to the findgravitationalinteractions function, thread control array element, and a pointer to the data structure we would like to access inside the called thread. We are not able to send individual parameters; therefore, we need to encapsulate each of these using the trans data structure. The trans data structure comprises pointers to the data structures which have been created in the initiation phase of the application. The data structure has additional fields which enable the thread to identify itself via a unique ID and to accordingly identify the offsets that it should apply to block allocation. The values are then set in a loop with the maximum number of threads as the upper bound. Each data structure has its pointers set to the shared data structures containing the current model state, or the individual force accumulators that will hold the new forces. Each of the offset values are initially set to 0. Each stage is initiated with a call to the clear accumulators function; this remains unchanged from the original application. We then call the pthread_create function for each thread we wish to activate passing the trans data structure, p-thread data structure and a pointer to findgravitational interactions. Each instance of the thread will run a copy of findgravitational interactions, with the main program issuing p-thread join on each of the threads passing the main thread until they are all completed. Find gravitational interactions (Figure 6.4) takes the pointer to the per-thread structure containing the thread id, offset values and the pointer to the current model state. Using this information, the thread calculates the range of particles that it should work on. It does this by dividing the work using a block cyclic approach, and dividing the local bodies equally between each active thread. This will be an unfair division of work, as detailed in Section Therefore, 82

83 in order to address this imbalance, the function tracks the number of force calculations it has conducted and reports this back to the main thread using its own trans data structure. If the balance is unfair the next time the function executes, the offset values in its trans data structure will then be used to alter the start and end of the work allocation. This is achieved by calculating the block allocation and subsequently applying the offsets in such a way so as to increase or decrease the quantity of work that would be carried out. Void * findgravitationalinteractions(void * sendin) { trans * send = (trans * )sendin; double G = send->g; double * xloc = send->xloc; double * yloc = send->yloc; double * zloc = send->zloc; int speid = send->speid; double * xforce = send->xforce; double * yforce = send->yforce; double * zforce = send->zforce; double * mass = send->mass; int start = speid *((num_particles - ( num_particles %NUM_OF_THREADS)) / NUM_OF_THREADS); int end = start + ((num_particles - (num_particles % NUM_OF_THREADS))/NUM_OF_THREADS); int remain = num_particles % NUM_OF_THREADS; } int count = 0; if(remain > 0) { if(speid < remain) { start = start + speid; end = end + (speid + 1); } else { start = start + remain; end = end + remain; } start = start + send->soff; end = end + send->eoff; 83

84 for(int j = start;j < end;j++) { for(int i = j + 1;i < num_particles;i++) { } } send->work = count; }... //Calculate forces... Figure 6.4: Calculate gravitation interactions function SMP As each thread completes as indicated by the completion of p-thread join on each thread data structure the total amount of work is accumulated. Using this information, the main function balances the work by altering the offsets of the neighbouring thread block; this is done by reducing the thread that is over-worked and then shifting the load onto its neighbour. The neighbours will, in turn, shift their own work, which will continue until the load is balanced. There is little overhead in the balance procedure, and the operation is not restricted at runtime; this is a requirement as the number of cores can change as the profiling algorithm adapts. Changing the core count and resetting the offsets will result in the load-balancing algorithm determining a more equal load balance for the new core count. In order to complete a single step in the models, evolution forces need to be calculated four times before the integration method advances the model. However, the load-balancing step is only initialised after the first calculation in order to minimise overhead. 84

85 6.4 CELL Implementation As identified in the literature survey, the cell implementation should be preceded by the SMP version of the algorithm, and the SMP version should compile without modification on the PPE, which is an SMP based on Power Architecture. Indeed, no alteration of the SMP code outlined in Section 6.3 was required. Therefore, the majority of the work was found in porting the SMP version of the application to the SPE. Currently, the SMP code starts a thread which executes the function using a structure containing the values needed to identify and process the elements where each thread is allocated. As identified in Chapter Two, the CBE has two computational elements: the PPE and the SPE. The contents of the main thread will therefore execute on the PPE and the computationally intensive function calculategravitationalinteractions will be executed on the SPE. One of the first tasks was to create force accumulators for each of the SPEs. Unlike the SMP, the SPEs are non-cache coherent with system RAM, and so each SPE will require individual force accumulators; these will need to be combined in the main thread on the PPE when each SPE has finished. These and all the other data items forming the model will need to be DMA from RAM to the individual SPE LS. Data items that will be moved using the DMA controller will need to be aligned to 16 byte borders; this is achieved using the attribute ((aligned(16))) compiler directive and the posix_memalign function shown in Figure 6.5. double * spexforce[6], * speyforce[6], *spezforce[6]; for(int b = 0;b < 6;b++) { ret = posix_memalign(&spexforce[b],16,msize ); ret = posix_memalign(&speyforce[b],16,msize ); ret = posix_memalign(&spezforce[b],16,msize ); for(int i = 0;i < num_particles;i++) { spexforce[b][i] = 0; 85

86 speyforce[b][i] = 0; spezforce[b][i] = 0; } } Figure 6.5: Force accumulator for each SPE aligned to 16 byte borders As in the SMP implementation, the trans data structure will contain all the information each SPE will require in order to complete the computation. In Figure 6.6, the trans structure for each SPE has a pointer to the shared current state of the model and individual force accumulators.... trans send[num_spe] attribute ((aligned(16)));... for(int i = 0; i < NUM_SPE;i++) { send[i].g = G; send[i].xloc = (unsigned long)xloc; send[i].yloc = (unsigned long)yloc; send[i].zloc = (unsigned long)zloc; send[i].xforce = (unsigned long)spexforce[i]; send[i].yforce = (unsigned long)speyforce[i]; send[i].zforce = (unsigned long)spezforce[i]; send[i].mass = (unsigned long)mass; send[i].id = i; send[i].numparticles = num_particles; } Figure 6.6: Trans data structure initialisation As the SPE code is compiled into a different machine code to that of the PPE, the elf files are created separately. The code for the SPE needs to be loaded from the disk to the RAM so that the SPE can be instructed to DMA the application code into LS then execute the code. Figure 6.7 shows the SPE elf being loaded into RAM. spe_program_handle_t *prog;... prog = spe_image_open( nbody_spe.elf ); If (!prog) 86

87 { } perror( spe_image_open ); exit(1); Figure 6.7: Loading SPE image file After the image file is loaded into RAM, the PPE code can then create a context for each of the SPE that it wishes to use. We create context structures using the context_create. Notably, function context structures are used in order to access and manipulate the individual SPE. One of the first tasks is to instruct each SPE to download the application code from RAM to SPE LS using the spe_program_load function. In order to activate each of the context simultaneously, we still need to use the P-thread library, as detailed in the SMP implementation. We need to pass the context structure and the trans structure into the thread using a single pointer. In order to achieve this, we combine them using the ags structure as shown in Figure 6.8. typedef struct { spe_context_ptr_t spe; trans * send; } thread_arg_t; void *run_abs_spe(void *thread_arg) { int ret; thread_arg_t *arg = (thread_arg_t *)thread_arg; unsigned int entry; spe_stop_info_t stop_info; entry = SPE_DEFAULT_ENTRY; ret = spe_context_run(arg->spe, &entry, 0, arg->send,null, &stop_info); if (ret < 0) { perror( spe_context_run ); return NULL; } return NULL; }... thread_arg_t arg[num_spe]; for (int i = 0; i < NUM_SPE; i++) { spe[i] = spe_context_create(0, NULL); if (!spe[i]) { 87

88 } perror( spe_context_create ); exit(1); } ret = spe_program_load(spe[i], prog); if (ret) { perror( spe_program_load ); exit(1); } //package arg to be set to thread spe and send is required arg[i].spe = spe[i]; arg[i].send = &send[i]; Figure 6.8: Passing arg structure to SPE Once the program is loaded onto each SPE, we can then call pthread_create which will, in turn, run an instance of run_abs_spe. This will control the start-up of the SPE, passing in the required SPE context and trans control structure, using the spe_context_run_function as shown in Figure 6.9. Each thread calling context run will stall until the SPE application terminates, at which time the thread will exit. In the main thread, the application is calling pthread_join; this will stall the main thread until all SPE have finished processing. ret = pthread_create(&thread[b], NULL, run_abs_spe, &arg[b]);... void *run_abs_spe(void *thread_arg) { int ret; thread_arg_t *arg = (thread_arg_t *)thread_arg; unsigned int entry; spe_stop_info_t stop_info; entry = SPE_DEFAULT_ENTRY; ret = spe_context_run(arg->spe, &entry, 0, arg->send,null, &stop_info); if (ret < 0) { perror( spe_context_run ); return NULL; } return NULL; }... Figure 6.9: SPE context creation 88

89 Each SPE needs to transfer the data it requires from the main memory. We need to create locations to store the values in the LS, and then use the DMA functions to transfer the memory locations from RAM into the locations in the LS. In Figure 6.10, we see that the trans data structure is transferred first; this is a requirement, as it contains the addresses in RAM of the other data items which are required for transfer. The only information we have when the SPE is initialised is the address in memory of the trans structure, allocated to the individual SPE. Transferring the contents of the trans data structure to the SPE will provide the address to all the model state arrays. Once we have those address and DMA, and the contents of the arrays to the LS, we can then start the computation. We do not need to transfer the force accumulators from RAM, but they do need to be reset locally on the SPE and DMA back to RAM once the stage has been calculated. double xloc[max_buffsize] attribute ((aligned(16))); double yloc[max_buffsize] attribute ((aligned(16))); double zloc[max_buffsize] attribute ((aligned(16))); double xforce[max_buffsize] attribute ((aligned(16))); double yforce[max_buffsize] attribute ((aligned(16))); double zforce[max_buffsize] attribute ((aligned(16))); double mass[max_buffsize] attribute ((aligned(16)));... trans send attribute ((aligned(16)));... spu_mfcdma64(&send, mfc_ea2h(argp), mfc_ea2l(argp),sizeof(trans), tag, MFC_GET_CMD); spu_writech(mfc_wrtagmask, 1 << tag); spu_mfcstat(mfc_tag_update_all); for(int i = 0; i <send.numparticles;i++){ xforce[i] = 0; yforce[i] = 0; zforce[i] = 0; } spu_mfcdma64(xloc, mfc_ea2h(send.xloc), mfc_ea2l(send.xloc),send.numparticles * sizeof(double), tag, MFC_GET_CMD); spu_writech(mfc_wrtagmask, 1 << tag); spu_mfcstat(mfc_tag_update_all); spu_mfcdma64(yloc, mfc_ea2h(send.yloc), mfc_ea2l(send.yloc),send.numparticles * sizeof(double), tag, 89

90 MFC_GET_CMD); spu_writech(mfc_wrtagmask, 1 << tag); spu_mfcstat(mfc_tag_update_all); spu_mfcdma64(zloc, mfc_ea2h(send.zloc), mfc_ea2l(send.zloc),send.numparticles * sizeof(double), tag, MFC_GET_CMD); spu_writech(mfc_wrtagmask, 1 << tag); spu_mfcstat(mfc_tag_update_all); spu_mfcdma64(mass, mfc_ea2h(send.mass), mfc_ea2l(send.mass),send.numparticles * sizeof(double), tag, MFC_GET_CMD); spu_writech(mfc_wrtagmask, 1 << tag); spu_mfcstat(mfc_tag_update_all);... //Calculate work offset... //Perform Computation... spu_mfcdma64(&send, mfc_ea2h(argp), mfc_ea2l(argp),sizeof(trans), tag, MFC_PUT_CMD); spu_writech(mfc_wrtagmask, 1 << tag); spu_mfcstat(mfc_tag_update_all); spu_mfcdma64(xforce, mfc_ea2h(send.xforce), mfc_ea2l(send.xforce),send.numparticles * sizeof(double), tag, MFC_PUT_CMD); spu_writech(mfc_wrtagmask, 1 << tag); spu_mfcstat(mfc_tag_update_all); spu_mfcdma64(yforce, mfc_ea2h(send.yforce), mfc_ea2l(send.yforce),send.numparticles * sizeof(double), tag, MFC_PUT_CMD); spu_writech(mfc_wrtagmask, 1 << tag); spu_mfcstat(mfc_tag_update_all); spu_mfcdma64(zforce, mfc_ea2h(send.zforce), mfc_ea2l(send.zforce),send.numparticles * sizeof(double), tag, MFC_PUT_CMD); spu_writech(mfc_wrtagmask, 1 << tag); spu_mfcstat(mfc_tag_update_all); Figure 6.10: SPE DMA transfer operation When the computation has completed, the trans data structure needs to be sent back to RAM from the SPE as it contains the total work completed by the SPE. The main thread on the PPE 90

91 will continue once all pthread_join function calls have completed and the application will perform the same load-balancing step that was outlined in the previous section for the SMP. 6.5 GPU Implementation Unlike the SMP and CELL processors, Nvidia does not provide discrete access to the individual processors on the GPU; instead, they provide a new scalable programming model CUDA, as detailed in Chapter Two. CUDA provides a block and thread abstraction from the underlying architecture; this approach is to abstract the individual processors to promote scalability and fit with the SIMT programming model required to maximise the capabilities of the GPU. The first step in redeveloping the application for SIMT is to compose the computational function; in a way, this creates independent thread blocks comprising independent threads which can be run in parallel. In order to achieve this, we can modify the application so that each call to the function calculates the individual force acting on a single particle. We will then use the block variable to index the local particle and the thread variable to control the distant particle. We specify the block and thread constructs between the <<< Block, Thread >>> compiler directives located between the function name and argument list, as shown in Figure Each block is limited to the number of threads that it can contain. Dividing the quantity of particles by four subsequently brings the quantity of threads within the capabilities of the card. The block variable is reduced by 1 because the Sun acts as a collapsor. Before calling the CUDA function, we first need to transfer the current model state from the SMP RAM to the GPU RAM. This is achieved using the cudamemcpy function taking the source and destination locations for the arrays and their size as well as the direction of transfer. int block = num_particles 1; cudamemcpy(d_xloc,xloc,msize,cudamemcpyhosttodevice); cudamemcpy(d_yloc,yloc,msize,cudamemcpyhosttodevice); cudamemcpy(d_zloc,zloc,msize,cudamemcpyhosttodevice); cudamemcpy(d_xforce,xforce,msize,cudamemcpyhosttodevice); cudamemcpy(d_yforce,yforce,msize,cudamemcpyhosttodevice); cudamemcpy(d_zforce,zforce,msize,cudamemcpyhosttodevice); 91

92 cudamemcpy(d_mass,mass,msize,cudamemcpyhosttodevice); checkcudaerror( memcpy ); findgravitationalinteractions<<<block,(num_particles/4)>>>(d_xloc,d_ yloc,d_zloc,d_xforce,d_yforce,d_zforce,g,d_mass); cudamemcpy(xforce,d_xforce,msize,cudamemcpydevicetohost); cudamemcpy(yforce,d_yforce,msize,cudamemcpydevicetohost); cudamemcpy(zforce,d_zforce,msize,cudamemcpydevicetohost); checkcudaerror( kernel invocation ); checkcudaerror( memcpy ); Figure 6.11: CUDA call find gravitational interactions Each of the active threads within the tread block will need to write to the local force accumulator. As there is no locking mechanism for the array, we need to use a spinlock (Figure 6.12). The spinlock only allows one thread at a time to access the shared force accumulator. Essentially, a spinlock counts each of the threads in the thread block through a holding loop. Once through the holding loop, the thread then has sole access to the force accumulator. global void findgravitationalinteractions(double * xloc,double * yloc,double * zloc,double * oxforce,double * oyforce,double * ozforce,double G,double * mass) { int i = blockidx.x + 1; int j = threadidx.x; j = j * 4; int upper = j +4; if(upper == num_particles) upper = num_particles % 4; while(j < upper) { if(i!= j) { //calculate force int templock = lock +1; while(j > templock){ res1 =1; } if(j == lock) { oxforce[i] += xforce; oyforce[i] += yforce; ozforce[i] += zforce; lock++; 92

93 } j++; } Figure 6.12: CUDA find gravitational interactions Nvidia GPU does provide locking mechanisms for floating point values; however, Nmod utilises double precision operations and, at the time of writing, there are no locking mechanisms for that particular data type. Notably, Nvidia have indicated that their support for double precision will improve with each new generation of hardware. 6.6 Load-balancing and Profiling Integration In the design phase, we outlined the operation of the profiling algorithm. Each of the target heterogeneous architectural combinations will require both computational threading calls combined into a single application. In the case of the CELL processor, the SMP threaded calls to the findgravitationalinteractions function will need to be integrated within the CELL project containing the SPE context communication code. Moreover, the data structures which are passed to both SMPs and to the SMP function are identical, which ultimately aids the integration of both the SMP and SPE calls. Integration for the Intel Nehalem and GPU will ultimately involve the addition of the SMP code, i.e. the Nvidia CUDA application code. CUDA does not make use of the thread identifier data structures or load-balancing algorithm due to the CUDA processor abstractions. The first integration and implementation of the profiling algorithm was done on the Nehalem/GPU combination; this was primarily owing to the time limited time that was available on the University test hardware. There is a slight difference in the implementations of the profiling algorithm on CELL and GPU relating to the collection of runtime profiling results. In the case of the GPU, we collect results of all tests conducted on the hardware and then, following the last test search, the linked list for the best. We found it to be more efficient to simply store the details on the current best processor load combination. Although if we were to extend the application in the future, it would be suggested that storing a fraction of the results 93

94 be more viable, so as to enable a quick test if a current plan becomes suboptimal; notably, there will be more details relating to this in future work section in Chapter Eight. The first stage in the algorithm is to collect system information and construct a test plan. Each test plan will identify the number of cores or SPEs that should be tested, as well as the number of load-balancing steps that should be conducted in each test. Importantly, load-balancing steps are a fraction of the total number of particles in the system (Figure 6.13). There will also be a single test plan entry for the GPU, with no requirement for the load-balancing action. 94

95 struct testplan_el{ int gpu; int corenum; int testcount; struct testplan_el * up; }; typedef struct testplan_el testplan; void addcpu(cpu * cpulist,int proc,unsigned long long time,int type,loadbal * temploadbal) { while(cpulist->up!= NULL) { cpulist = cpulist->up; } cpulist->up = (cpu *)malloc(sizeof(cpu)); cpulist->proc = proc; cpulist->time = time; cpulist->type = type; if(temploadbal!= NULL) { cpulist->loadbalist = temploadbal; } else { cpulist->loadbalist = NULL; } cpulist = cpulist->up; cpulist->up = NULL; } for(int i = 1;i <= MAX_NUM_OF_THREADS;i++) { testnum++; lasttest = addtest(testplanlist,0,i, num_particles/8); } testnum++; lasttest = addtest(testplanlist,1,0,2); Figure 6.13: load balance and profile GPU/NVIDIA Once we have constructed the test plan, we then need to iterate through the list testing the configuration, and allow each configuration to balance. In Figure 6.14, we are able to see that we have a check for the learning status for the algorithm. If the learning status is true, we are processing elements from the test plan list. At the start of each new test, we reset the loadbalancing offsets for each of the cores. We then create a loadbal linked list for each balance profile containing the start and end offsets for each of the threads. For each balance operation, 95

96 we store the time taken for each of the offset combinations as the test automatically balances using the work stealing mechanism outlined in Chapter Five. if(learning == 0) { int count = 0; currenttestplan = getcurtestplan(testplanlist ); if(currenttestplan->gpu == 0) { THREADS = currenttestplan->corenum; for(int i = 0; i < THREADS;i++) { send[i].num_of_threads = THREADS; if(currenttestplan->testcount == num_particles/8) { send[i].soff = 0; send[i].eoff = 0; } } } } t1 = rdtsc(); loadbal * temploadlist = (loadbal *)malloc(sizeof(loadbal)); temploadlist->up = NULL; temploadlist->down = NULL; if(learning == 0) { if(currenttestplan->gpu == 0) { for(int t = 0;t < THREADS;t++) { //Create threads } totalwork = 0; for(int t=0; t<threads; t++) { //Join Threads if(learning == 0) { addload(temploadlist,send[t].eoff,send[t].soff, send[t].work,t+1); } } if(learning == 0) { for(int i = 0; i < (THREADS - 1); i++) { if(send[i].work > ((totalwork/threads) + ((totalwork/threads) * 0.08))) { send[i].eoff--; send[i+1].soff--; 96

97 } } t2 = rdtsc(); addcpu(cpulist,threads,t2-t1,0,temploadlist); } } else { if(learning == 0) { t1 = rdtsc(); } //Copy from host to GPU findgravitationalinteractions<<<block,(num_particles/4)>>>(d_xloc,d_ yloc,d_zloc,d_xforce,d_yforce,d_zforce,g,d_mass); //Copy from GPU to host if(learning == 0) { t2 = rdtsc(); addcpu(cpulist,9,t2-t1,1,null); } } } else { if(finaltype == 0) { //Create Threads totalwork = 0; //Join Threads } else { //Copy from host to GPU... findgravitationalinteractions<<<block,(num_particles/4) >>>(d_xloc,d_yloc,d_zloc,d_xforce,d_yforce,d_zforce,g,d_mass);... //Copy from GPU to host } } Figure 6.14: Load-balancing implementation In order to obtain accurate timing, we use a call to the rdtsc function. This function contains inline assembly to obtain CPU ticks from the processor, as shown in Figure If we are running a test against a GPU, we can then store the timing information without the load balance list with the use of the addcpu function, therefore replacing the list reference will NULL e.g. addcpu (cpulist,9,t2-t1,1,null). 97

98 if(learning == 0) { if((currenttestplan->testcount == 0)) { testnumcount++; if(testnumcount == numtest + 1) { cpu * tempagaincpu = bestcpu(cpulist); learning = 1; if(tempagaincpu->type == 1) { finaltype = 1; } else { finaltype = 0; THREADS = tempagaincpu->proc; finalbalance(tempagaincpu,send); } } } } Figure 6.15: Finalising test When we have processed all of the elements in the test plan list we use the bestcpu function to obtain the fastest profile. We can then determine whether the GPU or SMP is the fastest. If the SMP is the preference, the final balance function will then provide the required offsets for the preferred core combination. Once learning is disabled, the application will execute without the overhead of the load-balancing or profiling code sections. Importantly, the implementation of the algorithm on the CELL was very similar, although both SMP and SPE required load-balancing on each test configuration. Moreover, as opposed to storing all combinations of core and balance, we instead opted to store the current best. This reduces the quantity of memory required by the algorithm, as the PS3 only has 256mb of ram. 98

99 PPE command : asm volatile ( mftb %[time2] : [time2] =r (time2):); Intel command: asm volatile( rdtsc : =a (a), =d (d)); Figure 6.16: Assembly commands for hardware timers Profiling individual calls to calculate gravitational interactions requires access to high-resolution timers. Usually, this can be accomplished with the use of the PAPI library. Unfortunately, however, neither of the target systems were compatible with the current version. Using the standard C libraries would not provide the required resolution. Both SMP and PPE provide access to the internal processor clock through assembly language commands, as shown in Figure Chapter Summary This chapter has detailed the successful implementation of the nmod application to CELL and Nahalem/GPU heterogeneous architectures. Notably, the development of the SMP version of the application on both Nehalem and PPE was relatively straight forward. However, the adaption of the PPE application to utilise the SPE was complex, and ultimately took a disproportionate amount of the total development time. Development complexity on the CELL has been well-reported, and could be one of the reasons behind IBM s decision to move away from the architecture. Accordingly, utilising the GPU was the most straightforward implementation, with the only main issue being the lack of locking support for double precision operations. This was overcome with the use of a spinlock. 99

100 Chapter 7 Testing and Evaluation 7.1 Chapter Overview This chapter will detail the individual tests conducted at each stage of the implementation, as detailed in Chapter Six. The speed-up of the application during each stage of the implementation will be analysed, as well as determining the capability of the profiling algorithm to select the appropriate hardware combination for the problem size. 7.2 Performance Testing In this section, we will analyse the performance of the modified Nmod application using a range of body counts. We will analyse the results of the cell implementation, followed by the Nehalem/GPU combination. For each architecture, we will highlight the results of the singlethreaded performance and the results of the profiling algorithm which will result in the fastest time available using the heterogeneous architecture. 7.3 Cell Processor Testing In Table 7.1, we are able to consider the test results for the single- and dual-threaded Nmod application. The results show that the overhead of the thread creation for the body counts of less than 10 result in slower execution time than running on two or more SPEs. However, when 100

101 there are a greater number of bodies above 10, we start to obtain speed-up, as shown in Figure PPE Thread Speed up Speed up Body Count Figure 7.1: CBE 2 PPE thread speed-up Speed-up on the PPE is just above 1.4 for body counts above 60. This value will be impacted by the hypervisor running on the PPE. It is difficult to quantify the effects, as there are no mechanisms to measure its overhead. It is important to remember that the two PPE hardware threads are implemented within the same core. Body Count No Thread Single PPE P-threads version 2 PPE Cores 10 0m3.919s 0m5.745s 20 0m12.543s 0m11.304s 30 0m25.744s 0m20.399s 40 0m43.516s 0m32.868s 50 1m6.014s 0m48.016s 60 1m32.817s 1m7.470s 70 2m4.621s 1m28.534s 80 2m40.618s 1m52.760s 90 3m21.923s 2m20.231s 100 4m6.584s 2m52.290s 150 9m1.728s 6m28.314s m51.960s 10m57.586s m24.489s 23m42.368s Table 7.1: Single and dual thread performance 101

102 6 SPE Speed up Speed up Body Count Figure 7.2: 6 SPE speed-up Initial tests are promising, despite the low-speed up values, as it shows that the algorithm does function and the division of work does result in speed-up. After the PPE tests were completed, the SPE version of the application was manually tested. Body Count SPE VERSION 1 SPE Enabled SPE VERSION 2 SPE Enabled SPE VERSION 3 SPE Enabled SPE VERSION 4 SPE Enabled SPE VERSION 5 SPE Enabled SPE VERSION 6 SPE Enabled 10 0m7.375s 0m7.195s 0m10.943s 0m12.175s 0m15.100s 0m18.150s 20 0m17.680s 0m17.240s 0m17.691s 0m16.895s 0m17.770s 0m20.497s 30 0m33.538s 0m31.097s 0m27.985s 0m25.021s 0m21.633s 0m23.180s 40 0m55.171s 0m49.290s 0m43.228s 0m37.048s 0m30.615s 0m26.457s 50 1m22.507s 1m13.314s 1m1.836s 0m49.964s 0m38.446s 0m32.706s 60 1m55.380s 1m36.857s 1m21.442s 1m5.053s 0m47.707s 0m39.958s 70 2m34.893s 2m9.264s 1m47.145s 1m22.028s 0m57.631s 0m50.261s 80 3m18.785s 2m45.077s 2m12.924s 1m40.813s 1m8.020s 1m0.293s 90 4m10.373s 3m26.871s 2m45.861s 2m6.758s 1m26.089s 1m12.030s 100 5m5.908s 4m11.521s 3m21.242s 2m29.335s 1m37.942s 1m23.275s Table 7.2: SPE timing data The SPE application code was tested by manually setting the quantity of SPE required and N- body count to the values shown in Table 7.2. One of the first interesting points to note is that, for the 10-body problem, the quickest count of two SPEs is still almost 1.5 seconds slower than the single PPE core. This illustrates one of the main issues relating to heterogeneous 102

103 architectures that there is a cost incurred in communication and, if insufficient, work is then divided, with the cost of division dominating performance. When we look at the 20-body problem, this has enough computation to overcome the costs of calling p-threads to use the two hardware threads of the PPE but not enough to overcome the cost of communication with the SPE. When the body value is increased to 30, the cost of communication can be accommodated within the increase in performance gained by using five or six SPE. Using five SPEs is ultimately faster than using six; this is due to the increased cost per SPE initialisation, with the cost of the sixth initialisation resulting in a suboptimal allocation. Particle counts above 60 run fastest on six SPE, and accordingly results in good speed-up figures, as can be seen in Figure 7.2. With 100 particles, we see speed-up reaching 3; this is still not scaling with the number of cores, but nevertheless shows good performance gains compared with the original single threaded version. After the manual testing was conducted, the automated profiling algorithm was used to determine the optimal configuration at runtime. The results are shown in Table 7.3. Auto Profiling Algorithm Body Count 103 Best Number of SPE 10 2 SPE 20 4 SPE 30 5 SPE 40 6 SPE 50 6 SPE 60 6 SPE 70 6 SPE 80 6 SPE 90 6 SPE SPE SPE SPE SPE Table 7.3: CELL Automated profiling algorithm timing

104 What is interesting is that, from the data shown in Table 7.3, the algorithm timed the instantiation of the 10-body problem as fastest on two SPEs; this goes against the individual timing results detailed in Table 7.2. This could be caused by the limited memory resources, the capability of the PPE, and the performance of the hypervisor. The tests of the Nehalem GPU in the next section show that the algorithm performs well on that platform without the same hypervisor/memory concerns. The algorithm does correctly profile the SPE resource identifying the same optimal configuration as the manual test. 7.4 Nehalem GPU Testing The timing results for the Nehalem/GPU combination are shown in Table 7.4. This shows the single-threaded performance on the single-threaded Nehalem core, and the load-balancing and profiling algorithm on the 8 Nehalem cores and Nehalem GPU combination. Comparing the Cell and Nehalem directly on their performance is unfair, as the Cell is an older technology and is hindered by the hypervisor of the PlayStation 3 test platform. We will, however, provide a more general comparison in the conclusion section. In order to stretch the computational ability of the GPU, we have increased the range of bodies to 1,500. The third and fourth column in Table 7.4 show the Nehalem profiling algorithm with GPU disabled in the test plan. The results show us that two cores are preferred for particle counts of less than 70; this core count then increases relative to the number of bodies, with the maximum number of 8 cores used for the 1,500-body problem. There are no surprises that the profiling algorithm finds that the increasing body count results in the increased core allocation. Given a greater number of bodies, the Nehalem speed-up values increases to 20 times the speed-up for the 100-body problem. 104

105 Single Thread 8 Nehalem Cores Auto Profiling Algorithm Nvidia GPU and 8 Nehalem Cores Auto Profiling Algorithm N-body Count Time Time Best Core Time Best Processor/core Target Target 10 0m3.743s s s m11.809s s s m23.971s s s m40.320s s s m0.939s s s m25.460s s s GPU 70 1m54.531s s s GPU 80 2m27.401s s s GPU 90 3m5.499s s s GPU 100 3m49.638s s s GPU 150 8m21.844s s s GPU m36.327s s s GPU m7.590s s s GPU m52.289s s s GPU m34.088s s s GPU Table 7.4: GPU/Nehalem timing data Speed-up for the 1,500-body problem is 40 times as fast as the single case. This super linear speed-up can be attributed to the effects of the n-body data structures being maintained within the cache hierarchy of the Nehalem processor. 105

106 Speed up Nehalem/ GPU Speed up Body Count Nehalem Only Nehalem + GPU Figure 7.3: Speed-up on Nehalem and GPU The last two columns of Table 7.3 show the results of the profiling algorithm on the Nehalem, with GPU enabled in the test plan. It shows that two Nehalem cores are still preferred as they are in the Nehalem-only profile; however, for body counts above 60, the GPU is the better target. As we found on the CELL processor, the quantity of computation for low body counts fails to cover the cost associated with the DMA transfers between the system and GPU onboard RAM. Once we obtain sufficient work for GPU, the highly parallel nature of the architecture provide considerable performance gains on the Nehalem processor. Speed-up for the 300-body problem is 80x, rising to 175x for the 1,500-body problem. 106

107 7.5 Chapter Summary This chapter has detailed and evaluated the application of a load-balancing and profiling algorithm on Nehalem/GPU and CELL heterogeneous platforms. Both applications show speed-up compared with their single-threaded counterparts except on the CELL where the 10- body problem provides insufficient computation to counter the thread creation overhead. Optimum configuration was selected by the profiling algorithm except on the SPE PPE combination; however, the SPEs were profiled correctly. Nehalem/GPU showed excellent speed-up and, where the computation was sufficient, the GPU provided speed-up of 175x; this is attributed to the highly parallel nature of the architecture. 107

108 Chapter 8 Conclusions and Future Work Having implemented the proposed algorithm and demonstrated that our algorithm targets the best architecture combination for the given workload, proves that this project is not only sound in design but also application. The aims of the project set at the outset have been met, and additionally interesting further research challenges have been identified during the course of the project. They will be detailed next followed by the conclusion. 8.1 Future Work Whilst undertaking the research and implementation for this project, a number of improvements and suggestions for future work have been identified. This section details these improvements, outlining how they would add value to the load-balancing and profiling algorithms Stored Knowledge In order to further reduce training overhead, the algorithm could save the optimum load balance and core configuration to persistent storage along with machine state data. This could then be used to set the initial load balance configuration in future invocations of the application. Machine state would need to be evaluated, as any change in state could have an impact on the load 108

109 balance configuration. Retrieving optimum state could save the algorithm, having to learn the optimum settings and reduce the impact of the initial learning stage of the algorithm Web Services Hint Processors have common characteristics that could be automatically shared between algorithms at runtime. If the problem size can be evaluated, then using a web service share that profile knowledge. Future invocations on systems which share a hardware profile could use the data to reduce the number of tests that would need to be conducted. Moreover, cache configuration and other architectural nuances could be communicated to the algorithm to improve performance Operating System Enhancements Operating systems provide little information to an application concerning how threads that compose the running application are being executed. It could be that separate threads are being executed on separate cores and accordingly switched onto the same core due to contention for processor resources. If the operating system could communicate this information, the quantity of active threads and workload could then be modified at runtime. This will become more important in cloud environments, as computation shifts between machines contained within virtual machines sitting on top of hypervisors. Communication of the state of the real underlying hardware will assist in the distribution of work in the abstracted virtualised system Generic Framework Our algorithm is specific to the N-body problem. More generic template-based objects which provide access to load-balancing and profiling would ultimately enable a greater number of applications to benefit from its application. However, issues relating to generic high-resolution 109

110 timers could hamper this approach. Processor vendors should work on common approaches to the structure and access of timing data Continuous Monitoring Future generations of the load-balancing and profiling algorithm should monitor the state of the current optimum settings. If they fall outside of a predetermined range, the algorithm should fall back to the successor optimum. If this fails to perform within a set performance threshold, the algorithm should re-enter the learning phase in order to determine a new optimum configuration Predictive Initial Conditions When the algorithm is testing the components of a system, it should identify trends in the ability of a system. As we found in the cases of the Cell and Nehalem, the performance will typically improve as we reach the optimum configuration, and then performance gradually decays as we move away. The algorithm should identify these patterns in order to reduce the number of tests which need to be carried out in order to determine the optimum. 8.2 Conclusion Heterogeneous architectures provide the developer with options which, if wielded properly, could ultimately provide excellent performance gains. Each of the architectures studied in this project has its own nuances and technical challenges. Moreover, understanding the GPU and CELL architectures in detail is the key to getting the most out them at the development stages of the project. Furthermore, gaining an understanding of the patterns commonly utilised also helped with the design and integration of the different architectural function calls. 110

111 Results show that the quantity of computation is the key to overcoming the costs associated with dividing computation between resources on the same or distant processing resources. On the heterogeneous elements, this was the cost of thread creation and the management on the SMP and the DMA transfers to and from the main memory. We have shown that our initial manual applications provided speed-up on the original singlethreaded application. Our runtime profiling and load-balancing algorithm was then implemented so as to highlight the fat that our approach was effective at selecting the best hardware combination and load balance for the underlying heterogeneous hardware. In conclusion, the ability to realise the potential of the available heterogeneous resources cannot be achieved solely through the use of design time-profiling and static scheduling. Ultimately, in order to create a best fit between computation and the available resource, runtime profiling and load-balancing is required, as aspects key to performance can only be known at runtime. 111

112 Bibliography [1] M. D. Hill and M. Michael R, "Amdahl s Law in the Multicore Era," IEEE Computer, vol. 41, pp , IEEE, [2] TOP500.Org, Top 500 Supercomputing Sites,2009, Available: Accessed: 12/2/2010. [3] M. A. Bhandarkar, L. V. Kalé, E. de Sturler, and J. Hoeflinger, "Adaptive load balancing for MPI programs," Lecture notes in computer science, pp , Springer, [4] A. Buttari, P. Luszczek, J. Kurzak, J. Dongarra, and G. Bosilca, "A rough guide to scientific computing on the PlayStation 3," Technical Report UT-CS , Innovative Computing Laboratory, University of Tennessee Knoxville, 11/5/2007. [5] K. Asanovic, R. Bodik, B. C. Catanzaro, J. J. Gebis, P. Husbands, K. Keutzer, D. A. Patterson, W. L. Plishker, J. Shalf, and S. W. Williams, "The landscape of parallel computing research: A view from berkeley," EECS Department, University of California, Berkeley, Tech. Rep. UCB/EECS , [6] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng, "Merge: a programming model for heterogeneous multi-core systems," ACM SIGOPS Operating Systems Review, vol. 42, pp , [7] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick, "The potential of the cell processor for scientific computing," in Computing Frontiers, 2007, pp [8] J. Kurzak, A. Buttari, P. Luszczek, and J. Dongarra, "The playstation 3 for highperformance scientific computing," Computing in Science and Engineering, vol. 10, pp , [9] A. Arevalo, R. M. Matinata, M. Pandian, E. Peri, K. Ruby, F. Thomas, and C. Almond, Programming the Cell Broadband Engine Architecture: Examples and Best Practices: International Business Machines Corporation,

113 [10] S. C. E. Inc., Cell Programming Primer,2008, Available: Accessed: 12/11/09. [11] A. Bergmann, Spufs: The Cell Synergistic Processing Unit as a virtual file system,2005, Available: Accessed: 4/1/2009. [12] Sony, Chapter 3 Basics of SPE Programming,2008, Available: Accessed: 10/10/2009. [13] M. Scarpino, Programming the Cell Processor: For Games, Graphics, and Computation: Prentice Hall PTR, [14] B. Barney, POSIX Threads Programming,2009, Available: Accessed: 4/12/09 [15] Nvidia, PCI Express, 2010, Available: Accessed: 20/11/2009. [16] W. Langdon and W. Banzhaf, "A SIMD interpreter for genetic programming on GPU graphics cards," Genetic Programming, pp , Springer, [17] U. Göhner, D. GmbH, and G. Stuttgart, "Computing on GPUs," in 7th European LS- DYNA Conference, DYNAmore GmbH, Springer, [18] Nvidia, CUDA Programming Guide Version 2.3.1,2009, Available: DA_Programming_Guide_2.3.pdf. Accessed: 2/10/09 113

114 [19] Khronus Group, OpenCL The Open Standard for Hetrogeneous Parallel Programming, 2009, Available: GDC_Mar09.pdf. Accessed: 20/1/09. [20] Nvidia, OpenCL Optimization,2009, Available: Webinars_Best_Practises_For_OpenCL_Programming.pdf. Accessed: 11/12/09. [21] R. Farber, CUDA Supercomputing for the Masses: Part 2,2008, Available: Accessed: 9/11/2009. [22] H. O. Bugge, "An evaluation of Intel s core i7 architecture using a comparative approach," Computer Science - Research and Development, vol. 23, pp , Springer, [23] D. T. Marr, Frank Binns, David L. Hill, G. Hinton, D. A. Koufaty, J. A. Miller, and M. Upton, "Hyper-Threading Technology Architecture and Micro architecture," Intel Technology Journal, vol. 6, Intel, [24] Intel, "Intel Core i7-800 Processor Series and the Intel Core i5-700 Processor Series Based on Intel Microarchitecture ( Nehalem )," Process Technology, Intel, [25] L. Freeman, COMP60032 Task Scheduling Lecture 9 Notes: University of Manchester, [26] H. Li, S. Tandri, M. Stumm, and K. C. Sevcik, "Locality and loop scheduling on NUMA multiprocessors," in Proceedings of the 1993 International Conference on Parallel Processing, 1993, pp [27] C. D. Polychronopoulos and D. J. Kuck, "Guided self-scheduling: A practical scheduling scheme for parallel supercomputers," IEEE Trans. Comput., vol. 36, pp , [28] Intel, An Introduction to the Intel Quickpath Interconnect,2009, Available: Accessed: 18/10/

115 [29] S. W. Y. a, H. Jin, S. Lin, K. C. Schiavone, G., "An incremental genetic algorithm approach to multiprocessor scheduling," IEEE Transactions on Parallel and Distributed Systems, vol. 15, pp , [30] U. a. Acar, G. E. Blelloch, and R. D. Blumofe, "The Data Locality of Work Stealing," Theory of Computing Systems, vol. 35, pp , [31] P. C. P. Bijay K. Jayaswal, "Design for Trustworthy Software: Tools, Techniques, and Methodology of Developing Robust Software," 1st ed: Prentice Hall, [32] H. Brunst, M. Winkler, W. Nagel, and H.-C. Hoppe, "Performance Optimization for Large Scale Computing: The Scalable VAMPIR Approach," in Computational Science - ICCS vol. 2074, V. Alexandrov, et al., Eds., ed: Springer Berlin Heidelberg, 2001, pp [33] S. L. Graham, P. B. Kessler, and M. K. Mckusick, "Gprof: A call graph execution profiler," SIGPLAN Not., vol. 17, pp , ACM, [34] J. Segal, "Models of scientific software development," First International Workshop on Software Engineering in Computational Science and Engineering, [35] V. R. Basili, D. Cruzes, C. Park, J. C. Carver, L. M. Hochstein, J. K. Hollingsworth, and M. V. Zelkowitz, "Understanding the High-Performance-Computing Community: A Software Engineer's Perspective," IEEE Software, vol. 25, pp , IEEE, [36] L. V. Kalé, "Some essential techniques for developing efficient petascale applications," Journal of Physics: Conference Series 125, vol. 125, [37] O. Sinnen, Task Scheduling for Parallel Systems: John Wiley & Sons, [38] H. Kim and R. Bond, "Multicore software technologies," IEEE Signal Processing Magazine, vol. 26, pp , IEEE, [39] D. Riehle, H. Züllighoven, and U. Bank, "Understanding and Using Patterns in Software Development," Theory and Practice of Object Systems, vol. 1, pp. 3-13, John Wiley & Sons,

116 [40] S. Siu and A. Singh, Composing parallel applications using design patterns Available: Accessed: 2/11/09. [41] T. Mattson, B. Sanders, and B. Massingill, Patterns for Parallel Programming: Addison- Wesley Professional, [42] S. Borkar, "Thousand core chips: a technology perspective," presented at the 44th annual Design Automation Conference, San Diego, California, ACM, [43] P. Carey, The Integration Methods in nmod: Part 3,2008, Available: Accessed: 5/8/09. [44] IBM, IBM Software Development Kit for Multicore Acceleration Version 3.1 Installation Guide,2008, Available: Installation_Guide_v3.1.pdf. Accessed: 8/12/2009. [45] Nvidia, CUDA C Installation and Verification on Linux,2009, Available: pdf. Accessed: 10/1/

117 Appendix A: CELL SDK Installation The test system was configured with the following development packages: GCC Version Free C Compiler Cell SDK v Development Libraries and compiler driver extension for GCC GNU Gprof Profiling toolset The yum package manager was used to install GCC and Gprof using the following commands Yum install GCC Yum install Gprof Installation steps for the Cell SDK where as follows: Make a directory for the ISO installation CD mkdir -p /tmp/cellsdkiso Download the SDK into the /tmp/cellsdkiso directory Check the status of the YUM installer. If it is running the process should be killed so as not to interfere with the installer # /etc/init.d/yum-updatesd status It can be stopped with the following command if running: # /etc/init.d/yum-updatesd stop Mount the SDK ISO running the installer script: /opt/cell/cellsdk --iso /tmp/cellsdkiso install 117