Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures

Transcription

1 Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures A THESIS SUBMITTED TO THE UNIVERSITY OF MANCHESTER FOR THE DEGREE OF MASTER OF SCIENCE IN THE FACULTY OF ENGINEERING AND PHYSICAL SCIENCES 2010 By Andrew Attwood School of Computer Science

2 Contents List of Tables... 5 List of Figures... 6 List of Equations... 8 Abstract... 9 Declaration Copyright Acknowledgements List of Abbreviations Chapter 1 Introduction Motivation and Context of the Study Project Aims and Methodology Project Phases Structure for the Dissertation Chapter 2 Heterogeneous Architectures Chapter Overview Cell Broadband Engine Development of the CBE Nvidia GPU with Intel Nehalem Nvidia Architecture Nvidia Development Intel Nehalem Architecture Intel Nehalem Development Chapter Summary Chapter 3 Scheduling, Load Balancing and Profiling Chapter Overview Computational Scheduling Assessment of the N-Body Problem Block and Cyclic Scheduling Dynamic Scheduling OS Thread Scheduling Profiling

3 3.3 Chapter Summary Chapter 4 Heterogeneous Development Methodologies and Patterns Chapter Overview Development Methodologies Software Development Phases Algorithm Design Considerations Patterns Implementation Testing Chapter Summary Chapter 5 Algorithm Analysis and Design Chapter Overview Analysis Amdahl s analysis of Nmod Nmod Application Structure Analysis SMP Design Cell Design CUDA Design Load Balancing and Profiling Algorithm Design Chapter Summary Chapter 6 Implementation Chapter Overview Platform Configuration CELL Development Environment GPU Development Environment SMP Implementation CELL Implementation GPU Implementation Load-balancing and Profiling Integration Chapter Summary Chapter 7 Testing and Evaluation Chapter Overview Performance Testing Cell Processor Testing

4 7.4 Nehalem GPU Testing Chapter Summary Chapter 8 Conclusions and Future Work Future Work Stored Knowledge Web Services Hint Operating System Enhancements Generic Framework Continuous Monitoring Predictive Initial Conditions Conclusion Bibliography Appendix A: CELL SDK Installation Word Count 23,380 4

5 List of Tables Table 2.1: LIBSPE2 functions 27 Table 2.2: SPE DMA functions 27 Table 2.3: Thread and fork creation overhead 28 Table 3.1: Offset pattern for 100 elements 42 Table 4.1: Parallel languages 53 Table 5.1: Gprof output non-threaded 10 particles 10 steps 61 Table 5.2: Gprof output non-threaded 60 particles 10 steps 62 Table 5.3: Nmod speed-up values 63 Table 7.1: Single and dual thread performance 101 Table 7.2: SPE timing data 102 Table 7.3: CELL Automated profiling algorithm timing 103 Table 7.4: GPU/Nehalem timing data 105 5

6 List of Figures Figure 2.1: CBE block diagram 23 Figure 2.2: Division of computation CBE 24 Figure 2.3: Stream Processing CBE 25 Figure 2.4: CBE compile commands 26 Figure 2.5: Bus capacity 29 Figure 2.6: GPU CPU footprint comparison 31 Figure 2.7: Nvidia processor Architecture 32 Figure 2.8: Two dual threaded SMP processors 34 Figure 2.9: Two SMP single thread processors 34 Figure 3.1: Computation level given index 39 Figure 3.2: 5-Body example 39 Figure 3.3: Body index calculation ratio 40 Figure 3.4: Block cyclic schedule 41 Figure 3.5: Pseudo code for in process offset calculation 41 Figure 3.6: Cyclic Schedule 43 Figure 3.7: Equal area schedule 43 Figure 3.8: DAG approach 45 Figure 4.1: Waterfall model 50 Figure 4.2: Message passing pattern 54 Figure 4.3: The divide and concur pattern 56 Figure 4.4: CELL development process 57 Figure 4.5: Structure of array and arrays of structures 58 Figure 4.6: Implementation plan 59 Figure 5.1: N-body test pattern loaded into matlab 64 Figure 5.2: Application code to generate random body distribution and mass 65 Figure 5.3: JSP Diagram of the NMOD Application 66 Figure 5.4: Original Function declaration for findgravitationalinteractions 67 Figure 5.5: Thread communication structure 68 Figure 5.6: Pseudo code for thread creation and balancing on SMP 69 Figure 5.7: Nmod SMP cache coherent program design 69 Figure 5.8: CELL SPE program design 70 Figure 5.9: Posix memalign function prototype 71 Figure 5.10: Nvidia GPU design 72 Figure 5.11: Load-balancing and profiling diagram 75 Figure 6.1: Cell Compilation commands 78 Figure 6.2: Compile command for threaded SMP 78 Figure 6.3: SMP multi-threaded implementation with load balance 81 Figure 6.4: Calculate gravitation interactions function SMP 84 Figure 6.5: Force accumulator for each SPE aligned to 16 byte borders 86 Figure 6.6: Trans data structure initialisation 86 Figure 6.7: Loading SPE image file 87 Figure 6.8: Passing arg structure to SPE 88 Figure 6.9: SPE context creation 88 Figure 6.10: SPE DMA transfer operation 90 Figure 6.11: CUDA call find gravitational interactions 92 Figure 6.12: CUDA find gravitational interactions 93 Figure 6.13: Load balance and profile GPU/NVIDIA 95 6

7 Figure 6.14: Load-balancing implementation 97 Figure 6.15: Finalising test 98 Figure 6.16: Assembly commands for hardware timers 99 Figure 7.1: CBE 2 PPE thread Speed-up 101 Figure 7.2: 6 SPE Speed-up 102 Figure 7.3: Speed-up on Nehalem and GPU 106 7

8 List of Equations Equation 5.1: Amdahl s Law 62 Equation 5.2: NMOD Speed-up value 63 Equation 5.3: Runge-Kutta 4th order integrator 66 8

9 Abstract Increasingly, devices are moving away from a single architecture single core design. From the fastest supercomputer to the smallest of mobile phones, devices are now being constructed with heterogeneous architectures. The reason for this heterogeneity is owing, in part, to the slowing of speed increases on single chip single core devices, but equally, the realisation that coupling specific devices to cope with specific problems provides increased performance and power efficiency. Application development processes concerned with dealing with multicore heterogeneous technologies are still in their infancy. Developing high-performance applications for the scientific community gives the responsibility to the developer to maximise the use of the underlying architecture. Traditional approaches to the development process are inadequate to deal with complexity in instantiation of computation to heterogeneous architectures. Current load-balancing algorithms fail to provide the required dynamism to best fit computation to the available resource. This project seeks to design an algorithm to optimise the application of a computational problem to the available heterogeneous resource through runtime profiling and load-balancing. CELL and Nehalem/GPU heterogeneous architectures are targeted with an N- Body simulator combined with the implementation of a profiling and load-balancing algorithm. This implementation is subsequently tested under different computational load conditions. 9

10 Declaration No portion of the work referred to in this report has been submitted in support of an application for another degree or qualification of this or any other university or other institute of leaning. 10

11 Copyright i. The author of this dissertation (including any appendices and/or schedules to this dissertation) owns any copyright in it (the Copyright ) and s/he has given The University of Manchester the right to use such copyright for any administrative, promotional, educational and/or teaching purposes. ii. iii. iv. Copies of this dissertation, either in full or in extracts, may be made only in accordance with the regulations of the John Rylands University Library of Manchester. Details of these regulations may be obtained from the Librarian. This page must form part of any such copies made. The ownership of any patents, designs, trademarks and any and all other intellectual property rights except for the Copyright (the Intellectual Property Rights ) and any reproductions of copyright works, for example graphs and tables ( Reproductions ), which may be described in this dissertation, may not be owned by the author and may be owned by third parties. Such Intellectual Property Rights and Reproductions cannot and must not be made available for use without the prior written permission of the owner(s) of the relevant Intellectual Property Rights and/or Reproductions. Further information on the conditions under which disclosure, publication and exploitation of this dissertation, the Copyright and any Intellectual Property Rights and/or reproductions described in it may take place is available from the Head of Department of Computer Science. 11

12 Acknowledgements I would like to take this opportunity to thank my supervisor, Dr John Brooke, for his support and guidance during the past 16 months. My thanks also go to Dr Carey Pridgeon for developing the nmod application, which provides the computational base for my algorithm, and his kind response to my questions. I would also like to thank Dr Jonathan Follows for proving the CELL and Nvidia CUDA training at STFC Daresbury. Finally, I would also like to thank my family for their help and support over the past two years. 12

13 List of Abbreviations CBE SPE PPU SIMD SPU MFC LS DMA EIB SPUFS MFC SIMT ILP SMT JSP HPC GCC AGP GPU PCIe CPU RAM Cell broadband engine Synergistic processing element Power processing unit Single instruction multiple data Synergistic processing unit Memory flow controller Local store Direct memory access controller Element interconnect bus Synergistic processing unit file system Memory flow controller Single instruction multiple thread Instruction level parallelism Symmetric multi-threading Jackson structured programming High performance computing GNU c compiler Advanced graphics port Graphics processing unit Peripheral component interconnect express Central processing unit Random access memory 13

14 Chapter 1 Introduction Heterogeneous computing is the coupling of dissimilar processor architectures into a single process or separate processes interconnected by an in-system bus. Heterogeneous computing is viewed as the next evolutionary step in the development of the CPU. Over the past 5 years, we have witnessed great advances in terms of the availability of processing cycles on devices other than the core central processing unit. However, our reliance on a single fast processor core has been challenged, as we have seen the slowing of the speed increases of these devices [1]. In order to combat this issue, most manufacturers have developed multi-core variants of their standard processor. Over the past ten years, graphics card manufacturers have been increasing the capability of their cards; this has mainly been to keep pace with the requirement of users, who now demand ever realistic game graphics and physics. Recently, we have seen libraries released which enable users to make use of the GPU as a massively powerful compute resource. Notably, at the time of writing, it is common to see quad core chips in standard desktop machines, combined with powerful compute capable GPU. We also see heterogeneous chips, such as the Cell Broadband Engine, used in the world s fastest computer [2]. This chip, in contrast to the CPU GPU relationship, combines heterogeneous components into a single process, as opposed to discrete elements connected by an external system bus. 14

15 Developing applications for single threaded execution can be difficult enough, but when also facing the challenge of multiple cores and heterogeneous compute elements, the challenge increases. Multi-threading libraries support the instantiation of virtual threads, and the underlying operating system will then accordingly schedule these virtual threads onto the available homogeneous hardware. Operating systems are unable to schedule threads across heterogeneous compute components, which consequently leaves the control of thread execution on these elements to the application which wants to make use of them. Application developers need to make decisions concerning how best to apply the available computation to the hardware elements in the target system. Load-balancing concerns the application of computational tasks to the available elements in the machine. Failing to properly and fairly distribute load over the available elements will result in less than optimal runtime performance [3]. In a high-performance compute environment, it is essential that idle time is reduced as much as possible so as to maximise the return of investment made in the data centre infrastructure. During the lifetime of a computational task, the quantity of work assigned to an individual subsystem can change; this can lead to an imbalance which can emerge at any point during the lifetime of the computation. Design time profiling is a common step in the development process. As an application is being developed, profiling is conducted in order to determine the sections of code that would benefit from some additional fine-tuning. These sections usually consume a disproportionate amount of the whole application runtime. Usually, it is these sections that we strive to split between the available resources. The complexity arises when these tasks are to be split amongst the available resources. It is only at runtime that we can determine the existence and capability of heterogeneous components and the cost of computation instantiation. This leads to the requirement to profile the application at runtime. 15

16 Planning for the development or redeployment of application code to a heterogeneous highperformance environment is a challenging activity. Conventional multi-threaded development is supported by a number of methodologies and design techniques, concerned with aiding in the development/transformation process. The ability to structure the development activity and to accordingly support the development through a design procedure is an important aspect of this project. 16

17 1.1 Motivation and Context of the Study This project is concerned with the development of a runtime-profiling and load-balancing algorithm to support the computational deployment to the available computing resource. With the ever-increasing complexity and heterogeneous mix of components in both desktops and servers, the ability to correctly assign tasks to resources will be critical when striving to realise the full potential of these machines. However, it should be noted that this project is concerned with floating point applications as opposed to integer applications. High-performance applications typically refer to those that spend a disproportionate amount of time typically using few instructions but operating on a large data set. This data set especially in physical simulations is iterated over many types as the simulation evolves. Integer applications typically use a large number of instructions and smaller data sets. Office applications typically fall under this application type. Successful algorithm development will require different heterogeneous components to target. As outlined in the previous section, GPU components are becoming increasingly common. We believe that targeting a heterogeneous system which comprises of GPU and host CPU is an essential architecture mix to include in this project. These devices are coupled using a PCI express system bus. The second architecture type should not be constrained by the connectivity of a system bus; architectures of this type are referred to as a single-chip heterogeneous platform. Therefore, in order to target a single-chip heterogeneous platform, there are few affordable options available. At the time of writing, the only realistic option is to make use of the Cell Broadband Engine as found in the games console, the PlayStation 3. It may seem unusual to use a game station in the development of a complex, high-performance application; nevertheless, as shown by Buttari et al., the PlayStation 3 is more that capable for scientific research [4]. 17

18 1.2 Project Aims and Methodology The aim of this project is to design an algorithm capable of determining the best pattern of instantiation on the available computational resource for a given computational problem. Subsequently, the algorithm will be implemented using different heterogeneous platforms to validate the effectiveness of the approach; this will be achieved through the transformation of an existing HPC application using a methodology for targeting heterogeneous architectures. The objectives of this project are summarised as follows: To understand the development process and profiling requirements of high-performance applications on heterogeneous architectures; and To obtain speed-up on each target architecture for the N-Body simulation. It is important that we have a deep understanding of the development process of heterogeneous applications, and that we can profile applications in real time in order to better match computation to the available resource. To prove that the suggested approach for runtime profiling is effective, we need to show speed-up over the single threaded versions of the application. Accordingly, in order to achieve the project aims, the following key points will need to be achieved: Understanding of the target heterogeneous architectures; Understanding of the development process for heterogeneous systems; Identification of the methods which enable runtime-profiling and load-balancing; and Researching of the available tools and technologies to enable the implementation. 18

19 To realise the development of a heterogeneous application, a thorough literature review of heterogeneous development will be undertaken. Two target architectures will be used to validate the profiling algorithm: the first is the Cell Broadband Engine, developed by Sony Toshiba and IBM; and the second architecture is the Nvidia 9800 GTX graphics card with an Intel Nehalem host processor. These devices have a total of four discrete computational units, each with their own toolset and architectural nuances which will need to be explored before the implementation phase. In order to successfully implement a profiling algorithm and to accordingly profile the target application, the author will review the current literature and example case studies of multithreaded and heterogeneous development; this will continue with a further general review of application development methodologies for both multi-threaded and heterogeneous development. In the same section, we will suggest an approach for development that will be used in the design and implementation. Following on from the literature review, we will detail the algorithm design followed by details of the implementation. To test the load-balancing and profiling algorithm, a target application will be required. Notably, however, this is outside the scope of this project, and indeed its time constraints to develop such an application. Nevertheless, a number of applications were considered after referring to wellknown lists of computational dwarfs detailed in [5]. As a result, we decided to implement an N- Body simulator, as it has sufficient complexity and can be easily adjusted to vary computational problem size. A suitable open source project was found titled N-Mod. Hosted by Google code and developed By Dr Carey Pridgeon, the application is provided in C source code. For the rest of the project, N-Mod will be referred to as the target application. Once the target application has been modified to run on the target architecture, the results of the speed-up will be documented without the use of the load-balancing algorithm. The loadbalancing algorithm will then be applied, and a comparison as to its effectiveness will be conducted. 19

20 1.3 Project Phases Below is a summary of the main project phases: 1. Literature and background research. 2. Algorithm design 3. Prototype algorithm implementation: a. Target platform installation and configuration b. Development of N-Body algorithm. c. Implement flexible load-balancing system d. Implement load-balancing algorithm. 4. Test/Timing 5. Dissertation writing. 20

21 1.4 Structure for the Dissertation This dissertation contains eight chapters. The first chapter provides the introduction, and details the project s background and the significance of the problem area, as well as the motivation for the research. Chapter One also details the aims, and a methodology to realise the aforementioned aims, and further details the main phases that will be conducted. Chapter Two investigates the architectural and development environments for the target heterogeneous architecture, as well as projects which have been successfully developed on these systems. Chapter Three will research the area of computational load-balancing a component which is fundamental to the development of an algorithm that scales independent of architectural constraints. In Chapter Four, current profiling techniques will be reviewed, including both static and runtime techniques. Chapter Five will review development methodologies and design formalisation techniques for both single and multi-threaded development. Additionally, Chapter Five will bring the research together in such a way so as to propose a design for heterogeneous profiling and a loadbalancing algorithm. In Chapter Six, the implementation of the algorithm using the target application will be detailed, highlighting how various challenges were overcome. Chapter Seven will show detailed testing of the converted algorithm for the target architectures, and will also include a review of the speed-up obtained using various core and load combinations. The loadbalancing algorithm will also be tested so as to demonstrate that it can determine the best architecture for a given computational load. Chapter Eight provides a conclusion of the project, and recommends future work concerning the implementation and design of the load-balancing algorithm, as well as the methodology followed in its development. 21

22 Chapter 2 Heterogeneous Architectures 2.1 Chapter Overview Computer systems have traditionally been deployed using single core homogeneous architectures. In order to meet growing user requirements, we have witnessed the development of systems which combine differing architecture types. However, the integration of different architecture to meet the demands of the user brings with it its own problems. For example, each architecture can differ drastically from the standard x86 design, and requires specific compiler support [6]. One main difference is the way in which these architectures interact with the main memory. This chapter will survey the literature relating to the target architectures that will be used in the application of the Profiling algorithm. 2.2 Cell Broadband Engine The cell broadband engine is a single chip heterogeneous system developed by a partnership of Sony, Toshiba and IBM [7]. It has a revolutionary design consisting of a single PPU (modified Power PC 4 core) which controls 8 SPE (6 available in the Sony PlayStation). The PPU is a 64- bit dual simultaneous multi-threading processor which is similar to the PowerPC 970, and which has 32 KB of level one cache, split between separate instructions and data caches, and 512 KB of level two cache. The PPU has an SIMD engine based on the Altivec instruction set [8]. Although 22

23 powerful, the PPU is included as the controller of the SPE and is not seen as being a target for computational load. Usually, the PPE would run the operating system and control operations of the HPC application. Notably, the PPE is more capable of task-switching than the individual SPE, which is less capable of task-switching owing to the lack of branch prediction logic [9]. Each SPE consists of a SPU, 256 KB LS and MFC. The SPU is a 128-bit vector engine which uses a different instruction set to that of the PPE. Moreover, the SPU has a 128 by 128 bit register bank, which is a large number of registers in comparison to a standard x86 processor; however, there is no on chip cache. Furthermore, each SPU is not considered coherent with the main system memory, and can only access its own LS and local MFC. The MFC is a messaging and DMA controller. The MFC controller communicates with other MFC, PPE and main memory controllers using the EIB. The EIB consists of four unidirectional buses, which interconnect PPE, SPE, IO and memory controller. In order to provide the required communication requirements, the EIB operates at 204.8GB. With this in mind, Figure 2.1 shows the interconnections of all the major components of the CBE. Figure 2.1: CBE Block Diagram 23

24 The PPE is capable of running an application code which is compatible with the PowerPC architecture. Application code is executed from the main memory transparently being transferred to L1 and L2 cache using a cache coherency protocol. SPU application code is located in the main memory, and a pointer to the application code is provided to the SPU by the PPU. The SPU can then direct the MFC to transfer the application code into the LS, and the PPU can subsequently execute the code directly from the LS. Once the PPU has finished with any variables in its own LS, it can then direct the MFC to transfer data back into the main memory. Messages can then be sent between the SPU and PPU so as to indicate the completion of processing [10]. Figure 2.2 details the division of two computational tasks between PPE and SPE. The PPU would perform various initialisations before executing the task on the individual SPE; the SPE would subsequently compute their assigned sub-section of the task before informing the PPE; the PPE would accordingly collect and combine the results before preparing and starting the next stage of the model evolution; and the PPE would then continue evolving the model, perform further compute stages, directing the SPE or terminate [9]. Figure 2.2: Division of computation CBE 24

25 SPE and PPE can communicate using EIB by sending messages to each other s mailboxes. This messaging capability provides alternative kernel instantiation patterns. One common method is to have each SPE conduct a unique stage of processing, as shown in Figure 2.3. This essentially creates a computational chain commonly found in video processing. Figure 2.3: Stream Processing CBE There are additional benefits to using a chain of SPEs when the application code is too big to fit into a single SPE; we can reduce the swapping of application space to and from the LS in the SPE by dividing the workload between the individual SPE Development of the CBE Only the Linux kernel can communicate with the individual SPE, and so drivers have been developed in order to provide easy access to the developer. 25

26 The following functions are provided by the underlying driver [11]: Loading a program binary into an SPU; Transferring memory between an SPU program and a Linux user space application; and Synchronising execution. The application running on the PPE makes use of the LIBSPEV2 (SPE Runtime Management Library V2) to control execution on the SPE. Moreover, the LIBSPEV2 utilises the SPUFS which is provided by the driver in kernel space. The PPE application uses SPE contexts provided by the LIBSPE2 to load the SPE application image, deploy the image to the SPE, trigger execution, and to finally destroy the SPE context [12]. The SPE contexts are considered key to controlling the execution of the binary image on the CBE. Each context needs to be run in an individual PPE thread which is spawned from the main PPE application. Due to the heterogeneous nature of the devices, the application machine code for both the PPE and SPE need to be separately compiled. Importantly, Sony provides a GCC compiler which will create a binary executable for both architectures. $ gcc -lspe2 helloworld_ppe.c -o helloworld_ppe.elf $ spu-gcc helloworld_spe.c -o helloworld_spe.elf Figure 2.4: CBE compile commands In Figure 2.4 you can see that GCC compiles the PPE elf executable, whereas the SPE elf file is compiled using the spu-gcc command. In order to execute the application, the PPE elf file is launched, which in turn loads the SPE elf into memory and accordingly directs the individual SPE to execute through that SPE s context. Each SPE requires its own context structure defined as spe_context_ptr_t. This structure is used for the purpose of storing information required to communicate with the SPE. The functions 26

27 listed in Table 2.1 are used to load the SPE elf image and, using the context structure load the image to the SPE, to initiate the image as well as to destroy and clear the SPE image from memory. Name Functionality spe_image_open() Load the image from the disk into main memory return pointer to the image. spe_context_create() Create the context populating the supplied spe_context_ptr_t structure spe_program_load() Load the supplied image pointer to the supplied SPE defined by its context. spe_context_run() Run the current image on the context provided spe_context_destroy() Destroy the SPE context spe_image_close() Remove the image from main memory Table 2.1: LIBSPE2 Functions The application image, which is loaded onto the SPE, will have to be supplied with the data required for processing. This data will be located in the main memory or on a disk drive. It is the responsibility of each individual SPE to DMA its section of the data that it wishes to process from the main memory to its LS. DMA operations are provided to the developer using intrinsic functions defined in the spu_intrinsics.h and spu_mfcio.h header files. Intrinsic functions wrap a number of machine language commands that are not available to the GCC compiler. Name Functionality spu_mfcdma64() This function instructs the DMA controller in the SPE to transfer the specified number of bytes to or from Main Memory/LS. A tag is specified so a number of DMA transfers can be conducted at once. spu_writech() This macro is used to communicate between SPU and MFC spu_mfcstat() We can call this function to stall the application until the transfer is complete. We can use a technique referred to as double buffering so that the application can process the information it has instead of waiting for all data to be transferred. Table 2.2: SPE DMA functions Using the functions in Table 2.2 and a pointer to a data structure containing the addresses that are required, the SPE can then instruct the DMA controller to access the main memory and to subsequently make available to the SPU the data required for processing [12]. Each SPE has its own LS and communicates with the system memory using DMA transfers. The DMA controller 27