Benchmarks and Comparisons of Performance for Data Intensive Research

Transcription

1 Benchmarks and Comparisons of Performance for Data Intensive Research Saad A. Alowayyed August 23, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012

2 Abstract The rapid developments in technology and the increase in the precision of the data captured are the reasons for a significant increase in the amount of data, known as big data. This big data is valuable because it is the outcomes of efforts such as experiments or discoveries, so the interpretation is necessary to obtain information. This information could not be interpreted efficiently without significant methods for storing, retrieving, manipulating and analysing the big data. The modest performance of the secondary storages prevents the approach of these significant ways, in other words, it exposes the I/O bottlenecks that prevent the I/O system from providing the CPU with the right amount of data. The right amount of the I/O provided to the CPU was analysed by Gene Amdahl, who derived a ratio widely known as the Amdahl number, which will be one of the main criteria used in this project to decide whether or not a system is balanced. Two different systems at the Edinburgh University are examined, EDIM1 and Eddie. This project is to decide whether or not EDIM1 and Eddie are balanced and, hence, suitable for data intensive computing. This is achieved by using a number of benchmarks for each component to determine if there is a sustainable bandwidth at the same order of magnitude. Moreover, the Amdahl number would be calculated theoretically and experimentally, using three benchmarks with different I/O patterns. Two of these benchmarks are developed during the scope of this project. From the results, it was found that the two systems are suitable for data intensive computing and can provide a good platform for data intensive programs. The first machine, EDIM1, was designed to be similar to the Amdahl balance blades by attaching more disks to a low power processor. This would increase the I/O throughput considerably and hence, the I/O should cope with the CPU performance. The second machine, Eddie, is not specifically for data intensive research, but it could reduce the gap between the I/O and the CPU in the field of general high performance computations. ii

3 Contents Chapter 1 Introduction... 1 Chapter 2 Background Big Data Data Intensive Computing Balanced systems Raw sequential I/O Multi-scaling Styles of Data Intensive Computing Data-processing pipelines Data warehouse Data centres Amdahl-balanced blades Devices Technical information Machines theoretical values Mythical system...15 Chapter 3 Benchmarks Floating point operations per second FLOPS LINPACK Simple CPU Memory bandwidth Stream Simple memory File system IOZone Simple I/O Balanced systems Amdahl synthetics benchmarks Existing benchmark Modified benchmarks...22 Chapter 4 Benchmarks Results and Analysis CPU flops rate EDIM Eddie Memory flops rate EDIM Eddie Disk s flops rate EDIM Eddie Amdahl synthetics benchmarks...35 iii

4 4.2 Testing...42 Chapter 5 Conclusions...43 Chapter 6 Project analysis Project Post-mortem Process Materials and technology What made the project successful What needs more improvements Lessons learned Deviations from the work plan Risks...48 Appendix A Benchmarks results...50 Appendix B Benchmarks compiling and running...69 References...75 iv

5 List of Tables Table 2-1 Theoretical (Datasheet) Values...10 Table 2-2: Machine theoretical values...14 Table 3-1: The eight kernels in the FLOPS benchmark...16 Table 3-2: The sizes of the array compared to the data used in LINPACK...17 Table 4-1: Benchmark results for the EDIM1 and Eddie CPUs...30 Table 4-2: Benchmark results for the EDIM1 and Eddie main memories...31 Table 4-3: Benchmark results for the EDIM1 and Eddie disks...33 Table 4-4: Summaries of test suites completed in the CUnit...42 Table 6-1: the Gantt Chart for the project...48 Table 6-2: Risk Analysis...49 v

6 List of Figures Figure 2-1: Illustration of the steps in the data-processing pipeline... 6 Figure 2-2: Node Architecture in EDIM Figure 3-1: Illustration of Amdahl synthetics benchmarks...22 Figure 3-2: Different threads compositions throughputs and Amdahl numbers...24 Figure 3-3: Different real program patterns...26 Figure 3-4: Algorithm for real program I/O pattern benchmark...28 Figure 4-1: Summary of the EDIM1 CPU-memory flops rates...31 Figure 4-2: Summary of the Eddie CPU-memory flops rates...32 Figure 4-3: Throughput for different sets of storage...35 Figure 4-4: Amdahl numbers for different sets of storage...36 Figure 4-5: Illustration of Amdahl numbers among different buffer sizes...37 Figure 4-6: Performance of different sets of storage...38 Figure 4-7: Amdahl numbers for a combination of disks...39 Figure 4-8: Amdahl number for different I/O patterns...40 Figure 4-9: Performance for different I/O patterns...41 Figure 4-10: Amdahl number for a pattern using different buffer sizes...41 vi

7 Acknowledgements I am sincerely grateful to Dr Adam Carter; this dissertation would not have been possible without his guidance, encouragement and feedback. I owe thanks to Mr Gareth Francis for helping me with problems related to the cluster. I am truly indebted to Dr Orlando Richards from ECDF for his help in exploring new information. I would like to show my gratitude to my family, with special reference to my father, Dr Abdullah Alowayyed, for giving me their support without which none of this would have happened. vii

8 Chapter 1 Introduction Recently, a great increase in the demand for retrieving, processing, sorting and storing large amounts of raw data has been seen. It is a challenge to deal with this amount of data efficiently, especially with the knowledge of its value, and that it has resulted from considerable efforts, such as experiments or discoveries. Storage, on the other hand, has seen a remarkable increase in size, and costs have decreased significantly. However, storage performance is not in line with CPU performance, because, in most cases, the CPU can process the data, while the storage cannot retrieve or store it at the same speed; this is widely-known as the I/O bottleneck. The I/O bottleneck is a real difficulty for computer experts, programmers, developers and users. Although the algorithm is fast and well defined, the I/O will slow down the overall speed. This reduction in speed decreases sharply with the increased amount of data or an increase in the I/O operations. The precise definition of Data Intensive Computing (DIC) is the reduction of the I/O bottleneck and the performance gap between the CPU and the I/O by organising the system to store/retrieve the data sufficiently quickly to ensure that the CPU does not remain idle. This project takes an in-depth examination of two systems EDIM1 and Eddie to determine whether or not they are suitable for data intensive computing by analysing the theoretical and experimental values. To make that decision, two main objectives were set: first and foremost, to run a set of benchmarks for those components that are involved in the CPU-I/O model. These components are the CPU, main memory, file system and disks. The second objective is to calculate the Amdahl number, because this gives a precise indication of whether or not the system is balanced in terms of the throughput of the I/O system and the performance of the CPU. The Amdahl number would be calculated theoretically for each system and, in addition, pragmatically. In 2011, a student in the EPCC wrote a benchmark to calculate the Amdahl number [1]. In this project, a different set of I/O patterns will be taken into account to make the previous set of benchmarks realistic by making it close to the peak or close to the real applications, which would suffice. Finally, to provide the final decision on whether or not a machine is suitable for data intensive computing, the Amdahl number should be close to 1.0. Moreover, the sustained bandwidth of all components, measured by the 1

9 benchmarks, should be in the same order of magnitude to maintain a similar achievable sustained CPU rate that is independent of the source of the data. Chapter 2 will begin with the elementary definitions of big data and data intensive computing. Next, a review of the related literature in the same field is included to provide a wider image of the project and the data intensive computing field. The chapter ends with an overview of the two systems under study and the calculation of their theoretical values. Chapter 3 will use several theories to go through the set of benchmarks being used with and will determine how to calculate the results. This chapter will categorise the benchmarks, depending on the components that will flow from the CPU to the disks. At the end of this chapter, a detailed description will discuss the different Amdahl synthetic benchmarks being developed in the scope of this project. Chapter 4 illustrates the results obtained from the previous chapter, in addition to the theoretical number taken from Chapter 2. This chapter will gather the numbers obtained to develop some hypotheses and an appropriate conclusion that will answer the research question discussed earlier. In Chapter 5, a final set of conclusions will be drawn to determine whether or not the machines are suitable for data intensive computing, and if programs with the I/O highly-bound will receive benefits from the machine. This chapter will close with recommendations for both systems. Finally, in Chapter 6 the project will be analysed, and deviations from the work plan and the risks mentioned in the project preparation period will be clarified. Appendix A illustrates the benchmarks results and a sample of tests results. Appendix B shows how to obtain the benchmarks source, compile and run it with different configurations. Moreover, one of the tests functions is illustrated. 2

10 Chapter 2 Background 2.1 Big Data Big Data is defined as large and fast-growing datasets. These datasets are captured from several resources, such as experiments and sensors. The technology is developing and becoming more precise, which produces more data. The storage, search, analysis and resultant availability of this Big Data is a problem. This issue doubles every year [2] [3], and perhaps even more so, because of the imminence of the Exascale age. Currently, various resources, such as sensors, search-engines, social networks and experiments are creating 2.5 exabytes of data daily [4]. This is exemplified by Google producing 20 petabytes of data per day and The European Organization for Nuclear Research (CERN), in its ATLAS experiment, a petabyte per second [5], as two examples out of many sources. Obviously, dealing efficiently with this big data is a great demand. Moreover, big data will generate more data movements than before. Unfortunately, present-day secondary storage may not behave so well, which may increase the I/O bottlenecks caused by their modest performance. These bottlenecks lead to most of the time being spent in moving instead of processing data [6]. However, data intensive computing is now being introduced to increase the ability to analyse and understand the data where captured [1]. Moreover, [7] mentioned that the increase in processing capabilities is another reason for big data s growth. 2.2 Data Intensive Computing Data-intensive computing, according to [8], is the fourth basic research paradigm, after the experimental, theoretical and computer simulations and is urgently required for three reasons. The first is the requirement to store and retrieve large amounts of data efficiently. 3

11 The second reason involves saving the energy employed by this great demand by using computational resources more effectively through compiling low-power systems [3]. These systems will be described later. Finally, analysing the large distributed amount of data is not an easy task. The ATLAS experiment, for example, is processing 10 petabytes of data every year [9]. Three main issues [2] should be considered when the DIC is mentioned. These issues are a balanced system, multi-scaling and the raw sequential I/O. The next subsections will describe these terms in depth Balanced systems Balanced systems could be defined as systems that can move data at a reasonable speed to prevent the CPUs from experiencing delays. The performance of a system is limited by its slowest component, when the I/O operations are usually the slowest component, in addition to the main memory and a poorly designed cache [6]. Thus, Amdahl has set up a number of laws [10] that determine whether or not the system is balanced. These laws are: the Throughput Balance Law: A system needs one I/O bit per second per instruction per second; this ratio is known as the Amdahl number. This number could be calculated by dividing the bandwidth how many bytes the system can read per second by the flops rate, which is the number of floating point operations per second. This number will be the main concern in this project. the Memory Law: One byte of memory per one instruction/sec. This ratio provides an indication of whether or not there is sufficient memory to undertake the computation. The ratio is known as the Amdahl memory ratio. the I/O Law: This is the ratio of one I/O operation for every 50,000 instructions. Moreover, this ratio captures the I/O latency when there is less than one I/O operation per 50,000 instructions. Amdahl s experience as a computer architect and his observations of real life applications led to these practical laws [11]. Although these laws are a few decades old now, the kind of computations being carried out over these decades has not changed in any fundamental way. Thus, these laws are still valid and are easily applied to current real-life applications. In addition to the Throughput Balanced law, the sustained bandwidth of the CPU memory and I/O being of the same order of magnitude is another indication of the system in balance. 4

12 2.2.2 Raw sequential I/O When considering a large amount of data, several terabytes for example, the memory and cache should be taken out of the performance equation. The best achievement is to use external schedulers such as in-drive scheduling, a schedule command queue to employ maximum efficiency. Another scheduler is on-board caching, which is a simple cache on modern disks to hold the data without searching on the disk itself [12]. On-board caching is effective in the case of a sequential I/O. The performance of the Hard Disk Drive (HDD) has experienced significant improvements over the years. For example, the first 500 GB hard drive [13] the Hitachi GST had a transfer speed of 3 Gb/s; then, in 2011, Seagate produced the first 4.0 Terabyte hard drive with data transfer boosted up to 80 Gb/sec [14]. However, these improvements in size and data transfer are not sufficient when compared with modern developments in CPUs or the huge increase in data sizes. The problem of the modest I/O sequential read performance could be resolved by an order of magnitude using Amdahl-balanced Blades (to be discussed in Section 3.4) Multi-scaling To solve the slowdown in the performance of the I/O, one of the scaling approaches must be used. These approaches are scaling-up and scaling-out. First, scaling-up means adding more multi-processors and a high-performance disk array [3]. This method is expensive and produces low productivity especially in the case of the sequential I/O but it has less management and partitioning overheads [2]. Second, scaling-out is known as using lower-power processors and attaching more nodes, each node being attached to one or more disks [1]. Scaling-out is cheaper and increases the throughput, thus reducing the bottleneck. On the other hand, this method creates a massive overhead, represented by the disks I/O management and the data partitioning between the disks. [15] illustrates that the scale-up is used in a large symmetric multiprocessing (SMP), while the scale-out are in the form of clusters. A number of research outcomes, such as Gray s law [16], have suggested that the scaleout is the most suitable type for data intensive computing. In this project, these two scaling approaches are similar to the differences between Eddie and EDIM1. In other words, EDIM1 is scale-out while Eddie is more scale-up. 2.3 Styles of Data Intensive Computing There are three styles of Data Intensive Computing that have been developed over recent years [17]. First, data-processing pipelines initialise a pipeline of processes after capturing the data. This pipeline shown in Figure 2-1 is used to significantly reduce the amount of data through well-defined steps. Second, the data ware house is a large, undistributed storage of data, while the third style is distributed storage, known as Data Centres. 5

13 2.3.1 Data-processing pipelines To reach small and easy-to-interpret datasets, data-processing pipelines have three defined main steps. The first is the High-throughput Capture; in this process, the data are captured from the source. After that, the researchers manipulate the data, such as deleting unnecessary content and searching for an easy route for analysis. Finally, the data is stored and is ready for the next step. The second step is the Analytics; in this process, the data were interpreted using a specific, complicated algorithm. This step requires high-performance computers to obtain the results. Finally, the Understanding Step, which is the visual analytical focus area. In this process, the researchers understand the analysis of the results produced from the previous step. The outcomes of the work might finally be visualised. Low High Data volume Understand Information density Time Analytics High High-throughput capture Low Figure 2-1: Illustration of the steps in the dataprocessing pipeline [17] 6

14 2.3.2 Data warehouse The data warehouse could be considered as the main storage location of the data or the main archive that draws the complete image to make a decision [18]. Its main function is to store the data until researchers extract and interpret the information. The Sloan Digital Sky Survey (SDSS) [19] is the most important survey in the astronomy age. All the images and information produced by the SDSS telescope are collected in the SDSS SkyServer. The SDSS SkyServer, as an example of a data warehouse, can host up to 40 terabytes of raw data, to be interpreted later by the astronomers [20]. The SkyServer provides a search tool called the dr6 for the public; this allows the data to be searched after it has been interpreted. The concept of the data warehouse is similar to cloud computing. Alexander Szalay [2] considers this a similarity, because both cloud computing and the data warehouse are accessible to services, both data and computation, in the same shared centralised facility. Szalay argued that a data warehouse design has more advantages than a grid-based one, because of the capability of allowing many users to access the shared datasets. On the other hand, it has been demonstrated that the popularity of using remote grids is vague [21] Data centres The newest data intensive computing style is inspired by the Internet and the concept of data distribution [17]. In this style, the data is stored in any distributed data centre, and a specific programming model, such as MapReduce, is used to process these large datasets. An example of data centres are the data centres that support the academic research provided by the National Science Foundation, Google and IBM. In 2008, they provided 1,600-processor clusters to provide the researchers with access to distributed sources using Hadoop [22]. Another issue that may classify the data centres, as suggested by [2], is that because of the large amount of data, the computation should be brought to where the data is, not vice versa, thus, saving transmission time. Avoiding the I/O problems and possible data loss are the expected outcomes of this change. In addition to the previous suggestion, Jim Gray [16] shows that scientific computing would benefit from the data centres that they are, in fact, becoming more data intensive. 7

15 2.4 Amdahl-balanced blades The increase in the data size yields a corresponding increase in the power consumptions [23]. Moreover, the main target, to achieve a system with fewer I/O bottlenecks, was the reason that convinced [3] to find a blade that is energy efficient and produces a high sequential read throughput. Sequential read is now used, because it is the most appropriate metric for an I/O throughput, because of the intensive usage and the high throughput. The power consumption problem could be improved by using energy-efficient CPUs [23], such as the Intel 1.6 GHz Atom N330 (used in EDIM1 [24] and GrayWulf [2]). Theoretically, and according to [3] the 1.6 GHz CPU is balanced with 32,000 I/O operations per second; and this number is what the Solid State Drives (SSDs) used in GrayWulf can almost produce. Another example of an SSD, the Crucial Real SSD C300 [25] the Solid State Drive used in EDIM1 can carry out up to 45,000 I/O operations per second. Another reason for choosing the SSD is that there is less time elapsed, compared to the HDD. On the other hand, SSDs are much more expensive. The Amdahl-balanced blade, used in GrayWulf [3], consists of a dual-core Atom 330, and the blade is considered to be balanced by using two SSDs. EDIM1 uses a similar architecture as the Amdahl-balanced blades. Instead of using a second SSD, three HDDs are used. After examining the prices of one of the SSDs used in the GrayWulf and the three HDDs used in EDIM1, one can find that the total disks in EDIM1 are cheaper by USD$100. 8

16 2.5 Devices In this project, two supercomputers will be examined. These systems are EDIM1 and Eddie. The following sections will first describe the technical information; and then, the theoretical machine values will be computed and demonstrated. Finally, a wellbalanced mythical system will be described Technical information EDIM1 Edinburgh Data Intensive Machine 1 is a data-intensive experimental cluster, owned by EPCC and the School of Informatics at the University of Edinburgh, containing 120 compute nodes in three racks. Each compute node consists of a dual-core Intel Atom N330 CPU (1.6 GHz) on a Zotac mini-itx motherboard, all having a lowpower consumption. The flops rate has been increased through the use of the NVIDIA ION GPU. EDIM1 has a 4 GB DDR3 shared-main memory, three 2-TB Hitachi Deskstar 7K3000 HDDs and one 256 GB RealSSD C300 SSD [24], as shown in Figure 2-2. Figure 2-2: Node Architecture in EDIM1 [1] 9

17 As noted in the specifications, each node is similar to the Amdahl balance blades described in Section 2.4. Table 2-1 concludes the technical information and clearly shows that the nodes do not rise to the high flops rates of computational clusters such as Eddie but are cheaper, more suitable for data intensive work and have a lower power consumption rate Eddie The Edinburgh Compute and Data Facility (ECDF) has a high performance cluster known as Eddie with 156 nodes, each consisting of two 8-core Intel Xeon E5645 CPUs (2.4 GHz). Each node comes with 24 GB of DDR3 memory and a 250 GB 7200 rpm SATA drive [26]. Moreover, the Eddie is equipped with a special storage system that permits it to supply a massive amount of storage for the University of Edinburgh. The nodes are connected by a 1 Gb Ethernet to 48-port switches, each of which connects back by a 1 x 10 Gb Ethernet to another network switch. There are eight storage servers that connect to that switch (also at 10 Gb). The storage servers connect to the storage via 2 x 8 Gb fibre channel SAN connections [27]. Table 2-1: Theoretical (Datasheet) Values EDIM1 Eddie CPU model Intel Atom N330 Intel Xeon E5645 CPU clock rate 1.6 GHz 2.4 GHz Cores/CPU 2 8 CPUs / node 1 2 Memory / node 4 GB DDR3 24 GB DDR3 Secondary storage / node 256 GB SSD + 6 TB HDD 250 GB HDD + storage system 10

18 2.5.2 Machines theoretical values Theoretical flops rate The theoretical flops rate for a node, widely known as the theoretical peak, could be calculated as the following formula [28] : Theoretical flops rate / = CPU speed number of CPU instructions/ cycle number of cores in CPUs number of CPUs in a node (1) In EDIM1 In Eddie Theoretical flops rate = =6.4 / Theoretical flops rate = =76.8 / Unsurprisingly, this gap in the results between the two machines provides the first indicator that these two machines are built for different purposes. In other words, EDIM1 is a specific for data intensive research and provides a balance between the CPU and the disks, while Eddie is designed for general purpose high-performance computations Theoretical Amdahl Memory Ratio The best indicator determines whether or not there is sufficient memory to undertake the computation. This indicator is based on the Law discussed previously in Section In the case when the memory ratio is exactly one, this theoretically means there is exactly one byte of memory per instruction per second. Which means the system s performance would not decrease, because of the reliance on reading from the disks several times. In other words, the main memory has plenty of space, so there is no need to read from the disk. The Amdahl Memory ratio could be calculated as follows [1]: So, EDIM1's value is: Amdahl Memory Ratio= 11 Memory size per node CPU Rate (2) Amdahl Memory ratio= = 1.28

19 While Eddie's value is: Amdahl Memory Ratio = = 1.28 From the previous results, one can note that there is more than one byte of memory per instruction in both systems. That means both systems nodes offer more memory than Amdahl noted for a real-world program s requirements Theoretical Amdahl IOPS ratio The I/O, in addition to memory, produces a bottleneck, widely known as the I/O bottleneck. In the case of the HPC, the I/O bottleneck is painful and wastes the CPU's time and energy in waiting. This dilemma is exacerbated in the DIC systems, because of the massive number of data and I/O operations in them. Theoretically, it is possible to decide whether or not there is a likelihood of I/O bottlenecks occurring through the Amdahl IOPS ratio [29]. Similar to the Amdahl Memory Ratio, when the number is close to one, it means it is unlikely for I/O bottlenecks to occur and shows the system can do more I/O than calculations. This value is a very important value in DIC systems; reducing as many I/O bottlenecks as possible will save much time and power, which are the criteria to be borne in mind during the purchase or when building the system. To compute the theoretical Amdahl IOPS ratio, it is necessary to calculate the IOPS value before applying the following formula [1]: Amdahl IOPS ratio = (3) To find the IOPS value for the HDD HDD IOPS = 1000 average seek time + average latency (4) According to the technical specification for the RealSSD C300 [25], the random read IOPS is 60,000. Moreover, for each node, there are three Hitachi Deskstar 7K3000 HDDs [30] for which it is stated that the average seek time is 0.5 ms and the average latency is 4.16 ms. From these numbers we get: HDD IOPS = SSD IOPS = = Total Node s IOPS = =

20 Finally, the Amdahl IOPS ratio is: Amdahl IOPS ratio = = For Eddie, there is only one 250 GB HDD per node with an average seek time of 2.0 ms and an average latency of 4.2 ms. Calculating the HDD IOPS as before, we get: HDD IOPS= = Moreover, the ECDF researchers have found a way to store a large amount of data using the storage architecture to retrieve data from two-tier storage. According to [27], the IOPS for Eddie s file system is : Total Nodes IOPS = = Then, the Amdahl IOPS ratio for a single node reading from all disks is: Amdahl IOPS ratio= =0.424 To conclude, the Amdahl IOPS ratios for both systems are not so close to one, but they are similar to the average results noted in [10]. However, knowing that Eddie relies on a special storage system to deal with parallel I/O operations, the Amdahl IOPS ratio is not reliable and then this ratio would be negligible compared to the whole system s I/O performance Theoretical Amdahl number The most important number among Amdahl's numbers is the value that decides the balance between the I/O throughput and the CPU rate. The Balanced System Law ( 2.2.1) argues that, to consider the system is balanced, it should perform exactly one bit of I/O per second per instruction per second. The theoretical Amdahl number is calculated as [1]: Total throughput Amdahl number= CPU Rate According to the data sheet [25], the SSD throughput is 265 MB/s. Moreover, for each node, there are three local Hitachi Deskstar 7K3000 HDDs [30], stating that the throughput for each HDD is 162 MB/s. From these figures it is possible to calculate the throughput as: (5) 13

21 Maximum throughput = =751 / Finally, the theoretical Amdahl number is: Amdahl number= =0.98 For Eddie, on the other hand, the ECDF researchers used a special I/O system using NAS and GPFS to retrieve data from two-tier storage (described in Section ). According to [27], the throughput for Eddie is 3.72 GB/s: Amdahl number = =0.416 EDIM1 is perfectly balanced, because the Amdahl number is so close to 1.0. In the case of Eddie, this ratio is not accurate, owing to the use of a system that depends on multiple components, such as the network and fibre channels, that may have bottlenecks. To compute this ratio in practice, [1] wrote a benchmark to determine the Amdahl number of the machine. Moreover, other synthetic benchmark with different I/O patterns will be written during this project. To see whether the sustained bandwidth of the CPU, memory and I/O are in the same order of magnitude, simple benchmarks will be written during the scope of this project. To make the theoretical numbers clearer, Table 2-2 includes all of these values and compares EDIM1's and Eddie's theoretical numbers visually. Table 2-2: Machine theoretical values EDIM1 Eddie FLOPS rate 6.4 GFLOP/s 76.8 GFLOP/s IOPS / node IOPS IOPS Disks peak throughput/node GB/s GB/s Amdahl Memory ratio Amdahl IOPS ratio Amdahl number

22 2.5.3 Mythical system To make the Amdahl number easier to understand, this mythical system was assumed to show the best balance situation and number that the system could have. Because of the mythical situation of the system, a set of assumptions would be raised. These assumptions are that the CPU s sustain a flops rate that is the theoretical FLOP/s, which in the case of the Atom 330 is 6.4 GFLOP/s, and a mythical budget and space. To obtain the full benefit from the CPU, the I/O should manage to transfer 6.4 GFLOP/s, which means the desired throughput is 51.2 GB/s. 6.4 GFLOP/s 8 =51.2 / Thus, to achieve this throughput, the system needs ~ 321 HDD OR ~ 191 SSD per node. In other words, if it is preferable to use the ratio 3 HDD: 1 SSD used in EDIM1 which means the system should have a 210 HDD and 70 SSD per node. Should the system have this mythical I/O system then that means the throughput is 51.3 GB/s, which is what the system requires. Finally, the Amdahl number for this mythical system is: Amdahl number= =68.85 This is an illustration of how data intensive computation could be, in theory. However, it is very unlikely to find an application that would benefit from this ratio; the reason being that Amdahl s idea of what would make a balanced machine in practice is based on his observations of real-life applications, not theoretical numbers. 15

23 Chapter 3 Benchmarks 3.1 Floating point operations per second The measurement of the computer's Floating Point Operations per Second (FLOP/s) for determining the overall performance is widely used nowadays. For instance, the TOP500 list is compiled using the flops rate from the LINPACK benchmark, and the list also contains the theoretical peak for each system in the list [31] ( ). Although the computer has other components, such as the I/O, memory, cache and communications, the reason for choosing the flops rate to be the main measurement criterion is that the floating-point operations are intensively computed and are used a great deal in the scientific computing field [32]. The author noted that, because the data intensive field is relatively new, certain terminologies are used most inconsistently amongst the authors. For instance, the term FLOPS could refer to the flops rate, the name of the FLOPS benchmark, the unit of floating point operations per second or the abbreviation for floating point operations. Obviously, maintaining consistency in the terminology is important for avoiding confusions FLOPS Before running the LINPACK stress test, a simple single-threaded c program was identified [33]. In this benchmark, there are eight computational kernels. Each kernel carries out several floating-point operations (Table 3-1). Table 3-1: The eight kernels in the FLOPS benchmark [1] Kernel FADD FSUB FMUL FDIV TOTAL

24 Thus, the output from each kernel is the flops rate. From this stage, the average of all the computational kernels would be calculated and would result in the performance of one core in one CPU in the supercomputer. Finally, depending on the machine, the following formula is used to calculate the overall performance [1]: LINPACK flops rate =number of cores in CPU number of CPUs in a node avg.number of flops from FLOPS benchmark In 1979, Jack Dongarra introduced the LINPACK benchmark, which solves a Gaussian elimination using the LU decomposition with partial pivoting [34]. Because of the nature of the algorithm, the benchmark could be considered as a stress test for the CPU, the on-chip caches and the main memory [35]. This provides a good estimate of how fast the computer could solve real problems. In most powerful CPUs, the LINPACK flops rate is close to the theoretical flops rate [36] calculated in Section Unlike the FLOPS benchmark, the LINPACK benchmark uses all the CPUs and cores available in the system. Thus, there is no need to multiply by the number of CPUs and number of cores to obtain the flops rate. The benchmark asks the user for the size of the array to solve a Gaussian elimination, see Table 3-2. [37], comparing the size of the array and the data used. Moreover, the number of run must be precisely specified by the user. Finally, the output of this benchmark will be the average number of the FLOP/s from all runs. (6) Table 3-2: The sizes of the array compared to the data used in LINPACK [37] Size of the Array Data Used GByte GBytes GBytes GBytes 17

25 3.1.3 Simple CPU To help in finding the system s main bottleneck, a simple set of benchmarks will be written. These benchmarks, with the defined sets of benchmarks, would provide a clear image of the device flops rate and the possible bottlenecks, if any. These benchmarks (Sections 3.1.3, and 3.3.2) are similar to the other widely-used benchmarks, but the reason for writing them is to make the benchmark as simple as possible for defining the real number of FLOP/s observed. In this defined set, three simple benchmarks are written. First is a simple CPU benchmark, which is the same concept as the FLOPS benchmark; in this benchmark, a register is defined as a variable of type double that fits in the CPU cache. Then, the time to carry out N floating-point operations will be measured. Finally, the number of FLOP/s will be calculated by: CPU / = Number of operations time From the maximum FLOP/s value, it is possible to calculate the maximum data rate from the memory that derives the full benefit of the CPU rate. Depending on the computer architecture the 64-bit processor used in the example and the maximum number of FLOP/s calculated above, this rate could be calculated as [1]: 64 maximum flops rate / Data rate = The divisor is used to convert the bits to Gbytes. This is the rate at which the bits of data would have to arrive at the processor to ensure that it could maintain its full flops rate. (7) 3.2 Memory bandwidth The next main measurement is the memory bandwidth. It is known that the memory bandwidth determines how fast the memory can maintain the flops from the CPU. Therefore, fast CPUs with a low memory bandwidth lead to a bottleneck. In the following benchmarks, the cache effects are not taken into account, because the size of the problem is larger than the cache. 18

26 3.2.1 Stream Another stress test examines the memory rather than the CPU, as in the previous two benchmarks. The output of this benchmark is the actual data rate that the main memory provides, and is known as the Memory Bandwidth [38]. The stream benchmark was written by John McCalpin and Joe Zagar and uses four operations [39]: 1. Copy: a[i] = b[i] 2. Scale: a[i] = d * b[i] 3. Sum: a[i] = b[i] + c[i] 4. Triad: a[i] = b[i] + d * c[i] As shown above, these four operations are applied to a number of vectors. These vectors should be much larger than the cache. [40] suggests that the size of the vector should be at least: Cache size = Number of CPUs 2 In the case of more than one core in the CPU, the benchmark should be run on all the cores using the OpenMP version of the benchmark. The benchmark runs a number of times, with the best result being taken. As a result of the high reliance on the memory, the Triad kernel will be the benchmark used to quantify memory bandwidth [39]. To determine in practice whether or not the memory can deliver data at a speed that avoids the bottleneck, the balance ratio should be calculated conceptually, it is similar to the Amdahl number. To calculate the balance, one must calculate how many FLOP/s the memory can deliver using the triad bandwidth resulting from the stream benchmark: Memory / = (8) Triad bandwidth / 8 Then the balance ratio will be calculated as described in Section equation 9. 19

27 3.2.2 Simple memory The second simple benchmark is the memory benchmark. In this simple benchmark, a double vector would be defined with a size of the L2 cache 4 [39], by using a number of threads to add 1 to all the elements in the vector; the time will be taken and then compute the following: Memory / = size of vector number of threads time The balance is calculated, as mentioned earlier as the Amdahl Memory Law, by the following formula [41]: CPU / Balance= Memory / (9) The best result is when the balance is exactly 1.0; this shows that the CPU can retrieve the data from the memory without being idle. 3.3 File system For data intensive applications, the I/O is valuable, because these applications are dealing with massive amounts of data retrieved from disks. Thus, these data intensive applications are more likely than others to expose I/O bottlenecks in a computer. For this reason, the I/O tests are important for determining the computers balances IOZone The most popular and generally-used File System benchmark is the IOZone. Using a number of operations, multiple file/record sizes and secondary options, the benchmark determines the file system's performance, the CPU usage and the throughput [42] [43]. To ensure reliability, the entire test should be applied to files larger than the node s main memory [43]. Moreover, the sequential read will be considered. This is because the read would reflect the main throughput, because of the high bandwidth and wide usage [1] [32]. 20

28 3.3.2 Simple I/O The third simple benchmark is a simple I/O benchmark. Basically, this benchmark calculates the aggregate throughput by reading a number of files sequentially using a number of threads. The number of threads should be equal to the number of files, because each thread reads a file. Also, the user sets the appropriate buffer size and the program calculates the throughput by: Disk Throughput / = bytes read time 3.4 Balanced systems Finally, the concept of a balanced system is an important concept in the data intensive computing field. The situation is where there is a balance between the computations and the I/O operations, that is, few I/O bottlenecks. There are a number of laws that classify the definition of balance, one of which is the throughput balance law (Section 2.2.1). The I/O machine balance ratio, better known as the Amdahl number, clarifies the I/O balance as the system requiring one bit of I/O per second per instruction per second (Section ). In this report, the previous theoretical value will be measured using an existing benchmark and a set of modified benchmarks that represent both the real and theoretical I/O patterns. 21

29 3.4.1 Amdahl synthetics benchmarks Existing benchmark Written by Kulkarni [1] as a dissertation at EPCC in 2011, the existing benchmark calculates mainly the practical value of the Amdahl number using the equation 5 in (Section ). Figure 3-1 shows the basic structure of the benchmark, where the main idea is that it declares a set of buffers the same as the number of threads and two kinds of threads: I/O threads, which read from a file to the buffer; and compute threads, responsible for carrying out certain computations on the data they read in. The number of I/O threads is the same as the number of files; here, the number of computing threads should be specified by the user. Figure 3-1: Illustration of Amdahl synthetics benchmarks [1] There are three states of the buffer: free, loading and ready. At the start of the program the buffers are free. Reading the data from the files to the buffers by the I/O threads would change the status of the buffer to loading. When the buffer is filled, the buffer s status changes to ready for computation. On the other hand, the computational threads wait for the buffers to be ready and then carry out the appropriate computations determined by the user turning them to free again when finished. In addition, the benchmark decides whether the system is compute or I/O-bound by calculating the idle time of the threads either the compute or the I/O threads. When it is noticeable that a system has a large idle time with the compute threads waiting for the buffers to be ready for a long time, the system is an I/O-bound. More information about the algorithm used and the details on how to use the benchmark can be found in [1] Modified benchmarks The next set of benchmarks is a modification of the original Amdahl benchmark. In this set, the I/O patterns are changed. This is employed to determine the best number of threads, the best technique used and the best buffer sizes used in a real application to reach to a good balance level between the I/O and compute. 22

30 Synthetic benchmark The synthetic benchmark decides the best composition of the threads to be used to obtain the best Amdahl number (closer to 1.0). This benchmark is based on the theoretical calculations for the output from a simple I/O (Section 3.3.2) and a simple CPU (Section 3.1.3), using different numbers of threads. Here, the main calculation is the calculation for the Amdahl number ( ). Figure 3-2 shows the theoretical calculations of the outputs from a simple CPU and a simple I/O, using multiple threads. In Figure 3-2, a number of four-threads have been used, because any number of threads could be run on the core; however, hyper-threading makes each core in the system appear to the operating system as if it is two virtual cores with a total of two cores in a node. In Figure 3-2, (a) is useless, because no data can reach the application; (b) is also useless, because there are no computations; (c) is a waste of I/O because, for a typical application, it is unlikely to be able to sustain a throughput of GB/s. Moreover, this scenario is limiting the amount of the FLOP/s that should be carried out. Starting from the fourth possibility, (d), the scenarios look quite balanced. In this instance, one could manage to achieve just over 1 FLOP on each byte read in, if the application can maintain a throughput of GB/s. Finally, Figure 3-2 (e) and (f) are the most balanced by using one thread for the I/O operations and three compute threads. Using the HDD in (f) provides the least throughput among the other scenarios and, as a consequence, the most balance. To conclude, by getting closer to 1.0, there is a greater chance that all components will be put to good use for a typical code. On the other hand, if there is a high data intensive code, it benefits from a machine with an Amdahl number greater than one. 23

31 a. Four compute threads Flop/s: 1.04 GFLOP/s Throughput: 0 GB/s Amdahl number: 0 b. Four I/O threads Flop/s: 0 GFLOP/s Throughput GB/s Amdahl number: inf c. One compute thread and 3 I/O threads d. Two compute threads and 2 I/O threads Flop/s: GFLOP/s Throughput: 0.41 GB/s Amdahl number: Flop/s: 0.39 GFLOP/s Throughput: 0.30 GB/s Amdahl number: 6.66 e. Three compute threads and 1 I/O thread (SSD) f. Three compute threads and 1 I/O thread (HDD) Flop/s: 0.95 GFLOP/s Throughput: 0.25 GB/s Amdahl number: 2.24 Flop/s: GFLOP/s Throughput: 0.15 GB/s Amdahl number: 1.34 Figure 3-2: Different threads compositions throughputs and Amdahl numbers A synthetic benchmark is written to bring the previous results from a theoretical manner to one fit for a practical application. The benchmark defines two types of threads: I/O and compute. The number of I/O threads is the same as the number of files, while the compute threads are calculated by subtracting the number of I/O threads from the number of total threads requested. For each I/O thread there is a corresponding buffer, whose size should be entered by the user, defined and filled by the thread until the thread reads the entire corresponding file. The other type of threads, the compute threads, are carrying out an operation floating point adding, for simplification for the number of times specified by the user. 24

32 Real world I/O patterns benchmark The real world I/O pattern benchmark is used to calculate the balance using the Amdahl number and the performance of the system by using I/O patterns similar to those found in real-world applications. The I/O is the medium that connects the application with the real world. In the traditional applications, the I/O operations were only used at the beginning of the programs by reading the data from the file and writing the results to the file at the end of the program. However, the reliability of these programs decreased with the time, because of the increasing faults in the processors. To increase the reliability of these programs, writing the results at some points before the end of the programs is necessary for tolerating these faults; these are widely known as checkpoints. Thus, the main role for the checkpoints is to allow for the re-starting of the application at a given point, if an interesting event happens in the simulation. Moreover, the application could re-start with different parameters for many purposes, such as wishing to simulate from that point on in more detail or with more instrumentation; data dumps for visualisation; partial results to allow computational steering; an output of the system state at a fixed point in simulation time to later measure the dynamic properties. Here, different techniques and effects would be measured to achieve the most balanced and high performance I/O patterns that are recommended for use in the machine. These techniques are Checkpoints and Barriers. Figure 3-3 shows the different I/O patterns that would be considered for the scope of this benchmark. In Figure 3-3, (a) is using the barriers and checkpoints, so it writes twice to the files at the same thread s times; (b) is not using checkpoints, merely barriers. This pattern only writes once to the files. The third possibility, (c), is using the barriers and checkpoints, so it writes twice to the files at the different times between the threads and, hence, finishes at different times. Finally, (d) is not using either barriers or checkpoints, which is the simplest situation among the other scenarios. 25

33 a. With Checkpoint, with Barriers b. Without Checkpoint, With Barrier READ READ Write Write c. With Checkpoint, without Barrier d. Without Checkpoint, without Barrier READ READ Write Write Figure 3-3: Different real program patterns All the patterns begin by reading from the number of files entered by the user. The number of files should be the same as the number of threads. A number of barriers (a and b) and a checkpoint (a and c) will be taken into account as options in this benchmark to differentiate between the patterns. These two options (Barrier and Checkpoint) should be requested by the user by running the program with --f = 1 to add barriers and --c = 1 for checkpoints, otherwise, replace 1 with 0. For example, to get a benchmark that consists of a checkpoint, but without the barriers (scenario c), the user should choose --f = 1 and --c = 0. 26

34 The benchmark begins by reading a buffer size from a file and carrying out a set of operations on the buffer. Next, depending on whether or not the user requests using barriers or checkpoints between the computations, the benchmark will take them into account. Finally, the output will be written in separate files. Each thread writes its own results in a separate file named an output_[thread_id], where the thread_id is the number of the thread, starting from zero. Finally, a set of results would be calculated, one of them being the Amdahl number. Figure 3-4 illustrates the algorithm of the realworld I/O pattern benchmark. In all of the benchmarks developed during the scope of this project, the precision when calculating the time was an important consideration. To achieve an acceptable precision when calculating the times in the benchmarks, the time measured by a loop that carried out a number of floating point operations was subtracted from the time for an empty loop (without optimisations). Moreover, any cache effect was turned off by running the code with special flags. A volatile keyword was assigned to the variable being measured to avoid any optimisation of the variable. The simple benchmarks would be carried out in a number of threads, depending on the machine and the number of times determined by the user. Again, the best time would be taken. 27

35 28 Figure 3-4: Algorithm for real program I/O pattern benchmark