Reducing Simulation Time by Parallelizing SimpleScalar in MPI Through The Use of SimPoint Generated Intervals

Transcription

1 Reducing Simulation Time by Parallelizing SimpleScalar in MPI Through The Use of SimPoint Generated Intervals James Michael Poe II 1, Fernando Hernandez 2, and Tao Li 3 Department of Electrical and Computer Engineering University of Florida, Gainesville, FL { 1 jpoe, 2 fhernand}@ufl.edu, 3 taoli@ece.ufl.edu Abstract Cycle accurate simulation is an essential tool used in the evaluation and design exploration of modern computer architectures. Due to the increasing complexity, additional critical constraints and ever expanding design space, cycle-level simulation time is growing at an unprecedented rate. A parallel approach to simulation at first appears to be an obvious choice; however previous attempts to parallelize simulators have been largely unsuccessful due to the sequential nature of the problem, often resorting to parallelizing multiple threads or simulating multiple processors. With recent advances in phase analysis, however, the use of "SimPoint" to choose simulation points can break a single thread of execution into multiple representative intervals of simulation, which are entirely independent. In this paper, we use SimPoint simulation points to demonstrate a parallel approach to architecture simulation using MPI (Message Passing Interface) on a cluster of computers. Our preliminary experiment results show that this approach greatly reduces simulation time, while maintaining the high levels of accuracy that SimPoint provides. 1. Introduction Cycle accurate simulators play a vital role in the design and evaluation of modern computer architectures. A recent journal article reported that in the four major conferences in computer architecture (MICRO, ISCA, ASPLOS, and HPCA), cycle precise simulation was used in 72 out of the 103 papers [1]. As the complexity of design and the application size continue to increase, so does the simulation time required for accurate results. Lengthy simulation periods have a negative impact not only on the time that it takes to validate and report the results to the rest of the architecture community, but also often force the designers to either choose faster yet less accurate models, or focus on a small fraction of program execution. Much of architecture design is a trial and error process, and the amount of trials completed is often directly dependant upon the time that it takes to complete those trials. Thus, it is imperative that methods to decrease the simulation time without forfeiting much of the accuracy be explored. At first, a parallel approach to simulation seems like an obvious choice, as is the case with any computationally intensive problem. This presents difficulty, however, due to the inherently sequential nature of simulating a single thread of execution. With recent advances in phase analysis [2], however, it has become possible to break a single thread of execution up into multiple representative intervals that when weighted appropriately and combined provide an accurate model of the entire execution of the workload. These intervals are entirely independent and thus can be simulated in parallel, resulting in reduced simulation time. As the price of commercial off the shelf (COTS) systems continue to fall, and moreover the performance of high speed interconnects continue to rise, clusters continue to provide an appealing low-cost platform for parallel computing. Various methodologies and software packages can be used to parallelize tasks in clusters, and a popular choice is the Message Passing Interface (MPI) standard [3]. MPI can be used with both C and FORTRAN, and MPI functionality can be conveniently added to existing programs through calls to the MPI library. While both MPI and clustered systems share many advantages, they both have their limitations when heavy communication is required due to the latency and overhead of sending and receiving messages. Therefore, to effectively use a clustered MPI system as a parallelization vehicle, programs are usually required to have a very minimal communication to computation ratio, a characteristic which is easily attainable in a parallel architectural simulator. In this paper we showcase a parallel simulation framework which is built with minimal additions and modifications to a cycle level simulator (i.e. SimpleScalar s sim-outorder simulator [4]), MPI, and SimPoint [] algorithm generated simulation points. 1

2 Experimental results show that when compared to running an entire benchmark, this method provides an excellent simulation speedup, and when compared to using SimPoint intervals on a single processor machine, shows a good, predictable speedup; all while maintaining high levels of accuracy. This paper is organized as follows: Section 2 briefly describes background information on the process flow of parallelizing simulations through SimPoint. Section 3 describes experimental methodology and implementation issues. Section 4 presents the simulation results. Section discusses related work. Finally, Section 6 concludes the paper and outlines future work. 2. Background Information The process of simulating multiple simulation points in parallel begins with the generation of those points for a particular benchmark. This is done by first generating a basic block vector of the benchmark, which can be performed using Sim-Fast BBV Tracker [6], ATOM BBV Tracker [7], or another basic block vector profiling program of the user s choice. Once the basic block vector has been generated, it is fed into the SimPoint phase analysis software which generates the representative simulation points and their corresponding weighting factors [8]. Once these simulation points and weights have been generated, the modified cycle level simulator reads in the values, simulates the different points in parallel, and writes the results to file. In the above described method, the generation of simulation points has to be performed sequentially. Therefore, the time that it takes to generate the basic block vector and compute the representative simulation points must be taken into consideration when evaluating the efficiency of this parallel approach to simulation. These points, however, only need to be computed once ever for any benchmark/input combination. Thus the time that it takes to calculate the basic block vector, roughly the amount of time it takes to fast forward through the entire benchmark once (about 1/10 th of the amount of time it takes to do cycle level simulation in the SimpleScalar suite), and the time that it takes to generate the simulation points using SimPoint (a relatively short amount of time using even the largest SPEC benchmarks with 100M intervals), can be amortized over multiple runs of the parallel simulator. Further, these intervals have already been generated for most of the SPEC2000 benchmarks, are available from the SimPoint website, and indeed were used for all of the results reported in this paper. 3. Implementation In this section we provide a brief introduction and summary of the simulator that was chosen and the modifications that were necessary to provide the parallel execution of simulation points. 3.1 Base Simulator The simulator chosen for modification was the simoutorder cycle level simulator available in the SimpleScalar suite [4]. While parallelization can be attempted with any simulator, the SimpleScalar suite was chosen because it provides most of the necessary functionality to natively take advantage of the SimPoint generated simulation points, has been exhaustively tested and verified, and is widely used in the architecture community. Message Passing Interface code was used to parallelize the simulator as the modified simulator only needed to communicate twice (namely, once to disburse the intervals and once to recombine the intervals). Several modifications to sim-outorder were necessary to allow parallel simulation. First, the internal fast forward variable had to be modified to allow fast functional simulation past 2.1 billion instructions, the integer limit. Sim-outorder code was then extended to accept a new argument, interval, which the user supplies to specify the interval size. Since the master process (the process that is responsible for generation and distribution of tasks) now determines and disseminates the fast forward and maximum instruction variables to the slaves (processes that carry out the work assigned by the master), these arguments are no longer accepted from the user. 3.2 Interval Allocation One of the most important decisions in the design of any parallel program is how the tasks will be allocated to the processors; and it is certainly true in the design of a parallel simulator. Since all of the intervals that need to be simulated are available before the allocation, and the time it takes the intervals to complete is based upon the distance that needs to be fast-forwarded, an optimal allocation can be predetermined and thus static assignment is preferred over dynamic. This allows the master process to disseminate the tasks and then simulate its own intervals. The algorithm used to assign the intervals to individual processes is as follows: 1. Arrange the simulation points from the farthest away to the closest 2. Assign the furthest unassigned point to the process with the minimum point distance sum, taking into account the time needed to 2

3 complete the cycle precise simulation of that point 3. Repeat until all points have been assigned Table 1: The result of the allocation strategy for the benchmark that has 9 intervals using 3 processors. In the table above, each column represents the intervals that are assigned to each respective processor, and the bottom row displays the sum of the intervals, which is directly proportional to the simulation time. To arrive at this distribution, it is important to first realize that simulation time is comprised of two parts: the time spent fast forwarding through the preceding code sections, and the time spent performing a detailed simulation on the target interval. Since the time to fast forward through a given block of instructions is not the same as the time needed to perform a detailed simulation on the same block, an offset needs to be added to each interval that represents the added time which will be spent performing the detailed simulation when the target interval is reached. From our experimental results, it was determined that detailed simulation is between 7 and 13 times slower than fast forwarding, so intervals are assigned assuming they are 1 units larger. The algorithm begins by assigning intervals 840, 83, and 834. Interval 26 is then assigned on processor 3, because at this point it is the processor with the smallest interval sum. Intervals 491 and 277 are then assigned by the same method, and intervals 13, 8, and 1 are all assigned to the first processor because although it already has two intervals, its total interval sum is smaller than any other. 3.3 Distribution, Simulation, and Recombination After the allocation of the simulation points has been determined, the results are saved into a three dimensional double array with the following structure: simpoint[x][y][z] The first dimension (represented by the x variable) of the array designates the processor that will simulate the different points, each of which is stored in a row that is index by the second dimension (represented by the y variable). The third dimension (represented by the z variable) indicates both the simulation point and its corresponding weight. After this array has been populated, each processor is sent its simulation points: for (m = 1; m < ntasks ; m++ ){ MPI_Send(&simpoint[m], 2 * MAX_SIMPOINTS, MPI_DOUBLE, m, tag, MPI_COMM_WORLD); } where ntasks is the number of processors obtained using the MPI_Comm_size function, and MAX_SIMPOINTS is the maximum number of possible simulation points. The for loop begins at 1 as the master has saved its own simulation points to be calculated in the 0 th level of the three dimensional array. Each slave then issues the MPI_Recv function and stores the result into the 0 th level of its array: MPI_Recv(&simpoint[0], 2 * MAX_SIMPOINTS, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, &status); From this point on, all processes can access their respective simulation points by referencing the same level of the array. The simulator then loops through all of the simulation points until all points have been processed, resetting the internal state of the simulator and updating the fast forward variable on each loop. When each process has completed the simulation of all of its points, it notifies the master. Once the master has completed all of its own points, it awaits the results from the slave processes, and after all of the intermediate results have been received, weighted, and accumulated, the master outputs the final results and exits. One limitation of the current implementation is that it does not support any form of warm up period. While a warm up period is not necessary for many programs as long as a large enough interval size is used (for instance, 100M instructions), it is needed for accurate results of smaller interval sizes [2]. Warm up functionality is discussed further in the future work section. 4. Results In this section we discuss the results that were obtained using the modified parallel version of simoutorder. These results were compiled using a cluster of 32 dual 733 MHz Intel Pentium-III processors with 26 MB PC133 SDRAM, connected using Gigabit Ethernet, running the Linux kernel, with MPICH version [3] installed. While each node of the cluster has two Pentium-III processors, only one was utilized to run the simulations while the other handled basic operating system processes. The simulation 3

4 points used for the SPEC2000 benchmarks were obtained via the SimPoint website [4] for 100M intervals. The simulated microarchitecture is the default configuration of the sim-outorder simulator. of 2.1% when compared to the full simulation of the program [2]. % Error in CPI 90% 80% 70% 60% 0% 40% 30% Start FF1B SimPoint 20% 10% 0% Median Average Max Figure 1: Accuracy comparison of the SimPoint algorithm using 1M intervals versus blindly fast forwarding a set number of instructions and starting from the beginning of program execution for the entire SPEC2000 suite. Both non-simpoint results are done using 300 million instructions. Graph taken from the SimPoint documentation []. 4.1 Validation While this paper is in no way intended to serve as a validation or verification of the SimPoint tool (for an extensive discussion on the validity of SimPoint as a means of reducing simulation time while maintaining accuracy, the reader is referred to the SimPoint website and documentation), it is important to provide a sense of the accuracy achievable with SimPoint at this stage to warrant further exploration of the speedups possible by parallelizing the simulator through the use of SimPoint simulation points. If one is unable to trust the results of the simulator, it really doesn t matter how quickly the simulation finishes. Figure 1 shows the simulation accuracy results using SimPoint compared to the complete execution of the programs for the entire SPEC2000 benchmark suite. Results are shown for the median, average, and maximum errors found. These results were obtained by the authors of SimPoint using an instruction interval size of 1 million. The results are compared to simulating 300 million instructions (which is over 3 times that of the SimPoint results provided) at the beginning of the code, and also by blindly fast forwarding through 1 billion instructions. Figure 1 shows that beginning execution from the start of the program results in an average CPI error of 201%, fast forwarding through 1 billion instructions and then simulating results in an average CPI error of 99%, whereas using the SimPoint algorithm to create multiple simulation points resulted in an average error Figure 2: Accuracy comparison of the SimPoint algorithm using 100M intervals versus blindly fast forwarding a set number of instructions and starting from the beginning of execution for the,,,,,, and integer benchmarks. Our results using 100M interval simulation points from the SimPoint website for the,,, perbmk,, and benchmarks were similar to those produced for the entire SPEC suite by the SimPoint authors and are shown in Figure 2. The non- SimPoint results were run for the same total number of instructions as the SimPoint versions (for example, if there were 8 simulation points run in the parallel version, each consisting of 100M instructions, then 800M instructions were simulated from the start of execution, and also by fast forwarding 1 billion instructions). The goal of our research was to study the potential speedup obtainable by parallelizing the SimPoint algorithm results, and not verifying their validity, and our accuracy results obviously do not depict a representative workload of the SPEC 2000 suite. They do, however, give a sense of the accuracy maintained while producing the speedups obtained in the following section. 4.2 We will now explore the speedups that were obtained using the parallel version of sim-outorder. Figure 3 shows the speedup when compared to running the entire benchmark and Figure 4 shows the speedup when compared to a sequential baseline of running the simulation points consecutively on a single processor. Multiple runs of each benchmark were performed on a total number of processors ranging from one to n, where n represents the total number of simulation points produced by the SimPoint algorithm. Both figures clearly depict a linear speedup up to a certain 4

5 maximum speedup, and then level off almost immediately. The maximum speedup is limited by the fast forward distance to the farthest interval. The parallel version of the simulator must fast forward to the farthest distance at least once on one of the processors, and thus the time to complete the simulation can be no less than the amount of time that is required to fast forward to that point and simulate the required instructions. Fortunately, however, the number of processors required to achieve maximum speedup can be easily determined before the simulation is started. Since the difference in time to fast forward to different points within the same program is almost entirely determined by the distance between the different points, and further the time to simulate one interval worth of instructions is approximately the same across a given benchmark; we can determine the number of processors required to achieve maximum speedup by repeating the allocation algorithm with an increasing number of processors until the furthest simulation point is the only point assigned to one of the processors. Table 2 shows the allocation of the tasks for the benchmark that would be predicted by this method to achieve maximum speedup. While this method yields results that are close to the maximum speedup, they neglect to take into consideration the time it takes to actually simulate each interval. If the total fast forward distances are the same for two processors, but one of the two processors is scheduled to simulate twice the number of points, that processor will take additional time to complete. This extra time can be compensated for by adding an offset to each of the simulation points before calculating the number of processors required for maximum speedup. Since the cycle precise simulation time to fast forward simulation time ratio averages around 10 (all of our cases were within the 7 to 13 range), a number such as 1 can be added to each simulation point. Table 3 shows the newly calculated allocation of simulation points to achieve maximum speedup, and Table 4 shows the results using the actual times from a simulation run With Respect to Running Full Benchmark (using standard intervals) Table 2: Erroneous estimation of number of processors to achieve maximum speedup Figure 3: achieved for various benchmarks with differing number of processors as compared to full execution. With Respect to Running Full Benchmark (using early intervals) Table 3: Estimation of the number of processors required to achieve maximum speedup Figure 4: achieved for various benchmarks with differing number of processors as compared to sequential baseline Table 4: Actual number of processors required to achieve maximum speedup with real times shown in seconds. The maximum speedup attainable can be determined by using the following equation:

6 TT Max = T Where T T is the total time to simulate all intervals on a single processor and T L is the time to simulate the farthest interval. 4.3 Early SimPoint Early SimPoint is an algorithm that produces simulation points in much the same fashion as the standard SimPoint algorithm, but with a general tendency towards choosing earlier simulation points, with the aim of reducing simulation time. The results using the Early SimPoint algorithm produce an additional speedup, however not as great as in other applications of this algorithm. This is expected, as the Early SimPoint intervals are smaller. Greater speedups would be possible if the algorithm explicitly focused on reducing the distance to the furthest simulation point, as this is the main factor in determining the maximum speedup possible. Of the 30 simulation points for integer benchmarks posted on the SimPoint website using 100M intervals, 11 (63%) of the them reduced the distance to the furthest simulation point, and 1 (0%) of them reduced the furthest simulation point more than 10%. For the 21 integer benchmarks using 10M intervals, 11 (2%) reduced the distance to the furthest simulation point and 8 (38%) did so by more than 10%. The simulation points posted for floating point benchmarks were much more impressive with over 78% producing a greater than 10% difference for 100M intervals, and almost 70% reducing the distance greater than 10% for the 10M interval size. Therefore the potential speedup of using the Early SimPoint algorithm should be carefully considered and weighed against the reduced accuracy of the algorithm as compared to the regular SimPoint algorithm; especially in the case of integer benchmarks With Respect to Running Full Benchmark (using early intervals) L Figure : s estimated for various benchmarks using the Early SimPoint algorithm as compared to full execution. Estimations for the speedups obtained using the Early SimPoint algorithm to generate the simulation points for the parallel version of sim-outorder are displayed in Figure as compared to running the entire benchmark and in Figure 6 as compared to running the sequential baseline. The estimations were calculated using a linear regression model on the previously simulated points to determine the average fast forward and cycle precise simulation time, and then using those times to estimate the time required to run the simulation using the optimal allocation algorithm With Respect to a Non-Parallel SimPoint Run (using early intervals) Figure 6: s estimated for various benchmarks using the Early SimPoint algorithm as compared to sequential baseline.. Related Work Techniques to reduce simulation time have been proposed in [2, 9, 10, 11, ]. These proposals exclusively focus on simulations running on a single processor machine. The SimPoint framework proposed by Sherwood and Calder [2] reduce the simulation time by using a sets of representative instruction chunks to represent the entire program execution. Schnarr and Larus [9] use memorization to replay actions stored in a processoraction cache when the current microarchitectural state matches a previously encountered state. Conte et al. performed the early work on the sampling based simulations [10]. Wunderlich et al. propose SMARTS [11], which further uses rigorous theoretical guidelines to determine sampling rate to reduce the simulation time. Huang et al. [12] propose EXPERT to reduce the simulation requirement by exploiting program behavior repetition. 6. Summary and Future Work Through the use of simulation points generated by the SimPoint algorithm, we were able to break up a single thread of execution into multiple independent 6

7 simulation points. These simulation points were individually simulated in parallel using a modified version of the sim-outorder simulator from the SimpleScalar suite that included MPI code to allocate, distribute, execute, and recombine the results from any number of processors up to the maximum number of points. The modified simulator was able to achieve a nearly linear speedup when compared to a sequential baseline up to a maximum speedup that was determined by the distance to the furthest simulation point. While this maximum speedup can not be surpassed, the number of processors at which this speedup occurs can easily be calculated before the simulation is started, and thus assigning extra processors that are unable to increase the speedup can be avoided. Further this limitation is accentuated because of the relatively few simulation points that are generated by the SimPoint algorithm for intervals of 100M instructions. This paper did not explore intervals less than 100M instructions due to lack of support for a warm up period in the current version of the parallel simulator. While in many cases a warm up period is not needed for programs when a 100M interval size is used, it is necessary for programs with smaller interval sizes. Smaller interval sizes can provide increased accuracy, particularly for programs that have a reduced maximum number of instructions, and thus future revisions to the parallel simulator should include warm up functionality. In addition, alternative approaches to repeated fast forwarding such as checkpointing and multiple fast-forwards should be explored in the parallel simulator to increase the maximum potential speedup. References [1] S. Girbal, G. Mouchard, A. Cohen, and O. Temam, DiST: A Simple, Reliable and Scalable Method to Significantly Reduce Processor Architecture Simulation Time, In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, [2] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior, In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, [3] The Message Passing Interface (MPI) Standard, [4] SimpleScalar, [] SimPoint, [6] Sim-Fast BBV Tracker lar-bbv.htm [7] ATOM BBV Tracker [8] G. Hamerly, E. Perelman, and B. Calder, How to Use SimPoint to Pick Simulation Points, ACM SIGMETRICS Performance Evaluation Review, [9] E. Schnarr and J. Larus, Fast Our-of-Order Processor Simulation Using Memoization, In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, [10] T. Conte, M. Hirsch, and K. Menezes, Reducing State Loss for Effective Trace Sampling of Superscalar Processors, In Proceedings of International Conference on Computer Design, [11] R. Wunderlich, T.Wenisch, B. Falsafi, and J. Hoe, SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling, In Proceedings of International Symposium on Computer Architecture, [12] W. Liu and M. Huang, EXPERT: Expedited Simulation Exploiting Program Behavior Repetition, In Proceedings of International Conference on Supercomputing,