Reducing Simulation Time by Parallelizing SimpleScalar in MPI Through The Use of SimPoint Generated Intervals
|
|
- Vincent Andrews
- 7 years ago
- Views:
Transcription
1 Reducing Simulation Time by Parallelizing SimpleScalar in MPI Through The Use of SimPoint Generated Intervals James Michael Poe II 1, Fernando Hernandez 2, and Tao Li 3 Department of Electrical and Computer Engineering University of Florida, Gainesville, FL { 1 jpoe, 2 fhernand}@ufl.edu, 3 taoli@ece.ufl.edu Abstract Cycle accurate simulation is an essential tool used in the evaluation and design exploration of modern computer architectures. Due to the increasing complexity, additional critical constraints and ever expanding design space, cycle-level simulation time is growing at an unprecedented rate. A parallel approach to simulation at first appears to be an obvious choice; however previous attempts to parallelize simulators have been largely unsuccessful due to the sequential nature of the problem, often resorting to parallelizing multiple threads or simulating multiple processors. With recent advances in phase analysis, however, the use of "SimPoint" to choose simulation points can break a single thread of execution into multiple representative intervals of simulation, which are entirely independent. In this paper, we use SimPoint simulation points to demonstrate a parallel approach to architecture simulation using MPI (Message Passing Interface) on a cluster of computers. Our preliminary experiment results show that this approach greatly reduces simulation time, while maintaining the high levels of accuracy that SimPoint provides. 1. Introduction Cycle accurate simulators play a vital role in the design and evaluation of modern computer architectures. A recent journal article reported that in the four major conferences in computer architecture (MICRO, ISCA, ASPLOS, and HPCA), cycle precise simulation was used in 72 out of the 103 papers [1]. As the complexity of design and the application size continue to increase, so does the simulation time required for accurate results. Lengthy simulation periods have a negative impact not only on the time that it takes to validate and report the results to the rest of the architecture community, but also often force the designers to either choose faster yet less accurate models, or focus on a small fraction of program execution. Much of architecture design is a trial and error process, and the amount of trials completed is often directly dependant upon the time that it takes to complete those trials. Thus, it is imperative that methods to decrease the simulation time without forfeiting much of the accuracy be explored. At first, a parallel approach to simulation seems like an obvious choice, as is the case with any computationally intensive problem. This presents difficulty, however, due to the inherently sequential nature of simulating a single thread of execution. With recent advances in phase analysis [2], however, it has become possible to break a single thread of execution up into multiple representative intervals that when weighted appropriately and combined provide an accurate model of the entire execution of the workload. These intervals are entirely independent and thus can be simulated in parallel, resulting in reduced simulation time. As the price of commercial off the shelf (COTS) systems continue to fall, and moreover the performance of high speed interconnects continue to rise, clusters continue to provide an appealing low-cost platform for parallel computing. Various methodologies and software packages can be used to parallelize tasks in clusters, and a popular choice is the Message Passing Interface (MPI) standard [3]. MPI can be used with both C and FORTRAN, and MPI functionality can be conveniently added to existing programs through calls to the MPI library. While both MPI and clustered systems share many advantages, they both have their limitations when heavy communication is required due to the latency and overhead of sending and receiving messages. Therefore, to effectively use a clustered MPI system as a parallelization vehicle, programs are usually required to have a very minimal communication to computation ratio, a characteristic which is easily attainable in a parallel architectural simulator. In this paper we showcase a parallel simulation framework which is built with minimal additions and modifications to a cycle level simulator (i.e. SimpleScalar s sim-outorder simulator [4]), MPI, and SimPoint [] algorithm generated simulation points. 1
2 Experimental results show that when compared to running an entire benchmark, this method provides an excellent simulation speedup, and when compared to using SimPoint intervals on a single processor machine, shows a good, predictable speedup; all while maintaining high levels of accuracy. This paper is organized as follows: Section 2 briefly describes background information on the process flow of parallelizing simulations through SimPoint. Section 3 describes experimental methodology and implementation issues. Section 4 presents the simulation results. Section discusses related work. Finally, Section 6 concludes the paper and outlines future work. 2. Background Information The process of simulating multiple simulation points in parallel begins with the generation of those points for a particular benchmark. This is done by first generating a basic block vector of the benchmark, which can be performed using Sim-Fast BBV Tracker [6], ATOM BBV Tracker [7], or another basic block vector profiling program of the user s choice. Once the basic block vector has been generated, it is fed into the SimPoint phase analysis software which generates the representative simulation points and their corresponding weighting factors [8]. Once these simulation points and weights have been generated, the modified cycle level simulator reads in the values, simulates the different points in parallel, and writes the results to file. In the above described method, the generation of simulation points has to be performed sequentially. Therefore, the time that it takes to generate the basic block vector and compute the representative simulation points must be taken into consideration when evaluating the efficiency of this parallel approach to simulation. These points, however, only need to be computed once ever for any benchmark/input combination. Thus the time that it takes to calculate the basic block vector, roughly the amount of time it takes to fast forward through the entire benchmark once (about 1/10 th of the amount of time it takes to do cycle level simulation in the SimpleScalar suite), and the time that it takes to generate the simulation points using SimPoint (a relatively short amount of time using even the largest SPEC benchmarks with 100M intervals), can be amortized over multiple runs of the parallel simulator. Further, these intervals have already been generated for most of the SPEC2000 benchmarks, are available from the SimPoint website, and indeed were used for all of the results reported in this paper. 3. Implementation In this section we provide a brief introduction and summary of the simulator that was chosen and the modifications that were necessary to provide the parallel execution of simulation points. 3.1 Base Simulator The simulator chosen for modification was the simoutorder cycle level simulator available in the SimpleScalar suite [4]. While parallelization can be attempted with any simulator, the SimpleScalar suite was chosen because it provides most of the necessary functionality to natively take advantage of the SimPoint generated simulation points, has been exhaustively tested and verified, and is widely used in the architecture community. Message Passing Interface code was used to parallelize the simulator as the modified simulator only needed to communicate twice (namely, once to disburse the intervals and once to recombine the intervals). Several modifications to sim-outorder were necessary to allow parallel simulation. First, the internal fast forward variable had to be modified to allow fast functional simulation past 2.1 billion instructions, the integer limit. Sim-outorder code was then extended to accept a new argument, interval, which the user supplies to specify the interval size. Since the master process (the process that is responsible for generation and distribution of tasks) now determines and disseminates the fast forward and maximum instruction variables to the slaves (processes that carry out the work assigned by the master), these arguments are no longer accepted from the user. 3.2 Interval Allocation One of the most important decisions in the design of any parallel program is how the tasks will be allocated to the processors; and it is certainly true in the design of a parallel simulator. Since all of the intervals that need to be simulated are available before the allocation, and the time it takes the intervals to complete is based upon the distance that needs to be fast-forwarded, an optimal allocation can be predetermined and thus static assignment is preferred over dynamic. This allows the master process to disseminate the tasks and then simulate its own intervals. The algorithm used to assign the intervals to individual processes is as follows: 1. Arrange the simulation points from the farthest away to the closest 2. Assign the furthest unassigned point to the process with the minimum point distance sum, taking into account the time needed to 2
3 complete the cycle precise simulation of that point 3. Repeat until all points have been assigned Table 1: The result of the allocation strategy for the benchmark that has 9 intervals using 3 processors. In the table above, each column represents the intervals that are assigned to each respective processor, and the bottom row displays the sum of the intervals, which is directly proportional to the simulation time. To arrive at this distribution, it is important to first realize that simulation time is comprised of two parts: the time spent fast forwarding through the preceding code sections, and the time spent performing a detailed simulation on the target interval. Since the time to fast forward through a given block of instructions is not the same as the time needed to perform a detailed simulation on the same block, an offset needs to be added to each interval that represents the added time which will be spent performing the detailed simulation when the target interval is reached. From our experimental results, it was determined that detailed simulation is between 7 and 13 times slower than fast forwarding, so intervals are assigned assuming they are 1 units larger. The algorithm begins by assigning intervals 840, 83, and 834. Interval 26 is then assigned on processor 3, because at this point it is the processor with the smallest interval sum. Intervals 491 and 277 are then assigned by the same method, and intervals 13, 8, and 1 are all assigned to the first processor because although it already has two intervals, its total interval sum is smaller than any other. 3.3 Distribution, Simulation, and Recombination After the allocation of the simulation points has been determined, the results are saved into a three dimensional double array with the following structure: simpoint[x][y][z] The first dimension (represented by the x variable) of the array designates the processor that will simulate the different points, each of which is stored in a row that is index by the second dimension (represented by the y variable). The third dimension (represented by the z variable) indicates both the simulation point and its corresponding weight. After this array has been populated, each processor is sent its simulation points: for (m = 1; m < ntasks ; m++ ){ MPI_Send(&simpoint[m], 2 * MAX_SIMPOINTS, MPI_DOUBLE, m, tag, MPI_COMM_WORLD); } where ntasks is the number of processors obtained using the MPI_Comm_size function, and MAX_SIMPOINTS is the maximum number of possible simulation points. The for loop begins at 1 as the master has saved its own simulation points to be calculated in the 0 th level of the three dimensional array. Each slave then issues the MPI_Recv function and stores the result into the 0 th level of its array: MPI_Recv(&simpoint[0], 2 * MAX_SIMPOINTS, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, &status); From this point on, all processes can access their respective simulation points by referencing the same level of the array. The simulator then loops through all of the simulation points until all points have been processed, resetting the internal state of the simulator and updating the fast forward variable on each loop. When each process has completed the simulation of all of its points, it notifies the master. Once the master has completed all of its own points, it awaits the results from the slave processes, and after all of the intermediate results have been received, weighted, and accumulated, the master outputs the final results and exits. One limitation of the current implementation is that it does not support any form of warm up period. While a warm up period is not necessary for many programs as long as a large enough interval size is used (for instance, 100M instructions), it is needed for accurate results of smaller interval sizes [2]. Warm up functionality is discussed further in the future work section. 4. Results In this section we discuss the results that were obtained using the modified parallel version of simoutorder. These results were compiled using a cluster of 32 dual 733 MHz Intel Pentium-III processors with 26 MB PC133 SDRAM, connected using Gigabit Ethernet, running the Linux kernel, with MPICH version [3] installed. While each node of the cluster has two Pentium-III processors, only one was utilized to run the simulations while the other handled basic operating system processes. The simulation 3
4 points used for the SPEC2000 benchmarks were obtained via the SimPoint website [4] for 100M intervals. The simulated microarchitecture is the default configuration of the sim-outorder simulator. of 2.1% when compared to the full simulation of the program [2]. % Error in CPI 90% 80% 70% 60% 0% 40% 30% Start FF1B SimPoint 20% 10% 0% Median Average Max Figure 1: Accuracy comparison of the SimPoint algorithm using 1M intervals versus blindly fast forwarding a set number of instructions and starting from the beginning of program execution for the entire SPEC2000 suite. Both non-simpoint results are done using 300 million instructions. Graph taken from the SimPoint documentation []. 4.1 Validation While this paper is in no way intended to serve as a validation or verification of the SimPoint tool (for an extensive discussion on the validity of SimPoint as a means of reducing simulation time while maintaining accuracy, the reader is referred to the SimPoint website and documentation), it is important to provide a sense of the accuracy achievable with SimPoint at this stage to warrant further exploration of the speedups possible by parallelizing the simulator through the use of SimPoint simulation points. If one is unable to trust the results of the simulator, it really doesn t matter how quickly the simulation finishes. Figure 1 shows the simulation accuracy results using SimPoint compared to the complete execution of the programs for the entire SPEC2000 benchmark suite. Results are shown for the median, average, and maximum errors found. These results were obtained by the authors of SimPoint using an instruction interval size of 1 million. The results are compared to simulating 300 million instructions (which is over 3 times that of the SimPoint results provided) at the beginning of the code, and also by blindly fast forwarding through 1 billion instructions. Figure 1 shows that beginning execution from the start of the program results in an average CPI error of 201%, fast forwarding through 1 billion instructions and then simulating results in an average CPI error of 99%, whereas using the SimPoint algorithm to create multiple simulation points resulted in an average error Figure 2: Accuracy comparison of the SimPoint algorithm using 100M intervals versus blindly fast forwarding a set number of instructions and starting from the beginning of execution for the,,,,,, and integer benchmarks. Our results using 100M interval simulation points from the SimPoint website for the,,, perbmk,, and benchmarks were similar to those produced for the entire SPEC suite by the SimPoint authors and are shown in Figure 2. The non- SimPoint results were run for the same total number of instructions as the SimPoint versions (for example, if there were 8 simulation points run in the parallel version, each consisting of 100M instructions, then 800M instructions were simulated from the start of execution, and also by fast forwarding 1 billion instructions). The goal of our research was to study the potential speedup obtainable by parallelizing the SimPoint algorithm results, and not verifying their validity, and our accuracy results obviously do not depict a representative workload of the SPEC 2000 suite. They do, however, give a sense of the accuracy maintained while producing the speedups obtained in the following section. 4.2 We will now explore the speedups that were obtained using the parallel version of sim-outorder. Figure 3 shows the speedup when compared to running the entire benchmark and Figure 4 shows the speedup when compared to a sequential baseline of running the simulation points consecutively on a single processor. Multiple runs of each benchmark were performed on a total number of processors ranging from one to n, where n represents the total number of simulation points produced by the SimPoint algorithm. Both figures clearly depict a linear speedup up to a certain 4
5 maximum speedup, and then level off almost immediately. The maximum speedup is limited by the fast forward distance to the farthest interval. The parallel version of the simulator must fast forward to the farthest distance at least once on one of the processors, and thus the time to complete the simulation can be no less than the amount of time that is required to fast forward to that point and simulate the required instructions. Fortunately, however, the number of processors required to achieve maximum speedup can be easily determined before the simulation is started. Since the difference in time to fast forward to different points within the same program is almost entirely determined by the distance between the different points, and further the time to simulate one interval worth of instructions is approximately the same across a given benchmark; we can determine the number of processors required to achieve maximum speedup by repeating the allocation algorithm with an increasing number of processors until the furthest simulation point is the only point assigned to one of the processors. Table 2 shows the allocation of the tasks for the benchmark that would be predicted by this method to achieve maximum speedup. While this method yields results that are close to the maximum speedup, they neglect to take into consideration the time it takes to actually simulate each interval. If the total fast forward distances are the same for two processors, but one of the two processors is scheduled to simulate twice the number of points, that processor will take additional time to complete. This extra time can be compensated for by adding an offset to each of the simulation points before calculating the number of processors required for maximum speedup. Since the cycle precise simulation time to fast forward simulation time ratio averages around 10 (all of our cases were within the 7 to 13 range), a number such as 1 can be added to each simulation point. Table 3 shows the newly calculated allocation of simulation points to achieve maximum speedup, and Table 4 shows the results using the actual times from a simulation run With Respect to Running Full Benchmark (using standard intervals) Table 2: Erroneous estimation of number of processors to achieve maximum speedup Figure 3: achieved for various benchmarks with differing number of processors as compared to full execution. With Respect to Running Full Benchmark (using early intervals) Table 3: Estimation of the number of processors required to achieve maximum speedup Figure 4: achieved for various benchmarks with differing number of processors as compared to sequential baseline Table 4: Actual number of processors required to achieve maximum speedup with real times shown in seconds. The maximum speedup attainable can be determined by using the following equation:
6 TT Max = T Where T T is the total time to simulate all intervals on a single processor and T L is the time to simulate the farthest interval. 4.3 Early SimPoint Early SimPoint is an algorithm that produces simulation points in much the same fashion as the standard SimPoint algorithm, but with a general tendency towards choosing earlier simulation points, with the aim of reducing simulation time. The results using the Early SimPoint algorithm produce an additional speedup, however not as great as in other applications of this algorithm. This is expected, as the Early SimPoint intervals are smaller. Greater speedups would be possible if the algorithm explicitly focused on reducing the distance to the furthest simulation point, as this is the main factor in determining the maximum speedup possible. Of the 30 simulation points for integer benchmarks posted on the SimPoint website using 100M intervals, 11 (63%) of the them reduced the distance to the furthest simulation point, and 1 (0%) of them reduced the furthest simulation point more than 10%. For the 21 integer benchmarks using 10M intervals, 11 (2%) reduced the distance to the furthest simulation point and 8 (38%) did so by more than 10%. The simulation points posted for floating point benchmarks were much more impressive with over 78% producing a greater than 10% difference for 100M intervals, and almost 70% reducing the distance greater than 10% for the 10M interval size. Therefore the potential speedup of using the Early SimPoint algorithm should be carefully considered and weighed against the reduced accuracy of the algorithm as compared to the regular SimPoint algorithm; especially in the case of integer benchmarks With Respect to Running Full Benchmark (using early intervals) L Figure : s estimated for various benchmarks using the Early SimPoint algorithm as compared to full execution. Estimations for the speedups obtained using the Early SimPoint algorithm to generate the simulation points for the parallel version of sim-outorder are displayed in Figure as compared to running the entire benchmark and in Figure 6 as compared to running the sequential baseline. The estimations were calculated using a linear regression model on the previously simulated points to determine the average fast forward and cycle precise simulation time, and then using those times to estimate the time required to run the simulation using the optimal allocation algorithm With Respect to a Non-Parallel SimPoint Run (using early intervals) Figure 6: s estimated for various benchmarks using the Early SimPoint algorithm as compared to sequential baseline.. Related Work Techniques to reduce simulation time have been proposed in [2, 9, 10, 11, ]. These proposals exclusively focus on simulations running on a single processor machine. The SimPoint framework proposed by Sherwood and Calder [2] reduce the simulation time by using a sets of representative instruction chunks to represent the entire program execution. Schnarr and Larus [9] use memorization to replay actions stored in a processoraction cache when the current microarchitectural state matches a previously encountered state. Conte et al. performed the early work on the sampling based simulations [10]. Wunderlich et al. propose SMARTS [11], which further uses rigorous theoretical guidelines to determine sampling rate to reduce the simulation time. Huang et al. [12] propose EXPERT to reduce the simulation requirement by exploiting program behavior repetition. 6. Summary and Future Work Through the use of simulation points generated by the SimPoint algorithm, we were able to break up a single thread of execution into multiple independent 6
7 simulation points. These simulation points were individually simulated in parallel using a modified version of the sim-outorder simulator from the SimpleScalar suite that included MPI code to allocate, distribute, execute, and recombine the results from any number of processors up to the maximum number of points. The modified simulator was able to achieve a nearly linear speedup when compared to a sequential baseline up to a maximum speedup that was determined by the distance to the furthest simulation point. While this maximum speedup can not be surpassed, the number of processors at which this speedup occurs can easily be calculated before the simulation is started, and thus assigning extra processors that are unable to increase the speedup can be avoided. Further this limitation is accentuated because of the relatively few simulation points that are generated by the SimPoint algorithm for intervals of 100M instructions. This paper did not explore intervals less than 100M instructions due to lack of support for a warm up period in the current version of the parallel simulator. While in many cases a warm up period is not needed for programs when a 100M interval size is used, it is necessary for programs with smaller interval sizes. Smaller interval sizes can provide increased accuracy, particularly for programs that have a reduced maximum number of instructions, and thus future revisions to the parallel simulator should include warm up functionality. In addition, alternative approaches to repeated fast forwarding such as checkpointing and multiple fast-forwards should be explored in the parallel simulator to increase the maximum potential speedup. References [1] S. Girbal, G. Mouchard, A. Cohen, and O. Temam, DiST: A Simple, Reliable and Scalable Method to Significantly Reduce Processor Architecture Simulation Time, In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, [2] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior, In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, [3] The Message Passing Interface (MPI) Standard, [4] SimpleScalar, [] SimPoint, [6] Sim-Fast BBV Tracker lar-bbv.htm [7] ATOM BBV Tracker [8] G. Hamerly, E. Perelman, and B. Calder, How to Use SimPoint to Pick Simulation Points, ACM SIGMETRICS Performance Evaluation Review, [9] E. Schnarr and J. Larus, Fast Our-of-Order Processor Simulation Using Memoization, In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, [10] T. Conte, M. Hirsch, and K. Menezes, Reducing State Loss for Effective Trace Sampling of Superscalar Processors, In Proceedings of International Conference on Computer Design, [11] R. Wunderlich, T.Wenisch, B. Falsafi, and J. Hoe, SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling, In Proceedings of International Symposium on Computer Architecture, [12] W. Liu and M. Huang, EXPERT: Expedited Simulation Exploiting Program Behavior Repetition, In Proceedings of International Conference on Supercomputing,
Figure 1: Graphical example of a mergesort 1.
CSE 30321 Computer Architecture I Fall 2011 Lab 02: Procedure Calls in MIPS Assembly Programming and Performance Total Points: 100 points due to its complexity, this lab will weight more heavily in your
More informationPARALLELS CLOUD STORAGE
PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...
More informationTPCalc : a throughput calculator for computer architecture studies
TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University pierre.michaud@inria.fr Stijn.Eyerman@elis.UGent.be
More informationQuiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
More informationCellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
More informationPerformance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
More informationA Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
More informationOverlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer
More informationBENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
More informationx64 Servers: Do you want 64 or 32 bit apps with that server?
TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called
More informationParallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,
More informationBenchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
More informationApplying Data Analysis to Big Data Benchmarks. Jazmine Olinger
Applying Data Analysis to Big Data Benchmarks Jazmine Olinger Abstract This paper describes finding accurate and fast ways to simulate Big Data benchmarks. Specifically, using the currently existing simulation
More informationPredict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationon an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
More informationAnalysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking
Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio
More informationThree Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture
White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive
More informationWorkshare Process of Thread Programming and MPI Model on Multicore Architecture
Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma
More informationLS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
More informationPerformance Impacts of Non-blocking Caches in Out-of-order Processors
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order
More informationDesigning and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp
Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationHyperThreading Support in VMware ESX Server 2.1
HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect
More informationLoad Balancing on a Non-dedicated Heterogeneous Network of Workstations
Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department
More informationESX Server Performance and Resource Management for CPU-Intensive Workloads
VMWARE WHITE PAPER VMware ESX Server 2 ESX Server Performance and Resource Management for CPU-Intensive Workloads VMware ESX Server 2 provides a robust, scalable virtualization framework for consolidating
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationPerformance Evaluation and Optimization of A Custom Native Linux Threads Library
Center for Embedded Computer Systems University of California, Irvine Performance Evaluation and Optimization of A Custom Native Linux Threads Library Guantao Liu and Rainer Dömer Technical Report CECS-12-11
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationFast Sequential Summation Algorithms Using Augmented Data Structures
Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer
More information22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationParallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
More informationParallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationChoosing a Computer for Running SLX, P3D, and P5
Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line
More informationCloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
More informationA Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
More informationAlgorithms and optimization for search engine marketing
Algorithms and optimization for search engine marketing Using portfolio optimization to achieve optimal performance of a search campaign and better forecast ROI Contents 1: The portfolio approach 3: Why
More informationEvaluating HDFS I/O Performance on Virtualized Systems
Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing
More informationAutomatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation
Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder Computer Science and
More informationImproved Software Testing Using McCabe IQ Coverage Analysis
White Paper Table of Contents Introduction...1 What is Coverage Analysis?...2 The McCabe IQ Approach to Coverage Analysis...3 The Importance of Coverage Analysis...4 Where Coverage Analysis Fits into your
More informationContributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
More informationA Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationAnalysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationObservations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications
Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications Roman Pfarrhofer and Andreas Uhl uhl@cosy.sbg.ac.at R. Pfarrhofer & A. Uhl 1 Carinthia Tech Institute
More informationOperating System Impact on SMT Architecture
Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th
More informationDISCOVERING AND EXPLOITING PROGRAM PHASES
DISCOVERING AND EXPLOITING PROGRAM PHASES IN A SINGLE SECOND, A MODERN PROCESSOR CAN EXECUTE BILLIONS OF INSTRUCTIONS AND A PROGRAM S BEHAVIOR CAN CHANGE MANY TIMES. SOME PROGRAMS CHANGE BEHAVIOR DRASTICALLY,
More informationInstruction Set Architecture (ISA)
Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine
More informationOptimization of Cluster Web Server Scheduling from Site Access Statistics
Optimization of Cluster Web Server Scheduling from Site Access Statistics Nartpong Ampornaramveth, Surasak Sanguanpong Faculty of Computer Engineering, Kasetsart University, Bangkhen Bangkok, Thailand
More informationAn examination of the dual-core capability of the new HP xw4300 Workstation
An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,
More informationPerformance Measurement of Dynamically Compiled Java Executions
Performance Measurement of Dynamically Compiled Java Executions Tia Newhall and Barton P. Miller University of Wisconsin Madison Madison, WI 53706-1685 USA +1 (608) 262-1204 {newhall,bart}@cs.wisc.edu
More informationUnit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.
This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'
More informationPerformance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer
More informationKeywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.
Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationOptimizing matrix multiplication Amitabha Banerjee abanerjee@ucdavis.edu
Optimizing matrix multiplication Amitabha Banerjee abanerjee@ucdavis.edu Present compilers are incapable of fully harnessing the processor architecture complexity. There is a wide gap between the available
More informationLattice QCD Performance. on Multi core Linux Servers
Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most
More informationThe ROI from Optimizing Software Performance with Intel Parallel Studio XE
The ROI from Optimizing Software Performance with Intel Parallel Studio XE Intel Parallel Studio XE delivers ROI solutions to development organizations. This comprehensive tool offering for the entire
More informationThe IntelliMagic White Paper: Green Storage: Reduce Power not Performance. December 2010
The IntelliMagic White Paper: Green Storage: Reduce Power not Performance December 2010 Summary: This white paper provides techniques to configure the disk drives in your storage system such that they
More informationPerformance Modeling and Analysis of a Database Server with Write-Heavy Workload
Performance Modeling and Analysis of a Database Server with Write-Heavy Workload Manfred Dellkrantz, Maria Kihl 2, and Anders Robertsson Department of Automatic Control, Lund University 2 Department of
More informationPractical Guide to the Simplex Method of Linear Programming
Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear
More informationA Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
More informationInside the Erlang VM
Rev A Inside the Erlang VM with focus on SMP Prepared by Kenneth Lundin, Ericsson AB Presentation held at Erlang User Conference, Stockholm, November 13, 2008 1 Introduction The history of support for
More informationImplementing Portfolio Management: Integrating Process, People and Tools
AAPG Annual Meeting March 10-13, 2002 Houston, Texas Implementing Portfolio Management: Integrating Process, People and Howell, John III, Portfolio Decisions, Inc., Houston, TX: Warren, Lillian H., Portfolio
More informationPerformance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab
Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event
More informationFPGA area allocation for parallel C applications
1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University
More informationA Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University
More informationMOSIX: High performance Linux farm
MOSIX: High performance Linux farm Paolo Mastroserio [mastroserio@na.infn.it] Francesco Maria Taurino [taurino@na.infn.it] Gennaro Tortone [tortone@na.infn.it] Napoli Index overview on Linux farm farm
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationAn Approach to High-Performance Scalable Temporal Object Storage
An Approach to High-Performance Scalable Temporal Object Storage Kjetil Nørvåg Department of Computer and Information Science Norwegian University of Science and Technology 791 Trondheim, Norway email:
More informationPaul s Norwegian Vacation (or Experiences with Cluster Computing ) Paul Sack 20 September, 2002. sack@stud.ntnu.no www.stud.ntnu.
Paul s Norwegian Vacation (or Experiences with Cluster Computing ) Paul Sack 20 September, 2002 sack@stud.ntnu.no www.stud.ntnu.no/ sack/ Outline Background information Work on clusters Profiling tools
More informationHigh Performance Computing for Operation Research
High Performance Computing for Operation Research IEF - Paris Sud University claude.tadonki@u-psud.fr INRIA-Alchemy seminar, Thursday March 17 Research topics Fundamental Aspects of Algorithms and Complexity
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationPerformance Analysis and Optimization Tool
Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop
More informationVHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU
VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU Martin Straka Doctoral Degree Programme (1), FIT BUT E-mail: strakam@fit.vutbr.cz Supervised by: Zdeněk Kotásek E-mail: kotasek@fit.vutbr.cz
More informationThe Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy
BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.
More informationThe Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
More informationExploiting GPU Hardware Saturation for Fast Compiler Optimization
Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom a.magni@sms.ed.ac.uk Christophe Dubach School of Informatics
More informationRecommended hardware system configurations for ANSYS users
Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range
More informationPARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP
PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP Sruthi Sankar CSE 633: Parallel Algorithms Spring 2014 Professor: Dr. Russ Miller Sudoku: the puzzle A standard Sudoku puzzles contains 81 grids :9 rows
More informationUnderstanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
More informationBest Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays
Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays Database Solutions Engineering By Murali Krishnan.K Dell Product Group October 2009
More information- An Essential Building Block for Stable and Reliable Compute Clusters
Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative
More informationAchieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003
Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks An Oracle White Paper April 2003 Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
More informationThe Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters
The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters User s Manual for the HPCVL DMSM Library Gang Liu and Hartmut L. Schmider High Performance Computing
More informationTHE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER
THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS By CHRISTOPHER J. ZIMMER A Thesis submitted to the Department of Computer Science In partial fulfillment of
More informationInterconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationAPPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM
152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented
More informationThe Taxman Game. Robert K. Moniot September 5, 2003
The Taxman Game Robert K. Moniot September 5, 2003 1 Introduction Want to know how to beat the taxman? Legally, that is? Read on, and we will explore this cute little mathematical game. The taxman game
More informationImproving Scalability for Citrix Presentation Server
VMWARE PERFORMANCE STUDY VMware ESX Server 3. Improving Scalability for Citrix Presentation Server Citrix Presentation Server administrators have often opted for many small servers (with one or two CPUs)
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationGet an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*
Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling
More informationA Case for Dynamic Selection of Replication and Caching Strategies
A Case for Dynamic Selection of Replication and Caching Strategies Swaminathan Sivasubramanian Guillaume Pierre Maarten van Steen Dept. of Mathematics and Computer Science Vrije Universiteit, Amsterdam,
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More information