Reducing Simulation Time by Parallelizing SimpleScalar in MPI Through The Use of SimPoint Generated Intervals

Size: px
Start display at page:

Download "Reducing Simulation Time by Parallelizing SimpleScalar in MPI Through The Use of SimPoint Generated Intervals"

Transcription

1 Reducing Simulation Time by Parallelizing SimpleScalar in MPI Through The Use of SimPoint Generated Intervals James Michael Poe II 1, Fernando Hernandez 2, and Tao Li 3 Department of Electrical and Computer Engineering University of Florida, Gainesville, FL { 1 jpoe, 2 fhernand}@ufl.edu, 3 taoli@ece.ufl.edu Abstract Cycle accurate simulation is an essential tool used in the evaluation and design exploration of modern computer architectures. Due to the increasing complexity, additional critical constraints and ever expanding design space, cycle-level simulation time is growing at an unprecedented rate. A parallel approach to simulation at first appears to be an obvious choice; however previous attempts to parallelize simulators have been largely unsuccessful due to the sequential nature of the problem, often resorting to parallelizing multiple threads or simulating multiple processors. With recent advances in phase analysis, however, the use of "SimPoint" to choose simulation points can break a single thread of execution into multiple representative intervals of simulation, which are entirely independent. In this paper, we use SimPoint simulation points to demonstrate a parallel approach to architecture simulation using MPI (Message Passing Interface) on a cluster of computers. Our preliminary experiment results show that this approach greatly reduces simulation time, while maintaining the high levels of accuracy that SimPoint provides. 1. Introduction Cycle accurate simulators play a vital role in the design and evaluation of modern computer architectures. A recent journal article reported that in the four major conferences in computer architecture (MICRO, ISCA, ASPLOS, and HPCA), cycle precise simulation was used in 72 out of the 103 papers [1]. As the complexity of design and the application size continue to increase, so does the simulation time required for accurate results. Lengthy simulation periods have a negative impact not only on the time that it takes to validate and report the results to the rest of the architecture community, but also often force the designers to either choose faster yet less accurate models, or focus on a small fraction of program execution. Much of architecture design is a trial and error process, and the amount of trials completed is often directly dependant upon the time that it takes to complete those trials. Thus, it is imperative that methods to decrease the simulation time without forfeiting much of the accuracy be explored. At first, a parallel approach to simulation seems like an obvious choice, as is the case with any computationally intensive problem. This presents difficulty, however, due to the inherently sequential nature of simulating a single thread of execution. With recent advances in phase analysis [2], however, it has become possible to break a single thread of execution up into multiple representative intervals that when weighted appropriately and combined provide an accurate model of the entire execution of the workload. These intervals are entirely independent and thus can be simulated in parallel, resulting in reduced simulation time. As the price of commercial off the shelf (COTS) systems continue to fall, and moreover the performance of high speed interconnects continue to rise, clusters continue to provide an appealing low-cost platform for parallel computing. Various methodologies and software packages can be used to parallelize tasks in clusters, and a popular choice is the Message Passing Interface (MPI) standard [3]. MPI can be used with both C and FORTRAN, and MPI functionality can be conveniently added to existing programs through calls to the MPI library. While both MPI and clustered systems share many advantages, they both have their limitations when heavy communication is required due to the latency and overhead of sending and receiving messages. Therefore, to effectively use a clustered MPI system as a parallelization vehicle, programs are usually required to have a very minimal communication to computation ratio, a characteristic which is easily attainable in a parallel architectural simulator. In this paper we showcase a parallel simulation framework which is built with minimal additions and modifications to a cycle level simulator (i.e. SimpleScalar s sim-outorder simulator [4]), MPI, and SimPoint [] algorithm generated simulation points. 1

2 Experimental results show that when compared to running an entire benchmark, this method provides an excellent simulation speedup, and when compared to using SimPoint intervals on a single processor machine, shows a good, predictable speedup; all while maintaining high levels of accuracy. This paper is organized as follows: Section 2 briefly describes background information on the process flow of parallelizing simulations through SimPoint. Section 3 describes experimental methodology and implementation issues. Section 4 presents the simulation results. Section discusses related work. Finally, Section 6 concludes the paper and outlines future work. 2. Background Information The process of simulating multiple simulation points in parallel begins with the generation of those points for a particular benchmark. This is done by first generating a basic block vector of the benchmark, which can be performed using Sim-Fast BBV Tracker [6], ATOM BBV Tracker [7], or another basic block vector profiling program of the user s choice. Once the basic block vector has been generated, it is fed into the SimPoint phase analysis software which generates the representative simulation points and their corresponding weighting factors [8]. Once these simulation points and weights have been generated, the modified cycle level simulator reads in the values, simulates the different points in parallel, and writes the results to file. In the above described method, the generation of simulation points has to be performed sequentially. Therefore, the time that it takes to generate the basic block vector and compute the representative simulation points must be taken into consideration when evaluating the efficiency of this parallel approach to simulation. These points, however, only need to be computed once ever for any benchmark/input combination. Thus the time that it takes to calculate the basic block vector, roughly the amount of time it takes to fast forward through the entire benchmark once (about 1/10 th of the amount of time it takes to do cycle level simulation in the SimpleScalar suite), and the time that it takes to generate the simulation points using SimPoint (a relatively short amount of time using even the largest SPEC benchmarks with 100M intervals), can be amortized over multiple runs of the parallel simulator. Further, these intervals have already been generated for most of the SPEC2000 benchmarks, are available from the SimPoint website, and indeed were used for all of the results reported in this paper. 3. Implementation In this section we provide a brief introduction and summary of the simulator that was chosen and the modifications that were necessary to provide the parallel execution of simulation points. 3.1 Base Simulator The simulator chosen for modification was the simoutorder cycle level simulator available in the SimpleScalar suite [4]. While parallelization can be attempted with any simulator, the SimpleScalar suite was chosen because it provides most of the necessary functionality to natively take advantage of the SimPoint generated simulation points, has been exhaustively tested and verified, and is widely used in the architecture community. Message Passing Interface code was used to parallelize the simulator as the modified simulator only needed to communicate twice (namely, once to disburse the intervals and once to recombine the intervals). Several modifications to sim-outorder were necessary to allow parallel simulation. First, the internal fast forward variable had to be modified to allow fast functional simulation past 2.1 billion instructions, the integer limit. Sim-outorder code was then extended to accept a new argument, interval, which the user supplies to specify the interval size. Since the master process (the process that is responsible for generation and distribution of tasks) now determines and disseminates the fast forward and maximum instruction variables to the slaves (processes that carry out the work assigned by the master), these arguments are no longer accepted from the user. 3.2 Interval Allocation One of the most important decisions in the design of any parallel program is how the tasks will be allocated to the processors; and it is certainly true in the design of a parallel simulator. Since all of the intervals that need to be simulated are available before the allocation, and the time it takes the intervals to complete is based upon the distance that needs to be fast-forwarded, an optimal allocation can be predetermined and thus static assignment is preferred over dynamic. This allows the master process to disseminate the tasks and then simulate its own intervals. The algorithm used to assign the intervals to individual processes is as follows: 1. Arrange the simulation points from the farthest away to the closest 2. Assign the furthest unassigned point to the process with the minimum point distance sum, taking into account the time needed to 2

3 complete the cycle precise simulation of that point 3. Repeat until all points have been assigned Table 1: The result of the allocation strategy for the benchmark that has 9 intervals using 3 processors. In the table above, each column represents the intervals that are assigned to each respective processor, and the bottom row displays the sum of the intervals, which is directly proportional to the simulation time. To arrive at this distribution, it is important to first realize that simulation time is comprised of two parts: the time spent fast forwarding through the preceding code sections, and the time spent performing a detailed simulation on the target interval. Since the time to fast forward through a given block of instructions is not the same as the time needed to perform a detailed simulation on the same block, an offset needs to be added to each interval that represents the added time which will be spent performing the detailed simulation when the target interval is reached. From our experimental results, it was determined that detailed simulation is between 7 and 13 times slower than fast forwarding, so intervals are assigned assuming they are 1 units larger. The algorithm begins by assigning intervals 840, 83, and 834. Interval 26 is then assigned on processor 3, because at this point it is the processor with the smallest interval sum. Intervals 491 and 277 are then assigned by the same method, and intervals 13, 8, and 1 are all assigned to the first processor because although it already has two intervals, its total interval sum is smaller than any other. 3.3 Distribution, Simulation, and Recombination After the allocation of the simulation points has been determined, the results are saved into a three dimensional double array with the following structure: simpoint[x][y][z] The first dimension (represented by the x variable) of the array designates the processor that will simulate the different points, each of which is stored in a row that is index by the second dimension (represented by the y variable). The third dimension (represented by the z variable) indicates both the simulation point and its corresponding weight. After this array has been populated, each processor is sent its simulation points: for (m = 1; m < ntasks ; m++ ){ MPI_Send(&simpoint[m], 2 * MAX_SIMPOINTS, MPI_DOUBLE, m, tag, MPI_COMM_WORLD); } where ntasks is the number of processors obtained using the MPI_Comm_size function, and MAX_SIMPOINTS is the maximum number of possible simulation points. The for loop begins at 1 as the master has saved its own simulation points to be calculated in the 0 th level of the three dimensional array. Each slave then issues the MPI_Recv function and stores the result into the 0 th level of its array: MPI_Recv(&simpoint[0], 2 * MAX_SIMPOINTS, MPI_DOUBLE, 0, tag, MPI_COMM_WORLD, &status); From this point on, all processes can access their respective simulation points by referencing the same level of the array. The simulator then loops through all of the simulation points until all points have been processed, resetting the internal state of the simulator and updating the fast forward variable on each loop. When each process has completed the simulation of all of its points, it notifies the master. Once the master has completed all of its own points, it awaits the results from the slave processes, and after all of the intermediate results have been received, weighted, and accumulated, the master outputs the final results and exits. One limitation of the current implementation is that it does not support any form of warm up period. While a warm up period is not necessary for many programs as long as a large enough interval size is used (for instance, 100M instructions), it is needed for accurate results of smaller interval sizes [2]. Warm up functionality is discussed further in the future work section. 4. Results In this section we discuss the results that were obtained using the modified parallel version of simoutorder. These results were compiled using a cluster of 32 dual 733 MHz Intel Pentium-III processors with 26 MB PC133 SDRAM, connected using Gigabit Ethernet, running the Linux kernel, with MPICH version [3] installed. While each node of the cluster has two Pentium-III processors, only one was utilized to run the simulations while the other handled basic operating system processes. The simulation 3

4 points used for the SPEC2000 benchmarks were obtained via the SimPoint website [4] for 100M intervals. The simulated microarchitecture is the default configuration of the sim-outorder simulator. of 2.1% when compared to the full simulation of the program [2]. % Error in CPI 90% 80% 70% 60% 0% 40% 30% Start FF1B SimPoint 20% 10% 0% Median Average Max Figure 1: Accuracy comparison of the SimPoint algorithm using 1M intervals versus blindly fast forwarding a set number of instructions and starting from the beginning of program execution for the entire SPEC2000 suite. Both non-simpoint results are done using 300 million instructions. Graph taken from the SimPoint documentation []. 4.1 Validation While this paper is in no way intended to serve as a validation or verification of the SimPoint tool (for an extensive discussion on the validity of SimPoint as a means of reducing simulation time while maintaining accuracy, the reader is referred to the SimPoint website and documentation), it is important to provide a sense of the accuracy achievable with SimPoint at this stage to warrant further exploration of the speedups possible by parallelizing the simulator through the use of SimPoint simulation points. If one is unable to trust the results of the simulator, it really doesn t matter how quickly the simulation finishes. Figure 1 shows the simulation accuracy results using SimPoint compared to the complete execution of the programs for the entire SPEC2000 benchmark suite. Results are shown for the median, average, and maximum errors found. These results were obtained by the authors of SimPoint using an instruction interval size of 1 million. The results are compared to simulating 300 million instructions (which is over 3 times that of the SimPoint results provided) at the beginning of the code, and also by blindly fast forwarding through 1 billion instructions. Figure 1 shows that beginning execution from the start of the program results in an average CPI error of 201%, fast forwarding through 1 billion instructions and then simulating results in an average CPI error of 99%, whereas using the SimPoint algorithm to create multiple simulation points resulted in an average error Figure 2: Accuracy comparison of the SimPoint algorithm using 100M intervals versus blindly fast forwarding a set number of instructions and starting from the beginning of execution for the,,,,,, and integer benchmarks. Our results using 100M interval simulation points from the SimPoint website for the,,, perbmk,, and benchmarks were similar to those produced for the entire SPEC suite by the SimPoint authors and are shown in Figure 2. The non- SimPoint results were run for the same total number of instructions as the SimPoint versions (for example, if there were 8 simulation points run in the parallel version, each consisting of 100M instructions, then 800M instructions were simulated from the start of execution, and also by fast forwarding 1 billion instructions). The goal of our research was to study the potential speedup obtainable by parallelizing the SimPoint algorithm results, and not verifying their validity, and our accuracy results obviously do not depict a representative workload of the SPEC 2000 suite. They do, however, give a sense of the accuracy maintained while producing the speedups obtained in the following section. 4.2 We will now explore the speedups that were obtained using the parallel version of sim-outorder. Figure 3 shows the speedup when compared to running the entire benchmark and Figure 4 shows the speedup when compared to a sequential baseline of running the simulation points consecutively on a single processor. Multiple runs of each benchmark were performed on a total number of processors ranging from one to n, where n represents the total number of simulation points produced by the SimPoint algorithm. Both figures clearly depict a linear speedup up to a certain 4

5 maximum speedup, and then level off almost immediately. The maximum speedup is limited by the fast forward distance to the farthest interval. The parallel version of the simulator must fast forward to the farthest distance at least once on one of the processors, and thus the time to complete the simulation can be no less than the amount of time that is required to fast forward to that point and simulate the required instructions. Fortunately, however, the number of processors required to achieve maximum speedup can be easily determined before the simulation is started. Since the difference in time to fast forward to different points within the same program is almost entirely determined by the distance between the different points, and further the time to simulate one interval worth of instructions is approximately the same across a given benchmark; we can determine the number of processors required to achieve maximum speedup by repeating the allocation algorithm with an increasing number of processors until the furthest simulation point is the only point assigned to one of the processors. Table 2 shows the allocation of the tasks for the benchmark that would be predicted by this method to achieve maximum speedup. While this method yields results that are close to the maximum speedup, they neglect to take into consideration the time it takes to actually simulate each interval. If the total fast forward distances are the same for two processors, but one of the two processors is scheduled to simulate twice the number of points, that processor will take additional time to complete. This extra time can be compensated for by adding an offset to each of the simulation points before calculating the number of processors required for maximum speedup. Since the cycle precise simulation time to fast forward simulation time ratio averages around 10 (all of our cases were within the 7 to 13 range), a number such as 1 can be added to each simulation point. Table 3 shows the newly calculated allocation of simulation points to achieve maximum speedup, and Table 4 shows the results using the actual times from a simulation run With Respect to Running Full Benchmark (using standard intervals) Table 2: Erroneous estimation of number of processors to achieve maximum speedup Figure 3: achieved for various benchmarks with differing number of processors as compared to full execution. With Respect to Running Full Benchmark (using early intervals) Table 3: Estimation of the number of processors required to achieve maximum speedup Figure 4: achieved for various benchmarks with differing number of processors as compared to sequential baseline Table 4: Actual number of processors required to achieve maximum speedup with real times shown in seconds. The maximum speedup attainable can be determined by using the following equation:

6 TT Max = T Where T T is the total time to simulate all intervals on a single processor and T L is the time to simulate the farthest interval. 4.3 Early SimPoint Early SimPoint is an algorithm that produces simulation points in much the same fashion as the standard SimPoint algorithm, but with a general tendency towards choosing earlier simulation points, with the aim of reducing simulation time. The results using the Early SimPoint algorithm produce an additional speedup, however not as great as in other applications of this algorithm. This is expected, as the Early SimPoint intervals are smaller. Greater speedups would be possible if the algorithm explicitly focused on reducing the distance to the furthest simulation point, as this is the main factor in determining the maximum speedup possible. Of the 30 simulation points for integer benchmarks posted on the SimPoint website using 100M intervals, 11 (63%) of the them reduced the distance to the furthest simulation point, and 1 (0%) of them reduced the furthest simulation point more than 10%. For the 21 integer benchmarks using 10M intervals, 11 (2%) reduced the distance to the furthest simulation point and 8 (38%) did so by more than 10%. The simulation points posted for floating point benchmarks were much more impressive with over 78% producing a greater than 10% difference for 100M intervals, and almost 70% reducing the distance greater than 10% for the 10M interval size. Therefore the potential speedup of using the Early SimPoint algorithm should be carefully considered and weighed against the reduced accuracy of the algorithm as compared to the regular SimPoint algorithm; especially in the case of integer benchmarks With Respect to Running Full Benchmark (using early intervals) L Figure : s estimated for various benchmarks using the Early SimPoint algorithm as compared to full execution. Estimations for the speedups obtained using the Early SimPoint algorithm to generate the simulation points for the parallel version of sim-outorder are displayed in Figure as compared to running the entire benchmark and in Figure 6 as compared to running the sequential baseline. The estimations were calculated using a linear regression model on the previously simulated points to determine the average fast forward and cycle precise simulation time, and then using those times to estimate the time required to run the simulation using the optimal allocation algorithm With Respect to a Non-Parallel SimPoint Run (using early intervals) Figure 6: s estimated for various benchmarks using the Early SimPoint algorithm as compared to sequential baseline.. Related Work Techniques to reduce simulation time have been proposed in [2, 9, 10, 11, ]. These proposals exclusively focus on simulations running on a single processor machine. The SimPoint framework proposed by Sherwood and Calder [2] reduce the simulation time by using a sets of representative instruction chunks to represent the entire program execution. Schnarr and Larus [9] use memorization to replay actions stored in a processoraction cache when the current microarchitectural state matches a previously encountered state. Conte et al. performed the early work on the sampling based simulations [10]. Wunderlich et al. propose SMARTS [11], which further uses rigorous theoretical guidelines to determine sampling rate to reduce the simulation time. Huang et al. [12] propose EXPERT to reduce the simulation requirement by exploiting program behavior repetition. 6. Summary and Future Work Through the use of simulation points generated by the SimPoint algorithm, we were able to break up a single thread of execution into multiple independent 6

7 simulation points. These simulation points were individually simulated in parallel using a modified version of the sim-outorder simulator from the SimpleScalar suite that included MPI code to allocate, distribute, execute, and recombine the results from any number of processors up to the maximum number of points. The modified simulator was able to achieve a nearly linear speedup when compared to a sequential baseline up to a maximum speedup that was determined by the distance to the furthest simulation point. While this maximum speedup can not be surpassed, the number of processors at which this speedup occurs can easily be calculated before the simulation is started, and thus assigning extra processors that are unable to increase the speedup can be avoided. Further this limitation is accentuated because of the relatively few simulation points that are generated by the SimPoint algorithm for intervals of 100M instructions. This paper did not explore intervals less than 100M instructions due to lack of support for a warm up period in the current version of the parallel simulator. While in many cases a warm up period is not needed for programs when a 100M interval size is used, it is necessary for programs with smaller interval sizes. Smaller interval sizes can provide increased accuracy, particularly for programs that have a reduced maximum number of instructions, and thus future revisions to the parallel simulator should include warm up functionality. In addition, alternative approaches to repeated fast forwarding such as checkpointing and multiple fast-forwards should be explored in the parallel simulator to increase the maximum potential speedup. References [1] S. Girbal, G. Mouchard, A. Cohen, and O. Temam, DiST: A Simple, Reliable and Scalable Method to Significantly Reduce Processor Architecture Simulation Time, In Proceedings of the International Conference on Measurement and Modeling of Computer Systems, [2] T. Sherwood, E. Perelman, G. Hamerly and B. Calder, Automatically Characterizing Large Scale Program Behavior, In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, [3] The Message Passing Interface (MPI) Standard, [4] SimpleScalar, [] SimPoint, [6] Sim-Fast BBV Tracker lar-bbv.htm [7] ATOM BBV Tracker [8] G. Hamerly, E. Perelman, and B. Calder, How to Use SimPoint to Pick Simulation Points, ACM SIGMETRICS Performance Evaluation Review, [9] E. Schnarr and J. Larus, Fast Our-of-Order Processor Simulation Using Memoization, In Proceedings of International Conference on Architectural Support for Programming Languages and Operating Systems, [10] T. Conte, M. Hirsch, and K. Menezes, Reducing State Loss for Effective Trace Sampling of Superscalar Processors, In Proceedings of International Conference on Computer Design, [11] R. Wunderlich, T.Wenisch, B. Falsafi, and J. Hoe, SMARTS: Accelerating Microarchitecture Simulation via Rigorous Statistical Sampling, In Proceedings of International Symposium on Computer Architecture, [12] W. Liu and M. Huang, EXPERT: Expedited Simulation Exploiting Program Behavior Repetition, In Proceedings of International Conference on Supercomputing,

Figure 1: Graphical example of a mergesort 1.

Figure 1: Graphical example of a mergesort 1. CSE 30321 Computer Architecture I Fall 2011 Lab 02: Procedure Calls in MIPS Assembly Programming and Performance Total Points: 100 points due to its complexity, this lab will weight more heavily in your

More information

PARALLELS CLOUD STORAGE

PARALLELS CLOUD STORAGE PARALLELS CLOUD STORAGE Performance Benchmark Results 1 Table of Contents Executive Summary... Error! Bookmark not defined. Architecture Overview... 3 Key Features... 5 No Special Hardware Requirements...

More information

TPCalc : a throughput calculator for computer architecture studies

TPCalc : a throughput calculator for computer architecture studies TPCalc : a throughput calculator for computer architecture studies Pierre Michaud Stijn Eyerman Wouter Rogiest IRISA/INRIA Ghent University Ghent University pierre.michaud@inria.fr Stijn.Eyerman@elis.UGent.be

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Cellular Computing on a Linux Cluster

Cellular Computing on a Linux Cluster Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results

More information

Performance Monitoring of Parallel Scientific Applications

Performance Monitoring of Parallel Scientific Applications Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure

More information

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

More information

x64 Servers: Do you want 64 or 32 bit apps with that server?

x64 Servers: Do you want 64 or 32 bit apps with that server? TMurgent Technologies x64 Servers: Do you want 64 or 32 bit apps with that server? White Paper by Tim Mangan TMurgent Technologies February, 2006 Introduction New servers based on what is generally called

More information

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden smakadir@csc.kth.se,

More information

Benchmarking Hadoop & HBase on Violin

Benchmarking Hadoop & HBase on Violin Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

More information

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger Applying Data Analysis to Big Data Benchmarks Jazmine Olinger Abstract This paper describes finding accurate and fast ways to simulate Big Data benchmarks. Specifically, using the currently existing simulation

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

on an system with an infinite number of processors. Calculate the speedup of

on an system with an infinite number of processors. Calculate the speedup of 1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements

More information

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking

Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Analysis of Memory Sensitive SPEC CPU2006 Integer Benchmarks for Big Data Benchmarking Kathlene Hurt and Eugene John Department of Electrical and Computer Engineering University of Texas at San Antonio

More information

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture White Paper Intel Xeon processor E5 v3 family Intel Xeon Phi coprocessor family Digital Design and Engineering Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture Executive

More information

Workshare Process of Thread Programming and MPI Model on Multicore Architecture

Workshare Process of Thread Programming and MPI Model on Multicore Architecture Vol., No. 7, 011 Workshare Process of Thread Programming and MPI Model on Multicore Architecture R. Refianti 1, A.B. Mutiara, D.T Hasta 3 Faculty of Computer Science and Information Technology, Gunadarma

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Performance Impacts of Non-blocking Caches in Out-of-order Processors Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order

More information

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp

Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Designing and Building Applications for Extreme Scale Systems CS598 William Gropp www.cs.illinois.edu/~wgropp Welcome! Who am I? William (Bill) Gropp Professor of Computer Science One of the Creators of

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

HyperThreading Support in VMware ESX Server 2.1

HyperThreading Support in VMware ESX Server 2.1 HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect

More information

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations

Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Load Balancing on a Non-dedicated Heterogeneous Network of Workstations Dr. Maurice Eggen Nathan Franklin Department of Computer Science Trinity University San Antonio, Texas 78212 Dr. Roger Eggen Department

More information

ESX Server Performance and Resource Management for CPU-Intensive Workloads

ESX Server Performance and Resource Management for CPU-Intensive Workloads VMWARE WHITE PAPER VMware ESX Server 2 ESX Server Performance and Resource Management for CPU-Intensive Workloads VMware ESX Server 2 provides a robust, scalable virtualization framework for consolidating

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

Performance Evaluation and Optimization of A Custom Native Linux Threads Library

Performance Evaluation and Optimization of A Custom Native Linux Threads Library Center for Embedded Computer Systems University of California, Irvine Performance Evaluation and Optimization of A Custom Native Linux Threads Library Guantao Liu and Rainer Dömer Technical Report CECS-12-11

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

More information

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics 22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

Parallelization: Binary Tree Traversal

Parallelization: Binary Tree Traversal By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi

More information

Algorithms and optimization for search engine marketing

Algorithms and optimization for search engine marketing Algorithms and optimization for search engine marketing Using portfolio optimization to achieve optimal performance of a search campaign and better forecast ROI Contents 1: The portfolio approach 3: Why

More information

Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

More information

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation Satish Narayanasamy, Cristiano Pereira, Harish Patil, Robert Cohn, and Brad Calder Computer Science and

More information

Improved Software Testing Using McCabe IQ Coverage Analysis

Improved Software Testing Using McCabe IQ Coverage Analysis White Paper Table of Contents Introduction...1 What is Coverage Analysis?...2 The McCabe IQ Approach to Coverage Analysis...3 The Importance of Coverage Analysis...4 Where Coverage Analysis Fits into your

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications

Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications Observations on Data Distribution and Scalability of Parallel and Distributed Image Processing Applications Roman Pfarrhofer and Andreas Uhl uhl@cosy.sbg.ac.at R. Pfarrhofer & A. Uhl 1 Carinthia Tech Institute

More information

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

DISCOVERING AND EXPLOITING PROGRAM PHASES

DISCOVERING AND EXPLOITING PROGRAM PHASES DISCOVERING AND EXPLOITING PROGRAM PHASES IN A SINGLE SECOND, A MODERN PROCESSOR CAN EXECUTE BILLIONS OF INSTRUCTIONS AND A PROGRAM S BEHAVIOR CAN CHANGE MANY TIMES. SOME PROGRAMS CHANGE BEHAVIOR DRASTICALLY,

More information

Instruction Set Architecture (ISA)

Instruction Set Architecture (ISA) Instruction Set Architecture (ISA) * Instruction set architecture of a machine fills the semantic gap between the user and the machine. * ISA serves as the starting point for the design of a new machine

More information

Optimization of Cluster Web Server Scheduling from Site Access Statistics

Optimization of Cluster Web Server Scheduling from Site Access Statistics Optimization of Cluster Web Server Scheduling from Site Access Statistics Nartpong Ampornaramveth, Surasak Sanguanpong Faculty of Computer Engineering, Kasetsart University, Bangkhen Bangkok, Thailand

More information

An examination of the dual-core capability of the new HP xw4300 Workstation

An examination of the dual-core capability of the new HP xw4300 Workstation An examination of the dual-core capability of the new HP xw4300 Workstation By employing single- and dual-core Intel Pentium processor technology, users have a choice of processing power options in a compact,

More information

Performance Measurement of Dynamically Compiled Java Executions

Performance Measurement of Dynamically Compiled Java Executions Performance Measurement of Dynamically Compiled Java Executions Tia Newhall and Barton P. Miller University of Wisconsin Madison Madison, WI 53706-1685 USA +1 (608) 262-1204 {newhall,bart}@cs.wisc.edu

More information

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs. This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'

More information

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer Res. Lett. Inf. Math. Sci., 2003, Vol.5, pp 1-10 Available online at http://iims.massey.ac.nz/research/letters/ 1 Performance Characteristics of a Cost-Effective Medium-Sized Beowulf Cluster Supercomputer

More information

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age. Volume 3, Issue 10, October 2013 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Load Measurement

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Optimizing matrix multiplication Amitabha Banerjee abanerjee@ucdavis.edu

Optimizing matrix multiplication Amitabha Banerjee abanerjee@ucdavis.edu Optimizing matrix multiplication Amitabha Banerjee abanerjee@ucdavis.edu Present compilers are incapable of fully harnessing the processor architecture complexity. There is a wide gap between the available

More information

Lattice QCD Performance. on Multi core Linux Servers

Lattice QCD Performance. on Multi core Linux Servers Lattice QCD Performance on Multi core Linux Servers Yang Suli * Department of Physics, Peking University, Beijing, 100871 Abstract At the moment, lattice quantum chromodynamics (lattice QCD) is the most

More information

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

The ROI from Optimizing Software Performance with Intel Parallel Studio XE The ROI from Optimizing Software Performance with Intel Parallel Studio XE Intel Parallel Studio XE delivers ROI solutions to development organizations. This comprehensive tool offering for the entire

More information

The IntelliMagic White Paper: Green Storage: Reduce Power not Performance. December 2010

The IntelliMagic White Paper: Green Storage: Reduce Power not Performance. December 2010 The IntelliMagic White Paper: Green Storage: Reduce Power not Performance December 2010 Summary: This white paper provides techniques to configure the disk drives in your storage system such that they

More information

Performance Modeling and Analysis of a Database Server with Write-Heavy Workload

Performance Modeling and Analysis of a Database Server with Write-Heavy Workload Performance Modeling and Analysis of a Database Server with Write-Heavy Workload Manfred Dellkrantz, Maria Kihl 2, and Anders Robertsson Department of Automatic Control, Lund University 2 Department of

More information

Practical Guide to the Simplex Method of Linear Programming

Practical Guide to the Simplex Method of Linear Programming Practical Guide to the Simplex Method of Linear Programming Marcel Oliver Revised: April, 0 The basic steps of the simplex algorithm Step : Write the linear programming problem in standard form Linear

More information

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures 11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the

More information

Inside the Erlang VM

Inside the Erlang VM Rev A Inside the Erlang VM with focus on SMP Prepared by Kenneth Lundin, Ericsson AB Presentation held at Erlang User Conference, Stockholm, November 13, 2008 1 Introduction The history of support for

More information

Implementing Portfolio Management: Integrating Process, People and Tools

Implementing Portfolio Management: Integrating Process, People and Tools AAPG Annual Meeting March 10-13, 2002 Houston, Texas Implementing Portfolio Management: Integrating Process, People and Howell, John III, Portfolio Decisions, Inc., Houston, TX: Warren, Lillian H., Portfolio

More information

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab Performance monitoring at CERN openlab July 20 th 2012 Andrzej Nowak, CERN openlab Data flow Reconstruction Selection and reconstruction Online triggering and filtering in detectors Raw Data (100%) Event

More information

FPGA area allocation for parallel C applications

FPGA area allocation for parallel C applications 1 FPGA area allocation for parallel C applications Vlad-Mihai Sima, Elena Moscu Panainte, Koen Bertels Computer Engineering Faculty of Electrical Engineering, Mathematics and Computer Science Delft University

More information

A Comparison of General Approaches to Multiprocessor Scheduling

A Comparison of General Approaches to Multiprocessor Scheduling A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA jing@jolt.mt.att.com Michael A. Palis Department of Computer Science Rutgers University

More information

MOSIX: High performance Linux farm

MOSIX: High performance Linux farm MOSIX: High performance Linux farm Paolo Mastroserio [mastroserio@na.infn.it] Francesco Maria Taurino [taurino@na.infn.it] Gennaro Tortone [tortone@na.infn.it] Napoli Index overview on Linux farm farm

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

An Approach to High-Performance Scalable Temporal Object Storage

An Approach to High-Performance Scalable Temporal Object Storage An Approach to High-Performance Scalable Temporal Object Storage Kjetil Nørvåg Department of Computer and Information Science Norwegian University of Science and Technology 791 Trondheim, Norway email:

More information

Paul s Norwegian Vacation (or Experiences with Cluster Computing ) Paul Sack 20 September, 2002. sack@stud.ntnu.no www.stud.ntnu.

Paul s Norwegian Vacation (or Experiences with Cluster Computing ) Paul Sack 20 September, 2002. sack@stud.ntnu.no www.stud.ntnu. Paul s Norwegian Vacation (or Experiences with Cluster Computing ) Paul Sack 20 September, 2002 sack@stud.ntnu.no www.stud.ntnu.no/ sack/ Outline Background information Work on clusters Profiling tools

More information

High Performance Computing for Operation Research

High Performance Computing for Operation Research High Performance Computing for Operation Research IEF - Paris Sud University claude.tadonki@u-psud.fr INRIA-Alchemy seminar, Thursday March 17 Research topics Fundamental Aspects of Algorithms and Complexity

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Performance Analysis and Optimization Tool

Performance Analysis and Optimization Tool Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL andres.charif@uvsq.fr Performance Analysis Team, University of Versailles http://www.maqao.org Introduction Performance Analysis Develop

More information

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU Martin Straka Doctoral Degree Programme (1), FIT BUT E-mail: strakam@fit.vutbr.cz Supervised by: Zdeněk Kotásek E-mail: kotasek@fit.vutbr.cz

More information

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy

The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy BMI Paper The Effects of Start Prices on the Performance of the Certainty Equivalent Pricing Policy Faculty of Sciences VU University Amsterdam De Boelelaan 1081 1081 HV Amsterdam Netherlands Author: R.D.R.

More information

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems 202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric

More information

Exploiting GPU Hardware Saturation for Fast Compiler Optimization

Exploiting GPU Hardware Saturation for Fast Compiler Optimization Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom a.magni@sms.ed.ac.uk Christophe Dubach School of Informatics

More information

Recommended hardware system configurations for ANSYS users

Recommended hardware system configurations for ANSYS users Recommended hardware system configurations for ANSYS users The purpose of this document is to recommend system configurations that will deliver high performance for ANSYS users across the entire range

More information

PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP

PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP PARALLELIZED SUDOKU SOLVING ALGORITHM USING OpenMP Sruthi Sankar CSE 633: Parallel Algorithms Spring 2014 Professor: Dr. Russ Miller Sudoku: the puzzle A standard Sudoku puzzles contains 81 grids :9 rows

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays Database Solutions Engineering By Murali Krishnan.K Dell Product Group October 2009

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003

Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks. An Oracle White Paper April 2003 Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building Blocks An Oracle White Paper April 2003 Achieving Mainframe-Class Performance on Intel Servers Using InfiniBand Building

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

More information

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters

The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters The Double-layer Master-Slave Model : A Hybrid Approach to Parallel Programming for Multicore Clusters User s Manual for the HPCVL DMSM Library Gang Liu and Hartmut L. Schmider High Performance Computing

More information

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER

THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS CHRISTOPHER J. ZIMMER THE FLORIDA STATE UNIVERSITY COLLEGE OF ARTS AND SCIENCES APPLICATION CONFIGURABLE PROCESSORS By CHRISTOPHER J. ZIMMER A Thesis submitted to the Department of Computer Science In partial fulfillment of

More information

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM 152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented

More information

The Taxman Game. Robert K. Moniot September 5, 2003

The Taxman Game. Robert K. Moniot September 5, 2003 The Taxman Game Robert K. Moniot September 5, 2003 1 Introduction Want to know how to beat the taxman? Legally, that is? Read on, and we will explore this cute little mathematical game. The taxman game

More information

Improving Scalability for Citrix Presentation Server

Improving Scalability for Citrix Presentation Server VMWARE PERFORMANCE STUDY VMware ESX Server 3. Improving Scalability for Citrix Presentation Server Citrix Presentation Server administrators have often opted for many small servers (with one or two CPUs)

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows*

Get an Easy Performance Boost Even with Unthreaded Apps. with Intel Parallel Studio XE for Windows* Get an Easy Performance Boost Even with Unthreaded Apps for Windows* Can recompiling just one file make a difference? Yes, in many cases it can! Often, you can achieve a major performance boost by recompiling

More information

A Case for Dynamic Selection of Replication and Caching Strategies

A Case for Dynamic Selection of Replication and Caching Strategies A Case for Dynamic Selection of Replication and Caching Strategies Swaminathan Sivasubramanian Guillaume Pierre Maarten van Steen Dept. of Mathematics and Computer Science Vrije Universiteit, Amsterdam,

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information