Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger

Applying Data Analysis to Big Data Benchmarks Jazmine Olinger Abstract This paper describes finding accurate and fast ways to simulate Big Data benchmarks. Specifically, using the currently existing simulation project, Macsim, from the High Performance Architecture Lab at Georgia Tech, and finding ways to reduce simulation time on a benchmark by performing some analysis (using SimPoint) to identify critical points of the overall application, and modifying Macsim to simulate critical sections instead of the entire application. I used basic benchmarks to implement all of this research but the same idea applies to and would ideally work on Big Data benchmark or other computationally large applications. Goals of Project 1. To understand SimPoint 2. To find out if Simpoint is fast and accurate enough to simulate desired applications 3. To create an environment for testing applications with SimPoint results quickly Background Information k-means clustering (MathWorks, n.d.) k-means is a clustering algorithm which classifies a given data set through a given number of clusters. SimPoint generates many different clusterings with k-means and uses a set of criteria to select the best one for the purpose of simulation (a small number of well-defined clusters is desirable) k-means algorithm

1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. SimPoint (Calder, n.d.) Simpoint is a simulation analysis tool that uses a statistical method to find ideal simulation points in an application. It uses a frequency vector profile of a program to perform k-means clustering and select the simulation points. After generating a frequency vector file of an application, running SimPoint on it will generate three meaningful output files: simpoint file: the vectors chosen as Simulation Points and their corresponding cluster numbers. weight file: a weight for each Simulation Point, and its corresponding cluster number. The weight is the proportion of the program s execution that the Simulation Point represents. label file: the final cluster labels and distance from cluster center of each vector Results SimPoint Result Data This result data is the comparison of the actual CPI at each point (from full Macsim run) compared against the CPI at only the points selected by Simpoint and multiplied by the respective weight given by Simpoint. The error

is very low for all but one (bzip2), which indicates that that particular program has patterns that do not lend themselves well to k-means clustering. Benchmark Macsim CPI Simpoint CPI Error bzip2 2.86582681024 2.58894745673 9.66% gcc 4.21720572526 4.17439724309 1.02% lbm 4.3498114936 4.37925075498 0.68% mcf 19.4838300792 19.1521495218 1.70% The following graphs show the four benchmarks used, comparing the CPI recorded at each point from a full run (top graph) to the cluster each point is placed in. There is a very clear pattern in all but one (gcc) matching the changes in CPI to changes in cluster, as expected. (bzip2, error 9.66%) (lbm, error 0.68%)

(gcc, error 1.02%) (mcf, error 1.70%) SimPoint Sampler To utilize the results of SimPoint in a meaningful way, I developed the SimPoint Sampler, which uses Macsim and SimPoint results together to simulate applications quickly.mpoi Instead of running the entire program through Macsim, it uses the Simulation Points provided to switch between two modes, emulation mode and timing mode. By running in timing mode only on the blocks identified by SimPoint and running in emulation mode on all other blocks, it can simulate the entire application significantly faster than a full run of Macsim. The SimPoint sampler currently massively loses accuracy in the reported CPI when switching between modes. However if this problem within Macsim was fixed it would be a very fast and accurate way(ideally exactly as accurate as SimPoint) to simulate applications. Future Work Future work to be done on this project includes fixing the results from Macsim when switching modes, using this method on Big Data benchmarks, and trying other methods on Big Data benchmarks if this one does not work.

Bibliography Calder, B. (n.d.). http://cseweb.ucsd.edu/~calder/simpoint/index.htm. MathWorks. (n.d.). http://www.mathworks.com/help/stats/k-means-clustering.html.