Applying Data Analysis to Big Data Benchmarks. Jazmine Olinger



Similar documents
Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Comparison of K-means and Backpropagation Data Mining Algorithms

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Final Project Report

Why is Internal Audit so Hard?

Social Media Mining. Data Mining Essentials

K-Means Clustering Tutorial

How To Make Visual Analytics With Big Data Visual

Machine Learning using MapReduce

The Integration of SNORT with K-Means Clustering Algorithm to Detect New Attack

Clustering Connectionist and Statistical Language Processing

Introduction to Machine Learning Using Python. Vikram Kamath

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Runtime Hardware Reconfiguration using Machine Learning

Introduction to Clustering

Cluster Analysis for Evaluating Trading Strategies 1

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Statistical Databases and Registers with some datamining

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

How To Cluster

They can be obtained in HQJHQH format directly from the home page at:

South East of Process Main Building / 1F. North East of Process Main Building / 1F. At 14:05 April 16, Sample not collected

Advanced Ensemble Strategies for Polynomial Models

Contents. Dedication List of Figures List of Tables. Acknowledgments

Compiler-Assisted Binary Parsing

K-means Clustering Technique on Search Engine Dataset using Data Mining Tool

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Java Modules for Time Series Analysis

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Using Data Mining for Mobile Communication Clustering and Characterization

Clustering UE 141 Spring 2013

Big Data Simulator version

CLUSTERING FOR FORENSIC ANALYSIS

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

Map-Reduce for Machine Learning on Multicore

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Measuring Heart Rate

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

Using multiple models: Bagging, Boosting, Ensembles, Forests

The Methodology of Application Development for Hybrid Architectures

Performance Metrics for Graph Mining Tasks

WHITE PAPER AUTOMATED, REAL-TIME RISK ANALYSIS AND REMEDIATION

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Energy Efficient MapReduce

Analytics on Big Data

Distributed Framework for Data Mining As a Service on Private Cloud

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

PayLess: A Low Cost Network Monitoring Framework for Software Defined Networks

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Monday Morning Data Mining

HARNESSING BIG DATA WITHIN THE FEDERAL GOVERNMENT FINDINGS AND RECOMMENDATIONS OF ATARC S BIG DATA INNOVATION LAB DECEMBER, 2015

Tutorial Segmentation and Classification

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

A Novel Approach for Network Traffic Summarization

How To Test A Web Server

Glencoe. correlated to SOUTH CAROLINA MATH CURRICULUM STANDARDS GRADE 6 3-3, , , 4-9

Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science

Clustering of Documents for Forensic Analysis

Motion. Complete Table 1. Record all data to three decimal places (e.g., or or 0.000). Do not include units in your answer.

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Fast Matching of Binary Features

Big Data Text Mining and Visualization. Anton Heijs

Northumberland Knowledge

On the Placement of Management and Control Functionality in Software Defined Networks

Environmental Remote Sensing GEOG 2021

GRAPH MATCHING EQUIPMENT/MATERIALS

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Hadoop SNS. renren.com. Saturday, December 3, 11

Time Domain and Frequency Domain Techniques For Multi Shaker Time Waveform Replication

Assessing Measurement System Variation

Building Data Cubes and Mining Them. Jelena Jovanovic

Performance Impacts of Non-blocking Caches in Out-of-order Processors

Prentice Hall: Middle School Math, Course Correlated to: New York Mathematics Learning Standards (Intermediate)

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Anomaly Detection in Predictive Maintenance

Maschinelles Lernen mit MATLAB

Clustering Data Streams

Analysis of MapReduce Algorithms

A fast multi-class SVM learning method for huge databases

QoS-Aware Storage Virtualization for Cloud File Systems. Christoph Kleineweber (Speaker) Alexander Reinefeld Thorsten Schütt. Zuse Institute Berlin

Leveraging Ensemble Models in SAS Enterprise Miner

Cluster analysis with SPSS: K-Means Cluster Analysis

Parameter inference of a basic p53 model using ABC

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Echtzeittesten mit MathWorks leicht gemacht Simulink Real-Time Tobias Kuschmider Applikationsingenieur

Distance Degree Sequences for Network Analysis

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Clustering and mapper

MADlib. An open source library for in-database analytics. Hitoshi Harada PGCon 2012, May 17th

RAVEN: A GUI and an Artificial Intelligence Engine in a Dynamic PRA Framework

Predict Influencers in the Social Network

Computer Science 146/246 Homework #3

Categorical Data Visualization and Clustering Using Subjective Factors

Android Application Analyzer

Learning is a very general term denoting the way in which agents:

Transcription:

Applying Data Analysis to Big Data Benchmarks Jazmine Olinger Abstract This paper describes finding accurate and fast ways to simulate Big Data benchmarks. Specifically, using the currently existing simulation project, Macsim, from the High Performance Architecture Lab at Georgia Tech, and finding ways to reduce simulation time on a benchmark by performing some analysis (using SimPoint) to identify critical points of the overall application, and modifying Macsim to simulate critical sections instead of the entire application. I used basic benchmarks to implement all of this research but the same idea applies to and would ideally work on Big Data benchmark or other computationally large applications. Goals of Project 1. To understand SimPoint 2. To find out if Simpoint is fast and accurate enough to simulate desired applications 3. To create an environment for testing applications with SimPoint results quickly Background Information k-means clustering (MathWorks, n.d.) k-means is a clustering algorithm which classifies a given data set through a given number of clusters. SimPoint generates many different clusterings with k-means and uses a set of criteria to select the best one for the purpose of simulation (a small number of well-defined clusters is desirable) k-means algorithm

1. Place K points into the space represented by the objects that are being clustered. These points represent initial group centroids. 2. Assign each object to the group that has the closest centroid. 3. When all objects have been assigned, recalculate the positions of the K centroids. 4. Repeat Steps 2 and 3 until the centroids no longer move. This produces a separation of the objects into groups from which the metric to be minimized can be calculated. SimPoint (Calder, n.d.) Simpoint is a simulation analysis tool that uses a statistical method to find ideal simulation points in an application. It uses a frequency vector profile of a program to perform k-means clustering and select the simulation points. After generating a frequency vector file of an application, running SimPoint on it will generate three meaningful output files: simpoint file: the vectors chosen as Simulation Points and their corresponding cluster numbers. weight file: a weight for each Simulation Point, and its corresponding cluster number. The weight is the proportion of the program s execution that the Simulation Point represents. label file: the final cluster labels and distance from cluster center of each vector Results SimPoint Result Data This result data is the comparison of the actual CPI at each point (from full Macsim run) compared against the CPI at only the points selected by Simpoint and multiplied by the respective weight given by Simpoint. The error

is very low for all but one (bzip2), which indicates that that particular program has patterns that do not lend themselves well to k-means clustering. Benchmark Macsim CPI Simpoint CPI Error bzip2 2.86582681024 2.58894745673 9.66% gcc 4.21720572526 4.17439724309 1.02% lbm 4.3498114936 4.37925075498 0.68% mcf 19.4838300792 19.1521495218 1.70% The following graphs show the four benchmarks used, comparing the CPI recorded at each point from a full run (top graph) to the cluster each point is placed in. There is a very clear pattern in all but one (gcc) matching the changes in CPI to changes in cluster, as expected. (bzip2, error 9.66%) (lbm, error 0.68%)

(gcc, error 1.02%) (mcf, error 1.70%) SimPoint Sampler To utilize the results of SimPoint in a meaningful way, I developed the SimPoint Sampler, which uses Macsim and SimPoint results together to simulate applications quickly.mpoi Instead of running the entire program through Macsim, it uses the Simulation Points provided to switch between two modes, emulation mode and timing mode. By running in timing mode only on the blocks identified by SimPoint and running in emulation mode on all other blocks, it can simulate the entire application significantly faster than a full run of Macsim. The SimPoint sampler currently massively loses accuracy in the reported CPI when switching between modes. However if this problem within Macsim was fixed it would be a very fast and accurate way(ideally exactly as accurate as SimPoint) to simulate applications. Future Work Future work to be done on this project includes fixing the results from Macsim when switching modes, using this method on Big Data benchmarks, and trying other methods on Big Data benchmarks if this one does not work.

Bibliography Calder, B. (n.d.). http://cseweb.ucsd.edu/~calder/simpoint/index.htm. MathWorks. (n.d.). http://www.mathworks.com/help/stats/k-means-clustering.html.