BiDAl: Big Data Analyzer for Cluster Traces

Similar documents

arxiv: v1 [cs.dc] 2 Sep 2015

BiDAl: Big Data Analyzer for Cluster Traces

Characterizing Task Usage Shapes in Google s Compute Clusters

Massive Cloud Auditing using Data Mining on Hadoop

Implement Hadoop jobs to extract business value from large and varied data sets

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop and Map-Reduce. Swati Gore

CSE-E5430 Scalable Cloud Computing Lecture 2

You should have a working knowledge of the Microsoft Windows platform. A basic knowledge of programming is helpful but not required.

Infrastructures for big data

Data processing goes big

Mining Large Datasets: Case of Mining Graph Data in the Cloud

A Performance Analysis of Distributed Indexing using Terrier

Bringing Big Data Modelling into the Hands of Domain Experts

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

Managing Cloud Server with Big Data for Small, Medium Enterprises: Issues and Challenges

Open source Google-style large scale data analysis with Hadoop

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop IST 734 SS CHUNG

How To Handle Big Data With A Data Scientist

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Energy Efficient MapReduce

KNIME & Avira, or how I ve learned to love Big Data

Hadoop implementation of MapReduce computational model. Ján Vaňo

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Hadoop & its Usage at Facebook

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

A very short Intro to Hadoop

Advanced Big Data Analytics with R and Hadoop

Ubuntu and Hadoop: the perfect match

MultiAlign Software. Windows GUI. Console Application. MultiAlign Software Website. Test Data

Big Data Explained. An introduction to Big Data Science.

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Cloud Management: Knowing is Half The Battle

Rumen. Table of contents

How To Scale Out Of A Nosql Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Large scale processing using Hadoop. Ján Vaňo

Viswanath Nandigam Sriram Krishnan Chaitan Baru

A very short talk about Apache Kylin Business Intelligence meets Big Data. Fabian Wilckens EMEA Solutions Architect

CloudRank-D:A Benchmark Suite for Private Cloud Systems

International Journal of Innovative Research in Computer and Communication Engineering

Big Data: Tools and Technologies in Big Data

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Jeffrey D. Ullman slides. MapReduce for data intensive computing

ITG Software Engineering

CRITEO INTERNSHIP PROGRAM 2015/2016

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Log Mining Based on Hadoop s Map and Reduce Technique

America s Most Wanted a metric to detect persistently faulty machines in Hadoop

The Big Data Ecosystem at LinkedIn. Presented by Zhongfang Zhuang

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

WROX Certified Big Data Analyst Program by AnalytixLabs and Wiley

Data Mining in the Swamp

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Testing Big data is one of the biggest

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

How Cisco IT Built Big Data Platform to Transform Data Management

BIG DATA What it is and how to use?

Big Fast Data Hadoop acceleration with Flash. June 2013

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Journée Thématique Big Data 13/03/2015

DATA EXPERTS MINE ANALYZE VISUALIZE. We accelerate research and transform data to help you create actionable insights

Integrating Apache Spark with an Enterprise Data Warehouse

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Deploying Hadoop with Manager

A Tour of the Zoo the Hadoop Ecosystem Prafulla Wani

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Hadoop Parallel Data Processing

Hadoop-BAM and SeqPig

Integrating Big Data into the Computing Curricula

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Cost-Effective Business Intelligence with Red Hat and Open Source

BIG DATA SERIES: HADOOP DEVELOPER TRAINING PROGRAM. An Overview

Introduction to Parallel Programming and MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Fast, Low-Overhead Encryption for Apache Hadoop*

BIG DATA IN BUSINESS ENVIRONMENT

Detection of Distributed Denial of Service Attack with Hadoop on Live Network

Preparing an IIS Server for EmpowerID installation

Big Data on Microsoft Platform

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Transcription:

BiDAl: Big Data Analyzer for Cluster Traces Alkida Balliu, Dennis Olivetti, Ozalp Babaoglu, Moreno Marzolla, Alina Sirbu Department of Computer Science and Engineering University of Bologna, Italy BigSys 2014 System Software Support for Big Data Stuttgart, sep 25-26, 2014 September 25-26 BigSys 2014, 2014 Stuttgart, Germany 1

Talk Outline Motivation BiDAl Case Study Conclusions Stuttgart, sep 25-26, 2014 BigSys 2014 2

Motivations Modern datacenters produce huge amounts of data in the form of event and error logs Understanding the logs is essential to identify problems or improve efficiency Understand and exploit hidden patterns and correlations First step towards selfmanaging, self-healing datacenters Stuttgart, sep 25-26, 2014 BigSys 2014 3

Challenges Huge size of logs A 2010 study [Thusoo et al, SIGMOD'10] reports that Facebook data centers produced 60TB of logging information daily Log analysis falls within the class of Big Data applications Data sets are so large that conventional storage and analysis techniques are not appropriate to process them Stuttgart, sep 25-26, 2014 BigSys 2014 4

BiDAl Big Data Analyzer Java application (with GUI) Proof-of-concept Typical workflow: Instantiation of a storage backend Data selection and aggregation Data analysis Stuttgart, sep 25-26, 2014 BigSys 2014 5

BiDAl Big Data Analyzer Can import raw data in.csv format Uses SQLite or Hadoop File System (HDFS) as storage backends Additional storage types can be added Although the current storage backends are based on the concept of table, other backends could be used too, e.g., HBase for <key, value> pairs Uses (a subset of) SQL as the query and data manipulation language Translates SQL to the language understood by the storage backend currently RSQLite or RHadoop Stuttgart, sep 25-26, 2014 BigSys 2014 6

BiDAl Big Data Analyzer Statistical computations can be performed using either R or Hadoop MapReduce R commands are usually applied to the SQLite storage, while MapReduce commands are usually applied to the HDFS storage BiDAl can transfer data automatically and transparently between the backends, to allow both languages to operate on both backends Computations can be concatenated Usually, a data reduction is followed by the computation of some statistics Stuttgart, sep 25-26, 2014 BigSys 2014 7

Data flow in BiDAl R Execute RSQLite SQLite Convert CSV Import Transfer SQL HDFS Convert MapReduce Execute RHadoop Stuttgart, sep 25-26, 2014 BigSys 2014 8

Stuttgart, sep 25-26, 2014 BigSys 2014 9

Case Study: Google Traces The development of BiDAl was initially motivated by the need to analyze Google traces https://code.google.com/p/googleclusterdata/ Goal: Extract workload parameter Instantiate a simulation model of the Google cluster Validate the simulation with respect to the observed data Stuttgart, sep 25-26, 2014 BigSys 2014 10

Google traces Contain 29 days of information from May 2011, on a cluster of about 11k machines Machine event (e.g., new machine is added to the pool,...) Machine attribute (e.g., OS is updated to a newer version,...) Jobs and Tasks (requirements, submit/completion time...) Resource usage (sampled at some fixed intervals) Total size of the compressed trace is ~40GB https://code.google.com/p/googleclusterdata/ Stuttgart, sep 25-26, 2014 BigSys 2014 11

Entities of the Simulation Model Tasks and Jobs Arrival Process that generates new events (new job, new machine, machine removal...) Scheduler Decides where the tasks of a job can be executed Machines Execute tasks; notify the scheduler when a task terminates; send status updates to the scheduler Network Allows other entities to communicate Stuttgart, sep 25-26, 2014 BigSys 2014 12

Trace-Driven Simulation Results Stuttgart, sep 25-26, 2014 BigSys 2014 13

Trace-Driven Simulation Results Stuttgart, sep 25-26, 2014 BigSys 2014 14

Workload Characterization We used BiDAl to extract workload parameters from the traces Jobs Inter-arrival time distribution Number of tasks per job Distribution of execution times of different types of jobs (e.g., jobs that terminate successfully, jobs that are aborted by the user, )... Stuttgart, sep 25-26, 2014 BigSys 2014 15

Examples Frequencies of the amount of RAM used by tasks (left) and number of tasks per job (right) Stuttgart, sep 25-26, 2014 BigSys 2014 16

Examples Machine update events Left Density and CDF with lines representing exponential fitting Right Goodness of fit in Q-Q and P-P plots (straight lines denote perfect fit) Stuttgart, sep 25-26, 2014 BigSys 2014 17

Examples CDFs fitted by a sequence of splines: CPU task requirements (left) and machine downtime (right) Stuttgart, sep 25-26, 2014 BigSys 2014 18

Results using synthetic traces Real Simulated Rel. dif. Running 124217 136037 0.09 Ready 5987 5726 0.04 Completed 3277 2317 0.29 Evicted 1057 2165 1.04 Can be explained by the high variance of real data Stuttgart, sep 25-26, 2014 BigSys 2014 19

Conclusions and Future Works Big Data Analyzer (BiDAl) is a prototype data analysis tool that can handle large datasets SQL, R, Hadoop/MapReduce Extensible We used BiDAl to analyze the Google traces dataset Future works Support additional storage backends Include additional analysis algorithms (e.g., predictive algorithms, machine learning) Live log analysis Stuttgart, sep 25-26, 2014 BigSys 2014 20

Thanks for your attention! http://www.cs.unibo.it/~sirbu/bidal.zip Stuttgart, sep 25-26, 2014 BigSys 2014 21