Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20



Similar documents
BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking. Aayush Agrawal

Characterizing Task Usage Shapes in Google s Compute Clusters

Energy Efficient MapReduce

L1: Introduction to Hadoop

Big Data and Apache Hadoop s MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce

BIG DATA What it is and how to use?

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Application Development. A Paradigm Shift

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Big Data With Hadoop

Benchmarking Hadoop & HBase on Violin

Hadoop implementation of MapReduce computational model. Ján Vaňo

Using the HP Vertica Analytics Platform to Manage Massive Volumes of Smart Meter Data

Performance and Energy Efficiency of. Hadoop deployment models

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

CSE-E5430 Scalable Cloud Computing Lecture 2

Architecting for Big Data Analytics and Beyond: A New Framework for Business Intelligence and Data Warehousing

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Open source Google-style large scale data analysis with Hadoop

Testing Big data is one of the biggest

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Large scale processing using Hadoop. Ján Vaňo

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Towards an Optimized Big Data Processing System

Apache Hadoop: Past, Present, and Future

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Mining Large Datasets: Case of Mining Graph Data in the Cloud

BIG DATA TECHNOLOGY. Hadoop Ecosystem

CloudRank-D:A Benchmark Suite for Private Cloud Systems

Maximizing Hadoop Performance with Hardware Compression

Networking in the Hadoop Cluster

EFFICIENT GEAR-SHIFTING FOR A POWER-PROPORTIONAL DISTRIBUTED DATA-PLACEMENT METHOD

BIG DATA TRENDS AND TECHNOLOGIES

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Hadoop Scheduler w i t h Deadline Constraint

A Brief Outline on Bigdata Hadoop

On Big Data Benchmarking

Energy Efficiency for Large-Scale MapReduce Workloads with Significant Interactive Analysis

BIG DATA IN BUSINESS ENVIRONMENT

How To Handle Big Data With A Data Scientist

Big Data: Tools and Technologies in Big Data

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Data Refinery with Big Data Aspects

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Big Fast Data Hadoop acceleration with Flash. June 2013

BiDAl: Big Data Analyzer for Cluster Traces

Big Data and Analytics: Challenges and Opportunities

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Data Mining in the Swamp

Lecture 10 - Functional programming: Hadoop and MapReduce

Big Data Explained. An introduction to Big Data Science.

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Suresh Lakavath csir urdip Pune, India

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

BIG DATA CHALLENGES AND PERSPECTIVES

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

Big Data: Beyond the Hype

On Big Data Benchmarking

Hadoop Ecosystem B Y R A H I M A.

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Real Time Big Data Processing

Hadoop IST 734 SS CHUNG

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Using distributed technologies to analyze Big Data

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Data Management in SAP Environments

Fast Data in the Era of Big Data: Twitter s Real-

Trustworthiness of Big Data

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

arxiv: v3 [cs.db] 27 Feb 2014

Journée Thématique Big Data 13/03/2015

Information Builders Mission & Value Proposition

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop s Adolescence: A Comparative Workload Analysis from Three Research Clusters

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Graphalytics: A Big Data Benchmark for Graph-Processing Platforms

Understanding Neo4j Scalability

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

UNDERSTANDING THE BIG DATA PROBLEMS AND THEIR SOLUTIONS USING HADOOP AND MAP-REDUCE

How To Scale Out Of A Nosql Database

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Big Data Processing with Google s MapReduce. Alexandru Costan

Internals of Hadoop Application Framework and Distributed File System

Hadoop: Embracing future hardware

Interactive Analytical Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads

COMP9321 Web Application Engineering

BIG DATA USING HADOOP

Transcription:

Interactive Analytical Processing in Big Data Systems,BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking,Study about DataSet May 23, 2014 Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20

Outline 1 Interactive Analytical Processing in Big Data Systems 2 Towards a BigData BenchMark 3 BDGS: A Scalable Big Data Generator Suite 4 Methodology of BDGS: 5 Synthetic Data Generator: 6 References Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data2 Generat / 20

Interactive Analytical Processing in Big Data Systems Introduction: MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. LOGICAL VIEW : Map Function: Map takes one pair of data with a type in one data domain, and returns a list of pairs in a different domain: Map(k1,v1) list(k2,v2) Reduce Function: It produces a collection of values in the same domain: Reduce(k2, list (v2)) list(v3) Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data3 Generat / 20

Interactive Analytical Processing in Big Data Systems Key findings during the study: 1 There is a new class of MapReduce workloads for interactive, semi-streaming analysis that differs considerably from the original MapReduce use case targeting purely batch computations. 2 There is a wide range of behavior within this workload class, such that we must exercise caution in regarding any aspect of workload dynamics as typical. 3 Query-like programatic frameworks on top of MapReduce such as Hive and Pig make up a considerable fraction of activity in all workloads we analyze 4 Some prior assumptions about MapReduce such as uniform data access, regular diurnal patterns, and prevalence of large jobs no longer hold. Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data4 Generat / 20

Interactive Analytical Processing in Big Data Systems Data Access Pattern: 1 By Job Data Sizes :Most jobs have input, shuffle, and output sizes in the MB to GB range.thus, benchmarks of TB and above captures only a narrow set of input, shuffle, and output patterns. Fig: Datasize for each workload [?] Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data5 Generat / 20

Interactive Analytical Processing in Big Data Systems 1 Skews in Data Frequency: The observation challenges the design assumption in HDFS that all data sets should be treated equally, i.e., stored on the same medium, with the same data replication policies. Fig: Zipf distribution of all workloads [?] Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data6 Generat / 20

Interactive Analytical Processing in Big Data Systems 1 Access Temporal Locality: A possible cache eviction policy is to evict entire files that have not been accessed for longer than a workload specific threshold duration. Fig: Data re-access interval [?] Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data7 Generat / 20

Interactive Analytical Processing in Big Data Systems Workload Variation With Time: 1 Weekly Time Series : Fig: Weekly time graph for workload CC-a(utilization slot)[?] Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data8 Generat / 20

Interactive Analytical Processing in Big Data Systems Workload Variation With Time: 1 Burstiness : Peak to average ratio is a comman way to measure burstiness.they extend the concept of peak-to-average ratio to quantify burstiness. Fig: Weekly time graph for workload CC-a(utilization slot)[?] Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data9 Generat / 20

Interactive Analytical Processing in Big Data Systems 1 Time Series Correlation :They compute three correlation values: between the time varying vectors jobssubmitted(t) and datasizebytes(t), between jobssubmitted(t) and computet imet askseconds(t),and between datasizebytes(t) andcomputetimetaskseconds(t), where t represents time in hourly granularity, and ranges over the entire trace duration Fig: Correlation between diff. time series [?] Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data10 Generat / 20

Towards a BigData BenchMark Towards a BigData BenchMark: Some challenges associated with building a TPC-style benchmark for MapReduce and other big data systems. Data Generation: A good benchmark should stress the system with realistic conditions in all areas likes data access pattern,skews in frequency. Mixing MapReduce and query-like frameworks:the heavy use of query-like frameworks on top of MapRe- duce indicates that future cluster management systems needto efficiently multiplex jobs. Empirical models:it is necessary for a benchmark to assume an empirical model of workloads. Workload Suites: If not a single set is able to represent a workload then we have to make a small suite of workload class that create a range of workload. Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data11 Generat / 20

BDGS: A Scalable Big Data Generator Suite Introduction: Data explosion is an irresistible trend that data are generated faster than ever. IDC forecasts this speed of data generation will continue and it is expected to increase at an exponential level over the next decade. The challenges of capturing, storing, indexing, searching, transferring, analyzing, and displaying Big Data bring fast development of big data systems. In this paper, we introduce Big Data Generator Suite (BDGS), a comprehensive tool developed to generate synthetic big data while preserving 4V(volume,variety,velocity,veracity) properties. Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data12 Generat / 20

BDGS: A Scalable Big Data Generator Suite Requirements: Requirements: Big data generators should scale up or down a synthetic data set (volume) of different types (variety) under a controllable generation rate (velocity) while keeping important characteristics of raw data (veracity). Volume:Big data generators should be able to generate data whose volume ranges from GB to PB, and can also scale up and down data volumes to meet different testing requirements Velocity:Big data applications can be divided into in three types: online analytic, online service and realtime analytic.data processing speed and data updating frequency should be maintained. Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data13 Generat / 20

BDGS: A Scalable Big Data Generator Suite Requirements: Requirements: Variety: Since big data come from various workloads and systems, big data generators should support a diversity data types (structured, semi-structured and unstructured) and sources (table, text, graph, etc). Veracity:In synthetic data generation, the important characteristics of raw data must be preserved. Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data14 Generat / 20

Methodology of BDGS: Methodology of BDGS: It represent an overview of BDGS that is implemented as an component of our big data benchmark suite BigDataBench. Fig: Architecture of BDGS [?] Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data15 Generat / 20

Synthetic Data Generator: Methodology of BDGS: BDGS can generate synthetic data while preserving the important characteristics of real data sets and can also rapidly scale data meeting the requirements. Text Generator Graph Generator Table Generator Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data16 Generat / 20

Synthetic Data Generator: Text Generator: It applies latent dirichlet allocation (LDA) as the text data generation model. This model can keep the veracity of topic distributions in documents as well as the word distribution under each topic. Fig: Text Generator [?] Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data17 Generat / 20

Synthetic Data Generator: Graph Generator: A graph consists of nodes and edges that connect nodes. For the graph data sets, namely Facebook Social Graph and Google Web Graph, we use the kronecker graph model in our graph generator Fig: Graph Generator [?] Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data18 Generat / 20

Synthetic Data Generator: Table Generator: E-commerce transaction and personal resumes comes under the table dataset. PDGF uses XML configuration files for data description and distribution, thereby simplying the generation of different distributions of specified data sets. The process of resume generator: Randomly generate a string as the name of a resume. Randomly choose fields from email, telephone, address, date of birth, home place, institute, title, research interest, education experience, work experience and publications, where each field follows the bernoulli probability distribution. For each field: if the field has sub fields, then randomly choose its sub fields following the bernoulli probability distribution. else assign the content of the field using the multinomial probability distribution. Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data19 Generat / 20

References References I [1] http://arxiv.org/pdf/1401.5465.pdf [2] http://www.eecs.berkeley.edu/~alspaugh/papers/ mapred_workloads_vldb_2012.pdf Interactive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data20 Generat / 20