A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Similar documents
A Novel Cloud Based Elastic Framework for Big Data Preprocessing

How to Ingest Data into Google BigQuery using Talend for Big Data. A Technical Solution Paper from Saama Technologies, Inc.

CitusDB Architecture for Real-Time Big Data

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

A Brief Introduction to Apache Tez

Characterizing Task Usage Shapes in Google s Compute Clusters

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Big Data on Microsoft Platform

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Testing 3Vs (Volume, Variety and Velocity) of Big Data

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Big Data on Google Cloud

XML enabled databases. Non relational databases. Guido Rotondi

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Data Refinery with Big Data Aspects

Next-Generation Cloud Analytics with Amazon Redshift

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

GeoGrid Project and Experiences with Hadoop

NextGen Infrastructure for Big DATA Analytics.

Large-Scale Data Processing

Cloud Cruiser and Azure Public Rate Card API Integration

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Bringing Big Data Modelling into the Hands of Domain Experts

Big Systems, Big Data

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Oracle Database - Engineered for Innovation. Sedat Zencirci Teknoloji Satış Danışmanlığı Direktörü Türkiye ve Orta Asya

Hadoop Big Data for Processing Data and Performing Workload

Big Fast Data Hadoop acceleration with Flash. June 2013

Microsoft Analytics Platform System. Solution Brief

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Maximizing Hadoop Performance with Hardware Compression

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

In Memory Accelerator for MongoDB

Survey of the Benchmark Systems and Testing Frameworks For Tachyon-Perf

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Massive Cloud Auditing using Data Mining on Hadoop

Big Data and Transactional Databases Exploding Data Volume is Creating New Stresses on Traditional Transactional Databases

Energy Efficient MapReduce

Native Connectivity to Big Data Sources in MicroStrategy 10. Presented by: Raja Ganapathy

Building your Big Data Architecture on Amazon Web Services

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

Big Data & Cloud Computing. Faysal Shaarani

Hadoop & its Usage at Facebook

An Approach to Implement Map Reduce with NoSQL Databases

Duke University

Assignment # 1 (Cloud Computing Security)

SAP HANA In-Memory Database Sizing Guideline

Google Cloud Platform The basics

Mobile Storage and Search Engine of Information Oriented to Food Cloud

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Chapter 7. Using Hadoop Cluster and MapReduce

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

CSE-E5430 Scalable Cloud Computing Lecture 2

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

2015 The MathWorks, Inc. 1

Cloud Management: Knowing is Half The Battle

Luncheon Webinar Series May 13, 2013

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Architectures for massive data management

Data Mining in the Swamp

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

Customized Report- Big Data

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Role of Cloud Computing in Big Data Analytics Using MapReduce Component of Hadoop

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Big Data With Hadoop

Cloud Computing through Virtualization and HPC technologies

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Data processing goes big

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

Application of Predictive Analytics for Better Alignment of Business and IT

HiBench Introduction. Carson Wang Software & Services Group

Hadoop & SAS Data Loader for Hadoop

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Understanding the Value of In-Memory in the IT Landscape

Cray: Enabling Real-Time Discovery in Big Data

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

IMCM: A Flexible Fine-Grained Adaptive Framework for Parallel Mobile Hybrid Cloud Applications

Benchmarking Hadoop & HBase on Violin

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

CS 378 Big Data Programming. Lecture 2 Map- Reduce

Management & Analysis of Big Data in Zenith Team

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Chapter 4 Cloud Computing Applications and Paradigms. Cloud Computing: Theory and Practice. 1

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

ROME, BIG DATA ANALYTICS

SQL Server 2012 Parallel Data Warehouse. Solution Brief

Transcription:

School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk

Overview Introduction Cloud based elastic framework Motivation Major Components Workload distribution Processing steps Experiments and results Discussion Conclusion and future work 2

Introduction Big Data is data that is too big, too fast or too hard to process using traditional tools. The Primary aspects of Big Data are characterized in terms of three dimensions (Volume, Variety and Velocity). Cloud computing is an emerging paradigm which offers resource Elasticity and Utility Billing. Cloud computing resources include: VMs, cloud storage and interactive analytical big data services (e.g. Google Bigquery). 3

Cloud Based Elastic Framework Entirely based on cloud computing. Elastic, hence able to dynamically scale up or down. Extendible, such that tasks can be added or removed. Tracks the overall cost incurred by the processing activities. Capable of both preprocessing and analyzing Big Data. Data collection Data curation Data integration and aggregation Data storage using Cloud Storage and Bigquery Data analysis and interpretation using Bigquery Big Data processing pipeline 4

Motivation Analytical big data services can analyze massive datasets in seconds (e.g. 1 terabyte in 50s). Can handle the analysis and storage of textual based structured and semi-structured big data. Data curation, transformation and normalization can be handled using an entirely parallel approach. Some tasks do not naturally fit the MapReduce paradigm (map/reduce, task chaining, complex logic, data streaming). Frameworks such as Hadoop utilizes a fixed number of computing nodes during processing. Cloud computing elasticity can be utilized to scale up and down VMs as needed. 5

Major Components Coordinator VM. Processor VMs. Processor VM Disk Image Job/Work description. Processing program and tasks. Workload Distribution function. Cloud storage. Analytical big data service. Program input via VM metadata. 6

Workload Distribution Task processing is entirely parallel, so processors do not need to communicate with each other. Work is distributed using bin packing to ensure each processor is fairly loaded. Items to partition can be files to process or analytical queries to run against Bigquery. 7

Processing Steps Metadata Server Coordinator receives requests for work. Wait for next task Run startup script Read Metadata The overall cost of resource usage is tracked. Partitions the work using binpacking algorithm. Coordinator Set metadata as FREE Processor Choose task to execute Terminates them once the work is done. Monitors nodes once a task is done, another task is started. Starts the required number of nodes, supplying tasks as metadata. Complete task, and upload/ update data Read input data from Cloud Storage or Bigquery Set metadata as BUSY Bigquery/Cloud Storage 8

Experiment Experiment conducted on the Google Cloud Platform: Compute Engine: Up to 10 processors of type n1-standard2 VMs each with 2 virtual cores, 10 GB disk and 7.5 GB of main memory. Cloud Storage DBpedia* dataset is used: Structured extract from Wikipedia Contains 300 Million statements Total size is 50.19 GB Compressed size is 5.3GB Data is in NTriple RDF format: <http://dbpedia.org/resource/accessiblecomputing> <http://xmlns.com/foaf/0.1/isprimarytopicof> <http://en.wikipedia.org/wiki/accessiblecomputing>. * http://wiki.dbpedia.org/datasets Linked Data Cloud http://lod-cloud.net/versions/2011-09-19/ lod-cloud_colored.html 9

Results 10

Discussion Preprocessed 50GB of data in 11 minutes using 8 VMs. For our data, the processing is CPU bound (80% processing, 20% I/O). Processing time is proportional to the size of the data assigned to the VM. The overall runtime is constraint by the time required to process the largest file. Input files can be split further to enable equal workload allocation. Only 9% to 20% of the overall runtime is spent in transferring the files to and form cloud storage. 11

Conclusion and future work We have developed a novel cloud based framework for Big Data preprocessing. Our framework is lightweight, elastic and extendible. Makes use of cloud storage and analytical big data services to provide a complete pipeline for big data processing. We have extended the processing to executing analytical queries against Bigquery. We plan to use the framework for processing social media datasets. The implementation for our framework is open source and can be downloaded from http://ecarf.io 12

Thank you Any questions? 13