Duke University

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Duke University http://www.cs.duke.edu/starfish"

Transcription

1 Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University

2 Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists Data Size Journalists Systems researchers Workload Complexity Counts & Aggregates Statistical analysis, Linear Algebra Rollups & Drilldowns Machine learning Text / Images / Video / Graphs 6/29/2011 Starfish 2

3 MapReduce/Hadoop Ecosystem Java / Ruby / Python Client Pig Hive Jaql Oozie Elastic MapReduce Hadoop MapReduce Execution Engine Distributed File System HBase 6/29/2011 Starfish 3

4 MADDER Principles of Big Data Analytics Magnetic Agile Deep Data-lifecycle-aware Elastic Robust 6/29/2011 Starfish 4

5 MADDER Principles of Big Data Analytics Magnetic A D D E R Easy to get data into the system 6/29/2011 Starfish 5

6 MADDER Principles of Big Data Analytics M Agile D D E R Make change (data/requirements) easy 6/29/2011 Starfish 6

7 MADDER Principles of Big Data Analytics M A Deep D E R Support the full spectrum of analytics Write MapReduce programs in Java / Python / R or use interfaces like Pig / Jaql 6/29/2011 Starfish 7

8 MADDER Principles of Big Data Analytics M A D Data-lifecycleaware E R Data cycle at LinkedIn 6/29/2011 Starfish 8

9 MADDER Principles of Big Data Analytics M A D D Elastic R Adapt resources/costs to actual workloads 6/29/2011 Starfish 9

10 MADDER Principles of Big Data Analytics M A D D E Robust Graceful degradation under undesirable events 6/29/2011 Starfish 10

11 Ease-of-Use Vs. Out-of-the-box Perf. Magnetic Agile Deep Data-lifecycle-aware Elastic Robust Ease of use Data can be opaque until run-time Programs are a different beast from SQL Policies are nontrivial Hard to get good performance out of the box 6/29/2011 Starfish 11

12 Starfish: MADDER + Self-Tuning Magnetic Agile Deep Data-lifecycle-aware Elastic Robust Ease of use Profile Recursive random search Dynamic instrumentation Mix of models & simulation Relative what-if calls Get good performance automatically 6/29/2011 Starfish 12

13 Starfish: MADDER + Self-Tuning Goal: Provide good performance automatically Java Client Pig Hive Oozie Elastic MR Analytics System Starfish Hadoop MapReduce Execution Engine Distributed File System 6/29/2011 Starfish 13

14 What are the Tuning Problems? Job-level MapReduce configuration Cluster sizing J1 Data layout tuning D J2 Workflow optimization Workload management 6/29/2011 Starfish 14

15 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 15

16 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 16

17 MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Map function Reduce function Input Splits Run this program as a MapReduce job

18 MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?

19 Optimizing MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Space of configuration choices include settings for: Number of map tasks Number of reduce tasks Partitioning of map outputs to reduce tasks Memory allocation to task-level buffers Multiphase external sorting in the tasks Whether output data from tasks should be compressed Whether combine function should be used 6/29/2011 Starfish 19

20 Optimizing MapReduce Job Execution 2-dim projection of 13-dim surface 190+ parameters in Hadoop 6/29/2011 Starfish 20

21 Case for Automated Hadoop Tuning (from wiki.apache.org/pig/pigjournal) Hadoop has that can affect the latency and scalability of a job. For different types of jobs, different configurations will yield optimal results. For example, a job with no memory-intensive operations in the map phase but with a combine phase will want to set Hadoop's io.sort.mb quite high, to minimize the number of spills from the map. many configuration parameters feature will greatly increase the utility Adding this of Pig for Hadoop users, as it will free them from needing to understand Hadoop well enough to tune it themselves for their particular jobs. 6/29/2011 Starfish 21

22 Starfish s Core Approach to Tuning c S Challenges: p is an arbitrary MapReduce program; c is high-dimensional; Profiler What-if Engine Optimizer(s) perf F( p, d, r, c) program p, data d, resources r, copt arg min F( p, d, r, c) configuration c Runs MapReduce jobs to collect job profiles (concise execution summary) Given profile of j = <p,d,r,c>, estimates virtual profile for job j' = <p,d,r,c > Enumerates and searches through the optimization space efficiently 6/29/2011 Starfish 22

23 Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map split 2 map reduce out 0 split 1 map split 3 map reduce Out 1 Two Map Waves One Reduce Wave 6/29/2011 Starfish 23

24 Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map reduce out 0 split 0 Map func Serialize, Partition Memory Buffer Sort, [Combine], [Compress] Merge DFS Map Task Phases Read Map Collect Spill Merge 6/29/2011 Starfish 24

25 Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map reduce out 0 Profile Dataflow Amount of data flowing though tasks & task phases Dataflow Statistics Statistical info about the dataflow Cost Execution time at level of tasks & task phases Cost Statistics Statistical info about the costs 6/29/2011 Starfish 25

26 Fields in a Job Profile Dataflow Map output bytes Number of map-side spills Number of merge rounds Number of records in buffer per spill Cost Read phase time in the map task Map phase time in the map task Collect phase time in the map task Spill phase time in the map task Dataflow Statistics Map func s selectivity (output / input) Map output compression ratio Combiner s selectivity Size of records (keys and values) Cost Statistics I/O cost for reading from local disk per byte I/O cost for writing to HDFS per byte CPU cost for executing Map func per record CPU cost for uncompressing the input per byte 6/29/2011 Starfish 26

27 Generating Profiles Concise representation of program execution Records information at the level of task phases Generated by Profiler through measurement or by the What-if Engine through estimation 6/29/2011 Starfish 27

28 Generating Profiles by Measurement Goals Have zero overhead when off, low overhead when on Require no modifications to Hadoop Support unmodified MapReduce programs written in Java/Python/Ruby/C++ Approach: Dynamic (on-demand) instrumentation Event-condition-action rules are specified (in Java) Leads to run-time instrumentation of Hadoop internals Monitors task phases of MapReduce job execution We currently use BTrace (Hadoop internals are in Java) 6/29/2011 Starfish 28

29 Generating Profiles by Measurement JVM Enable Profiling JVM split 0 map reduce out 0 raw data ECA rules raw data JVM split 1 map map profile reduce profile raw data job profile Use of Sampling Profile fewer tasks Execute fewer tasks JVM = Java Virtual Machine, ECA = Event-Condition-Action 6/29/2011 Starfish 29

30 Overhead of Profile Measurement Word Co-occurrence job running on a 16-node cluster of c1.medium EC2 nodes on the Amazon Cloud 6/29/2011 Starfish 30

31 What-if Engine Possibly Hypothetical Job Profile Input Data Properties What-if Engine Job Oracle Cluster Resources Virtual Job Profile for <p, d 2, r 2, c 2 > Task Scheduler Simulator Configuration Settings <p, d 1, r 1, c 1 > <d 2 > <r 2 > <c 2 > Properties of Hypothetical Job 6/29/2011 Starfish 31

32 What-if Questions Starfish can Answer How will job j s execution time change if the number of reduce tasks is changed from 20 to 40? What will the change in I/O be if map o/p compression is turned on, but the input data size increases by 40%? What will job j s new execution time be if 5 more nodes are added to the cluster, bringing the total to 20? How will workload execution time & dollar cost change if we move the production cluster from m1.xlarge nodes to c1.medium nodes on Amazon EC2? 6/29/2011 Starfish 32

33 Virtual Profile Estimation Given profile for job j = <p, d 1, r 1, c 1 > Estimate profile for job j' = <p, d 2, r 2, c 2 > Profile for j Dataflow Statistics Cost Statistics Dataflow Cost Input Data d 2 Resources r 2 Relative Black-box Models (Virtual) Profile for j' Cardinality Models Cost Statistics Cost Dataflow Statistics White-box Models Configuration c 2 White-box Models Dataflow 6/29/2011 Starfish 33

34 Job Optimizer Job Profile Input Data Properties Job Optimizer Subspace Enumeration Cluster Resources <p, d 1, r 1, c 1 > <d 2 > <r 2 > Recursive Random Search What-if calls Best Configuration Settings <c opt > for <p, d 2, r 2 > 6/29/2011 Starfish 34

35 Experimental Setup Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type 2 map slots & 2 reduce slots 300MB max memory per task Cost-Based Job Optimizer Vs. Rule-Based Optimizer 6/29/2011 Starfish 35

36 Experimental Setup Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type 2 map slots & 2 reduce slots 300MB max memory per task Cost-Based Job Optimizer Vs. Rule-Based Optimizer Abbr. MapReduce Program Domain Dataset CO Word Co-occurrence NLP 30GB,Wikipedia WC WordCount Text Analytics 30GB, Wikipedia TS TeraSort Business Analytics 30GB, Teragen LG LinkGraph Graph Processing 10GB, Wikipedia (compressed) JO Join Business Analytics 30GB, TPC-H 6/29/2011 Starfish 36

37 Speedup Job Optimizer Evaluation Default Settings Rule-Based Optimizer Cost-Based Just-in-TimeJob Optimizer 0 CO WC TS LG JO MapReduce Job 6/29/2011 Starfish 37

38 Insights from Profiles for WordCount A: Rule-Based B: Cost-Based (2x faster) Few, large spills Combiner gave high data reduction Combiner made Mappers CPU bound Many, small spills Combiner gave smaller data reduction Better resource utilization in Mappers 6/29/2011 Starfish 38

39 Estimates from the What-if Engine True surface Estimated surface 6/29/2011 Starfish 39

40 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 40

41 Workflow Optimization Space Optimization Space Logical Physical Vertical Packing Partition Function Selection Job-level Configuration Dataset-level Configuration Intra-job Inter-job 6/29/2011 Starfish 41

42 Optimizations on TF-IDF Workflow D0 <{D},{W}> D1 D2 D4 J1 M1 R1 <{D, W},{f}> J2 J3, J4 M2 R2 <{D},{W, f, c}> M3 R3 M4 <{W},{D, t}> D0 <{D},{W}> J1, J2 M1 R1 M2 Logical R2 Optimization D2 <{D},{W, f, c}> J3, J4 D4 M3 R3 M4 <{W},{D, t}> Partition:{D} Sort: {D,W} Physical Optimization Reducers= 50 Compress = off Memory = 400 Reducers= 20 Compress = on Memory = 300 Legend D = docname f = frequency W = word c = count t = TF-IDF 6/29/2011 Starfish 42

43 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 43

44 Cost ($) Running Time (min) Multi-objective Cluster Provisioning Cloud enables users to provision Hadoop clusters in minutes 1,200 1,000 Can avoid the man (system administrator) in the middle Pay only 800 for what resources are used m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type 6/29/2011 Starfish 44

45 Cost ($) Running Time (min) Multi-objective Cluster Provisioning 1,200 1, m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type for Target Cluster m1.small m1.large m1.xlarge c1.medium c1.xlarge Actual Predicted Actual Predicted EC2 Instance Type for Target Cluster Instance Type for Source Cluster: m1.large 6/29/2011 Starfish 45

46 More Info: Job-level MapReduce configuration Cluster sizing J1 Data layout tuning D J2 Workflow optimization Workload management 6/29/2011 Starfish 46

Herodotos Herodotou Shivnath Babu. Duke University

Herodotos Herodotou Shivnath Babu. Duke University Herodotos Herodotou Shivnath Babu Duke University Analysis in the Big Data Era Popular option Hadoop software stack Java / C++ / R / Python Pig Hive Jaql Oozie Elastic MapReduce Hadoop HBase MapReduce

More information

No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics

No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics Herodotos Herodotou Duke University hero@cs.duke.edu Fei Dong Duke University dongfei@cs.duke.edu Shivnath Babu Duke

More information

Report 02 Data analytics workbench for educational data. Palak Agrawal

Report 02 Data analytics workbench for educational data. Palak Agrawal Report 02 Data analytics workbench for educational data Palak Agrawal Last Updated: May 22, 2014 Starfish: A Selftuning System for Big Data Analytics [1] text CONTENTS Contents 1 Introduction 1 1.1 Starfish:

More information

Starfish: A Self-tuning System for Big Data Analytics

Starfish: A Self-tuning System for Big Data Analytics Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu Department of Computer Science Duke University

More information

Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs

Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Herodotos Herodotou Duke University hero@cs.duke.edu Shivnath Babu Duke University shivnath@cs.duke.edu ABSTRACT MapReduce

More information

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 697-701 STARFISH

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 697-701 STARFISH International Journal of Research in Information Technology (IJRIT) www.ijrit.com ISSN 2001-5569 A Self-Tuning System for Big Data Analytics - STARFISH Y. Sai Pramoda 1, C. Mary Sindhura 2, T. Hareesha

More information

A What-if Engine for Cost-based MapReduce Optimization

A What-if Engine for Cost-based MapReduce Optimization A What-if Engine for Cost-based MapReduce Optimization Herodotos Herodotou Microsoft Research Shivnath Babu Duke University Abstract The Starfish project at Duke University aims to provide MapReduce users

More information

Workloads. Herodotos Herodotou. Department of Computer Science Duke University. Date: Approved: Shivnath Babu, Supervisor. Jun Yang.

Workloads. Herodotos Herodotou. Department of Computer Science Duke University. Date: Approved: Shivnath Babu, Supervisor. Jun Yang. Automatic Tuning of Data-Intensive Analytical Workloads by Herodotos Herodotou Department of Computer Science Duke University Date: Approved: Shivnath Babu, Supervisor Jun Yang Jeffrey Chase Christopher

More information

Automatic Tuning of Data-Intensive Analytical. Workloads

Automatic Tuning of Data-Intensive Analytical. Workloads Automatic Tuning of Data-Intensive Analytical Workloads by Herodotos Herodotou Department of Computer Science Duke University Ph.D. Dissertation 2012 Copyright c 2012 by Herodotos Herodotou All rights

More information

An Enhanced MapReduce Workload Allocation Tool for Spot Market Resources

An Enhanced MapReduce Workload Allocation Tool for Spot Market Resources Nova Southeastern University NSUWorks CEC Theses and Dissertations College of Engineering and Computing 2015 An Enhanced MapReduce Workload Allocation Tool for Spot Market Resources John Stephen Hudzina

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Analytical Processing in the Big Data Era

Analytical Processing in the Big Data Era Analytical Processing in the Big Data Era 1 Modern industrial, government, and academic organizations are collecting massive amounts of data ( Big Data ) at an unprecedented scale and pace. Companies like

More information

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

HADOOP PERFORMANCE TUNING

HADOOP PERFORMANCE TUNING PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information

Understanding Hadoop Performance on Lustre

Understanding Hadoop Performance on Lustre Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15

More information

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

More information

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File

More information

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview

More information

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?

More information

ITG Software Engineering

ITG Software Engineering Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.

More information

Implement Hadoop jobs to extract business value from large and varied data sets

Implement Hadoop jobs to extract business value from large and varied data sets Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to

More information

Parameterizable benchmarking framework for designing a MapReduce performance model

Parameterizable benchmarking framework for designing a MapReduce performance model CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2014) Published online in Wiley Online Library (wileyonlinelibrary.com)..3229 SPECIAL ISSUE PAPER Parameterizable

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

L1: Introduction to Hadoop

L1: Introduction to Hadoop L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

More information

Amazon Elastic Compute Cloud Getting Started Guide. My experience

Amazon Elastic Compute Cloud Getting Started Guide. My experience Amazon Elastic Compute Cloud Getting Started Guide My experience Prepare Cell Phone Credit Card Register & Activate Pricing(Singapore) Region Amazon EC2 running Linux(SUSE Linux Windows Windows with SQL

More information

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763 International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

Hadoop and Map-Reduce. Swati Gore

Hadoop and Map-Reduce. Swati Gore Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data

More information

Apache Hadoop. Alexandru Costan

Apache Hadoop. Alexandru Costan 1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open

More information

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:

More information

Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of. Hadoop deployment models Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

More information

A very short Intro to Hadoop

A very short Intro to Hadoop 4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

More information

Oracle Big Data SQL Technical Update

Oracle Big Data SQL Technical Update Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

Using distributed technologies to analyze Big Data

Using distributed technologies to analyze Big Data Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce

More information

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

More information

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers

More information

Hadoop vs Apache Spark

Hadoop vs Apache Spark Innovate, Integrate, Transform Hadoop vs Apache Spark www.altencalsoftlabs.com Introduction Any sufficiently advanced technology is indistinguishable from magic. said Arthur C. Clark. Big data technologies

More information

Performance Analysis of Hadoop for Query Processing

Performance Analysis of Hadoop for Query Processing 211 Workshops of International Conference on Advanced Information Networking and Applications Performance Analysis of Hadoop for Query Processing Tomasz Wiktor Wlodarczyk, Yi Han, Chunming Rong Department

More information

A Brief Outline on Bigdata Hadoop

A Brief Outline on Bigdata Hadoop A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is

More information

Hadoop Ecosystem B Y R A H I M A.

Hadoop Ecosystem B Y R A H I M A. Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open

More information

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344

Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL

More information

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish

More information

Hadoop Development & BI- 0 to 100

Hadoop Development & BI- 0 to 100 Development Master the Data Analysis tools like Pig and hive Data Science Hadoop Development & BI- 0 to 100 Build a recommendation engine Hadoop Development - 0 to 100 HADOOP SCHOOL OF TRAINING Basics

More information

Map Reduce & Hadoop Recommended Text:

Map Reduce & Hadoop Recommended Text: Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

More information

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash. June 2013 Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

A Brief Introduction to Apache Tez

A Brief Introduction to Apache Tez A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden

More information

Firebird meets NoSQL (Apache HBase) Case Study

Firebird meets NoSQL (Apache HBase) Case Study Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI

More information

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

More information

Hadoop Big Data for Processing Data and Performing Workload

Hadoop Big Data for Processing Data and Performing Workload Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer

More information

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All

More information

COURSE CONTENT Big Data and Hadoop Training

COURSE CONTENT Big Data and Hadoop Training COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop

More information

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti

More information

Hadoop IST 734 SS CHUNG

Hadoop IST 734 SS CHUNG Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to

More information

Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony

Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony Speaker logo centered below image Steve Kuo, Software Architect Joshua Tuberville, Software Architect Goal > Leverage EC2 and Hadoop to

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?

More information

Data Mining in the Swamp

Data Mining in the Swamp WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Hadoop Memory Usage Model

Hadoop Memory Usage Model Hadoop Memory Usage Model Lijie Xu xulijie9@otcaix.iscas.ac.cn Technical Report Institute of Software, Chinese Academy of Sciences November 15, 213 Abstract Hadoop MapReduce is a powerful open-source framework

More information

A Comparison of Approaches to Large-Scale Data Analysis

A Comparison of Approaches to Large-Scale Data Analysis A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce

More information

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable

More information

Application Development. A Paradigm Shift

Application Development. A Paradigm Shift Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the

More information

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears Neeraj Ganapathy Outline Hadoop Architecture Pipelined MapReduce Online Aggregation Continuous

More information

UBUNTU DISK IO BENCHMARK TEST RESULTS

UBUNTU DISK IO BENCHMARK TEST RESULTS UBUNTU DISK IO BENCHMARK TEST RESULTS FOR JOYENT Revision 2 January 5 th, 2010 The IMS Company Scope: This report summarizes the Disk Input Output (IO) benchmark testing performed in December of 2010 for

More information

High Performance Computing MapReduce & Hadoop. 17th Apr 2014

High Performance Computing MapReduce & Hadoop. 17th Apr 2014 High Performance Computing MapReduce & Hadoop 17th Apr 2014 MapReduce Programming model for parallel processing vast amounts of data (TBs/PBs) distributed on commodity clusters Borrows from map() and reduce()

More information

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER ANDREAS-LAZAROS GEORGIADIS, SOTIRIOS XYDIS, DIMITRIOS SOUDRIS MICROPROCESSOR AND MICROSYSTEMS LABORATORY ELECTRICAL AND

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Accelerating and Simplifying Apache

Accelerating and Simplifying Apache Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly

More information

Maximizing Hadoop Performance with Hardware Compression

Maximizing Hadoop Performance with Hardware Compression Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Compression and Security Exar Corporation November 2012 1 What is Big? sets whose size is beyond the ability

More information

BlobSeer in the context of MapReduce applications

BlobSeer in the context of MapReduce applications KerData team, INRIA 1 Core 2 Integrating BlobSeer with Experimental evaluation Microbenchmarks Experiments with Map/Reduce Applications 3 Application case 4 Outline Core Yahoo! s implementation of MapReduce

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Big Business, Big Data, Industrialized Workload

Big Business, Big Data, Industrialized Workload Big Business, Big Data, Industrialized Workload Big Data Big Data 4 Billion 600TB London - NYC 1 Billion by 2020 100 Million Giga Bytes Copyright 3/20/2014 BMC Software, Inc 2 Copyright 3/20/2014 BMC Software,

More information

Open source Google-style large scale data analysis with Hadoop

Open source Google-style large scale data analysis with Hadoop Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical

More information

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Keywords: Big Data, HDFS, Map Reduce, Hadoop Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

High Throughput Sequencing Data Analysis using Cloud Computing

High Throughput Sequencing Data Analysis using Cloud Computing High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom (stephane.le_crom@upmc.fr) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure

More information

A Performance Analysis of Distributed Indexing using Terrier

A Performance Analysis of Distributed Indexing using Terrier A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

See Spot Run: Using Spot Instances for MapReduce Workflows

See Spot Run: Using Spot Instances for MapReduce Workflows See Spot Run: Using Spot Instances for MapReduce Workflows Navraj Chohan Claris Castillo Mike Spreitzer Malgorzata Steinder Asser Tantawi Chandra Krintz IBM Watson Research Hawthorne, New York Computer

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Lustre * Filesystem for Cloud and Hadoop *

Lustre * Filesystem for Cloud and Hadoop * OpenFabrics Software User Group Workshop Lustre * Filesystem for Cloud and Hadoop * Robert Read, Intel Lustre * for Cloud and Hadoop * Brief Lustre History and Overview Using Lustre with Hadoop Intel Cloud

More information

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Introduction to Big Data! with Apache Spark UC#BERKELEY# Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"

More information

Peers Techno log ies Pv t. L td. HADOOP

Peers Techno log ies Pv t. L td. HADOOP Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and

More information

Spark: Cluster Computing with Working Sets

Spark: Cluster Computing with Working Sets Spark: Cluster Computing with Working Sets Outline Why? Mesos Resilient Distributed Dataset Spark & Scala Examples Uses Why? MapReduce deficiencies: Standard Dataflows are Acyclic Prevents Iterative Jobs

More information

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Data Warehousing and Analytics Infrastructure at Facebook Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com Overview Challenges in a Fast Growing & Dynamic Environment Data Flow Architecture,

More information

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of

More information

I/O Considerations in Big Data Analytics

I/O Considerations in Big Data Analytics Library of Congress I/O Considerations in Big Data Analytics 26 September 2011 Marshall Presser Federal Field CTO EMC, Data Computing Division 1 Paradigms in Big Data Structured (relational) data Very

More information