Duke University
|
|
- Stephen Hensley
- 8 years ago
- Views:
Transcription
1 Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University
2 Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists Data Size Journalists Systems researchers Workload Complexity Counts & Aggregates Statistical analysis, Linear Algebra Rollups & Drilldowns Machine learning Text / Images / Video / Graphs 6/29/2011 Starfish 2
3 MapReduce/Hadoop Ecosystem Java / Ruby / Python Client Pig Hive Jaql Oozie Elastic MapReduce Hadoop MapReduce Execution Engine Distributed File System HBase 6/29/2011 Starfish 3
4 MADDER Principles of Big Data Analytics Magnetic Agile Deep Data-lifecycle-aware Elastic Robust 6/29/2011 Starfish 4
5 MADDER Principles of Big Data Analytics Magnetic A D D E R Easy to get data into the system 6/29/2011 Starfish 5
6 MADDER Principles of Big Data Analytics M Agile D D E R Make change (data/requirements) easy 6/29/2011 Starfish 6
7 MADDER Principles of Big Data Analytics M A Deep D E R Support the full spectrum of analytics Write MapReduce programs in Java / Python / R or use interfaces like Pig / Jaql 6/29/2011 Starfish 7
8 MADDER Principles of Big Data Analytics M A D Data-lifecycleaware E R Data cycle at LinkedIn 6/29/2011 Starfish 8
9 MADDER Principles of Big Data Analytics M A D D Elastic R Adapt resources/costs to actual workloads 6/29/2011 Starfish 9
10 MADDER Principles of Big Data Analytics M A D D E Robust Graceful degradation under undesirable events 6/29/2011 Starfish 10
11 Ease-of-Use Vs. Out-of-the-box Perf. Magnetic Agile Deep Data-lifecycle-aware Elastic Robust Ease of use Data can be opaque until run-time Programs are a different beast from SQL Policies are nontrivial Hard to get good performance out of the box 6/29/2011 Starfish 11
12 Starfish: MADDER + Self-Tuning Magnetic Agile Deep Data-lifecycle-aware Elastic Robust Ease of use Profile Recursive random search Dynamic instrumentation Mix of models & simulation Relative what-if calls Get good performance automatically 6/29/2011 Starfish 12
13 Starfish: MADDER + Self-Tuning Goal: Provide good performance automatically Java Client Pig Hive Oozie Elastic MR Analytics System Starfish Hadoop MapReduce Execution Engine Distributed File System 6/29/2011 Starfish 13
14 What are the Tuning Problems? Job-level MapReduce configuration Cluster sizing J1 Data layout tuning D J2 Workflow optimization Workload management 6/29/2011 Starfish 14
15 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 15
16 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 16
17 MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Map function Reduce function Input Splits Run this program as a MapReduce job
18 MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
19 Optimizing MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Space of configuration choices include settings for: Number of map tasks Number of reduce tasks Partitioning of map outputs to reduce tasks Memory allocation to task-level buffers Multiphase external sorting in the tasks Whether output data from tasks should be compressed Whether combine function should be used 6/29/2011 Starfish 19
20 Optimizing MapReduce Job Execution 2-dim projection of 13-dim surface 190+ parameters in Hadoop 6/29/2011 Starfish 20
21 Case for Automated Hadoop Tuning (from wiki.apache.org/pig/pigjournal) Hadoop has that can affect the latency and scalability of a job. For different types of jobs, different configurations will yield optimal results. For example, a job with no memory-intensive operations in the map phase but with a combine phase will want to set Hadoop's io.sort.mb quite high, to minimize the number of spills from the map. many configuration parameters feature will greatly increase the utility Adding this of Pig for Hadoop users, as it will free them from needing to understand Hadoop well enough to tune it themselves for their particular jobs. 6/29/2011 Starfish 21
22 Starfish s Core Approach to Tuning c S Challenges: p is an arbitrary MapReduce program; c is high-dimensional; Profiler What-if Engine Optimizer(s) perf F( p, d, r, c) program p, data d, resources r, copt arg min F( p, d, r, c) configuration c Runs MapReduce jobs to collect job profiles (concise execution summary) Given profile of j = <p,d,r,c>, estimates virtual profile for job j' = <p,d,r,c > Enumerates and searches through the optimization space efficiently 6/29/2011 Starfish 22
23 Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map split 2 map reduce out 0 split 1 map split 3 map reduce Out 1 Two Map Waves One Reduce Wave 6/29/2011 Starfish 23
24 Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map reduce out 0 split 0 Map func Serialize, Partition Memory Buffer Sort, [Combine], [Compress] Merge DFS Map Task Phases Read Map Collect Spill Merge 6/29/2011 Starfish 24
25 Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map reduce out 0 Profile Dataflow Amount of data flowing though tasks & task phases Dataflow Statistics Statistical info about the dataflow Cost Execution time at level of tasks & task phases Cost Statistics Statistical info about the costs 6/29/2011 Starfish 25
26 Fields in a Job Profile Dataflow Map output bytes Number of map-side spills Number of merge rounds Number of records in buffer per spill Cost Read phase time in the map task Map phase time in the map task Collect phase time in the map task Spill phase time in the map task Dataflow Statistics Map func s selectivity (output / input) Map output compression ratio Combiner s selectivity Size of records (keys and values) Cost Statistics I/O cost for reading from local disk per byte I/O cost for writing to HDFS per byte CPU cost for executing Map func per record CPU cost for uncompressing the input per byte 6/29/2011 Starfish 26
27 Generating Profiles Concise representation of program execution Records information at the level of task phases Generated by Profiler through measurement or by the What-if Engine through estimation 6/29/2011 Starfish 27
28 Generating Profiles by Measurement Goals Have zero overhead when off, low overhead when on Require no modifications to Hadoop Support unmodified MapReduce programs written in Java/Python/Ruby/C++ Approach: Dynamic (on-demand) instrumentation Event-condition-action rules are specified (in Java) Leads to run-time instrumentation of Hadoop internals Monitors task phases of MapReduce job execution We currently use BTrace (Hadoop internals are in Java) 6/29/2011 Starfish 28
29 Generating Profiles by Measurement JVM Enable Profiling JVM split 0 map reduce out 0 raw data ECA rules raw data JVM split 1 map map profile reduce profile raw data job profile Use of Sampling Profile fewer tasks Execute fewer tasks JVM = Java Virtual Machine, ECA = Event-Condition-Action 6/29/2011 Starfish 29
30 Overhead of Profile Measurement Word Co-occurrence job running on a 16-node cluster of c1.medium EC2 nodes on the Amazon Cloud 6/29/2011 Starfish 30
31 What-if Engine Possibly Hypothetical Job Profile Input Data Properties What-if Engine Job Oracle Cluster Resources Virtual Job Profile for <p, d 2, r 2, c 2 > Task Scheduler Simulator Configuration Settings <p, d 1, r 1, c 1 > <d 2 > <r 2 > <c 2 > Properties of Hypothetical Job 6/29/2011 Starfish 31
32 What-if Questions Starfish can Answer How will job j s execution time change if the number of reduce tasks is changed from 20 to 40? What will the change in I/O be if map o/p compression is turned on, but the input data size increases by 40%? What will job j s new execution time be if 5 more nodes are added to the cluster, bringing the total to 20? How will workload execution time & dollar cost change if we move the production cluster from m1.xlarge nodes to c1.medium nodes on Amazon EC2? 6/29/2011 Starfish 32
33 Virtual Profile Estimation Given profile for job j = <p, d 1, r 1, c 1 > Estimate profile for job j' = <p, d 2, r 2, c 2 > Profile for j Dataflow Statistics Cost Statistics Dataflow Cost Input Data d 2 Resources r 2 Relative Black-box Models (Virtual) Profile for j' Cardinality Models Cost Statistics Cost Dataflow Statistics White-box Models Configuration c 2 White-box Models Dataflow 6/29/2011 Starfish 33
34 Job Optimizer Job Profile Input Data Properties Job Optimizer Subspace Enumeration Cluster Resources <p, d 1, r 1, c 1 > <d 2 > <r 2 > Recursive Random Search What-if calls Best Configuration Settings <c opt > for <p, d 2, r 2 > 6/29/2011 Starfish 34
35 Experimental Setup Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type 2 map slots & 2 reduce slots 300MB max memory per task Cost-Based Job Optimizer Vs. Rule-Based Optimizer 6/29/2011 Starfish 35
36 Experimental Setup Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type 2 map slots & 2 reduce slots 300MB max memory per task Cost-Based Job Optimizer Vs. Rule-Based Optimizer Abbr. MapReduce Program Domain Dataset CO Word Co-occurrence NLP 30GB,Wikipedia WC WordCount Text Analytics 30GB, Wikipedia TS TeraSort Business Analytics 30GB, Teragen LG LinkGraph Graph Processing 10GB, Wikipedia (compressed) JO Join Business Analytics 30GB, TPC-H 6/29/2011 Starfish 36
37 Speedup Job Optimizer Evaluation Default Settings Rule-Based Optimizer Cost-Based Just-in-TimeJob Optimizer 0 CO WC TS LG JO MapReduce Job 6/29/2011 Starfish 37
38 Insights from Profiles for WordCount A: Rule-Based B: Cost-Based (2x faster) Few, large spills Combiner gave high data reduction Combiner made Mappers CPU bound Many, small spills Combiner gave smaller data reduction Better resource utilization in Mappers 6/29/2011 Starfish 38
39 Estimates from the What-if Engine True surface Estimated surface 6/29/2011 Starfish 39
40 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 40
41 Workflow Optimization Space Optimization Space Logical Physical Vertical Packing Partition Function Selection Job-level Configuration Dataset-level Configuration Intra-job Inter-job 6/29/2011 Starfish 41
42 Optimizations on TF-IDF Workflow D0 <{D},{W}> D1 D2 D4 J1 M1 R1 <{D, W},{f}> J2 J3, J4 M2 R2 <{D},{W, f, c}> M3 R3 M4 <{W},{D, t}> D0 <{D},{W}> J1, J2 M1 R1 M2 Logical R2 Optimization D2 <{D},{W, f, c}> J3, J4 D4 M3 R3 M4 <{W},{D, t}> Partition:{D} Sort: {D,W} Physical Optimization Reducers= 50 Compress = off Memory = 400 Reducers= 20 Compress = on Memory = 300 Legend D = docname f = frequency W = word c = count t = TF-IDF 6/29/2011 Starfish 42
43 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 43
44 Cost ($) Running Time (min) Multi-objective Cluster Provisioning Cloud enables users to provision Hadoop clusters in minutes 1,200 1,000 Can avoid the man (system administrator) in the middle Pay only 800 for what resources are used m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type 6/29/2011 Starfish 44
45 Cost ($) Running Time (min) Multi-objective Cluster Provisioning 1,200 1, m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type for Target Cluster m1.small m1.large m1.xlarge c1.medium c1.xlarge Actual Predicted Actual Predicted EC2 Instance Type for Target Cluster Instance Type for Source Cluster: m1.large 6/29/2011 Starfish 45
46 More Info: Job-level MapReduce configuration Cluster sizing J1 Data layout tuning D J2 Workflow optimization Workload management 6/29/2011 Starfish 46
Herodotos Herodotou Shivnath Babu. Duke University
Herodotos Herodotou Shivnath Babu Duke University Analysis in the Big Data Era Popular option Hadoop software stack Java / C++ / R / Python Pig Hive Jaql Oozie Elastic MapReduce Hadoop HBase MapReduce
More informationNo One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics Herodotos Herodotou Duke University hero@cs.duke.edu Fei Dong Duke University dongfei@cs.duke.edu Shivnath Babu Duke
More informationReport 02 Data analytics workbench for educational data. Palak Agrawal
Report 02 Data analytics workbench for educational data Palak Agrawal Last Updated: May 22, 2014 Starfish: A Selftuning System for Big Data Analytics [1] text CONTENTS Contents 1 Introduction 1 1.1 Starfish:
More informationStarfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu Department of Computer Science Duke University
More informationProfiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Herodotos Herodotou Duke University hero@cs.duke.edu Shivnath Babu Duke University shivnath@cs.duke.edu ABSTRACT MapReduce
More informationIJRIT International Journal of Research in Information Technology, Volume 2, Issue 4, April 2014, Pg: 697-701 STARFISH
International Journal of Research in Information Technology (IJRIT) www.ijrit.com ISSN 2001-5569 A Self-Tuning System for Big Data Analytics - STARFISH Y. Sai Pramoda 1, C. Mary Sindhura 2, T. Hareesha
More informationWorkloads. Herodotos Herodotou. Department of Computer Science Duke University. Date: Approved: Shivnath Babu, Supervisor. Jun Yang.
Automatic Tuning of Data-Intensive Analytical Workloads by Herodotos Herodotou Department of Computer Science Duke University Date: Approved: Shivnath Babu, Supervisor Jun Yang Jeffrey Chase Christopher
More informationAutomatic Tuning of Data-Intensive Analytical. Workloads
Automatic Tuning of Data-Intensive Analytical Workloads by Herodotos Herodotou Department of Computer Science Duke University Ph.D. Dissertation 2012 Copyright c 2012 by Herodotos Herodotou All rights
More informationA What-if Engine for Cost-based MapReduce Optimization
A What-if Engine for Cost-based MapReduce Optimization Herodotos Herodotou Microsoft Research Shivnath Babu Duke University Abstract The Starfish project at Duke University aims to provide MapReduce users
More informationAn Enhanced MapReduce Workload Allocation Tool for Spot Market Resources
Nova Southeastern University NSUWorks CEC Theses and Dissertations College of Engineering and Computing 2015 An Enhanced MapReduce Workload Allocation Tool for Spot Market Resources John Stephen Hudzina
More informationAnalytical Processing in the Big Data Era
Analytical Processing in the Big Data Era 1 Modern industrial, government, and academic organizations are collecting massive amounts of data ( Big Data ) at an unprecedented scale and pace. Companies like
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationArchitecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationHADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
More informationUnderstanding Hadoop Performance on Lustre
Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15
More informationA bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationHadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
More informationCSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
More informationImplement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
More informationHadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
More informationParameterizable benchmarking framework for designing a MapReduce performance model
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2014) Published online in Wiley Online Library (wileyonlinelibrary.com)..3229 SPECIAL ISSUE PAPER Parameterizable
More informationInternational Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
More informationAmazon Elastic Compute Cloud Getting Started Guide. My experience
Amazon Elastic Compute Cloud Getting Started Guide My experience Prepare Cell Phone Credit Card Register & Activate Pricing(Singapore) Region Amazon EC2 running Linux(SUSE Linux Windows Windows with SQL
More informationA Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
More informationApache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationBig Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing
Big Data Rethink Algos and Architecture Scott Marsh Manager R&D Personal Lines Auto Pricing Agenda History Map Reduce Algorithms History Google talks about their solutions to their problems Map Reduce:
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationPerformance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
More informationL1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
More informationHadoop Development & BI- 0 to 100
Development Master the Data Analysis tools like Pig and hive Data Science Hadoop Development & BI- 0 to 100 Build a recommendation engine Hadoop Development - 0 to 100 HADOOP SCHOOL OF TRAINING Basics
More informationPerformance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationUsing distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
More informationA Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationA Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
More informationHadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee dhruba@apache.org dhruba@facebook.com Hadoop, Why? Need to process huge datasets on large clusters of computers
More informationHadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationIntroduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationData Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
More informationHadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
More informationPerformance Analysis of Hadoop for Query Processing
211 Workshops of International Conference on Advanced Information Networking and Applications Performance Analysis of Hadoop for Query Processing Tomasz Wiktor Wlodarczyk, Yi Han, Chunming Rong Department
More informationOracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
More informationWhere We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
More informationA Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationIntroduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
More informationHow To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 thomas.steinmaurer@scch.at www.scch.at Michael Zwick DI
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationCOURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
More informationBig Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
More informationLecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationHadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
More informationMaximizing Hadoop Performance with Hardware Compression
Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Compression and Security Exar Corporation November 2012 1 What is Big? sets whose size is beyond the ability
More informationMatchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony
Matchmaking in the Cloud: Amazon EC2 and Apache Hadoop at eharmony Speaker logo centered below image Steve Kuo, Software Architect Joshua Tuberville, Software Architect Goal > Leverage EC2 and Hadoop to
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationData-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer timo.aaltonen@tut.fi Assistants: Henri Terho and Antti
More informationSee Spot Run: Using Spot Instances for MapReduce Workflows
See Spot Run: Using Spot Instances for MapReduce Workflows Navraj Chohan Claris Castillo Mike Spreitzer Malgorzata Steinder Asser Tantawi Chandra Krintz IBM Watson Research Hawthorne, New York Computer
More informationEntering the Zettabyte Age Jeffrey Krone
Entering the Zettabyte Age Jeffrey Krone 1 Kilobyte 1,000 bits/byte. 1 megabyte 1,000,000 1 gigabyte 1,000,000,000 1 terabyte 1,000,000,000,000 1 petabyte 1,000,000,000,000,000 1 exabyte 1,000,000,000,000,000,000
More informationHigh Throughput Sequencing Data Analysis using Cloud Computing
High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom (stephane.le_crom@upmc.fr) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationBig Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum
Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationUBUNTU DISK IO BENCHMARK TEST RESULTS
UBUNTU DISK IO BENCHMARK TEST RESULTS FOR JOYENT Revision 2 January 5 th, 2010 The IMS Company Scope: This report summarizes the Disk Input Output (IO) benchmark testing performed in December of 2010 for
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationHigh Performance Computing MapReduce & Hadoop. 17th Apr 2014
High Performance Computing MapReduce & Hadoop 17th Apr 2014 MapReduce Programming model for parallel processing vast amounts of data (TBs/PBs) distributed on commodity clusters Borrows from map() and reduce()
More informationScalable Architecture on Amazon AWS Cloud
Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies kalpak@clogeny.com 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect
More informationMapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy
MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears Neeraj Ganapathy Outline Hadoop Architecture Pipelined MapReduce Online Aggregation Continuous
More informationHadoop Memory Usage Model
Hadoop Memory Usage Model Lijie Xu xulijie9@otcaix.iscas.ac.cn Technical Report Institute of Software, Chinese Academy of Sciences November 15, 213 Abstract Hadoop MapReduce is a powerful open-source framework
More informationBig Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect
on AWS Services Overview Bernie Nallamotu Principle Solutions Architect \ So what is it? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze
More informationA Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
More informationAli Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More informationHADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
More informationTesting Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
More informationPeers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
More informationWorkshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
More informationSTeP-IN SUMMIT 2014. June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions
11 th International Conference on Software Testing June 2014 at Bangalore, Hyderabad, Pune - INDIA Performance testing Hadoop based big data analytics solutions by Mustufa Batterywala, Performance Architect,
More informationHadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010
Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
More informationAnalytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
More information