Duke University
|
|
|
- Stephen Hensley
- 10 years ago
- Views:
Transcription
1 Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University
2 Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists Data Size Journalists Systems researchers Workload Complexity Counts & Aggregates Statistical analysis, Linear Algebra Rollups & Drilldowns Machine learning Text / Images / Video / Graphs 6/29/2011 Starfish 2
3 MapReduce/Hadoop Ecosystem Java / Ruby / Python Client Pig Hive Jaql Oozie Elastic MapReduce Hadoop MapReduce Execution Engine Distributed File System HBase 6/29/2011 Starfish 3
4 MADDER Principles of Big Data Analytics Magnetic Agile Deep Data-lifecycle-aware Elastic Robust 6/29/2011 Starfish 4
5 MADDER Principles of Big Data Analytics Magnetic A D D E R Easy to get data into the system 6/29/2011 Starfish 5
6 MADDER Principles of Big Data Analytics M Agile D D E R Make change (data/requirements) easy 6/29/2011 Starfish 6
7 MADDER Principles of Big Data Analytics M A Deep D E R Support the full spectrum of analytics Write MapReduce programs in Java / Python / R or use interfaces like Pig / Jaql 6/29/2011 Starfish 7
8 MADDER Principles of Big Data Analytics M A D Data-lifecycleaware E R Data cycle at LinkedIn 6/29/2011 Starfish 8
9 MADDER Principles of Big Data Analytics M A D D Elastic R Adapt resources/costs to actual workloads 6/29/2011 Starfish 9
10 MADDER Principles of Big Data Analytics M A D D E Robust Graceful degradation under undesirable events 6/29/2011 Starfish 10
11 Ease-of-Use Vs. Out-of-the-box Perf. Magnetic Agile Deep Data-lifecycle-aware Elastic Robust Ease of use Data can be opaque until run-time Programs are a different beast from SQL Policies are nontrivial Hard to get good performance out of the box 6/29/2011 Starfish 11
12 Starfish: MADDER + Self-Tuning Magnetic Agile Deep Data-lifecycle-aware Elastic Robust Ease of use Profile Recursive random search Dynamic instrumentation Mix of models & simulation Relative what-if calls Get good performance automatically 6/29/2011 Starfish 12
13 Starfish: MADDER + Self-Tuning Goal: Provide good performance automatically Java Client Pig Hive Oozie Elastic MR Analytics System Starfish Hadoop MapReduce Execution Engine Distributed File System 6/29/2011 Starfish 13
14 What are the Tuning Problems? Job-level MapReduce configuration Cluster sizing J1 Data layout tuning D J2 Workflow optimization Workload management 6/29/2011 Starfish 14
15 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 15
16 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 16
17 MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Map function Reduce function Input Splits Run this program as a MapReduce job
18 MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Input Splits Map Wave 1 Map Wave 2 Reduce Wave 1 Reduce Wave 2 How are the number of splits, number of map and reduce tasks, memory allocation to tasks, etc., determined?
19 Optimizing MapReduce Job Execution job j = < program p, data d, resources r, configuration c > Space of configuration choices include settings for: Number of map tasks Number of reduce tasks Partitioning of map outputs to reduce tasks Memory allocation to task-level buffers Multiphase external sorting in the tasks Whether output data from tasks should be compressed Whether combine function should be used 6/29/2011 Starfish 19
20 Optimizing MapReduce Job Execution 2-dim projection of 13-dim surface 190+ parameters in Hadoop 6/29/2011 Starfish 20
21 Case for Automated Hadoop Tuning (from wiki.apache.org/pig/pigjournal) Hadoop has that can affect the latency and scalability of a job. For different types of jobs, different configurations will yield optimal results. For example, a job with no memory-intensive operations in the map phase but with a combine phase will want to set Hadoop's io.sort.mb quite high, to minimize the number of spills from the map. many configuration parameters feature will greatly increase the utility Adding this of Pig for Hadoop users, as it will free them from needing to understand Hadoop well enough to tune it themselves for their particular jobs. 6/29/2011 Starfish 21
22 Starfish s Core Approach to Tuning c S Challenges: p is an arbitrary MapReduce program; c is high-dimensional; Profiler What-if Engine Optimizer(s) perf F( p, d, r, c) program p, data d, resources r, copt arg min F( p, d, r, c) configuration c Runs MapReduce jobs to collect job profiles (concise execution summary) Given profile of j = <p,d,r,c>, estimates virtual profile for job j' = <p,d,r,c > Enumerates and searches through the optimization space efficiently 6/29/2011 Starfish 22
23 Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map split 2 map reduce out 0 split 1 map split 3 map reduce Out 1 Two Map Waves One Reduce Wave 6/29/2011 Starfish 23
24 Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map reduce out 0 split 0 Map func Serialize, Partition Memory Buffer Sort, [Combine], [Compress] Merge DFS Map Task Phases Read Map Collect Spill Merge 6/29/2011 Starfish 24
25 Profile of a MapReduce Job Concise representation of program execution Records information at the level of task phases split 0 map reduce out 0 Profile Dataflow Amount of data flowing though tasks & task phases Dataflow Statistics Statistical info about the dataflow Cost Execution time at level of tasks & task phases Cost Statistics Statistical info about the costs 6/29/2011 Starfish 25
26 Fields in a Job Profile Dataflow Map output bytes Number of map-side spills Number of merge rounds Number of records in buffer per spill Cost Read phase time in the map task Map phase time in the map task Collect phase time in the map task Spill phase time in the map task Dataflow Statistics Map func s selectivity (output / input) Map output compression ratio Combiner s selectivity Size of records (keys and values) Cost Statistics I/O cost for reading from local disk per byte I/O cost for writing to HDFS per byte CPU cost for executing Map func per record CPU cost for uncompressing the input per byte 6/29/2011 Starfish 26
27 Generating Profiles Concise representation of program execution Records information at the level of task phases Generated by Profiler through measurement or by the What-if Engine through estimation 6/29/2011 Starfish 27
28 Generating Profiles by Measurement Goals Have zero overhead when off, low overhead when on Require no modifications to Hadoop Support unmodified MapReduce programs written in Java/Python/Ruby/C++ Approach: Dynamic (on-demand) instrumentation Event-condition-action rules are specified (in Java) Leads to run-time instrumentation of Hadoop internals Monitors task phases of MapReduce job execution We currently use BTrace (Hadoop internals are in Java) 6/29/2011 Starfish 28
29 Generating Profiles by Measurement JVM Enable Profiling JVM split 0 map reduce out 0 raw data ECA rules raw data JVM split 1 map map profile reduce profile raw data job profile Use of Sampling Profile fewer tasks Execute fewer tasks JVM = Java Virtual Machine, ECA = Event-Condition-Action 6/29/2011 Starfish 29
30 Overhead of Profile Measurement Word Co-occurrence job running on a 16-node cluster of c1.medium EC2 nodes on the Amazon Cloud 6/29/2011 Starfish 30
31 What-if Engine Possibly Hypothetical Job Profile Input Data Properties What-if Engine Job Oracle Cluster Resources Virtual Job Profile for <p, d 2, r 2, c 2 > Task Scheduler Simulator Configuration Settings <p, d 1, r 1, c 1 > <d 2 > <r 2 > <c 2 > Properties of Hypothetical Job 6/29/2011 Starfish 31
32 What-if Questions Starfish can Answer How will job j s execution time change if the number of reduce tasks is changed from 20 to 40? What will the change in I/O be if map o/p compression is turned on, but the input data size increases by 40%? What will job j s new execution time be if 5 more nodes are added to the cluster, bringing the total to 20? How will workload execution time & dollar cost change if we move the production cluster from m1.xlarge nodes to c1.medium nodes on Amazon EC2? 6/29/2011 Starfish 32
33 Virtual Profile Estimation Given profile for job j = <p, d 1, r 1, c 1 > Estimate profile for job j' = <p, d 2, r 2, c 2 > Profile for j Dataflow Statistics Cost Statistics Dataflow Cost Input Data d 2 Resources r 2 Relative Black-box Models (Virtual) Profile for j' Cardinality Models Cost Statistics Cost Dataflow Statistics White-box Models Configuration c 2 White-box Models Dataflow 6/29/2011 Starfish 33
34 Job Optimizer Job Profile Input Data Properties Job Optimizer Subspace Enumeration Cluster Resources <p, d 1, r 1, c 1 > <d 2 > <r 2 > Recursive Random Search What-if calls Best Configuration Settings <c opt > for <p, d 2, r 2 > 6/29/2011 Starfish 34
35 Experimental Setup Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type 2 map slots & 2 reduce slots 300MB max memory per task Cost-Based Job Optimizer Vs. Rule-Based Optimizer 6/29/2011 Starfish 35
36 Experimental Setup Hadoop cluster on 16 Amazon EC2 nodes, c1.medium type 2 map slots & 2 reduce slots 300MB max memory per task Cost-Based Job Optimizer Vs. Rule-Based Optimizer Abbr. MapReduce Program Domain Dataset CO Word Co-occurrence NLP 30GB,Wikipedia WC WordCount Text Analytics 30GB, Wikipedia TS TeraSort Business Analytics 30GB, Teragen LG LinkGraph Graph Processing 10GB, Wikipedia (compressed) JO Join Business Analytics 30GB, TPC-H 6/29/2011 Starfish 36
37 Speedup Job Optimizer Evaluation Default Settings Rule-Based Optimizer Cost-Based Just-in-TimeJob Optimizer 0 CO WC TS LG JO MapReduce Job 6/29/2011 Starfish 37
38 Insights from Profiles for WordCount A: Rule-Based B: Cost-Based (2x faster) Few, large spills Combiner gave high data reduction Combiner made Mappers CPU bound Many, small spills Combiner gave smaller data reduction Better resource utilization in Mappers 6/29/2011 Starfish 38
39 Estimates from the What-if Engine True surface Estimated surface 6/29/2011 Starfish 39
40 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 40
41 Workflow Optimization Space Optimization Space Logical Physical Vertical Packing Partition Function Selection Job-level Configuration Dataset-level Configuration Intra-job Inter-job 6/29/2011 Starfish 41
42 Optimizations on TF-IDF Workflow D0 <{D},{W}> D1 D2 D4 J1 M1 R1 <{D, W},{f}> J2 J3, J4 M2 R2 <{D},{W, f, c}> M3 R3 M4 <{W},{D, t}> D0 <{D},{W}> J1, J2 M1 R1 M2 Logical R2 Optimization D2 <{D},{W, f, c}> J3, J4 D4 M3 R3 M4 <{W},{D, t}> Partition:{D} Sort: {D,W} Physical Optimization Reducers= 50 Compress = off Memory = 400 Reducers= 20 Compress = on Memory = 300 Legend D = docname f = frequency W = word c = count t = TF-IDF 6/29/2011 Starfish 42
43 Starfish Architecture Workload-level tuning Workload Optimizer Elastisizer Workflow-level tuning Workflow Optimizer What-if Engine Job-level tuning Job Optimizer Profiler Sampler Metadata Mgr. Data Manager Intermediate Data Mgr. Data Layout & Storage Mgr. 6/29/2011 Starfish 43
44 Cost ($) Running Time (min) Multi-objective Cluster Provisioning Cloud enables users to provision Hadoop clusters in minutes 1,200 1,000 Can avoid the man (system administrator) in the middle Pay only 800 for what resources are used m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type 6/29/2011 Starfish 44
45 Cost ($) Running Time (min) Multi-objective Cluster Provisioning 1,200 1, m1.small m1.large m1.xlarge c1.medium c1.xlarge EC2 Instance Type for Target Cluster m1.small m1.large m1.xlarge c1.medium c1.xlarge Actual Predicted Actual Predicted EC2 Instance Type for Target Cluster Instance Type for Source Cluster: m1.large 6/29/2011 Starfish 45
46 More Info: Job-level MapReduce configuration Cluster sizing J1 Data layout tuning D J2 Workflow optimization Workload management 6/29/2011 Starfish 46
Herodotos Herodotou Shivnath Babu. Duke University
Herodotos Herodotou Shivnath Babu Duke University Analysis in the Big Data Era Popular option Hadoop software stack Java / C++ / R / Python Pig Hive Jaql Oozie Elastic MapReduce Hadoop HBase MapReduce
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics
No One (Cluster) Size Fits All: Automatic Cluster Sizing for Data-intensive Analytics Herodotos Herodotou Duke University [email protected] Fei Dong Duke University [email protected] Shivnath Babu Duke
Starfish: A Self-tuning System for Big Data Analytics
Starfish: A Self-tuning System for Big Data Analytics Herodotos Herodotou, Harold Lim, Gang Luo, Nedyalko Borisov, Liang Dong, Fatma Bilgen Cetin, Shivnath Babu Department of Computer Science Duke University
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs
Profiling, What-if Analysis, and Cost-based Optimization of MapReduce Programs Herodotos Herodotou Duke University [email protected] Shivnath Babu Duke University [email protected] ABSTRACT MapReduce
Automatic Tuning of Data-Intensive Analytical. Workloads
Automatic Tuning of Data-Intensive Analytical Workloads by Herodotos Herodotou Department of Computer Science Duke University Ph.D. Dissertation 2012 Copyright c 2012 by Herodotos Herodotou All rights
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7
Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7 Yan Fisher Senior Principal Product Marketing Manager, Red Hat Rohit Bakhshi Product Manager,
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
HADOOP PERFORMANCE TUNING
PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The
Understanding Hadoop Performance on Lustre
Understanding Hadoop Performance on Lustre Stephen Skory, PhD Seagate Technology Collaborators Kelsie Betsch, Daniel Kaslovsky, Daniel Lingenfelter, Dimitar Vlassarev, and Zhenzhen Yan LUG Conference 15
A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 [email protected] (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
ITG Software Engineering
Introduction to Cloudera Course ID: Page 1 Last Updated 12/15/2014 Introduction to Cloudera Course : This 5 day course introduces the student to the Hadoop architecture, file system, and the Hadoop Ecosystem.
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
Hadoop and Map-Reduce. Swati Gore
Hadoop and Map-Reduce Swati Gore Contents Why Hadoop? Hadoop Overview Hadoop Architecture Working Description Fault Tolerance Limitations Why Map-Reduce not MPI Distributed sort Why Hadoop? Existing Data
Parameterizable benchmarking framework for designing a MapReduce performance model
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2014) Published online in Wiley Online Library (wileyonlinelibrary.com)..3229 SPECIAL ISSUE PAPER Parameterizable
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Amazon Elastic Compute Cloud Getting Started Guide. My experience
Amazon Elastic Compute Cloud Getting Started Guide My experience Prepare Cell Phone Credit Card Register & Activate Pricing(Singapore) Region Amazon EC2 running Linux(SUSE Linux Windows Windows with SQL
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
Apache Flink Next-gen data analysis. Kostas Tzoumas [email protected] @kostas_tzoumas
Apache Flink Next-gen data analysis Kostas Tzoumas [email protected] @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
Jeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
A Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
L1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li [email protected] School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
Hadoop Development & BI- 0 to 100
Development Master the Data Analysis tools like Pig and hive Data Science Hadoop Development & BI- 0 to 100 Build a recommendation engine Hadoop Development - 0 to 100 HADOOP SCHOOL OF TRAINING Basics
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
A very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
Using distributed technologies to analyze Big Data
Using distributed technologies to analyze Big Data Abhijit Sharma Innovation Lab BMC Software 1 Data Explosion in Data Center Performance / Time Series Data Incoming data rates ~Millions of data points/
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
A Comparison of Approaches to Large-Scale Data Analysis
A Comparison of Approaches to Large-Scale Data Analysis Sam Madden MIT CSAIL with Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, and Michael Stonebraker In SIGMOD 2009 MapReduce
Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected]
Hadoop Distributed File System Dhruba Borthakur Apache Hadoop Project Management Committee [email protected] [email protected] Hadoop, Why? Need to process huge datasets on large clusters of computers
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data
Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give
Application Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
Hadoop Big Data for Processing Data and Performing Workload
Hadoop Big Data for Processing Data and Performing Workload Girish T B 1, Shadik Mohammed Ghouse 2, Dr. B. R. Prasad Babu 3 1 M Tech Student, 2 Assosiate professor, 3 Professor & Head (PG), of Computer
Oracle Big Data SQL Technical Update
Oracle Big Data SQL Technical Update Jean-Pierre Dijcks Oracle Redwood City, CA, USA Keywords: Big Data, Hadoop, NoSQL Databases, Relational Databases, SQL, Security, Performance Introduction This technical
Where We Are. References. Cloud Computing. Levels of Service. Cloud Computing History. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 25: DBMS-as-a-service and NoSQL We learned quite a bit about data management see course calendar Three topics left: DBMS-as-a-service and NoSQL
A Brief Introduction to Apache Tez
A Brief Introduction to Apache Tez Introduction It is a fact that data is basically the new currency of the modern business world. Companies that effectively maximize the value of their data (extract value
Map Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
How To Scale Out Of A Nosql Database
Firebird meets NoSQL (Apache HBase) Case Study Firebird Conference 2011 Luxembourg 25.11.2011 26.11.2011 Thomas Steinmaurer DI +43 7236 3343 896 [email protected] www.scch.at Michael Zwick DI
Accelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
COURSE CONTENT Big Data and Hadoop Training
COURSE CONTENT Big Data and Hadoop Training 1. Meet Hadoop Data! Data Storage and Analysis Comparison with Other Systems RDBMS Grid Computing Volunteer Computing A Brief History of Hadoop Apache Hadoop
Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park
Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS Apache Hadoop What is Hadoop? Hadoop is a scalable
Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop
Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Keywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
Hadoop & Spark Using Amazon EMR
Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?
Maximizing Hadoop Performance with Hardware Compression
Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Compression and Security Exar Corporation November 2012 1 What is Big? sets whose size is beyond the ability
MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
Open source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: [email protected] Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing
Data-Intensive Programming Timo Aaltonen Department of Pervasive Computing Data-Intensive Programming Lecturer: Timo Aaltonen University Lecturer [email protected] Assistants: Henri Terho and Antti
High Throughput Sequencing Data Analysis using Cloud Computing
High Throughput Sequencing Data Analysis using Cloud Computing Stéphane Le Crom ([email protected]) LBD - Université Pierre et Marie Curie (UPMC) Institut de Biologie de l École normale supérieure
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat
ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum
Big Data Analytics with EMC Greenplum and Hadoop Big Data Analytics with EMC Greenplum and Hadoop Ofir Manor Pre Sales Technical Architect EMC Greenplum 1 Big Data and the Data Warehouse Potential All
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
UBUNTU DISK IO BENCHMARK TEST RESULTS
UBUNTU DISK IO BENCHMARK TEST RESULTS FOR JOYENT Revision 2 January 5 th, 2010 The IMS Company Scope: This report summarizes the Disk Input Output (IO) benchmark testing performed in December of 2010 for
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Scalable Architecture on Amazon AWS Cloud
Scalable Architecture on Amazon AWS Cloud Kalpak Shah Founder & CEO, Clogeny Technologies [email protected] 1 * http://www.rightscale.com/products/cloud-computing-uses/scalable-website.php 2 Architect
MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy
MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears Neeraj Ganapathy Outline Hadoop Architecture Pipelined MapReduce Online Aggregation Continuous
Hadoop Memory Usage Model
Hadoop Memory Usage Model Lijie Xu [email protected] Technical Report Institute of Software, Chinese Academy of Sciences November 15, 213 Abstract Hadoop MapReduce is a powerful open-source framework
Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect
on AWS Services Overview Bernie Nallamotu Principle Solutions Architect \ So what is it? When your data sets become so large that you have to start innovating around how to collect, store, organize, analyze
A Performance Analysis of Distributed Indexing using Terrier
A Performance Analysis of Distributed Indexing using Terrier Amaury Couste Jakub Kozłowski William Martin Indexing Indexing Used by search
Ali Ghodsi Head of PM and Engineering Databricks
Making Big Data Simple Ali Ghodsi Head of PM and Engineering Databricks Big Data is Hard: A Big Data Project Tasks Tasks Build a Hadoop cluster Challenges Clusters hard to setup and manage Build a data
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM
HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM 1. Introduction 1.1 Big Data Introduction What is Big Data Data Analytics Bigdata Challenges Technologies supported by big data 1.2 Hadoop Introduction
Testing Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
Peers Techno log ies Pv t. L td. HADOOP
Page 1 Peers Techno log ies Pv t. L td. Course Brochure Overview Hadoop is a Open Source from Apache, which provides reliable storage and faster process by using the Hadoop distibution file system and
Workshop on Hadoop with Big Data
Workshop on Hadoop with Big Data Hadoop? Apache Hadoop is an open source framework for distributed storage and processing of large sets of data on commodity hardware. Hadoop enables businesses to quickly
STeP-IN SUMMIT 2014. June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions
11 th International Conference on Software Testing June 2014 at Bangalore, Hyderabad, Pune - INDIA Performance testing Hadoop based big data analytics solutions by Mustufa Batterywala, Performance Architect,
Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010
Hadoop: Distributed Data Processing Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010 Outline Scaling for Large Data Processing What is Hadoop? HDFS and MapReduce
Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce
Analytics in the Cloud Peter Sirota, GM Elastic MapReduce Data-Driven Decision Making Data is the new raw material for any business on par with capital, people, and labor. What is Big Data? Terabytes of
