Hadoop Optimizations for BigData Analytics
|
|
- Edward Banks
- 8 years ago
- Views:
Transcription
1 Hadoop Optimizations for BigData Analytics Weikuan Yu Auburn University
2 Outline WBDB, Oct 2012 S-2 Background Network Levitated Merge JVM-Bypass Shuffling Fast Completion Scheduler
3 WBDB, Oct 2012 S-3 Emerging Demand for BigData Analytics Big demand from many organizations in various domains Scalable computing power without worrying about system maintenance. Ubiquitously accessible computing and storage resources. Low cost, highly reliable, trusted computing infrastructure. Commercial companies are gearing up resources for BigData
4 MapReduce l l l l A simple data processing model to process big data Designed for commodity off-the-shelf hardware components. Strong merits for big data analytics l l Scalability: increase throughput by increasing # of nodes Fault-tolerance (quick and low cost recovery of the failures of tasks) Hadoop, An open-source implementation of MapReduce: l Widely deployed by many big data companies: AOL, Baidu, EBay, Facebook, IBM, NY Times, Yahoo!. WBDB, Oct 2012, S-4
5 WBDB, Oct 2012 S-5 High-Level Overview of Hadoop l l l HDFS and the MapReduce Framework Data processing with MapTasks and ReduceTasks Three main steps of data movement. l Intermediate data shuffling in the MapReduce is time-consuming Applications JobTracker Job Submission Task Tracker/Runner 1 HDFS 3 Task Tracker/Runner MapTasks 2 Shuffle Intermediate Data ReduceTasks
6 Outline WBDB, Oct 2012 S-6 Background Network Levitated Merge JVM-Bypass Shuffling Fast Completion Scheduler
7 WBDB, Oct 2012 S-7 Motivation for Network-Levitated Merge 1: Serialization between shuffle/merge and reduce phases shuffle map merge reduce Start First MOF Serialization Time
8 WBDB, Oct 2012 S-8 Repetitive Merges and Disk Access Hadoop data spilling controlled through parameters To limit the number of outstanding files An example with io.sort.factor=3 1: merge more 2: insert 3: merge 4: to merge soon
9 WBDB, Oct 2012 S-9 Hadoop Acceleration (Hadoop-A) Pipelined shuffle, merge and reduce Network-levitated data merge Hadoop JobTracker TaskTracker TaskTracker Java MapTask ReduceTask C++ NetMerger Data Engine MOFSupplier RDMA Server RDMA Client Fetch Manager Merged Data Merging Thread Merge Manager RDMA Interconnects Acceleration
10 WBDB, Oct 2012 S-10 Pipelined Data Shuffle, Merge and Reduce shuffle map header merge reduce PQ setup start First MOF Last MOF Time
11 WBDB, Oct 2012 S-11 Network-Levitated Merge Algorithm S1 S1 Merge Point <k1,v1>, S2 S3 S2 S3 <k2,v2>, <k3,v3>, (a) Fetching Header (b) Priority Queue Setup S2 <k2,v2>, <k2,v2 >, S1 <k1,v1>, <k1,v1 >, S1 <k1,v1>, <k1,v1 > S2 <k2,v2>, <k2,v2 >, S3 <k3,v3>, Merged Data: <k1,v1><k2,v2><k3,v3>, <k3,v3 >, S3 <k3,v3>, <k3,v3 >, Merged Data: <k1,v1><k2,v2><k3,v3>,,<k2,v2 ><k1,v1 ><k3,v3 >, (c) Concurrent Fetching & Merging (d) Towards Completion
12 WBDB, Oct 2012 S-12 Job Progression with Network-levitated Merge a) Hadoop-A speeds up the execution time by more than 47% b) Both MapTasks and ReduceTasks are improved Hadoop-A (Map) Hadoop on IPoIB (Map) Hadoop on GigE (Map) Hadoop-A (Reduce) Hadoop on IPoIB (Reduce) Hadoop on GigE (Reduce) Progress (%) Progress (%) Time (sec) Time (sec) a) Map Progress of TeraSort b) Reduce Progress of TeraSort
13 WBDB, Oct 2012 S-13 Breakdown of ReduceTask Execution Time (sec) Significantly reduced the execution time of ReduceTasks Most came from reduced shuffle/merge time An improvement of 2.5 times Also improved the time to reduce data An improvement of 15% Category PQ-Setup Shuffle/Merge Reduce or Merge/Reduce Hadoop-GigE (65.0%) (35.0%) Hadoop-IPoIB (65.9%) (34.1%) Hadoop-A (47.4%) (52.6%)
14 Outline WBDB, Oct 2012 S-14 Background Network Levitated Merge JVM-Bypass Shuffling Fast Completion Scheduler
15 JVM-Dependent Intermediate Data Shuffling MapTask Map TaskTracker HttpServlet JobTracker TaskTracker ReduceTask MOF1 MOF2 MOF1 MOF2 MOF1 MOF2 Staging HttpServlet Staging HttpServlet Staging TCP/IP-Only MOFCopiers Sort/Merge Reduce HDFS Heavily relies on Java WBDB, Oct 2012, S-15
16 JVM-Bypass Shuffling (JBS) JBS removes JVM from the critical path of intermediate data shuffling JBS is a portable library supporting both TCP/IP and RDMA protocols Data Analytics Applications MapTask TaskTracker ReduceTask HTTP Servlet Java C Sockets TCP/IP HTTP GET JVM-Bypass C JVM-Bypass Shuffling (JBS) MOFSupplier NetMerger RDMA Verbs, TCP/IP Ethernet InfiniBand/Ethernet WBDB, Oct 2012, S-16
17 Benefits of JBS: 1/10 Gigabit Ethernets JBS is effective for intermediate data of different sizes Ø Using Terasort benchmark, size of intermediate data = size of input data JBS reduces the execution time by 20.9% on average in 1GigE, 19.3% on average in 10GigE Terasort Job Execution Time (sec) Hadoop on 1GigE JBS on 1GigE Terasort Job Execution Time (sec) Hadoop on 10GigE JBS on 10GigE Input Data Size (GB) Input Data Size (GB) (a): 1 Gigabit Ethernet (b): 10 Gigabit Ethernet WBDB, Oct 2012, S-17
18 Benefits of JBS: InfiniBand Cluster JBS on IPoIB outperforms Hadoop on IPoIB and SDP by 14.1%, 14.8%, respectively. Hadoop performs similarly when using IPoIB or SDP. Terasort Job Execution Time (sec) Hadoop on IPoIB Hadoop on SDP JBS on IPoIB Input Data Size (GB) WBDB, Oct 2012, S-18
19 Outline WBDB, Oct 2012 S-19 Background Network Levitated Merge JVM-Bypass Shuffling Fast Completion Scheduler
20 Hadoop Fair Scheduler WBDB, Oct 2012 S-20 Scheduler assigns tasks to the TaskTrackers Tasks occupy slots until completion or failure Slot-M5 J-2 J-3 J-3 Slot-M4 J-2 J-3 J-3 Slot-M3 Slot-M2 Slot-M1 J-1 J-1 J-1 J-2 J-3 J-2 J-2 J-2 J-2 shuffle reduce Slot-R3 Slot-R2 Slot-R Job Arrival Time
21 WBDB, Oct 2012 S-21 Fair Completion Scheduler Prioritize ReduceTasks based on the shortest remaining map phase When remaining map phases are equal, prioritize ReduceTasks of jobs with least remaining reduce data Track the slowdown of preempted ReduceTasks Prevent large jobs from being preempted for too long
22 Average ReduceTask Waiting Time WBDB, Oct 2012 S-22 ReduceTasks in small jobs are significantly speedup Average ReduceTask Waiting Time (sec) ,5 FCS 12,4 22 HFS Groups
23 WBDB, Oct 2012 S-23 Conclusions Examined the design and architecture of Hadoop MapReduce framework and reveal critical issues faced by the existing implementation Designed and implemented Hadoop-A as an extensible acceleration framework which addresses all these issues Provided JVM-Bypass Shuffling to avoid JVM overhead, meanwhile we enable it to be a portable library that can run on both TCP/IP and RDMA protocols. Designed and Implemented Fast Completion Scheduler for fast job completion and job fairness.
24 Sponsors of our research WBDB, Oct 2012, S-24
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
More informationTCP/IP Implementation of Hadoop Acceleration. Cong Xu
TCP/IP Implementation of Hadoop Acceleration by Cong Xu A thesis submitted to the Graduate Faculty of Auburn University in partial fulfillment of the requirements for the Degree of Master of Science Auburn,
More informationCan High-Performance Interconnects Benefit Memcached and Hadoop?
Can High-Performance Interconnects Benefit Memcached and Hadoop? D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University,
More informationDESIGN AND ESTIMATION OF BIG DATA ANALYSIS USING MAPREDUCE AND HADOOP-A
DESIGN AND ESTIMATION OF BIG DATA ANALYSIS USING MAPREDUCE AND HADOOP-A 1 DARSHANA WAJEKAR, 2 SUSHILA RATRE 1,2 Department of Computer Engineering, Pillai HOC College of Engineering& Technology Rasayani,
More informationCooMR: Cross-Task Coordination for Efficient Data Management in MapReduce Programs
CooMR: Cross-Task Coordination for Efficient Data Management in MapReduce Programs Xiaobing Li Yandong Wang Yizheng Jiao Cong Xu Weikuan Yu Department of Computer Science and Software Engineering Auburn
More informationDifferent Technologies for Improving the Performance of Hadoop
Different Technologies for Improving the Performance of Hadoop Mr. Yogesh Gupta 1, Mr. Satish Patil 2, Mr. Omkar Ghumatkar 3 Student, IT Dept,PVG s COET,Pune. Pune,India 1 Student, IT Dept,PVG s COET,Pune.
More informationBig Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
More informationEnabling High performance Big Data platform with RDMA
Enabling High performance Big Data platform with RDMA Tong Liu HPC Advisory Council Oct 7 th, 2014 Shortcomings of Hadoop Administration tooling Performance Reliability SQL support Backup and recovery
More informationAnalysis and Optimization of Massive Data Processing on High Performance Computing Architecture
Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National
More informationLifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India
Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:
More informationAccelerating Spark with RDMA for Big Data Processing: Early Experiences
Accelerating Spark with RDMA for Big Data Processing: Early Experiences Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, Dip7 Shankar, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
More informationExtending Hadoop beyond MapReduce
Extending Hadoop beyond MapReduce Mahadev Konar Co-Founder @mahadevkonar (@hortonworks) Page 1 Bio Apache Hadoop since 2006 - committer and PMC member Developed and supported Map Reduce @Yahoo! - Core
More informationHadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
More informationHadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of
More informationA Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks
A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks Xiaoyi Lu, Md. Wasi- ur- Rahman, Nusrat Islam, and Dhabaleswar K. (DK) Panda Network- Based Compu2ng Laboratory Department
More informationDriving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA
WHITE PAPER April 2014 Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA Executive Summary...1 Background...2 File Systems Architecture...2 Network Architecture...3 IBM BigInsights...5
More informationINTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A COMPREHENSIVE VIEW OF HADOOP ER. AMRINDER KAUR Assistant Professor, Department
More informationAccelerating and Simplifying Apache
Accelerating and Simplifying Apache Hadoop with Panasas ActiveStor White paper NOvember 2012 1.888.PANASAS www.panasas.com Executive Overview The technology requirements for big data vary significantly
More informationPerformance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
More informationMaximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
More informationWhite Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
More informationHadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart
Hadoop/MapReduce Object-oriented framework presentation CSCI 5448 Casey McTaggart What is Apache Hadoop? Large scale, open source software framework Yahoo! has been the largest contributor to date Dedicated
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationMapReduce Evaluator: User Guide
University of A Coruña Computer Architecture Group MapReduce Evaluator: User Guide Authors: Jorge Veiga, Roberto R. Expósito, Guillermo L. Taboada and Juan Touriño December 9, 2014 Contents 1 Overview
More informationA REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, sborkar95@gmail.com Assistant Professor, Information
More informationCSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing
More informationPerformance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others. Lustre File
More informationSurvey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
More informationModernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com
DDN Technical Brief Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. A Fundamentally Different Approach To Enterprise Analytics Architecture: A Scalable Unit
More informationHadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013
Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013 * Other names and brands may be claimed as the property of others. Agenda Hadoop Intro Why run Hadoop on Lustre?
More informationHadoop on the Gordon Data Intensive Cluster
Hadoop on the Gordon Data Intensive Cluster Amit Majumdar, Scientific Computing Applications Mahidhar Tatineni, HPC User Services San Diego Supercomputer Center University of California San Diego Dec 18,
More informationDyScale: a MapReduce Job Scheduler for Heterogeneous Multicore Processors
JOURNAL OF L A T E X CLASS FILES, VOL. 6, NO. 1, JULY 214 1 DyScale: a MapReduce Job Scheduler for Heterogeneous Multicore Processors Feng Yan, Member, IEEE, Ludmila Cherkasova, Member, IEEE, Zhuoyao Zhang,
More informationHadoop Distributed File System. Jordan Prosch, Matt Kipps
Hadoop Distributed File System Jordan Prosch, Matt Kipps Outline - Background - Architecture - Comments & Suggestions Background What is HDFS? Part of Apache Hadoop - distributed storage What is Hadoop?
More informationApplication Development. A Paradigm Shift
Application Development for the Cloud: A Paradigm Shift Ramesh Rangachar Intelsat t 2012 by Intelsat. t Published by The Aerospace Corporation with permission. New 2007 Template - 1 Motivation for the
More informationWeekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay
Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability
More informationHadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
More informationA Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
More informationHadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
More informationWelcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
More informationHadoop implementation of MapReduce computational model. Ján Vaňo
Hadoop implementation of MapReduce computational model Ján Vaňo What is MapReduce? A computational model published in a paper by Google in 2004 Based on distributed computation Complements Google s distributed
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationRecord Setting Hadoop in the Cloud By M.C. Srivas, CTO, MapR Technologies
Record Setting Hadoop in the Cloud By M.C. Srivas, CTO, MapR Technologies When MapR was invited to provide Hadoop on Google Compute Engine, we ran a lot of mini tests on the virtualized hardware to figure
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
More informationHadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
More informationmarlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
More informationPerformance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
More informationHadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
More informationMapReduce with Apache Hadoop Analysing Big Data
MapReduce with Apache Hadoop Analysing Big Data April 2010 Gavin Heavyside gavin.heavyside@journeydynamics.com About Journey Dynamics Founded in 2006 to develop software technology to address the issues
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationMAPREDUCE [1] is proposed by Google in 2004 and
IEEE TRANSACTIONS ON COMPUTERS 1 Improving MapReduce Performance Using Smart Speculative Execution Strategy Qi Chen, Cheng Liu, and Zhen Xiao, Senior Member, IEEE Abstract MapReduce is a widely used parallel
More informationPreemptive ReduceTask Scheduling for Fair and Fast Job Completion
Preemptive ReduceTask Scheduling for Fair and Fast Job Completion Yandong Wang Jian Tan Weikuan Yu Li Zhang Xiaoqiao Meng Auburn University IBM T.J Watson Research {wangyd,wkyu}@auburn.edu {tanji,zhangli,xmeng}@us.ibm.com
More informationHadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN
Hadoop MPDL-Frühstück 9. Dezember 2013 MPDL INTERN Understanding Hadoop Understanding Hadoop What's Hadoop about? Apache Hadoop project (started 2008) downloadable open-source software library (current
More informationLarge scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
More informationOutline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
More informationTask Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
More informationHow MapReduce Works 資碩一 戴睿宸
How MapReduce Works MapReduce Entities four independent entities: The client The jobtracker The tasktrackers The distributed filesystem Steps 1. Asks the jobtracker for a new job ID 2. Checks the output
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationHadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
More informationBig Data Analysis and HADOOP
Big Data Analysis and HADOOP B.Jegatheswari and M.Muthulakshmi III year MCA AVC College of engineering, Mayiladuthurai. Email ID: jjega.cool@gmail.com Mobile: 8220380693 Abstract: - Digital universe with
More informationMesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)
UC BERKELEY Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II) Anthony D. Joseph LASER Summer School September 2013 My Talks at LASER 2013 1. AMP Lab introduction 2. The Datacenter
More informationHere comes the flood Tools for Big Data analytics. Guy Chesnot -June, 2012
Here comes the flood Tools for Big Data analytics Guy Chesnot -June, 2012 Agenda Data flood Implementations Hadoop Not Hadoop 2 Agenda Data flood Implementations Hadoop Not Hadoop 3 Forecast Data Growth
More informationMapReduce and Hadoop Distributed File System V I J A Y R A O
MapReduce and Hadoop Distributed File System 1 V I J A Y R A O The Context: Big-data Man on the moon with 32KB (1969); my laptop had 2GB RAM (2009) Google collects 270PB data in a month (2007), 20000PB
More informationReference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference
More informationHadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.
Hadoop Source Alessandro Rezzani, Big Data - Architettura, tecnologie e metodi per l utilizzo di grandi basi di dati, Apogeo Education, ottobre 2013 wikipedia Hadoop Apache Hadoop is an open-source software
More information6. How MapReduce Works. Jari-Pekka Voutilainen
6. How MapReduce Works Jari-Pekka Voutilainen MapReduce Implementations Apache Hadoop has 2 implementations of MapReduce: Classic MapReduce (MapReduce 1) YARN (MapReduce 2) Classic MapReduce The Client
More informationHadoop Distributed File System. T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela
Hadoop Distributed File System T-111.5550 Seminar On Multimedia 2009-11-11 Eero Kurkela Agenda Introduction Flesh and bones of HDFS Architecture Accessing data Data replication strategy Fault tolerance
More informationL1: Introduction to Hadoop
L1: Introduction to Hadoop Feng Li feng.li@cufe.edu.cn School of Statistics and Mathematics Central University of Finance and Economics Revision: December 1, 2014 Today we are going to learn... 1 General
More informationMap Reduce & Hadoop Recommended Text:
Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately
More informationA Cost-Evaluation of MapReduce Applications in the Cloud
1/23 A Cost-Evaluation of MapReduce Applications in the Cloud Diana Moise, Alexandra Carpen-Amarie Gabriel Antoniu, Luc Bougé KerData team 2/23 1 MapReduce applications - case study 2 3 4 5 3/23 MapReduce
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationGraySort and MinuteSort at Yahoo on Hadoop 0.23
GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters
More informationHadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook
Hadoop Ecosystem Overview CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook Agenda Introduce Hadoop projects to prepare you for your group work Intimate detail will be provided in future
More informationAn In-Memory RDMA-Based Architecture for the Hadoop Distributed Filesystem
An In-Memory RDMA-Based Architecture for the Hadoop Distributed Filesystem Master Thesis Konstantinos Karampogias August 21, 2012 Supervisor: Advisors: Prof. Dr. Bernhard Plattner Dr. Xenofontas Dimitropoulos
More informationHadoop Deployment and Performance on Gordon Data Intensive Supercomputer!
Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer! Mahidhar Tatineni, Rick Wagner, Eva Hocks, Christopher Irving, and Jerry Greenberg! SDSC! XSEDE13, July 22-25, 2013! Overview!!
More informationBig Data in the Enterprise: Network Design Considerations
White Paper Big Data in the Enterprise: Network Design Considerations What You Will Learn This document examines the role of big data in the enterprise as it relates to network design considerations. It
More informationData-intensive computing systems
Data-intensive computing systems Hadoop Universtity of Verona Computer Science Department Damiano Carra Acknowledgements! Credits Part of the course material is based on slides provided by the following
More informationMapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy
MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears Neeraj Ganapathy Outline Hadoop Architecture Pipelined MapReduce Online Aggregation Continuous
More informationPrepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
More informationThe Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform
The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform Fong-Hao Liu, Ya-Ruei Liou, Hsiang-Fu Lo, Ko-Chin Chang, and Wei-Tsong Lee Abstract Virtualization platform solutions
More informationAccelerating life sciences research
IBM Systems and Technology Thought Leadership White Paper June 2013 Accelerating life sciences research IBM Platform Symphony helps deliver improved performance for life sciences workloads using Contrail
More informationDepartment of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15
Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases Lecture 15 Big Data Management V (Big-data Analytics / Map-Reduce) Chapter 16 and 19: Abideboul et. Al. Demetris
More informationTurbo-Charging Open Source Hadoop for Faster, more Meaningful Insights
Turbo-Charging Open Source Hadoop for Faster, more Meaningful Insights Gord Sissons Senior Manager, Technical Marketing IM Platform Computing gsissons@ca.ibm.com Agenda Some Context IM Platform Computing
More informationMapReduce, Hadoop and Amazon AWS
MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables
More informationModel Driven Performance Simulation of Cloud Provisioned Hadoop MapReduce Applications
Model Driven Performance Simulation of Cloud Provisioned Hadoop MapReduce Applications Hanieh Alipour, Yan Liu Concordia University Montreal.Canada h_alipou@encs.concordia.ca; yan.liu@concordia.ca Abdelwahab
More informationA Brief Outline on Bigdata Hadoop
A Brief Outline on Bigdata Hadoop Twinkle Gupta 1, Shruti Dixit 2 RGPV, Department of Computer Science and Engineering, Acropolis Institute of Technology and Research, Indore, India Abstract- Bigdata is
More informationBig Data Analytics - Accelerated. stream-horizon.com
Big Data Analytics - Accelerated stream-horizon.com StreamHorizon & Big Data Integrates into your Data Processing Pipeline Seamlessly integrates at any point of your your data processing pipeline Implements
More informationOpen source Google-style large scale data analysis with Hadoop
Open source Google-style large scale data analysis with Hadoop Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory School of Electrical
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
More informationA very short Intro to Hadoop
4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,
More informationKeywords: Big Data, HDFS, Map Reduce, Hadoop
Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning
More informationHigh-Performance Networking for Optimized Hadoop Deployments
High-Performance Networking for Optimized Hadoop Deployments Chelsio Terminator 4 (T4) Unified Wire adapters deliver a range of performance gains for Hadoop by bringing the Hadoop cluster networking into
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
More informationApache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
More informationHadoop. History and Introduction. Explained By Vaibhav Agarwal
Hadoop History and Introduction Explained By Vaibhav Agarwal Agenda Architecture HDFS Data Flow Map Reduce Data Flow Hadoop Versions History Hadoop version 2 Hadoop Architecture HADOOP (HDFS) Data Flow
More informationTowards MapReduce Performance Optimization: A Look into the Optimization Techniques in Apache Hadoop for BigData Analytics
Towards MapReduce Performance Optimization: A Look into the Optimization Techniques in Apache Hadoop for BigData Analytics Kudakwashe Zvarevashe 1, Dr. A Vinaya Babu 2 1 M Tech Student, Dept of CSE, Jawaharlal
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationHigh Speed I/O Server Computing with InfiniBand
High Speed I/O Server Computing with InfiniBand José Luís Gonçalves Dep. Informática, Universidade do Minho 4710-057 Braga, Portugal zeluis@ipb.pt Abstract: High-speed server computing heavily relies on
More information