Energy Efficient MapReduce
|
|
|
- Russell Perry
- 10 years ago
- Views:
Transcription
1 Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing 1.5% of national power consumption in 2005, power consumption of datacenters increases by 15% yearly. Datacenters with their long history of innovation took the responsibly to lead the nations to use new ideas to reduce energy consumption, they have developed industry power efficiency standards, such as the power utilization efficiency (PUE) metric. PUE metric is the ratio the total facility power and total IT equipments power, in 2004 the value was 2 and above, which means that most of the datacenters powers is wasted in the physical equipments, such as the cooling and power distribution systems, recently the value of PUE improved approaching 1, which means that power of datacenters is wasted in the IT equipments. From the PUE metric, it is possible to improve energy efficiency of IT equipments through hardware or software. In this paper they focused in the software, and they selected the MapReduce from all existing paradigms, and they chose the Hadoop implementation of MapReduce. Contributions: They started to give a brief introduction about MapReduce, then they presented some analytical framework for energy efficiency, after that they presented some experiments to characterize the energy consumption of Hadoop MapReduce, at the end the provided some quantitative models that can help hadoop operators and developers in identifying the opportunities to improve energy consumption.
2 Summary: MapReduce It was developed at Google as a platform for parallel processing of large datasets, it's attractive qualities made it very popular, MapReduce can be run on clusters of cheap commodity machines, it is highly scalable, makes processing on petabytes of data on thousands and even millions of machines, it's jobs generate mixture of CPU, disk, network activity, it takes care of retrieving data, partitioning input data, outputting data, scheduling parallel execution, coordinating network communication, handling machine failures, it is very simple has two user defined functions Map and Reduce Analytical Framework Energy efficiency should be guided by the following considerations; the desired performance metrics, the available system parameters, the appropriate workload, and the method of measuring energy. 1) Metrics: First of all it is necessary to discriminate between the energy and power, power is the rate of energy consumption, and energy is the result of multiplying power and time. Another crucial metric is the energy per useful work, by using this metric we are able to talk about energy consumption independent of workload size, the challenge for this metric is identifying the unit of useful work, however it is obvious to identify the appropriate unit of work for workloads of same types, such as joules per sorted records for sort workload. Third metric is work rate per watt, measures power efficiency independent of workload size. In the paper they suggested to use bundle of metrics, and they started with the metrics that are easiest to collect, such as job duration, energy, power, and energy per unit of work like sorted records per joule.
3 2) System parameters: For Hadoop/ MapReduce there are three types of parameters that could be optimized. 1) Static parameters: such number of workers to use in the hadoop cluster, number of HDFS data nodes, HDFS block size, and HDFS replication factor 2) Dynamic parameters: such as number of map reduce jobs, HDFS block placement schemes, and scheduling algorithm 3) Hardware parameters: understanding software-hardware interaction is vital to understand software energy efficiency, most important parameters are CPU, speed of various IO paths, including memory, network, and the size of memory In this paper the focused only on the static parameters. 3) Appropriate Workloads: Good workload should be representative of actual production workload, should exercise all parts of the system; like major parts of IO paths, common aspects of various scheduling/placement algorithms, and combination of IO/CPU biases. it is not possible to find common workload that meet all these criteria, but MapReduce is a good starting point 4) Energy measurement: In measuring energy software consumption, it is important to identify system boundary, the convenient boundary is at wall power socket, in this paper they measured energy for all MapReduce master and workers. Experiments and Results The goal of the experiments that have been done in this paper is to characterize the Hadoop energy performance from different perspectives, we can group the experiments into the following categories 1) measuring scalability behaviors of sort and Nutch web crawler: The goal of these experiments is investigate the effect of increasing the number of workers on the performance of MapReduce job. From the results of sort experiment, we can say that workload size has been designed to scale to more workers, but
4 Nutch is not scalable to increasing the number of workers, large number of workers gives no benefit in either job duration or energy or power, in the paper the explained the results because the input data set is not large enough to take advantage of the full parallelism offer by Nutch 2) Artificial workloads to stress different data paths: For experiments in part one, they treated MapReduce job as one part, but the goal of second experiment group is to isolate each part of Hadoop MapReduce job. The three major parts of MapReduce job are: HDFS read, Shuffle, and write. The most surprising result is that increasing the number of workers would not decrease the job duration of HDFS write and consumes more energy, but read and shuffle run faster with increasing number of workers and energy consumed is relatively constant, so we can say that HDFS write doesn't scale at the same rate as read and shuffle and this because of the replication process used by HDFS, so that they made experiment to investigate the relation between the performance and replication factor. 3) Examine the effect of changing static hadoop configuration parameters: these configuration parameters are: HDFS replication, HDFS block size, and Input size, we will present the results of these experiments briefly. HDFS Replication: measure the performance of the MapReduce job while increasing the replication level, we conclude from the results that any replication level above 3 is not used, because this imposes storage overhead with negligible additional benefit in failure tolerance. HDFS block size: investigate the relation between MapReduce job and size of HDFS block, they suggested according to results they got to increase the block size appropriately with work load size. Input Size: the result they got from Nutch experiment made them to suspect that the workload size may impact the energy performance, so that the did experiments to investigate the relation between job performance and the input size by repeating same MapReduce jobs, sort, read, shuffle, and write while changing the input size, the interesting results were for experiment the did to measure records sorted per joule, from this experiment we conclude that, for small input size we can increase the sorted records per joule by decreasing the number of workers, because overhead of coordinating workers and distributing data on workers is a bottleneck, for large input size records sorted per joule remains constant relative to the number of workers, this means that overhead associated with coordinating many workers is balanced by increasing parallelism.
5 The most inserting result of these experiments especially the HDFS replication level is that power remains identical across different replication levels for all jobs, which means that any energy saving would come from decreasing job duration. Indeed there is a strong correlation in the shape of duration and energy graphs Quantitative models 1) predicting energy consumption: good way to predict the performance of sort job is to add the performance of the HDFS read, write, and shuffle of the same workload and cluster size, so that they used the following model to predict job duration, energy and power. t sort t read + t shuffle + t write E sort E read + E shuffle + E write P sort average [ P read, P shuffle, P write ] This model is the same for the IO MapReduce job. The model is not exact in some cases, there are a significant difference between predicted and measured, we have sometimes over prediction, since the model assumes no overlapping between IO tasks, and this is not real, under prediction because of coordination overhead in MapReduce 2) Provisioning machines: It is important to have a model which help Hadoop operators to select how many machines to use in the cluster, and also this kind of model help the Hadoop task scheduler in assigning machines to a particular job. For that they proposed the following expected energy function E(N) = D(N) * P a (N) + (T D(N))P i where: N: number of workers D(N): job duration, function of N
6 P a (N): power when active, function of N P i : power when idle, near constant T: average job arrival interval E(N): energy consumed per job, function of N with this equation we can optimize E(N) for range of N such that D(N) less than or equal T. select the number of workers which gives less energy consumption 3) Replication and reliability: For this model the authors distinguished between two scenarios, the first is the Hadoop deployments on top of reliable storage system, in this case the input data is located in the reliable storage, for computation data is read and stored in HDFS and the is written into reliable storage once the job finishes, so for this scenario failure probability is low, so it is always make sense operate at 1-fold HDFS replication. Second scenario, HDFS is the only storage system, in this case durability has priority over performance, when failure probabilities are constant, we need orders of magnitude decreases in cluster size such that 2-fold replication has the same fault resilience as 3-fold replication in this paper they presented equations which can help the Hadoop operators to select the replication factor, taking into account the failure probabilities, and the number of machines working in the cluster. 4) Relation between energy efficiency and time efficiency: they came up with simple quantitative model to answer the question about whether reducing job duration is equipment to reducing the energy consumed, they thought about this model, because of the correlation they found between job duration and energy in the HDFS replication experiments. The key result from the analysis they made is that if the work rate is proportional to system utilization, we should run our workloads as fast as possible. Recommendations and Conclusions Recommendations for Hadoop Operators: Measure the duration, energy, and power for MapReduce task, Select the appropriate number of workers, good selection of HDFS block size, selection of HDFS replication depends on whether deployed on reliable storage, or HDFS is the primary
7 Recommendations for Hadoop/MapReduce Designers Improve the HDFS replication process, data locality, Investigate a way to find the slow node in the cluster Recommendations for Hadoop/ MapReduce energy efficiency Researchers Run experiments on clusters of different machines, use different workloads, Effect of other parameters such as parallel reduce tasks, sort memory limit, and file buffer size Conclusions: 1) Consider Energy as another performance metric in the design space for computer systems and datacenters 2) Develop Hadoop/MapReduce benchmark that uses bundle of performance metrics 3) They concludes their work with the thought energy efficiency and traditional time efficiency are fundamentally different performance metrics for the same thing Pros & Cons Pros: 1) They shed the light on many aspects that are needed for any one who wants to run a MapReduce job. 2) At the end of the paper they gave some recommendations for Hadoop operators according to the experiments they had in this paper 3) The work in this paper differs from other, by focusing on improving on software energy efficiency of MapReduce
8 Cons: 1) At the beginning of the paper they claimed that they will offer a framework to analyze energy efficiency in general, and MapReduce energy efficiency in particular, but they just talked about energy efficiency of MapReduce 2) It is better to focus on dynamic parameters of the Hadoop/MapReduce framework to improve energy efficiency, rather than concentrating on static parameters 3) Result in some experiments is misleading, such as the experiments on the effect of the HDFS replication factor, for example the behavior of the 4-fold replication is not understandable 4) Results of some experiments is not convenient and not helpful, such as the HDFS block size, they suggested to select the block size appropriately with work load size, but every one who works on the Hadoop knows this, but what it is the exact relation between input size and HDFS block size is not clear 5) We need some experiments to show the relation between different parameters, such as the relation between block size and number of Map workers or reducers Presenter: Thaer Samar Opponent: Souza N. Windiartono
Hadoop Cluster Applications
Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday
Performance and Energy Efficiency of. Hadoop deployment models
Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created
Big Fast Data Hadoop acceleration with Flash. June 2013
Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional
Hadoop Architecture. Part 1
Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Rackspace Cloud Databases and Container-based Virtualization
Rackspace Cloud Databases and Container-based Virtualization August 2012 J.R. Arredondo @jrarredondo Page 1 of 6 INTRODUCTION When Rackspace set out to build the Cloud Databases product, we asked many
HiBench Introduction. Carson Wang ([email protected]) Software & Services Group
HiBench Introduction Carson Wang ([email protected]) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
Introduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM
A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM Sneha D.Borkar 1, Prof.Chaitali S.Surtakar 2 Student of B.E., Information Technology, J.D.I.E.T, [email protected] Assistant Professor, Information
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware
Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray ware 2 Agenda The Hadoop Journey Why Virtualize Hadoop? Elasticity and Scalability Performance Tests Storage Reference
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
Big Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
Networking in the Hadoop Cluster
Hadoop and other distributed systems are increasingly the solution of choice for next generation data volumes. A high capacity, any to any, easily manageable networking layer is critical for peak Hadoop
Characterizing Task Usage Shapes in Google s Compute Clusters
Characterizing Task Usage Shapes in Google s Compute Clusters Qi Zhang 1, Joseph L. Hellerstein 2, Raouf Boutaba 1 1 University of Waterloo, 2 Google Inc. Introduction Cloud computing is becoming a key
NoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
Chapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
EFFICIENT GEAR-SHIFTING FOR A POWER-PROPORTIONAL DISTRIBUTED DATA-PLACEMENT METHOD
EFFICIENT GEAR-SHIFTING FOR A POWER-PROPORTIONAL DISTRIBUTED DATA-PLACEMENT METHOD 2014/1/27 Hieu Hanh Le, Satoshi Hikida and Haruo Yokota Tokyo Institute of Technology 1.1 Background 2 Commodity-based
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components
Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.
Fault Tolerance in Hadoop for Work Migration
1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
Copyright www.agileload.com 1
Copyright www.agileload.com 1 INTRODUCTION Performance testing is a complex activity where dozens of factors contribute to its success and effective usage of all those factors is necessary to get the accurate
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc [email protected]
Take An Internal Look at Hadoop Hairong Kuang Grid Team, Yahoo! Inc [email protected] What s Hadoop Framework for running applications on large clusters of commodity hardware Scale: petabytes of data
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce
Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.
Running a typical ROOT HEP analysis on Hadoop/MapReduce. Stefano Alberto Russo Michele Pinamonti Marina Cobal
Running a typical ROOT HEP analysis on Hadoop/MapReduce Stefano Alberto Russo Michele Pinamonti Marina Cobal CHEP 2013 Amsterdam 14-18/10/2013 Topics The Hadoop/MapReduce model Hadoop and High Energy Physics
Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging
Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Storage Architectures for Big Data in the Cloud
Storage Architectures for Big Data in the Cloud Sam Fineberg HP Storage CT Office/ May 2013 Overview Introduction What is big data? Big Data I/O Hadoop/HDFS SAN Distributed FS Cloud Summary Research Areas
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the The Israeli Association of Grid Technologies July 15, 2009 Outline Architecture
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
Implement Hadoop jobs to extract business value from large and varied data sets
Hadoop Development for Big Data Solutions: Hands-On You Will Learn How To: Implement Hadoop jobs to extract business value from large and varied data sets Write, customize and deploy MapReduce jobs to
marlabs driving digital agility WHITEPAPER Big Data and Hadoop
marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil
Data Mining in the Swamp
WHITE PAPER Page 1 of 8 Data Mining in the Swamp Taming Unruly Data with Cloud Computing By John Brothers Business Intelligence is all about making better decisions from the data you have. However, all
BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop
Apache Hadoop. Alexandru Costan
1 Apache Hadoop Alexandru Costan Big Data Landscape No one-size-fits-all solution: SQL, NoSQL, MapReduce, No standard, except Hadoop 2 Outline What is Hadoop? Who uses it? Architecture HDFS MapReduce Open
Managing large clusters resources
Managing large clusters resources ID2210 Gautier Berthou (SICS) Big Processing with No Locality Job( /crawler/bot/jd.io/1 ) submi t Workflow Manager Compute Grid Node Job This doesn t scale. Bandwidth
Hadoop Ecosystem B Y R A H I M A.
Hadoop Ecosystem B Y R A H I M A. History of Hadoop Hadoop was created by Doug Cutting, the creator of Apache Lucene, the widely used text search library. Hadoop has its origins in Apache Nutch, an open
Mark Bennett. Search and the Virtual Machine
Mark Bennett Search and the Virtual Machine Agenda Intro / Business Drivers What to do with Search + Virtual What Makes Search Fast (or Slow!) Virtual Platforms Test Results Trends / Wrap Up / Q & A Business
Data-Intensive Computing with Map-Reduce and Hadoop
Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan [email protected] Abstract Every day, we create 2.5 quintillion
The Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 [email protected] ABSTRACT Many cluster owners and operators have
From GWS to MapReduce: Google s Cloud Technology in the Early Days
Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu [email protected] MapReduce/Hadoop
Reduction of Data at Namenode in HDFS using harballing Technique
Reduction of Data at Namenode in HDFS using harballing Technique Vaibhav Gopal Korat, Kumar Swamy Pamu [email protected] [email protected] Abstract HDFS stands for the Hadoop Distributed File System.
Hadoop Scheduler w i t h Deadline Constraint
Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,
Optimization of Distributed Crawler under Hadoop
MATEC Web of Conferences 22, 0202 9 ( 2015) DOI: 10.1051/ matecconf/ 2015220202 9 C Owned by the authors, published by EDP Sciences, 2015 Optimization of Distributed Crawler under Hadoop Xiaochen Zhang*
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW
HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW 757 Maleta Lane, Suite 201 Castle Rock, CO 80108 Brett Weninger, Managing Director [email protected] Dave Smelker, Managing Principal [email protected]
A survey of big data architectures for handling massive data
CSIT 6910 Independent Project A survey of big data architectures for handling massive data Jordy Domingos - [email protected] Supervisor : Dr David Rossiter Content Table 1 - Introduction a - Context
An Oracle White Paper June 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database
An Oracle White Paper June 2012 High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database Executive Overview... 1 Introduction... 1 Oracle Loader for Hadoop... 2 Oracle Direct
Hadoop IST 734 SS CHUNG
Hadoop IST 734 SS CHUNG Introduction What is Big Data?? Bulk Amount Unstructured Lots of Applications which need to handle huge amount of data (in terms of 500+ TB per day) If a regular machine need to
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763
International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing
Prepared By : Manoj Kumar Joshi & Vikas Sawhney
Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks
BIG DATA USING HADOOP
+ Breakaway Session By Johnson Iyilade, Ph.D. University of Saskatchewan, Canada 23-July, 2015 BIG DATA USING HADOOP + Outline n Framing the Problem Hadoop Solves n Meet Hadoop n Storage with HDFS n Data
Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013
Petabyte Scale Data at Facebook Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013 Agenda 1 Types of Data 2 Data Model and API for Facebook Graph Data 3 SLTP (Semi-OLTP) and Analytics
Big Data Storage Options for Hadoop Sam Fineberg, HP Storage
Sam Fineberg, HP Storage SNIA Legal Notice The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations
Introduction to Hadoop. New York Oracle User Group Vikas Sawhney
Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop
An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing
An Alternative Storage Solution for MapReduce Eric Lomascolo Director, Solutions Marketing MapReduce Breaks the Problem Down Data Analysis Distributes processing work (Map) across compute nodes and accumulates
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012
Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1 Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume
Improving MapReduce Performance in Heterogeneous Environments
UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce
Unified Big Data Analytics Pipeline. 连 城 [email protected]
Unified Big Data Analytics Pipeline 连 城 [email protected] What is A fast and general engine for large-scale data processing An open source implementation of Resilient Distributed Datasets (RDD) Has an
Hadoop Parallel Data Processing
MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for
A Novel Cloud Based Elastic Framework for Big Data Preprocessing
School of Systems Engineering A Novel Cloud Based Elastic Framework for Big Data Preprocessing Omer Dawelbeit and Rachel McCrindle October 21, 2014 University of Reading 2008 www.reading.ac.uk Overview
Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce
Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com
DDN Technical Brief Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. A Fundamentally Different Approach To Enterprise Analytics Architecture: A Scalable Unit
Parallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai [email protected] MapReduce is a parallel programming model
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing
Shareability and Locality Aware Scheduling Algorithm in Hadoop for Mobile Cloud Computing Hsin-Wen Wei 1,2, Che-Wei Hsu 2, Tin-Yu Wu 3, Wei-Tsong Lee 1 1 Department of Electrical Engineering, Tamkang University
Hadoop Scalability at Facebook. Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011
Hadoop Scalability at Facebook Dmytro Molkov ([email protected]) YaC, Moscow, September 19, 2011 How Facebook uses Hadoop Hadoop Scalability Hadoop High Availability HDFS Raid How Facebook uses Hadoop Usages
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP
White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Hadoop & its Usage at Facebook
Hadoop & its Usage at Facebook Dhruba Borthakur Project Lead, Hadoop Distributed File System [email protected] Presented at the Storage Developer Conference, Santa Clara September 15, 2009 Outline Introduction
APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM
152 APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM A1.1 INTRODUCTION PPATPAN is implemented in a test bed with five Linux system arranged in a multihop topology. The system is implemented
Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010
Hadoop s Entry into the Traditional Analytical DBMS Market Daniel Abadi Yale University August 3 rd, 2010 Data, Data, Everywhere Data explosion Web 2.0 more user data More devices that sense data More
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment
Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,
BIG DATA TECHNOLOGY. Hadoop Ecosystem
BIG DATA TECHNOLOGY Hadoop Ecosystem Agenda Background What is Big Data Solution Objective Introduction to Hadoop Hadoop Ecosystem Hybrid EDW Model Predictive Analysis using Hadoop Conclusion What is Big
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY
INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.
Survey on Scheduling Algorithm in MapReduce Framework
Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India
Enhancing MapReduce Functionality for Optimizing Workloads on Data Centers
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 2, Issue. 10, October 2013,
Large scale processing using Hadoop. Ján Vaňo
Large scale processing using Hadoop Ján Vaňo What is Hadoop? Software platform that lets one easily write and run applications that process vast amounts of data Includes: MapReduce offline computing engine
The Importance of Software License Server Monitoring
The Importance of Software License Server Monitoring NetworkComputer How Shorter Running Jobs Can Help In Optimizing Your Resource Utilization White Paper Introduction Semiconductor companies typically
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
An improved task assignment scheme for Hadoop running in the clouds
Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni
Introduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
NoSQL Data Base Basics
NoSQL Data Base Basics Course Notes in Transparency Format Cloud Computing MIRI (CLC-MIRI) UPC Master in Innovation & Research in Informatics Spring- 2013 Jordi Torres, UPC - BSC www.jorditorres.eu HDFS
Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel
Big Data and Analytics: A Conceptual Overview Mike Park Erik Hoel In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will
Scaling Hadoop for Multi-Core and Highly Threaded Systems
Scaling Hadoop for Multi-Core and Highly Threaded Systems Jangwoo Kim, Zoran Radovic Performance Architects Architecture Technology Group Sun Microsystems Inc. Project Overview Hadoop Updates CMT Hadoop
Big Data Analysis and Its Scheduling Policy Hadoop
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 1, Ver. IV (Jan Feb. 2015), PP 36-40 www.iosrjournals.org Big Data Analysis and Its Scheduling Policy
Cloud Computing based on the Hadoop Platform
Cloud Computing based on the Hadoop Platform Harshita Pandey 1 UG, Department of Information Technology RKGITW, Ghaziabad ABSTRACT In the recent years,cloud computing has come forth as the new IT paradigm.
Cloud Management: Knowing is Half The Battle
Cloud Management: Knowing is Half The Battle Raouf BOUTABA David R. Cheriton School of Computer Science University of Waterloo Joint work with Qi Zhang, Faten Zhani (University of Waterloo) and Joseph
Information Architecture
The Bloor Group Actian and The Big Data Information Architecture WHITE PAPER The Actian Big Data Information Architecture Actian and The Big Data Information Architecture Originally founded in 2005 to
Big Data in the Enterprise: Network Design Considerations
White Paper Big Data in the Enterprise: Network Design Considerations What You Will Learn This document examines the role of big data in the enterprise as it relates to network design considerations. It
The Hadoop Distributed File System
The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA {Shv, Hairong, SRadia, Chansler}@Yahoo-Inc.com Presenter: Alex Hu HDFS
