Hadoop Cluster Applications



Similar documents
Networking in the Hadoop Cluster

Energy Efficient MapReduce

Big Fast Data Hadoop acceleration with Flash. June 2013

Chapter 7. Using Hadoop Cluster and MapReduce

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Switching Architectures for Cloud Network Designs

Dell Reference Configuration for Hortonworks Data Platform

Hadoop Architecture. Part 1

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Architecting Low Latency Cloud Networks

Reduction of Data at Namenode in HDFS using harballing Technique

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

10 Gigabit Ethernet : Enabling Storage Networking for Big Data. Whitepaper. Introduction. Big data requires big networks.

CSE-E5430 Scalable Cloud Computing Lecture 2

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop. Sunday, November 25, 12

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Using an In-Memory Data Grid for Near Real-Time Data Analysis

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

BIG DATA TRENDS AND TECHNOLOGIES

Parallel Processing of cluster by Map Reduce

Extreme Networks: Building Cloud-Scale Networks Using Open Fabric Architectures A SOLUTION WHITE PAPER

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Distributed File Systems

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Cray: Enabling Real-Time Discovery in Big Data

Map Reduce / Hadoop / HDFS

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Big data management with IBM General Parallel File System

Easier - Faster - Better

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Jeffrey D. Ullman slides. MapReduce for data intensive computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Performance and Energy Efficiency of. Hadoop deployment models

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Unisys ClearPath Forward Fabric Based Platform to Power the Weather Enterprise

Brocade Solution for EMC VSPEX Server Virtualization

How Cisco IT Built Big Data Platform to Transform Data Management

HADOOP, a newly emerged Java-based software framework, Hadoop Distributed File System for the Grid

Building a Scalable Big Data Infrastructure for Dynamic Workflows

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop and Map-Reduce. Swati Gore

Apache Hadoop Cluster Configuration Guide

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

Big Application Execution on Cloud using Hadoop Distributed File System

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

ARISTA WHITE PAPER 10 Gigabit Ethernet: Enabling Storage Networking for Big Data

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

BBM467 Data Intensive ApplicaAons

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Cluster Scalability of ANSYS FLUENT 12 for a Large Aerodynamics Case on the Darwin Supercomputer

Cloud Based Application Architectures using Smart Computing

Big Data With Hadoop

Workshop on Hadoop with Big Data

Apache Hadoop. Alexandru Costan

Data Mining in the Swamp

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Big Data Challenges in Bioinformatics

CS246: Mining Massive Datasets Jure Leskovec, Stanford University.

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

All-Flash Arrays Weren t Built for Dynamic Environments. Here s Why... This whitepaper is based on content originally posted at

Hadoop & its Usage at Facebook

EXPLORATION TECHNOLOGY REQUIRES A RADICAL CHANGE IN DATA ANALYSIS

Benchmarking Hadoop & HBase on Violin

Big Data - Infrastructure Considerations

GeoGrid Project and Experiences with Hadoop

HadoopTM Analytics DDN

Big Data and Apache Hadoop s MapReduce

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

2015 The MathWorks, Inc. 1

Fault Tolerance in Hadoop for Work Migration

WHITE PAPER. Get Ready for Big Data:

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Inge Os Sales Consulting Manager Oracle Norway

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

The Performance Characteristics of MapReduce Applications on Scalable Clusters

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

IP ETHERNET STORAGE CHALLENGES

GraySort on Apache Spark by Databricks

Big Data on Microsoft Platform

Information Architecture

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

I/O Considerations in Big Data Analytics

Big Data in the Enterprise: Network Design Considerations

Concepts Introduced in Chapter 6. Warehouse-Scale Computers. Important Design Factors for WSCs. Programming Models for WSCs

Transcription:

Hadoop Overview Data analytics has become a key element of the business decision process over the last decade. Classic reporting on a dataset stored in a database was sufficient until recently, but yesterday s data gathering and mining techniques are no longer a match for the amount of unstructured data and the time demands required to make it useful. The common limitations for such analysis are compute and storage resources required to obtain the results in a timely manner. While advanced servers and supercomputers can process the data quickly, these solutions are too expensive for applications like online retailers trying to analyze website visits or small research labs analyzing weather patterns. Hadoop is an open-source framework for running data-intensive applications in a processing cluster built from commodity server hardware. Some customers use Hadoop clustering to analyze customer search patterns for targeted advertising. Other applications include filtering and indexing of web listings, or facial recognition algorithms to search for specific images in a large image database, to name just a few. The possibilities are almost endless provided there is sufficient storage, processing, and networking resources. How Does Hadoop Work? Hadoop is designed to efficiently distribute large amounts of processing across a set of machines, from a few to over 2,000 servers. A small scale Hadoop cluster can easily crunch terabytes or even petabytes of data. The key steps in analyzing data in a Hadoop framework are: Step 1: Data Loading and Distribution: The input data is stored in multiple files. The scale of parallelism in a Hadoop job is related to the number of input files. For example, for data in ten files, the computation can be distributed across ten nodes. Therefore, the ability to rapidly process large data sets across compute servers is related to the number of files and the speed of the network infrastructure used to distribute the data to the compute nodes. Figure1:DataLoadingStep The Hadoop scheduler assigns jobs to nodes to process the files. As a job is completed, the scheduler assigns the node another job with corresponding data. The job s data may be on local storage or may reside on another node in the network. Nodes remain idle until they receive the data to process. Therefore, planning the data set distribution and a high-speed data center network both contribute to better performance of the Hadoop processing cluster. By design, The Hadoop Distributed File System (HDFS) typically holds three or more copies of the same data set across nodes to avoid idle time. A network designer can manage storage costs by implementing a highspeed switched data network scheme. High-speed network switching can deliver substantial increases in processing performance to Hadoop clusters that span more than one server rack. 1

Steps 2 & 3: Map/Reduce: The first data processing step applies a mapping function to the data loaded during step 1. The intermediate output of the mapping process is partitioned using some key, and all data with the same key is next moved to the same reducer node. The final processing step applies a reduce function to the intermediate data; the output of the reduce is stored back on disk. Between the map and reduce operations the data is shuffled between nodes. All outputs of the map function with the same key is moved to the same reducer node. At this point the data network is the critical path. Its performance and latency directly impact the shuffle phase of a data set reduction. High-speed, non-blocking network switches ensure that the Hadoop cluster is running at peak efficiency. Another benefit of a high-speed switched network is the ability to store shuffled data back in the HDFS instead of the reducer node. Assuming sufficient switching capacity, the data center manager can resume or restart Hadoop data reductions as the HDFS holds the intermediate data and knows the next stage of the reduction. Shuffled data stored on a reduction server is effectively lost if the process is suspended. Step 4: Consolidation Figure2:Map/ReduceSteps After the data has been mapped and reduced, it must be merged for output and reporting. This requires another network operation as the outputs of the reduce function must be combined from each reducer node onto a single reporting node. Again, switched network performance can improve throughout, especially if the Hadoop cluster is running multiple data reductions. In this instance, high performance data center switching reduces idle time, further optimizing the performance of the Hadoop cluster. 2

Figure3:ConsolidatingResults Impact of Network Designs on Hadoop Cluster Performance Typically, communication bandwidth for small-scale clusters, such as those within a single rack, is not a critical issue. Nodes can communicate without significant degradation by tuning HDFS parameters. However, Hadoop clusters scale from hundreds of nodes to over ten thousand nodes to analyze some of the largest datasets in the world. Scalability of the cluster is limited by local resources on a node (processor, memory, storage), as well as the interconnect bandwidth to all other nodes in the cluster. Figure4:LegacyNetworkDesign Legacy network designs as shown in figure 4 lead to congestion and a significant drop in cluster performance as nodes in one rack go idle while waiting for data from another node. There is simply insufficient bandwidth between racks for the network operations to complete efficiently. In fact, the problem with these legacy designs is so significant that Hadoop 0.17 offers rack-aware block replication, which limits replicas of data only to nodes within the same rack. However, there is a significant performance penalty since idle resources in other racks cannot be used, thus limiting the overall efficiency of the system. 3

Network Designs Optimized for Hadoop Clusters A network that is designed for Hadoop applications, rather than standard enterprise applications, can make a big difference in the performance of the cluster. The key attributes of a good network design are: Non-Blocking: A non-blocking design for any-to-any communication. All nodes should be able to communicate with each other without running into congestion points. Low Latency: A low latency, 2-tier design outperforms legacy 3-tier designs since latency at each hop is compounded by thousands of transactions within the cluster. Resiliency: Network node failures must not result in partitioned clusters. Additionally, network performance degradation must be smooth and predictable. Scale as you grow: Adding nodes or racks to an existing cluster should be seamless and not require replacing the existing network infrastructure. In essence, networking should also be viewed as a building block that can grow incrementally as additional processing nodes are added. Performance Monitoring Figure5:Non blockingnetworkdesign As a Hadoop cluster gets busy, there are peaks and troughs in the use of various resources. With careful monitoring, users can quickly identify any further optimizations in the map/reduce functions or the need to add more nodes. Ganglia 1 and Nagios 2 are two open source tools that monitor the performance of a Hadoop cluster. Ganglia can be used to monitor Hadoop-specific metrics at each individual node. Nagios can be used to monitor all resources of the cluster including compute, storage, and network I/O resources. Regardless of the tools used, a holistic view of compute and network resource usage becomes critical in monitoring a Hadoop cluster and maintaining high efficiency. 1 ganglia.sourceforge.net 2 www.nagios.org 4

Summary Hadoop is a very powerful distributed computational framework that can process a wide range of datasets, from a few gigabytes to petabytes of structured and unstructured data. Use of Hadoop has quickly gained momentum, and Hadoop is now the preferred platform for various scientific computational and business analytics. While availability of commodity Linux based servers makes it feasible to build very large clusters, the network is often the bottleneck, resulting in congestion and less efficient use of the cluster. Arista offers high performance 1/10 GbE non-blocking, ultra-low latency solutions that can scale from a few racks to some of the largest Hadoop deployments. In addition, Multi-Chassis LAG (MLAG) offers true active/active uplink connectivity from each rack, allowing the full bi-sectional bandwidth of the network to be utilized in a flat layer 2 network. Arista s Extensible Operating System can easily be integrated with Ganglia and Nagios. Lastly, Arista s networking solutions offer a true flat-line growth when it comes to price for server-interconnect bandwidth. All of the above factors make Arista s networking solutions ideal for any Hadoop deployment. InformationinthisdocumentisprovidedinconnectionwithAristaNetworksproducts.Formoreinformation,visitusat http://www.aristanetworks.com,orcontactusatsales@aristanetworks.com. 5