High Availability on MapR



Similar documents
Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

SQL Server AlwaysOn Deep Dive for SharePoint Administrators

Hadoop Architecture. Part 1

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

HADOOP MOCK TEST HADOOP MOCK TEST I

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

High Availability with Postgres Plus Advanced Server. An EnterpriseDB White Paper

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

High Availability with Windows Server 2012 Release Candidate

Hadoop Scalability at Facebook. Dmytro Molkov YaC, Moscow, September 19, 2011

Hadoop IST 734 SS CHUNG

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Vess A2000 Series HA Surveillance with Milestone XProtect VMS Version 1.0

Apache Hadoop: Past, Present, and Future

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads

recovery at a fraction of the cost of Oracle RAC

The Hadoop Distributed File System

Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS)

Design and Evolution of the Apache Hadoop File System(HDFS)

Cloud Based Application Architectures using Smart Computing

High Availability Solutions for the MariaDB and MySQL Database

CA Cloud Overview Benefits of the Hyper-V Cloud

Comparing the Hadoop Distributed File System (HDFS) with the Cassandra File System (CFS) WHITE PAPER

High Availability and Disaster Recovery for Exchange Servers Through a Mailbox Replication Approach

Hadoop: Embracing future hardware

RPO represents the data differential between the source cluster and the replicas.

Sujee Maniyam, ElephantScale

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Enterprise-grade Hadoop: The Building Blocks

CDH AND BUSINESS CONTINUITY:

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Critical SQL Server Databases:

Distributed File Systems

HADOOP MOCK TEST HADOOP MOCK TEST II

HADOOP ADMINISTATION AND DEVELOPMENT TRAINING CURRICULUM

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Apache Hadoop. Alexandru Costan

<Insert Picture Here> Big Data

Cloudera Manager Training: Hands-On Exercises

End-to-End Availability for Microsoft SQL Server

MaxDeploy Hyper- Converged Reference Architecture Solution Brief

Leveraging Virtualization in Data Centers

Fault Tolerant Servers: The Choice for Continuous Availability on Microsoft Windows Server Platform

Big Data Technology Core Hadoop: HDFS-YARN Internals

EMC s Enterprise Hadoop Solution. By Julie Lockner, Senior Analyst, and Terri McClure, Senior Analyst

HDFS Users Guide. Table of contents

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

A three step plan for migrating to Microsoft Exchange 2010

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

How Routine Data Center Operations Put Your HA/DR Plans at Risk

HadoopTM Analytics DDN

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Introduction. Scalable File-Serving Using External Storage

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Virtualizing Apache Hadoop. June, 2012

Saving Millions through Data Warehouse Offloading to Hadoop. Jack Norris, CMO MapR Technologies. MapR Technologies. All rights reserved.

There's Plenty of Room in the Cloud

Whitepaper Continuous Availability Suite: Neverfail Solution Architecture

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

ENABLING GLOBAL HADOOP WITH EMC ELASTIC CLOUD STORAGE

Big data management with IBM General Parallel File System

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Fundamentals Curriculum HAWQ

Microsoft SharePoint 2010 on VMware Availability and Recovery Options. Microsoft SharePoint 2010 on VMware Availability and Recovery Options

Synology High Availability (SHA)

THE HADOOP DISTRIBUTED FILE SYSTEM

Affordable, Scalable, Reliable OLTP in a Cloud and Big Data World: IBM DB2 purescale

EMC IRODS RESOURCE DRIVERS

Trends driving software-defined storage

EMC VPLEX FAMILY. Transparent information mobility within, across, and between data centers ESSENTIALS A STORAGE PLATFORM FOR THE PRIVATE CLOUD

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

High Availability Cluster for RC18015xs+

IBM System x reference architecture for Hadoop: MapR

June Blade.org 2009 ALL RIGHTS RESERVED

SwiftStack Filesystem Gateway Architecture

ScaleArc for SQL Server

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

MaximumOnTM. Bringing High Availability to a New Level. Introducing the Comm100 Live Chat Patent Pending MaximumOn TM Technology

Synology High Availability (SHA)

White Paper. Managing MapR Clusters on Google Compute Engine

Maxta Storage Platform Enterprise Storage Re-defined

Ultra-Scalable Storage Provides Low Cost Virtualization Solutions

Realizing the True Potential of Software-Defined Storage

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Certified Big Data and Apache Hadoop Developer VS-1221

DeltaV Virtualization High Availability and Disaster Recovery

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Information Builders Mission & Value Proposition

ATA DRIVEN GLOBAL VISION CLOUD PLATFORM STRATEG N POWERFUL RELEVANT PERFORMANCE SOLUTION CLO IRTUAL BIG DATA SOLUTION ROI FLEXIBLE DATA DRIVEN V

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Disaster Recovery for Oracle Database

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Transcription:

Technical brief Introduction High availability (HA) is the ability of a system to remain up and running despite unforeseen failures, avoiding unplanned downtime or service disruption*. HA is a critical feature that businesses rely on to support customer-facing applications and service level agreements. HA Benefits in the MapR Distribution for Hadoop Advanced HA features in the MapR Distribution for Hadoop provides numerous benefits to organizations trying to harness big data. No Data Loss The MapR Distribution for Hadoop ensures critical data is never lost via configurable levels of replication. Automatic failover ensures the cluster is always available so big data applications can run on a 24x7 basis, helping organizations meet stringent business SLAs. Dependable Jobs Jobs started on the MapR Distribution run to completion despite failures of associated job trackers or resource managers. This tremendously improves Hadoop cluster efficiency and resource utilization by avoiding restarts of jobs, especially the long-running MapReduce analytics jobs. 24x7 NoSQL Applications MapR supports organizations to quickly graduate from batch-oriented analytics to operational NoSQL applications on Hadoop, by providing instant recovery capabilities and eliminating downtime associated with NoSQL housekeeping. Continuous Access to Data MapR provides unprecedented application and user access to Hadoop via the NFS interface. To ensure continuous, uninterrupted operations, MapR makes the NFS access resilient. Maintaining Availability during Planned Downtime Upgrading large clusters often require service disruptions. MapR provides options to ensure clusters are available even during planned downtime for maintenance tasks such as software upgrades. * This tech brief deals with single data center high availability. For information about how MapR provides cross-data-center replication to enable disaster recovery, please visit www.mapr.com.

2 MapR HA Implementation The MapR Distribution for Hadoop is the only distribution that is designed for 24x7 environments providing HA across several critical elements of the Hadoop cluster. MapR provides HA not only for data and job completion, but also for access points and ancillary services running on Hadoop. Metadata HA Cluster metadata includes critical information about the location of application data and the associated replicas. Metadata HA is therefore critical for long-running Hadoop operations. MapR provides self-healing from multiple, simultaneous failures, allowing cluster availability at all times. MapR automatically shards and replicates its metadata along with application data, making HA part of the core architecture. This also makes it extremely easy to implement HA, which works right out of the box with no requirements for deploying specialized nodes on specialized hardware and with minimal configuration to setup and monitor. As an added advantage, the distributed metadata architecture allows for extreme scalability with no practical limit on the number of files that can be stored on Hadoop. MapReduce HA MapR is the only distribution that supports fully functional MapReduce HA. Job execution will proceed to completion even if the associated trackers and resource managers go down. In other distributions, hardware failures result in failed jobs, thus requiring jobs to be completely restarted. This functionality is applicable to both MapReduce v1 as well as MapReduce v2 (YARN) jobs. NFS HA MapR uniquely provides network-attached storage (NAS) style access to Hadoop through the standard NFS (Network File System) interface. MapR allows you to mount the cluster via NFS and ensures that the NFS mount point is also HA enabled. This ensures continuous undisrupted access to incoming streaming data and to applications requiring random read/write. Instant Recovery for NoSQL Applications MapR ensures that data from a failed node is automatically and instantly available to the NoSQL application. The automatic and instant failover means there is no reassignment lag time, ensuring uninterrupted availability.

3 MapR HA Implementation continued Zero NoSQL Maintenance In the broader objective of minimizing service disruptions, MapR requires zero NoSQL maintenance to further improve availability. Automatic, workload-aware scaling maintains high performance as the data load grows. The simplified architecture means there are no NoSQL servers to administer, thus reducing the number of failure points. And the optimized, compaction-less design prevents disruptive I/O storms and eliminates downtime from performing housekeeping tasks. Rolling Upgrades Rolling upgrades also help with minimizing disruptions. Users can eliminate planned downtime by performing maintenance or software upgrades on the cluster, a few nodes at a time, while the system continues to run. Services HA The MapR model of distributing the metadata can be easily extended to services running on Hadoop. One can easily implement HA for any service running on the MapR cluster by configuring the service to store its state information as part of the cluster metadata and by registering the service with the ZooKeeper. If the service goes down, the ZooKeeper and Warden services take care of automatically restarting the services on a different node. HDFS-Based Distributions and HA HDFS-based distributions provide minimal HA capabilities. All HDFS-based distributions rely on a single server known as the to store and process metadata. This single-server approach creates performance and scalability bottlenecks, forcing a federated model of data storage that further increases SLA risks by creating multiple points of failure across the system. More importantly from an HA standpoint this model requires an Active-Standby implementation that ends up protecting from just one failure. This means that if you have another -related failure before the failed node is replaced/repaired, you will lose or corrupt data. Furthermore, the complexity of the system increases for setup and configuration. Administrators have additional tasks associated with configuring specialized hardware which also increases the total cost of ownership - to accommodate the. The setup must also ensure continuous sharing of metadata across Active and Standby nodes, and enable every node in the cluster to maintain a heartbeat connection to both Active and Standby nodes at all times. (continued on next page)

4 HDFS-Based Distributions and HA continued The figure below delineates the differences between the HDFS model and the MapR model of storing metadata. MapR No- Architecture HDFS Federation MapR (Distributed Metadata) NAS APPLIANCE E A B A B C D C D E F E F A F C D E D DataNode DataNode DataNode A B B C E B DataNode DataNode DataNode A D C F B F Multiple single points of failure Limited to 50-100 million files Performance bottleneck Commercial NAS required HA w/ automatic failover Instant cluster restart Up to 1T files (>5000x advantage) 10-20x higher performance 100% commodity hardware (continued on next page)

5 HDFS-Based Distributions and HA continued With reference to jobs, since the jobs-related metadata is not stored in HDFS-based distributions today, the jobs have to be restarted whenever there is a failure or if the resource manager or the job trackers go down. Furthermore, for NoSQL applications, HDFS-based distributions do not provide any HA capabilities because of complex architectural issues associated with working with an append-only file system. Longrunning downtime is one of the common issues associated with these HDFS-based NoSQL applications. Conclusion MapR architectural innovations deliver 24x7 big data applications ensuring high availability for all the critical components of Hadoop, including for Hadoop 2.0 features such as YARN. The MapR Distribution for Hadoop provides high availability across nodes, jobs, access methods, and services for both file-based as well as NoSQL applications in a uniform fashion across the cluster. 2014 MapR Technologies, Inc.