Solving performance and data protection problems with active-active Hadoop SOLUTIONS BRIEF



Similar documents
Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

How a global bank is overcoming technical, business and regulatory barriers to use Hadoop for mission-critical applications

Data Lake In Action: Real-time, Closed Looped Analytics On Hadoop

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Upcoming Announcements

Data movement for globally deployed Big Data Hadoop architectures

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

TABLE OF CONTENTS THE SHAREPOINT MVP GUIDE TO ACHIEVING HIGH AVAILABILITY FOR SHAREPOINT DATA. Introduction. Examining Third-Party Replication Models

Hadoop in the Enterprise

CDH AND BUSINESS CONTINUITY:

HDP Hadoop From concept to deployment.

Protecting Big Data Data Protection Solutions for the Business Data Lake

HADOOP SOLUTION USING EMC ISILON AND CLOUDERA ENTERPRISE Efficient, Flexible In-Place Hadoop Analytics

The Future of Big Data SAS Automotive Roundtable Los Angeles, CA 5 March 2015 Mike Olson Chief Strategy Officer,

Capitalize on Big Data for Competitive Advantage with Bedrock TM, an integrated Management Platform for Hadoop Data Lakes

Deploying Hadoop with Manager

#TalendSandbox for Big Data

Hadoop: Embracing future hardware

Oracle Big Data SQL Technical Update

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Comprehensive Analytics on the Hortonworks Data Platform

Information Builders Mission & Value Proposition

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

HADOOP. Revised 10/19/2015

The Future of Data Management

Testing Big data is one of the biggest

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Data Security in Hadoop

Apache Hadoop: Past, Present, and Future

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

HDP Enabling the Modern Data Architecture

Real-time Protection for Hyper-V

Big Data Realities Hadoop in the Enterprise Architecture

Driving Growth in Insurance With a Big Data Architecture

Fundamentals Curriculum HAWQ

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Dell In-Memory Appliance for Cloudera Enterprise

Hortonworks and ODP: Realizing the Future of Big Data, Now Manila, May 13, 2015

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Dominik Wagenknecht Accenture

Sujee Maniyam, ElephantScale

Software-Defined Networks Powered by VellOS

Large scale processing using Hadoop. Ján Vaňo

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

A Brief Outline on Bigdata Hadoop

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Hadoop and Map-Reduce. Swati Gore

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Oracle Database 12c Plug In. Switch On. Get SMART.

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

CA Big Data Management: It s here, but what can it do for your business?

Certified Big Data and Apache Hadoop Developer VS-1221

WHITEPAPER. A Technical Perspective on the Talena Data Availability Management Solution

The Digital Enterprise Demands a Modern Integration Approach. Nada daveiga, Sr. Dir. of Technical Sales Tony LaVasseur, Territory Leader

Hadoop Cluster Applications

Move Data from Oracle to Hadoop and Gain New Business Insights

Big Data Analytics Nokia

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

F5 and Oracle Database Solution Guide. Solutions to optimize the network for database operations, replication, scalability, and security

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

How Companies are! Using Spark

Cisco IT Hadoop Journey

How to Choose Between Hadoop, NoSQL and RDBMS

Hadoop implementation of MapReduce computational model. Ján Vaňo

Data Governance in the Hadoop Data Lake. Michael Lang May 2015

Are You Big Data Ready?

Workshop on Hadoop with Big Data

Information Architecture

Qsoft Inc

RPO represents the data differential between the source cluster and the replicas.

EMC ISILON OneFS OPERATING SYSTEM Powering scale-out storage for the new world of Big Data in the enterprise

Big Data and New Paradigms in Information Management. Vladimir Videnovic Institute for Information Management

PEPPERDATA OVERVIEW AND DIFFERENTIATORS

Hadoop in the Hybrid Cloud

How To Make Data Streaming A Real Time Intelligence

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

Quickly Deploy Microsoft Private Cloud and SQL Server 2012 Data Warehouse on Hitachi Converged Solutions. September 25, 2013

A Brief Introduction to Apache Tez

Multi-Datacenter Replication

BIG DATA HADOOP TRAINING

BACKUP IS DEAD: Introducing the Data Protection Lifecycle, a new paradigm for data protection and recovery WHITE PAPER

Moving From Hadoop to Spark

Constructing a Data Lake: Hadoop and Oracle Database United!

The Future of Data Management with Hadoop and the Enterprise Data Hub

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

EMC Federation Big Data Solutions. Copyright 2015 EMC Corporation. All rights reserved.

Transcription:

Solving performance and data protection problems with active-active Hadoop SOLUTIONS BRIEF

Solving performance and data protection problems with active-active Hadoop Many Hadoop deployments are not realizing their full business potential, with performance 1 and data protection 2 cited by 62% of IT professionals as barriers to moving into full production use. Meanwhile, 70% of Hadoop early adopters are already using multiple siloed installations in separate data centers 3. active replication turns those siloed installations into a single unified HDFS cluster that provides total data protection and better performance for Hadoop applications. Barriers to realizing full business value from Hadoop Let s first consider each of the problem areas in more detail. Performance at scale Hadoop deployments typically start small and then see viral adoption as the value of Big Data becomes clear. Rapid adoption and increased load from new applications can lead to serious performance challenges. For example, one national energy services firm found that ingesting the largest table from its legacy ERP system caused severe performance problems for other applications. Likewise, a consumer science company had to place restrictions on new machine learning applications for the same reason, limiting eager data scientists to weekend hours on the production cluster. Even among those who have already adopted YARN, resource management in Hadoop is an unsolved problem as these examples illustrate. YARN is designed to allocate based on capacity queues or fair division of resources. It was not built for the current generation of mixed-tenant workloads, where applications like Spark require high-memory nodes. Even recent improvements like node labels do not guarantee that the right data is always local to the right nodes. 1 http://www.wsj.com/articles/the-joys-and-hype-of-softwarecalled-hadoop-1418777627?mod=wsj_hp_editorspicks 2 http://www.techvibes.com/blog/17-billion-the-annual-costof-data-loss-and-downtime-in-canada-2014-12-04 3 http://wikibon.org/wiki/v/wikibon_big_data_analytics_survey,_2014 Page 1 of 7

Marketing At least 25% No more than 40% At least 50% No more than 90% Risk Analysis Cluster Figure 1: YARN scheduling based on capacity queues, granting minimum and maximum resource allocation to different roles Missed the 25% you need Analytics Data Cluster High RAM nodes 75% Risk Analysis Figure 2: YARN has trouble managing mixed hardware profiles and diverse workloads In data processing pipelines such as the Lambda architecture, multiple processing stages run different applications with very different resource profiles, and YARN does not provide ideal resource management in this case. For example, ingest applications like Sqoop can experience performance degradation up to 81% when running on a cluster that is also loaded with batch processing applications. The batch applications likewise see a degradation of as much as 131%. In-memory frameworks like Spark can see an order of magnitude performance improvement when run on dedicated high-memory nodes. Data protection Hadoop s file system (HDFS) provides redundancy in one Hadoop installation by distributing data between nodes and racks. It has no provision for consistent real-time backups. The backup tools used by most distributions rely on DistCp, an asynchronous batch transfer program. As simple performance testing demonstrates, DistCp is a problematic tool when used as a primary backup solution: It consumes valuable processing (MapReduce) resources on the production cluster. Some Hadoop administrators report that DistCp Page 2 of 7

prevents other applications from running simultaneously. The problem is exacerbated as the size of a cluster grows, with large deployments able to run DistCp only once every 12 or 24 hours. It is a file-based program and fails if a file copy is interrupted or corrupted. Manual intervention is then required. There is no guarantee of consistency when DistCp runs, and no automated way to check the consistency of backups after the fact. Furthermore, backup clusters can only be used for a limited set of read-only operations. DistCp is unable to reconcile changes made at multiple locations, and even read-only MapReduce applications generate intermediate data that must be managed carefully to avoid conflicts with backup jobs. The result is that a significant portion of the investment in hardware and operations is not contributing processing capability, negatively impacting Hadoop s cost efficiency advantage. Data silos Most companies end up using multiple Hadoop clusters for one or more of these reasons: Maintaining different sets of users and permissions. Hadoop security tools are only now maturing, so in the past it was simpler to isolate data that had different security requirements. Lack of holistic planning. Many teams and business units might stand up a new cluster just for experimentation. Cost model. Providing individual installations to different business units is a simple way to manage cost allocation. Maintaining siloed clusters makes sharing data between Hadoop installations difficult. Without appropriate data sharing, data scientists only have a partial view of the information, making roll-up reporting between business units difficult. Since obtaining a complete view of business operations is an important benefit of Hadoop, companies must rely on DistCp-based data transfer tools. Workflow management tools like Oozie and Falcon are very useful for building complete data pipelines, but in a cross-cluster situation require Hadoop administrators to build data transfer stages into the pipeline along with verification steps. As noted earlier, DistCp introduces performance and consistency problems that complicate and slow down data pipelines. Page 3 of 7

Data Center 1 Hadoop A Data Center 2 Hadoop B VPN Data Nodes Data Nodes Step 2: Data from Hadoop A is periodically DistCp d into Hadoop B Figure 3: Periodic data transfer using DistCp A single HDFS cluster spanning several Hadoop installations and data centers Fortunately, there is a solution. WANdisco s active-active replication turns multiple Hadoop silos running in one or more data centers into a unified HDFS cluster with separate processing layers. Applications (MapReduce, Spark, HBase) Applications (MapReduce, Spark, HBase) Security and governance Access Layer (YARN) Access Layer (YARN) Security and governance Data Layer (Non-Stop Hadoop) -active s WAN Block Replicator Figure 4: Non-Stop Hadoop provides a single HDFS cluster underneath several Hadoop installations at one or more locations Total data protection Non-Stop Hadoop provides synchronous real-time active-active replication of HDFS metadata. Every Hadoop installation, even at data centers across the WAN, will see a consistent view of the data. In the event of a failure or a network partition, the system heals automatically with no need for manual reconciliation. Non-Stop Hadoop also uses an efficient WAN block replicator to transfer data blocks to other installations without consuming processing (MapReduce) resources. Customer experience shows that even large data ingests are transferred to another data center in minutes with no performance impact on the source Hadoop installation, compared to hours of transfer time and severe performance degradation using DistCp. Page 4 of 7

Data Center 1 World File System Data Center 2 World File System A B C A B C WAN Coordinated MetData Replication Block Replication DC1 Data Nodes DC2 Data Nodes Figure 5: Non-Stop Hadoop architecture with two data centers separated by a WAN. HDFS writes are coordinated in real time followed by asynchronous block replication. As a result, Non-Stop Hadoop provides a Recovery Point Objective (RPO) of minutes instead of hours or days, and a Recovery Time Objective (RTO) of zero. Other data centers are available for immediate use even if one data center is lost entirely. Improved performance for applications Non-Stop Hadoop presents a single HDFS cluster while preserving the independence of the processing layers. As a result, applications can be run in separate installations or zones without any extra data transfer steps. For example, one zone could run critical business applications with rigorous response SLAs, and another zone could run experimental machine learning applications that use in-memory analytics. Meanwhile, other zones in other data centers can handle ingest jobs. Each zone has all the advantages of fast local access to data, making it a more effective approach than YARN s experimental node labels which do not guarantee that the selected node is the closest to the data. Page 5 of 7

Cluster: Region X Zone A: Batch/Ingest Zone B: Low Latency Query MapReduce Hive Pig MapReduce HBase YARN YARN Spark Non-Stop HDFS Non-Stop HDFS NN NN NN NN NN NN Figure 6: Nonstop Hadoop presents a single HDFS cluster with independent processing tiers across zones As noted earlier, ingest applications like Sqoop can experience up to a 45% performance improvement when run in a separate zone from batch processing applications, and the batch processing applications may see up to a 57% improvement when isolated from Sqoop. Likewise, Spark applications can see an order of magnitude improvement when run on a small zone with dedicated high-memory nodes. Further, every Hadoop installation is available for full active processing. Readonly backup clusters become fully writable processing clusters. As a result, Hadoop deployments effectively double the processing node count and require less hardware to support the same processing requirements. Breaking down data silos Each Hadoop installation in a Non-Stop Hadoop deployment uses a single HDFS cluster, even when located across the WAN. This avoids the need for expensive data transfer stages in tools like Oozie or Falcon and provides total data visibility to data scientists. Overcome performance and data protection problems Non-Stop Hadoop turns multiple Hadoop data silos into a single HDFS cluster that provides total data protection and improved performance for Hadoop applications. The single HDFS cluster also overcomes data sharing problems while delivering improved utilization of valuable Hadoop processing resources. Alternative approaches, however, are problematic: Building a larger Hadoop cluster to add processing power magnifies the backup burdens to the point where system RPO becomes unacceptable. Another option is to rely on the network to move data to processing, discarding Hadoop s natural preference for data locality. This Page 6 of 7

technique is not proven at scale and may prove very difficult in a WAN situation and of course doesn t satisfy backup/dr requirements. -active replication is recognized as a vital capability for data protection, and offers much more than just data safety. A Hadoop cluster built with activeactive technology weaves independent Hadoop installations into a unified HDFS cluster that alleviates several barriers to productive Hadoop deployment. For more information including architectural white papers, visit http://www. wandisco.com/hadoop. World Headquarters 5000 Executive Pkwy Suite 270 San Ramon, CA 94583 Europe Electric Works Sheffield Digital Campus Sheffield S1 2BJ Japan Level 15 Cerulean Tower 26-1 Sakuragaoka-cho Shibuya-ku Tokyo Japan 150-8512 China Financial Street Centre, Level 10 South Tower No.9A Financial Street XiCheng District Beijing 100033 US Toll Free 1-877-WANDISCO(926-3472) Outside US +1-925-380-1728 EU +44 (0)114 3039985 APAC +61 2 8211 0620 Email sales@wandisco.com