CDH AND BUSINESS CONTINUITY:

WHITE PAPER CDH AND BUSINESS CONTINUITY: An overview of the availability, data protection and disaster recovery features in Hadoop Abstract Using the sophisticated built-in capabilities of CDH for tunable data access, high availability and flexibility, enterprises are able to architect business continuity plans to reliably deploy Hadoop based infrastructures that meet or exceed requirements for data availability.

CLOUDERA WHITE PAPER 2 Table of Contents Introduction 3 The Foundation of BCP 3 Key Business Continuity Features in CDH 4 Applying BCP to Critical Workloads 5 Applying BCB to Data Center Failures 6 Conclusion 7 About Cloudera 7

CLOUDERA WHITE PAPER 3 Introduction For data-driven organizations, Hadoop has rapidly become a mission critical component of their information architectures. In many businesses, Hadoop is the hub through which all data is processed, analyzed and turned into action often in real time without human intervention. When so many critical parts of your business depend on Hadoop, it is essential that you have a solid Business Continuity Plan (BCP) in place to make sure you are equipped to keep your business running in the face of unanticipated events. For many businesses, Hadoop has been a critical piece of their infrastructure for some time. Large web companies such as Facebook, LinkedIn, Twitter and Yahoo! have built their businesses around modern technologies including Hadoop. As a result, robust capabilities for achieving business continuity including availability, data protection and disaster recovery have been core components of Hadoop from inception. This paper provides an overview of these capabilities to give you an idea of how your CDH cluster can meet the stringent recovery time and recovery point objectives (RTO and RPO) that provide the backbone of your BCP. Hadoop was specifically designed for large-scale data processing and advanced analytics workloads. The Foundation of BCP: Defining Hadoop Workloads and Design Characteristics Every day, more enterprises are moving CDH deployments into production. As this happens, there is an increased requirement for reliable access to data, high availability and disaster recovery as part of an overall BCP. Most enterprise systems have a number of features that are designed to provide business continuity, and Hadoop is no exception. While these features form a solid foundation, the most important step in developing a successful BCP is to make sure that you are using each system in your environment for the applications and workloads for which it is best suited. Therefore, we ll begin by defining what Hadoop was designed to do and how the system is optimized for performing those tasks. Viewing Hadoop through this lens will help clarify how the various features included in CDH are effective for achieving business continuity. Hadoop was specifically designed for large-scale data processing and advanced analytics workloads. Unlike relational databases that allow for in-place updates and require sophisticated concurrency control to achieve multi-tenancy, Hadoop operates as a batch system, taking advantage of modern storage and processing architectures that optimize for throughput over latency. Hadoop allows users to perform mass transformations and exploratory analysis on large, complex data sets. As such, Hadoop implements several simplifying design paradigms that both increase overall system performance and reduce operational headaches: > Large files are written once and not updated in place, removing the overhead that comes with managing changes to files. > Workflows, whether for data processing or advanced analytics, generate new files as output, simplifying concurrency control and fan out processes where multiple data sets are produced from one input. > Metadata, which has different access patterns, is stored and replicated separately from data, allowing each to use optimal data management algorithms. > Data is accessed in parallel for each step, to leverage the processing power of many nodes, and sequentially between steps to facilitate incremental algorithms.

CLOUDERA WHITE PAPER 4 In other words, Hadoop is designed to handle massive quantities of data that, once written, do not change. Hadoop does not prevent data manipulation (data transformation is one of its primary use cases); rather those manipulations get written as a separate output instead of changing the input files. Based on this philosophy of optimizing for write once and fault tolerant incremental processing, Hadoop is extremely well suited for high volume data loads with strict SLAs for business continuity. This is because in addition to allowing CDH to crunch massive amounts of data more efficiently, the four design paradigms make business continuity much simpler to implement as described below. CDH has a number of key features for availability, data protection and disaster recovery. Key Business Continuity Features in CDH CDH has a number of key features for availability, data protection and disaster recovery. For the purposes of this paper, we ve focused on four: > Tunable Replication. Each file that is written to the Hadoop Distributed File System (HDFS) is replicated to multiple hosts in the cluster. By default, the replication count, or repcount, is set to three. Therefore each file is guaranteed to be replicated to three different hosts in the cluster: two within the same rack and one in a different rack. Administrators can configure the repcount from one up to the total number of nodes in the cluster, depending on your data protection policies. Replication ensures that data is protected in the event of bit rot, disk loss, machine failure or complete rack failure. Having a higher repcount also improves the performance of MapReduce jobs as there are more copies of the data to process in parallel. > Highly Available Metadata. As mentioned in the previous section, Hadoop separates metadata from data. Files are stored in HDFS and all metadata is stored in a dedicated server called the NameNode. CDH provides high availability for the NameNode with manual or automatic failover options. This, combined with file replication in HDFS ensures that data stored in CDH is continually available for processing and analysis. > DistCp. DistCp is a tool that leverages the MapReduce framework to copy data between two different CDH clusters. Since DistCp is a MapReduce job, it can be scheduled to run as often as needed to comply with disaster recovery mandates. DistCp supports both Local Area Network (LAN) and Wide Area Network (WAN) replication, allowing you to move data between local clusters and maintain a copy of data on a CDH cluster that is in a different geographical location from your primary cluster. DistCp is typically used to copy processed or derivative data sets between clusters. > Streaming Framework. CDH includes a framework for streaming raw data from a source to one or more clusters called Apache Flume. Flume provides pluggable data sources and data targets with scalable ingestion and in flight processing. This flexibility allows organizations to route data to multiple disparate clusters from a wide variety of data sources. Combinations of these features can be applied to your Hadoop cluster to meet the most stringent mandates for high availability and disaster recovery. The following sections outline how these features apply to each discipline.

CLOUDERA WHITE PAPER 5 Separating metadata from data gives Hadoop a scalable design for achieving high availability and tunable replication without sacrificing performance. Applying BCP to Critical Workloads: High Availability with Active Failover Separating metadata from data gives Hadoop a scalable design for achieving high availability and tunable replication without sacrificing performance. Hadoop employs dedicated metadata servers (NameNodes) that persist and replicate all mutations to the file system namespace, such as creating and renaming files, allocating blocks and changing permissions. All metadata changes are persisted by the active NameNode to multiple, specialized nodes called JournalNodes. This redundancy ensures zero data loss so long as at least one copy of the metadata is available. In order to maintain continuous high availability, the Standby NameNode receives simultaneous block reports from the cluster and reads the shared metadata updates from any of the JournalNodes. By maintaining in-memory parity with the Active NameNode, the Standby is able to take over processing immediately in case the Active NameNode fails or otherwise becomes unavailable. If the Active NameNode fails, any open connections or operations return errors to each client. As with any error from the NameNode, the clients then retry, failing over to the Standby NameNode if the Active can no longer be reached. For maximum high availability, the Active and Standby NameNodes have an automatic failover capability that leverages Zookeeper to handle leader election and a pluggable mechanism for resource fencing. When a failure is detected, the health detection initiates a failover, quarantines the previous Active NameNode, and the Standby NameNode immediately takes over exactly where the previously Active NameNode left off. The failover process happens in mere seconds, well within the expected retry time for any Hadoop clients. It is important to remember that the NameNode is purely for metadata and does not affect any inflight data read or write operations which occur directly with the data nodes. JournalNode JournalNode JournalNode All namespace edits logged to JournalNodes Namespace edits periodically read from any JournalNode Active NameNode Block reports are sent to both NameNodes Standby NameNode DataNode DataNode DataNode DataNode Achieve high availability for HDFS with redundant NameNodes

CLOUDERA WHITE PAPER 6 Hadoop s architecture is an optimal design for data processing and advanced analytics. Applying BCP to Data Center Failures: Disaster Recovery with Tunable RTO/RPO Modern computing is highly optimized for processing streams of data, whether over the wire or to and from flash or magnetic disk. This optimization led Hadoop s designers to create an architecture where data is loaded into Hadoop in continuous streams, stored as raw bytes and these files are then fixed until deleted. Any modifications to data are accomplished by processing large batches of files and writing new files with the results. In contrast to transaction processing systems which benefit from in place updates, Hadoop s architecture is an optimal design for data processing and advanced analytics. With the large volume and high value of data loaded into Hadoop, organizations focused on business continuity planning need to be more aggressive with recovery time objectives. This is much easier in Hadoop due to HDFS built in redundancy, high availability and replication. Each data set can be tuned to replicate from 1.8 times up to the total number of machines in the cluster in order to guard against failures within a data center. HDFS completely eliminates the need for on-site backups since data is not updated in place. This design allows for straightforward versioning. Hadoop natively supports high availability and reliability within and across data centers. Handling data center disasters requires careful planning and balancing between BCP objectives and the effective available bandwidth between data centers. As discussed in the introduction, enterprise BCP is based on two primary objectives for recovery from data failure disasters: RPO which is set based on the amount of data that may be lost due a failure, and RTO which measures the time it takes to restore access to data. The effective bandwidth between data centers is dictated by the total available bandwidth minus the bandwidth used for other daily operations and replication for other data management systems. Your network engineering team should be consulted regarding the effective available bandwidth, which may vary depending on the time of day, day of the week and day of the year. When considering recovery point objectives, an organization must look at two tasks: recovery from source and recovery from processed data. Recovery from source is optimally achieved by loading two clusters simultaneously. Using streaming frameworks such as Apache Flume, raw data can be loaded to two Hadoop clusters, effectively reducing the RPO for raw data to zero and providing flexibility to determine RTO for processed data. Since a refinement process typically follows collection of raw data, an RPO based on raw data alone will allow organizations to tune RTO for processed data. Complementing a dual write strategy with periodic cross data center results transfer via DistCp gives enterprises a tunable means of achieving an optimal RTO/cost balance. The formula for calculating optimal RTO for processed data is based on the processing time at the backup data center (T), resulting data set size (S) and effective available bandwidth (B) between primary and backup data centers. We can compute the maximal RTO with the following equation: min(t, S/B) This is read as the minimum of either the time to re-process data or to transmit data that has been processed. Note that if processing time is less than the time to transmit results then there is no value to transmitting results from the primary to the backup data center so long as raw data is available in both.

CLOUDERA WHITE PAPER 7 Conclusion Using the sophisticated built-in capabilities of CDH for tunable data access, high availability and flexibility, enterprises are able to architect business continuity plans to reliably deploy Hadoop based infrastructures that meet or exceed requirements for data availability. Hadoop meets these requirements with a tunable replication factor for each file or directory, high availability for metadata, flexibility to simultaneously load raw data to two clusters and the ability to selectively copy result data between clusters. Combined, these features make Hadoop an enterprise class solution for Big Data storage and processing. About Cloudera Cloudera, the leader in Apache Hadoop-based software and services, enables data driven enterprises to easily derive business value from all their structured and unstructured data. As the top contributor to the Apache open source community and with tens of thousands of nodes under management across customers in financial services, government, telecommunications, media, web, advertising, retail, energy, bioinformatics, pharma/healthcare, university research, oil and gas and gaming, Cloudera's depth of experience and commitment to sharing expertise are unrivaled. Cloudera provides no representations or warranties regarding the accuracy, reliability, or serviceability of any information or recommendations provided in this publication, or with respect to any results that may be obtained by the use of the information or observance of any recommendations provided herein. The information in this document is distributed AS IS, and the use of this information or the implementation of any recommendations or techniques herein is a customer s responsibility and depends on the customer s ability to evaluate and integrate them into the customer s operational environment. Cloudera, Inc. 220 Portage Avenue, Palo Alto, CA 94306 USA 1-888-789-1488 or 1-650-362-0488 cloudera.com 2012 Cloudera, Inc. All rights reserved. Cloudera and the Cloudera logo are trademarks or registered trademarks of Cloudera Inc. in the USA and other countries. All other trademarks are the property of their respective companies. Information is subject to change without notice.