High Availability on MapR

Technical brief Introduction High availability (HA) is the ability of a system to remain up and running despite unforeseen failures, avoiding unplanned downtime or service disruption*. HA is a critical feature that businesses rely on to support customer-facing applications and service level agreements. HA Benefits in the MapR Distribution for Hadoop Advanced HA features in the MapR Distribution for Hadoop provides numerous benefits to organizations trying to harness big data. No Data Loss The MapR Distribution for Hadoop ensures critical data is never lost via configurable levels of replication. Automatic failover ensures the cluster is always available so big data applications can run on a 24x7 basis, helping organizations meet stringent business SLAs. Dependable Jobs Jobs started on the MapR Distribution run to completion despite failures of associated job trackers or resource managers. This tremendously improves Hadoop cluster efficiency and resource utilization by avoiding restarts of jobs, especially the long-running MapReduce analytics jobs. 24x7 NoSQL Applications MapR supports organizations to quickly graduate from batch-oriented analytics to operational NoSQL applications on Hadoop, by providing instant recovery capabilities and eliminating downtime associated with NoSQL housekeeping. Continuous Access to Data MapR provides unprecedented application and user access to Hadoop via the NFS interface. To ensure continuous, uninterrupted operations, MapR makes the NFS access resilient. Maintaining Availability during Planned Downtime Upgrading large clusters often require service disruptions. MapR provides options to ensure clusters are available even during planned downtime for maintenance tasks such as software upgrades. * This tech brief deals with single data center high availability. For information about how MapR provides cross-data-center replication to enable disaster recovery, please visit www.mapr.com.

2 MapR HA Implementation The MapR Distribution for Hadoop is the only distribution that is designed for 24x7 environments providing HA across several critical elements of the Hadoop cluster. MapR provides HA not only for data and job completion, but also for access points and ancillary services running on Hadoop. Metadata HA Cluster metadata includes critical information about the location of application data and the associated replicas. Metadata HA is therefore critical for long-running Hadoop operations. MapR provides self-healing from multiple, simultaneous failures, allowing cluster availability at all times. MapR automatically shards and replicates its metadata along with application data, making HA part of the core architecture. This also makes it extremely easy to implement HA, which works right out of the box with no requirements for deploying specialized nodes on specialized hardware and with minimal configuration to setup and monitor. As an added advantage, the distributed metadata architecture allows for extreme scalability with no practical limit on the number of files that can be stored on Hadoop. MapReduce HA MapR is the only distribution that supports fully functional MapReduce HA. Job execution will proceed to completion even if the associated trackers and resource managers go down. In other distributions, hardware failures result in failed jobs, thus requiring jobs to be completely restarted. This functionality is applicable to both MapReduce v1 as well as MapReduce v2 (YARN) jobs. NFS HA MapR uniquely provides network-attached storage (NAS) style access to Hadoop through the standard NFS (Network File System) interface. MapR allows you to mount the cluster via NFS and ensures that the NFS mount point is also HA enabled. This ensures continuous undisrupted access to incoming streaming data and to applications requiring random read/write. Instant Recovery for NoSQL Applications MapR ensures that data from a failed node is automatically and instantly available to the NoSQL application. The automatic and instant failover means there is no reassignment lag time, ensuring uninterrupted availability.

3 MapR HA Implementation continued Zero NoSQL Maintenance In the broader objective of minimizing service disruptions, MapR requires zero NoSQL maintenance to further improve availability. Automatic, workload-aware scaling maintains high performance as the data load grows. The simplified architecture means there are no NoSQL servers to administer, thus reducing the number of failure points. And the optimized, compaction-less design prevents disruptive I/O storms and eliminates downtime from performing housekeeping tasks. Rolling Upgrades Rolling upgrades also help with minimizing disruptions. Users can eliminate planned downtime by performing maintenance or software upgrades on the cluster, a few nodes at a time, while the system continues to run. Services HA The MapR model of distributing the metadata can be easily extended to services running on Hadoop. One can easily implement HA for any service running on the MapR cluster by configuring the service to store its state information as part of the cluster metadata and by registering the service with the ZooKeeper. If the service goes down, the ZooKeeper and Warden services take care of automatically restarting the services on a different node. HDFS-Based Distributions and HA HDFS-based distributions provide minimal HA capabilities. All HDFS-based distributions rely on a single server known as the to store and process metadata. This single-server approach creates performance and scalability bottlenecks, forcing a federated model of data storage that further increases SLA risks by creating multiple points of failure across the system. More importantly from an HA standpoint this model requires an Active-Standby implementation that ends up protecting from just one failure. This means that if you have another -related failure before the failed node is replaced/repaired, you will lose or corrupt data. Furthermore, the complexity of the system increases for setup and configuration. Administrators have additional tasks associated with configuring specialized hardware which also increases the total cost of ownership - to accommodate the. The setup must also ensure continuous sharing of metadata across Active and Standby nodes, and enable every node in the cluster to maintain a heartbeat connection to both Active and Standby nodes at all times. (continued on next page)

4 HDFS-Based Distributions and HA continued The figure below delineates the differences between the HDFS model and the MapR model of storing metadata. MapR No- Architecture HDFS Federation MapR (Distributed Metadata) NAS APPLIANCE E A B A B C D C D E F E F A F C D E D DataNode DataNode DataNode A B B C E B DataNode DataNode DataNode A D C F B F Multiple single points of failure Limited to 50-100 million files Performance bottleneck Commercial NAS required HA w/ automatic failover Instant cluster restart Up to 1T files (>5000x advantage) 10-20x higher performance 100% commodity hardware (continued on next page)

5 HDFS-Based Distributions and HA continued With reference to jobs, since the jobs-related metadata is not stored in HDFS-based distributions today, the jobs have to be restarted whenever there is a failure or if the resource manager or the job trackers go down. Furthermore, for NoSQL applications, HDFS-based distributions do not provide any HA capabilities because of complex architectural issues associated with working with an append-only file system. Longrunning downtime is one of the common issues associated with these HDFS-based NoSQL applications. Conclusion MapR architectural innovations deliver 24x7 big data applications ensuring high availability for all the critical components of Hadoop, including for Hadoop 2.0 features such as YARN. The MapR Distribution for Hadoop provides high availability across nodes, jobs, access methods, and services for both file-based as well as NoSQL applications in a uniform fashion across the cluster. 2014 MapR Technologies, Inc.