Solving performance and data protection problems with active-active Hadoop SOLUTIONS BRIEF
Solving performance and data protection problems with active-active Hadoop Many Hadoop deployments are not realizing their full business potential, with performance 1 and data protection 2 cited by 62% of IT professionals as barriers to moving into full production use. Meanwhile, 70% of Hadoop early adopters are already using multiple siloed installations in separate data centers 3. active replication turns those siloed installations into a single unified HDFS cluster that provides total data protection and better performance for Hadoop applications. Barriers to realizing full business value from Hadoop Let s first consider each of the problem areas in more detail. Performance at scale Hadoop deployments typically start small and then see viral adoption as the value of Big Data becomes clear. Rapid adoption and increased load from new applications can lead to serious performance challenges. For example, one national energy services firm found that ingesting the largest table from its legacy ERP system caused severe performance problems for other applications. Likewise, a consumer science company had to place restrictions on new machine learning applications for the same reason, limiting eager data scientists to weekend hours on the production cluster. Even among those who have already adopted YARN, resource management in Hadoop is an unsolved problem as these examples illustrate. YARN is designed to allocate based on capacity queues or fair division of resources. It was not built for the current generation of mixed-tenant workloads, where applications like Spark require high-memory nodes. Even recent improvements like node labels do not guarantee that the right data is always local to the right nodes. 1 http://www.wsj.com/articles/the-joys-and-hype-of-softwarecalled-hadoop-1418777627?mod=wsj_hp_editorspicks 2 http://www.techvibes.com/blog/17-billion-the-annual-costof-data-loss-and-downtime-in-canada-2014-12-04 3 http://wikibon.org/wiki/v/wikibon_big_data_analytics_survey,_2014 Page 1 of 7
Marketing At least 25% No more than 40% At least 50% No more than 90% Risk Analysis Cluster Figure 1: YARN scheduling based on capacity queues, granting minimum and maximum resource allocation to different roles Missed the 25% you need Analytics Data Cluster High RAM nodes 75% Risk Analysis Figure 2: YARN has trouble managing mixed hardware profiles and diverse workloads In data processing pipelines such as the Lambda architecture, multiple processing stages run different applications with very different resource profiles, and YARN does not provide ideal resource management in this case. For example, ingest applications like Sqoop can experience performance degradation up to 81% when running on a cluster that is also loaded with batch processing applications. The batch applications likewise see a degradation of as much as 131%. In-memory frameworks like Spark can see an order of magnitude performance improvement when run on dedicated high-memory nodes. Data protection Hadoop s file system (HDFS) provides redundancy in one Hadoop installation by distributing data between nodes and racks. It has no provision for consistent real-time backups. The backup tools used by most distributions rely on DistCp, an asynchronous batch transfer program. As simple performance testing demonstrates, DistCp is a problematic tool when used as a primary backup solution: It consumes valuable processing (MapReduce) resources on the production cluster. Some Hadoop administrators report that DistCp Page 2 of 7
prevents other applications from running simultaneously. The problem is exacerbated as the size of a cluster grows, with large deployments able to run DistCp only once every 12 or 24 hours. It is a file-based program and fails if a file copy is interrupted or corrupted. Manual intervention is then required. There is no guarantee of consistency when DistCp runs, and no automated way to check the consistency of backups after the fact. Furthermore, backup clusters can only be used for a limited set of read-only operations. DistCp is unable to reconcile changes made at multiple locations, and even read-only MapReduce applications generate intermediate data that must be managed carefully to avoid conflicts with backup jobs. The result is that a significant portion of the investment in hardware and operations is not contributing processing capability, negatively impacting Hadoop s cost efficiency advantage. Data silos Most companies end up using multiple Hadoop clusters for one or more of these reasons: Maintaining different sets of users and permissions. Hadoop security tools are only now maturing, so in the past it was simpler to isolate data that had different security requirements. Lack of holistic planning. Many teams and business units might stand up a new cluster just for experimentation. Cost model. Providing individual installations to different business units is a simple way to manage cost allocation. Maintaining siloed clusters makes sharing data between Hadoop installations difficult. Without appropriate data sharing, data scientists only have a partial view of the information, making roll-up reporting between business units difficult. Since obtaining a complete view of business operations is an important benefit of Hadoop, companies must rely on DistCp-based data transfer tools. Workflow management tools like Oozie and Falcon are very useful for building complete data pipelines, but in a cross-cluster situation require Hadoop administrators to build data transfer stages into the pipeline along with verification steps. As noted earlier, DistCp introduces performance and consistency problems that complicate and slow down data pipelines. Page 3 of 7
Data Center 1 Hadoop A Data Center 2 Hadoop B VPN Data Nodes Data Nodes Step 2: Data from Hadoop A is periodically DistCp d into Hadoop B Figure 3: Periodic data transfer using DistCp A single HDFS cluster spanning several Hadoop installations and data centers Fortunately, there is a solution. WANdisco s active-active replication turns multiple Hadoop silos running in one or more data centers into a unified HDFS cluster with separate processing layers. Applications (MapReduce, Spark, HBase) Applications (MapReduce, Spark, HBase) Security and governance Access Layer (YARN) Access Layer (YARN) Security and governance Data Layer (Non-Stop Hadoop) -active s WAN Block Replicator Figure 4: Non-Stop Hadoop provides a single HDFS cluster underneath several Hadoop installations at one or more locations Total data protection Non-Stop Hadoop provides synchronous real-time active-active replication of HDFS metadata. Every Hadoop installation, even at data centers across the WAN, will see a consistent view of the data. In the event of a failure or a network partition, the system heals automatically with no need for manual reconciliation. Non-Stop Hadoop also uses an efficient WAN block replicator to transfer data blocks to other installations without consuming processing (MapReduce) resources. Customer experience shows that even large data ingests are transferred to another data center in minutes with no performance impact on the source Hadoop installation, compared to hours of transfer time and severe performance degradation using DistCp. Page 4 of 7
Data Center 1 World File System Data Center 2 World File System A B C A B C WAN Coordinated MetData Replication Block Replication DC1 Data Nodes DC2 Data Nodes Figure 5: Non-Stop Hadoop architecture with two data centers separated by a WAN. HDFS writes are coordinated in real time followed by asynchronous block replication. As a result, Non-Stop Hadoop provides a Recovery Point Objective (RPO) of minutes instead of hours or days, and a Recovery Time Objective (RTO) of zero. Other data centers are available for immediate use even if one data center is lost entirely. Improved performance for applications Non-Stop Hadoop presents a single HDFS cluster while preserving the independence of the processing layers. As a result, applications can be run in separate installations or zones without any extra data transfer steps. For example, one zone could run critical business applications with rigorous response SLAs, and another zone could run experimental machine learning applications that use in-memory analytics. Meanwhile, other zones in other data centers can handle ingest jobs. Each zone has all the advantages of fast local access to data, making it a more effective approach than YARN s experimental node labels which do not guarantee that the selected node is the closest to the data. Page 5 of 7
Cluster: Region X Zone A: Batch/Ingest Zone B: Low Latency Query MapReduce Hive Pig MapReduce HBase YARN YARN Spark Non-Stop HDFS Non-Stop HDFS NN NN NN NN NN NN Figure 6: Nonstop Hadoop presents a single HDFS cluster with independent processing tiers across zones As noted earlier, ingest applications like Sqoop can experience up to a 45% performance improvement when run in a separate zone from batch processing applications, and the batch processing applications may see up to a 57% improvement when isolated from Sqoop. Likewise, Spark applications can see an order of magnitude improvement when run on a small zone with dedicated high-memory nodes. Further, every Hadoop installation is available for full active processing. Readonly backup clusters become fully writable processing clusters. As a result, Hadoop deployments effectively double the processing node count and require less hardware to support the same processing requirements. Breaking down data silos Each Hadoop installation in a Non-Stop Hadoop deployment uses a single HDFS cluster, even when located across the WAN. This avoids the need for expensive data transfer stages in tools like Oozie or Falcon and provides total data visibility to data scientists. Overcome performance and data protection problems Non-Stop Hadoop turns multiple Hadoop data silos into a single HDFS cluster that provides total data protection and improved performance for Hadoop applications. The single HDFS cluster also overcomes data sharing problems while delivering improved utilization of valuable Hadoop processing resources. Alternative approaches, however, are problematic: Building a larger Hadoop cluster to add processing power magnifies the backup burdens to the point where system RPO becomes unacceptable. Another option is to rely on the network to move data to processing, discarding Hadoop s natural preference for data locality. This Page 6 of 7
technique is not proven at scale and may prove very difficult in a WAN situation and of course doesn t satisfy backup/dr requirements. -active replication is recognized as a vital capability for data protection, and offers much more than just data safety. A Hadoop cluster built with activeactive technology weaves independent Hadoop installations into a unified HDFS cluster that alleviates several barriers to productive Hadoop deployment. For more information including architectural white papers, visit http://www. wandisco.com/hadoop. World Headquarters 5000 Executive Pkwy Suite 270 San Ramon, CA 94583 Europe Electric Works Sheffield Digital Campus Sheffield S1 2BJ Japan Level 15 Cerulean Tower 26-1 Sakuragaoka-cho Shibuya-ku Tokyo Japan 150-8512 China Financial Street Centre, Level 10 South Tower No.9A Financial Street XiCheng District Beijing 100033 US Toll Free 1-877-WANDISCO(926-3472) Outside US +1-925-380-1728 EU +44 (0)114 3039985 APAC +61 2 8211 0620 Email sales@wandisco.com