Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp
Agenda Hadoop and storage Alternative storage architecture for Hadoop Use cases and customer examples Guidelines and best practices NFS Connector for Hadoop Conclusion and next steps 2
Hadoop and Storage 3
Traditional Hadoop Storage Flow Ingest to data-node-a Ingest is replicated to data-nodes-b and data-nodes-c Network Switch Ingest logs, images, text data1 data2 data3 data4 Name Node Data Node A Data Node B Data Node C Replication R=3 data1 data2 data3 data4 replicate data1 data2 data3 data4 replicate data1 data2 data3 data4 4
Implications of three copies Network Congestion Server Congestion, RAM Utilization Server A Server B Server C Server A CPU network Memory (RAM, DIMM) Memory Controller I/O Controller Server B Server C LUN - A (master) LUN - B (copy) LUN - C (copy) LUN - A (copy) LUN - B (master) LUN - C (copy) LUN - A (copy) LUN - B (copy) LUN - C (master) Start Disk Drive(s) network Hadoop uses server-based replication to keep three copies Causes high levels of I/O over server system bus Causes poor disk utilization (1/3 of raw capacity) Hadoop and Memory Memory issues large part of support calls (root cause = server memory contention) Reducing server replication reduces memory consumption for a more reliable, faster cluster Server replication can be messy 5
Alternative DAS Architecture Dedicated storage with E-series External DAS architecture Higher capacity and density 180TB in 4U Less footprint in datacenter Two copies of data (not three) Less network congestion, better throughput Less data to manage, higher effic High availability for Hadoop Reliable NameNode protection Jobs continue when nodes go off-line Faster cluster recovery 6
NetApp Storage Layout for HDFS Two 7-disk RAID 5 groups with two LUNs per node Dedicated set of disks per DataNode Shared-nothing architecture Spare disks shared globally 7
Use Cases 8
Service Provider Leveraging Hadoop Significant growth in network log data from remote data centers couldn t be consolidated Analytical queries can t be done with existing tools stakeholders couldn t access data Analysts Business Users UI + Search Tool Analytics Solution Archiving & Indexing Tools Hadoop HDFS/MapReduce Faster consolidation, indexing, searching of log data Information needed for auditing and compliance New analytics capabilities Eight note Hadoop cluster with open source search, indexing tools Remote Servers Central Servers Remote Servers 9
Security Use Case in Government Challenges Protect IT/data assets from cyber attacks Implementation: how to combine big data with cyber analytics Customer analytics application Benefits Defensive perimeter around financial data to thwart potential attacks Better situational awareness Required both Hadoop and custom analytical application for complete solution 10
Alternative Architecture in Healthcare Challenges Extract Transform Load offload for increasing amounts of unstructured data Integration of Hadoop with traditional systems Images, Insurance claims patient records Business Intelligence Data Warehouse Benefits Cost effective ingest solution of semi and unstructured data New treatment analytics capabilities Highly available Hadoop cluster Hadoop 11
Other customers and use cases Healthcare Hospitals, pharmaceutical, managed healthcare, clinical testing Transportation Airline, automotive Government Education, security Telco/SP Wireless hotspots, logs analysis Consumer Retail, household goods Financial Services Insurance, banking, mobile payments Manufacturing Electronics, industrial coating High Tech Semiconductor design and packaging, networking 12
Advantages of Alternative Architecture Feature External or Managed DAS White Box DAS Replication count Application availability Performance Fan-In Ratio Solution Architecture 2 Reduction of hardware required by one third Single copy planned Enterprise Hardware RAID 5,6 & Dynamic Disk Pools Much higher uptime (five nines) Consistent performance during healthy and unhealthy modes of operation 33% less network traffic Up to 8:1 (nodes per E-Series) SAS options: I-Band, FC Validated designs, Technical Reports expediting time to market, reducing risk Growth Flexibility Storage and compute decoupled Non-disruptive lifecycle management DataNode Management Non Disruptive DataNode replacement No rebalancing or migration 3 minimum Slower recovery from disk drive failure, NameNode failure Less uptime Degraded of up to 240% with single drive failure Limited scalability only with internal drives Iterative time-consuming tuning process, multiple failure points, and resource intensive Can only grow both simultaneously Disruptive migration and rebalancing Disruptive DataNode Replacement must rebalance and / or migrate content 13
Best practices from customer use cases Start with the use case or business problem to everage new data sources Determine the workload, technologies, infrastructure Enhance or update your datawarehouse and BI tools (ETL offload and active archiving) Think about redesigning or updating the analytic platform 14
Best Practices Minimize network overhead Replication factor of 2 and RAID 5 Use compression wherever possible Storage and Hadoop optimization Start with 4:1 storage to compute ratio Allocate 30% of storage capacity to map output Disk group layout Turn on rack awareness 15
Best Practices Use E5560 (or later) as storage array, supporting four DataNodes Use FAS22xx for diskless and network boot, storage administration Separate network for data; separate for node interconnect Use Jumbo Frames and 10GbE Determine DataNodes by storage and job run requirements 16
Best practices (continued) Start a POC or pilot sooner than later POC is for business validation Pilot is for technology validation Focus on performance after deployment Application and cluster size determine most of the configuration 17
Putting the Stack Together Reporting/Dashboard/ Visualization Applications and Analytics Data Management Servers, Networking, Hardware Storage and File Systems
Scenario for storage and analytics Enterprise Data 4 Map- Reduce HBase Spark 1 YARN NetApp FAS Storage NFS-based 3 HDFS Hadoop Analytics 2 1) Data is sitting on FAS, NFS-based storage 2) If Hadoop or Map Reduce analysis is needed, HDFS-based storage has to be created 3) Data has to be moved to newly created Hadoop storage 4) Analysis can now be done on data Hadoop diagram courtesy Hortonworks 19
Map- Reduce HBase Introducing NetApp NFS Connector YARN Spark HDFS Enterprise Data Hadoop Analytics NFS Connector NetApp FAS Storage NFS-based Directly on NFS Data Map Reduce analytics natively on data sitting on FAS, NFS-based storage NFS Connector is a thin software application between Map Reduce and NFS Hadoop diagram courtesy Hortonworks 20
Next Steps Download information at netapp.com/hadoop Technical Reports, Solution Guides, Cisco Validated Designs, Solution Briefs Start a POC Engage NetApp or partner Contact us gustav.horn@netapp.com or iyerv@netapp.com or NetApp System Engineer 21
Thank You! 22