Roadmap for Applying Hadoop Distributed File System in Scientific Grid Computing

Transcription

1 Roadmap for Applying Hadoop Distributed File System in Scientific Grid Computing Garhan Attebury 1, Andrew Baranovski 2, Ken Bloom 1, Brian Bockelman 1, Dorian Kcira 3, James Letts 4, Tanya Levshina 2, Carl Lundestedt 1, Terrence Martin 4, Will Maier 5, Haifeng Pi 4, Abhishek Rana 4, Igor Sfiligoi 4, Alexander Sim 6, Michael Thomas 3, Frank Wuerthwein 4 1. University of Nebraska Lincoln 2. Fermi National Accelerator Laboratory 3. California Institute of Technology 4. University of California, San Diego 5. University of Wisconsin Madson 6. Lawrence Berkeley National Laboratory On Behalf of Open Science Grid (OSG) Storage Hadoop Community 1

2 Storage, a critical component of Grid Grid computing is data-intensive and CPU-intensive, which requires Scalable management system for bookkeeping and discovering data Reliable and fast tools for distributing and replicating data Efficient procedures for processing and extracting data Advanced techniques for analyzing and storing data in parallel A scalable, dynamic, efficient and easy-to-maintain storage system is on the critical path to the success of grid computing Meet various data access needs in both organization and individual level Maximize the CPU usage and efficiency Fit into sophisticated VO policies (e.g. Data security, user privilege ) Survive the unexpected usage of storage system Minimize the cost of ownership Easy to expand, reconfigure, commission/decommission as requirement changes 2

3 A Case Study, Some Requirements for Storage Element (SE) at Compact Muon Solenoid (CMS) Have a credible support model that meets the reliability, availability, and security expectations consistent with the computing infrastructure Demonstrate the ability to interface with the existing global data transfer system and the transfer technology of SRM tools and FTS as well as demonstrate the ability to interface to the CMS software locally through ROOT Well-defined and reliable behavior for recovery from the failure of any hardware components. Well-defined and reliable method of replicating files to protect against the loss of any individual hardware system Well-defined and reliable procedure for decommissioning hardware without data loss Well-defined and reliable procedure for site operators to regularly check the integrity of all files in the SE Well-defined interfaces to monitoring systems Capable of delivering at least 1 MB/s/batch slot for CMS applications, capable of writing files from the WAN at a performance of at least 125MB/s while simultaneously writing data from the local farm at an average rate of 20MB/s. Failures of jobs due to failure to open the file or deliver the data products from the storage systems should be at the level of less than 1 in 10 5 level. 3

4 Hadoop Distributed File System (HDFS) Open source project hosted by Apache ( and used by YAHOO for its search engine with multiple-pb scale of data involved Design goal reduce the impact of hardware failure Stream data access handle large datasets Simple coherency model Portability across heterogeneous platforms A scalable distributed cluster file system The namespace and image of the whole file system is maintained in one single machine's memory, NameNode The files are split into blocks and stored across the cluster, DataNode File blocks can be replicated. Loss of one DataNode can be recovered from the replica blocks in other DataNodes. 4

5 Important Components of HDFS-based SE Fuse/Fuse-DFS A linux kernel module, allows file systems to be written in userspace and POSIXlike interface to HDFS Important for the software application accessing data in the local SE Globus GridFTP provide WAN transfer between to SE(s) or SE and workernode (WN). A special plugin is needed to assemble asynchronous transfered packets for sequential writing to the HDFS if multiple streams are used BeStMan provide SRM interface to the HDFS Possible to develop/implement plugins to select GridFTP servers according to the status of the GridFTP servers A number of software bugs and integration issues have been solved for the last 12 months to really bring all the components together and make a production quality SE 5

6 HDFS SE Architecture for Scientific Computing NameNode (secondary NN) BeStMan Fuse + Hadoop Client Dedicated Data Node Hadoop Client WorkerNode + (DataNode) + (GridFTP) FUSE + Hadoop Client WorkerNode + (DataNode) + (GridFTP) FUSE + Hadoop Client Dedicated Data Node Hadoop Client WorkerNode + (DataNode) + (GridFTP) FUSE + Hadoop Client WorkerNode + (DataNode) + (GridFTP) FUSE + Hadoop Client GUMS Proxy-User Mapping GridFTP Node FUSE + Hadoop Client GridFTP Node FUSE + Hadoop Client 6

7 HDFS-based SE at CMS Tier-2 Currently three CMS Tier-2 sites, Nebraska, Caltech and UCSD, deployed HDFS-based SE Average 6-12 months operation experience with increasing scale in total disk space Currently around 100 DataNodes ranging from 300 to 500 TB for each site Successfully serve the CMS collaboration with up to thousands of grid users and hundreds of local users to access the dataset in HDFS Successfully serve the data operation and Monte Carlo production run by the CMS What benefits the new SE brings to these sites Reliability: stop loss of files because of a decent file replica schemes run by HDFS Simple deployment: most of the deployment procedure is streamlined with fewer commands done by the administrators Easy operation: stable system, little effort for system/file recovery, less than 30 min for daily operation and user support Proved scalability for supporting a large number of simultaneous Read/Write operation and high throughput for serving the data for grid jobs running at the site 7

8 Highlight of Operational Performance of HDFS-SE Stably deliver ~3MB/s to applications in the cluster while the cluster is fully loaded with jobs Sufficient for CMS application's requirement on I/O with high CPU efficiency CMS application is IOPS limited, not bandwidth limited HDFS NameNode serves 2500 user request per second Sufficient for a cluster with thousand of cores with I/O intensive jobs Sustained WAN transfer rate 400MB/s Sufficient for CMS Tier-2 data operation (dataset transfer and stage-out of user analysis jobs) Simultaneously processing thousand client's request at BeStMan Sustained endpoint processing rate 50 Hz Sufficient for high-rate transfers of gigabytes-sized files and uncontrolled chaotic user jobs Observed extremely low file corruption rate Benefit from robust and fast file replication of HDFS Decommissioning of a DataNode < 1 hour, restart NameNode in 1 minute, check the image of file system (from memory of NameNode) in 10 sec Fast and efficient for the operation Survive various stress test that involves HDFS, BeStMan, GridFTP... 8

9 Data Transfer to HDFS-SE 9

10 NameNode Operation Count 10

11 Processing Rate at SRM endpoint 11

12 Monitoring and Routine Test Integration with general grid monitoring infrastructure Nagious, Ganglia, MonALISA CPU, memory, network statistics for the NameNode, DataNode and the whole system HDFS monitoring Hadoop web service, Hadoop Chronicle, Jconsole Status of the file system and user Logs of NameNode, DataNode and GridFTP, BeStMan As part of the daily tasks and debugging activities Regular low-stress test performed by CMS VO Test analysis jobs, load test of file transfer Part of the daily commission of the site involves local and remote I/O of the SE Intentional failure in various parts of the SE with demonstrated recovery mechanism Documentation of recovery procedure 12

13 Load test between two HDFS-SE 13

14 Data Security and Integrity Security concerns HDFS No encryption or strong authentication between client and server. HDFS must only be exposed to a secure internal network Practically firewall or NAT is needed to properly isolated the HDFS from direct public access Latest HDFS implements access token. Transition to kerberos-based components is expected in Grid components (GridFTP and BeStMan) Use standard GSI security with VOMS extensions Data integrity and consistency of the file system HDFS Checksum for blocks of data Command line tool to check block, directory and file HDFS keeps multiple journal and file system image NameNode periodically requests the entire block report from all DataNode. 14

15 A Combined Release Infrastructure at OSG and CMS Various original open sources provide all the necessary packages HDFS, FUSE, BeStMan, GridFTP plugins, BeStMan plugins... All software components needed for deploying the hadoop-based SE are packaged as RPM with add-on configuration and scripts necessarily to a site to install with minimal changes according to the site condition and requirement Consistency check and validation are done in selected sites with HDFS-SE experts before the formal release via OSG a testbed for common platforms and scalability test Development in 2010 Release procedure to be fully integrated into standard OSG distribution: Virtual Data Toolkit (VDT) Possibility of some intersection with external commercial packagers, e.g., using selected RPMs from Cloudera 15

16 Site Specific Optimization Various optimization can be done for each site based on the usage patterns and local hardware condition Block size for files Number of file replicas Architecture of GridFTP server deployment A few high performance GridFTP vs. many GridFTP running at the WorkerNode Memory allocation at WorkerNode (WN) for GridFTP, application... Selection of GridFTP servers Real-time-monitoring-based GridFTP selection base on CPU and memory usage vs. randomly picking alive GridFTP Data access with MapReduce A special case for data processing Rack awareness 16

17 Summary of Our Experience Hadoop-based storage solution is established and functioning at CMS tier-2 level as an example of data- and CPU-intensive HPC Flexible in the architecture involving various grid components Scalable and stable Seamlessly interfaced with various grid middleware Lower costs in deployment, maintenance, and required hardware Significantly reduce manpower and increase QoS Easy to adapt to existing/new hardware and changing requirements Standard release for the whole community Experts available to help solve the technical problems VO and grid sites benefit from reliable HDFS file replica and distribution scheme High data security and integrity Excellent I/O performance for CPU- and data-intensive grid applications Less administrator intervention HFDS is shown to be seamlessly integrated into a grid storage solution for a Virtual Organization (VO) or grid site 17

18 Roadmap for the Near Future Deployment in a varieties of scientific computing projects, or experiments, or institutions As a integrated storage element solution As a storage file system Benchmark performance for HPC with data- and CPU-intensive grid computing Scalability, Stability, Usability Integration and efficiency with other tools Organization Seamless integration between scientific user community and HDFS development community Consolidation of scientific release and technical support New development and contribution from scientific community Funding proposal based on HDFS infrastructure and technology Improvement in I/O Capacity and full integration as a critical component of Storage Element Operational optimization with different scales of data and compute infrastructure 18