Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

Similar documents
Detailed Outline of Hadoop. Brian Bockelman

Roadmap for Applying Hadoop Distributed File System in Scientific Grid Computing

Hadoop Architecture. Part 1

Hadoop Distributed File System. Dhruba Borthakur June, 2007

HADOOP, a newly emerged Java-based software framework, Hadoop Distributed File System for the Grid

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

OSG Hadoop is packaged into rpms for SL4, SL5 by Caltech BeStMan, gridftp backend

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed Filesystems

Optimize the execution of local physics analysis workflows using Hadoop

HADOOP MOCK TEST HADOOP MOCK TEST I

Distributed File Systems

HDFS Users Guide. Table of contents

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

HDFS Architecture Guide

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

THE HADOOP DISTRIBUTED FILE SYSTEM

Apache Hadoop. Alexandru Costan

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

HDFS: Hadoop Distributed File System

Design and Evolution of the Apache Hadoop File System(HDFS)

Analisi di un servizio SRM: StoRM

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to HDFS. Prasanth Kothuri, CERN

The Hadoop Distributed File System

Hadoop: Embracing future hardware

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

Introduction to HDFS. Prasanth Kothuri, CERN

<Insert Picture Here> Big Data

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop Distributed File System (HDFS) Overview

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

The Hadoop Distributed File System

HADOOP MOCK TEST HADOOP MOCK TEST II

GraySort and MinuteSort at Yahoo on Hadoop 0.23

HADOOP MOCK TEST HADOOP MOCK TEST

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Hadoop & its Usage at Facebook

Distributed File Systems

5 HDFS - Hadoop Distributed System

HADOOP PERFORMANCE TUNING

Lecture 2 (08/31, 09/02, 09/09): Hadoop. Decisions, Operations & Information Technologies Robert H. Smith School of Business Fall, 2015

Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

HDFS Space Consolidation

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CSE-E5430 Scalable Cloud Computing Lecture 2

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

HDFS. Hadoop Distributed File System

Hadoop & its Usage at Facebook

Experiences with Lustre* and Hadoop*

Apache Hadoop new way for the company to store and analyze big data

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

HDFS Reliability. Tom White, Cloudera, 12 January 2008

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

Apache Hadoop FileSystem and its Usage in Facebook

IJFEAT INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY

Spectrum Scale HDFS Transparency Guide

Hadoop Distributed File System Propagation Adapter for Nimbus

GraySort on Apache Spark by Databricks

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

CURSO: ADMINISTRADOR PARA APACHE HADOOP


CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data Technology Core Hadoop: HDFS-YARN Internals

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

HDFS Installation and Shell

Scientific Storage at FNAL. Gerard Bernabeu Altayo Dmitry Litvintsev Gene Oleynik 14/10/2015

Intro to Map/Reduce a.k.a. Hadoop

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

International Journal of Advance Research in Computer Science and Management Studies

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Performance Analysis of Mixed Distributed Filesystem Workloads

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Introduction to Cloud Computing

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Sujee Maniyam, ElephantScale

Managing your Red Hat Enterprise Linux guests with RHN Satellite

Big Data With Hadoop

Case Study : 3 different hadoop cluster deployments

Cloud Computing at Google. Architecture

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

HDFS 2015: Past, Present, and Future

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Transcription:

Michael Thomas, Dorian Kcira California Institute of Technology CMS Offline & Computing Week San Diego, April 20-24 th 2009

Map-Reduce plus the HDFS filesystem implemented in java Map-Reduce is a highly parallelized distributed computing system HDFS is the distributed cluster filesystem o This is the feature that we are most interested in Open source project hosted by Apache Used by Yahoo for their search engine. Yahoo is a major contributor to the Apache Hadoop project. 2

Distributed Cluster filesystem Extremely scalable Yahoo uses it for multi-pb storage Easy to manage few services and little hardware overhead Files split into blocks and spread across multiple cluster datanodes o 64MB blocks default, configurable o Block-level decomposition avoids 'hot-file' access bottlenecks o Block-level decomposition means the loss of multiple data nodes will result in the loss of more files than file-level decomposition 3

Namenode manages the filesystem namespace operations o File/directory creation/deletion o Block allocation/removal o Block locations Datanode stores file blocks on one or more disk partitions Secondary Namenode helper service for merging namespace changes Services communicate through java RPC, with some functionality exposed through http interfaces 4

Purpose is similar to dcache PNFS Keeps track of entire fs image o The entire filesystem directory structure o The file block datanode mapping o Block replication level o ~1GB per 1e6 blocks recommended Entire namespace is stored in memory, but persisted to disk o Block locations not persisted to disk o All namespace requests served from memory o fsck across entire namespace is really fast 5

NN fs image is read from disk only once at startup Any changes to the namespace (mkdir, rm) are written to one or more journal files (local disk, NFS,...) Journal is periodically merged with the fs image Merging can temporarily require extra memory to store two copies of fs image at once 6

The name is misleading... this is NOT a backup namenode or hot spare namenode. It does NOT respond to namespace requests Optional checkpoint server for offloading the NN journal fsimage merges Download fs image from namenode (once) Periodically download journal from namenode Merge journal and fs image Uploaded merged fs image back to namenode Contents of merged fsimage can be manually copied to NN in case of namenode corruption or failure 7

Purpose is similar to dcache pool Stores file block metadata and file block contents in one or more local disk partitions. Datanode scales well with # local partitions o Caltech is using one per local disk o Nebraska has 48 individual partitions on Sun Thumpers Sends heartbeat to namenode every 3 seconds Sends full block report to namenode every hour Namenode uses report + heartbeats to keep track of which block replicas are still accessible 8

When a client requests a file, it first contacts the namenode for namespace information. The namenode looks up the block locations for the requested files, and returns the datanodes that contain the requested blocks The client contacts the datanodes directly to retrieve the file contents from the blocks on the datanodes 9

A native java client can be used to perform all file and management operations All operations use native Hadoop java APIs 10

Client that presents a posix-like interface to arbitrary backend storage systems (ntfs, lustre, ssh) HDFS fuse module provides posix interface to HDFS using the HDFS APIs. Allows standard filesystem commands on HDFS (rm, cp, mkdir,...) HDFS does not support non-sequential (random) writes o root TFile can't write directly to HDFS fuse, but not really necessary for CMS o but files can be read through fuse with CMSSW / TFile - eventually CMSSW can use the Hadoop API Random reads are ok 11

Gridftp could write to HDFS+FUSE with a single stream Multiple streams will fail due to non-sequential writes Brian at Nebraska developed a GridFTP dsi module to buffer multiple streams so that data can be written to HDFS sequentially Bestman SRM can perform namespace operations by using FUSE o srmrm, srmls, srmmkdir 12

Current Tier2 cluster runs RHEL4 with dcache. We did not want to disturb this working setup Recently acquired 64 additional nodes, installed with Rocks5/RHEL5. This is set up as a separate cluster with its own CE and SE. Avoids interfering with working RHEL4 cluster Single PhEDEx instance runs on the RHEL4 cluster, but each SE has its own SRM server Clusters share the same private subnet 13

Namenode runs on same system as Condor negotiator/ collector o 8 cores, 16GB RAM o System is very over-provisioned. Load never exceeds 1.0, JVM never exceeds 200MB o Plenty of room for scaling to more blocks Secondary NN runs on same system as condor batch worker 64 data nodes, 170TB available space o Includes 2 Sun Thumpers running Solaris o Currently only 4.5TB used o All datanodes are also condor batch workers Single Bestman SRM server using FUSE for file ops Two gridftp-hdfs servers 14

T2_US_Nebraska first started investigating Hadoop last year. They performed a lot of R&D to get Hadoop to work in the CMS context Two SEs in SAM Gridftp-hdfs DSI module Use of Bestman SRM Many internal Hadoop bug fixes and improvements Presented this work to the USCMS T2 community in March 15

Held at UCSD in early March 2009 Intended to help get interested USCMS Tier2 sites jump-start their hadoop installations Results: o Caltech, UCSD expanded their hadoop installations o Wisconsin delayed deployment due to facility problems o Bestman, GridFTP servers deployed o Initial SRM stress tests performed o UCSD Caltech load tests started o Hadoop SEs added to SAM o Improved RPM packaging o Better online documentation for CMS https://twiki.grid.iu.edu/bin/view/storage/hdfsworkshop 16

Started using Hadoop in Feb. 2009 on a 4-node testbed Created RPMs to greatly simplify the deployment across an entire cluster Deployed Hadoop on new RHEL5 cluster of 64 nodes Basic functionality worked out of the box, but performance was poor. Attended a USCMS Tier2 hadoop workshop at UCSD in early March 17

Migrated OSG RSV tests to Hadoop in midmarch Migrated T1 Caltech load tests to Hadoop in early April Attempted to move one /store/user/$user directory to hadoop in early April, but failed due to TFC problems 18

SAM tests passing T1 Caltech load tests passing RPMs provide easy installs, reinstalls Bestman + GridFTP-HDFS have been stable Great inter-node transfer rates (2GB/s aggregate) Adequate WAN transfer rates (200MB/s) 19

OSG RSV tests required patch to remove : from filenames. This is not a valid character in hadoop filenames. (resolved) Bestman dropped VOMS FQAN for non-delegated proxies, caused improper user mappings and filesystem permission failures for SAM, PhEDEx (resolved) TFC not so t anymore * Datanode/Namenode version mismatches (improved) Initial performance was poor (400MB/s aggregate) due to cluster switch configuration (resolved) *) TFC = Trivial File Catalog 20

FUSE was not so stable o Boundary condition error for files with a specific size crashed fuse (resolved) o df sometimes not showing fuse mount space (resolved) o Lazy java garbage collection resulted in hitting ulimit for open files (resolved with larger ulimit) Running two CEs and SEs requires extra care so that both CEs can access both SEs o Some private network configuration issues o Lots of TFC wrangling 21

Looping reads on 62 machines, one read per machine 22

Write 4GB file on 62 machines (dd+fuse) with 2x replication (1.8GB/s) 23

Decommission 10 machines at once, resulting in the namenode issuing many replication tasks (1.7GB/s) 24

2 x 10GbE GridFTP servers, 260MB/s 25

Make another attempt to move /store/user to HDFS More benchmarks to show that HDFS satisfies the CMS SE technology requirements Finish validation that both CEs can access data from both SEs More WAN transfer tests and tuning o FDT + HDFS integration starting soon Migrate additional data to Hadoop o All of /store/user o /store/unmerged o Non-CMS storage areas 26

Management of HDFS is simple relative to other SE options Performance has been more than adequate Scaled from 4 nodes to 64 nodes with no problems ~50% of our initial problems were related to Hadoop, the other 50% were Bestman, TFC, PhEDEx agent, or caused by running multiple SEs We currently plan to continue using Hadoop and expand it moving forward 27