Optimize the execution of local physics analysis workflows using Hadoop

Similar documents
Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Hadoop Architecture. Part 1

Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

Roadmap for Applying Hadoop Distributed File System in Scientific Grid Computing

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Using Hadoop for Webscale Computing. Ajay Anand Yahoo! Usenix 2008

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Apache Hadoop. Alexandru Costan

Hadoop Distributed File System. Dhruba Borthakur June, 2007

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Chapter 7. Using Hadoop Cluster and MapReduce

THE HADOOP DISTRIBUTED FILE SYSTEM

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Apache Hadoop new way for the company to store and analyze big data

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

A very short Intro to Hadoop

Introduction to HDFS. Prasanth Kothuri, CERN

MapReduce, Hadoop and Amazon AWS

Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari

HDFS Under the Hood. Sanjay Radia. Grid Computing, Hadoop Yahoo Inc.

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

HADOOP, a newly emerged Java-based software framework, Hadoop Distributed File System for the Grid

Hadoop IST 734 SS CHUNG

Intro to Map/Reduce a.k.a. Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

HDFS Users Guide. Table of contents

MapReduce. Tushar B. Kute,

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to MapReduce and Hadoop

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

TP1: Getting Started with Hadoop

Hadoop Installation. Sandeep Prasad

Hadoop & its Usage at Facebook

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Apache Hadoop 2.0 Installation and Single Node Cluster Configuration on Ubuntu A guide to install and setup Single-Node Apache Hadoop 2.

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

HDFS: Hadoop Distributed File System

Large scale processing using Hadoop. Ján Vaňo

OSG Hadoop is packaged into rpms for SL4, SL5 by Caltech BeStMan, gridftp backend

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Open source Google-style large scale data analysis with Hadoop

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Big Data - Infrastructure Considerations

CSE-E5430 Scalable Cloud Computing Lecture 2

Detailed Outline of Hadoop. Brian Bockelman

Hadoop implementation of MapReduce computational model. Ján Vaňo

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Application Development. A Paradigm Shift

1. GridGain In-Memory Accelerator For Hadoop. 2. Hadoop Installation. 2.1 Hadoop 1.x Installation

Design and Evolution of the Apache Hadoop File System(HDFS)

Understanding Hadoop Performance on Lustre

Bright Cluster Manager

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Setup Hadoop On Ubuntu Linux. ---Multi-Node Cluster

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Big Data With Hadoop

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

Getting to know Apache Hadoop

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

HDFS. Hadoop Distributed File System

Running a typical ROOT HEP analysis on Hadoop/MapReduce. Stefano Alberto Russo Michele Pinamonti Marina Cobal

Hadoop and Map-Reduce. Swati Gore

L1: Introduction to Hadoop

Introduction to HDFS. Prasanth Kothuri, CERN

Big Data Analytics - Accelerated. stream-horizon.com

Survey on Scheduling Algorithm in MapReduce Framework

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

A Cost-Evaluation of MapReduce Applications in the Cloud

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

Hadoop & its Usage at Facebook

Big Data Technology Core Hadoop: HDFS-YARN Internals

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Introduction to Cloud Computing

IMPLEMENTING PREDICTIVE ANALYTICS USING HADOOP FOR DOCUMENT CLASSIFICATION ON CRM SYSTEM

Mobile Cloud Computing for Data-Intensive Applications

MapReduce with Apache Hadoop Analysing Big Data

CERN Cloud Storage Evaluation Geoffray Adde, Dirk Duellmann, Maitane Zotes CERN IT

Apache Hadoop FileSystem and its Usage in Facebook

Hadoop: Embracing future hardware

How To Install Hadoop From Apa Hadoop To (Hadoop)

Performance and Energy Efficiency of. Hadoop deployment models

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

HADOOP MOCK TEST HADOOP MOCK TEST

How To Use Hadoop

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Big Data Too Big To Ignore

Extending Hadoop beyond MapReduce

Hadoop/MapReduce. Object-oriented framework presentation CSCI 5448 Casey McTaggart

Transcription:

Optimize the execution of local physics analysis workflows using Hadoop INFN CCR - GARR Workshop 14-17 May Napoli Hassen Riahi Giacinto Donvito Livio Fano Massimiliano Fasi Andrea Valentini INFN-PERUGIA INFN-Bari INFN-PERUGIA INFN-PERUGIA INFN-PERUGIA 1

Outline - Introduction - Apache Hadoop framework - Motivation - ROOT files analysis using Hadoop MapReduce - Results - Summary and future works 2

Introduction Hadoop is a data processing system that follows the MapReduce paradigm for scalable data analysis. Largest install is at Facebook, a huge Hadoop user - 30PB of online disk. Being used by many other companies such as Yahoo, Last.fm, New York Times... 3

Hadoop FS: HDFS HDFS is currently used at some CMS Tier-2 sites and it has shown satisfactory performance Advantages: Manageability - Decommissioning Hardware - Recovery from Hardware failure Reliability - Replication works well Usability - Provide a Posix-like interface though Fuse Scalability 4

Hadoop Processing: MapReduce Is a framework for distributed data processing Provides a simple programming model for efficient distributed computing - It works like a Unix pipeline: Input Map Sort Reduce Output Let s users plug-in own code 5

Advantages of Hadoop MapReduce Runs MapReduce programs written in various language Scalability - Fine grained Map and Reduce tasks - Improved load balancing - Faster recovery from failed tasks - Locality optimizations - Tasks are scheduled close to the inputs when possible Reliability - Automatic resubmission on failure - Framework re-executes failed tasks Manageability - Powerful Web monitoring interface 6

Hadoop MapReduce + HDFS solution characteristics Commodity HW: add inexpensive servers - Use replication across servers to deal with unreliable storage/servers - Handle task resubmission on failure Simple FS design: Metadata-data separation Support for moving computation close to data - Servers have 2 purposes: data storage and computation Use MapReduce + HDFS solution to perform physics analysis in HEP 7

Performance test Time in sec. 600 450 300 150 0 1Block Time in sec. 400 300 200 100 0 2Blocks/1Node MapReduce vs Fuse 1File 1File 2Blocks 2Blocks/2Nodes Time in seconds to process 2G files with MapReduce and Fuse Time in sec. 6000 MapReduce 4500 3000 1500 0 38Sec 47Files Fuse MapReduce process 2 HDFS blocks of the same file in parallel Same performance when processing 2 blocks in the same node and 2 blocks in 2 different nodes MapReduce parallelizes files processing 8

Data flows MapReduce: input map ouput 1283745... 4223252... 8237257... 3668737... cat * ( 0, 1283745...) ( 312, 8237257...) ( 10, 4223252...) ( 54, 3668737...) (1950, 22) (1950, 0) (1950, 5) (1949, 11) 1950, 22 1950, 0 1950, 5 1949, 11 Analysis: map.py ROOT Binary file ROOT primitives User Analysis code ROOT primitives ROOT Binary file Analyzer.cpp 9

ROOT files analysis using Hadoop MapReduce MapReduce workflow Input: default TXT files Split per line Map Process TXT input streams Output: default TXT files CMS Analysis workflow Input: ROOT files Split per file Analyzer Open and process ROOT files Output: ROOT files 10

ROOT file analysis execution 1. Use the dedicated InputFormat, RootFileInputFormat, developed to support the splitting and read of ROOT files 2. Use map.input.file environment to locate the path of the input ROOT file in HDFS 3. Open the ROOT file using the HDFS Plugin of ROOT - Patch submitted to export correctly the CLASSPATH environment variable in the Plugin 4. Process the ROOT file content 5. Write the output into HDFS 11

Testbed and setup description Hadoop 0.20 is installed on a dedicated Cluster composed of 3 machines (Debian GNU/ Linux 6.0, 24 Cores, 24 G RAM and 1 disk 500 G): - AdminNode configured as JobTracker, NameNode, and SecondaryNameNode - Salve nodes configured as TaskTrackers and DataNodes AdminNode setup: 1. Add a Hadoop group and Hadoop user for that group 2. Configure and test SSH operation on all nodes 3. Change ownership of all the release contents to permit use by the Hadoop user 4. Add and export for JAVA_HOME and HADOOP_HOME to the.bashrc file 5. Uncommenting the export for JAVA_HOME in hadoop env.sh 6. Edit the file <yourhadoophomepath>/conf/core site.xml to suit your needs 7. Create the hadoop temp datastore directory and change ownership to allow use by the Hadoop user 8. Test Hadoop execution: Start/Stop the single node cluster: start-all.sh/stop-all.sh 12

Preliminary test Read a Tree from ROOT file and write TH1 histogram from one of its variables in a new output ROOT file Size of the ROOT file read is 100MB R/W into HDFS through Fuse Time in sec. 3500 200 MapReduce PBS 150 100 50 Fuse input/output error caused by high I/O of jobs 0 1 10 100 500 Files Files 13

Results Read a Tree from ROOT file and write TH1 histogram from one of its variables in a new output ROOT file Read from HDFS using ROOT s HDFS Plugin Write into HDFS using libhdfs Only little overhead introduced by MapReduce Time in sec. 3000 200 MapReduce scales better with large files Time in sec. 600 MapReduce PBS The high load on worker machines when processing 50 files cause the abort of PBS jobs 150 100 100MB/file 450 300 1GB/file 50 150 0 Files 1 10 100 500 0 Files 1 5 10 50 ~ 95 % MapReduce jobs benefit from the data locality Files 14

Summary and future works - ROOT files analysis workflow has been implemented with success using Hadoop - The execution of ROOT files analysis using Hadoop has shown promising performance - Extend the workflow to include a reduce in the analysis workflow (such as merging the ROOT output files) - Perform scale tests of the analysis workflow in large Cluster - Implement Plugins to allow the deployment of a Hadoopbased Grid Computing Element 15

BACKUP 16

HDFS deployment in HEP HDFS is currently used at some Tier-2 sites Configured as a Grid Storage Element - Interfaced with Globus GridFTP and BeStMan SRM - Accessed locally through Fuse to read files using ROOT It has shown satisfactory performance 17

Hadoop Storage: HDFS Tet 18

INPUT Write a custom InputFormat to support the splitting and read of ROOT files RootFileRecordReader.java use RootFileInputFormat.java Old MR API: BUG MAPREDUCE-1122 Do not read anything as input Whole file per split 19

OUTPUT HDFS does not support Random write - designed for write-once-read-many access model Write the output in /tmp of WN and then copy it to HDFS using libhdfs Write into HDFS only successfully produced ROOT files avoiding useless load on the FS Write 2 times the output file 20