Use of Hadoop File System for Nuclear Physics Analyses in STAR



Similar documents
Cloud Computing through Virtualization and HPC technologies

A Performance Analysis of Distributed Indexing using Terrier

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop Architecture. Part 1

PARALLELS CLOUD SERVER

POSIX and Object Distributed Storage Systems

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Investigation of storage options for scientific computing on Grid and Cloud facilities

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

Understanding Hadoop Performance on Lustre

GraySort on Apache Spark by Databricks

Enabling Technologies for Distributed Computing

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Recommended hardware system configurations for ANSYS users

PARALLELS CLOUD STORAGE

Hadoop Size does Hadoop Summit 2013

OSG Hadoop is packaged into rpms for SL4, SL5 by Caltech BeStMan, gridftp backend

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

The functionality and advantages of a high-availability file server system

GeoGrid Project and Experiences with Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Virtuoso and Database Scalability

Enabling Technologies for Distributed and Cloud Computing

Evaluating MapReduce and Hadoop for Science

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Quantcast Petabyte Storage at Half Price with QFS!

How To Build A Cloud Server For A Large Company

Price/performance Modern Memory Hierarchy

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Parallels Plesk Automation

BIG DATA USING HADOOP

Elastic Cloud Computing in the Open Cirrus Testbed implemented via Eucalyptus

This article is the second

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers

Hadoop IST 734 SS CHUNG

Maximizing SQL Server Virtualization Performance

NLSS: A Near-Line Storage System Design Based on the Combination of HDFS and ZFS

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Full and Para Virtualization

Promise Pegasus R6 Thunderbolt RAID Performance

Evaluating HDFS I/O Performance on Virtualized Systems

Big Fast Data Hadoop acceleration with Flash. June 2013

HPC performance applications on Virtual Clusters

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Performance Testing of a Cloud Service

Investigation of storage options for scientific computing on Grid and Cloud facilities

Benchmarking Hadoop & HBase on Violin

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

GraySort and MinuteSort at Yahoo on Hadoop 0.23

DSS. Diskpool and cloud storage benchmarks used in IT-DSS. Data & Storage Services. Geoffray ADDE

Testing of several distributed file-system (HadoopFS, CEPH and GlusterFS) for supporting the HEP experiments analisys. Giacinto DONVITO INFN-Bari

IncidentMonitor Server Specification Datasheet

# Not a part of 1Z0-061 or 1Z0-144 Certification test, but very important technology in BIG DATA Analysis

HADOOP PERFORMANCE TUNING

CSE-E5430 Scalable Cloud Computing Lecture 2

Configuring Apache Derby for Performance and Durability Olav Sandstå

Distributed Systems. Virtualization. Paul Krzyzanowski

Enabling High performance Big Data platform with RDMA

What is Big Data? Concepts, Ideas and Principles. Hitesh Dharamdasani

Big Data Course Highlights

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Accelerating and Simplifying Apache

Storage Architectures for Big Data in the Cloud

Liferay Portal s Document Library: Architectural Overview, Performance and Scalability

NEXTGEN v5.8 HARDWARE VERIFICATION GUIDE CLIENT HOSTED OR THIRD PARTY SERVERS

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Chapter 7. Using Hadoop Cluster and MapReduce

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Big Data Technology Core Hadoop: HDFS-YARN Internals

Big Data and Cloud Computing for GHRSST

Benchmarking Sahara-based Big-Data-as-a-Service Solutions. Zhidong Yu, Weiting Chen (Intel) Matthew Farrellee (Red Hat) May 2015

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Lenovo ThinkServer Solution For Apache Hadoop: Cloudera Installation Guide

Experiences with Lustre* and Hadoop*

MS EXCHANGE SERVER ACCELERATION IN VMWARE ENVIRONMENTS WITH SANRAD VXL

Hadoop Distributed File System Propagation Adapter for Nimbus

Microsoft Hyper-V chose a Primary Server Virtualization Platform

Optimize the execution of local physics analysis workflows using Hadoop

Transcription:

1 Use of Hadoop File System for Nuclear Physics Analyses in STAR EVAN SANGALINE UC DAVIS

Motivations 2 Data storage a key component of analysis requirements Transmission and storage across diverse resources Large quantities of data XRootD offers a robust solution for STAR and other experiments Works well but not designed for dynamic configuration Utilization of on availability resources Difficult to deploy on temporary/changing cloud resources Hadoop File System offers a possible alternative Strong track record of performance in dynamic environments

XRootD Performance 3 XRootD Read Rates ROOT analysis jobs Same hardware as test nodes Mean read of 13.5 MB/s Baseline to compare Hadoop performance

XRootD Performance 4 Structure Same analysis task Filling two histograms Interesting structure Due to different classes of files Different triggers, etc Different analyses would effect structure as well

Test Bed for Performance Evaluation 5 25 virtual node cluster was constructed OS: CentOS 5.9 Hypervisor: Xen Storage: Four 2 TB drives in a RAID 5 array RAM: 4GB CPU: 1 core of a dual core 1.8 GHz AMD Opteron Processor Hadoop 1.1.2 was deployed with a 64 MB bock size three methods of access FUSE DFS interface Cloudera NFS Proxy Direct Hadoop client

VM Overhead in Disk Access 6 Outside VM Outside VM Small overhead for writes More significant for reads Separate issue than Hadoop vs disk or XRootD All comparisons done within a consistent environment In VM In VM

What is? 7 Apache open source framework in Java Based on Google s MapReduce and Google File System papers Allows data to be redundantly stored across many nodes A single job can be split across the nodes and analyze in parallel Rack awareness allocates blocks and jobs intelligently Hadoop File System Designed for sequential reading, usually text No direct POSIX interface FUSE DFS and Cloudera NFS Proxy Easy to configure and to add/remove nodes Robust against node failure

Write Rates for Various File Sizes 8 FUSE DFS and NFS Proxy have similar performance NFS Proxy writes break for files larger than 300 MB Hadoop client has high overhead at low file sizes, does well with larger Low additional upfront cost for higher replications Hadoop at ~85% of disk rate for large files

Read Rates for Various File Sizes 9 Only small gain in read rates for higher replications NFS rates are very low, break for large file sizes Hadoop again near 85% of disk for large files Large overhead for Hadoop client at low file sizes

General Performance 10 For transferring large amounts of data into the cluster Hadoop client allows for ~80% of local write performance Replication comes basically for free Read rates through FUSE DFS are ~50% that of local reads Might not matter for CPU bottlenecked analyses Fuse DFS outperforms NFS Proxy NFS Proxy has issues with larger files Not a viable candidate

What is? 11 A commercial distribution based on Apache Hadoop Uses a custom file system Accesses partitions directly, not through another file system Stripes across multiple partitions, analogous to the RAID 5 Includes an NFS proxy Only for one node in the free version (M3) Multiple nodes requires a commercial license (M5) Good documentation and packages for installation

Multiple Clients Reading One File 12 1 GB File Little dependence on replication MapR is fastest but Fuse HDFS and NFS Proxy are comparable Likely local caching Total throughput for Hadoop client flattens earlier No local caching

Multiple Clients Reading One File 13 Little dependence on replication MapR is nearly twice as fast as Hadoop for many clients Total throughput scales almost linearly even for replication one 64 MB blocks stored on separate machines 1 GB File

Actual Analysis Task 14 Run over 1 GB ROOT file containg a TTree of event and track data Fill histograms with different track and event quantities Comparable to XRootD tests, simpler tree structure

Analysis Rates 15 Read Rate (MB/s) Fuse DFS MapR Disk 9.9+/-0.1 3.1+/-0.1 9.7+/-0.6 FUSE DFS performance is consistent with local disk access Comparable to XRootD performance (13.5 MB/s) MapR tests ran at ~30% the speed of disk or fuse NFS Proxy was unstable Running ROOT with libhdfs support was attempted Issues with consistency, progress ongoing

Summary 16 Reasonable performance for importing data files 80% of writing to local disk using the Hadoop client Little to no initial penalty for higher replication rates Fuse DFS performs as well as local disk in a typical ROOT analysis 9.9 MB/s Total throughput of reads scales almost linearly MapR outperforms Hadoop at sequential reads but not in the analysis test NFS Proxy has issues with larger files and is not a robust option

Conclusions 17 HDFS performs well for importing data files and for reading files for analysis through Fuse DFS It is easy to configure and to update dynamically It is well documented and is an active project Overall, it seems a well suited alternative to XRootD for use in dynamic cloud environments More study is needed of high concurrency scaling