Performance Analysis of Mixed Distributed Filesystem Workloads



Similar documents
Mixing Hadoop and HPC Workloads on Parallel Filesystems

I/O intensive applications: what are the main differences in the design of the HPC filesystems vs the MapReduce ones?

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop Architecture. Part 1

A Performance Analysis of Distributed Indexing using Terrier

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Accelerating and Simplifying Apache

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

HPCHadoop: MapReduce on Cray X-series

Distributed Filesystems

Lustre * Filesystem for Cloud and Hadoop *

BlobSeer: Towards efficient data storage management on large-scale, distributed systems

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Cray XC30 Hadoop Platform Jonathan (Bill) Sparks Howard Pritchard Martha Dumler

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Hadoop IST 734 SS CHUNG

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Evaluating HDFS I/O Performance on Virtualized Systems

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

GeoGrid Project and Experiences with Hadoop

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Linux Performance Optimizations for Big Data Environments

HDFS Space Consolidation

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Enabling High performance Big Data platform with RDMA

CSE-E5430 Scalable Cloud Computing Lecture 2

Map Reduce / Hadoop / HDFS

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Apache Hadoop. Alexandru Costan

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Data Analytics. CloudSuite1.0 Benchmark Suite Copyright (c) 2011, Parallel Systems Architecture Lab, EPFL. All rights reserved.

Hadoop Distributed File System. Dhruba Borthakur June, 2007

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Michael Thomas, Dorian Kcira California Institute of Technology. CMS Offline & Computing Week

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Use of Hadoop File System for Nuclear Physics Analyses in STAR

How To Scale Out Of A Nosql Database

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Apache Hadoop new way for the company to store and analyze big data

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

BIG DATA TRENDS AND TECHNOLOGIES

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Research Article Hadoop-Based Distributed Sensor Node Management System

COSC 6397 Big Data Analytics. Distributed File Systems (II) Edgar Gabriel Spring HDFS Basics

OSG Hadoop is packaged into rpms for SL4, SL5 by Caltech BeStMan, gridftp backend

Chapter 7. Using Hadoop Cluster and MapReduce

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Hadoop. Sunday, November 25, 12

QoS-Aware Storage Virtualization for Cloud File Systems. Christoph Kleineweber (Speaker) Alexander Reinefeld Thorsten Schütt. Zuse Institute Berlin

Investigation of storage options for scientific computing on Grid and Cloud facilities

Open source Google-style large scale data analysis with Hadoop

THE EXPAND PARALLEL FILE SYSTEM A FILE SYSTEM FOR CLUSTER AND GRID COMPUTING. José Daniel García Sánchez ARCOS Group University Carlos III of Madrid

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

HDFS Architecture Guide

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

HP reference configuration for entry-level SAS Grid Manager solutions

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Integrated Grid Solutions. and Greenplum

Quantcast Petabyte Storage at Half Price with QFS!

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Dell Reference Configuration for Hortonworks Data Platform

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

Hadoop Distributed File System (HDFS) Overview

Large scale processing using Hadoop. Ján Vaňo

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

DualFS: A New Journaling File System for Linux

HPC performance applications on Virtual Clusters

NoSQL Data Base Basics

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

SAS Grid Manager Testing and Benchmarking Best Practices for SAS Intelligence Platform

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Benchmarking Hadoop & HBase on Violin

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Transcription:

Performance Analysis of Mixed Distributed Filesystem Workloads Esteban Molina-Estolano, Maya Gokhale, Carlos Maltzahn, John May, John Bent, Scott Brandt

Motivation Hadoop-tailored filesystems (e.g. CloudStore) and highperformance computing filesystems (e.g. PVFS) are tailored to considerably different workloads Existing investments in HPC systems and Hadoop systems should be usable for both workloads Avoid dedicating separate hardware for each type of workload Goal: Examine the performance of both types of workloads running concurrently on the same filesystem Goal: collect I/O traces from concurrent workload runs, for parallel filesystem simulator work

MapReduce-oriented filesystems Large-scale batch data processing and analysis Single cluster of unreliable commodity machines for both storage and computation Data locality is important for performance Examples: Google FS, Hadoop DFS, CloudStore

Hadoop DFS architecture 2"3,,4(567("$'8).%'.9$%!"#$%&'%(!)*%$+,$%(-".),&"/(!"0,$".,$1 "##$%&&"'())$*'$'+",*)-.!

High-Performance Computing filesystems 2)3456%$7,$+"&'%(8,+9:.)&3(7)/%;1;.%+; High-throughput, lowlatency workloads!!"#$%&$'()#$*)&+,-(.% -/&0123,.('4-(/56! 7'2$"&02&)'08,60*/'/&0, 2(9*)&0,/15,6&('/#0, 2-)6&0'6+,$"#$%6*005, :'"5#0,:0&.001,&$09! ;3*"2/-,.('4-(/58, 6"9)-/&"(1,2$024*("1&"1#! <=/9*-068,>?@A+,B)6&0'+, Architecture: separate compute and storage clusters, high-speed bridge between them Typical workload: >/1@A simulation checkpointing.$/01#()2314#(% 56'7840((9):%69'( Examples: PVFS, Lustre, PanFS!"#$%&'%(!)*%$+,$%(-".),&"/(!"0,$".,$1 "#$%&'()*%(&)+(#,$%-!

Running each workload on the non-native filesystem Two-sided problem: running HPC workloads on a Hadoop filesystem, and Hadoop workloads on an HPC filesystem Different interfaces: HPC workloads need a POSIX-like interface and shared writes Hadoop is write-once-read-many Different data layout policies

Running HPC workloads on a Hadoop filesystem Chosen filesystem: CloudStore Downside of Hadoop s HDFS: no support for shared writes (needed for HPC N-1 workloads) Cloudstore has HDFS-like architecture, and shared write support

Running Hadoop workloads on an HPC filesystem Chosen HPC filesystem: PVFS PVFS is open-source and easy to configure Tantisiriroj et al. at CMU have created a shim to run Hadoop on PVFS Shim also adds prefetching, buffering, exposes data layout

The two concurrent workloads IOR checkpointing workload writes large amounts of data to disk from many clients N-1 and N-N write patterns Hadoop MapReduce HTTP attack classifier (TFIDF) Using a pre-generated attack model, classify HTTP headers as normal traffic or attack traffic

Tracing infrastructure We gather traces to use for our parallel filesystem simulator Existing tracing mechanisms (e.g. strace, Pianola, Darshan) don t work well with Java or CloudStore Solution: our own tracing mechanisms for IOR and Hadoop

Tracing IOR workloads 2$"')&3(456(#,$7/,"89 Trace shim intercepts I/O calls, sends to stdio!!"#$%&'()*&$#+,-"%'&./0&$#11'2&'%34'&,5&',4)5 #$%&'()*&.$/#' #$%&'()* #$%&'()*&+*,-) #$%&'()*&0.##$ 89,*9&9;=)>?@*<-)88&AB=>?C*),:DE*;9)F> <((8)9>?8;G)>?)A:&9;=) #$%&'()*&1234 #$%&'()*&560.# 89:;< #$%&'()*&73/!"#$%&'%(!)*%$+,$%(-".),&"/(!"0,$".,$1!"

2$"')&3(4"5,,6(7"68%59'% Tracing Hadoop #$%&'$(+0%.%1234.+.$'%/ #$%&'$()*'+,-.'/ 56(+7()*'+,-.'/ :6%& ;$%".%!!"#$$%&'(")*+,&-.*/0&1(")2(3*4256-'2/! 78"/%420&'9-0::%;9-<='>-$+<?0<@@@@:&'$&'(")2&AB35 Tracing shim wraps filesystem interfaces, sends I/O calls! C"-2&9*42-6-'2/-&/$#*9*2#&'$&D("%&E+%>':F>'%>'5'(2"/-& to Hadoop logs! E:F&$%2("'*$+-&4$,,2#&'$&!"#$$%&%2(='"-G&4$,- 56(+70%.%1234.+.$'%/ 9%:;;3<*;=- #$%&'$(+0%.%84.34.+.$'%/ 56(+70%.%84.34.+.$'%/ (;$/%.><?)*'2%/'@<3):@<-.%$.A.)/'@<'2:A.)/'@<;3'$%.);2B3%$%/@<CCCD<E<$'-4*.<F'*%3-':</-G!"#$%&'%(!)*%$+,$%(-".),&"/(!"0,$".,$1!"

Experimental Setup System: 19 nodes, 2-core 2.4 GHz Xeon, 120 GB disks IOR baseline: N-1 strided workload, 64 MB chunks IOR baseline: N-N workload, 64 MB chunks TFIDF baseline: classify 7.2 GB of HTTP headers Mixed workloads: IOR N-1 and TFIDF, IOR N-N and TFIDF Checkpoint size adjusted to make IOR and TFIDF take the same amount of time

Naive performance predictions Each workload will perform better on its native filesystem Each workload will be slowed down considerably in the mixed experiments

Experimental results Classification throughput (MB/s) 20 15 10 5 0 TFIDF classification throughput, standalone and with IOR CloudStore Baseline with IOR N-1 with IOR N-N PVFS

Experimental results Write throughput (MB/s) 90 80 70 60 50 40 30 20 10 0 IOR checkpointing on CloudStore N-1 N-N IOR checkpointing on PVFS Standalone Mixed N-1 N-N

Experimental Results Runtime (seconds) 1000 900 800 700 600 500 400 300 200 100 Runtime comparison of mixed vs. sequential workloads 0 Sequential runtime Mixed runtime PVFS N-1 PVFS N-N CloudStore N-1 CloudStore N-N

Conclusions Developed I/O tracing mechanisms for IOR benchmarks and Hadoop MapReduce Analyzed performance of mixed MapReduce and HPC benchmarking workloads on PVFS and CloudStore TFIDF on PVFS is barely slowed down by IOR All other mixed workloads significantly slowed If only total elapsed time matters, the mixed workloads are faster Future work: use experimental results to improve parallel filesystem simulator

Questions?