Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Similar documents
Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Lustre * Filesystem for Cloud and Hadoop *

Hadoop* on Lustre* Liu Ying High Performance Data Division, Intel Corporation

Deploying Hadoop on Lustre Sto rage:

Understanding Hadoop Performance on Lustre

A Performance Analysis of Distributed Indexing using Terrier

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Big Fast Data Hadoop acceleration with Flash. June 2013

CSE-E5430 Scalable Cloud Computing Lecture 2

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

GraySort on Apache Spark by Databricks

Performance measurement of a Hadoop Cluster

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Dell Reference Configuration for Hortonworks Data Platform

Chapter 7. Using Hadoop Cluster and MapReduce

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

New Storage System Solutions

HADOOP PERFORMANCE TUNING

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

GeoGrid Project and Experiences with Hadoop

Hadoop Architecture. Part 1

MapReduce Evaluator: User Guide

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Performance Analysis of Mixed Distributed Filesystem Workloads

HPCHadoop: MapReduce on Cray X-series

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

HP reference configuration for entry-level SAS Grid Manager solutions

Evaluating MapReduce and Hadoop for Science

HiBench Introduction. Carson Wang Software & Services Group

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Current Status of FEFS for the K computer

How To Use Hadoop

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

COURSE CONTENT Big Data and Hadoop Training

Experiences with Lustre* and Hadoop*

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Storage Architectures for Big Data in the Cloud

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Mixing Hadoop and HPC Workloads on Parallel Filesystems

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Energy Efficient MapReduce

Big Data in the Enterprise: Network Design Considerations

Reduction of Data at Namenode in HDFS using harballing Technique

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Architecting a High Performance Storage System

Lustre failover experience

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Duke University

Data-intensive computing systems

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Hadoop & its Usage at Facebook

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Map Reduce / Hadoop / HDFS

Hadoop Scheduler w i t h Deadline Constraint

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

A METHODOLOGY FOR IDENTIFYING THE RELATIONSHIP BETWEEN PERFORMANCE FACTORS FOR CLOUD COMPUTING APPLICATIONS

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Evaluating HDFS I/O Performance on Virtualized Systems

Dell Microsoft Business Intelligence and Data Warehousing Reference Configuration Performance Results Phase III

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

CS 378 Big Data Programming

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Mining of Massive Datasets Jure Leskovec, Anand Rajaraman, Jeff Ullman Stanford University

THE HADOOP DISTRIBUTED FILE SYSTEM

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Intro to Map/Reduce a.k.a. Hadoop

Apache Hadoop new way for the company to store and analyze big data

Virtuoso and Database Scalability

MapReduce and Hadoop Distributed File System

Business white paper. HP Process Automation. Version 7.0. Server performance

A very short Intro to Hadoop

Hadoop on the Gordon Data Intensive Cluster

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Accelerate Big Data Analysis with Intel Technologies

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Hadoop & its Usage at Facebook

Transcription:

Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar

2 Hadoop Introduc-on Open source MapReduce framework for data- intensive compu7ng Simple programming model two func7ons: Map and Reduce Map: Transforms input into a list of key value pairs Map(D) List[Ki, Vi] Reduce: Given a key and all associated values, produces result in the form of a list of values Reduce(Ki, List[Vi]) List[Vo] Parallelism hidden by framework Highly scalable: can be applied to large datasets (Big Data) and run on commodity clusters Comes with its own user- space distributed file system (HDFS) based on the local storage of cluster nodes

Hadoop Introduc-on (cont.) Input Split Map Shuffle Reduce Output Input Split Map Reduce Output Input Split Map 3 Framework handles most of the execu7on Splits input logically and feeds mappers Par77ons and sorts map outputs (Collect) Transports map outputs to reducers (Shuffle) Merges output obtained from each mapper (Merge)

MapReduce Application Processing

Intel Enterprise Edition for Lustre software Hadoop Dist. File System Node Node Node Disk Disk Disk Network Switch Node Node Node OSS OSS OSS Network Switch OST OST OST

Intel Enterprise Edition for Lustre software Clustered, distributed computing and storage No data replication No local storage Widely used for HPC applications Hadoop Dist. File System Data moves to the computation Data replication Local storage Widely used for MR applications 6

Motivation q Could HPC and MR co-exist? q Need to evaluate use of Lustre software for MR application processing

Using Intel Enterprise Edition for Lustre software with Hadoop HADOOP ADAPTER FOR LUSTRE 8

Hadoop over Intel EE for Lustre Implementa-on Hadoop uses pluggable extensions to work with different file system types Lustre is POSIX compliant: Use Hadoop s built- in LocalFileSystem class Uses na7ve file system support in Java Extend and override default behavior: LustreFileSystem Defines new URL scheme for Lustre lustre:/// Controls Lustre striping info Resolves absolute paths to user- defined directory Leaves room for future enhancements org.apache.hadoop.fs FileSystem RawLocalFileSystem LustreFileSystem 9 Allow Hadoop to find it in config files

MR Processing in Intel EE for Lustre and HDFS Input Spill File : Reducer à N : 1 LUSTRE (Global Namespace) Output Spill Map Sort Merge Shuffle Merge Reduce Spill Repetitive Local Disk Local Disk Map O/P Map O/P Map O/P Map O/P Map O/P Input HDFS Output

Conclusions from Existing Evaluations q TestDFSIO: 100% better throughput q TeraSort: 10-15% better performance q High Speed connecting Network Needed q Same BOM, HDFS is better for WordCount and BigMapOutput applications q Large number of compute nodes may challenge Enterprise Edition for Lustre for software performance

Problem Definition Performance comparison of Lustre and HDFS file systems for MR implementation of FSI workload using HPDD cluster hosted in the Intel BigData Lab in Swindon (UK) using Intel Enterprise Edition for Lustre software Audit Trail System part of FINRA security specifications (publicly available) is used as a representative application.

EXPERIMENTAL SETUP 13

Hadoop + HDFS Setup 1 cluster manager, 1 Name node (NN), 8 Data nodes (DN) including NN. 8 nodes, each of Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz, 320GB cluster RAM 27 TB of cluster storage 10 GB network among compute nodes Red Hat 6.5, CDH 5.0.2 and HDFS 14

Hadoop + Intel EE for Lustre software - Setup 1 Resource manager (RM), 1 History server (HS), 8 Node managers (NM) including RM and HS. 8 nodes, each of Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz, 320GB cluster RAM 165TB of usable Lustre storage 10 GB network among compute nodes Red Hat 6.5, CDH 5.0.2, Intel Enterprise Edition for Lustre software 2.0 15

Intel Enterprise Edition for Lustre 2.0 Setup q Four OSS, One MDS, 16 OSTs, 1 MDT. q OSS Node o CPU- Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz, Memory - 128GB DDr3 1600mhz o Disk subsystem 4 only LSI Logic / Symbios Logic MegaRAID SAS 2108 [Liberator] (rev 05) 4 only 4TB SATA drives per controller raid 5 configuration per raid set o 4 OST per OSS node.

Cluster Parameters q Number of Compute nodes = 8 q Map slots = 24 q Reduce slots = 7 q Rest of parameters such as Shuffle percent, Merge Percent, Sort Buffer are all kept as default q HDFS Replication Factor = 3 q Intel EE for Lustre software stripe count = 1,4,16. stripe size = 4MB

Job Configuration Parameters q Map Split size= 1GB q Block size = 128MB q Input Data is NOT compressed q Output Data is NOT compressed

Workload q Consolidated Audit Trail System (part of FINRA application) DB Schema Single table with 12 columns related to share order. q Data consolidation query Print share order details for share orders during a date range. SELECT issue_symbol,orf_order_id, orf_order_received_ts FROM default.rt_query_extract WHERE issue_symbol like 'XLP' AND from_unixtime(cast((orf_order_received_ts/1000) as BIGINT),'yyyy-MM-ddhh:ii:ss') >= "2014-06-26 23:00:00" AND from_unixtime(cast((orf_order_received_ts/1000) as BIGINT),'yyyy-MM-ddhh:ii:ss') <= "2014-06-27 11:00:00";

Workload Implementation q DB is a flat file with columns separated using a token q Data generator to generate data for the DB q Tool to run queries concurrently q Query is implemented as Map and Reduce functions

Workload Size q Concurrency Tests: Query in isolation, concurrency =1 Query in concurrent workload, concurrency =5 Thinktime = 10% of query execution time in isolation. q Data Size: 100GB, 500GB, 1TB and 7TB

Performance Metric q MR job execution time in isolation q MR job average execution time in concurrent workload q CPU, Disk and Memory Utilization of the cluster

Performance Measurement q SAR data is collected from all nodes in the cluster. q MapReduce job log files are used for performance analysis q Intel EE for Lustre software nodes performance data is collected using Intel Manager q Hadoop performance data is collected using Intel Manager

Benchmarking Steps Generate Data of given size Copy data to HDFS For different Concurrency levels Start MR job of query on Name node with given concurrency On completion of job, collect Logs and performance data For different Data sizes 24

RESULT ANALYSIS 25

Degree of Concurrency = 1 Intel EE for Lustre performs better on large stripe count 26

Degree of Concurrency = 1 Intel EE for Lustre delivered 3X HDFS for optimal SC settings 27

Degree of Concurrency = 1 Intel EE for Lustre optimal SC gives 70% improvement over HDFS 28

Hadoop + HDFS Setup Hadoop + Intel EE for Lustre software - Setup Nodes = 8 Performance Linear extrapolation for Nodes =13 Nodes = 8+5 = 13 29

Number of Compute Servers = 13 Intel EE for Lustre 2X better than HDFS for same BOM 30

Degree of Concurrency = 5 Intel EE for Lustre was 5.5 times better than HDFS on 7 TB data size 31

Degree of Concurrency = 5 Intel EE for Lustre was 5.5 times better than HDFS on 7 TB data size 32

Concurrent Job Average Execution Time/Single Job Execution Time = 2.5 33

= 4.5 34

Intel EE for Lustre software > HDFS for concurrency 35

Conclusion q Increase in Stripe count improves Enterprise Edition for Lustre software performance q Intel EE for Lustre shows better performance for concurrent workload q Intel EE for Lustre software = 3 X HDFS for single job q Intel EE for Lustre software = 5.5 X HDFS for concurrent workload q Future work Impact of large number of compute nodes (i.e. OSSs <<<< Nodes)

Thank You rekha.singhal@tcs.com