Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Similar documents
Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Lustre * Filesystem for Cloud and Hadoop *

Hadoop* on Lustre* Liu Ying High Performance Data Division, Intel Corporation

Deploying Hadoop on Lustre Sto rage:

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

GraySort on Apache Spark by Databricks

CSE-E5430 Scalable Cloud Computing Lecture 2

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Hadoop Architecture. Part 1

A Performance Analysis of Distributed Indexing using Terrier

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

New Storage System Solutions

Big Fast Data Hadoop acceleration with Flash. June 2013

GeoGrid Project and Experiences with Hadoop

GraySort and MinuteSort at Yahoo on Hadoop 0.23

A Brief Outline on Bigdata Hadoop

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop & its Usage at Facebook

Hadoop Job Oriented Training Agenda

Hadoop IST 734 SS CHUNG

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Experiences with Lustre* and Hadoop*

Using distributed technologies to analyze Big Data

Reduction of Data at Namenode in HDFS using harballing Technique

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Chapter 7. Using Hadoop Cluster and MapReduce

Testing Big data is one of the biggest

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

HDFS. Hadoop Distributed File System

Understanding Hadoop Performance on Lustre

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

Dell Reference Configuration for Hortonworks Data Platform

Hadoop & its Usage at Facebook

Evaluating MapReduce and Hadoop for Science

Performance Analysis of Mixed Distributed Filesystem Workloads

COURSE CONTENT Big Data and Hadoop Training

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

A very short Intro to Hadoop

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

HPC and Big Data. EPCC The University of Edinburgh. Adrian Jackson Technical Architect

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Large Scale Text Analysis Using the Map/Reduce

OLH: Oracle Loader for Hadoop OSCH: Oracle SQL Connector for Hadoop Distributed File System (HDFS)

HADOOP PERFORMANCE TUNING

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Duke University

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Enabling High performance Big Data platform with RDMA

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

THE HADOOP DISTRIBUTED FILE SYSTEM

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

System Requirements Table of contents

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

How To Scale Out Of A Nosql Database

CS380 Final Project Evaluating the Scalability of Hadoop in a Real and Virtual Environment

Quick Deployment Step-by-step instructions to deploy Oracle Big Data Lite Virtual Machine

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Intel HPC Distribution for Apache Hadoop* Software including Intel Enterprise Edition for Lustre* Software. SC13, November, 2013

Big Data in the Enterprise: Network Design Considerations

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Big Data Management and Security

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hadoop Cluster Applications

The Greenplum Analytics Workbench

HP reference configuration for entry-level SAS Grid Manager solutions

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Platfora Big Data Analytics

Architecting a High Performance Storage System

Performance and Scalability Overview

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Accelerating and Simplifying Apache

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Cloud Computing at Google. Architecture

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Mixing Hadoop and HPC Workloads on Parallel Filesystems

Big Data and Apache Hadoop s MapReduce

BIG DATA What it is and how to use?

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Transcription:

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems Rekha Singhal and Gabriele Pacciucci * Other names and brands may be claimed as the property of others.

Lustre File System Hadoop Dist. File System Node Node Node Disk Disk Disk Network Switch Node Node Node OSS OSS OSS Network Switch Parallel File System No data replication No local storage Widely used for HPC applications Distributed File System Data replication Local storage Widely used for MR applications * Other names and brands may be claimed as the property of others.

Hive + Hadoop Architecture SQL1 SQL2 SQL3 SQL4 SQL Hive Server Stage 1: Fetch Hadoop (MR) HDFS/Lustre Stage 1: MR1 Stage 3: MR3 Stage 2: MR2 3

4 Hive+Hadoop Open source SQL on MapReduce framework for data-intensive computing Hive translates SQL into stages of MR jobs A MR job two functions: Map and Reduce Map: Transforms input into a list of key value pairs Map(D) List[Ki, Vi] Reduce: Given a key and all associated values, produces result in the form of a list of values Reduce(Ki, List[Vi]) List[Vo] Parallelism hidden by framework Highly scalable: can be applied to large datasets (Big Data) and run on commodity clusters Comes with its own user-space distributed file system (HDFS) based on the local storage of cluster nodes

MR Processing in Intel EE for Lustre* and HDFS Input Spill File : Reducer N : 1 LUSTRE (Global Namespace) Output Spill Map Sort Merge Shuffle Merge Reduce Spill Repetitive Local Disk Local Disk Map O/P Map O/P Map O/P Map O/P Map O/P Input HDFS Output * Other names and brands may be claimed as the property of others.

Motivation Could HPC and Analytic Computations co-exist? required to reduce simulations for HPC applications Need to evaluate use of alternative file systems for Big Data Analytic applications HDFS is an expensive distributed file system * Other names and brands may be claimed as the property of others.

Using Intel Enterprise Edition for Lustre* software with Hadoop HADOOP ADAPTER FOR LUSTRE * Other names and brands may be claimed as the property of others. 7

Hadoop over Intel EE for Lustre* Implementation Hadoop uses pluggable extensions to work with different file system types Lustre is POSIX compliant: Use Hadoop s built-in LocalFileSystem class Uses native file system support in Java Extend and override default behavior: LustreFileSystem Defines new URL scheme for Lustre lustre:/// Controls Lustre striping info Resolves absolute paths to user-defined directory Leaves room for future enhancements org.apache.hadoop.fs FileSystem RawLocalFileSystem LustreFileSystem 8 Allow Hadoop to find it in config files * Other names and brands may be claimed as the property of others.

Problem Definition Performance comparison of LUSTRE and HDFS for SQL Analytic queries of FSI, Insurance and Telecom workload on16 nodes HDDP cluster hosted in the Intel BigData Lab in Swindon (UK) and Intel Enterprise Edition for Lustre* software Performance metric : SQL Query Average Execution Time

EXPERIMENTAL SETUP 10

Hive+Hadoop+ HDFS Setup Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz, 320GB cluster RAM, 1 TB SATA 7200 RPM Hive server 40 Gbps Redhat 6.5, CDH 5.2, Hive 0.13 11

Hive+ Hadoop+Lustre Setup Intel(R) Xeon(R) CPU E5-2695 v2 @ 2.40GHz, 320GB cluster RAM, 1 TB SATA 7200 RPM Hive server 40 Gbps Lustre Filesystem Redhat 6.5, Hive 0.13, CDH 5.2, Intel Enterprise Edition for Lustre* software 2.2, HAL 3.1 165 TB of usable cluster storage 12

Intel Enterprise Edition for Lustre* software 2.2 Setup CPU- Intel(R) Xeon(R) CPU E5-2637 v2 @ 3.50GHz, Memory - 128GB DDR3 1600MHZ OSS 4 TB SATA 7200 RPM 40 Gbps MDS OSS MDT OSS OSS 13

Parameters Configuration (Hadoop) Map Slots = 24 Reduce Slots = 32 HDFS: Replication Factor = 2 D D D Map Split size = 1 GB Block Size = 256 MB Compression = No Intel EE for Lustre 123, HGDRsdfd, 45, 488 123, GHDFDTt, 100, 7800.. File Stripe Count = 16 Stripe Size = 4 MB 1 2 3. 16 14

Parameters Configuration (Hive) Parameters 1T 2T 4T input.filei.minsize 4294967296 8589934592 17179869184 task.io.sort.factor #streams to merge 50 60 80 mapreduce.task.io.sort. mb 1024 1024 1024

Workloads FSI Workload Single Table Two SQL queries Telecom Workload Two Tables - Call fact details & Date dimension Two SQL queries - single Map join Insurance Workload Four Tables Vehicle, Customer, Policy & Policy Details Two SQL queries - having 3 level joins (map as well reduce) 16

Example Workload Consolidate Audit Trail (Part of FINRA) Database File (Single table, 12 columns ) Order-id, issue_symbol, orf_order_id, orf_order_received_ts, routes_share_quantity, route_price, 072900, FSGWTE, HFRWDF, 1410931788223, 100, 39.626602, 072900, VCSETH, BCXFNBJ, 1410758988282, 100, 32.642002, 072900, FRQSXF, BVCDSEY, 1410758988284, 100, 33.626502, 072900, OURSCV, MKVDERT, 1410931788223, 100, 78.825609, 072900, VXERGY, KDWRXV, 1410931788285, 100, 19.526312, Query Concurrency =1,2,8 Query Size : 100GB, 500GB, 1TB, 2TB, 4TB Query: Print total amount attributed to a particular share code routing during a date range. 17

RESULT ANALYSIS 18

Concurrency=1 Lustre = 2 * HDFS, data size >> 19

Concurrency=8 Lustre = 3 * HDFS, data size >> 20

Hadoop+ HDFS Setup Hadoop+ Intel EE for Lustre* Setup Total Nodes = Compute Nodes = 16 Total Nodes = 16 Compute Nodes = 11 Lustre Nodes = 5 21

Same BOM Lustre.Compute Nodes = 11 Lustre = 2 * HDFS, data size >> 22

Conclusion Intel EE for Lustre shows better performance than HDFS for concurrent as well as Join query bound workload Intel EE for Lustre = 2 X HDFS for single query Intel EE for Lustre = 3 X HDFS for concurrent queries HDFS: SQL performance is scalable with horizontal scalable cluster Lustre: SQL performance is scalable with vertical scalability Future work Impact of large number of compute nodes (i.e. OSSs <<<< Nodes) and scalable Lustre file systems. * Other names and brands may be claimed as the property of others.

Thank You rekha.singhal@tcs.com