Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012



Similar documents
Hadoop Optimizations for BigData Analytics

InfiniBand Update Addressing new I/O challenges in HPC, Cloud, and Web 2.0 infrastructures. Brian Sparks IBTA Marketing Working Group Co-Chair

Enabling High performance Big Data platform with RDMA

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Hadoop on the Gordon Data Intensive Cluster

Chapter 7. Using Hadoop Cluster and MapReduce

Big Fast Data Hadoop acceleration with Flash. June 2013

CSE-E5430 Scalable Cloud Computing Lecture 2

Open source Google-style large scale data analysis with Hadoop

Performance measurement of a Hadoop Cluster

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Installing Hadoop over Ceph, Using High Performance Networking

Hadoop: Embracing future hardware

Maximizing Hadoop Performance with Hardware Compression

Exar. Optimizing Hadoop Is Bigger Better?? March Exar Corporation Kato Road Fremont, CA

TCP/IP Implementation of Hadoop Acceleration. Cong Xu

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Hadoop Deployment and Performance on Gordon Data Intensive Supercomputer!

Big Data in the Enterprise: Network Design Considerations

Map Reduce & Hadoop Recommended Text:

Building a Scalable Storage with InfiniBand

Performance and Energy Efficiency of. Hadoop deployment models

SMB Direct for SQL Server and Private Cloud

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

DESIGN AND ESTIMATION OF BIG DATA ANALYSIS USING MAPREDUCE AND HADOOP-A

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

MapReduce Evaluator: User Guide

I/O Considerations in Big Data Analytics

Hadoop Size does Hadoop Summit 2013

Understanding Hadoop Performance on Lustre

Accelerate Big Data Analysis with Intel Technologies

A Micro-benchmark Suite for Evaluating Hadoop RPC on High-Performance Networks

Hadoop Architecture. Part 1

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Accelerating Spark with RDMA for Big Data Processing: Early Experiences

Different Technologies for Improving the Performance of Hadoop

High-Performance Networking for Optimized Hadoop Deployments

Integrated Grid Solutions. and Greenplum

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions

Storage, Cloud, Web 2.0, Big Data Driving Growth

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

HiBench Introduction. Carson Wang Software & Services Group

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

Open source large scale distributed data management with Google s MapReduce and Bigtable

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

MapReduce and Hadoop Distributed File System V I J A Y R A O

Hadoop IST 734 SS CHUNG

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Enabling the Use of Data

Networking in the Hadoop Cluster

Hadoop Cluster Applications

Hadoop implementation of MapReduce computational model. Ján Vaňo

Large scale processing using Hadoop. Ján Vaňo

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Extending Hadoop beyond MapReduce

Mellanox Accelerated Storage Solutions

MapReduce, Hadoop and Amazon AWS

CooMR: Cross-Task Coordination for Efficient Data Management in MapReduce Programs

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Dell Reference Configuration for Hortonworks Data Platform

Greenplum Analytics Workbench

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Big Data With Hadoop

HadoopTM Analytics DDN

MapReduce with Apache Hadoop Analysing Big Data

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

UNDERSTANDING THE BIG DATA PROBLEMS AND THEIR SOLUTIONS USING HADOOP AND MAP-REDUCE

Introduction to Hadoop on the SDSC Gordon Data Intensive Cluster"

Entering the Zettabyte Age Jeffrey Krone

A Case for Flash Memory SSD in Hadoop Applications

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

Optimize the execution of local physics analysis workflows using Hadoop

Building a cost-effective and high-performing public cloud. Sander Cruiming, founder Cloud Provider

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Introduction. Need for ever-increasing storage scalability. Arista and Panasas provide a unique Cloud Storage solution

MapReduce and Hadoop Distributed File System

The Greenplum Analytics Workbench

GraySort on Apache Spark by Databricks

Boosting Hadoop Performance with Emulex OneConnect 10GbE Network Adapters

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

How to Deploy OpenStack on TH-2 Supercomputer Yusong Tan, Bao Li National Supercomputing Center in Guangzhou April 10, 2014

ebay Storage, From Good to Great

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

The Comprehensive Performance Rating for Hadoop Clusters on Cloud Computing Platform

Transcription:

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012 1

Market Trends Big Data Growing technology deployments are creating an exponential increase in the volume of data available Existing analytical techniques not adequate for business decision-making processes A successful approach of big data analytics will be a critical core competency Delivering significant competitive advantage to organizations Nationwide Electricity Grid Analysis 2

Big Data Not just a Volume Play Facebook statistics 2009 350 Million Named users 175 Million Active users in one day 35 Million Users updating status each day 2.5 Billion Photos / Month 1.6 Million Active pages Growth: 12 TB /day, 2 PB /year Global data volume : 8.7 PB Source: Gartner 3

Market Trends Serial to Parallel Computing Big Data Analytics is Parallel Processing Single Core Multi cores Single Computer Clustered Computing Traditional Data-Processing Pipeline Big Data parallel processing over Hadoop 4

Hadoop Cluster over High Performance Networking Mapper Shuffle IO Intensive Phases Bursty characteristics HDFS Replication Data Split Merge Sort Reducer Out Split Mapper Data Split Merge Sort Reducer Mapper Out Split Data Split 5

Avg Cap (GB) ASP ($), ASP/GB (Cent/GB) Peak Bendwidth (MB/s) I/O Bottlenecks: from Disk to Network Trends Use multiple SATA drives SSD is around the corner Matching network speed 10Gigabit Ethernet and beyond 7000 6000 5000 4000 3000 2000 1000 0 Disk and Network Performance SATA 4 SATA 12 SATA SSD (SAS) 12 SSD (SAS) SSD (PCIe) Quad SSD (PCIe) Bandwidth (MB/s) 40GE 10GE 1GE 500.00 450.00 400.00 350.00 Enterprise Enterprise Server Server SSD SSD Trends 1,400.0 1,200.0 1,000.0 300.00 250.00 200.00 150.00 100.00 50.00 - Source: "SSD Summary", Gartner, May 2011 2009A 2010A 2011 2012 2013 2014 2015 Avg Capacity (GB) ASP ($) ASP/GB (Cent/GB) 800.0 600.0 400.0 200.0-6

Efficient Fabric Services Scalable and non-blocking Matches extensive data exchange at peak rates Non blocking east-west traffic High Capacity Match all I/O bandwidth of the server Losslessness Efficiency and avoidance of retransmission Offload Full transport offload reduces CPU utilization RDMA zero copy operations 7

UDA Plug In Architecture Plug-in architecture Hadoop applications are unmodified Plug-in to Apache Hadoop Enabled via xml configuration Efficient Map Reduce Data communication over RDMA Using RDMA for In-Memory processing Enables to start the Reduce operation in parallel to the Shuffle operation Reduce disk IO operation Supports InfiniBand and Ethernet Zero copy, transport offload, kernel bypass 8

Software Architecture Hadoop (Java) UDA Plugin (C++) TaskTracker MapTask JobTracker TaskTracker ReduceTask Plug-in Benefits: Zcopy datapath Transport offload Improved merge algorithm MOFSupplier NetMerger Data Engine RDMA Server RDMA Client Merging Merging Thread Merging Thread Thread RDMA NIC / HCA 9

New Pipelined Data Flow Map Stage Map Map Map Map Map Map Map Map Map Map Map Map Shuffle Merge Vanilla Algorithm shuffle shuffle merge merge Reduce Reduce shuffle merge Header fetch Shuffle Merge New Algorithm shuffle merge Reduce Header fetch Reduce start Time 10

UDA MapReduce for Hadoop 1.X Shuffle portion executed in-memory Eliminating time consuming HDD read/writes UDA reads Map Output Files (MOF) from mappers Predefined portions of the MOFs, default 1KB Reduce tasks begins only with all mapper tasks completion 11

Execution Time (sec) UDA Enables Higher Performance 1000 Terasort Benchmark* (20GB file size, 16GB data per node, 8 Mappers, 4 reducers, 4 Disks) 900 800 700 600 500 45% Lower is better 1 GE 10 GE 400 UDA 10GE 300 200 ~2X Acceleration 100 0 8 Nodes 10 Nodes 12 Nodes *TeraSort is a popular benchmark used to measure the performance of Hadoop cluster 12

Accelerating Big Data Analytics EMC 1000-Node Analytic Platform Accelerates Industry's Hadoop Development 24 PetaByte of physical storage Half of every written word since inception of mankind Mellanox VPI Solutions Hadoop Acceleration 2X Faster Hadoop Job Run-Time High Throughput, Low Latency, RDMA Critical for ROI 13

Thank You Email: motti@mellanox.com 14