Maximizing Hadoop Performance with Hardware Compression



Similar documents
Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

BIG DATA TRENDS AND TECHNOLOGIES

Exar. Optimizing Hadoop Is Bigger Better?? March Exar Corporation Kato Road Fremont, CA

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Hadoop Size does Hadoop Summit 2013

Enabling High performance Big Data platform with RDMA

Testing 3Vs (Volume, Variety and Velocity) of Big Data

CSE-E5430 Scalable Cloud Computing Lecture 2

How To Scale Out Of A Nosql Database

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

HadoopTM Analytics DDN

Chapter 7. Using Hadoop Cluster and MapReduce

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

A Brief Outline on Bigdata Hadoop

Performance and Energy Efficiency of. Hadoop deployment models

Big Data Analytics. with EMC Greenplum and Hadoop. Big Data Analytics. Ofir Manor Pre Sales Technical Architect EMC Greenplum

Hadoop: Distributed Data Processing. Amr Awadallah Founder/CTO, Cloudera, Inc. ACM Data Mining SIG Thursday, January 25 th, 2010

How To Handle Big Data With A Data Scientist

Big Fast Data Hadoop acceleration with Flash. June 2013

Hadoop Beyond Hype: Complex Adaptive Systems Conference Nov 16, Viswa Sharma Solutions Architect Tata Consultancy Services

Large scale processing using Hadoop. Ján Vaňo

Addressing Open Source Big Data, Hadoop, and MapReduce limitations

Complete Java Classes Hadoop Syllabus Contact No:

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Hadoop implementation of MapReduce computational model. Ján Vaňo

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

2009 Oracle Corporation 1

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Big Data and Apache Hadoop s MapReduce

So What s the Big Deal?

Implement Hadoop jobs to extract business value from large and varied data sets

Open source Google-style large scale data analysis with Hadoop

Big Data Analytics - Accelerated. stream-horizon.com

Hadoop. Sunday, November 25, 12

Integrated Grid Solutions. and Greenplum

Accelerating and Simplifying Apache

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Energy Efficient MapReduce

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Accelerate Big Data Analysis with Intel Technologies

ITG Software Engineering

Big Data Analytics* Outline. Issues. Big Data

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

How To Use Hadoop For Gis

How Cisco IT Built Big Data Platform to Transform Data Management

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

ITG Software Engineering

Luncheon Webinar Series May 13, 2013

Big Data Course Highlights

Application Development. A Paradigm Shift

Hadoop and ecosystem * 本 文 中 的 言 论 仅 代 表 作 者 个 人 观 点 * 本 文 中 的 一 些 图 例 来 自 于 互 联 网. Information Management. Information Management IBM CDL Lab

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads

Big Data Big Data/Data Analytics & Software Development

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Cost-Effective Business Intelligence with Red Hat and Open Source

A Survey on Big Data Concepts and Tools

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Hadoop and Map-Reduce. Swati Gore

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Understanding Hadoop Performance on Lustre

IBM Netezza High Capacity Appliance

THE HADOOP DISTRIBUTED FILE SYSTEM

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Herodotos Herodotou Shivnath Babu. Duke University

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Memory-Centric Database Acceleration

Big Data, Why All the Buzz? (Abridged) Anita Luthra, February 20, 2014

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Big Data: Tools and Technologies in Big Data

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop on Windows Azure: Hive vs. JavaScript for Processing Big Data

I/O Considerations in Big Data Analytics

Hadoop. Bioinformatics Big Data

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

HiTune: Dataflow-Based Performance Analysis for Big Data Cloud

Big Data and Data Science: Behind the Buzz Words

CitusDB Architecture for Real-Time Big Data

Oracle Big Data SQL Technical Update

Dell In-Memory Appliance for Cloudera Enterprise

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

BIG DATA What it is and how to use?

Hadoop IST 734 SS CHUNG

Evaluating MapReduce and Hadoop for Science

NoSQL for SQL Professionals William McKnight

Duke University

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

The Future of Data Management

COURSE CONTENT Big Data and Hadoop Training

Transcription:

Maximizing Hadoop Performance with Hardware Compression Robert Reiner Director of Marketing Compression and Security Exar Corporation November 2012 1

What is Big? sets whose size is beyond the ability of typical data base software tools to capture, store, and analyze McKinsey Global Institute sets so large and complex that it becomes difficult to process using on-hand database management tools. Wikipedia that s an order of magnitude bigger than what you re accustomed to, Grasshopper ebay November 2012 2

Sources of Big Type of Web Logs RFID Smart Grid Sensor Telemetry Telematics Text Analysis Time and Location Analysis Industry Social Networking, Online Transactions Retail, Manufacturing, Casinos Utilities Industrial Equipment, Engines Video Games Auto Insurance Multiple Multiple Source: Franks, Taming the Big Tidal Wave November 2012 3

Big Growth 40% projected growth in global data generated/year 1 5% growth in global IT spending 1 Twitter generates >7TB of data per day 2 Facebook generates >10TB of data per day 2 80% of world s information is unstructured 2 Sources: 1) McKinsey, 2) Zikopoulos et al, Understanding Big November 2012 4

Three Vs of Big Volume Terabytes Records Transactions Tables, Files Velocity Batch Near Time Real Time Streams 3 Vs Structured Unstructured Semistructured All of the Above Variety Source: TDWI November 2012 5

Hadoop Addresses Big Challenges November 2012 6

Map Reduce is Core of Hadoop Parallel Programming Framework Simplifies Processing Across Massive Sets Enables Processing in a File System without being Stored into a Base Ability to Process Unstructured Excels at Sifting through Huge Masses of to Find what is Useful Pig Hive Sqoop HBase MapReduce November 2012 7

MapReduce Flow 8 Sort Copy Merge Input 1 Map Reduce Output 1 Replication Sort Merge Input 2 Map Reduce Output 2 Replication Sort Merge Input 3 Map Reduce Output 2 Replication Source: White, Hadoop, The Definitive Guide November 2012 8

MapReduce I/O Bottlenecks 9 Sort Copy Merge Input 1 Map Reduce Output 1 Replication Sort Merge Input 2 Map Reduce Output 2 Replication Sort Merge Input 3 Map Reduce Output 2 Replication November 2012

Compression Reduces Networking and Disk I/O Addresses Bottlenecks Sort Copy Merge Input 1 Map Reduce Output 1 Replication Sort Merge Input 2 Map Reduce Output 2 Replication Sort Merge Input 3 Map Reduce Output 2 Replication November 2012

Hadoop Software Compression Codecs Codec Compression Performance Compression Ratio CPU Overhead Deflate/ Gzip Low High High Bzip2 Very Low Very High Very High LZO Medium Medium Medium Source: White, Hadoop, The Definitive Guide November 2012 11

Hardware Accelerated Compression Processor Intensive Compression Algorithms Executed in Hardware Increases Performance Reduces MapReduce Execution Time Offloads CPU Lowers Energy Consumption Pig Hive Scoop HBase MapReduce Codec November 2012 12

Hardware Accelerated Compression - Exar Solutions DX1800 and DX1700 Series PCIe Cards Java codec calls C libraries and invokes card SDK Pig Hive Scoop HBase MapReduce Codec November 2012 13

Hardware and Software Compression for Hadoop Codec Compression Performance Compression Ratio CPU Overhead HW elzs Very High Medium Low HW Deflate/ Gzip Very High High Low Deflate/ Gzip Low High High Bzip2 Very Low Very High Very High LZO Medium Medium Medium November 2012 14

Hadoop Codec Benchmarking Benchmarked Multiple Codecs using Terasort Terasort Input Size: 100GB Three Node Hadoop Cluster 1GbE Switch Interconnect Hadoop version 1.0.0 Node Configuration Dual E5620/ node (8 cores, 16 threads) 16 GB DRAM 4 x SATA-300 (108 MB/sec, 500 GB) RHEL 5.4 November 2012 15

Benchmarking Results SW LZO and HW elzs Codecs 24 min, 10 sec 19 min, 1 sec SW LZO Total Time: 24 min, 8 sec Compression Ratio: 5.126.2478 kwh HW elzs** Total Time: 19 min, 8 sec Compression Ratio: 5.136.2061 kwh **HW elzs benchmarked with DX 1845 Compression and November 2012 Security Acceleration Card 16

Benchmarking Results SW Gzip and HW Gzip Codecs 44 min, 11 sec 15 min, 56 sec SW Gzip Total Time: 44 min, 11 sec Compression Ratio: 7.645 HW Gzip** Total Time: 15 min, 56 sec Compression Ratio: 6.329 **HW Gzip benchmarked with next generation Compression November 2012 and Security Acceleration product prototype 17

Terasort Benchmarking Summary HW elzs is 21% faster than SW LZO HW Gzip is 64% faster than SW Gzip HW Gzip provides the fastest Terasort time of all codecs benchmarked HW Accelerated Compression is more energy efficient than SW compression November 2012 18

Future of Hardware Acceleration for Hadoop Security Add encryption to Hardware- Accelerated Codec Exar s technology enables single pass compression and encryption Hashing Encryption Compression Enhanced Benchmarks Expand beyond Terasort to include benchmarks that represent additional production workloads November 2012 19

Thank You Robert Reiner Director of Marketing Compression and Security Exar Corporation rob.reiner@exar.com November 2012 20