Maximizing Hadoop Performance and Storage Capacity with AltraHD TM



Similar documents
Exar. Optimizing Hadoop Is Bigger Better?? March Exar Corporation Kato Road Fremont, CA

Maximizing Hadoop Performance with Hardware Compression

Enabling High performance Big Data platform with RDMA

Big Fast Data Hadoop acceleration with Flash. June 2013

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Virtualizing Apache Hadoop. June, 2012

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Benchmarking Hadoop & HBase on Violin

Chapter 7. Using Hadoop Cluster and MapReduce

Implement Hadoop jobs to extract business value from large and varied data sets

Understanding Hadoop Performance on Lustre

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Performance and Energy Efficiency of. Hadoop deployment models

Accelerating and Simplifying Apache

Hadoop Cluster Applications

Open source Google-style large scale data analysis with Hadoop

Energy Efficient MapReduce

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

CSE-E5430 Scalable Cloud Computing Lecture 2

A Brief Introduction to Apache Tez

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Dell Reference Configuration for Hortonworks Data Platform

GraySort on Apache Spark by Databricks

The Performance Characteristics of MapReduce Applications on Scalable Clusters

THE HADOOP DISTRIBUTED FILE SYSTEM

Hadoop Architecture. Part 1

How To Scale Out Of A Nosql Database

Testing 3Vs (Volume, Variety and Velocity) of Big Data

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

Hadoop and Map-Reduce. Swati Gore

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Big Data Analytics - Accelerated. stream-horizon.com

MapReduce and Lustre * : Running Hadoop * in a High Performance Computing Environment

How Companies are! Using Spark

Oracle s Big Data solutions. Roger Wullschleger. <Insert Picture Here>

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop IST 734 SS CHUNG

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Intel RAID SSD Cache Controller RCS25ZB040

Big Data - Infrastructure Considerations

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

I/O Considerations in Big Data Analytics

Fast, Low-Overhead Encryption for Apache Hadoop*

Hadoop implementation of MapReduce computational model. Ján Vaňo

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Accelerate Big Data Analysis with Intel Technologies

Hadoop Big Data for Processing Data and Performing Workload

A very short Intro to Hadoop

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Platfora Big Data Analytics

Federated Big Data for resource aggregation and load balancing with DIRAC

HadoopTM Analytics DDN

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Cisco Unified Data Center Solutions for MapR: Deliver Automated, High-Performance Hadoop Workloads

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Networking in the Hadoop Cluster

Best Practices for Hadoop Data Analysis with Tableau

Using distributed technologies to analyze Big Data

Big Data Too Big To Ignore

In Memory Accelerator for MongoDB

HiBench Introduction. Carson Wang Software & Services Group

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Hadoop: Embracing future hardware

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Actian Vector in Hadoop

BIG DATA-AS-A-SERVICE

Big Data Course Highlights

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

Software-defined Storage Architecture for Analytics Computing

International Journal of Innovative Research in Computer and Communication Engineering

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Manifest for Big Data Pig, Hive & Jaql

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Large scale processing using Hadoop. Ján Vaňo

A Performance Analysis of Distributed Indexing using Terrier

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Constructing a Data Lake: Hadoop and Oracle Database United!

Big Data Analytics OverOnline Transactional Data Set

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

Internals of Hadoop Application Framework and Distributed File System

Big Data in the Enterprise: Network Design Considerations

OnX Big Data Reference Architecture

EMC XtremSF: Delivering Next Generation Performance for Oracle Database

COURSE CONTENT Big Data and Hadoop Training

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Cisco Data Preparation

Transcription:

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Executive Summary The explosion of internet data, driven in large part by the growth of more and more powerful mobile devices, has created not only a large volume of data, but a variety of data types, with new data being generated at an increasingly rapid rate. Data characterized by the three Vs Volume, Variety, and Velocity is commonly referred to as big data, and has put an enormous strain on organizations to store and analyze their data. Organizations are increasingly turning to Apache Hadoop to tackle this challenge. Hadoop is a set of open source applications and utilities that can be used to reliably store and process big data. What makes Hadoop so attractive? Hadoop runs on commodity off-the-shelf (COTS) hardware, making it relatively inexpensive to construct a large cluster. Hadoop supports unstructured data, which includes a wide range of data types and can perform analytics on the unstructured data without requiring a specific schema to describe the data. Hadoop is highly scalable, allowing companies to easily expand their existing cluster by adding more nodes without requiring extensive software modifications. Apache Hadoop is an open source platform. Hadoop runs on a wide range of Linux distributions. The Hadoop cluster is composed of a set of nodes or servers that consist of the following key components. Hadoop Distributed File System (HDFS) HDFS is a Java based file system which layers over the existing file systems in the cluster. The implementation allows the file system to span over all of the distributed data nodes in the Hadoop cluster to provide a scalable and reliable storage system. MapReduce MapReduce is a two stage analytic function used to search and sort large data sets, leveraging the inherent parallel processing of Hadoop. Mapping involves reading the dataset from HDFS and organizing it into an appropriate format for analysis. Reducing then analyzes the mapped data to yield the desired information. In the evolution of Hadoop, there have been two versions of MapReduce. MapReduce 1 (MR1) incorporates MapReduce and the resource manager into a single module. MapReduce 2 (MR2) was introduced with hadoop-.23. In MapReduce 2, MapReduce and the resource manager are separated into discrete modules. The resource manager for MR2 is commonly called YARN (Yet Another Resource Negotiator). MR2 allows additional applications to be run on Hadoop because they plug directly into YARN. For example, YARN enables different versions of Hadoop to be run in the same cluster, or run applications that do not follow the MapReduce model. When data is written to HDFS, each block of data written into the file system is replicated, resulting in three copies of each block of data by default. If there are enough datanodes in the cluster, each of the data blocks will be stored on different nodes. In this way, the cluster can sustain the loss of a disk or even a server because the data is distributed. The cost of this redundancy is that it requires three times the storage and the distributed data can create I/O bottlenecks. MapReduce also adds additional I/O overhead when splitting the data set into smaller segments that can be sorted and filtered. The MapReduce operation generates disk I/O traffic when data is moved to storage devices, and networking I/O traffic, or shuffle traffic, when data is moved between nodes. Hadoop attempts to localize the data to minimize the amount of data movement, but bottlenecks still occur in the processing of a job, often due to the transfer speed and rotational latency of spinning disks.

This paper describes a solution that may be seamlessly added to a Hadoop cluster to transparently compress HDFS data and intermediate data. Reducing the amount of data being transferred greatly improves I/O performance for both storage and networking which translates into enhanced system performance. Compressing the file system and intermediate data results in increased storage capacity. Market Dynamics IDC estimates the 2.8 ZetaBytes of data generated in 212 is expected to grow to 4 ZetaBytes by 22 1. Eighty-five percent of this growth is expected to come from new data sources with a 15x increase in machine-generated data. As the quantity of data continues to grow, the amount of processing power needed to analyze the data and the amount of storage needed will increase. Typically the solution is to add more datanodes to the cluster, but doing so increases the CAPEX, along with the OPEX from powering, cooling, and cabling the additional datanodes, and results in escalating TCO. Exar provides a solution that transparently compresses data in the Hadoop cluster, provides more storage capacity to the cluster, and enables more efficient execution of Hadoop jobs. Exar s state of the art AltraHD solution combines Exar s software technology with its leading edge hardware compression accelerators to remove costly I/O bottlenecks and optimize the storage capacity for Hadoop applications. The Exar AltraHD TM Solution Exar s AltraHD s solution for Hadoop includes the following components: 1. An application transparent file system filter driver that sits below HDFS and automatically compresses/decompresses all files using HDFS. This enables transparent compression for all modules that interface to HDFS, including MapReduce, HBase, and Hive. 2. A suite of compression codecs that automatically compress/decompress intermediate data during the MapReduce phase of Hadoop processing. 3. A high performance PCIe-based hardware compression card that automatically accelerates all compression and decompression operations, maximizing throughput while offloading the host CPU. Each DX24 card provides up to 5 GB/sec of compression/decompression throughput, optimizing workloads and delivering maximum system performance. The DX24 card also supports encryption and hashing to support future Hadoop roadmaps. 1 THE DIGITAL UNIVERSE IN 22: Big Data, Bigger Digital Shadows, and Biggest Growth in the Far Easthttp://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-22.pdf

AltraHD integrates seamlessly into the Hadoop stack, and transparently compresses all files, including files that are stored using HDFS, as well as files stored locally as intermediate data outside of HDFS. AltraHD is a plug and play solution that installs easily and quickly on each Hadoop datanode without requiring kernel recompilation or modification of user applications. Once installed, all file accesses are automatically accelerated and optimized. The performance increase and storage optimization gained by using the Exar AltraHD solution can reduce the number of nodes required as well as the associated power, cooling, and space requirements. This minimizes both CAPEX and OPEX, reducing the overall TCO for the solution. The storage savings provided by AltraHD is proportional to the source data compression ratio, and will therefore vary based on the data sets. Benchmark Results This section presents two benchmarks. The Hadoop cluster s performance was measured using Terasort, the industry standard benchmarking tool. The following table describes the test cluster configuration. Hardware Configuration Name Node Data Node Number of Nodes 1 8 CPU Model Intel E552 Intel E5-263 Speed 2.27 GHz 2.3 GHz # of Cores/Threads 8/16 12/24 RAM 16 GB 96 GB Network 1Gbit 1Gbit Software Configuration AltraHD Release AltraHD 1.1.L Hadoop Version Hadoop 2.2.-cdh5..-beta-1 (applies to both MR1 and MR2) Operating System Version CentOS 6.2 (2.6.32-22) The first set of benchmark data was collected using Hadoop MapReduce 1. The second set of benchmark data was collected using MapReduce 2. For comparison, data for each benchmark was collected in both a native environment and in an AltraHD environment. The Hadoop configuration settings were optimized for both the native and AltraHD environments. The native Hadoop cluster uses the industry standard software LZO codec to compress the intermediate data, but does not compress file system data. The AltraHD Hadoop cluster uses the AltraHD file System filter driver to compress file system data and the zlib AltraCodec to compress intermediate data, providing a complete compression solution for Hadoop. The performance metrics shown in the charts below include: Job Execution Time: the time required for the job to run. A smaller job time translates to more time available for the CPU to perform other tasks. Disk I/O Wait Time: the time the CPU spends waiting for disk I/O accesses to complete which indirectly measures the read latency. The lower the disk I/O wait time, the less time the CPU is waiting for an I/O access to complete. Disk I/O Utilization: The percentage of time that the disk is actually in use by Hadoop. The lower the disk I/O utilization, the more time other applications can utilize the disk.

Seconds MapReduce 1 Benchmark Results Using AltraHD in a Hadoop MR1 environment yields a disk compression ratio of 7:1 which corresponds to an increase of storage capacity of 7x. MR1 Effective Storage Capacity Native AltraHD 3 6 9 12 15 TeraBytes MR1 Job Execution Time 14 12 1 8 6 4 2 93% 35% 27% 51% 21% 22%

Percent Percent 25 MR1 Disk I/O Wait Time 2 15 1 5 93% 87% 77% 93% 82% 83% 45 4 35 3 25 2 15 1 5 MR1 Disk I/O Utilization 26% 37% 13% 44% 3% 3%

Seconds MapReduce 2 Benchmark Results Using AltraHD in a Hadoop MR2 environment results in a disk compression ratio of 3.5:1 which corresponds to an increase of storage capacity of 3.5x. The difference in compression ratio compared to MR1 is due to changes to in the Terasort GraySort algorithm 2 that produces a less compressible data set. MR2 Effective Storage Capacity Native AltraHD 1 2 3 4 5 6 7 TeraBytes 25 2 MR2 Job Execution Time 27% 34% 31% 18% 18% 24% 15 1 5 2 Getting MapReduce 2 Up to Speed - Cloudera Engineering Blog - http://blog.cloudera.com/blog/214/2/getting-mapreduce-2-up-to-speed/

Percent Percent 12 MR2 Disk I/O Wait Time 1 8 6 4 2 66% 72% 82% 71% 64% 73% 45 4 35 3 25 2 15 1 5 MR2 Disk I/O Utilization 18% 28% 21% 23% 37% 24% Conclusion Hadoop is increasingly being used by leading edge companies to manage and analyze the explosion of big data on their servers. However, Hadoop replication adds a 3x multiplier to their already large storage footprint and creates I/O bottlenecks as the additional data is accessed on the network and disk. Exar s AltraHD solution addresses these issues by increasing the overall performance by up to 2x and by boosting the storage capacity by an amount proportional to the data compression ratio. The increase in performance and storage capacity translates to lower CAPEX and OPEX, resulting in a significant reduction in TCO. For more information about the Exar AltraHD solution, visit Exar s website at www.exar.com or call 51-668-7.