Big Fast Data Hadoop acceleration with Flash. June 2013



Similar documents
Intel RAID SSD Cache Controller RCS25ZB040

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

The Revival of Direct Attached Storage for Oracle Databases

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Hadoop Cluster Applications

Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI

Mambo Running Analytics on Enterprise Storage

GraySort on Apache Spark by Databricks

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

The Methodology Behind the Dell SQL Server Advisor Tool

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

The Data Placement Challenge

Maximizing SQL Server Virtualization Performance

Exar. Optimizing Hadoop Is Bigger Better?? March Exar Corporation Kato Road Fremont, CA

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

NextGen Infrastructure for Big DATA Analytics.

Benchmarking Hadoop & HBase on Violin

Design and Evolution of the Apache Hadoop File System(HDFS)

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

Enabling High performance Big Data platform with RDMA

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Hadoop: Embracing future hardware

SLIDE 1 Previous Next Exit

Hadoop on the Gordon Data Intensive Cluster

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Hadoop & its Usage at Facebook

Performance Comparison of SQL based Big Data Analytics with Lustre and HDFS file systems

Energy Efficient MapReduce

Big Data With Hadoop

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Inge Os Sales Consulting Manager Oracle Norway

Building & Optimizing Enterprise-class Hadoop with Open Architectures Prem Jain NetApp

Performance and Energy Efficiency of. Hadoop deployment models

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

SSD Performance Tips: Avoid The Write Cliff

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop & its Usage at Facebook

Understanding Hadoop Performance on Lustre

PrimaryIO Application Performance Acceleration Date: July 2015 Author: Tony Palmer, Senior Lab Analyst

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

HyperQ Storage Tiering White Paper

Distributed File Systems

SQL Server Virtualization

Storage Architectures for Big Data in the Cloud

Benchmarking Cassandra on Violin

EMC XtremSF: Delivering Next Generation Storage Performance for SQL Server

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Increasing Storage Performance

Hadoop Optimizations for BigData Analytics

CitusDB Architecture for Real-Time Big Data

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Hadoop Size does Hadoop Summit 2013

HadoopTM Analytics DDN

Azure VM Performance Considerations Running SQL Server

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Accelerating and Simplifying Apache

Software-defined Storage Architecture for Analytics Computing

CSE-E5430 Scalable Cloud Computing Lecture 2

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

Deploying Flash- Accelerated Hadoop with InfiniFlash from SanDisk

Using distributed technologies to analyze Big Data

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Future Prospects of Scalable Cloud Computing

SSDs: Practical Ways to Accelerate Virtual Servers

Actian Vector in Hadoop

NoSQL Data Base Basics

HiBench Introduction. Carson Wang Software & Services Group

Overview: X5 Generation Database Machines

Building your Big Data Architecture on Amazon Web Services

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

SSDs: Practical Ways to Accelerate Virtual Servers

Dell Reference Configuration for Hortonworks Data Platform

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Extending Hadoop beyond MapReduce

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

Understanding Enterprise NAS

GraySort and MinuteSort at Yahoo on Hadoop 0.23

The Evolution of Microsoft SQL Server: The right time for Violin flash Memory Arrays

Solid State Storage in Massive Data Environments Erik Eyberg

BIG DATA TRENDS AND TECHNOLOGIES

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Accelerate SQL Server 2014 AlwaysOn Availability Groups with Seagate. Nytro Flash Accelerator Cards

Big Data: A Storage Systems Perspective Muthukumar Murugan Ph.D. HP Storage Division

Transcription:

Big Fast Data Hadoop acceleration with Flash June 2013

Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results

The Big Data Problem Big Data Output Facebook Traditional Relational Database Friend Map Approaches Information comes from a wide Data Models are developed variety of sources based on queries and data Value can often be derived d from by sources requirements combining this with other sources of Traditional process involves information significant expense and time Traditional Approach Challenges Time to insight, scale, importing large quantities of Data.

Big Data Answer Hadoop architecture allows a cluster of commodity servers to work together to solve big data analytical problems. Hadoop Architecture Save everything. Scan all of the dt data via brute force. Focus on making brute force scanning efficient. Traditional Architecture Massage data into a structured database discarding everything outside of the data model. Build an efficient data model to processs queries efficiently. Hadoop can be best understood as a two step process: Structure & Query Which corresponds to the Hadoop nomenclature of Map & Reduce

Hadoop Hadoop architecture is a combination of three components: 1. An implementation of Map-Reduce to utilize clusters more effectively. 2. HDFS distributed file system 3. Bringing the processing to the data rather than alternative of bringing the data to the processing Hadoop architecture & clusters go together Hadoop architecture utilizes computer hardware components that are cheap and powerful. Developed to allow efficient use thousands of CPU cores and disks Hadoop architecture is rigid in the processing steps (Map & Reduce) to enable massive horizontal cluster scaling and uses multiple passes over a dataset.

Hadoop Design & Flash

Hadoop Data Flow 1. Map (Structure) 2. Shuffle, Sort & Merge (Organize structured intermediate data to query) 3. Reduce (Query)

Where to use Flash Shuffle, Sort and Merge! The shuffle, sort and merge (The Shuffle Phase) uses local temporary storage on each node outside of HDFS. The results of the maps have to be committed to disk before the reduce processes start. The reducers fetch this intermediate data over the network. This can be very IO intensive and cannot leverage bringing the processing to the data,, instead the data is brought to the processing nodes.

Apache Hadoop Map Reduce Local IO Access Pattern I/O i b th d d ti l i diff t t f th j b I/O is both random and sequential in different parts of the job. Shuffle reads are random with temporal locality (cache friendly).

Map Reduce Requirements and Guidelines Require high IOPS and high bandwidth for different parts of the shuffle phase. Must be large enough to handle the biggest intermediate data set that a cluster node will run. If the directory is filled, the job fails. Intermediate data is deleted when it is no longer needed. Hadoop uses a balanced read/write workload, emlc is the ideal media.

The Solution Nytro MegaRAID Key Features Transparent to Applications, File system, OS and device drivers Based on industry hardened MegaRAID technology Supports Read and Write caching Integrated in the HBA and runs locally on the controller Limited CPU and memory overhead Accelerates rebuild Accelerates workloads spanning from Analytics, OLTP to virtualized servers Local HDD Array (DAS) Seamless, Plug-n-Play and Transparent acceleration for Server/Workstation Storage

Test Environment

Test Environment Worker nodes 12 cores 32 GB RAM 7 500 GB SAS Disks Mirrored boot drives 10 GigE networking Apache Hadoop 1.0.2 Map reduce local 7 Volumes (1 per disk) Boot HDFS 7V Volumes (1 per disk)

Full test setup 3 Worker Nodes 1 Name Node/ Worker node 10 GigE Interconnect

Nytro MegaRAID 100 GB TeraSort Run 7 Disks 7 Disks 7 Disks No Caching: 18 Minutes 23 seconds With LSI Nytro Caching enabled: 12 Minutes 15 Seconds 33% reduction in job completion time.

Other requirements for effective Flash usage No CPU Bottlenecks Enough cores per node to keep the storage and network saturated Faster Network interfaces to support the shuffle phase storage capabilities with flash higher performance networking is recommended (10 GigE or IB) Enough local disks to avoid HDFS being a bottleneck Once other requirements are met, substantial acceleration with Flash is possible LSI Proprietary

Updated config Migrate Boot volume onto a small Flash partition freeing up drives for HDFS. Can cover the cost of the flash caching completely. Boot partition - ~20 GB Mirrored Map reduce local 9 Volumes (1 per disk) HDFS 9V Volumes (1 per disk)

Key Take Aways Hadoop leverages that computer hardware components are cheap and powerful. Hadoop require high IOPS and high bandwidth for different parts of the shuffle phase. Using Flash as a Cache is a both an effective and cost effective way to improve Hadoop performance. LSI Proprietary