Hadoop Hardware @Twitter: Size does matter. @joep and @eecraft Hadoop Summit 2013



Similar documents
Dell Reference Configuration for Hortonworks Data Platform

Maximizing Hadoop Performance with Hardware Compression

Hadoop: Embracing future hardware

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Hyperscale Use Cases for Scaling Out with Flash. David Olszewski

Applied Storage Performance For Big Analytics. PRESENTATION TITLE GOES HERE Hubbert Smith LSI

James Serra Sr BI Architect

Performance measurement of a Hadoop Cluster

Big Fast Data Hadoop acceleration with Flash. June 2013

Extending Hadoop beyond MapReduce

Building All-Flash Software Defined Storages for Datacenters. Ji Hyuck Yun Storage Tech. Lab SK Telecom

Hadoop on the Gordon Data Intensive Cluster

Hadoop Architecture. Part 1

Successfully Deploying Alternative Storage Architectures for Hadoop Gus Horn Iyer Venkatesan NetApp

Apache Hadoop Cluster Configuration Guide

docs.hortonworks.com

Deploying Cloudera CDH (Cloudera Distribution Including Apache Hadoop) with Emulex OneConnect OCe14000 Network Adapters

Jun Liu, Senior Software Engineer Bianny Bian, Engineering Manager SSG/STO/PAC

No matter what you need for Managed IT services, High-Performance Storage, you can count on us for low cost, fast and effective service.

Exar. Optimizing Hadoop Is Bigger Better?? March Exar Corporation Kato Road Fremont, CA

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

HiBench Introduction. Carson Wang Software & Services Group

VxRACK : L HYPER-CONVERGENCE AVEC L EXPERIENCE VCE JEUDI 19 NOVEMBRE Jean-Baptiste ROBERJOT - VCE - Software Defined Specialist

Reference Design: Scalable Object Storage with Seagate Kinetic, Supermicro, and SwiftStack

UCS M-Series Modular Servers

Accelerating and Simplifying Apache

Performance and Energy Efficiency of. Hadoop deployment models

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

Use of Hadoop File System for Nuclear Physics Analyses in STAR

Big Data. Value, use cases and architectures. Petar Torre Lead Architect Service Provider Group. Dubrovnik, Croatia, South East Europe May, 2013

Increasing Hadoop Performance with SanDisk Solid State Drives (SSDs)

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Mit Soft- & Hardware zum Erfolg. Giuseppe Paletta

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

Software-defined Storage Architecture for Analytics Computing

MapReduce Evaluator: User Guide

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

ebay Storage, From Good to Great

Accelerating Real Time Big Data Applications. PRESENTATION TITLE GOES HERE Bob Hansen

THE HADOOP DISTRIBUTED FILE SYSTEM

STeP-IN SUMMIT June 2014 at Bangalore, Hyderabad, Pune - INDIA. Performance testing Hadoop based big data analytics solutions

Nutanix Solutions for Private Cloud. Kees Baggerman Performance and Solution Engineer

CSE-E5430 Scalable Cloud Computing Lecture 2

Accelerate Big Data Analysis with Intel Technologies

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions

Scaling from Datacenter to Client

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Enabling High performance Big Data platform with RDMA

Maximum performance, minimal risk for data warehousing

Deploy Apache Hadoop with Emulex OneConnect OCe14000 Ethernet Network Adapters

HDFS Federation. Sanjay Radia Founder and Hortonworks. Page 1

HP Reference Architecture for Hortonworks Data Platform on HP ProLiant SL4540 Gen8 Server

Open source Google-style large scale data analysis with Hadoop

Intel RAID SSD Cache Controller RCS25ZB040

How To Run Apa Hadoop 1.0 On Vsphere Tmt On A Hyperconverged Network On A Virtualized Cluster On A Vspplace Tmter (Vmware) Vspheon Tm (

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Protecting Hadoop with VMware vsphere. 5 Fault Tolerance. Performance Study TECHNICAL WHITE PAPER

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Cloud Computing through Virtualization and HPC technologies

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Einsatzfelder von IBM PureData Systems und Ihre Vorteile.

HadoopTM Analytics DDN

Microsoft SQL Server 2014 Fast Track

Application-Focused Flash Acceleration

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Benchmarking Hadoop & HBase on Violin

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

SCI Briefing: A Review of the New Hitachi Unified Storage and Hitachi NAS Platform 4000 Series. Silverton Consulting, Inc.

Design and Evolution of the Apache Hadoop File System(HDFS)

LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment

Modernizing Hadoop Architecture for Superior Scalability, Efficiency & Productive Throughput. ddn.com

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

IBM System x SAP HANA

IBM General Parallel File System (GPFS ) 3.5 File Placement Optimizer (FPO)

Ceph Optimization on All Flash Storage

VMware Virtual SAN Hardware Guidance. TECHNICAL MARKETING DOCUMENTATION v 1.0

The Flash Transformed Data Center & the Unlimited Future of Flash John Scaramuzzo Sr. Vice President & General Manager, Enterprise Storage Solutions

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

MapR Enterprise Edition & Enterprise Database Edition

Installing Hadoop over Ceph, Using High Performance Networking

Running Highly Available, High Performance Databases in a SAN-Free Environment

WHITE PAPER 1

Apache Hadoop new way for the company to store and analyze big data

Hortonworks Data Platform Reference Architecture

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

How To Build A Cloud Server For A Large Company

Transcription:

Hadoop Hardware : Size does matter. @joep and @eecraft Hadoop Summit 2013 v2.3

About us Joep Rottinghuis Software Engineer @ Twitter Engineering Manager Hadoop/HBase team @ Twitter Follow me @joep Jay Shenoy Hardware Engineer @ Twitter Engineering Manager HW @ Twitter Follow me @eecraft HW & Hadoop teams @ Twitter, Many others 2

Agenda Scale of Hadoop Clusters Single versus multiple clusters Twitter Hadoop Architecture Hardware investigations Results 3

Scale Scaling limits JobTracker 10 s thousands of jobs per day; 10 s Ks concurrent slots Namenode 250-300 M objects in single namespace Namenode @~100 GB heap -> full GC pauses Shipping job jars to 1,000 s of nodes JobHistory server at a few 100 s K job history/conf files # Nodes 4

When / why to split clusters? In principle preference for single cluster Common logs, shared free space, reduced admin burden, more rack diversity Varying SLA s Workload diversity Storage intensive Processing (CPU / Disk IO) intensive Network intensive Data access Hot, Warm, Cold 5

Cluster Architecture 6

Hardware investigations 7

Service criteria for hardware Hadoop does not need live HDD swap Twitter DC : No SLA on data nodes Rack SLA : Only 1 rack down at any time in a cluster 8

Baseline Hadoop Server (~ early 2012) E56xx PCH NIC GbE Characteristics: HBA Standard 2U server 20 servers / rack E56xx Expander E5645 CPU Works for the general cluster, but... Need more density for storage Potential IO bottlenecks Dual 6-core 72GB memory 12 x 2TB HDD 2 x 1 GbE 9

Hadoop Server: Possible evolution E5-26xx or E5-24xx NIC GbE 10GbE? Characteristics: + CPU performance E5-26xx or E5-24xx HBA Expander 16 x 2T? 16 x 3T? 24 x 3T?? 20 servers / rack Candidate for DW Can deploy into the general DW cluster, but... Too much CPU for storage intensive apps Server failure domain too large if we scale up disks 10

Rethinking hardware evolution Debunking myths Bigger is always better One size fits all Back to Hadoop Hardware Roots: Scale horizontally, not vertically Twitter Hadoop Server - THS 11

THS for backups E3-12xx NIC GbE Characteristics: + IO Performance Few fast cores SAS HBA PCH E3-1230 V2 CPU 16 GB memory 12 x 3 TB HDD Storage focus: Cost efficient (single socket, 3T drives) SSD boot 2 x 1 GbE Less memory needed 12

THS variant for Hadoop-Proc and HBase E3-12xx NIC 10GbE Characteristics: + IO Performance Few fast cores SAS HBA PCH E3-1230 V2 CPU 32 GB memory 12 x 1 TB HDD Processing / throughput focus: Cost efficient (single socket, 1T drives) SSD boot 1 x 10 GbE More disk and network IO per socket 13

THS for cold cluster E3-12xx NIC GbE Characteristics: Disk Efficiency Some compute SAS HBA PCH E3-1230 V2 CPU 32 GB memory 12 x 3 TB HDD Combination of previous 2 use cases: Space & power efficient 2 x 1 GbE Storage dense and some processing capabilities 14

Rack-level view 1G TOR 1G TOR 1G TOR 10G TOR 1G TOR 1G TOR Baseline Twitter Hadoop Server Backups Proc Cold Power ~ 8 kw ~ 8 kw ~ 8 kw ~ 8 kw CPU sockets; DRAM 40; 1440 GB 40; 640 GB 40; 1280 GB 40; 1280 GB Spindles; TB raw 240; 480 TB 480; 1,440 TB 480; 480 TB 480; 1,440 TB Uplink; Internal BW 20 ; 40 Gbps 20 ; 80 Gbps 40 ; 400 Gbps 20 ; 80 Gbps 15

Processing performance comparison Benchmark Baseline Server THS (-Cold) TestDFSIO (write replication = 1) 360 MB/s / node 780 MB/s / node TeraGen (30TB replication = 3) 1:36 hrs 1:35 hrs TeraSort (30 TB, replication = 3) 6:11 hrs 4:22 hrs 2 Parallel TeraSort (30 TB each, replication = 3) 10:36 hrs 6:21 hrs Application #1 4:37 min 3:09 min Application set #2 13:3 hrs 10:57 hrs Performance benchmark set up: Each clusters 102 nodes of respective type Efficient server = 3 racks, Baseline 5+ racks Dated stack: CentOS 5.5, Sun 1.6 JRE, Hadoop 2.0.3 16

Results 17

LZO performance comparison 16 18

Recap At a certain scale it makes sense to split into multiple clusters For us: RT, PROC, DW, COLD, BACKUPS, TST, EXP For large enough clusters, depending on use-case, it may be worth to choose different HW configurations 19

Conclusion our Twitter Hadoop Server not only saves many $$$, it is also faster! 20

#ThankYou @joep and @eecraft Come talk to us at booth 26