Top 10 Lessons Learned from Deploying Hadoop in a Private Cloud. Rod Cope, CTO & Founder OpenLogic, Inc.



Similar documents
Tuning Linux for the Cloud. Rod Cope, CTO & Founder OpenLogic, Inc.

Apache HBase. Crazy dances on the elephant back

Hadoop Architecture. Part 1

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

CSE-E5430 Scalable Cloud Computing Lecture 2

CS54100: Database Systems

Apache Hadoop FileSystem and its Usage in Facebook

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

ZingMe Practice For Building Scalable PHP Website. By Chau Nguyen Nhat Thanh ZingMe Technical Manager Web Technical - VNG

A very short Intro to Hadoop

Hadoop IST 734 SS CHUNG

Scaling Pinterest. Yash Nelapati Ascii Artist. Pinterest Engineering. Saturday, August 31, 13

Leveraging the Power of SOLR with SPARK. Johannes Weigend QAware GmbH Germany pache Big Data Europe September 2015

Webinar on Dec 9, Presented by Kim Weins, Sr. VP of Marketing and Rod Cope, CTO and Founder of OpenLogic

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Big Data with Component Based Software

Comparing SQL and NOSQL databases

Hadoop & its Usage at Facebook

Scalable Architecture on Amazon AWS Cloud

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Cloud Based Application Architectures using Smart Computing

How To Scale Out Of A Nosql Database

Enabling High performance Big Data platform with RDMA

Introduction to Big Data Training

A Performance Analysis of Distributed Indexing using Terrier

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Why NoSQL? Your database options in the new non- relational world IBM Cloudant 1

Hadoop Distributed File System (HDFS) Overview

Mark Bennett. Search and the Virtual Machine

Open Source Technologies on Microsoft Azure

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Hadoop at Yahoo! Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com

Introduction to Hbase Gkavresis Giorgos 1470

This guide specifies the required and supported system elements for the application.

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Drupal in the Cloud. by Azhan Founder/Director S & A Solutions

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

NoSQL and Hadoop Technologies On Oracle Cloud

Hadoop Ecosystem B Y R A H I M A.

Cost-Effective Business Intelligence with Red Hat and Open Source

Design and Evolution of the Apache Hadoop File System(HDFS)

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

WINDOWS AZURE EXECUTION MODELS

NoSQL Data Base Basics

Hadoop & its Usage at Facebook

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Social Networks and the Richness of Data

Cloud UT. Pay-as-you-go computing explained

MySQL and Virtualization Guide

MapReduce with Apache Hadoop Analysing Big Data

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Benchmarking Hadoop & HBase on Violin

HDFS Users Guide. Table of contents

Accelerating and Simplifying Apache

Open source large scale distributed data management with Google s MapReduce and Bigtable

Big Fast Data Hadoop acceleration with Flash. June 2013

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Open source Google-style large scale data analysis with Hadoop

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Cloud Computing Workload Benchmark Report

GigaSpaces Real-Time Analytics for Big Data

In Memory Accelerator for MongoDB

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

EXPERIMENTATION. HARRISON CARRANZA School of Computer Science and Mathematics

The Top 10 7 Hadoop Patterns and Anti-patterns. Alex

Web Technologies: RAMCloud and Fiz. John Ousterhout Stanford University

Introduction to Cloud Computing

Zadara Storage Cloud A

Big Data and Apache Hadoop s MapReduce

Can High-Performance Interconnects Benefit Memcached and Hadoop?

Building a Scalable News Feed Web Service in Clojure


Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Moving From Hadoop to Spark

Apache Hadoop. Alexandru Costan

Hadoop Big Data for Processing Data and Performing Workload

MapReduce, Hadoop and Amazon AWS

Big Data With Hadoop

Efficient Network Marketing - Fabien Hermenier A.M.a.a.a.C.

Introduction to Apache Cassandra

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Chapter 7. Using Hadoop Cluster and MapReduce

Distributed File Systems

Analytics March 2015 White paper. Why NoSQL? Your database options in the new non-relational world

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Big Data Technology Core Hadoop: HDFS-YARN Internals

The Hadoop Eco System Shanghai Data Science Meetup

Search and Real-Time Analytics on Big Data

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Google File System. Web and scalability

Hadoop: Embracing future hardware

Large-Scale Web Applications

farmerswife Contents Hourline Display Lists 1.1 Server Application 1.2 Client Application farmerswife.com

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Transcription:

Top 10 Lessons Learned from Deploying Hadoop in a Private Cloud Rod Cope, CTO & Founder

Agenda Introduction The Problem The Solution Top 10 Lessons Final Thoughts Q & A 2

Introduction Rod Cope CTO & Founder of OpenLogic 25 years of software development experience IBM Global Services, Anthem, General Electric OpenLogic Open Source Support, Governance, and Scanning Solutions Certified library w/sla support on 500+ Open Source packages http://olex.openlogic.com Over 200 Enterprise customers 3

The Problem Big Data All the world s Open Source Software Metadata, code, indexes Individual tables contain many terabytes Relational databases aren t scale-free Growing every day Need real-time random access to all data Long-running and complex analysis jobs 4

The Solution Hadoop, HBase, and Solr Hadoop distributed file system, map/reduce HBase NoSQL data store column-oriented Solr search server based on Lucene All are scalable, flexible, fast, well-supported, used in production environments And a supporting cast of thousands Stargate, MySQL, Rails, Redis, Resque, Nginx, Unicorn, HAProxy, Memcached, Ruby, JRuby, CentOS, 5

Solution Architecture Web Browser Nginx & Unicorn MySQL Live replication Live replication Scanner Client Solr Ruby on Rails Resque Workers Stargate Redis Maven Client Maven Repo Live replication Live replication (3x) HBase Internet Application LAN Data LAN Caching and load balancing not shown 6

Hadoop Implementation Private Cloud 100+ CPU cores 100+ Terabytes of disk Machines don t have identity Add capacity by plugging in new machines Why not Amazon EC2? Great for computational bursts Expensive for long-term storage of Big Data Not yet consistent enough for mission-critical usage of HBase 7

Top 10 Lessons Learned Configuration is key Commodity Hardware is not an old desktop Hadoop & HBase crave bandwidth Big Data takes a long time Big Data is hard Scripting languages can help Public clouds are expensive Not possible without Open Source Expect things to fail a lot It s all still cutting edge 8

Configuration is Key Many moving parts Pay attention to the details Operating system max open files, sockets, and other limits Hadoop max Map/Reduce jobs, memory, disk HBase region size, memory Solr merge factor, norms, memory Minor versions are very important Use a good known combination of Hadoop and HBase Specific patches are critical The fine print matters 9

Configuration Tips & Gotchas Follow all HBase configuration advice here: http://wiki.apache.org/hadoop/hbase/troubleshooting Yes, that s a whole lot of configuration Skip steps at your own peril! If you really need HA Hadoop http://www.cloudera.com/blog/2009/07/hadoop-ha-configuration/ If you hit datanode timeouts while writing to sockets: dfs.datanode.socket.write.timeout = 0 Even though it should be ignored 10

Configuration Tips & Gotchas (cont.) Linux kernel is important affects configuration switches, both required and optional Example: epoll limits required as of 2.6.27, then no longer required in newer kernels such as 2.6.33+ http://pero.blogs.aprilmayjune.org/2009/01/22/hadoop-and-linuxkernel-2627-epoll-limits/ Upgrade your machine BIOS, network card BIOS, and all hardware drivers Example: issues with certain default configurations of Dell boxes on CentOS/RHEL 5.x and Broadcom NIC s Will drop packets & cause other problems under high load Disable MSI in Linux & power saver (C-states) in machine BIOS 11

Configuration Debugging Tips Many problems only show up under severe load Sustained, massive data loads running for 2-24 hours Change only one parameter at a time Yes, this can be excruciating Ask the mailing list or your support provider They ve seen a lot, likely including your problem but not always Don t be afraid to dig in and read some code 12

Comment found in Hadoop networking code Ideally we should wait after transferto returns 0. But because of a bug in JRE on Linux (http://bugs.sun.com/view_bug.do?bug_id=5103988), which throws an exception instead of returning 0, we wait for the channel to be writable before writing to it. If you ever see IOException with message "Resource temporarily unavailable thrown here, please let us know. Once we move to JAVA SE 7, wait should be moved to correct place. Hadoop stresses every bit of networking code in Java and tends to expose all the cracks This bug was fixed in JDK 1.6.0_18 (after 6 years) 13

Commodity Hardware Commodity hardware!= 3 year old desktop Dual quad-core, 32GB RAM, 4+ disks Don t bother with RAID on Hadoop data disks Be wary of non-enterprise drives Expect ugly hardware issues at some point 14

OpenLogic s Hadoop Deployment Dual quad-core and dual hex-core Dell boxes 32-64GB RAM ECC (highly recommended by Google) 6 x 2TB enterprise hard drives RAID 1 on two of the drives OS, Hadoop, HBase, Solr, NFS mounts (be careful!), job code, etc. Key source data backups Hadoop datanode gets remaining drives Redundant enterprise switches Dual- and quad-gigabit NIC s 15

Hadoop & HBase Crave Bandwidth and More Hadoop Map/Reduce jobs shuffle lots of data Continuously replicating blocks and rebalancing Loves bandwidth dual-gigabit network on dedicated switches 10Gbps network can help HBase Needs 5+ machines to stretch its legs Depends on ZooKeeper low-latency is important Don t let it run low on memory 16

Big Data Takes a Long Time to do anything Load, list, walk directory structures, count, process, test, back up I m not kidding Hard to test, but don t be tempted to skip it You ll eventually hit every corner case you know and don t know Backups are difficult Consider a backup Hadoop cluster HBase team is working on live replication Solr already has built-in replication 17

Advice on Loading Big Data into HBase Don t use a single machine to load the cluster You might not live long enough to see it finish At OpenLogic, we spread raw source data across many machines and hard drives via NFS Be very careful with NFS configuration can hang machines Load data into HBase via Hadoop map/reduce jobs Turn off WAL for much better performance put.setwritetowal(false) Avoid large values (> 5MB) Works, but may cause instability and/or performance issues Rows and columns are cheap, so use more of them instead Be careful not to over-commit to Solr 18

Advice on Getting Big Data out of HBase HBase NoSQL Think hash table, not relational database How do find my data if primary key won t cut it? Solr to the rescue Very fast, highly scalable search server with built-in sharding and replication based on Lucene Dynamic schema, powerful query language, faceted search, accessible via simple REST-like web API w/xml, JSON, Ruby, and other data formats 19

Solr Sharding Query any server it executes the same query against all other servers in the group Returns aggregated result to original caller Async replication (slaves poll their masters) Can use repeaters if replicating across data centers OpenLogic Solr farm, sharded, cross-replicated, fronted with HAProxy Load balanced writes across masters, reads across slaves and masters Billions of lines of code in HBase, all indexed in Solr for real-time search in multiple ways Over 20 Solr fields indexed per source file 20

Big Data is Hard Expect to learn and experiment quite a bit Many moving parts, lots of decisions to make You won t get them all right the first time Expect to discover new and better ways of modeling your data and processes Don t be afraid to start over once or twice Consider getting outside help Training, consulting, mentoring, support 21

Scripting Languages Can Help Scripting is faster and easier than writing Java Great for system administration tasks, testing Standard HBase shell is based on JRuby Very easy Map/Reduce jobs with J/Ruby and Wukong Used heavily at OpenLogic Productivity of Ruby Power of Java Virtual Machine Ruby on Rails, Hadoop integration, GUI clients 22

Java (27 lines) public class Filter { public static void main( String[] args ) { List list = new ArrayList(); list.add( "Rod" ); list.add( "Neeta" ); list.add( "Eric" ); list.add( "Missy" ); Filter filter = new Filter(); List shorts = filter.filterlongerthan( list, 4 ); System.out.println( shorts.size() ); Iterator iter = shorts.iterator(); while ( iter.hasnext() ) { System.out.println( iter.next() ); } } public List filterlongerthan( List list, int length ) { List result = new ArrayList(); Iterator iter = list.iterator(); while ( iter.hasnext() ) { String item = (String) iter.next(); if ( item.length() <= length ) { result.add( item ); } } return result; } 23

Scripting languages (4 lines) JRuby list = ["Rod", "Neeta", "Eric", "Missy"] shorts = list.find_all { name name.size <= 4 } puts shorts.size shorts.each { name puts name } Groovy -> 2 -> Rod Eric list = ["Rod", "Neeta", "Eric", "Missy"] shorts = list.findall { name -> name.size() <= 4 } println shorts.size shorts.each { name -> println name } -> 2 -> Rod Eric 24

Public Clouds and Big Data Amazon EC2 EBS Storage 100TB * $0.10/GB/month = $120k/year Double Extra Large instances 13 EC2 compute units, 34.2GB RAM 20 instances * $1.00/hr * 8,760 hrs/yr = $175k/year 3 year reserved instances 20 * 4k = $80k up front to reserve (20 * $0.34/hr * 8,760 hrs/yr * 3 yrs) / 3 = $86k/year to operate Totals for 20 virtual machines 1 st year cost: $120k + $80k + $86k = $286k 2 nd & 3 rd year costs: $120k + $86k = $206k Average: ($286k + $206k + $206k) / 3 = $232k/year 25

Private Clouds and Big Data Buy your own 20 * Dell servers w/12 CPU cores, 32GB RAM, 5 TB disk = $160k Over 33 EC2 compute units each Total: $53k/year (amortized over 3 years) 26

Public Clouds are Expensive for Big Data Amazon EC2 20 instances * 13 EC2 compute units = 260 EC2 compute units Cost: $232k/year Buy your own 20 machines * 33 EC2 compute units = 660 EC2 compute units Cost: $53k/year Does not include hosting & maintenance costs Don t think system administration goes away You still own all the instances monitoring, debugging, support 27

Not Possible Without Open Source 28

Not Possible Without Open Source Hadoop, HBase, Solr Apache, Tomcat, ZooKeeper, HAProxy Stargate, JRuby, Lucene, Jetty, HSQLDB, Geronimo Apache Commons, JUnit CentOS Dozens more Too expensive to build or buy everything 29

Expect Things to Fail A Lot Hardware Power supplies, hard drives Operating System Kernel panics, zombie processes, dropped packets Hadoop and Friends Hadoop datanodes, HBase regionservers, Stargate servers, Solr servers Your Code and Data Stray Map/Reduce jobs, strange corner cases in your data leading to program failures 30

It s All Still Cutting Edge Hadoop SPOF around Namenode, append functionality HBase Backup, replication, and indexing solutions in flux Solr Several competing solutions around cloud-like scalability and fault-tolerance, including ZooKeeper and Hadoop integration 31

Final Thoughts You can host big data in your own private cloud Tools are available today that didn t exist a few years ago Fast to prototype production readiness takes time Expect to invest in training and support Public clouds Great for learning, experimenting, testing Best for bursts vs. sustained loads Beware latency, expense of long-term Big Data storage You still need Small Data SQL and NoSQL coexist peacefully OpenLogic uses MySQL & Redis in addition to HBase, Solr, Memcached 32

Q & A Any questions for Rod? rod.cope@openlogic.com Slides: http://www.openlogic.com/news/presentations.php * Unless otherwise credited, all images in this presentation are either open source project logos or were licensed from BigStockPhoto.com