Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Similar documents

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Hypertable Architecture Overview

Hadoop Architecture. Part 1

HADOOP MOCK TEST HADOOP MOCK TEST I

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Distributed File Systems

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Hadoop: Embracing future hardware

Practical Cassandra. Vitalii

The Hadoop Distributed File System

CSE-E5430 Scalable Cloud Computing Lecture 2

NoSQL Data Base Basics

CDH AND BUSINESS CONTINUITY:

Hadoop IST 734 SS CHUNG

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Big Data Technology Core Hadoop: HDFS-YARN Internals

HDFS Users Guide. Table of contents

Assignment # 1 (Cloud Computing Security)

Kafka & Redis for Big Data Solutions

Big Data With Hadoop

The Hadoop Distributed File System

Apache Hadoop. Alexandru Costan

Apache Hadoop Cluster Configuration Guide

NoSQL and Hadoop Technologies On Oracle Cloud

Deploying Hadoop with Manager

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

HDB++: HIGH AVAILABILITY WITH. l TANGO Meeting l 20 May 2015 l Reynald Bourtembourg

Architectures for massive data management

A Survey of Distributed Database Management Systems

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Distributed File Systems

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Big Data Analytics - Accelerated. stream-horizon.com

Design and Evolution of the Apache Hadoop File System(HDFS)

Journal of science STUDY ON REPLICA MANAGEMENT AND HIGH AVAILABILITY IN HADOOP DISTRIBUTED FILE SYSTEM (HDFS)

Using MySQL for Big Data Advantage Integrate for Insight Sastry Vedantam

THE HADOOP DISTRIBUTED FILE SYSTEM

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Hadoop Distributed File System. Jordan Prosch, Matt Kipps

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Non-Stop Hadoop Paul Scott-Murphy VP Field Techincal Service, APJ. Cloudera World Japan November 2014

Hadoop Distributed File System. T Seminar On Multimedia Eero Kurkela

Hadoop Technology HADOOP CLUSTER

HADOOP MOCK TEST HADOOP MOCK TEST II

Highly available, scalable and secure data with Cassandra and DataStax Enterprise. GOTO Berlin 27 th February 2014

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Comparing SQL and NOSQL databases

Apache Hadoop FileSystem and its Usage in Facebook

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

A Brief Outline on Bigdata Hadoop

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

MapReduce with Apache Hadoop Analysing Big Data

Facebook: Cassandra. Smruti R. Sarangi. Department of Computer Science Indian Institute of Technology New Delhi, India. Overview Design Evaluation

Cloud Based Application Architectures using Smart Computing

Mark Bennett. Search and the Virtual Machine

Scalable Architecture on Amazon AWS Cloud

!"#$%&' ( )%#*'+,'-#.//"0( !"#$"%&'()*$+()',!-+.'/', 4(5,67,!-+!"89,:*$;'0+$.<.,&0$'09,&)"/=+,!()<>'0, 3, Processing LARGE data sets

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Distributed Filesystems

CS 145: NoSQL Activity Stanford University, Fall 2015 A Quick Introdution to Redis

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

Processing of massive data: MapReduce. 2. Hadoop. New Trends In Distributed Systems MSc Software and Systems

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Google File System. Web and scalability

Using Kafka to Optimize Data Movement and System Integration. Alex

Evaluation of NoSQL databases for large-scale decentralized microblogging

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Big Data with Component Based Software

High Availability Solutions for the MariaDB and MySQL Database

<Insert Picture Here> Oracle NoSQL Database A Distributed Key-Value Store

Apache HBase. Crazy dances on the elephant back

Big Data Analytics(Hadoop) Prepared By : Manoj Kumar Joshi & Vikas Sawhney

ZooKeeper. Table of contents

Benchmarking Hadoop & HBase on Violin

Real-time Analytics at Facebook: Data Freeway and Puma. Zheng Shao 12/2/2011

Snapshots in Hadoop Distributed File System

Understanding Neo4j Scalability

SAN Conceptual and Design Basics

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Parallels Cloud Storage

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

Accelerating and Simplifying Apache

Big Fast Data Hadoop acceleration with Flash. June 2013

Bigdata High Availability (HA) Architecture

Data Pipeline with Kafka

Deploying and Optimizing SQL Server for Virtual Machines

High Throughput Computing on P2P Networks. Carlos Pérez Miguel

HDFS Architecture Guide

Benchmarking Couchbase Server for Interactive Applications. By Alexey Diomin and Kirill Grigorchuk

Transcription:

Etcetera! CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Agenda Advanced HDFS Features Apache Kafka Apache Cassandra Redis (but more this time) Cluster Planning

ADVANCED HDFS FEATURES

Highly Available NameNode Highly Available NameNode feature eliminates SPOF Requires two NameNodes and some extra configuration Active/Passive or Active/Active Clients only contact the active NameNode DataNodes report in and heartbeat with both NameNodes Active NameNode writes metadata to a quorum of JournalNodes Standby NameNode reads the JournalNodes to stay in sync There is no CheckPointNode (SecondaryNameNode) The passive NameNode performs checkpoint operations

HA NameNode Failover There are two failover scenarios Graceful Performed by an administrator for maintenance Automated Active NameNode fails Failed NameNode must be fenced Eliminates the 'split brain syndrome' Two fencing methods are available sshfence Kill NameNodes daemon shell script disables access to the NameNode, shuts down the network switch port, sends power off to the failed NameNode There is no 'default' fencing method

Release lock ZKFC NN Active ZooKeeper NFS or QJM Shared NN State Lock Released Created Create Lock ZKFC NN Become Active Standby Active I'm the Boss Data Node Data Node Data Node

HDFS Federation Useful for: Isolation/multi-tenancy Horizontal scalability of HDFS namespace Performance Allows for multiple independent NameNodes using the same collection of DataNodes DataNodes store blocks from all NameNode pools

Federated NameNodes File-system namespace scalable beyond heap size NameNode performance no longer a bottleneck NameNode failure/degradation is isolated Only data managed by the failed NameNode is unavailable Each NameNode can be made Highly Available

Hadoop Security Hadoop's original design web crawler and indexing Not designed for processing of confidential data Small number of trusted users Access to cluster controlled by providing user accounts Little / no control on what a user could do once logged in HDFS permissions were added in the Hadoop 0.16 release Similar to basic UNIX file permissions HDFS permissions can be disabled via dfs.permissions Basically for protection against user-induced accidents Did not protect from attacks Authentication is accomplished on the client side Easily subverted via a simple configuration parameter

Kerberos Kerberos support introduced in the Hadoop 0.22.2 release Developed at MIT / freely available Not a Hadoop-specific feature Not included in Hadoop releases Works on the basis of 'tickets' Allow communicating nodes to securely identify each other across unsecure networks Primarily a client/server model implementing mutual authentication The user and the server verify each other's identity

How Kerberos Works Client forwards the username to KDC A. KDC sends Client/TGS Session Key, encrypted with user's password B. KDC issues a TGT, encrypted with TGS's key C. Sends B and service ID to TGS D. Authenticator encrypted w/a E. TGS issues CTS ticket, encrypted with SS key F. TGS issues CSS, encrypted w/a G. New authenticator encrypted with F H. Timestamp found in G+1 KDC - Key Distribution Center TGS Ticket Granting Service TGT Ticket Granting Ticket CTS Client-to-Server Ticket CSS Client Server Session Key

Kerberos Services Authentication Server Authenticates client Gives client enough information to authenticate with Service Server Service Server Authenticates client Authenticates itself to client Provides services to client

Kerberos Limitations Single point of failure Must use multiple servers Implement failback authentication mechanisms Strict time requirements 'tickets' are time stamped Clocks on all host must be carefully synchronized All authentication is controlled by the KDC Compromise of this infrastructure will allow attackers to impersonate any user Each network service requiring a different host name must have its own set of Kerberos keys Complicates virtual hosting of clusters

APACHE KAFKA

Overview Kafka is a publish-subscribe messaging rethought as a distributed commit log Fast Scalable Durable Distributed

Kafka adoption and use cases LinkedIn: activity streams, operational metrics, data bus 400 nodes, 18k topics, 220B msg/day (peak 3.2M msg/s), May 2014 Netflix: real-time monitoring and event processing Twitter: as part of their Storm real-time data pipelines Spotify: log delivery (from 4h down to 10s), Hadoop Loggly: log collection and processing Mozilla: telemetry data Airbnb, Cisco, Gnip, InfoChimps, Ooyala, Square, Uber, 16

How fast is Kafka? Up to 2 million writes/sec on 3 cheap machines Using 3 producers on 3 different machines, 3x async replication Only 1 producer/machine because NIC already saturated Sustained throughput as stored data grows Slightly different test config than 2M writes/sec above. 17

Why is Kafka so fast? Fast writes: While Kafka persists all data to disk, essentially all writes go to the page cache of OS, i.e. RAM. Fast reads: Very efficient to transfer data from page cache to a network socket Linux: sendfile() system call Combination of the two = fast Kafka! Example (Operations): On a Kafka cluster where the consumers are mostly caught up you will see no read activity on the disks as they will be serving data entirely from cache. 18

A first look The who is who Producers write data to brokers. Consumers read data from brokers. All this is distributed. The data Data is stored in topics. Topics are split into partitions, which are replicated. 19

A first look 20

Topics Topic: feed name to which messages are published Example: zerg.hydra Kafka prunes head based on age or max size or key Kafka topic new Producer A1 Producer A2 Producer An Older msgs Newer msgs Producers always append to tail (think: append to a file) Broker(s) 21

Topics Consumer group C1 Consumer group C2 Consumers use an offset pointer to track/control their read progress (and decide the pace of consumption) Older msgs Newer msgs new Producer A1 Producer A2 Producer An Producers always append to tail (think: append to a file) Broker(s) 22

A topic consists of partitions. Partitions Partition: ordered + immutable sequence of messages that is continually appended to 23

Partitions #partitions of a topic is configurable #partitions determines max consumer (group) parallelism cf. parallelism of Storm s KafkaSpout via builder.setspout(,,n) Consumer group A, with 2 consumers, reads from a 4-partition topic Consumer group B, with 4 consumers, reads from the same topic 24

Partition offsets Offset: messages in the partitions are each assigned a unique (per partition) and sequential id called the offset Consumers track their pointers via (offset, partition, topic) tuples Consumer group C1 25

Replicas of a partition Replicas: backups of a partition They exist solely to prevent data loss. Replicas are never read from, never written to. They do NOT help to increase producer or consumer parallelism! Kafka tolerates (numreplicas - 1) dead brokers before losing data LinkedIn: numreplicas == 2 1 broker can die 26

APACHE CASSANDRA

In a couple dozen words... Apache Cassandra is an open source, distributed, decentralized, elastically scalable, highly available, fault-tolerant, tunably consistent, column-oriented database with a lot of adjectives

Overview Originally created by Facebook and opened sourced in 2008 Based on Google Big Table & Amazon Dynamo Massively Scalable Easy to use No relation to Hadoop Specifically, data is not stored on HDFS

Distributed and Decentralized Distributed Can run on multiple machines Decentralized No single point of failure No master or slave issues by using a peer-to-peer architecture (gossip protocol, specifically) Can run across geographic datacenters

Elastic Scalability Scales horizontally Adding nodes linearly increases performance Decreasing and increasing nodecounts happen seamlessly

Highly Available and Fault Tolerant Multiple networked computers in a cluster Facility for recognizing node failures Forward failing over requests to another part of the system

Tunable Consistency Choice between strong and eventual consistency Adjustable for reads and write operations separately Conflicts are solved during reads

Stored in spare multidimensional hash tables Row can have multiple columns, and not necessarily the same amount of columns for each row Each row has a unique key used for partitioning Column-Oriented

Query with CQL Familiar SQL-like syntax that maps to Cassandra's storage engine and simplifies data modeling CREATE TABLE songs ( id uuid PRIMARY KEY, title text, album text, artist text, data blob, tags set <text> ); INSERT INTO songs (id, title, artist, album, tags) VALUES ( 'a3e648f...', 'La Grange', 'ZZ Top', 'Tres Hombres', {'cool', 'hot'}); SELECT * FROM songs WHERE id = 'a3e648f...';

When should I use this? Key features to compliment a Hadoop system: Geographical distribution Large deployments of structured data

REDIS

Introduction ANSI C open-source advanced key-value store Commonly referred to as a data structure server, since keys can contain strings, hashes, lists, sets, and sorted sets Operations are atomic and there are a bunch of them All data is stored in-memory, and can be persisted using snapshots or transaction logs Trivial master-slave replication

Clients Redis itself is ANSI C, but the protocol is opensource and developers have created support in many languages C C# C++ Clojure Common Lisp D Dart Emacs lisp Erland Fancy GNU Prolog Go Haskell haxe Java Lua Node.js Objective-C Perl PHP Pure Data Python Ruby Rust Scala Scheme Smalltalk Tcl

Data Types Redis keys can be anything from a string to a byte array of a JPEG Keys have associated data types, and we should talk about them Strings Lists Hashes Sets Sorted Sets HyperLogLogs

Strings! The simplest type Supports a number of operations, including sets, gets, and incremental operations for values > SET mkey "my binary safe value" OK > GET mkey "my binary safe value"

Lists! Linked Lists, actually, i.e. O(1) for inserts into the head or tail of the list Accessing an element by index... O(N) > RPUSH messages "Hello how are you?: (integer) 1 > RPUSH messages "Fine thanks. I'm having fun with Redis" (integer) 2 > RPUSH messages "I should look into this NOSQL thing ASAP" (integer) 3 > LRANGE messages 0 2 1) "Hello how are you?" 2) "Fine thanks. I'm having fun with Redis" 3) "I should look into this NOSQL thing ASAP"

Hashes! Maps between string fields and string values > HMSET user:1000 username antirez password P1pp0 age 34 OK > HGETALL user:1000 1) "username" 2) "antirez" 3) "password" 4) "P1pp0" 5) "age" 6) "34" > HSET user:100 password 12345 (integer) 0 > HGETALL user:1000 1) "username" 2) "antirez" 3) "password" 4) "12345" 5) "age" 6) "34"

Sets! Unordered collection of strings Supports adds, gets, is-member checks, intersections, unions, sorting... > SADD myset 1 (integer) 1 > SADD myset 2 (integer) 1 > SADD myset 3 (integer) 1 > SMEMBERS myset 1) "1" 2) "2" 3) "3" > SADD myotherset 2 (integer) 1 > SINTER myset myotherset 1) "2" > SUNION myset myotherset 1) "1" 2) "2" 3) "3"

Sorted Sets! Similar to sorted sets, but they have an associated score and can return items in order Elements are already sorted via an O(log(n)) operation, so returning them is easy > ZADD hackers 1940 "Alan Kay" > ZRANGE hackers 0-1 (integer) 1 1) "Alan Turing" > ZADD hackers 1953 "Richard Stallman" 2) "Claude Shannon" (integer) 1 3) "Alan Kay" > ZADD hackers 1965 "Yukihiro Matsumoto" 4) "Richard Stallman" (integer) 1 5) "Yukihiro Matsumoto" > ZADD hackers 1916 "Claude Shannon" 6) "Linus Torvalds" (integer) 1 > ZADD hackers 1969 "Linus Torvalds" (integer) 1 > ZADD hackers 1912 "Alan Turing" (integer) 1

HyperLogLogs! Probabilistic data structure to estimate the cardinality of a set Very useful when you have a set with high cardinality Talking millions Returns 1 if the cardinality changed, 0 otherwise > PFADD hll a b c d e f g (integer) 1 > PFCOUNT hll (integer) 7 > PFADD hll a (integer) 0 > PFADD hll h (integer) 1 > PFCOUNT hll (integer) 8 http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/40671.pdf

Features Transactions Pub/Sub Lua Scripting Key Expiration Redis Clustering

Transactions Guarantees no client requests are served in the middle of a transaction Either all commands or none are processed, so they are atomic MULTI begins a transaction, and EXEC commits it Redis will queue commands and process them upon EXEC All commands in the queue are processed, even if one fails > MULTI OK > INCR foo QUEUED > INCR bar QUEUED > EXEC 1) (integer) 1 2) (integer) 1

Pub/Sub Messaging paradigm where publishers send messages to subscribers (if any) via channels Subscribers express interest in channels, and receive messages from publishers (if any) SUBSCRIBE test Clients can subscribe to channels and messages from publishers will be pushed to them by Redis PUBLISH test Hello Can do pattern-based subscriptions to channels PSUBSCRIBE news.*

Lua Scripting You can run Lui scripts to manipulate Redis > eval "return redis.call('set','foo','bar')" 0 OK

Expire Keys after time Set a timeout on a key, having Redis automatically delete it after the set time Use case: Maintain session information for a user for the last 60 seconds to recommend related products MULTI RPUSH pagewviews.user:<userid> http://... EXPIRE pagewviews.user:<userid> 60 EXEC

Redis Cluster Redis Cluster is not production ready, but can be used to do partitioning of your data cross multiple Redis instances A few abstractions exist today to partition among Multiple instances, but they are not out-of-the-box with a Redis download

Use Cases Session Cache Ranking lists Auto Complete Twitter/Github/Pinterest/Snapchat/Craiglist/ StackOverflow/Flicker

CLUSTER PLANNING

Workload Considerations Balanced workloads Jobs are distributed across various job types CPU bound Disk I/O bound Network I/O bound Compute intensive workloads - Data Analytics CPU bound workloads require: Large numbers of CPU's Large amounts of memory to store in-process data I/O intensive workloads - Sorting I/O bound workloads require: Larger number of spindles ( disks ) per node Not sure go with balance workloads configuration

Hardware Topology Hadoop uses a master / slave topology Master Nodes include: NameNode - maintains system metadata Backup NN- performs checkpoint operations and host standby ResourceManager- manages task assignment Slave Nodes include: DataNode - stores hdfs files / manages read and write requests Preferably co-located with TaskTracker NodeManager - performs map / reduce tasks

Sizing The Cluster Remember... Scaling is a relatively simple task Start with a moderate sized cluster Grow the cluster as requirements dictate Develop a scaling strategy As simple as scaling is adding new nodes takes time and resources Don't want to be adding new nodes each week Amount of data typically defines initial cluster size rate at which the volume of data increases Drivers for determining when to grow your cluster Storage requirements Processing requirements Memory requirements

Storage Reqs Drive Cluster Growth Data volume increases at a rate of 1TB / week 3TB of storage are required to store the data alone Remember block replication Consider additional overhead - typically 30% Remember files that are stored on a nodes local disk If DataNodes incorporate 4-1TB drives 1 new node per week is required 2 years of data - roughly 100TB will require 100 new nodes

Things Break Things are going to break This assumption is a core premise of Hadoop If a disk fails, the infrastructure must accommodate If a DataNode fails, the NameNode must manage this If a task fails, the ApplicationMaster must manage this failure Master nodes are typically a SPOF unless using a Highly Available configuration NameNode goes down, HDFS is inaccessible Use NameNode HA ResourceManager goes down, can't run any jobs Use RM HA (in development)

Cluster Nodes Cluster nodes should be commodity hardware Buy more nodes... Not more expensive nodes Workload patterns and cluster size drive CPU choice Small cluster - 50 nodes or less Quad core / medium clock speed is usually sufficient Large cluster Dual 8-core CPUs with a medium clock speed is sufficient Compute intensive workloads might require higher clock speeds General guideline is to buy more hardware instead of faster hardware Lots of memory - 48GB / 64GB / 128GB / 256GB Each map / reduce task consumes 1GB to 3GB of memory OS / Daemons consume memory as well

Cluster Storage 4 to 12 drives of 1TB / 2TB capacity - up to 24TB / node 3TB drives work Network performance penalty if a node fails 7200 rpm SATA drives are sufficient Slightly above average MTBF is advantageous JBOD configuration RAID is slow RAID is not required due to block replication More smaller disks is preferred over fewer larger disks Increased parallelism for DataNodes Slaves should never use virtual memory

Master Nodes Still commodity hardware, but... better Redundant everything Power supplies Dual Ethernet cards 16 to 24 CPU cores on NameNodes NameNodes and their clients are very chatty and need more cores to handle messaging traffic Medium clock speeds should be sufficient

Master Nodes HDFS namespace is limited to the amount of memory on the NameNode RAID and NFS storage on NameNode Typically RAID5 with hot spare Second remote directory such as NFS Quorum Journal Manager for HA

Network Considerations Hadoop is bandwidth intensive This can be a significant bottleneck Use dedicated switches 10Gb Ethernet is pretty good for large clusters

Which Operating System? Choose an OS that you are comfortable and familiar with Consider you admin resources / experience RedHat Enterprise Linux Includes support contract CentOS No support but the price is right Many other possibilities SuSE Enterprise Linux Ubuntu Fedora

Which Java Virtual Machine? Oracle Java is the only supported JVM Runs on OpenJDK, but use at your own risk Hadoop 1.0 requires Java JDK 1.6 or higher Hadoop 2.x requires Java JDK 1.7

References http://cassandra.apache.org http://redis.io/ http://try.redis.io Give it a test drive! http://www.slideshare.net/jbellis/apache-cassandranosql-in-the-enterprise http://www.slideshare.net/planetcassandra/cassandraintroduction-features-30103666 http://research.microsoft.com/enus/um/people/srikanth/netdb11/netdb11papers/netd b11-final12.pdf http://www.slideshare.net/miguno/apache-kafka-08- basic-training-verisign