Building 1000 node cluster on EMR Manjeet Chayel



Similar documents
Hadoop & Spark Using Amazon EMR

Real Time Big Data Processing

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Scalable Architecture on Amazon AWS Cloud

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

QLIKVIEW INTEGRATION TION WITH AMAZON REDSHIFT John Park Partner Engineering

Hadoop & its Usage at Facebook

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Shadi Khalifa Database Systems Laboratory (DSL)

ur skills.com

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

How to Leverage Cloud to Quickly Build Scalable Applications

Building your Big Data Architecture on Amazon Web Services

Amazon EC2 Product Details Page 1 of 5

HADOOP BIG DATA DEVELOPER TRAINING AGENDA

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

On a Hadoop-based Analytics Service System

COURSE CONTENT Big Data and Hadoop Training

Alexander Rubin Principle Architect, Percona April 18, Using Hadoop Together with MySQL for Data Analysis

CIS 4930/6930 Spring 2014 Introduction to Data Science Data Intensive Computing. University of Florida, CISE Department Prof.

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Introduction to AWS in Higher Ed

Hadoop & its Usage at Facebook

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

Native Connectivity to Big Data Sources in MSTR 10

Spark: Making Big Data Interactive & Real-Time

THE STATE OF GEO BIG DATA IN OPEN SOURCE. Rob Emanuele

Using distributed technologies to analyze Big Data

Amazon Web Services. Elastic Compute Cloud (EC2) and more...

Cost Optimization with AWS

Increasing revenue realization CASE STUDY. by leveraging. Big Data. Mobile marketing platform

Apache Hadoop FileSystem and its Usage in Facebook

Unified Big Data Analytics Pipeline. 连 城

The Easiest Way to Run Spark Jobs. How-To Guide

Thing Big: How to Scale Your Own Internet of Things.

EXECUTIVE SUMMARY CONTENTS. 1. Summary 2. Objectives 3. Methodology and Approach 4. Results 5. Next Steps 6. Glossary 7. Appendix. 1.

A Tutorial Introduc/on to Big Data. Hands On Data Analy/cs over EMR. Robert Grossman University of Chicago Open Data Group

Amazon Kinesis and Apache Storm

Cloud Computing. AWS a practical example. Hugo Pérez UPC. Mayo 2012

Moving From Hadoop to Spark

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Big Data Web Analytics Platform on AWS for Yottaa

Alfresco Enterprise on AWS: Reference Architecture

AWS Key Management Service. Developer Guide

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Evaluation of NoSQL databases for large-scale decentralized microblogging

Big Data on Google Cloud

Administrative Issues

Analytics on Spark &

Unified Big Data Processing with Apache Spark. Matei

Lustre * Filesystem for Cloud and Hadoop *

Workshop on Hadoop with Big Data

What is Cloud Computing? Tackling the Challenges of Big Data. Tackling The Challenges of Big Data. Matei Zaharia. Matei Zaharia. Big Data Collection

Service Organization Controls 3 Report

BERLIN. 2015, Amazon Web Services, Inc. or its affiliates. All rights reserved

Predictive Analytics with Storm, Hadoop, R on AWS

Introduction to. Thilina Gunarathne Salsa Group, Indiana University. With contributions from Saliya Ekanayake.

Amazon Redshift & Amazon DynamoDB Michael Hanisch, Amazon Web Services Erez Hadas-Sonnenschein, clipkit GmbH Witali Stohler, clipkit GmbH

t] open source Hadoop Beginner's Guide ij$ data avalanche Garry Turkington Learn how to crunch big data to extract meaning from

Amazon Web Services (AWS) Setup Guidelines

Apache Hadoop FileSystem Internals

xpaaerns on Spark, Shark, Tachyon and Mesos

HiBench Introduction. Carson Wang Software & Services Group

Introduction to Amazon Web Services! Leo Senior Solutions Architect

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Cloud Models and Platforms

Amazon Web Services Annual ALGIM Conference. Tim Dacombe-Bird Regional Sales Manager Amazon Web Services New Zealand

DLT Solutions and Amazon Web Services

AIST Data Symposium. Ed Lenta. Managing Director, ANZ Amazon Web Services

Hadoop Architecture. Part 1

Amazon Elastic Beanstalk

GraySort on Apache Spark by Databricks

Hadoop-based Open Source ediscovery: FreeEed. (Easy as popcorn)

Elastic Map Reduce. Shadi Khalifa Database Systems Laboratory (DSL)

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Big Data With Hadoop

Apache Hadoop. Alexandru Costan

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Deployment Planning Guide

Big Data Analytics - Accelerated. stream-horizon.com

Cloud Computing For Bioinformatics

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

A. Aiken & K. Olukotun PA3

THE HADOOP DISTRIBUTED FILE SYSTEM

BIG DATA. Using the Lambda Architecture on a Big Data Platform to Improve Mobile Campaign Management. Author: Sandesh Deshmane

Professional Hadoop Solutions

Cloudera Manager Training: Hands-On Exercises

Data processing goes big

Realtime Apache Hadoop at Facebook. Jonathan Gray & Dhruba Borthakur June 14, 2011 at SIGMOD, Athens

Archiving and Sharing Big Data Digital Repositories, Libraries, Cloud Storage

AWS Performance Tuning

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Financial Services Grid Computing on Amazon Web Services January 2013 Ian Meyers

Transcription:

Building 1000 node cluster on EMR Manjeet Chayel

What is EMR? Amazon Elas+c MapReduce

Hadoop- as- a- service Map- Reduce engine What is EMR? Integrated with tools Massively parallel Integrated to AWS services Cost effecbve AWS wrapper

Amazon EMR

Amazon EMR AWS Data Pipeline Amazon S3 Amazon DynamoDB

Data Sources Amazon EMR Amazon Kinesis AWS Data Pipeline Amazon S3 Amazon DynamoDB

Data management Hadoop Ecosystem analybcal tools Data Sources Amazon EMR Amazon Kinesis AWS Data Pipeline Amazon S3 Amazon DynamoDB

Data management Hadoop Ecosystem analybcal tools Data Sources Amazon EMR Amazon RDS Amazon Kinesis AWS Data Pipeline Amazon RedShift Amazon S3 Amazon DynamoDB

Data management Hadoop Ecosystem analybcal tools Data Sources Amazon EMR Amazon RDS Amazon Kinesis AWS Data Pipeline Amazon RedShift Amazon S3 Amazon DynamoDB AWS Data Pipeline

Amazon EMR Concepts Master Node Core Nodes Task Nodes

Core Nodes Amazon EMR cluster DataNode () Master Node Master instance group Core instance group

Core Nodes Amazon EMR cluster Can Add Core Nodes: Master Node Master instance group More CPU More Memory More Space Core instance group

Core Nodes Amazon EMR cluster Can t remove core nodes: Master Node Master instance group corrupbon Core instance group

Task Nodes Amazon EMR cluster No Master Node Master instance group Provides compute resources: CPU Memory Core instance group

Task Nodes Amazon EMR cluster Can add and remove task nodes Master Node Master instance group Core instance group

Spark On Amazon EMR

Bootstrap AcBons Ability to run or install addibonal packages/ souware on EMR nodes Simple bash script stored on S3 Script gets executed during node/instance boot Bme Script gets executed on every node that gets added to the cluster

Spark on Amazon EMR Bootstrap acbon installs Spark on EMR nodes Currently on Spark 0.8.1 & upgrading to 1.0 very soon

Why Spark on Amazon EMR? Deploy small and large Spark clusters in minutes EMR manages your cluster, and handles node recover in case of failures IntegraBon with EC2 Spot Market, Amazon RedshiU, Amazon Data pipeline, Amazon Cloudwatch and etc Tight integrabon with Amazon S3

Why Spark on Amazon EMR? Shipping Spark logs to S3 for debugging Define S3 bucket at cluster deploy Bme

Scaling Spark on EMR Amazon EMR cluster Launch inibal Spark cluster with core nodes Master Node Master instance group to store and checkpoint RDDs 32GB Memory

Scaling Spark on EMR Amazon EMR cluster Add Task nodes in spot market to increase memory capacity Master Node Master instance group 32GB Memory 256GB Memory

Scaling Spark on EMR Create RDDs from or Amazon S3 with: Amazon EMR cluster Master Node Master instance group sc.textfile OR sc.sequencefile Run ComputaBon on RDDs 32GB Memory 256GB Memory Amazon S3

Scaling Spark on EMR Amazon EMR cluster Save the resulbng RDDs to or S3 with: Master Node Master instance group rdd.saveassequencefile OR rdd.saveasobjectfile 32GB Memory 256GB Memory saveasobjectfile Amazon S3

Scaling Spark on EMR Amazon EMR cluster Shutdown TaskNodes when your job is done Master Node Master instance group 32GB Memory

ElasBc Spark With Amazon EMR

Autoscaling Spark Amazon EMR cluster Master Node 32GB Memory

Autoscaling Spark Amazon EMR cluster Master Node 32GB Memory 256GB Memory

When to Scale? Depends on your job ElasBc Spark CPU bounded or Memory intensive? Probably both for Spark jobs Use CPU/Memory ubl. metrics to decide when to scale

Spark Autoscaling Based Memory Spark needs memory Lots of it!! How to scale based on the memory usage?

Launching a 1000 node Cluster Amazon EMR

Launching a 1000 node Cluster What do I need to launch a cluster? AWS Account Amazon EMR CLI

Launching a 1000 node Cluster Easy to launch 1 command./elas2c- mapreduce - - create alive - - name "Spark/Shark Cluster" \ - - bootstrap- ac2on s3://elasbcmapreduce/samples/spark/0.8.1/install- spark- shark.sh - - bootstrap- name "Spark/Shark" - - instance- type m1.xlarge - - instance- count 1000 Comes up in 15-20mins

Launching a 1000 node Cluster Adding Task nodes - - add- instance- group TASK - - instance- count INSTANCE_COUNT - - instance- type INSTANCE_TYPE

Is cluster ready? Cluster will be in WaiBng state

lynx hhp://localhost:9101 Lynx Interface

Web Interface

Spark UI

Dataset Wikipedia arbcle traffic stabsbcs 4.5 TB 104 Billion records Stored in Amazon S3 s3://bigdata- spark- demo/wikistats/

Dataset File structure (pagecount- DATE- HOUR.gz) Period: Dec- 2007 to Feb- 2014 Format of File (tsv) Feilds projectcode, pagename, pageviews, and bytes

Sample dataset Projectcode pagename pageviews bytes en Barack_Obama 997 123091092 en Barack_Obama%27s_first_100_days 8 850127 en Barack_Obama,_Jr 1 144103 en Barack_Obama,_Sr. 37 938821 en Barack_Obama_%22HOPE%22_poster 4 81005 en Barack_Obama_%22Hope%22_poster 5 102081

Loading data Amazon EMR Amazon S3

Analyze opbons

Table structure create external table wikistats ( projectcode string, pagename string, pageviews int, pagesize int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LOCATION 's3n://bigdata- spark- demo/wikistats/'; ALTER TABLE wikistats add parbbon(dt= 2007-12 ) locabon 's3n://bigdata- spark- demo//wikistats/2007/2007-12';. Adding parbbons for every month Bll 2014-04

Analyze using Shark Top 10 Page Views in Jan 2014 Exec Bme: 26 secs Scanning 250GB of data

Analyze using Shark Query 2: Top 10 Page Views Overall Exec Bme: 45 sec Scanning 4.5TB of data

Analyze using Shark Query 3: No of pages in each projectcodes Exec Bme: 48 secs Scanning 4.5TB of data / 104 Billion records

Spark Streaming and Amazon Kinesis

Amazon Kinesis CreateStream Creates a new Data Stream within the Kinesis Service PutRecord Adds new records to a Kinesis Stream DescribeStream Provides metadata about the Stream, including name, status, Shards, etc. GetNextRecord Fetches next record for processing by user business logic MergeShard / SplitShard Scales Stream up/ down DeleteStream Deletes the Stream

Amazon Kinesis Kinesis

Amazon Kinesis Kinesis

What Do You Like To See? Send Feedbacks To: Manjeet Chayel chayel@amazon.com hhp://bit.ly/sparkemr