Building 1000 node cluster on EMR Manjeet Chayel

Size: px

Start display at page:

Download "Building 1000 node cluster on EMR Manjeet Chayel"

Silvester Parks
10 years ago
Views:

1 Building 1000 node cluster on EMR Manjeet Chayel

2 What is EMR? Amazon Elas+c MapReduce

3 Hadoop- as- a- service Map- Reduce engine What is EMR? Integrated with tools Massively parallel Integrated to AWS services Cost effecbve AWS wrapper

4 Amazon EMR

5 Amazon EMR AWS Data Pipeline Amazon S3 Amazon DynamoDB

6 Data Sources Amazon EMR Amazon Kinesis AWS Data Pipeline Amazon S3 Amazon DynamoDB

7 Data management Hadoop Ecosystem analybcal tools Data Sources Amazon EMR Amazon Kinesis AWS Data Pipeline Amazon S3 Amazon DynamoDB

8 Data management Hadoop Ecosystem analybcal tools Data Sources Amazon EMR Amazon RDS Amazon Kinesis AWS Data Pipeline Amazon RedShift Amazon S3 Amazon DynamoDB

9 Data management Hadoop Ecosystem analybcal tools Data Sources Amazon EMR Amazon RDS Amazon Kinesis AWS Data Pipeline Amazon RedShift Amazon S3 Amazon DynamoDB AWS Data Pipeline

10 Amazon EMR Concepts Master Node Core Nodes Task Nodes

11 Core Nodes Amazon EMR cluster DataNode () Master Node Master instance group Core instance group

12 Core Nodes Amazon EMR cluster Can Add Core Nodes: Master Node Master instance group More CPU More Memory More Space Core instance group

13 Core Nodes Amazon EMR cluster Can t remove core nodes: Master Node Master instance group corrupbon Core instance group

14 Task Nodes Amazon EMR cluster No Master Node Master instance group Provides compute resources: CPU Memory Core instance group

15 Task Nodes Amazon EMR cluster Can add and remove task nodes Master Node Master instance group Core instance group

16 Spark On Amazon EMR

17 Bootstrap AcBons Ability to run or install addibonal packages/ souware on EMR nodes Simple bash script stored on S3 Script gets executed during node/instance boot Bme Script gets executed on every node that gets added to the cluster

on S3 Script gets executed during node/instance boot Bme

18 Spark on Amazon EMR Bootstrap acbon installs Spark on EMR nodes Currently on Spark & upgrading to 1.0 very soon

19 Why Spark on Amazon EMR? Deploy small and large Spark clusters in minutes EMR manages your cluster, and handles node recover in case of failures IntegraBon with EC2 Spot Market, Amazon RedshiU, Amazon Data pipeline, Amazon Cloudwatch and etc Tight integrabon with Amazon S3

cluster, and handles node recover in case of failures IntegraBon

20 Why Spark on Amazon EMR? Shipping Spark logs to S3 for debugging Define S3 bucket at cluster deploy Bme

21 Scaling Spark on EMR Amazon EMR cluster Launch inibal Spark cluster with core nodes Master Node Master instance group to store and checkpoint RDDs 32GB Memory

22 Scaling Spark on EMR Amazon EMR cluster Add Task nodes in spot market to increase memory capacity Master Node Master instance group 32GB Memory 256GB Memory

23 Scaling Spark on EMR Create RDDs from or Amazon S3 with: Amazon EMR cluster Master Node Master instance group sc.textfile OR sc.sequencefile Run ComputaBon on RDDs 32GB Memory 256GB Memory Amazon S3

24 Scaling Spark on EMR Amazon EMR cluster Save the resulbng RDDs to or S3 with: Master Node Master instance group rdd.saveassequencefile OR rdd.saveasobjectfile 32GB Memory 256GB Memory saveasobjectfile Amazon S3

25 Scaling Spark on EMR Amazon EMR cluster Shutdown TaskNodes when your job is done Master Node Master instance group 32GB Memory

26 ElasBc Spark With Amazon EMR

27 Autoscaling Spark Amazon EMR cluster Master Node 32GB Memory

28 Autoscaling Spark Amazon EMR cluster Master Node 32GB Memory 256GB Memory

29 When to Scale? Depends on your job ElasBc Spark CPU bounded or Memory intensive? Probably both for Spark jobs Use CPU/Memory ubl. metrics to decide when to scale

30 Spark Autoscaling Based Memory Spark needs memory Lots of it!! How to scale based on the memory usage?

31 Launching a 1000 node Cluster Amazon EMR

32 Launching a 1000 node Cluster What do I need to launch a cluster? AWS Account Amazon EMR CLI

33 Launching a 1000 node Cluster Easy to launch 1 command./elas2c- mapreduce - - create alive - - name "Spark/Shark Cluster" \ - - bootstrap- ac2on s3://elasbcmapreduce/samples/spark/0.8.1/install- spark- shark.sh - - bootstrap- name "Spark/Shark" - - instance- type m1.xlarge - - instance- count 1000 Comes up in 15-20mins

34 Launching a 1000 node Cluster Adding Task nodes - - add- instance- group TASK - - instance- count INSTANCE_COUNT - - instance- type INSTANCE_TYPE

35 Is cluster ready? Cluster will be in WaiBng state

37 lynx hhp://localhost:9101 Lynx Interface

38 Web Interface

39 Spark UI

40 Dataset Wikipedia arbcle traffic stabsbcs 4.5 TB 104 Billion records Stored in Amazon S3 s3://bigdata- spark- demo/wikistats/

41 Dataset File structure (pagecount- DATE- HOUR.gz) Period: Dec to Feb Format of File (tsv) Feilds projectcode, pagename, pageviews, and bytes

42 Sample dataset Projectcode pagename pageviews bytes en Barack_Obama en Barack_Obama%27s_first_100_days en Barack_Obama,_Jr en Barack_Obama,_Sr en Barack_Obama_%22HOPE%22_poster en Barack_Obama_%22Hope%22_poster

43 Loading data Amazon EMR Amazon S3

44 Analyze opbons

45 Table structure create external table wikistats ( projectcode string, pagename string, pageviews int, pagesize int ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ' ' LOCATION 's3n://bigdata- spark- demo/wikistats/'; ALTER TABLE wikistats add parbbon(dt= ) locabon 's3n://bigdata- spark- demo//wikistats/2007/ ';. Adding parbbons for every month Bll

46 Analyze using Shark Top 10 Page Views in Jan 2014 Exec Bme: 26 secs Scanning 250GB of data

47 Analyze using Shark Query 2: Top 10 Page Views Overall Exec Bme: 45 sec Scanning 4.5TB of data

48 Analyze using Shark Query 3: No of pages in each projectcodes Exec Bme: 48 secs Scanning 4.5TB of data / 104 Billion records

49 Spark Streaming and Amazon Kinesis

50 Amazon Kinesis CreateStream Creates a new Data Stream within the Kinesis Service PutRecord Adds new records to a Kinesis Stream DescribeStream Provides metadata about the Stream, including name, status, Shards, etc. GetNextRecord Fetches next record for processing by user business logic MergeShard / SplitShard Scales Stream up/ down DeleteStream Deletes the Stream

51 Amazon Kinesis Kinesis

52 Amazon Kinesis Kinesis

53 What Do You Like To See? Send Feedbacks To: Manjeet Chayel hhp://bit.ly/sparkemr

Hadoop & Spark Using Amazon EMR

Hadoop & Spark Using Amazon EMR Michael Hanisch, AWS Solutions Architecture 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda Why did we build Amazon EMR? What is Amazon EMR?