Hadoop in a Cloud. 서상원 (sangwon.seo@ahems.co.kr) : : : : : Hadoop In a Cloud PlatformDay 2011 : : : : :

Size: px

Start display at page:

Download "Hadoop in a Cloud. 서상원 (sangwon.seo@ahems.co.kr) http://www.ahems.co.kr. : : : : : Hadoop In a Cloud PlatformDay 2011 : : : : :"

Rafe McLaughlin
8 years ago
Views:

1 Hadoop in a Cloud 서상원 (sangwon.seo@ahems.co.kr) 1

2 Outline Amazon Elastic MapReduce MapReduce on Virtual Machines Speculative Execution on Virtual Machines Towards Hadoop In a Cloud : AhemoMapReduce 2

3 Amazon Elastic MapReduce Overview Paying Policy WorkFlow System Architecture 3

4 Amazon Elastic MapReduce Overview Amazon Elastic MapReduce = Hadoop processing + Amazon web Services Process 1. 클라우드 선택 EC2 (Elastic Computing Cloud) instances 2. 저장 서비스 선택 S3 (Simple Storage Service) 3. Job Flow 생성 Amazon Elastic M/R tool kit 사용 4. Job Flow 완료후, S3 에서 결과 조회 5. SimpleDB에 job flow 상태 정보 저장 6. 실제로 활용한 자원에 대해서만 요금 지불 (EC2, S3, SimpleDB) 4

5 Amazon Elastic MapReduce Overview Apache Hadoop 기반 작업생성 : 작업완료 각 개별 Slave Instance 에 아래를 배포 copy of the MapReduce executable 분할된 작업 데이터 셋 (from Amazon S3 ) 각 개별 Slave Instance 는 작업 결과를 S3에 업로드 각 개별 Slave Instance 는 상태 정보를 Master Instance 에 리포트 5

6 Amazon Elastic MapReduce Paying Policy 실제 활용한 자원에 대해서만 요금 부과 내 작업에 정확하게 필요한 Instance 개수를 알 수 있나? Instance 개수는 애플리케이션에 종속적 전체 공간의 60% 를 input data 공간으로 사용한다고 가정 : 나머지 40%는 inte rmediate data 혹은 output data 공간으로 사용한다고 가정 Ex) 3x replication on HDFS If 5TB input data, 15 m1.xlarge (1,690 GB disk space) is needed (5TB * 3) / (1,690 GB * 0.6) = 15 6

7 Amazon Elastic MapReduce Workflow 마스터 텍스트 스타일을 편집합니다 둘째 수준 셋째 수준 넷째 수준 다섯째 수준 7

8 Amazon Elastic MapReduce System Architecture Virtual Storage = HDFS Master Slave-1 Slave-2 Slave-N feature Amazon Elastic MapReduce migration autoscaling Cloud Decoupling computation and data 8

9 MapReduce on Virtual Machine s Evaluation of HDFS s Feasibility MapReduce on Virtual Machines Performance Analysis Evaluating MapReduce on Virtual Machines : The Hadoop Case, LNCS

10 Evaluation Environment Physical Machine Spec Virtual Machine Spec HW Spec : Two Quad-core 2.33GHz Xeon 8GB memory 1TB disk 1 Gigabit 이더넷 SW Spec : Linux Kernel Hadoop HW Spec : 1 VCPU 1GB memory SW Spec : Xen

11 Evaluation Environment 7 physical cluster VS. 7 virtual cluster Virtual cluster : 하나의 physical node에 하나의 virtual machine을 올린 상 황. VS. 11

12 Evaluation of HDFS Using put and get command to HDFS Different data scale and 7 nodes cluster Virtualization overhead 12

13 Evaluation of HDFS Different cluster scale, with the same data distribution (512MB per DataNode) Virtualization overhead 13

14 Evaluation Environment 4 Configurations 1. Ph-Cluster Cluster Size : 7 Physical nodes 2. V-Cluster Cluster Size : 7 nodes Load : 1 /Node 3. V2-Cluster Cluster Size : 13 nodes Load : 2 /Node 14

15 Evaluation Environment 4 Configurations 4. V4-Cluster Cluster Size : 25 nodes Load : 4 /Node 15

16 s Feasibility Wordcount execution with different load per node (1,2, and 4), and two data sets (1GB and 8GB) More computing cycles & more slots are free 16

17 MapReduce on Virtual Machines Performance A nalysis Sort and workcount execution with different load per node, and different dat a size s are competing s are competing for the node I/O for the node I/O resources resources 17

18 Speculative Execution on Virtual Machines Overview Improving MapReduce Performance in Heterogeneous Environments in OSDI 08 18

19 Fault Tolerance in MapReduce Node failure Detect failure via periodic heartbeats Re-execute completed or in-progress tasks Node performing poorly Called Straggler MapReduce runs a speculative copy of its task (also called backup task ) Node 1 Node 2 19

20 Straggler Checking in Hadoop When nodes become free, look for backups to launch Tasks report progress score from 0 to 1 Launch backup if 1. current progress < avgprogress task has run at least one minute 20

21 Problems in Virtual Cluster 1. Too many backups, thrashing shared resources like net work bandwidth 2. Wrong tasks backed up 3. Backups may be placed on slow nodes 4. Breaks when tasks start at different times ex) ~80% of reduces backed up, most losing originals by n etwork thrashing 21

22 Hadoop s wrong speculation Hadoop s wrong speculation at heterogeneous environment Fast Node Progress score : 0.5 time left: 1 min Slow Node Progress score : 0.7 time left: 5 min Speculating slow node s task is much better, but Hadoop scheduler might speculate Fast Node s task 22

23 Solution Back up the task with the largest estimated finish time Longest Approximate Time to End (LATE) Look forward instead of looking backward Sanity thresholds: Cap number of backup tasks Launch backups on fast nodes Only back up tasks that are sufficiently slow 23

24 LATE Details Estimating finish times: progress rate = progress score execution time estimated time left = Threshold values: 10% cap on backups, 25th percentiles progress for slow node/task rate Validated by sensitivity analysis 1 progress score 24

25 Evaluation Environments: EC2 (3 job types, nodes) Small local testbed Self-contention through placement Stragglers through background processes 25

26 EC2 Sort with Stragglers Normalized Response Time No Backups Hadoop Native LATE Scheduler Worst Average 58% speedup over native, 220% over no backups 93% max speedup over native 26

27 EC2 Sort without Stragglers Normalized Response Time Worst No Backups Hadoop Native LATE Scheduler Average 27% speedup over native, 31% over no backups 27

28 Implications 2x improvement using simple algorithm Hadoop scheduler should be changed to aware heterogeneous environment 28

29 Toward Hadoop In a Cloud Limitation of the Existing Works Toward Hadoop In a Cloud : AhemoMapReduce System Architecture Features 29

30 Limitation of the Existing Works Virtual Storage = HDFS Master Slave-1 Slave-2 Slave-N feature Amazon Elastic MapReduce migration autoscaling Cloud Decoupling computation and data 30

31 Toward Hadoop In a Cloud : AhemoMapReduce Virtual Storage = HDFS Master Slave-1 Slave-2 Slave-N feature Amazon Elastic MapReduce AhemoMapReduce migration autoscaling Cloud Decoupling computation and data 31

32 AhemoMapReduce System Architecture Ahems Master Slave-1 Slave-2 Slave-N migration AhemoFS Ahems Storage Cloud AhemoMapReduce migration Ahemos autoscaling pay as you use (cost * delay product) Based on Virtual Cloud (Ahemos) Based on High Performance Distributed File System (AhemoFS) 32

33 AhemoMapReduce Features 분리해서 생각 하는 일반적인 접근 방법 Hadoop Layer Hadoop Layer Layer Layer Task Placement -Data Locality Task Scheduling -Fair Scheduling -Delay Scheduling Migration? Autoscaling? Virtual cloud aware Task Placement Virtual cloud aware Task Scheduling Hadoop-aware Migration Hadoop-aware Autoscaling Virtual Storage Virtual Storage HDFS AhemoFS 33

34 Summary Amazon Elastic MapReduce MapReduce on Virtual Machines Speculative Execution on Virtual Machines Towards Hadoop In a Cloud : AhemoMapReduce 34

35 감사합니다. 35

Improving MapReduce Performance in Heterogeneous Environments

UC Berkeley Improving MapReduce Performance in Heterogeneous Environments Matei Zaharia, Andy Konwinski, Anthony Joseph, Randy Katz, Ion Stoica University of California at Berkeley Motivation 1. MapReduce