Fuxi: a Fault Tolerant Resource Management and Job Scheduling System at Internet Scale



Similar documents
Computing at Scale: Resource Scheduling Architectural Evolution and Introduction to Fuxi System

Big Data Technology Core Hadoop: HDFS-YARN Internals

Cloud Infrastructure Licensing, Packaging and Pricing

Cloud Optimize Your IT

GraySort on Apache Spark by Databricks

TekSouth Fights US Air Force Data Center Sprawl with iomemory

Cloud Computing and Amazon Web Services

HP recommended configuration for Microsoft Exchange Server 2010: HP LeftHand P4000 SAN

SUN ORACLE DATABASE MACHINE

Extending Hadoop beyond MapReduce

Brian LaGoe, Systems Administrator Benjamin Jellema, Systems Administrator Eastern Michigan University

Zadara Storage Cloud A

wu.cloud: Insights Gained from Operating a Private Cloud System

Centrata IT Management Suite 3.0

How To Use Vsphere On Windows Server 2012 (Vsphere) Vsphervisor Vsphereserver Vspheer51 (Vse) Vse.Org (Vserve) Vspehere 5.1 (V

StruxureWare TM Data Center Expert

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

ORACLE DATABASE 10G ENTERPRISE EDITION

bla bla OPEN-XCHANGE Open-Xchange Hardware Needs

Mit Soft- & Hardware zum Erfolg. Giuseppe Paletta

Axceleon s CloudFuzion Turbocharges 3D Rendering On Amazon s EC2

AlphaTrust PRONTO - Hardware Requirements

Dell Virtualization Solution for Microsoft SQL Server 2012 using PowerEdge R820

Configuration Maximums VMware vsphere 4.0

Integrated Application and Data Protection. NEC ExpressCluster White Paper

Overview: X5 Generation Database Machines

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

Scala Storage Scale-Out Clustered Storage White Paper

Dragon Medical Enterprise Network Edition Technical Note: Requirements for DMENE Networks with virtual servers

...DYNAMiC INTERNET SOLUTiONS >> Reg.No. 1995/020215/23

With Red Hat Enterprise Virtualization, you can: Take advantage of existing people skills and investments

Brocade and EMC Solution for Microsoft Hyper-V and SharePoint Clusters

Tableau Server 7.0 scalability

How swift is your Swift? Ning Zhang, OpenStack Engineer at Zmanda Chander Kant, CEO at Zmanda

Big Data Trends and HDFS Evolution

Content Distribution Management

Very Large Enterprise Network Deployment, 25,000+ Users

YARN Apache Hadoop Next Generation Compute Platform

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

Get More Scalability and Flexibility for Big Data

SAS and Oracle: Big Data and Cloud Partnering Innovation Targets the Third Platform

Nebula Cloud Computing Project: Background, Technology, Operations, Challenges, and Status

Embracing Open Source: Practice and Experience from Alibaba. Wensong Zhang The 11 th Northeast Asia OSS Promotion Forum

Server Virtualization with Windows Server Hyper-V and System Center

Technology and Cost Considerations for Cloud Deployment: Amazon Elastic Compute Cloud (EC2) Case Study

LSKA 2010 Survey Report Job Scheduler

Virtual Machine Synchronization for High Availability Clusters

Apache Hadoop. Alexandru Costan

Scaling Database Performance in Azure

Red Hat Enterprise linux 5 Continuous Availability

Introduction to Windows Azure Cloud Computing Futures Group, Microsoft Research Roger Barga, Jared Jackson,Nelson Araujo, Dennis Gannon, Wei Lu, and

Introduction 1 Performance on Hosted Server 1. Benchmarks 2. System Requirements 7 Load Balancing 7

Fault Tolerant Solutions Overview

Scalable Application. Mikalai Alimenkou

Quantum StorNext. Product Brief: Distributed LAN Client

Windows Server 2012 授 權 說 明

Configuration Maximums VMware Infrastructure 3

Big Data Performance Growth on the Rise

Best Practices for Optimizing Your Linux VPS and Cloud Server Infrastructure

Microsoft Hyper-V chose a Primary Server Virtualization Platform

App App App App App App App App. VMware vcenter Suite. VMware vsphere 4. Availability Security Scalablity. vshield Zones VMSafe

Google File System. Web and scalability

CLOUD SERVERS vs DEDICATED SERVERS

Recommended hardware system configurations for ANSYS users

Oracle Big Data SQL Technical Update

How Solace Message Routers Reduce the Cost of IT Infrastructure

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

Het is een kleine stap naar een hybrid cloud

TPS Virtualization and Future Virtual Developments. Paul Hodge

Intro to Virtualization

Dell Reference Configuration for Hortonworks Data Platform

SUN ORACLE EXADATA STORAGE SERVER

StruxureWare TM Center Expert. Data

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

An Oracle White Paper August Oracle VM 3: Server Pool Deployment Planning Considerations for Scalability and Availability

Hardware/Software Guidelines

Whitepaper. Vertex VDI. Tangent, Inc.

Using SUSE Cloud to Orchestrate Multiple Hypervisors and Storage at ADP

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

VMware Virtual Infrastucture From the Virtualized to the Automated Data Center

IOS110. Virtualization 5/27/2014 1

Building Storage Service in a Private Cloud

Very Large Enterprise Network, Deployment, Users

Introduction to AWS Economics

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Enabling Technologies for Distributed and Cloud Computing

<Insert Picture Here> Introducing Oracle VM: Oracle s Virtualization Product Strategy

Nested Virtualization

Oracle Primavera P6 Enterprise Project Portfolio Management Performance and Sizing Guide. An Oracle White Paper October 2010

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

UCS M-Series Modular Servers

Cloud Computing through Virtualization and HPC technologies

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Traditional v/s CONVRGD

Transcription:

Fuxi: a Fault Tolerant Resource Management and Job Scheduling System at Internet Scale Zhuo Zhang *, Chao Li *, Yangyu Tao *, Renyu Yang *, Hong Tang *, Jie Xu Alibaba Cloud Computing Inc. * Beihang University University of Leeds

Agenda Motivation System Introduction Contribution Incremental resource management protocol User transparent failure recovery Faulty node detection and multilevel blacklist mechanism Evaluation Conclusion and Future Work

Motivation Big data era 2.5 exabytes of data are generated everyday The speed of data generation doubles every 40 months Billions of transactions on Taobao everyday, must be processed in 6 hours Scalability challenges 10 3 ~ 10 4 of machines 10 3 ~ 10 4 of concurrent jobs 10 5 ~ 10 6 of concurrent workers 10 4 ~ 10 5 OPS for resource request/assign Other factors: Multi-dimensional resource, quota, priority, locality, etc. Fault tolerance challenges Failures become normal at large scale Partial failures worse than machine down Master failures

Apsara: A Brief History 09/10/2009 Aliyun.com Inc established 07/28/2011 Aliyun.com went online, releasing 08/27/2010 1 st cloud service: ECS Apsara became the platform of four applications: Search, Mail, Image Storage, AliFinance 08/15/2013 5000-node Aspara cluster went into production 5K 10/24/2013 3rd Aliyun dev conference held in Hangzhou. ~5000 developers attended the conference

Deployment Monitoring Apsara Cloud Platform Map, Mail, Search DAG Job Long running service Cloud Market ACE Other Cloud Services ECS/SLB OSS OTS RDS ODPS Resource scheduling and allocation Quota control Distributed File System Preemption Process life cycle control Process isolation Remote Procedure Call Security Management Application Scheduling Distributed Coordination Resource Management Linux Clusters

Fuxi Workflow FuxiMaster Client... Cluster Node Cluster Node Cluster Node Cluster Node Job scheduling FuxiAgent FuxiAgent FuxiAgent FuxiAgent Resource request and response App Master... APP Worker APP Worker... APP Worker APP Worker App Master APP Worker......... Node management and status collection Job submission Process Execution Plan

Support Different Application Models AppMaster lib Resource request/return Process execution Heartbeat Build-in models DAG job (similar to Dryad and Tez) Long running services Mail service Search service Can support more models spark, storm, etc. T2 InputFile T1 T4 OutputFile T3

Contributions Scalability Incremental resource management protocol Fault tolerance User transparent failure recovery Faulty node detection and multilevel blacklist mechanism

Incremental Resource Management Protocol FuxiAgent1 FuxiAgent2 FuxiAgent3 A1: A1: -3 +3 A1: -4 A1: +3 A1: -1 A2: A1: -1 +2 A1: -2 A1: +2 FuxiMaster ScheduleUnit: Return: {1cpu, 2gb} Apply: Return: M1: -3 {M1*2, M2: M2: -1 C*10} -2 M3: -4 Max: 10 Assign: M1: Assign: +3, M2: M3: +3, +2 M3: +2 AppMaster1 ScheduleUnit: {1cpu, 2gb} Return: M3: -1 AppMaster2 ScheduleUnit: {2cpu, 5gb}

Resource Scheduling Event driven real-time resource allocation Locality tree based scheduling Multi-layer schedule order New Submission Resource Release Resource Preemption Resource Change Quota Change Resource Allocation Available Resource Pool Waiting Resource Requests Queue Fulfilled Resource Request Queue M1 App1: P1,4 App2: P1,3 App3: P2,1 Rack1 App1: P1,9 App2: P1,5 App3: P2,1 App4:P3,8 Cluster App1: P1,14 APP2: P1,9 APP3: P2,5 APP4: P3, 16 Rack2 M2 M3 M4 App1: P1,4 App3: P2,6 App1: P1,3 App3: P2,1 App4:P3,1 App1: P1,4 App2: P1,4 App3: P2,3 App4:P3,8 App3: P2,1 App4:P3,3

User Transparent Failure Recovery Full stack failover support FuxiMaster JobMaster FuxiAgent Scalability consideration Hard state Soft state

FuxiMaster Failover Hard state Application configuration Soft state Machine list Resource request Resource schedule results FuxiAgent FuxiAgent FuxiAgent ResourceAssign App1: 5 APP2: 3 ResourceAssign App1: 3 APP2: 2 Soft state FuxiMaster ResourceAssign App1: 2 APP2: 3 Hard state AppConfig App1:[ ] App2:[ ] Checkpoint ScheduleUnit:{1cpu,2GB Mem} Request: M1*2,M2*3, R2*3,C*8 MaxCount: 15 Soft state ScheduleUnit:{2cpu,3GB Mem} Request: M3*2,C*8 MaxCount: 10 App Master1 App Master2

Faulty Node Detection and Multilevel Blacklist Cluster level: rule based Hardware & OS: Disk statistics, machine load, network I/O, etc. Apsara platform: OPS, state, heartbeat, etc. Scoring system Plugin based Job level

Faulty Node Detection and Multilevel Blacklist Cluster level Job level: history based One instance failed on one machine Multiple instances failed on one same machine Running slow on one machine Read slow from one machine

Fuxi in Production In production since 2009 Manage hundreds of thousands of nodes 5000 nodes in one cluster Job type at a glance avg max total Instance Number 228/task 99,937/task 42,266,899 Worker Number 89.92/task 4,636/task 16,295,167 Task Number 2.0/job 150/job 185,444

Evaluation: 5000 nodes with 1000 concurrent jobs Planned memory usage FuxiMaster scheduling time

Evaluation: GraySort Provenance Configuration GraySort Indi Result Fuxi (2013) Yahoo! Inc. (2012) UCSD (2011) UCSD&VUT (2010) KIT (2009) 5000 nodes (2 2.20GHz 6cores Xeon E5-2430, 96 GB memory,12x2tb disks) 2100 nodes (2 2.3Ghz hexcore Xeon E5-2630, 64 GB memory,12x3tb disks) 52 nodes (2 Quadcore processors,24 GBmemory, 16x500GB disks) Cisco Nexus 5096 switch 47 nodes (2 Quadcore processors,24 GB memory, 16x500GB disks) Cisco Nexus 5020 switch 195 nodes x (2 Quadcore processors,16 GB memory,4x250gb disks)288-port InniBand 4xDDR switch 100TB in 2538 seconds (2.364TB/min) 102.5 TB in 4,328 seconds (1.42TB/min) 100 TB in 6,395 seconds (0.938TB/min) 100 TB in 10,318 seconds (0.582TB/min) 100 TB in 10,628 seconds (0.564TB/min)

Evaluation: Performance of Fault Handling 5% failure: 15.7% performance degradation 10% failure: 19.6% performance degradation 0 job fail Injected Fault Type Node Number Node Number Node Down 2 2 Partial Worker Failure 2 4 Slow Machine 11 23 Total Number (ratio) 15 (5% failure) 30 (10% failure)

Conclusion and Future Work Conclusion Scalability Incremental communication/schedule instead of full run Fault tolerance Full stack failover support Do checkpoint only for hard state Detect faulty node by rules and history Future work Improve real resource utilization More resource isolation Interactive jobs

Thanks!

Alibaba Night Recently Alibaba s Chinese online commerce sites drove over US $5.7 billion dollars worth of sales in a single day. Learn what it takes to run these sites and how the Alibaba Group has impacted the Chinese economy. This event is specifically geared towards researchers, engineers and IT professionals who are interested in learning about the growth of ecommerce and technology. Join us for this FREE event as we share learning s and surge ahead in 2014! Date: September 4 th, 2014 Location: Alibaba Xixi Campus Agenda: - Tour of Xixi Campus - Presentations from Alibaba - Q&A Session Notes: - Please register at the Alibaba booth in the Exhibition area. - Due to limited space, we may not be able to accommodate everyone. Sorry! - This event will be primarily conducted in Mandarin.