Distributed Scheduling with Apache Mesos in the Cloud. PhillyETE - April, 2015 Diptanu Gon Choudhury @diptanu

Similar documents

Heterogeneous Resource Scheduling Using Apache Mesos for Cloud Native Frameworks

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Cisco Application-Centric Infrastructure (ACI) and Linux Containers

Scalable Architecture on Amazon AWS Cloud

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

TECHNOLOGY WHITE PAPER Jun 2012

TECHNOLOGY WHITE PAPER Jan 2016

Cloud Computing #8 - Datacenter OS. Johan Eker

Building Hyper-Scale Platform-as-a-Service Microservices with Microsoft Azure. Patriek van Dorp and Alex Thissen

Private Cloud Management

YARN Apache Hadoop Next Generation Compute Platform

Hadoop & Spark Using Amazon EMR

Amazon Elastic Beanstalk

The Definitive Guide To Docker Containers

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

CDH AND BUSINESS CONTINUITY:

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

YouTube Vitess. Cloud-Native MySQL. Oracle OpenWorld Conference October 26, Anthony Yeh, Software Engineer, YouTube.

Continuous Integration in the Cloud with Hudson

Gregory PaaS team

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

CSE-E5430 Scalable Cloud Computing Lecture 2

Microservices on AWS

Amazon EC2 Product Details Page 1 of 5

Running Oracle Applications on AWS

Modern App Architecture for the Enterprise Delivering agility, portability and control with Docker Containers as a Service (CaaS)

CSE-E5430 Scalable Cloud Computing Lecture 11

Architecting ColdFusion For Scalability And High Availability. Ryan Stewart Platform Evangelist

Alfresco Enterprise on AWS: Reference Architecture

Building Success on Acquia Cloud:

Apache HBase. Crazy dances on the elephant back

APPLICATION DELIVERY IN OPENSTACK WITH AVI NETWORKS

Lessons Learned from the Movies

APPLICATION DELIVERY IN OPENSTACK WITH AVI NETWORKS CLOUD APPLICATION DELIVERY PLATFORM

Why the Datacenter needs an Operating System. Dr. Bernd Mathiske Senior Software Architect Mesosphere

Cloudera Enterprise Reference Architecture for Google Cloud Platform Deployments

OpenStack. Orgad Kimchi. Principal Software Engineer. Oracle ISV Engineering. 1 Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Microsoft Private Cloud Fast Track

Big data blue print for cloud architecture

One click Hadoop clusters - anywhere

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Intro to Docker and Containers

DISTRIBUTED SYSTEMS [COMP9243] Lecture 9a: Cloud Computing WHAT IS CLOUD COMPUTING? 2

Modern Application Architecture for the Enterprise

STRATEGIC WHITE PAPER. The next step in server virtualization: How containers are changing the cloud and application landscape

WE RUN SEVERAL ON AWS BECAUSE WE CRITICAL APPLICATIONS CAN SCALE AND USE THE INFRASTRUCTURE EFFICIENTLY.

The Virtualization Practice

Object Storage: A Growing Opportunity for Service Providers. White Paper. Prepared for: 2012 Neovise, LLC. All Rights Reserved.

PLUMgrid Toolbox: Tools to Install, Operate and Monitor Your Virtual Network Infrastructure

Snapshots in Hadoop Distributed File System

Real Time Data Processing using Spark Streaming

How To Use Arcgis For Free On A Gdb (For A Gis Server) For A Small Business

Apache Hadoop. Alexandru Costan

Who s me? Zequi Vázquez DevOps & Backend PhD student Hacking & Security Rock n Roll (electric guitarist) Videogames Books

LARGE-SCALE DATA STORAGE APPLICATIONS

Using ArcGIS for Server in the Amazon Cloud

Web Application Hosting in the AWS Cloud Best Practices

Fault-Tolerant Computer System Design ECE 695/CS 590. Putting it All Together

Building an AWS-Compatible Hybrid Cloud with OpenStack

RED HAT CONTAINER STRATEGY

Postgres Plus Cloud Database!

Building a cloud with Openstack. Iqbal Mohomed iqbal@us.ibm.com March 25 th 2015

Building a Kubernetes Cluster with Ansible. Patrick Galbraith, ATG Cloud Computing Expo, NYC, May 2016

xpaaerns on Spark, Shark, Tachyon and Mesos

Cloud-Based dwaf A Real World Deployment Case Study. OWASP 5. April The OWASP Foundation

Management Packs for Database

An Introduction to Cloud Computing Concepts

Zadara Storage Cloud A

Migrating a running service to AWS

DevOps Course Content

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

WHITE PAPER: PAN Cloud Director Technical Overview

9/26/2011. What is Virtualization? What are the different types of virtualization.

Red Hat Enterprise linux 5 Continuous Availability

Docker : devops, shared registries, HPC and emerging use cases. François Moreews & Olivier Sallou

Bluemix: The Open Platform as a Service

Application Security Best Practices. Matt Tavis Principal Solutions Architect

The Future of Data Management

CoreTech Foundation Apps for Humanity

A Comparison of Clouds: Amazon Web Services, Windows Azure, Google Cloud Platform, VMWare and Others (Fall 2012)

Making Your ColdFusion Apps Highly Available. Brian Klaas Johns Hopkins Bloomberg School of Public Health

Sacha Dubois RED HAT TRENDS AND TECHNOLOGY PATH TO AN OPEN HYBRID CLOUD AND DEVELOPER AGILITY. Solution Architect Infrastructure

Why should you look at your logs? Why ELK (Elasticsearch, Logstash, and Kibana)?

Design For Availability. October 2013 Stevan Vlaovic

How AWS Pricing Works

Cloud computing - Architecting in the cloud

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

Drupal in the Cloud. Scaling with Drupal and Amazon Web Services. Northern Virginia Drupal Meetup

What s Happening to the Mainframe? Mobile? Social? Cloud? Big Data?

Migration and Disaster Recovery Underground in the NEC / Iron Mountain National Data Center with the RackWare Management Module

NetflixOSS A Cloud Native Architecture

Hadoop and Map-Reduce. Swati Gore

Microsoft SQL Server Acceleration with SanDisk

Symantec NetBackup Appliances

Learning Management Redefined. Acadox Infrastructure & Architecture

How To Backup With Ec Avamar

Testing Cloud Application System Resiliency by Wrecking the System

Non-Stop for Apache HBase: Active-active region server clusters TECHNICAL BRIEF

Transcription:

Distributed Scheduling with Apache Mesos in the Cloud PhillyETE - April, 2015 Diptanu Gon Choudhury @diptanu

Who am I? Distributed Systems/Infrastructure Engineer in the Platform Engineering Group Design and develop resilient highly available services IPC, Service Discovery, Application Lifecycle Senior Consultant at ThoughtWorks Europe OpenMRS/RapidSMS/ICT4D contributor

A word about Netflix Just the stats 16 years < 2000 employees 50+ million users 5 * 10^9 hours/quarter Freedom and Responsibility Culture

The Titan Framework A globally distributed resource scheduler which offers compute resources as a service

Guiding Principles Design for Native to the public clouds Availability Reliability Responsiveness Continuous Delivery Pushing to production faster

Guiding Principles Being able to sleep at night even when there are partial failures. Availability over Consistency at a higher level Ability for teams to fit in their domain specific needs

Active-Active Architecture Region AZ AZ AZ

Current Deployment Pipeline Bakery

The Base AMI

Need for a Distributed Scheduler ASGs are great for web services but for processes whose life cycle are controlled via events we needed something more flexible Cluster Management across multiple geographies Faster turnaround from development to production

Need for a Distributed Scheduler A runtime for polyglot development Tighter Integration with services like Atlas, Scryer etc

We are not alone in the woods Google s Borg and Kubernetes Twitter s Aurora Soundcloud s Harpoon Facebook s tupperware Mesosphere s Marathon

Why did we write Titan We wanted a cloud native distributed scheduler Multi Geography from the get-go A meta scheduler which can support domain specific scheduling needs Work Flow systems for batch processing workloads Event driven systems Resource Allocators for Samza, Spark, etc

Why did we write Titan Persistent Volumes and Volume Management Scaling rules based on metrics published by the kernel Levers for SREs to do region failovers and shape traffic globally

Compute Resources as a service { name : rocker, applicationname : nf-rocker, version : 1.06, location : dc1:20,us-west-2:dc2:40,dc5:60, cpus : 4, memory : 3200, disk : 40, ports : 2, restartonfailure : true, numretries : 10, restartonsuccess : false }

Things Titan doesn t solve Service Discovery Distributed Tracing Naming Service

Building blocks A resource allocator Packaging and isolation of processes Scheduler Distribution of artifacts Replication across multiple geographies AutoScalers

Resource Allocator Scale to 10s of thousands of servers in a single fault domain Does one thing really well Ability to define custom resources Ability to write flexible schedulers Battle tested

Mesos

How we use Mesos Provides discovery of resources We have written a scheduler called Fenzo An API to launch tasks Allows writing executors to control the lifecycle of a task A mechanism to send messages

Packaging and Isolation We love Immutable Infrastructure Artifacts of applications after every build contains the runtime Flexible process isolation using cgroups and namespaces Good tooling and distribution mechanism

Docker

Building Containers Lots of tutorials around docker helped our engineers to pick the technology very easily Developers and build infrastructure uses the Docker cli to create containers. The docker-java plugin allows developers to think about their application as a standalone process

Volume Management ZFS on linux for creating volumes Allows us to clone, snapshot and move around volumes The zfs toolset is very rich Hoping for a better libzfs

Networking In AWS EC2 classic containers use the global network namespace Ports are allocated to containers via Mesos In AWS VPC, we can allocate an IP address per container via ENIs

Logging Logging agent on every host to allows users to stream logs Archive logs to S3 Every container gets a volume for logging

Monitoring We push metrics published by the kernel to Atlas The scheduler gets a stream of metrics from every container to make scheduling decisions Use the cgroup notification API to alert users when a task is killed

Scheduler We have a pluggable scheduler called Fenzo Solves the problem of matching resources with tasks that are queued.

Scheduler Remembers the cluster state Efficient bin-packing Helps with Auto Scaling Allows us to do things like reserve instances for specific type of workloads

Auto Scaling A must need for running on the cloud Two levels of scaling Scaling of underlying resources to match the demands of processes Scaling the applications based on metrics to match SLAs

Reactive Auto Scaling Titan adjusts the size of the fleet to have enough compute resources to run all the tasks Autoscaling Providers are pluggable

Predictive Autoscaling Historical data to predict the size of clusters of individual applications Linear Regression models for predicting near real time cluster sizes

Bin Packing for efficient Autoscaling Node A Node B Node C Service A Batch Job B Batch Job C 16 CPUs 16 CPUs 16 CPUs Service A Service A Service A Long Running Service Short Lived Batch Process Short Lived Batch Process

Bin Packing for efficient Autoscaling Node A Node B Node C Service A Scale Down 16 CPUs 16 CPUs 16 CPUs

Mesos Framework Master Slave model with leader election for redundancy A single Mesos Framework per fault domain We currently use Zookeeper but moving to Raft Resilient to failures of underlying data store

Globally Distributed Each geography has multiple fault domains Single scheduler and API in each fault domain.

Globally Distributed All job specifications are replicated across all fault domains across all geographies Heart beats across all fault domains to detect failures Centralized control plane

Spinnaker Apollo Creed Dagobah Meson Samza Titan Mesos

Future More robust scheduling decisions Optimize the host OS for running containers More monitoring

Questions?