Why the Datacenter needs an Operating System 1 Dr. Bernd Mathiske Senior Software Architect Mesosphere
Bringing Google-Scale Computing to Everybody
A Slice of Google Tech Transfer History 2005: MapReduce -> Hadoop (Yahoo) 2007: Linux cgroups for lightweight isolation (Google) 2009: BigTable -> MongoDB 2009: The Datacenter as a Computer - Barroso, Hölzle (Google) 2009: Mesos - a distributed operating system kernel (UC Berkeley) 2010: Large scale production Mesos deployment (Twitter) since 2010: Many more frameworks and quite a few meta-frameworks
Notable Operating System Developments Single-something => multi-something: user, tasking, threading, core, More: bits, memory, storage, bandwidth OS virtualization => lightweight virtualization (cgroups, LXCs, jails, ) Packaging => containers (docker, rkt, lmctfy, ) Static libraries => dynamic libraries => static libraries 4
Cluster Operating Systems (Hardware Clustering) Researched since the 1980s Trying to provide (the illusion of) a single system image Aiming at HA, load balancing, location transparency (e.g. for storage) Many systems: Amoeba, ChorusOS, GLUnix, Hurricane, MOSIX, Plan9, RHCS, Spring, Sprite, Sumo, QNX, Solaris MC, UnixWare, VAXclusters, Relatively low scale (up to 100s of nodes) Complicated to manage, less dynamic than software clustering 5
From HPC Grid to Enterprise Cloud Condor, LSF, Maui, Moab, Quartz, SLURM, Typically for batch jobs Also cover services => SOA => more job schedulers => grid computing => grid middleware => cloud stacks 6
From Server Virtualization to App Aggregation App App App App Virtualization Server App Aggregation Serv Serv Serv Serv Client-Server Era: Small apps, big servers Cloud Era: Big apps, small servers
Cloud Computing SaaS: Salesforce demonstrated success, then many followed PaaS: Deis, Dotcloud, OpenShift, Heroku, Pivotal, Stackato, IaaS: AWS, Azure, DigitalOcean, GCE Private cloud stacks including IaaS: Eucalyptus, CloudStack, Joyent, OpenStack, SmartCloud, vsphere, 8
Datacenter A facility used to house computer systems and associated components (e.g. networking, storage, cooling, sensors) In this talk we focus on how to manage and use a single production cluster of networked computers in a datacenter Such clusters range in size from 10s to 10000s of nodes Why should we and how can we end up with just one production cluster? 9
Datacenter Services LAMP (Linux, Apache, MSQL, PHP) or similar MEAN (MongoDB, Express.js, Angular.js, Node.js) or similar Cassandra, ElasticSearch, Exelixi, Hadoop, Hypertable, Jenkins, Kafka, MPI, Spark, Storm, SSSP, Torque, Private PaaS: Deis, 10
Operate your Laptop like your Datacenter?
From Static Partitioning to Elastic Sharing Static Partitioning 100% WASTED WASTED WASTED WEB CACHE HADOOP 100% Elastic Sharing HADOOP WEB FREE FREE CACHE
Software Clustering Layer between node OS and application frameworks Scale Multi-tenancy High availability
Available Open Source Components 2-level scheduler: Apache Mesos Meta-frameworks / schedulers: Aurora, Chronos, Marathon, Kubernetes, Swarm, Service discovery: Consul, HAProxy, Mesos DNS, Highly available configuration: zk, etcd, Storage: HDFS, Ceph, Node OSs: lots of Linux variants 14 Lots of app frameworks: Sparc, Storm, Cassandra, Kafka,
2-Level Scheduling Scale: from 1 node to at least 10000s of nodes Optimizing resource management End-to-end principle: application-specific functions ought to reside in the end nodes of a network rather than intermediary nodes -> Requirement for general multi-tenancy -> Requirement for having only one production cluster 15
How Mesos Works App Executor Task Framework Scheduler Master Slave Task Executor 16 zk/etcd Master Master Master Task Task
Ways to Run an Application 1. Vanilla job Employ meta-framework for invocation: Chronos, Aurora, Kubernetes, 2. Application of an adapted framework Hadoop, Sparc, Storm, ElasticSearch, Cassandra, Kafka, many more 3. Non-adapted services Employ meta-framework for invocation: Marathon, Aurora, Kubernetes, Provide (select) a service discovery solution 4. Program your own scheduler (and executor) 17
The Mesos Framework API Currently like internal Mesos communication: protobuf messages over HTTP Soon: JSON messages over HTTP (stream) => no need to link with binary Mesos library and/or less to reimplement ca. a dozen programming languages => any language 18
How to implement a framework Scheduler interface: 1 half of 2-level scheduling The framework knows best when to do what with what kind of resources About a dozen callbacks, main functionality in 2 of them: - receive resource offers - receive task status updates Executor interface: task life-cycle management and monitoring Command line executor included in Mesos Docker executor included in Mesos 19 Custom executors often not needed
Scheduler SPI (implemented by Framework) public interface Scheduler { void registered(schedulerdriver driver, FrameworkID frameworkid, MasterInfo masterinfo); void reregistered(schedulerdriver driver, MasterInfo masterinfo); void resourceoffers(schedulerdriver driver, List<Offer> offers); void offerrescinded(schedulerdriver driver, OfferID offerid); void statusupdate(schedulerdriver driver, TaskStatus status); void frameworkmessage(schedulerdriver driver, ExecutorID executorid, SlaveID slaveid, byte[] data); void disconnected(schedulerdriver driver); void slavelost(schedulerdriver driver, SlaveID slaveid); 20 void executorlost(schedulerdriver driver, ExecutorID executorid, SlaveID slaveid, int status); } void error(schedulerdriver driver, String message);
Minimal Scheduler Implementation class MyFrameworkScheduler implements Scheduler { private TaskGenerator _taskgen; public void resourceoffers(schedulerdriver driver, List<Offer> offers) { if (_taskgen.donecreatingtasks()) { for (offer : offers) { driver.declineoffer(offer.getid()); } } else { for (offer : offers) { List<TaskInfo> taskinfos = _taskgen.generatetaskinfos(offer); driver.launchtasks(offer.getid(), taskinfos, _filters); } } } 21 } public void statusupdate(schedulerdriver driver, TaskStatus status) { _taskgen.observetaskstatusupdate(taskstatus); if (_taskgen.done()) { driver.stop(); } }
The Developer s Perspective Focus on application logic, not datacenter structure Avoid networking-related code Reuse of built-in fault-tolerance and high availability Reuse distributed (infrastructure) frameworks (e.g., storage) => API, SDK for datacenter services 22
The Operations Engineer s Perspective Ease of deployment/management Uniformity of deployment/management Hardware utilization rate Scaling up as business grows Scaling out sporadically Cost and time for moving to a different datacenter High availability and fault-tolerance of system services Monitoring Trouble shooting 23
Necessary Multi-Tenancy Features Task containerization Resource isolation Resource and task attributes Static and dynamic resource reservations Reservation levels Meta-frameworks Dynamic scheduler update and reconfiguration 24 Security
Desirable Multi-Tenancy Features Optimistic offers Oversubscription Task preemption, migration, resizing, reconfiguration Rate limiting Auto-scaling => hybrid cloud Infrastructure frameworks 25
Using Docker Containers in Mesos 26 Mesos Master Server init + mesos-master + marathon 1 When a user requests a container Docker Registry 2 6 3 7 Mesos, LXC, and Docker are tied together for launch Mesos Slave Server init + docker 8 + lxc + (user task, under container init system) 4 + mesos-slave + /var/lib/mesos/executors/docker + docker run 5
Other Schedulers as Meta-Frameworks in a 2-level Scheduler YARN => https://github.com/mesos/myriad Kubernetes => https://github.com/mesosphere/kubernetes-mesos Swarm => Swarm on Mesos (new project) => run everything in one cluster 27
Myriad : Virtual YARN Clusters on Mesos POST /api/clusters: Registers a new YARN GET /api/clusters: Lists all registered clusters GET /api/clusters/{clusterid}: Lists the cluster with {clusterid} PUT /api/clusters/{clusterid}/flexup: Expands the size of cluster with {clusterid} 1. Launch NodeManager PUT /api/clusters/{clusterid}/flexdown: Shrinks the size of cluster with {clusterid} DELETE /api/clusters/{clusterid}: Unregisters YARN cluster with {clusterid}. Also, kills all the nodes. Mesos YARN Master 1 Myriad Scheduler RM flexup 1 2.0 CPU 2.0 GB Node Mesos 28 Slave 1 Myriad Executor 1 YARN NM C1 C2 2.5 CPU 2.5 GB
29 Kubernetes in Mesos
Portability Framework Apps Vanilla Apps Infrastructure Frameworks Meta-Frameworks Mesos 30 Public Cloud Managed Cloud Your Own DC
The Application User s Perspective Focus on apps, services, parameters, results Avoid dealing with datacenter operations/management Avoid adjusting system settings High availability Throughput Responsiveness Predictiveness Run everything I need 31 Return on and safety of investment
The Datacenter is the new form factor 2-level scheduler => single production cluster scalability and portability => avoiding hardware/cloud lock-in built-in container support => running containers at scale automation => operator efficiency repositories => apps/services readily available API and SDK => productive/quick app/service development 32
Above the Clouds with Open Source! 33