Ganeti Private Cloud as Google does it Helga Velroyen <helgav@google.com> Linuxtag Berlin, May 9th, 2014
A Ganeti Cluster Instance: a virtualization guest Node: a virtualization host Nodegroup: a homogeneous set of nodes Cluster: a set of nodes, managed as a collective, partitioned by nodegroups 4/20
What can it do? Manage clusters of physical machines Deploy virtual machines on them - Resiliency to failure (distributed storage) - Live migration - Ease of repairs and hardware swaps - Cluster balancing 5/20
Ideas Interact with the cluster as an entity, instead of the individual machines. Making the virtualization entry level as low as possible - - - - Easy to install/manage Lightweight (no "expensive" dependencies) No specialized hardware needed (eg. SANs) Start small, grow big Scale to enterprise ecosystems - - Manage simultaneously from 1 to ~200 host machines Access to advanced features (distributed storage, live migration, cluster balancing) 6/20
Technologies Linux and standard utils (iproute2, bridge-utils, ssh) Hypervisors: - Xen, KVM, LXC Storage: - DRBD, LVM, file, distributed storage, Ceph/Gluster Programming languages: - Python, Haskell 7/20
Controlling Ganeti Command line (*) RAPI (Rest-full http interface) (*) Webinterfaces: - - - Ganeti Web manager, aiming for admins, but includes "self-service management" for users ganetimgr web manager, simplified multicluster web manager for end users Synnefo, complete cloud service solution, OpenStack API compatible (*) Programmable interfaces 8/20
Production cluster As we use it in a Google Datacentre Remote API SSH access current master node Per machine monitoring......... Per machine monitoring Per machine monitoring group / rack group / rack group / rack 9/20
Fleet at Google type Office no maint window Office ZURICH Fleet Management Virgil Euripides Dradis type General maint window A type General maint window B type General maint window A type General maint window B type Dedicated maint window A type Ubiquity no maint window Datacenter Z VM transfer type Ubiquity no maint window Datacenter X type Dedicated maint window B type General maint window A Datacenter Y 10/20
Instance provisioning at Google type General type General RAPI interface Alloc request type Ubiquity type Dedicated scan capacity Virgil Machine DB gather capacity Monitoring 11/20
Auto node repair at Google Euripides Send machine to repairs (2) Virgil Monitoring detects fault (1) broken HW Tell cluster to evacuate the broken machine (4) Machine database Mark machine broken (3) Send to repairs (5) 12/20
Auto node readd at Google Euripides Detects machine was repaired (1) Machine DB Tells Virgil to reintegrate machine (3) Watches machine for 24hrs (2) repaired HW Tell cluster to add it (5) Configure machine (4) Virgil Mark machine serving (6) Dradis 13/20
Ganeti 2.8, 2.9 2.8.4 Downgrading Autorepair tool Hroller Improvements on storage, monitoring 2.9.6 DRBD 8.4 support Continued work on monitoring, storage, hroller 14/20
Ganeti 2.10 2.10.3, available in debian wheezy backports, debian jessi Cross-cluster instance moves: - automatic node allocation on destination cluster - convert disk templates on the fly Cluster balancing based on CPU load KVM: Hotplug support, direct access to RBD storage Ganeti upgrades! 15/20
Updates In the past, updating Ganeti was a pain: /etc/init.d/ganeti stop// on all nodes apt-get install ganeti2=2.7.1-1 ganeti-htools=2.7.1-1//on all nodes /usr/lib/ganeti/tools/cfgupgrade//on master /etc/init.d/ganeti start// on all nodes gnt-cluster redist-conf// on master...// lots of other steps, depending on the version // If something goes wrong, fix the mess manually. From 2.10 on, Ganeti comes with a built-in upgrade mechanism: apt-get install ganeti-2.11// on all nodes gnt-cluster upgrade--to 2.11// on master gnt-cluster upgrade--to 2.10// to roll back Note that you still have to install the new and deinstall the old packages manually. 16/20
Ganeti 2.11 Current stable release, 2.11.0. RPC security: individual node certificates Compression for instance moves / backups / imports Configurable SSH ports per node group Gluster support (experimental) 17/20
Current and Future development No guarantees! Network improvements (IPv6, more flexibility) Storage: more work on shared storage Heterogeneous clusters Improvements on cross-cluster instances moves Google Summer of Code: Make LXC support production-ready Conversion between arbitrary disk templates 18/20
Open Source Events Confirmed: Linuxcon Japan, Tokyo, May 20th 2014 Ganeticon, Portland, Oregon, September Not confirmed yet: Linuxcon North America, Chicago, August FrOSCon, St. Augustin, Germany, August LISA '14, Seattle, November 19/20
Thank You! Questions? 2010-2014 Google Use under GPLv2+ or CC-by-SA Some images borrowed / modified from Lance Albertson, Iustin Pop, and Guido Trotter Some slides were borrowed / modified from Tom Limoncelli