Rebuilding your Cloud, Multiple Times a Day Vilmos Nebehaj 10/07/2015 Sauce Labs
Hi! https://github.com/ldx/ @dyingpixel vilmos@saucelabs.com
Sauce Labs HQ in San Francisco, 2nd office in Vancouver ~100 employees, almost 50% are engineers Main product: Selenium and Appium testing in the cloud Private cloud which runs tens of millions of jobs every month >500 combinations of OSes, desktop browsers, mobile emulators/simulators and real mobile devices for testing browser, native and hybrid applications We re hiring
Immutable Infrastructure
"A large fraction of the flaws in software development are due to programmers not fully understanding all the possible states their code may execute in. In a multithreaded environment, the lack of understanding and the resulting problems are greatly amplified, almost to the point of panic if you are paying attention. Programming in a functional style makes the state presented to your code explicit, which makes it much easier to reason about, and, in a completely pure system, makes thread race conditions impossible."
Without mutable variables, testing becomes trivial: if we're transforming certain input via a given side effect free function, we always get the same output (referential transparency). Note: this is just an abstraction of course. If you drill deep enough, latest at the CPU instruction level, you have side effects e.g. caches, TLB, etc. But as an abstraction, this is still pretty useful.
So what are the downsides? In several cases, performance is not as good as with simply mutating a data structure in place.
What does it have to do with my infrastructure? In Ops/DevOps, we have the exact same issue as in the application development space. Large fraction of the problems we are facing are due to the almost incomprehensible state space in configuration on our servers. Think about it: how many configuration files are there, with how many possible settings in each on the average server? What interactions and interference is possible between them?
NO MUTATING STATE in your infrastructure?
The primary goal of treating your infrastructure as code: "Enable the reconstruction of the business from nothing but a source code repository, an application data backup, and bare metal resources."
Model Configuration Enabling technology pets manual, minimal scripting internet, IP, server hosting cattle automated configuration management software infrastructure as code immutable infrastructure automated no modification, rebuilding for any change virtualization cloud services
Containers vs VMs Containers Virtual Machines Security concerns Lock you into a specific OS No (or minimal) performance Another layer of security penalty Different operating Lightweight Very fast startup times Fully isolated at the hardware level environment (kernel/os) Performance overhead Slower boot times
Infrastructure as Code at Sauce Two repositories: sauce-ansible with our inventory, playbooks and roles vmbuilder with a packer templates for our VMs/containers We use branch builds for pull requests in sauce-ansible. A commit/merge into sauce-ansible master kicks off new image builds for all templates in vmbuilder.
+
Packer builders { "builders": [ { "type": "virtualbox-iso", "guest_os_type": "Ubuntu_32", "iso_checksum": "1214cd22448338b60bb24f583dd8741a", "iso_url": "http://releases.ubuntu.com/14.04/...",... }, { "type": "qemu", "format": "qcow2", "iso_checksum": "1214cd22448338b60bb24f583dd8741a", "iso_url": "http://releases.ubuntu.com/14.04/...",... } ],... }
Packer provisioners { "provisioners": [ { "type": "shell", "inline": ["sudo pip install ansible"] }, { "destination": "/tmp/ansible", "type": "file", "source": "../ansible" }, { "type": "shell", "inline": ["cd /tmp/ansible && ansible-playbook -c local -i inventory chef.yml"] } ],... }
Building an LXC image is as simple as: rm -rf output-lxc PACKER_CONFIG=/etc/packer.conf packer build -only=lxc./packer.json
Building a QEMU image: # Parse command line arguments. #... # Remove output directory in case we get killed. trap "rm -rf ${OUTDIR}; exit" SIGHUP SIGINT SIGTERM # Remove any previous build. rm -rf ${OUTDIR} # Build image. packer build -var basename=${name} -only=qemu./packer.json # Convert image to desired format. #... # Jenkins has problems transferring large images when they are larger than # 8GB. Split it up into smaller chunks. #...
Long image builds Relying on a new image to be built for any change means you want to minimize image build times CI infrastructure you can scale out is key Especially VMs might take a long time We split our longest running builds into multiple Jenkins jobs Install base OS Configure system and application(s) in the image Jenkins makes it easy to create build pipelines
Deploying images Central artifact store (Jenkins) Images have unique build numbers There are several hundred hypervisors in the Sauce Cloud We deploy images to hypervisors in smaller batches via ansible The control plane for the Cloud tells hypervisors which image to boot -> easy to roll back
Tools recap Jenkins CI software, also our artifact (image) store Ansible Automation and configuration management Packer Building images from a common configuration for different backends LXC Linux kernel level containment library and tools QEMU/KVM Full virtualization solution for Linux with hardware acceleration
Runtime temporary storage VM reads unchanged blocks Base image reads/writes changed blocks snapshot CoW image
Runtime temporary storage Images are immutable, VMs are always started from this clean state Temporary storage is provided on the hypervisors via per-vm copy on write images, snapshotted from the immutable image For containers, we use aufs for CoW Assets created during tests are uploaded to S3 When job ends, the VM and its CoW image are destroyed
Testing Role tests Repo PR Inventory tests Playbook tests Join
Testing
Testing
Testing For end to end testing, we have a main integration build Several thousand Selenium tests - we re eating our own dogfood No continuous delivery for automatic image deployment into production
TL;DR Containers are cool. VMs are also cool. Both of them have their use cases. We use continuous integration for building immutable VM/container images for our cloud. Images are built on Jenkins in a fully automated fashion. Long builds are split up into multiple jobs and chained together. Testing is key. Our infra codebase (vmbuilder + sauce-ansible) is tested for any change. We test both our ansible codebase via unit tests and integration tests, and the image artifacts via end to end tests. Packer is a great tool; you create your image template once, and can use various builders to produce an output image for many different cloud providers and virtualization solutions.
Questions?