TestOps: Continuous Integration when infrastructure is the product. Barry Jaspan Senior Architect, Acquia Inc.

Similar documents
Achieving Continuous Integration with Drupal

Drupal in the Cloud. Scaling with Drupal and Amazon Web Services. Northern Virginia Drupal Meetup

Virtualization and IaaS management

Building Success on Acquia Cloud:

Best Practices for Python in the Cloud: Lessons

Migrating a running service to AWS

Enterprise-level EE: Uptime, Speed, and Scale

System Management. Leif Nixon. a security perspective 1/37

WhitePaper. Private Cloud Computing Essentials

WINDOWS AZURE EXECUTION MODELS

A central continuous integration platform

Enterprise PaaS Evaluation Guide

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

Mark Bennett. Search and the Virtual Machine

Building Success on Acquia Cloud. Buyer s Guide

Continuous Integration. CSC 440: Software Engineering Slide #1

Taking Drupal development to the Cloud. Karel Bemelmans

Zero-Touch Drupal Deployment

Continuous Delivery for Alfresco Solutions. Satisfied customers and happy developers with!! Continuous Delivery!

Handling Hyper-V. In this series of articles, learn how to manage Hyper-V, from ensuring high availability to upgrading to Windows Server 2012 R2

Git - Working with Remote Repositories

PCI COMPLIANCE ON AWS: HOW TREND MICRO CAN HELP

Network Virtualization Platform (NVP) Incident Reports

How To Run A Modern Business With Microsoft Arknow

Continuous Integration: A case study

Improving your Drupal Development workflow with Continuous Integration

Automating Big Data Benchmarking for Different Architectures with ALOJA

Web Application Platform for Sandia

CloudPassage Halo Technical Overview

DevOps Course Content

Splunk for VMware Virtualization. Marco Bizzantino Vmug - 05/10/2011

DevOps. Josh Preston Solutions Architect Stardate

When flexibility met simplicity: The friendship of OpenStack and Ansible

High-Availability in the Cloud Architectural Best Practices

CloudPassage Halo Technical Overview

About Acquia. Acquia Cloud Site Factory allows you to rapidly build mobile- ready brand, campaign, and franchise websites on a turnkey cloud platform.

Capacity planning with Microsoft System Center

How To Manage Change In Jeepers

How To Fix A Powerline From Disaster To Powerline

Servers. Servers. NAT Public Subnet: /20. Internet Gateway. VPC Gateway VPC: /16

Managing MySQL Scale Through Consolidation

The Virtualization Practice

DevShop. Drupal Infrastructure in a Box. Jon Pugh CEO, Founder ThinkDrop Consulting Brooklyn NY

Why Engine Yard is better than Do it yourself

Unlimited Server 24/7/365 Support

Linux/Open Source and Cloud computing Wim Coekaerts Senior Vice President, Linux and Virtualization Engineering

Fundamentals of Continuous Integration

Automation and DevOps Best Practices. Rob Hirschfeld, Dell Matt Ray, Opscode

Migration Scenario: Migrating Backend Processing Pipeline to the AWS Cloud

Cloud Security with Stackato

EXECUTIVE SUMMARY CONTENTS. 1. Summary 2. Objectives 3. Methodology and Approach 4. Results 5. Next Steps 6. Glossary 7. Appendix. 1.

Continuous Integration (CI) for Mobile Applications

Cost effective methods of test environment management. Prabhu Meruga Director - Solution Engineering 16 th July SCQAA Irvine, CA

Resolving Active Directory Backup and Recovery Requirements with Quest Software

On- Prem MongoDB- as- a- Service Powered by the CumuLogic DBaaS Platform

Making Your ColdFusion Apps Highly Available. Brian Klaas Johns Hopkins Bloomberg School of Public Health

UserLock advanced documentation

ovirt self-hosted engine seamless deployment

Enterprise Application Monitoring with

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

The Defense RESTs: Automation and APIs for Improving Security

Platform as a Service and Container Clouds

Continuous Integration and Delivery. manage development build deploy / release

Blackboard Collaborate Web Conferencing Hosted Environment Technical Infrastructure and Security

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

Scalable Architecture on Amazon AWS Cloud

WHITE PAPER. Best Practices to Ensure SAP Availability. Software for Innovative Open Solutions. Abstract. What is high availability?

Why we Picked CF as the Basis for our Public Cloud Multi-Tenant Platform

@jenkinsconf. Maintaining huge Jenkins clusters - Have we reached the limit of Jenkins?

AVLOR SERVER CLOUD RECOVERY

How To Fix A Fault Fault Fault Management In A Vsphere 5 Vsphe5 Vsphee5 V2.5.5 (Vmfs) Vspheron 5 (Vsphere5) (Vmf5) V

PCI COMPLIANCE ON AWS: HOW TREND MICRO CAN HELP

Linux A first-class citizen in Windows Azure. Bruno Terkaly bterkaly@microsoft.com Principal Software Engineer Mobile/Cloud/Startup/Enterprise

Migration and Building of Data Centers in IBM SoftLayer with the RackWare Management Module

CONDOR CLUSTERS ON EC2

Deploying Splunk on Amazon Web Services

SELENIUM GRID BUILD VS. BUY

Disk-to-Disk-to-Tape Backup for a Citrix XenServer Cluster Bacula conference Robert Müller

Simple Tips to Improve Drupal Performance: No Coding Required. By Erik Webb, Senior Technical Consultant, Acquia

HOPS: Hadoop Open Platform-as-a-Service

80% 50x. 30x. CASE STUDY: How WaveMaker Got Faster, Better, More Agile with Docker. Lower Costs. Better Performance. Greater App Density

Continuous Integration Processes and SCM To Support Test Automation

Acquia Hosting Services. Version: v1 By: Dave Gully, Regional Sales Manager. Date: April 2014

Cloud UT. Pay-as-you-go computing explained

CHEF IN THE CLOUD AND ON THE GROUND

Postgres Plus Cloud Database!

How to choose the right PaaS Platform?

insync Installation Guide

Testing Automation for Distributed Applications By Isabel Drost-Fromm, Software Engineer, Elastic

Java PaaS Enabling CI, CD, and DevOps

Server Virtualization with Windows Server Hyper-V and System Center

Transcription:

TestOps: Continuous Integration when infrastructure is the product Barry Jaspan Senior Architect, Acquia Inc.

This talk is about the hard parts. Rainbows and ponies have left the building.

Intro to Continuous Integration Maintain a code repository Automate the build Make the build self-testing Everyone commits to the baseline every day Every commit (to baseline) should be built Keep the build fast Test in a clone of the production environment Make it easy to get the latest deliverables Everyone can see the results of the latest build Automate deployment

Intro to Acquia Cloud PaaS for PHP apps Multiple environments: Dev, Stage, Prod, Continuous Integration environment for your app Special sauce for Drupal Obligatory Impressive Numbers, ca. 03/2014 27 billion origin hits per month 422 TB data xfer/month 8000+ EC2 instances Release every 1.6 days on average Each release alters the infrastructure under thousands of web apps that we do not control Our customers REALLY hate downtime

Server configuration is software Puppet, Chef, and similar tools turn server config into software Hurray! Deserves the same best practices as app development TestOps == test-related CI principles for infrastructure software Make the build self-testing Test in a clone of the production environment Keep the build fast If it isn t tested, it doesn t work.

Unit tests vs. System tests Unit tests isolate individual program modules Injection mocks out external systems Problem: You can t mock out the real world and get accurate results Server configuration interacts with the OS, network, and services puppetd noop doesn t tell the whole story Sample failure crontab and cron daemon race condition

Unit tests vs. System tests System tests are end-to-end Apply code changes to real, running servers Exercise the infrastructure as the app(s) will Problem: Reality is very messy! Launch failures Race conditions Vendor API scheduled maintenance Cosmic rays

System tests FTW For infrastructure, system tests are essential Making Acquia Cloud s system tests reliable is the hardest engineering challenge I ve ever faced. We would be totally dead without them.

Test in a clone of production No back-doors to make tests easier Ex: HIPPA, PCI, etc. security requirements Access only from a bastion server using two-factor authentication No root logins, even from the bastion Any failure here locks Ops out from all servers! Tests operate just as admins do Most tests operate from a bastion : ssh testadmin@server sudo bash -c cmd Ensures the code works in production

Server build strategy Always build from a reference base e.g. Ubuntu 12.04 Server 64-bit No incrementally evolved images Puppet makes this natural Sidebar: Docker gets this totally wrong! Puppet can take a while Makes tests and MTTR slow More on this later

Basic build tests Launch VMs, run puppet Replicate a functional production environment Isolated from production Scan syslog for errors Test config files, daemons, users, cron jobs Sample failure Incorrect Puppet dependencies work while iterating on development instances but not on clean launch

Functionally test the moving parts Backup and restore Message queues Worker auto-scaling Load balancing with up and down workers ELB health check and recovery Database failover Monitoring & Alerting Self healing

Application test Install and verify application(s) Real site code Real site db (scrubbed) Cause app to exercise the infrastructure Write to database, message queues, etc. Verify success on the back end Operate app on degraded infrastructure Failed web nodes Database failover

Reboot test Reboot all test servers Re-run build tests Filesystems mounted? Services restarted? Re-run functional and application tests Sample failure Database quota daemon starts via /etc/init.d before MySQL daemon, then aborts

Relaunch test Relaunch all test servers from base image Simulates server crash and recovery Persistent data retained? Server rejoins services? (e.g. MySQL replication restarts) Unexpected issues, ex: tmpfs Re-run build, functional, and application tests Sample failure Non-deployable customer application prevents relaunch from completing normally

Upgrade test So far we've talked about new servers Use case: Your service is growing. Also need to test upgrading existing servers Use case: You add a feature. The Upgrade Test Dance Launch servers in test environment on current production code Run smoke tests to ensure system is operating Upgrade servers to latest development code Requires a fully automated upgrade process Run build, functional, and application tests from development code

Upgrade release process Puppet cannot orchestrate all upgrades Rolling upgrade across HA clusters Server type upgrade order requirements Post-release tasks ex: Uninstall package X once all servers are upgraded Devs document release procedure with the commit Different devs run the release on an internal-use installation Sample failure nginx failed to restart after version upgrade because prod server has more domain names than test

Server builds: continuous imaging Remember: Always launch from a reference image. No evolved images! Building servers from scratch can be slow Automate pre-built images from development branch Speeds intra-day tests, reduces MTTR in prod You will hit unexpected bumps in the road Sample failure MySQL server, EBS, and init.d

Continuous Imaging image tests Create development images nightly Create per-branch images at release Run system tests on both base and pre-built images Test upgrade from per-branch to development pre-built images

Testing in parallel Infrastructure system tests are slow Run them in parallel Workers may alter server-wide behavior (e.g. kill Apache) Each worker needs an isolated set of servers Workers that break their servers need to self destruct, or they will cause false failures Optimize running time Add more workers Reduce setup time Run the slower tests first

Management issues

Who writes the tests? Our tests are as, or more, complex than the product Tests often take longer to write Subtle cases require white-box testing Triggering specific failure scenarios requires understanding OS and code details together First try: QA department Did not work, they could not keep up or go deep Now: Engineering Every dev writes unit and system tests for their own code

Who fixes the tests? Infrastructure system tests are fragile The damn things break for every little bug! and every race condition imaginable and every cosmic ray Code reviews require a passing run Author must analyze any failures, confirm they are unrelated, and refer to or open a ticket for it Bugs often only occur post-commit Permanent, rotating team handles failures Authority to revert any commit causing a failure Usually it is easier to fix it instead

Who invests in the tests? Management must accept that infrastructure system tests are hard time-consuming essential worth it Under-investing will bite you badly If it isn t tested, it doesn t work. It will fail, at the worst possible time

Questions? Barry Jaspan, barry.jaspan@acquia.com Please evaluate this session! Acquia is hiring! Boston, New York, Portland Europe! Australia!!! wherever you are