Configuration Management Evolution at CERN. Gavin McCance gavin.mccance@cern.ch @gmccance

Similar documents
Agile Infrastructure: an updated overview of IaaS at CERN

A central continuous integration platform

Deploying Foreman in Enterprise Environments 2.0. best practices and lessons learned. Nils Domrose Cologne, August,

SUCCESFUL TESTING THE CONTINUOUS DELIVERY PROCESS

CERN Cloud Architecture

SUCCESFUL TESTING THE CONTINUOUS DELIVERY PROCESS

PES. Ermis service for DNS Load Balancer configuration. HEPiX Fall Aris Angelogiannopoulos, CERN IT-PES/PS Ignacio Reguero, CERN IT-PES/PS

Continuous security audit automation with Spacewalk, Puppet, Mcollective and SCAP

Pro Puppet. Jeffrey McCune. James TurnbuII. Apress* m in

PES. High Availability Load Balancing in the Agile Infrastructure. Platform & Engineering Services. HEPiX Bologna, April 2013

DevOps. Jesse Pai Robert Monical 8/14/2015

The Agile Infrastructure Project. Monitoring. Markus Schulz Pedro Andrade. CERN IT Department CH-1211 Genève 23 Switzerland

System Management with RHN Satellite

Git Branching for Continuous Delivery

FUJITSU Software ServerView Cloud Monitoring Manager V1 Introduction

Build Automation for Mobile. or How to Deliver Quality Apps Continuously. Angelo Rüggeberg

Azure Day Application Development

Kubernetes-Murano Integration in Mirantis OpenStack 7.0

Testing Automation for Distributed Applications By Isabel Drost-Fromm, Software Engineer, Elastic

Massively! Continuous Integration! A case study for Jenkins at cloud-scale

Of Pets and Cattle and Hearts

Openshift for Continuous Integration

Modern Web development and operations practices. Grig Gheorghiu VP Tech Operations Nasty Gal

Application Release Automation (ARA) Vs. Continuous Delivery

PES. Creating Load-Balanced Services on top of Cloud Infrastructure and Puppet. Platform & Engineering Services. Vítor Gouveia, vitor.gouveia@cern.

Deployment Guide. How to prepare your environment for an OnApp Cloud deployment.

Jenkins World Tour 2015 Santa Clara, CA, September 2-3

DevOps in OpenStack Public Cloud 副 标 题 副 标 题 副 标 题 Presented at OpenStack Summit, Fall 2012, San Diego

Servers. Servers. NAT Public Subnet: /20. Internet Gateway. VPC Gateway VPC: /16

Continuous Integration using Docker & Jenkins

developing sysadmin - sysadmining developers

Using Git with Rational Team Concert and Rational ClearCase in enterprise environments

SSM6437 DESIGNING A WINDOWS SERVER 2008 APPLICATIONS INFRASTRUCTURE

Apache Sentry. Prasad Mujumdar

Monitor Open stack environments from the bottom up and front to back. Roger Ruttimann VP Engineering, GroundWork OpenSource November 17, 2015

Neelesh Kamkolkar, Product Manager. A Guide to Scaling Tableau Server for Self-Service Analytics

Sales Slide Midokura Enterprise MidoNet V1. July 2015 Fujitsu Limited

Puppet Firewall Module and Landb Integration

KVM, OpenStack, and the Open Cloud

OpenStack Towards a fully open cloud. Thierry Carrez Release Manager, OpenStack

Running an OpenStack Cloud for several years and living to tell the tale. Alexandre Maumené Gaëtan Trellu Tokyo Summit, November 2015

Carrier-grade VoIP platform with Kamailio at 1&1

Cloud.. Migration? Bursting? Orchestration? Vincent Lavergne SED EMEA, South Gary Newe Sr SEM EMEA, UKISA

Going Hybrid. The first step to your! Enterprise Cloud journey! Eric Sansonny General Manager!

Optimizing Service Levels in Public Cloud Deployments

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Monitoring Evolution WLCG collaboration workshop 7 July Pablo Saiz IT/SDC

Continuous Integration and Delivery at NSIDC

Configuration report 04/09-17/09.

Status and Evolution of ATLAS Workload Management System PanDA

High Performance Computing OpenStack Options. September 22, 2015

GitLab as an Alternative Development Platform for Github.com

Version Control Your Jenkins Jobs with Jenkins Job Builder

How Bigtop Leveraged Docker for Build Automation and One-Click Hadoop Provisioning

KVM, OpenStack, and the Open Cloud

Design and Implementation of IaaS platform based on tool migration Wei Ding

Scalability of web applications. CSCI 470: Web Science Keith Vertanen

Elasticsearch on Cisco Unified Computing System: Optimizing your UCS infrastructure for Elasticsearch s analytics software stack

Streamlining Infrastructure Monitoring and Metrics in IT- DB-IMS

Cloud Computing Paradigm

Building Storage as a Service with OpenStack. Greg Elkinbard Senior Technical Director

DevOps Course Content

How To Manage A Multi Site In Drupal

White Paper. The Importance of Automating the End to End Pipeline for Continuous Delivery

Running Oracle Databases in a z Systems Cloud environment

Introducing ScienceCloud

Red Hat Satellite Overview & Roadmap

Building a Continuous Integration Pipeline with Docker

Workshop on Hadoop with Big Data

RED HAT INFRASTRUCTURE AS A SERVICE OVERVIEW AND ROADMAP. Andrew Cathrow Red Hat, Inc. Wednesday, June 12, 2013

Configuration Management Change Management, and Culture Management

SUSE Manager. A Comprehensive Linux Server Management the Linux Way. Name. Title

STeP-IN SUMMIT June 18 21, 2013 at Bangalore, INDIA. Performance Testing of an IAAS Cloud Software (A CloudStack Use Case)

High Performance Big Data Analy5cs powered by Unique Web Accelera5on and NoSQL. The Big Data Engine

PaaS solutions evaluation

A Sumo Logic White Paper. Harnessing Continuous Intelligence to Enable the Modern DevOps Team

Improve performance and availability of Banking Portal with HADOOP

Database Monitoring Requirements. Salvatore Di Guida (CERN) On behalf of the CMS DB group

Self service for software development tools

PES. TWiki at CERN Service Evolution. Platform & Engineering Services. Terje Andersen, Peter Jones for IT-PES-IS Jan 2014

Web Application Platform for Sandia

Fundamentals of Continuous Integration

Software Development In the Cloud Cloud management and ALM

Acano solution. Acano Manager R1.1 FAQs. Acano. December G

Automation & Open Source. How to tame the Cloud?

Transcription:

Configuration Management Evolution at CERN Gavin McCance gavin.mccance@cern.ch @gmccance

Agile Infrastructure Why we changed the stack Current status Technology challenges People challenges Community The Agile Infrastructure Making IT operations better since 2013 Puppet Openstack Foreman Jenkins mcollective Koji git ActiveMQ 14/10/2013 CHEP 2013 3

Why? Homebrew stack of tools Twice the number machines, no new staff New remote data-centre Adopting more dynamic Cloud model We re not special Existence of open source tool chain: OpenStack, Puppet, Foreman, Kibana Staff turnover Use standard tools we can hire for it and people can be hired for it when they leave 14/10/2013 CHEP 2013 4

Agile Infrastructure stack Our current stack has been stable for one year now See plenary talk at last CHEP (Tim Bell et al) Virtual server provisioning Cloud operating system : OpenStack -> (Belmiro, next) Configuration management Puppet + ecosystem as configuration management system Foreman as machine inventory tool and dashboard Monitoring improvements Flume + Elasticsearch + Kibana -> (Pedro, next++) 14/10/2013 CHEP 2013 5

Puppet Puppet manages nodes configuration via manifests written in Puppet DSL All nodes check in frequently (~1-2 hours) and ask for configuration Configuration applied frequently to minimise drift Using the central puppet master model..rather than masterless model No shipping of code, central caching and ACLs 14/10/2013 CHEP 2013 6

Separation of data and code Puppet Hiera splits configuration data from code Treat Puppet manifests really as code More reusable manifests Heira is quite new: old manifests are catching up Hiera can use multiple sources for lookup Currently we store the data in git Investigating DB for canned operations 14/10/2013 CHEP 2013 7

Modules and Git Manifests (code) and hiera (data) are version controlled Puppet can use git s easy branching to support parallel environments Later 14/10/2013 CHEP 2013 8

Foreman Lifecycle management tool for VMs and physical servers External Node Classifier tells the puppet master what a node should look like Receives reports from Puppet runs and provides dashboard 14/10/2013 CHEP 2013 9

14/10/2013 CHEP 2013 10

14/10/2013 CHEP 2013 11

Deployment at CERN Puppet 3.2 Foreman 1.2 Been in real production for 6 months Over 4000 hosts currently managed by Puppet SLC5, SLC6, Windows ~100 distinct hostgroups in CERN IT + Expts New EMI Grid service instances puppetised Batch/Lxplus service moving as fast as we can drain it Data services migrating with new capacity AI services (Openstack, Puppet, etc) 14/10/2013 CHEP 2013 12

Key technical challenges Service stability and scaling Service monitoring Foreman improvements Site integration 14/10/2013 CHEP 2013 13

Scalability experiences Most stability issues we had were down to scaling issues Puppet masters are easy to load-balance We use standard apache mod_proxy_balancer We currently have 16 masters Fairly high IO and CPU requirements Split up services Puppet critical vs. non critical 12 backend nodes Bulk 4 backend nodes Interactive 14/10/2013 CHEP 2013 14

Scalability guidelines 14/10/2013 CHEP 2013 15

Scalability experiences Foreman is easy to load-balance Also split into different services That way Puppet and Foreman UI don t get affected by e.g. massive installation bursts Load balancer ENC UI/API Reports processing 14/10/2013 CHEP 2013 16

PuppetDB All puppet data sent to PuppetDB Querying at compile time for Puppet manifests e.g. configure load-balancer for all workers Scaling is still a challenge Single instance manual failover for now Postgres scaling Heavily IO bound (we moved to SSDs) Get the book 14/10/2013 CHEP 2013 17

Monitor response times Monitor response, errors and identify bottlenecks Currently using Splunk will likely migrate to Elasticsearch and Kibana 14/10/2013 CHEP 2013 18

Upstream improvements CERN strategy is to run the main-line upstream code Any developments we do gets pushed upstream e.g Foreman power operations, CVE reported VM Openstack Nova API VM Foreman Proxy VM IPMI IPMI IPMI Physical box Physical box Physical box 14/10/2013 CHEP 2013 19

Site integration Using Opensource doesn t get completely get you away from coding your own stuff We ve found every time Puppet touches our existing site infrastructure a new service or plugin is born Implementing our CA audit policy Integrating with our existing PXE setup and burnin/hardware allocation process - possible convergence on tools in the future Razor? Implementing Lemon monitoring masking use-cases nothing upstream, yet.. 14/10/2013 CHEP 2013 20

People challenges Debugging tools and docu needed! PuppetDB helpful here Can we have X, Y and Z? Just because the old thing did it like that, doesn t mean it was the only way to do it Real requirements are interesting to others too Re-understanding requirements and documentation and training Great tools how do 150 people use them without stepping on each other? Workflow and process 14/10/2013 CHEP 2013 21

Your special feature Your test box My special feature QA machines QA Production Most machines Use git branches to define isolated puppet environments 14/10/2013 CHEP 2013 22

Easy git cherry pick 14/10/2013 CHEP 2013 23

Git workflow

Git model and flexible environments For simplicity we made it more complex Each Puppet module / hostgroup now has its own git repo (~200 in all) Simple git-merge process within module Delegated ACLs to enhance security Standard QA and production branches that machines can subscribe to Flexible tool (Jens, to be open-sourced by CERN) for defining feature developments Everything from production except for the change I m testing on my module 14/10/2013 CHEP 2013 25

Strong QA process Mandatory QA process for shared modules Recommended for non-shared modules Everyone is expected to have some nodes from their service in the QA environment Normally changes are QA d for at least 1 week. Hit the button if it breaks your box! Still iterating on the process Not bound by technology Is one week enough? Can people freeze? 14/10/2013 CHEP 2013 26

Community collaboration Traditionally one of HEPs strong points There s a large existing Puppet community with a good model - we can join it and open-source our modules New HEPiX working group being formed now Engage with existing Puppet community Advice on best practices Common modules for HEP/Grid-specific software https://twiki.cern.ch/twiki/bin/view/hepix/configmanagem ent https://lists.desy.de/sympa/info/hepix-config-wg 14/10/2013 CHEP 2013 27

http://github.com/cernops for the modules we share Pull requests welcome! 14/10/2013 CHEP 2013 28

Summary The Puppet / Foreman / Git / Openstack model is working well for us 4000 hosts in production, migration ongoing Key technical challenges are scaling and integration which are under control Main challenge now is people and process How to maximise the utility of the tools The HEP and Puppet communities are both strong and we can benefit if we join them together https://twiki.cern.ch/twiki/bin/view/hepix/configmanagement http://github.com/cernops 14/10/2013 CHEP 2013 29

Backup slides 14/10/2013 CHEP 2013 30

mcollective, yum Bamboo Puppet OpenStack Nova AIMS/PXE Foreman JIRA Active Directory / LDAP Yum repo Pulp Koji, Mock git Hardware database Puppet-DB Lemon / Hadoop 14/10/2013 CHEP 2013 31