Configuration Management Change Management, and Culture Management



Similar documents
Configuration Management Evolution at CERN. Gavin

Pro Puppet. Jeffrey McCune. James TurnbuII. Apress* m in

ovirt: Open Your Virtual Data Center

Managing your Red Hat Enterprise Linux guests with RHN Satellite

Continuous security audit automation with Spacewalk, Puppet, Mcollective and SCAP

developing sysadmin - sysadmining developers

Software Scalability Issues in Large Clusters

PZVM1 Administration Guide. V1.1 February 2014 Alain Ganuchaud. Page 1/27

Centralized Orchestration and Performance Monitoring

Parallels Plesk Automation

vrealize Business System Requirements Guide

How to extend Puppet using Ruby

This guide specifies the required and supported system elements for the application.

Red Hat enterprise virtualization 3.0 feature comparison

Increasing XenServer s VM density

Workflow Templates Library

Virtuoso and Database Scalability

Evaluation of Dell PowerEdge VRTX Shared PERC8 in Failover Scenario

System Requirements for Netmail Archive

Facultat d'informàtica de Barcelona Univ. Politècnica de Catalunya. Administració de Sistemes Operatius. System monitoring

Rudder. Sharing IT automation benefits in a team with Rudder. Benoît Peccatte bpe@normation.com. Normation Tous droits réservés normation.

Enterprise IT is complex. Today, IT infrastructure spans the physical, the virtual and applications, and crosses public, private and hybrid clouds.

Tivoli Endpoint Manager for Remote Control Version 8 Release 2. User s Guide

Synchronizer Installation

Open Source Datacenter Conference 2011 System Management with RHN Satellite. Dirk Herrmann, Solution Architect, Red Hat

Version Control Your Jenkins Jobs with Jenkins Job Builder

McAfee Public Cloud Server Security Suite

DevOps. Josh Preston Solutions Architect Stardate

Develop a process for applying updates to systems, including verifying properties of the update. Create File Systems

Secure Linux Administration Conference Bernd Strößenreuther

Architecting for the next generation of Big Data Hortonworks HDP 2.0 on Red Hat Enterprise Linux 6 with OpenJDK 7

Ansible in Depth WHITEPAPER. ansible.com

DevOps Course Content

How to Deploy a Secure, Highly-Available Hadoop Platform

Analysis of VDI Storage Performance During Bootstorm

LabStats 5 System Requirements

Continuous Integration using Docker & Jenkins

Azure Day Application Development

Release Notes for Fuel and Fuel Web Version 3.0.1

Getting Started Guide: Deploying Puppet Enterprise in Microsoft Azure

owncloud Enterprise Edition on IBM Infrastructure

Charles Endirect Ltd and the CELTEK Central Management System

Managed Servers ASA Extract FY14

Kaseya 2. User Guide. Version R8. English

Red Hat Network Satellite Management and automation of your Red Hat Enterprise Linux environment

Automated Configuration of Open Stack Instances at Boot Time

HP Universal CMDB. Software Version: Support Matrix

Metalogix SharePoint Backup. Advanced Installation Guide. Publication Date: August 24, 2015

IT Business Management System Requirements Guide

A central continuous integration platform

PaaS solutions evaluation

Continuous Integration in the Cloud with Hudson

BMC Client Management - Technical Specifications. Version 12.0

Vistara Lifecycle Management

Solution for private cloud computing

OpenNebula Open Souce Solution for DC Virtualization

Cisco is a registered trademark or trademark of Cisco Systems, Inc. and/or its affiliates in the United States and certain other countries.

Monitoring Agent for PostgreSQL Fix Pack 10. Reference IBM

APPLICATION MANAGEMENT SUITE FOR ORACLE E-BUSINESS SUITE APPLICATIONS

QuickStart Guide for Managing Computers. Version 9.2

Liferay Portal Performance. Benchmark Study of Liferay Portal Enterprise Edition

LOCKSS on LINUX. Installation Manual and the OpenBSD Transition 02/17/2011

Continuous Integration: Put it at the heart of your development

Apache Sentry. Prasad Mujumdar

McAfee Web Gateway 7.4.1

HP Server Automation Standard

My DevOps Journey by Billy Foss, Engineering Services Architect, CA Technologies

OpenShift on you own cloud. Troy Dawson OpenShift Engineer, Red Hat November 1, 2013

RSA Security Analytics Virtual Appliance Setup Guide

Parallels Operations Automation

Command Center :29:23 UTC Citrix Systems, Inc. All rights reserved. Terms of Use Trademarks Privacy Statement

CLOUD DEVELOPMENT BEST PRACTICES & SUPPORT APPLICATIONS

Installing and Configuring Guardium, ODF, and OAV

White Paper Take Control of Datacenter Infrastructure

Web Application Platform for Sandia

VMTurbo Operations Manager 4.5 Installing and Updating Operations Manager

The System Monitor Handbook. Chris Schlaeger John Tapsell Chris Schlaeger Tobias Koenig

DSView 4 Management Software Transition Technical Bulletin

Automated deployment of virtualization-based research models of distributed computer systems

MYASIA CLOUD SERVER. Defining next generation of global storage grid User Guide AUG 2010, version 1.1

Jenkins and Chef Infrastructure CI and Application Deployment

Of Pets and Cattle and Hearts

An Introduction to High Performance Computing in the Department

OpenNebula Open Souce Solution for DC Virtualization

Install and Configure SQL Server Database Software Interview Questions and Answers

An Oracle White Paper March Oracle s Single Server Solution for VDI

Windows Intune Walkthrough: Windows Phone 8 Management

Dell Reference Configuration for Hortonworks Data Platform

Deploying Foreman in Enterprise Environments 2.0. best practices and lessons learned. Nils Domrose Cologne, August,

Red Hat Satellite Management and automation of your Red Hat Enterprise Linux environment

The Greenplum Analytics Workbench

Capacity Planning for Microsoft SharePoint Technologies

NoSQL Performance Test In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB

Siebel Installation Guide for UNIX. Siebel Innovation Pack 2013 Version 8.1/8.2, Rev. A April 2014

CHAPTER 4 PERFORMANCE ANALYSIS OF CDN IN ACADEMICS

Monitoring and Alerting

Application Servers - BEA WebLogic. Installing the Application Server

Diploma in Computer Science

Transcription:

Configuration Management Change Management, and Culture Management HEPiX Fall 2014 - Thursday October 16 4:00pm James Pryor - pryor@bnl.gov RHIC and ATLAS Computing Facility at Brookhaven National Laboratory

Configuration Management Change Management, and Culture Management Past Present Future Plans & Desires Credits / Discussion 2

Past Many ways to do OS/application deployment and configuration OS deploy: a single PXE/kickstart server in 2007.Lots of kickstart files. At best 5:1, worst 1:1. We used a post install RPM file with base level config files & scripts Started with Cfengine in 2008 on RH Linux and Solaris Used it for very basic OS configuration and some application configuration 3

Past: 2010 Identified some problems local & global. Kickstart file is installation only. Once installed, changes over time make unique configurations. We found Cfengine 2.x and Cfengine 3.x not ideal On a wider scope: We (humans) don't scale. Separate pools of knowledge. Files, internal web, internal mail, our minds, external web. More than one way to create/fix/diagnose it. Errors & issues are hard to debug/replicate. We can not do it all or remember it all. 4

We can not do it all. We don't have super powers. Shawn Hoke https://www.flickr.com/photos/shawnhoke/14908288722 Attribution-NonCommercial-NoDerivs 2.0 Generic (CC BY-NC-ND 2.0) 5

Past: Change Management Policy & procedures (can be formal as ITIL, or as informal as DevOps) used to control changes and keep a historical change record made to production/development/test systems Uncontrolled change can work, but will often cause self inflicted problems and future firefighting episodes & upgrade nightmares Without it, servers/applications become like snowflakes: they start out identical, but over time, configuration drift eventually makes each one unique. 6

They are pretty to look at but hard to manage. Alexey Kljatov - Creative Commons (CC BY-NC 2.0) https://www.flickr.com/photos/chaoticmind75/6715743931 https://www.flickr.com/photos/chaoticmind75/6922463361 https://www.flickr.com/photos/chaoticmind75/6737065985 7

Past: Change Management Increase agility. Do More with less. Shorten time to solution. Standardize! We want to stop duplicating work and effort. Almost everyone is using change management. Align ourselves to be congruent with other entities (Labs, Univ, Corp) so that we can build upon their success. Follow proven methods of improvement and change management to shift staff time from perpetual reactive mode (firefighting) to more proactive work (fire prevention) No unauthorized changes, no "cowboy" or "superhero" type behavior tolerated 8

Past: 2010 Moved to Puppet for configuration management: easy to learn, facts, modular, idempotence, reporting. Intro level training in Aug 2010 Fall 2010, we chose Cobbler for Linux OS deployment via PXE/Kickstart. CLI & Web manages distros, systems, and repos. Modularity through variables, templates, and profiles. Selected GLPI for asset management, and FusionInventory Agent to collect server details 9

Past: 2011 to 2012 Convert kickstarts into Cobbler single template to support just about all use cases, and convert the RPM file post-install configuration shell scripts into Puppet code. Developed front-end Perl web scripts to manage SSL CA certs, and act as change management 'gateway' to back-end git/cgit Puppet External Node Classifier: GLPI, Puppet Dashboard, Linux worker node pool custom inventory DB Created a foundational base class that encapsulates the desired default configuration, then worked on managing services with puppet. Started working on Desktop Centralized Management 10

This flowchart is from a presentation made in late 2010 11

Cobbler System view - racf-min.ks is our shared templated kickstart 12

GLPI Server details. Note the circled Custom tab. 13

GLPI Custom tab details. Note the circled base class. 14

Base class is 'community property' where we build the foundation for infrastructure servers. Code edited/resized for screenshot 15

This is the linuxfarm group's workernode class 16

This shows we have one client host cert that need to be signed. 17

I need to clean a client host cert, so I can search for that host with regex support. 18

It is cleaning one host cert and removing and exported resources tied to that host. 19

Present: Make a change to production pryor@mydesktop (production) ~/gitree/catalog/common/cvesecurity/manifests $ git push origin production <trimmed> remote: diff-tree: remote: :100644 100644 04eb0bc... 7bf289b... M common/cvesecurity/manifests/cve_2014_7169.pp remote: Merge results: remote: Updating ce24fcc..d8a7860 remote: Fast-forward remote: common/cvesecurity/manifests/cve_2014_7169.pp 1 + remote: 1 file changed, 1 insertion(+) remote: Note: Your updates to the production branch are waiting for approval in branch: pending-production-pryor-ce24fcc-20141009t201915utc remote: remote: error: hook declined to update refs/heads/production To https://webdocs.racf.bnl.gov/git/puppet/catalog! [remote rejected] production -> production (hook declined) error: failed to push some refs to 'https://webdocs.racf.bnl.gov/git/puppet/catalog' 20

Puppet Approval Change Management Portal showing my commit to production 21

Diff of my commit as seen in cgit 22

Do you want to merge this? 23

Puppet Change Approval Committe. Note the regex under the Authorization column. 24

RACF Puppet Catalog. The regex from the previous slide matches up with the directories (in blue). 25

Merge results 26

Approval scripts also record who approved what for auditing purposes 27

Summary of all log messages making up the branch merge in cases where there are many commits in that branch 28

Present: Reporting, Performance Metrics Puppet Dashboard is deprecated. We now use The Foreman and PuppetDB for puppet client run reporting. The Foreman is a RHEL 6.5 VM on the RHEV cluster, and assigned 8 vcpus and 8GB RAM. A single Puppet Master server: Dell PowerEdge R610 with Dual CPU Intel Xeon X5660 2.80GHz, 96GB RAM, RAID10 900GB, RHEL 6.5 It runs just about everything: Apache, Perl Puppet CA, Perl Puppet Approval, git/cgit, PuppetDB (w/ PostgreSQL), GLPI (w/ MySQL) 29

30

Puppet Client agent check-in method & interval The Foreman VM on RHEV PostgreSQL Puppet Master Dell PowerEdge R610 Dual CPU Intel Xeon X5660 2.80GHz 96GB RAM, RAID10 900GB,RHEL 6.5 Apache, Perl Puppet CA, Perl Puppet Approval, git/cgit, PuppetDB (w/ PostgreSQL) GLPI (w/ MySQL) Manual check_puppet.py run without Puppet daemon Puppet daemon check-in once per hour Linux farm worker nodes Grid, Infrastructure, Storage, Cloud hosts under 600 machines Farm Infrastructure, DB 31

Present: Load test & Metrics We tested our Puppet master with client agents checking in at about 1Hz. PuppetDB reports that we have over 2500 checking-in during this test, which ran a little over an hour. Load (top command output) as seen at about 12 minutes after starting a manual puppet agent run (14:45) on the Linux Farm worker nodes. top - 14:57:23 up 3:55, 2 users, load average: 4.22, 4.22, 3.03 Tasks: 551 total, 4 running, 547 sleeping, 0 stopped, 0 zombie Cpu(s): 20.9%us, 2.1%sy, 0.0%ni, 76.8%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 99051792k total, 31167540k used, 67884252k free, 340332k buffers Swap: 8388604k total, 0k used, 8388604k free, 4798864k cached PID 9178 7361 8992 6993 434 9070 9077 7808 8571 USER puppet puppet puppet puppetdb puppet puppet puppet postgres postgres PR 20 20 20 20 20 20 20 20 20 NI VIRT RES SHR S %CPU %MEM 0 302m 195m 3404 S 96.5 0.2 0 291m 183m 3404 S 86.6 0.2 0 362m 254m 3404 R 74.6 0.3 0 14.6g 1.3g 12m S 71.3 1.3 0 352m 244m 3404 R 54.7 0.3 0 276m 169m 3404 S 43.8 0.2 0 306m 199m 3400 S 33.2 0.2 0 24.4g 575m 569m S 26.5 0.6 0 24.4g 670m 662m S 9.6 0.7 TIME+ 4:19.72 4:24.94 13:43.31 21:26.15 8:32.26 4:34.72 5:03.86 0:14.28 0:14.15 COMMAND ruby ruby ruby java ruby ruby ruby postmaster postmaster 32

Linux farm manual run started approx 14:45. 33

In a bit more than an hour, we saw between 40% - 60% CPU with about 2500 agents checking in. 34

PuppetDB metrics during the load test 35

PuppetDB metrics during the load test 36

PuppetDB metrics during the load test 37

Present: Puppet Code Statistics Project name: RACF Puppet Catalog Generated: 2014-10-10 05:05:07 (in 6 seconds) Generator: GitStats (version ), git version 1.8.5.3 Report Period: 2011-01-04 13:54:30 to 2014-10-09 16:59:44 Age: 1375 days, 877 active days (63.78%) Total Files: 2620 Total Lines of Code: 327595 (1209686 added, 882091 removed) Average file size: 9068.73 bytes Total Commits: 13396 (average 15.3 commits per active day, 9.7 per all days) Authors: 31 (average 432.1 commits per author) 146 modules, 431 different.pp files, totaling nearly 34k lines of Puppet code. 38

Present: Puppet Code Statistics 39

Present: Puppet code style guide In Aug & Sept of 2014, we used puppet-lint and Geppetto to update all our Puppet code to comply with the Puppet Labs's suggested style guide. Style guide is now enforced on push to the Git server via puppet-lint puppet-lint --with-context --no-80chars-check init.pp WARNING: class not documented on line 1 ERROR: trailing whitespace found on line 66 } ^ WARNING: indentation of => is not properly aligned on line 81 source => $allow_from, ^ WARNING: unquoted file mode on line 53 mode => 644, ^ 40

41

Present: Culture and CM Adoption Discovered that changing servers is easy. Changing people's work flow and ultimately their minds is much harder. At first mixed reception and some resistance. Most people within our group adopted this work flow & toolset. Others not at all. http://en.wikipedia.org/wiki/file:diffusionofinnovation.png Attribution 2.5 Generic (CC BY 2.5) 42

Present: Puppet Use / Culture Management Had some accidental & potentially dangerous commits to production branch. For the Change Management Approval gateway, implemented a self-approve delay, mandatory scroll & approval authorizations Never manually change production servers or commit untested code to production. Use test environments & servers. We host a tree of Puppet code base ( common ) & is now shared and used upstream by the Physics and IT departments. 43

If your infrastructure is now done in code, do you test your code? Or do you just push it to production and see if it works? 44

Dos Equis Man The Most Interesting Man in the World is a character & property of Cervecería Cuauhtémoc-Moctezuma. Used without permisson on assumption of Fair Use / Parody 45

He is not someone to admire. Don't be like him. You must test your code. Dos Equis Man The Most Interesting Man in the World is a character & property of Cervecería Cuauhtémoc-Moctezuma. Used without permisson on assumption of Fair Use / Parody 46

Future Plans and Desires We look to adopt software development practices for our Puppet code: smoke testing, unit testing, acceptance testing Automatic testing (Continuous Integration) system with Jenkins CI tool, to manage our Puppet testing process. It runs Puppet on a pool of RHEV VMs,and all pending changes to production must pass this validation process before they can be approved and merged into production. See talk on this topic at future HEPiX and/or CHEP 2015. Use The Foreman beyond just reporting: as both ENC & as OS provisioner. 47

Future Plans and Desires Work toward a community shared Puppet code base beyond RACF/Physics/Lab ITD. This is desirable but is at least 1 2 years away from being realized. Puppet with Hiera: a hierarchical data store keeps sitespecific data out of your manifests. Avoid repetition (duplicating similar blocks of modular code), and use public Puppet modules. When used, don t need to edit the code, just put the necessary data in Hiera. Requires a rewrite/refactor of our code. This is a non-trivial project. 48

Future Plans and Desires Mcollective: a framework to build server orchestration, parallel job execution, on clusters of servers. Not simply a fancy SSH "for loop", but provides granularity and reporting. Integrate Monitoring and Puppet: rewrite existing nagios puppet class to support use of exported resources. Both the target node to be monitored and the Nagios server would execute Puppet code in a sort of conversation: "Hey Nagios server. I'm a node and have a new service. You need to monitor it." Then it would be monitored. 49

Credit / Questions Dr. Jason A. Smith, Mizuki Karasawa, John S. De Stefano Jr. William Strecker-Kellog, Christopher Hollowell, James Pryor "Bernard De Chartres used to compare us to [puny] dwarfs perched on the shoulders of giants. He pointed out that we see more and farther than our predecessors, not because we have keener vision or greater height, but because we are lifted up and borne aloft on their gigantic stature." John of Salisbury - 1159 50