Goodbye war room, hello DevOps 2.0
Table of contents Page Chapter Page Chapter 3 Authors 9 4 Executive Summary 10 Where ARE your priorities? Show me the money Shift left to your developers workstations Continuous delivery requires continuous quality 5 Say goodbye to the war room Bug-fixing in production is crazy expensive 11 Automate, automate, automate Too much data not enough time 6 Ask the right questions AND get answers 12 Your homework 8 It s time to level-up skills Different metrics + different tools = different worldview 2
Best practices from two guys that eat, drink and breathe DevOps Andreas Grabner Performance Advocate, Dynatrace Blog: blog.dynatrace.com Twitter: @grabnerandi Andreas Grabner has 15+ years experience as an architect and developer in the Java and.net space and is an advocate for high performing applications. He is a regular contributor to large performance communities and a frequent speaker at technology conferences. Brett Hofer Senior Solution Architect, Dynatrace Blog: blog.dynatrace.com Twitter: @brett_solarch Brett Hofer is passionate about DevOps and specializes in delivering complex mission critical software. With more than twenty years of experience from product designer and solution architect to senior management he has a unique 360 perspective of IT. 3
Executive Summary Whether you call it DevOps or something else, we ve seen many organizations trying to follow the book to transform the way software gets developed, tested and deployed. Companies like Facebook, Flickr, Etsy, Twitter, and Amazon have led the way, and are seen as the DevOps Unicorns, but many fail because they don t have the right organization, company culture, and tools in place. The key is for the entire team developers, testers and operations to take ownership for delivering software faster to the end-user without jeopardizing quality. Tools play a critical supporting role in this: they help companies become more efficient by automating tasks along the delivery pipeline. DevOps 1.0 was a great way to get people thinking about the necessary transformation we learned from lean manufacturing in the 80s. Now it s time to evolve to DevOps 2.0, with everyone in your engineering organization leveling up skills and taking responsibility for the end product. War rooms can be the exception, not the rule, and we should all shift left when it comes to quality. Quality must be built into everything we do Requirements, Engineering, Testing, and Deployment and as much as possible it should be automated. In this ebook, we ll focus on what you need to do to continue to transform: > > Ask the right questions and get the right answers to act efficiently when there is a problem > > Level-up development, test, and operations capabilities to build in quality right from the start > > Define a common set of metrics based upon shared goals to do what good agile teams do well collaborate > > Identify the priorities for your business so you can give teams laser-like focus on what matters most > > Integrate continuous quality metrics as gates in the delivery pipeline, reducing technical debt and unplanned work > > Automate architectural, scalability, and performance analysis instead of being stuck with stacks of log files to wade through 4
Say goodbye to the war room Bug-fixing in production is crazy expensive We all know that fixing bugs in production is crazy expensive: 150 times more than earlier in the development lifecycle. 1 Think about your typical war room a lot of really important and expensive people sitting in one room for days on end analyzing log files. Instead of working on new features for tomorrow, they are fixing yesterday s problems and working down accumulated technical debt. Unfortunately this scenario is all too A world without war rooms? Not to be overly dramatic, but it is really possible to eliminate most war room scenarios. Based on our experience, eighty percent of the problems leading to production issues are caused by only 20% of the problem patterns. You can drastically reduce war room scenarios by employing DevOps principles and actively monitoring every environment throughout the delivery pipeline with the right tools that accurately pinpoint problems. I ve muddled over the same log files for weeks sometimes to extrapolate the relationships between different systems [...] before having my eureka moment. - RecklessKelly (Operator) on reddit Hmm I wonder what the deployment rate looked like for this guy? Every 2 weeks? Every 4 weeks? Quarterly? If they deployed into production every two weeks, his insights were outdated just about the time he had his eureka moment! common: much of a typical dev team s effort is spent on bug fixing instead of on new features, with bad software costing $60 billion annually. 2 1. Barry Boehm, 2007 Equity Keynote Address 2. NIST (National Institute of Standards and Technology) http://www.computerworld.com/article/2575560/ it-management/study--buggy-software-costs-users--vendors-nearly--60b-annually.html 5
Ask the right questions AND get answers! There are a lot of people summoned to the war room that often have no clue whether the problems are something they can fix or whether they are even the one responsible. The evidence (infrastructure monitoring data, log files, user complaints, etc.) shows the symptoms, but nothing about the root cause. Just having a lot of log information and high level data doesn t give you the answers to the questions that really matter. To do away with this scenario, what should the evidence be? What questions should you be asking? Is an individual user complaining or are all users impacted? Is it just the CEO that complains about a problem because a report doesn t work on his old IE10? Or is it just the end user in a remote location using dial-up? Knowing whether a problem happens for very small group of users or a large number of users all located in China, for example, is critical for prioritization. Is there a problem in the delivery chain (e.g. CDNs, 3rd parties, ISPs, Cloud Providers, hosted services, mobile networks)? Modern web applications rely on a long list of services along the delivery chain. Knowing the impact of each tells you whether to look into your own data center or whether you should be calling Is a critical transaction impacted? When error rate goes up is it a critical transaction such as search? Or is it unimportant because it is a BOT causing errors while it crawls through pages that don t exist anyway? You need to monitor performance on critical transactions and know which SMEs to call upon if there is a problem. A significant number of users in China are NOT SATISFIED Is the problem in the application? Applications are complex. If you know the problem is within the application, you then need to isolate where so that you can get it to the right developers and architects fast. Akamai or Facebook. 6
Ask the right questions AND get answers! Is the problem related to bad coding? If application response time is slow, a first question should be whether it is due to bad coding. You need to analyze the performance hotspot at the code level to find out if the cause is inefficient algorithms or a lack of coding and architecture best practices. Is the problem in the virtual machine? There can be performance problems if virtual machines (e.g. VMware, EC2, Azure) or your containers (such as Docker) are not properly sized or are battling for resources with other virtual machines on the same virtual server. If you know the performance impact of virtualization on the application, you ll know to call in the VM experts, and not the app developers, to solve a problem. Does the infrastructure cause an issue? What if it is not the app itself, but the app is running low on resources provided by the infrastructure? What if the CPU required to run the Garbage Collector is not available because the machine is over-utilized? Then it s time to think about distributing applications differently or scaling the infrastructure. Is the AppServer the issue? The AppServer might be the cause of performance issues due to an incorrect setting or corrupt deployment. Correct resource pool (threads, database connection, etc.) sizing, security settings or logging options can impact the performance. If it turns out that the AppServer is the problem, you know to contact your IBM, Oracle or Microsoft specialist. With the answers to these questions, you can eliminate the war room and identify the source of the problems quickly, prioritize them and find a solution. So instead of a 20-person war room, you have just three people a developer, a tester and operations guy evaluating detailed performance insights and bringing in the experts, as they are needed. Pretty cool! 7
It s time to level-up skills Different metrics + different tools = different worldview When your DevOps practices are not aligned, developers, testers and operations will each naturally have a different worldview, and their performance will be measured by different metrics. > > Developers: delivering new features and completing as many story points per sprint as possible > > Testers: finding defects and pushing them back to developers > > Operations: maintaining stability If their goals are not the same, how will they ever work together to support the continuous delivery goals of your organization? They will be acting independently of each other and when there is a problem, there will be a lot of finger pointing. To reach your continuous delivery goals, you have to break down those team silos and get everyone on the same team: a team whose primary goal is to create great high quality software. It does no good to be throwing things over the wall with a constant cycle of: code, test, break. Instead of testers finding 10 defects a day, they educate developers on common mistakes so they are avoided from the start. Testers then focus on more critical tasks acceptance testing and large scale performance testing. It is time to level-up the skills of your engineering team. All sides need to start understanding the challenges of the other. DevOps at its best is getting everyone to work together, agree on a common set of tools and metrics, then agreeing on the definition of each metric and how it will be measured. Application performance monitoring tools the evolution of pure system monitoring tools to tools that now monitor end-to-end quality throughout the application delivery pipeline get everyone speaking the same language. They offer a single, automated view of performance that is customized to instead of deploying new applications each team member s role and needs. When something goes wrong, the cross-functional team can fix it together without having to call up a war room. 8
Where ARE your priorities? Show me the money We know this has happened to you because it has happened to us: your engineering teams THINK there is an issue and burn a ton of money and time to fix it BUT as it turns out, the business was really worried about an entirely different problem. Sometimes all you see is red. There are 250 issues, but how do you wade through all of it to figure out what is most important? Prioritize the problems User experience and application performance monitoring tools can help by aggregating the massive amounts of data from all your users across the application delivery chain. They provide smart impact analysis with the option to go deep into the technical root cause of issues. They also prevent you from seeing all the red in the first place by identifying the problem patterns that could bring down your production system, from the first stages of the delivery pipeline. Prioritize improvements Performance monitoring tools can help you prioritize improvements in end user experience: > > Are your users following the path you expect in your application? > > Are they using all of the functionality? These tools can provide visibility into user behavior, highlighting opportunities to improve user workflow and even remove code so that you don t build new features on top of functionality that no one uses. Combine all this data with insight from the business on which transactions, applications, and user groups are critical, and your engineering team will all have laser-like focus on what is most important the issues that are costing your business the most in sales, brand equity and user satisfaction. 9
Shift left to your developers workstations Continuous delivery requires continuous quality. Continuous delivery works (just look at Amazon.) But it can t be just done halfway. Remember from Chapter 1, the cost of bug fixes in production can be up to 150x more than if the bug is found earlier in the development lifecycle. 3 The cost of bug fixes increases exponentially 100% 80% 80% % of Bugs in Software Relative Cost of a Bugfix Numbers adjusted by A. Grabner 150x To achieve continuous delivery, shift left and use performance monitoring to support continuous 40% 50% 50x quality throughout the development lifecycle starting even before code is checked in, at your developers workstations. 1x 10x 20% 25x 40% 10% 40% 5% You can deploy faster without failing faster with a single version of the performance truth from development, testing and production. As a result, you will reduce Unit testing Unit testing Acceptance Testing Performance Testing Release unplanned work and technical debt freeing up more time for those fun, new feature releases. Amazon deploys at an amazing pace: > > Every 11.6 seconds with 23,000 deployments a day. > > They have had 75% fewer outages since 2006, 90% fewer outage minutes. > > Only 0.001% deployments cause a problem. 4 3. Barry Boehm, 2007 Equity Keynote Address 4. Amazon, 2012 Velocity Presentation 10
Automate, automate, automate Too much data not enough time. At this point you have so much data coming in, you are probably worried about having the time and skills to analyze it all. Unit testing Acceptance Testing Performance Testing Release The beauty of application performance management is that you don t need to educate everyone. Once you have worked with your business teams to determine priorities, identified your key technical metrics and set your KPIs, you can automate performance monitoring. You are building continuous quality into your continuous delivery pipeline by setting up metrics-based quality gateways at each stage of the pipeline. This approach strengthens the safety net, helping your team deal with changing requirements and catching errors early. This is a huge benefit.. Monitor Tests Analyze Results Quality Gate in your Build Tool Wondering what kinds of things you can automate? > > Send change requests right back to developers after a load test > > Send problems back to developers during an integration test > > Set alerts when KPIs or SLAs are tested and breached 11
Your homework We hope that you, and everyone in your engineering organization, can use these best practices as you continue to level-up to DevOps 2.0. Go forward: make war rooms the exception and shift left on quality. Automate as much as possible, and keep focusing on faster and better! DevOps tools we love Here is a collection of tools that foster collaboration amongst Product Management, Development, IT Operations and Technical Support teams. The tools allow them to build more quality into their products and supports them in establishing better feedback loops. > > Change Controls JIRA > > Development Eclipse > > Source Control - GitHub > > Build Automation - Ant and Maven > > Configuration Management - Ansible, Chef and Puppet > > Test Automation - LoadRunner and Selenium > > Virtual Machines - Vagrant, Packer and VeeWee Webinar replays > > 5 Key Metrics to Release Better Software Faster > > DevOps: From Adoption to Performance The blogroll > > IT Revolution s DevOps > > The Art of DevOps series > > Software Quality Metrics for Continuous Delivery Part 1-3 > > Dynatrace APM on DevOps > > DevOps reactions Enjoy some DevOps humor! Recommended reading > > The Phoenix Project by co-author Gene Kim > > The Speed of Trust by Steven Covey > > Release It by Mike Nygard > > Continuous Delivery by Jez Humble and David Farley > > The Other Side of Innovation by Vijay Govindarajan 12
Learn more at dynatrace.com Dynatrace is the innovator behind the new generation of Application Performance Management. Our passion: helping customers, large and small, see their applications and digital channels through the lens of end users. Over 5,800 organizations use these insights to master complexity, gain operational agility, and grow revenue by delivering amazing user experiences. 6.17.15 384_SS_5WaysDevOps_jg