The Definitive Guide. Monitoring the Data Center, Virtual Environments, and the Cloud. Don Jones

Size: px
Start display at page:

Download "The Definitive Guide. Monitoring the Data Center, Virtual Environments, and the Cloud. Don Jones"

Transcription

1 The Definitive Guide tm To Monitoring the Data Center, Virtual Environments, and the Cloud Don Jones

2 Introduction to Realtime Publishers by Don Jones, Series Editor For several years now, Realtime has produced dozens and dozens of high quality books that just happen to be delivered in electronic format at no cost to you, the reader. We ve made this unique publishing model work through the generous support and cooperation of our sponsors, who agree to bear each book s production expenses for the benefit of our readers. Although we ve always offered our publications to you for free, don t think for a moment that quality is anything less than our top priority. My job is to make sure that our books are as good as and in most cases better than any printed book that would cost you $40 or more. Our electronic publishing model offers several advantages over printed books: You receive chapters literally as fast as our authors produce them (hence the realtime aspect of our model), and we can update chapters to reflect the latest changes in technology. I want to point out that our books are by no means paid advertisements or white papers. We re an independent publishing company, and an important aspect of my job is to make sure that our authors are free to voice their expertise and opinions without reservation or restriction. We maintain complete editorial control of our publications, and I m proud that we ve produced so many quality books over the past years. I want to extend an invitation to visit us at especially if you ve received this publication from a friend or colleague. We have a wide variety of additional books on a range of topics, and you re sure to find something that s of interest to you and it won t cost you a thing. We hope you ll continue to come to Realtime for your educational needs far into the future. Until then, enjoy. Don Jones i

3 Introduction to Realtime Publishers... i Ch apter 1: Evolving IT Data Centers, Virtual Environments, and the Cloud... 1 Ev olving IT... 1 Remember When IT Was Easy?... 1 Distributed Computing: Flexible, But Tough to Manage... 2 Super Distributed Computing: Massively Flexible, Impossible to Manage?... 3 Th ree Perspectives in IT... 4 The IT End User... 4 The IT Department... 5 The IT Service Provider... 7 IT Concerns and Expectations... 8 IT End Users... 8 IT Departments... 8 IT Service Providers... 9 Bu siness Drivers for the Hybrid, Super Distributed IT Environment Increased Flexibility Faster Time to Market Pay As You Go Bu siness Goals and Challenges for the Hybrid IT Environment Centralizing Management Information Redefining Service Level Gaining Insight Maintaining Responsibility Special Challenges for IT Service Providers Th e Perfect Picture of Hybrid IT Management For IT End Users For IT Departments ii

4 For IT Service Providers About this Book Ch apter 2: Traditional IT Monitoring, and Why It No Longer Works Ho w You re Probably Monitoring Today Standalone Technology Specific Tools Local Visibility Technology Focus, Not User Focus Pr oblems with Traditional Monitoring Techniques Too Many Tools Fragmented Visibility into Deep Application Stacks Disjointed Troubleshooting Efforts Difficulty Defining User Focused SLAs No Budget Perspective Ev olving Your Monitoring Focus The End User Experience The Budget Angle Traditional Monitoring: Inappropriate for Hybrid IT It s Your Business, So It s Your Problem Provider SLAs Aren t a Business Insurance Policy Concerns with Pay As You Go in the Cloud Ev olving Monitoring for Hybrid IT Focusing on the EUE Monitoring the Application Stack Keeping an Eye on the Budget Coming Up Next Ch apter 3: The Customer Is King: Monitoring the End User Experience Why the EUE Matters iii

5 Business Level Metric Tied to User Perceptions Ch allenges as You Evolve to Hybrid IT Geographic Distribution Deep, Distributed Application Stacks Te chniques for Monitoring the EUE Platform Level APIs Data from Providers Distributed Monitoring Agents Click to Click Monitoring W hy We Often Don t Monitoring EUE Today Complexity Lack of Tools Cost Component Level Monitoring Can Be Close Enough W hy We Must Monitor EUE Going Forward Vastly More Complex Environments Business and Perception Level Focus Too Much Is Out of Your Con trol Th e Provider Perspective: You Want Your Customers Measuring the EUE The Provider Isn t 100% Responsible for Performance You Gain a Competitive Advantage Coming Up Next Ch apter 4: Success Is in the Details: Monitoring at the Component Level Tr aditional, Multi Tool Monitoring Client Layer Network Layer iv

6 Application and Database Layers Other Concerns Mu lti Discipline Monitoring and Troubleshooting Applications Are Not the Sum of Their Parts Tossing Problems Over the Fence: Troubleshooting Challenges Int egrated, Bottom Up Monitoring Monitoring Performance Across the Entire Stack Integrated Troubleshooting Saves Time and Effort The Provider Perspective: Providing Details on Your Stack Coming Up Next Ch apter 5: The Capabilities You Need to Monitor IT from the Data Center into the Cloud.. 70 Bu siness Goals for Evolved Monitoring EUE and SLAs Budget Control Te chnology Goals for Evolved Monitoring Centralized Bottom Up Monitoring Improved Troubleshooting A S hopping List for Evolved Monitoring High Level Consoles Domain Specific Drilldown Performance Thresholds Broad Technology Support: Virtualization, Applications, Servers, Databases, and Networks End User Response Monitoring SLA Reporting Public Cloud Support: IaaS, PaaS, SaaS The Provider Perspective: Capabilities for Your Customers v

7 Coming Up Next Ch apter 6: IT Health: Management Reporting as a Service Th e Value of Management Reporting Business Value Technology Value Re porting Elements Performance Reports SLA Reports Dashboards The Provider Perspective: Reports for Your Customers Conclusion vi

8 Copyright Statement 2010 Realtime Publishers. All rights reserved. This site contains materials that have been created, developed, or commissioned by, and published with the permission of, Realtime Publishers (the Materials ) and this site and any such Materials are protected by international copyright and trademark laws. THE MATERIALS ARE PROVIDED AS IS WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE, TITLE AND NON-INFRINGEMENT. The Materials are subject to change without notice and do not represent a commitment on the part of Realtime Publishers or its web site sponsors. In no event shall Realtime Publishers or its web site sponsors be held liable for technical or editorial errors or omissions contained in the Materials, including without limitation, for any direct, indirect, incidental, special, exemplary or consequential damages whatsoever resulting from the use of any information contained in the Materials. The Materials (including but not limited to the text, images, audio, and/or video) may not be copied, reproduced, republished, uploaded, posted, transmitted, or distributed in any way, in whole or in part, except that one copy may be downloaded for your personal, noncommercial use on a single computer. In connection with such use, you may not modify or obscure any copyright or other proprietary notice. The Materials may contain trademarks, services marks and logos that are the property of third parties. You are not permitted to use these trademarks, services marks or logos without prior written consent of such third parties. Realtime Publishers and the Realtime Publishers logo are registered in the US Patent & Trademark Office. All other product or service names are the property of their respective owners. If you have any questions about these terms, or if you would like information about licensing materials from Realtime Publishers, please contact us via at [email protected]. vii

9 Chapter 1: Evolving IT Data Centers, Virtual Environments, and the Cloud In the beginning, data centers were giant buildings housing a single, vacuum tube driven computer, tended to by people in white lab coats whose main job was changing the tubes as they burned out. Today s data centers are so much more complicated that it s like a completely different industry: We not only have dozens or hundreds or even thousands of servers to worry about, but now we re starting to outsource specific services like , spam filtering, or customer relationship management (CRM) to Web based companies selling Software as a Service (SaaS) in the cloud. How do we manage it all, to ensure that all of our IT assets are delivering the performance and service that our businesses need? Evolving IT Every decade or so, the IT industry pokes its toes into the waters of a new way of computing. I m not talking specifically about the revolving thin client/thick client computing model that comes and goes every few years; I m talking about major paradigm shifts that take place because of radical new technologies and concepts. Shifts that permanently change the way we do business. In some cases, these shifts can resemble past IT techniques and concepts, although there are always crucial differences as we move forward. This is how IT evolves from one state to another, and it s often difficult and complex for the human beings in IT to keep up. Remember When IT Was Easy? I started in IT almost two decades ago that s several lifetimes in technology years. When I started, we had relatively simple lives my first IT department didn t even have a local area network (LAN). Instead, our standalone computers connected directly to an AS/400 located in the data center, and that was really our only server. IT was incredibly easy back then: Everything took place on the mainframe. We didn t worry about imaging our client computers because we ultimately didn t care very much about them. Security was simple because all our resources were located on one big machine, and the only connections to it were basically video screen and keyboard feeds. Monitoring performance was incredibly straightforward: We called up an AS/400 screen I think the command was WRKJOB, for work with jobs and looked at every single IT process we had in a single place. We could bump the priority on important jobs or depress the priority on a long running job that was consuming too many cycles. Ah, nostalgia. 1

10 Distributed Computing: Flexible, But Tough to Manage We soon made the move into distributed computing. Soon, we had dozens of Novell NetWare servers and Windows NT servers in our expanding data center. Our computers were connected by blazing fast Token Ring networks. We shifted mail off our AS/400 onto an Exchange Server. For the first time, our IT processes were starting to live on more and more independent machines, and monitoring them well, we didn t actually monitor them. If things were a bit slow, there wasn t much we could do about it. I mean, the network was 16Mbps and the processors were Pentiums. Slow was kind of expected. And, at the time, the best performance tool we had was Windows own Performance Monitor, which wasn t exactly a high level tool for managing anything like Service Level Agreements (SLAs). Our basic SLA was, If it breaks, yell a lot and we ll get right on it. We have a pager. That s the same basic computing model that we all use today: Bunches of servers in the data center, connected by networks 100Mbps or better Ethernet, thankfully, rather than Token Ring and client computers that we have to spend a significant amount of time managing. Gone are the days of applications that ran entirely on the mainframe; now we have multi tier applications that run on our clients, on mid tier servers, and in back end databases. Even our thin client Web apps are often multi tier, with Web servers, application servers, and database servers participating. We re also more sophisticated about management. Companies today can use tools that monitor each and every aspect of a service. For example, some tools can be taught to recognize the various components middle tier, back end, and so forth that comprise a given application. As Figure 1.1 shows, they can monitor each aspect of the application, and let us know when one or more elements are impacting delivery of that application s services to our end users. Figure 1.1: Monitoring elements in an application or service. We ve mastered distributed computing, and we have the means to monitor and manage the distributed elements quite effectively. To be sure, not every company employs these methods, but they re certainly available. So what s next? 2

11 Super Distributed Computing: Massively Flexible, Impossible to Manage? The common theme behind all of today s distributed elements is that they live in our data centers. Location, however, isn t as important as what our own data centers provide us absolute control. For every server in our data center, we re free to install management agents, monitor network traffic, and even stick thermal sensors into our servers if we want to. They re our machines, and we can do anything with them that, from a corporate perspective, we want to. But we re starting to move outside of our own data centers. What marketing folks like to broadly call the cloud is offering a variety of services that live in someone else s data center. For example and to define a few terms we can now choose from: Hosted services, such as hosted Exchange Server or hosted SharePoint Server. In most cases, these are the same technologies we could host in our own data center, but we ve chosen to let someone else invest in the infrastructure and to bear the headache of things like patching, maintenance, and backups. Software as a Service, or SaaS, such as the popular SalesForce.com or Google Apps. Here, we re paying for access to software, typically Web based, that runs in someone else s data center. We have no clue how many servers are sitting behind the application and we don t care we re just paying for access to the application and the services it provides. Typically, these are applications that aren t available for hosting within our own data center, even if we wanted to, although they compete with on premises solutions that provide the same kind of services. Cloud computing, which, from a strict viewpoint, doesn t include either of the previous two models. Cloud computing is a real computing platform where we install our own applications, often with our own data in a back end database, to be run on someone else s computers. Cloud computing is designed to offer an elastic computing environment, where more computing resources can be engaged to run our application based on our demand. Cloud apps are more easily distributed geographically, too, making them more readily available to users all over the world. All of these services are provided to us by a company of some kind, which we might variously call a hosting provider or even a managed service provider (MSP). Ultimately, this is still the same distributed computing model we ve known and loved for a decade or more. We re just moving some elements out of our own direct control and into someone else s, and often using the Internet as an extension to our own private networks. This new model is increasingly being referred to as hybrid IT, meaning a hybridization of traditional distributed computing, in conjunction with this new, super distributed model that includes outsourced services as a core part of our IT portfolio. 3

12 But there s the key phrase: Out of our own direct control. Without control over the servers running these outsourced services, how can we manage them? We can t exactly install our own management agents on someone else s computers, can we? And for that matter, do we really need to monitor performance of these outsourced services? After all, isn t that what the providers SLAs are for ensuring that we get the performance we need? These are all questions we need to consider very carefully and that s exactly what we ll be doing in this chapter and throughout the rest of this book. Three Perspectives in IT The world of hybrid IT consists of three major viewpoints: the IT end user, or the person who is the ultimate consumer of whatever technology services your company has; the IT department, tasked with implementing and maintaining those services on behalf of the end user; and the IT service provider, which is the external company that provides some of your IT services to you. It s important to understand the goals and priorities of each of these viewpoints, because as you move more toward a hybrid IT model, you ll find that some of those priorities tend to shift around and change their importance. The IT End User The IT end user, ultimately, cares about getting their job done. They re the ones on the phone telling their customers, sorry, the computer is really slow today. They re the ones who don t ultimately care about the technology very much, except as a means of accomplishing their jobs. Here s something important: The IT end user has the most important perspective in the entire world of business technology because without the end user s need to accomplish their job, nobody in IT has a job. I m going to make that statement a manifesto for this book. In fact, I ll introduce you to an IT end user (whose name and company name have been changed for this book) that I ve interviewed. You ll meet him a few times throughout this book, and I ll follow his progress as his IT department shifts him over to using hybridized IT services. Ernesto is an inside sales manager for World Coffee, a gourmet coffee wholesaler. Ernesto s job is to keep coffee products flowing to the various independent outlets who resell his company s products. Like most users, Ernesto consumes some basic IT services, including file storage, , and so on. He also interacts with a CRM application that his company owns, and he uses an in house order management application. Ernesto works on a team of more than 600 salespeople that are distributed across the globe: His company sells products to resellers in 46 countries and has sales offices in 12 of those countries. 4

13 Ernesto s biggest concerns are the speed of the CRM and order management application. He literally spends three quarters of his day using these applications, and much of his time today is spent waiting on them to process his input and serve up the next data entry screen. His own admittedly informal measurements suggest that about one third of that time just under 2 hours a day is spent waiting on the computer. He knows exactly how much he generates in sales every hour, and since he s paid mainly on commission, he knows that those 9 hours a week almost a quarter of his work time are costing him dearly. He complains and complains to his IT department, as does everyone else, but feels that it s mostly falling on deaf ears. The IT guys don t seem to be able to make things go any faster. There s talk now of outsourcing some of the applications Ernesto uses, such as the CRM application. Ernesto just hopes it doesn t run any slower he can t afford it. If you work in IT, you know how common a scenario that is. Nothing s ever fast enough for our users, and it can be incredibly difficult to nail down the exact cause of performance problems so we tend to file them all in the like to help you, but can t, really folder and go on with our other projects. It ll be interesting to see this same situation from the perspective of Ernesto s IT department. The IT Department The IT department, on paper, cares about supporting their users. You and I both know, however, that what IT really cares about is technology. Tell us that is slow and we re less interested because that s a big, broad topic. We need to narrow it down to the network is experiencing high packet loss or the server s processor is at 90% utilization all the time before we can start to solve the problem. We think in those technical terms because we re paid to; interfacing with end users who typically can t provide anything more definitive than it seems slower than yesterday can be challenging. And so we create SLAs. Typically, however, those SLAs are not performance based but rather are availability based. We promise to provide 99% uptime for the messaging application, and to respond within 2 hours and correct the problem within 8 hours when the application goes down. That means we can have up to 87.2 hours of downtime two full work weeks and still meet our SLA! 99% sounded good, though, and hopefully nobody will think to do the math. But we still haven t addressed slow messaging performance because it s difficult to measure. What do we measure? How long it takes to send a message? How long it takes to open a message? What s good performance 5 seconds to open a message? Honestly, if you ve ever had to actually wait that long, you were already drumming your fingers on the mouse. A second? That seems like a tough goal to hit. And how do you even measure that? Go to a user s computer, click Open, and start counting, one one thousand, two one thousand, three one thou oh, there, it s open. That s about two and a half seconds. 5

14 Instead we tend to measure performance in terms of technical things that we can accurately touch and measure: Network consumption, processor utilization, memory utilization, internal message queue lengths, and so on. Nothing the end user cares about, and nothing we can really map to an end user expectation how does a longer message queue or higher processor consumption impact the time it takes to open a message? but they re things we can see and take action on, if necessary. John works for World Coffee s IT department, and is in charge of several important applications that the company relies upon including the CRM application and the in house order management application. John has set up extensive monitoring to help manage the IT department s SLAs for these applications. They ve been able to maintain 99.97% availability for both applications, a fact John is justifiably proud of. The monitoring includes several front end application servers, some middle tier servers, and a couple of large back end databases one of which replicates data to two other database servers in other cities. John primarily monitors key metrics for each server, such as processor and memory utilization, and he monitors response times for database transactions. He also has to monitor replication latency between the three database servers. Generally speaking, all of those performance numbers look good. As an end point metric, he also monitors network utilization between the front end servers and the client applications on the network. He doesn t panic until that utilization starts to hit 80% or so, which it rarely does. When it does, he s automatically alerted by the monitoring solution, so he feels like he has a pretty good handle on performance. The company s users complain about performance, of course, but the client application has always run fine on John s own client computer, so he figures the users are just being users. The company plans to start moving the CRM application to an outsourced vendor, probably using a SaaS solution. They also plan to move the in house order management application into a cloud computing platform, which should make it easier to access from around the world, and help ensure that there are always computing resources available to the application as the company grows. John is relieved because it ll mean all this performance management stuff will be out of his hands. He just needs to make sure they get a good SLA from the hosting providers, and he can sit back and relax at last. 6

15 The IT Service Provider As we start to move to a world of hybridized IT, it s also important to consider the perspective of the IT service provider the person responsible for whatever IT services are being hosted in the cloud. These folks have a unique perspective: In one way, they re like an IT department, because they have to manage a data center, monitor performance, patch servers, and do everything else a standard IT department would do. Their customers, however, aren t internal users employed by the same company. They don t get to use the term customers in the same touchy feely, but ultimately meaningless, way that standard IT departments do. An IT service provider s customers are real customers, who pay real money for services and when someone pays money for something, they expect a level of service to be met. So service providers SLAs are much more serious, legally binding contracts that often come with real, financial penalties if they re not met. Service providers are also in the unusual position of having to expose some of their IT infrastructure to their customers. In a normal IT department, the end users or customers, if you like don t usually care about technology metrics. End users don t care about processor utilization, and might not even know what a good utilization figure is. With a service provider, however, the customer is an IT department, and they know exactly what some of those technology metrics mean and they may want to know what they are from moment to moment. At the very least, a service provider s customers want to see metrics that correspond to the provider s SLA, such as metrics related to uptime, bandwidth used, and so forth. Li works for New Earth Services, a cloud computing provider. Li is in charge of their network infrastructure and computing platform, and is working with World Coffee, who plans to shift their existing Web services based order management application into New Earth s cloud computing platform. Li knows that he ll have to provide statistics to World Coffee s IT department regarding New Earth s platform availability, because that availability is guaranteed in the SLA between the two companies. However, Li is worried because he knows most of World Coffee s end users already think their order management application is slow. He knows that, once the application is in the cloud, those slow complaints will start coming across his desk. He needs to be able to prove that his infrastructure and platform are performing well so that World Coffee can t pin the blame for slowness on him. He knows, too, that he needs to be able to provide that proof in some regular, automated way so that World Coffee has something they can look at on their own to see that the New Earth platform is running efficiently. He knows his customers aren t asking for that kind of detail yet but he knows they will be, and he doesn t yet know how he s going to provide it. 7

16 IT Concerns and Expectations With those three perspectives in mind, let s look at some of the specific concerns and expectations that each of those three audiences tend to have. This is a way of summarizing and formalizing the most important points from each perspective so that we can start to think of ways to meet each specific expectation and to address each specific concern. Think of these as our checklists for a more evolved, hybrid IT computing model. IT End Users As I stated previously, IT end users ultimately care about getting their jobs done. That means: They expect their applications to respond more or less immediately. They may accept slower responses, but the expectation is that everything they need comes up pretty much instantly. They expect applications to be available and stable pretty much all the time. This is often referred to as dial tone availability, because one of the most reliable consumer services of the past century was the dial tone from your land telephone which typically worked even if your home s power was out. And you know what? That s about it. Users don t tend to have complex expectations they just want everything to be immediate, all the time. That may not always be reasonable, but it s certainly straightforward. IT departments, as a rule, have never done much to manage this expectation, which is why many end users have a poor perception of their IT department. IT has, in fact, found it to be very difficult to even formally define any alternate expectations that they could present to their users. IT Departments IT departments tend to have availability as their first concern. Performance is important, but it s often somewhat secondary to just making certain a particular service is up and running at all. In fact, one of the main reasons we monitor performance at all is because certain performance trends allow us to catch a service before it goes down not necessarily before it becomes unacceptably slow, but before it becomes completely unavailable. We also tend to monitor technology directly, meaning we re looking at processor utilization, network utilization, and so on. So you can summarize the IT department s concerns and expectations as follows: They want to be able to manage technology level metrics, such as resource utilization, across servers. They want to be able to map raw performance data to thresholds that indicate the health of a particular service such as knowing that 75% processor utilization on a messaging server really means still working, but approaching a bad situation. 8

17 They want to be able to track performance data and develop trends that help predict growth. They want to be able to track and manage uptime and other metrics so that they can comply with, and report on their compliance with, internal SLAs. They typically want to be able to track all the low level metrics associated with a service. For example, messaging may depend on a server, the underlying network, and infrastructure services such as a directory, name resolution, and so on, as well as infrastructure components such as routers, switches, and firewalls. IT departments, in other words, are end to end data fiends. A good IT department in today s world, at least wants to be able to track detailed performance numbers on each and every element of their data center, right out to the network cards in client computers, although they typically stop short of trying to track any kind of performance on client computers. The theory is that if everything inside the data center is running acceptably, then any lack of performance at the client computer is the client computer s fault. IT Service Providers IT service providers have, as I ve stated already, a kind of hybrid perspective. They need to have the same concerns as any IT department, but they have additional concerns because their customers other IT departments are technically savvy and spending real money for the services being provided. So in addition to the concerns of an IT department, a service provider has these concerns and expectations: They need to be able to provide performance and health information about their infrastructure to their customers. In many cases, slow performance at the customer end may be due to elements on the customer s network, which is out of the service provider s control. Service providers need to be able to quantify performance of their infrastructure so that they can defend themselves against performance accusations from their customers. They need to be able to prove, in a legally defensible fashion, their compliance with the SLAs between themselves and their customers. They need to be able to communicate certain health and performance information to their customers so that customers have some visibility into what they re paying for. It s actually kind of unfair to service providers, in a way. Most IT departments would never be expected to provide, to their own end users, the kind of metrics that the IT department expects from their own service providers. 9

18 Business Drivers for the Hybrid, Super Distributed IT Environment Let s shift gears for a moment. So far, we ve talked mainly about the perspectives and expectations of various IT centric audiences. As we move into a hybrid IT environment, with some services hosted in our own data center and others outsourced to various providers, meeting those expectations can become increasingly complex and difficult. But what does the business get out of it? IT concerns aside, why are businesses driving us toward a hybrid IT model? I can assure you that if the business didn t have some vested interest in it, we wouldn t be doing it; outsourcing services is never completely free, so the business has to have some kind of ulterior motive. What is it? Increased Flexibility Flexibility is one of the big drivers. Let me offer you a story from my own experience from around 2000, when the Internet was certainly big but nothing called cloud computing was really in anyone s mind. Craftopia.com (now a part of Home Shopping Network) was a small arts and crafts e tailer based in the suburbs of Philadelphia, PA. The company s brand new infrastructure consisted of two Web servers and a database server (which, in a pinch, could be a third Web server for the company s site), hosted in an America Online owned data center (with tons of available bandwidth). The company generally saw fewer than 1,000 simultaneous hits on the Web site, and their small server farm was more than up to that task. One day, the IT department all four people, including the CTO was informed that the company was up for a feature segment on the Oprah television show. Everyone gulped because they knew Oprah could generate the kinds of hits that would melt their little server farm, even though that level of traffic would likely only last for a few days or even hours. If the servers could manage to stay up, they might pull in a lot of extra sales, but not enough to justify adding the dozen or so servers needed to meet the demand. Especially since that surge in demand would be so short. The servers didn t manage to stay up: It was a constant battle to restart them after they d crash, and it made for a long few days. Had all this taken place in 2010, the company could simply have put its Web site onto a cloud computing platform. The purpose of those platforms is to offer near infinite, on demand expansion, with no up front infrastructure investment. You simply pay as you go. Sure, the Oprah Surge would have cost more but it would presumably have resulted in a compensating increase in sales, too. Once the Surge was over, the company would simply be paying less for their site hosting, since the site would be consuming fewer resources again. There wouldn t be any extra servers sitting around idle, because from the company s perspective, there weren t any servers at all. It was all just a big cloud. 10

19 That s the exact argument for cloud computing, as well as for SaaS and even hosted services: Expand as much as you need without having to invest any infrastructure. If you ve grown just beyond one server, you don t have to buy another whole server which will sit around mostly idle just to add a tiny bit of extra capacity. The hosting provider takes care of it, adding just what you need. Faster Time to Market Today s businesses need to move faster and faster and faster, all the time. It used to be that taking a year or more to bring a new product or service to market was fast enough; today, product life cycles move in weeks and months. If you need to spin up a new CRM application in order to provide better customer service, you need it now, not in 8 months. With hosted services and SaaS, you can have new services and capabilities in minutes. After all, the provider has already created the application and supporting infrastructure; you just need to pay and start using it. This additional flexibility the ability to add new services to your company s toolset with practically zero capital investment and zero notice is proving invaluable to many companies. They no longer have to figure out how their already overburdened IT department will find the time to deploy a new solution; they simply provide a purchase order and turn on the new solution as easy as flipping a light switch. Pay As You Go Massive capital investment is something that companies have long associated with IT projects. Roll out a major solution like a new Enterprise Resource Planning (ERP) or CRM application, and you re looking at new servers, new network components, new software licenses, and more. It s an expensive proposition, and in many cases, you re investing in capacity that you won t be using immediately. In fact, a quick survey of some of my industry contacts suggests that most data centers use about 40 to 50% of their total server capacity. That means companies are paying fully double what they need simply because we all know you have to leave a little extra room for growth. You want Exchange Server, and you have 500 users, but think you ll have 1500 within 3 years? Well, then you spend for That s why the pay as you go model offered by service providers is so attractive. If you need 500 mailboxes today, you pay for 500. When you need 501, you pay for 501. It s possible that what you eventually pay for all 1500 will cost more than if you were hosting the service in your own data center, but the point is that you didn t have to pay for all 1500 all along. If you were wrong about your growth, and only needed 1000 mailboxes, then you re not paying for the excess one third capacity. Pay as you go means you don t have to plan as much, or as accurately, and you re less likely to pay a surcharge for overestimating. Pay as you go lets you get started quickly, with less up front investment. 11

20 Business Goals and Challenges for the Hybrid IT Environment If there are business level drivers for hybrid IT, there are certainly business level challenges to go with them. Remember, we re talking about the business here, rather than specific IT concerns. These are the things that a business executive will be concerned with. Centralizing Management Information One major concern is where management information will come from. Today, many businesses are already getting IT management information from too many separate places and tools. Managers are often forced to look at one set of reports for Microsoft portions of the environment, for example, and a separate set for the Unix or Linux based portions. When some services move out of the data center and into the cloud, the problem becomes even more complex. In some cases, there s a concern about whether management information will even be available for the outsourced services; at the very least, there s an expectation that the outsourced services will be yet another set of reports. What kind of management reports are we talking about? Availability, at one level, which is a high level metric but is still important to know. Managers need to know that they re getting what they paid for, and that includes the availability of in house services as well as outsourced ones. At another level, consumption is important. Some companies may need to allocate service costs whether internal or external across business units or divisions. In other cases, managers need to see consumption levels in order to plan for growth and the accompanying expenses. A definite business goal is get all this information in one place, regardless of whether a particular service is hosted in house or on a provider s network. Redefining Service Level Businesses really need to redefine their top level SLAs. Rather than worrying so much about uptime which seems to be the primary focus of most of today s SLAs businesses should manage to the end user experience (EUE). In other words, regardless of the service s basic availability, how is it performing from the perspective of the end user? If end users are spending half their time waiting for a computer to respond, the company is potentially wasting a lot of money on that service, regardless of where it s hosted. This sounds complicated, but that s only because I and probably you, since you re an IT person tend to start thinking about the underlying technology. Do we start guaranteeing a transaction processing time in the database? Do we guarantee a certain network bandwidth availability? Nope. We guarantee a specific EUE. For example, When you search for a customer by name, you will receive a first page of results within 3 seconds. You need to identify key tasks or transactions as seen from the end user perspective, and write an SLA that sets a goal for a specific time to complete that task or transaction from the end user s perspective. 12

21 If you re not able to meet that SLA, then you dive into technology metrics like network bandwidth, processor utilization, and database response times; the end metric that you drive to is what the end user actually experiences on their desktop. That may sound impossible to even measure, let alone guarantee, but as you move into a hybrid IT environment, it s absolutely essential and most of the rest of this book will talk about how you ll actually achieve it. Gaining Insight IT departments and thus, the business have the option to get as much insight as they need into their existing data centers. That is, plenty of tools and techniques exist, although not every business chooses to utilize them. Going forward, businesses are going to have to have deep insight into the technology assets, because that insight is going to be the only way to achieve that EUE based SLA that businesses need to establish. Hybrid IT makes this vastly more complicated to achieve. If your EUE isn t where you want it to be with a cloud based application, where do you start looking for the problem? Is it the Internet connection between your Taiwan office and the Paris based cloud provider data center? Is it processing time within your cloud based application? Is it response time between the cloud based application server and the cloud based back end database? Or is it network latency within the Taiwan office itself? The term hybrid IT is an apt one because you re never truly outsourcing the entire set of elements that comprise an IT service: Some elements will always remain under your control, while other elements like the public Internet may be out of the control of both you and your service provider. You re going to need tools that can give you insight into every aspect so that you can spot the problem and either solve it or adjust your EUE expectations accordingly. Maintaining Responsibility Here s another major business challenge in hybrid IT: It s still your business. Let s consider a brief section from Amazon s EC2 SLA (you can read the entire thing at sla/): AWS will use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage (defined below) of at least 99.95% during the Service Year. In the event Amazon EC2 does not meet the Annual Uptime Percentage commitment, you will be eligible to receive a Service Credit as described below. They define a year as 365 days of 24 hour days, meaning the service can be unavailable for up to about 5 hours a year. However, if they don t meet that SLA, you re only entitled to a service credit: If the Annual Uptime Percentage for a customer drops below 99.95% for the Service Year, that customer is eligible to receive a Service Credit equal to 10% of their bill (excluding one time payments made for Reserved Instances) for the Eligible Credit Period. To file a claim, a customer does not have to have wait 365 days from the day they started using the service or 365 days from their last successful claim. A customer can file a claim any time their Annual Uptime Percentage over the trailing 365 days drops below 99.95%. 13

22 That means you can t even file a claim until you ve had more than 5 hours of outage in a 365 day period. If you do file a claim, you re eligible to receive a credit not a refund of up to 10% of your bill. I m not trying to pick on Amazon.com, either, because most service providers in this part of the industry have very similar SLAs. My point, rather, is that the SLA is not protecting your business. If you have a missioncritical application hosted by a service provider, and that application goes down, your business is impacted. You re losing money potentially tens of thousands of dollars an hour, depending on what service is impacted. The SLA is never going to pay for that damage; at best, it s going to refund or credit a portion of your provider fees, and that s all. The moral is that you need to remain responsible for your entire business, and all the services you rely upon to operate that business. You may outsource a service, but you can t outsource responsibility for it. You need to have insight into its performance levels and availability, and you need to be able to engage your service provider when things start to look bad, not when they go completely awful or offline. This can be a major challenge with some of today s service providers and most of them know it, and are struggling to provide better metrics to those customers who demand it. You need to be one of those customers who demands it, because it s your business that s on the line. Special Challenges for IT Service Providers If you re an IT service provider with a hosted service, SaaS offering, or even a cloud computing platform, then you know the difficult situation that you re in. On the one hand, you re an IT department. You have data centers, and you need to manage the performance and availability of those resources that are under your control. When things slow down or problems arise, you need to be able to quickly troubleshoot the problem by driving directly toward the root cause element, whether that s a server, a network infrastructure device, a network connection, a software problem, or whatever. On the other hand, you re providing a service to a technically savvy customer typically, another IT department that s paying for the services you provide. Unlike end users, your customer is accustomed to highly technical metrics, and they re used to having complete control and insight over IT services because those have traditionally been hosted in the customer s own data center. Just because they re moving service elements out of their data center doesn t mean they want to give up all the control they re accustomed to. In fact, smart customers will demand deep metrics so that they can continue to manage an EUEbased SLA. They ll want to know when slowdowns are on their end, or when they can hold you responsible and ask you to work on the problem. Competitively, you want to be the kind of provider that can offer these kinds of metrics and this kind of insight and visibility. As the world of hybrid IT grows more prevalent and more accepted, it will also grow more competitive and providers that can become a seamless extension of the customer s IT department and data center will be the preferred providers who earn the most money and trust from their customers. 14

23 So start thinking: How can you provide your customers with deep metrics in a way that only exposes the metrics related to them and not information related to other customers? How can you provide this information in a way that will integrate with your customers existing monitoring tools so that they can treat your services as a true extension of their own data center rather than as yet another set of graphs and reports that they have to look at? The Perfect Picture of Hybrid IT Management Let s talk about what the perfect hybrid IT world might look like. This is the pie in the sky view; for now, let s not worry about what s possible but rather focus on what would be best for the various IT audiences and for the business as a whole. We ll use this perfect picture to drive the discussion throughout this book, looking at whether this picture is achievable, and if so, how we could do so. If there are any instances where we realize that this perfect picture isn t yet fully realized, we can at least outline the capabilities and techniques that need to exist in order to make this dream a reality. For IT End Users Remember, for end users, getting the job done is the key. And while they sort of naturally expect everything to be instant and always available, we can reset that expectation if we explicitly do so in terms they can understand and relate to. Ernesto has been using the company s newly outsourced applications for several months now, and he s quite satisfied with them. The company has published target response times for key tasks, such as locating a customer record in the CRM application and processing a new order in the ordermanagement application. On Ernesto s computer and the computers of several of his colleagues across the world is a small piece of software agent that continually measures the response times of these applications as Ernesto is using them. The information collected by that agent is, he s told, forwarded to his company s IT department, which compiles the data and publishes the actual response times from across the company as an average. Anytime Ernesto feels that the application is slow, he can visit an intranet Web page and see the actual, measured performance often times, he realizes, the slowdown is more his impatience at getting a big new order into the system. On a couple of occasions, though, he s noticed the measured response times falling below the published standard, and he s called the help desk. They ve always known about the problem before he called, and were able to let him know which bit of the application was slow, and about when he could expect it to return to normal. He hasn t the slightest idea what any of that means, but it feels good to know that the IT department seems to have a handle on things. 15

24 There are actually numerous ways to measure the end user experience and better ways don t require any kind of agent to be installed on actual end user computers. That s something we ll explore in later chapters of this book. For IT Departments The IT department serves as a human link between the end users and the technologies those users consume. Rather than holding themselves accountable to standards that only they can interpret and understand, however, they re now setting goals that the end users their customers can comprehend. Fortunately, they re also able to manage to those goals, even across services that are outsourced. By having the right tools in place, the IT department can treat outsourced services just like any other element of the IT portfolio. John was concerned about setting SLAs based on end user experience, but because they started with real world measurements, and used those as the performance baseline, he s found that he no longer has to fend off as many things are slow complaints. End users know what kind of performance to expect, and so long as John provides that performance, they re satisfied if not always delighted. He was especially worried about providing those SLAs for services that were outsourced. However, World Coffee now receives a stream of performance metrics directly from their hosting providers. When things are slow at the end user computer, John can see exactly where the slowdown is occurring. Sometimes it s in the communication between networks, and John can bug his ISP about their latency. Sometimes it s the communication within the provider network, between database and application server, and John can call their help desk and find out what s going on. They re defining new, performance based SLAs with the providers, which will help ensure that the provider is engineering their network to keep up with demand. For IT Service Providers Service providers want to do a good job for their customers after all, that s what earns new business, retains business, and grows business relationships. They re discovering that the way to do that is not always by being a black box but by offering some visibility. After all, customers are betting a portion of their business on the provider, and they deserve a little insight into what they re betting on. Li s work with World Coffee is going well. Thanks to the detailed metrics he s able to provide them, and because they re using a single tool to monitor their entire service portfolio, they tend to call only when there s legitimately a problem on Li s end. Best of all, the same tools that provide customers like World Coffee with data are also providing him with performance information, helping him spot declining performance trends before actual performance starts to near the thresholds that might trigger an SLA violation. 16

25 About this Book This chapter has really been an introduction, with a goal of helping you to understand the goals and challenges you face as you evolve your IT services to a hybrid IT model. We ve outlined the evolution from today s IT models to the future s super distributed, hybrid model, and covered some of the key concerns and problems you re likely to face as you move along that path. What we need to cover and what the rest of this book is about is how you actually accomplish it. Chapter 2 will dive into the issue of monitoring in some detail. I want to look at how companies monitor their IT environments today, and discuss how they probably should be monitoring those same environments because we all know that not every environment is doing all they can in terms of service monitoring! But then I ll look specifically at why today s accepted practices really start to fall apart when you move into a hybrid IT model, and explore new goals that we can set for monitoring as our IT environment moves toward that super distributed model. In Chapter 3, I ll propose a new model for defining SLAs internally. This isn t a radical new model by any stretch, but in the past, it s been impractical to achieve. I want to really lay out what we should be looking for in terms of IT service levels, and look at some of the techniques that we can employ to do so right now, not years in the future. Chapter 4 is an acknowledgement that although the EUE is a great top level metric, it doesn t actually help us solve performance problems. We still need to be able to dive into performance at a very detailed, very granular component level but how can you accomplish that in a world where half of your components live on someone else s network and are even abstracted away from the hardware they run on? I ll propose some capabilities for new monitoring tools that can help not only solve the super distributed challenge but also streamline your everyday troubleshooting processes and procedures. Chapter 5 is the real world, nitty gritty look at what you re going to need to successfully manage a hybrid IT environment, where you ve got services hosted in your own data center as well as in someone else s. Although I m not going to compare and contrast specific vendor tools, I will provide you with a shopping list of capabilities so that you can start engaging vendors and truly evaluating products with an eye toward the value they bring to your environment. Chapter 6 is going to take this idea of hybrid IT monitoring and move it up a level in the organization, proposing the idea that IT health reporting can become just another one of the services you offer to your customers such as managers. I ll also spend some time covering this topic from the perspective of an IT service provider company, when their customers truly are paying customers, and where management reporting becomes a significant value add. We ve got a lot of ground to cover, but I think this is one of the most important topics that IT faces as we begin hybridizing our IT environments. Sure, issues like security and user access are important, but in the end, we need to be able to ensure that these outsourced IT services can support our businesses. That s what this book is all about. 17

26 Chapter 2: Traditional IT Monitoring, and Why It No Longer Works Those of us in the IT world think we know monitoring. After all, we ve been doing it, in various ways and using various tools, for decades. We collect performance data, we look at charts, and well, that s monitoring. Sadly, that kind of monitoring just doesn t meet today s business needs. How You re Probably Monitoring Today IT monitoring has evolved over the past few decades, but that evolution has pretty much consisted of continuing refinements to a basic model. Today s monitoring techniques evolved more out of what was possible and less out of what the business actually needed. Let s take some time to look at the monitoring techniques of today, because we ll want to carefully consider which techniques we need to keep and which ones we should ditch. Standalone Technology Specific Tools Today, you re probably relying heavily on monitoring tools that are standalone and technology specific. That is, once you move beyond the collection of basic performance data, you start to move into extremely domain specific tools that are geared for a particular task. You might, for example, use a tool like SQL Profiler (see Figure 2.1) to capture diagnostic information from a Microsoft SQL Server, or you might use a tool like Network Monitor (see Figure 2.2) to capture network packet information. Figure 2.1: SQL Profiler. 18

27 The problem with these tools is that they are domain specific, and they require a great deal of domain specific knowledge. You have to know what you re looking at. Although these tools will always provide us with valuable troubleshooting information, they don t tell us much about the health of an application that runs across multiple technology domains. In fact, in some instances, these domain specific tools can lead to longer and more convoluted troubleshooting processes. Figure 2.2: Network Monitor. For example, when a user complains of a slow application, a database administrator might grab SQL Profiler to see what s hitting the database server. At the same time, a network administrator might start tracing packets in Network Monitor to see if the network looks healthy. Both of them are failing to see the forest for their individual trees, and failing to recognize that the application s health isn t driven entirely by one or the other technology domain. Local Visibility Our current monitoring tools are, quite understandably, limited to our local environment. We monitor our servers, our network, our infrastructure components, our software applications. The minute we leave our last firewall, we start to lose our ability to accurately measure and monitor; at best, we can get some response time statistics from routers and so forth on our Internet service provider (ISP) network, but once we re out of our own environment, our vision becomes severely limited. 19

28 Technology Focus, Not User Focus Even tools that profess to monitor an entire application stack still take a very domaincentric approach. For example, it s not uncommon to have tools that continuously collect performance information from individual servers and network components, compare that performance to pre determined thresholds, and then display any problems. These tools can be configured to understand which components support a given application, so they can report a problem that is affecting the application s health and help you trace to the root cause of that problem. Figure 2.3 shows an example of how these solutions often present that kind of problem to an administrator. Figure 2.3: Tracing application problems to a specific component. These tools, however, don t encourage a user centric view of the application; they encourage a technology centric view. They concern themselves with the health and performance of application components, not with the way the end user is currently experiencing that application. These kinds of tools can absolutely be valuable, but only when they can also include the end users experience at the very top of the application s stack, and when they can do a better job of correlating observed application performance to specific component health or performance. 20

29 Problems with Traditional Monitoring Techniques In addition to the problems I pointed out already, our traditional monitoring techniques have some severe shortcomings that actually make monitoring and application health maintenance more difficult than it should be. Too Many Tools For starters, we simply have too many tools. They deliver too much different information in too many different ways. There s no way to correlate information between them, and we have to spend a ton of time becoming an expert on every single tool s nuances and tricks. Consider a modern, multi tier application: It might rely on several servers, a database, network connectivity, and so on. When the application seems slow to the users, you have to reach for a dozen tools to troubleshoot each element of the application stack and you re still not looking at the application as a whole. We need to get centralized visibility of the entire application and all its components. We need information presented in a uniform, consistent fashion, and we need it correlated so that we can tell which bit of the application stack is contributing to an observed problem with overall application health. Fragmented Visibility into Deep Application Stacks Another problem with our traditional monitoring tools is that they re really not built for today s deep application stacks. Consider what might seem like a fairly straightforward multi tier application, consisting of: Client application Middle tier application server Back end database server That doesn t include the connectivity components, though, so let s include them: Client application Network switch Network router Network switch Middle tier application server Network switch Back end database server 21

30 Each of those individual elements, however, is really a stack unto itself: Client application o Java runtime engine o Java libraries o Operating system (OS) Network switch Network router Network switch Middle tier application server o o.net Framework runtime engine.net Framework classes o Database drivers o Operating system o Memory o Processor Network switch Back end database server o OS o o Database management system Disk subsystem o Memory o Processor All these sub components can have a significant impact on application health, but it s very difficult to get traditional monitoring tools that can see inside all of them. We might be able to get excellent data on the database management system s performance and use of memory and processor resources, while having virtually no idea how the middle tier application s database drivers are performing. This fragmented visibility makes it tough to find the root cause of problems. 22

31 Disjointed Troubleshooting Efforts Domain specific tools lead to domain specific troubleshooting. Let s revisit my case study illustration from the previous chapter and see how this domain specific troubleshooting usually works in the real world: John, the IT specialist at World Coffee, is trying to find the cause of performance problems that Ernesto has reported in the company s order management application. John initially suspected that the database server was running slowly, and he notified the company s DBA. The DBA, however, said that the individual queries are executing within the expected amount of time, and that the database server s overall performance looks good. John then called a desktop support technician to look at Ernesto s computer. The technician said that everything on the computer seems to be running smoothly the problem seems isolated to this one application. A software developer that John contacted insists that the application is running fine on his computer. John is now analyzing network packet captures to see whether there s some latency between the server network segment and the client segment that Ernesto s computer is connected to. Sound familiar? This is how most companies deal with application health issues today: A bunch of domain experts jump on their particular application component, tending to look at that component in isolation and tossing the problem over the wall to another specialist when they can t find an obvious problem with their particular component. Application stack tools like the one shown in Figure 2.3 are a good starting point for solving this problem because they help pinpoint the component that isn t performing to specification. But they fail in that they re still looking at each component as a standalone entity, and measuring performance against predetermined thresholds. It is entirely possible for a server to be within its performance tolerances and yet still be the root cause for a poorly performing application; these tools don t go far enough in that they don t correlate observed application behavior with component performance. Difficulty Defining User Focused SLAs We don t tend to offer user centric service level agreements (SLAs) because our monitoring tools don t really let us figure out what good end user experiences should look like. We can tell you that the database server running at 90% processor utilization isn t good, but we can t tell you exactly how that will manifest in end user experience. Simply put, we re too focused on the technology and the components and not on the application and its end users. As we start moving into hybrid IT and outsourcing some of our applications, we need to focus less on the technology which isn t going to be in our control anyway and focus instead on getting what we re paying for, which means focusing on the service that our end users are receiving. 23

32 No Budget Perspective Finally, another growing problem with traditional monitoring tools is that they really don t have any budgetary focus. That s not necessarily a huge deal for in house applications, but as we start moving into hybrid IT and outsourcing portions or all of an application, we need to know that we re getting what we paid for. When we start moving to pay as you go cloud computing, we need tools that will help correlate application health and use to that pay as you go model so that we can accurately forecast and plan for those cloud computing expenses. Evolving Your Monitoring Focus IT is rapidly evolving toward hybridization; every day, companies adopt Software as a Service (SaaS) solutions, outsource specific services to Managed Service Providers (MSPs), and move applications and components into cloud computing platforms. As IT evolves, so must our ability to monitor these assets to ensure they are performing to our needs. The End User Experience The first point of evolution is to focus entirely on the end user experience (EUE) as your top level metric. The first and foremost thing you should care about is how quickly your end users are able to perform selected tasks with an application. When a problem occurs in an application whether it s something in your control, like your local network, or something outside of your control, like a back end database server in a service provider s data center that problem will flow up through the application stack, resulting in a problem with the EUE. That should be your indication that there s a problem: When the end user experiences the problem. Does every application problem impact the EUE? No, of course not. A service provider might lose a server but might also have redundancy built in to handle that exact situation. If you don t see a problem in the EUE, you don t have a problem. With the right tools, you ll be able to start at that EUE and drill down to find the root cause of problems that are under your control, making the EUE the perfect place to begin problem diagnoses and troubleshooting activities. Figure 2.4 shows how a monitoring solution can expose that EUE in a simple fashion, such as through a color coded response time indicator, as shown. Green means the end users are going to have an acceptable experience; anything else requires your attention. 24

33 Figure 2.4: Top level monitoring of the EUE. The Budget Angle As you move into pay as you go services, you ll want your monitoring to be correlated to your expenditure. Bringing all that information together into a single console, such as the one illustrated in Figure 2.5, can help you predict expenses and plot growth in your service consumption and the associated expenses. Figure 2.5: Monitoring cloud computing consumption. 25

34 Traditional Monitoring: Inappropriate for Hybrid IT If you ve been following my logic closely to this point, you may be ready to make a significant argument: Traditional monitoring can do all of this. True. To a point. We already do have tools that let us measure things like service response times, and many companies use those to develop a top level view of application health almost a sort of EUE metric. That works fine when you re completely inside your own network, but the minute you start creating a hybrid IT environment, you lose whatever deep monitoring ability you may have. Understand, too, that hybrid IT doesn t just mean that you ve outsourced a few services. It means that you may have internal services that depend on external services. For example, consider Figure 2.6. Figure 2.6: A truly hybridized IT environment. Illustrated here is an e commerce application, hosted in the Rackspace Cloud platform. That application requires access to SalesForce.com, a SaaS customer relationship management (CRM) solution. It also depends on an Exchange Server messaging system, which is hosted by an MSP. Finally, it relies on data from within your own data center, perhaps running on Windows or Linux computers, which might in turn be virtual machines that depend on a virtualization platform such as VMware. Your traditional IT monitoring tools simply can t put all those varied components into a single view for you. And how will you monitor the EUE? Ask your customers to install a little monitoring agent on their computers? Probably not that s usually known as spyware even if it s being used for noble purposes. When you start moving to complex, hybrid IT environments, the old let s get a bunch oftools approach just doesn t work anymore. There s another problem, too, that can t be corrected using traditional monitoring tools: The problem of correlating performance to your real world business. 26

35 It s Your Business, So It s Your Problem Let s be very clear on one thing: Hybrid IT involves outsourcing business services; it does not entail handing off responsibility for your business. And hybridization is not a panacea for all your IT woes; although it can solve many business problems, it can also introduce unique challenges. Provider SLAs Aren t a Business Insurance Policy Consider this troubling possibility: John is having a bad day. Last month, World Coffee moved one of its smaller applications to New Earth s cloud computing platform, and for most of today, that application has been completely unavailable. John s boss and his boss, and his boss is furious; the company estimates that it s losing $250,000 every hour that the application is unavailable. Much of that is in lost sales, and the company will not only have difficulty recouping the money but also may lose some of those customers to other distributors. So far, they ve lost more than a million and a half dollars. Just then, the application comes back online. Relieved, John calls his contact at New Earth to follow up. He explains to his contact that the company has lost hundreds of thousands of dollars in sales, and that he wants to invoke New Earth s SLA terms. His contact agrees but regretfully informs John that World Coffee isn t due any money from New Earth. The SLA provides for a refund of fees paid to New Earth for any services that were not provided in accordance with the SLA. However, World Coffee is on an entirely pay as you go plan, meaning they have no fees beyond those they pay for what they actually used. Since they haven t been using the application for the past 6 hours, they would have paid New Earth nothing, and so New Earth can t offer any refund. That s how most SLAs work: They re there to guarantee the service or your money back (or a portion of it). They re not there to cover the business losses that an outage or performance problem may cause. For example, Microsoft s Windows Azure Compute SLA states that customers can receive a 10% service credit when monthly uptime percentage falls below 99.95%, or a 25% credit if uptime is below 99%. The SLA doesn t provide any performance guarantees; if the service is up but responding slowly, customers aren t due any credit. This is exactly why it s so crucial that we be able to monitor performance across our hybridized IT infrastructure. If we re not seeing the EUE that we need, we can take action, either by working with service providers to raise performance or finding new service providers. 27

36 Concerns with Pay As You Go in the Cloud Many cloud computing platforms operate on a pay as you go model, which is one of the main points that make them so attractive to businesses. They offer the ability to scale to almost infinite resources, provided you re willing to pay for that. When you don t need a lot of resources, you re not paying much. So you can get a giant potential infrastructure with essentially zero capital investment. But pay as you go can be full of surprises. Ever go shopping for songs on Apple s itunes store? Each song is only a dollar or so, so you click, click, click and then a few days later get a bill for a couple of hundred dollars. Oops. These things add up, don t they? Cloud computing can be the same way. To continue picking on Windows Azure as an example, pricing at the time of this writing is as follows: Computing power costs between $0.12 and $0.96 per compute hour, depending on the size of the compute instance. Storage costs $0.15 per month per gigabyte, and $0.01 per 10,000 storage transactions meaning data reads and writes. Data transfers cost $0.10 to $0.45 per gigabyte, depending on the direction (in or out) and the region of the world. Sounds cheap pennies for gigabytes! But how much will your application use? This is where the performance=budget angle comes in. You don t want to get that surprise bill at the end of the month; you want to be able to monitor your use of the application and predict what your bills will be, and even use trending to predict changes in usage patterns and the resulting bills. Evolving Monitoring for Hybrid IT So let s talk about what kind of monitoring our hybrid IT environment is going to need. I ll refer to this as evolved monitoring, to draw a distinction between this new approach and the more traditional techniques you re already using. Hybrid IT is an evolution in how we deploy IT services, so it only makes sense that we d need some evolved monitoring capacity to go with it. Focusing on the EUE The first thing we need is a monitoring solution that focuses on the EUE it s the first and foremost metric that we should care about, and our tools should focus on what we care about. A tool should be able to help us test the EUE of an application, either by passively observing our application in action or by actively probing our application to measure the results. As Figure 2.7 shows, the EUE should be broken down into the various contributing components so that we can see how each component drives the overall EUE. 28

37 Figure 2.7: Breaking down the EUE. This approach not only helps us manage to that all important EUE metric but also provides a starting point for diagnosing problems when that metric strays out of our comfort zone. Our tool should allow us to define our EUE based SLAs, and should help us monitor compliance with those SLAs. Figure 2.8 shows what that might look like. Figure 2.8: Managing end user response SLAs. 29

38 At the top of this console, you can see that specific end user transactions, like logging in and logging out, have been defined with SLAs, and the console is monitoring those transactions and telling us how often the application fails to meet our defined response times. We should even, as Figure 2.9 shows, be able to pull up a history of SLA compliance. Figure 2.9: Monitoring SLA compliance over time. Note that these SLAs are being defined by simple response times like pinging a service; they re times to complete a specific end user task. That s a distinction we ll drill into in the next chapter. Monitoring the Application Stack Being aware of the EUE metric is important; however, when it s outside of our acceptable range, we need to be able to drill further to find a problem. That s why an evolved monitoring solution needs to be able to drill deeper, measuring specific application components. Figure 2.7 provided one example of how that might be visualized in a monitoring application; Figure 2.10 shows another. 30

39 Figure 2.10: Monitoring the entire application stack. From this kind of view, you should be able to drill deeper into specific problem areas. That drill down should help you spot problems with that particular component. For example, clicking on Enterprise Server in the end to end view might bring up a drill down console like the one Figure 2.11 shows, where we can get a high level view of that particular server s performance, and start looking for problems. Figure 2.11: Drilling down into a server s performance. 31

40 This drill down must be aware of the kind of component we re looking at. For example, this view might be sufficient for a typical Windows server, but it s not showing us anything specific for a database server. If we d clicked on a database server, we d expect to see information related to the database management platform, such as the console shown in Figure Figure 2.12: Drilling down into a database server. Now, we re really troubleshooting problems. We can see that the server s CPU is dangerously close to maximum, and database response times have crept firmly into the red. We re running low on free space, too. With a couple of clicks, this all in one console might take us from a potentially problematic EUE metric right to the root cause of the problem which we can now confidently assign to a DBA to start fixing. Keeping an Eye on the Budget Because it s continually monitoring our application, this evolved console can also help us keep an eye on our expenses for pay as you go services. By simply tracking the things we pay for compute time, bandwidth, storage, and so on we can use the console to help plan growth and learn what to expect in our bills. We might even be able to export those numbers into a spreadsheet and play what if scenarios to see what our bills would be like if we increased or decreased our application activity. 32

41 Coming Up Next We need to redefine what service level means, and to do that, we need to think about what really matters in IT: the EUE. In the next chapter, we ll go into detail about why that EUE is so important from a business perspective, and why that must drive the technology perspective. We ll also look at the unique challenges imposed by a hybrid IT environment, and how that kind of environment can make it even more challenging to accurately assess the EUE. We ll examine the technologies and techniques available for monitoring the users experience, and some of the reasons why that kind of monitoring isn t already in your arsenal of tools. We ll also look at some of the unique perspectives that service providers have on the EUE, and how they can help themselves and their customers do a better job of maintaining application performance. 33

42 Chapter 3: The Customer Is King: Monitoring the End User Experience I ve already presented the End User Experience (EUE) as the ultimate metric for an application s overall performance, whether that application lives entirely in your data center, entirely in the cloud, or in some combination of the two. In this chapter, I ll explore the EUE in greater detail: How do you actually measure it? What, specifically, are you looking at? What contributes to a good or poor EUE in an application? If you see your EUE metric starting to head toward poor, what can you do about it? I ll also examine some of the reasons companies traditionally haven t measured the EUE and why doing so can still be tremendously difficult, especially in newer, highly distributed applications that involve elements of cloud computing. Why the EUE Matters First, however, let s really pin down why the EUE is so important. As I outlined in the previous chapter, businesses have typically measured application performance solely from technical measurements database response times, for example. What does the EUE really offer that the more traditional measurements don t? Business Level Metric Take a look at Figure 3.1. It s a common enough chart in a business technology environment, displaying a variety of performance metrics for a database. The bottom graph shows, in orange, physical reads from disk. I guess those look normal. The middle graph shows, also in orange, logons to the database. Had a little peak around 6pm and 5:30. Guess that s okay. The top graph shows a variety of statistics, primarily user input/output in blue and CPU utilization in green. That s a lot of I/O, I guess. CPU looks okay. At least, it s lower than the top of the graph. 34

43 Figure 3.1: Database performance. So what does all of this mean to the business? That s harder to gauge. Are we making money or losing money? Do the users of our application believe it s performing well? Do we have phone agents sitting on the phone telling customers, I m sorry, the computer is slow today, or is everything popping up pretty quickly for them? It s impossible to tell from this chart. What about Figure 3.2? This is a VMware vsphere performance graph showing CPU utilization for a week. Got a little spike late Monday, but there s no way to tell whether that impacted our business operations. In fact, it might have been late at night, so possibly it was related to a maintenance task or anything. It s impossible to tell if any users were impacted, though. 35

44 Figure 3.2: Virtualization host CPU performance. These two examples precisely illustrate the problem with traditional performance measurement: There s no tie to anything that matters to the business. Memory, I/O, CPU, disk none of these things tell us whether the application is performing well for the business. Sure, we could add some thresholds to those charts, maybe generating an alert when CPU utilization spiked above 80%, as it did in Figure 3.2. But that still doesn t tie directly to what matters to the business. Here s how businesses have traditionally tried to make these technical measurements rela te to business concerns: 1. We observe technical performance metrics. 2. When we start getting end user calls about application performance, we make a note of the performance metrics at that time and draw a threshold line. 3. From then on, if performance is on one side of the threshold, we assume that equates to good end user performance. On the other side, we assume it means poor end user performance. The problem with this approach is that we can t continuously measure every technical element that contributes to good or bad end user performance. We continue to use technical performance elements CPU, disk, memory, and so forth as our only real metrics, even though we can t directly relate them to anything the business cares about. 36

45 Note It s even worse when your end users don t work for your company because in that case, you might never know that there s a perceived problem. For example, if your customers feel that your online shopping cart is too slow, they won t call you they ll simply give up and shop elsewhere. Only measuring technical elements can make it easy to miss the fact that you re losing sales and customers. Tied to User Perceptions As I discussed in the first chapter, end users don t usually care about CPU utilization and disk throughput: They care about whether the application seems slow. They base their judgment on how long it takes them to complete common transactions, such as looking up a customer order, completing a checkout, and so on. Ultimately, your end users are the ones who can tell you whether your application is performing well and doing its job unfortunately, end users have some significant problems: They aren t consistent One user might feel the application is performing well, while another feels it s too slow They aren t accurate An application with identical performance might be perceived as slightly slow one day and just fine on another They don t always report Even users within your own organization tend to not report poor performance because they usually get a brush off; external users such as customers will generally just take their business elsewhere That s why monitoring the EUE is so important. You get that end user perspective, which is ultimately the only thing that matters to the business. But you get it more accurately, more consistently, and without the need for users to actually report in on their own. This is really where the term application performance management (APM), comes from. In fact, let s create a formal definition: APM can be defined as the process and use of related IT tools to detect, diagnose, remedy, and report application s performance to ensure that it meets or exceeds end users and businesses expectations. Application performance relates to how fast transactions are completed on behalf of or information is delivered to the end user by the application via a particular network, application, and/or Web services infrastructure. 37

46 Challenges as You Evolve to Hybrid IT Hybrid IT, as I described in the previous chapter, is the evolution of business technology services that incorporate a variety of highly distributed technologies; super distributed is another term that refers to this evolution. In the previous chapter, I outlined how our applications have evolved from monolithic code that ran on a single computer to today s applications that rely on resources in the cloud, from your own data center, and from service providers. Figure 3.3 illustrates this kind of application. Figure 3.3: Modern hybrid IT application. With this kind of application, customers interact with a Web site that is hosted by a Web hosting company Rackspace Cloud, in this example. That application is not self contained, however. It depends on a Software as a Service (SaaS) offering from SalesForce.com to track customer information, and sends messages through a hosted Microsoft Exchange service. On the backend, additional services virtualization hosts, Windows servers, Linux servers, Oracle databases, and network elements provide data and support to the application. This application depends upon multiple data centers, numerous disparate software elements, and more. Monitoring this by using traditional technologycentric techniques is impractical if not impossible, meaning the only way to make sure our customers are happy is to measure the EUE directly. What specific challenges come along with this kind of super distributed application? Geographic Distribution One significant problem is the geographic distribution of today s more complex applications. For example, suppose you ve decided to host a Web service or Web site in the Windows Azure cloud. You have no idea where your application is physically located. The whole point of the cloud, in fact, is to make your application available more globally so that users all across the world can access it more or less equally. 38

47 For example, consider an application that s hosted in a more traditional fashion, in a shared hosting environment, or on a dedicated server that s located in a hosting company s data center. One server, or one group of collocated servers, hosts the application. Monitoring the EUE is straightforward because you ve only got one thing to manage. Figure 3.4 illustrates. Figure 3.4: Monitoring the EUE in a single location application. Move into the cloud, however, and your application is inherently able to be hosted transparently on multiple servers that are geographically distributed. Figure 3.5 illustrates the difficulty in monitoring the EUE: An EUE monitor in a given location will probably be connecting, most of the time, to a geographically close server; end users in other locations, however, may be connecting to entirely different servers across entirely different infrastructure. Your EUE monitoring won t be getting an accurate picture of the entire application s performance because it s only seeing part of it. 39

48 Figure 3.5: Geo dispersed cloud applications. You could, of course, start deploying distributed EUE monitors as well. But that means you have to start building a giant, globally distributed monitoring infrastructure and I ll go out on a limb and guess that doing so isn t part of your business overall goals. Another option is to work with service providers such as cloud hosting companies that provide you with distributed monitoring capabilities, meaning they ve deployed those capabilities within their own infrastructures and make data related to your application available to you. If you are a service provider, that s a very competitive feature to offer to your customers. The last option is to find a monitoring solution that comes with a globally distributed monitoring service. In other words, rather than just buying yet another monitoring console, you re buying a console that integrates with an in place ability to monitor distributed applications hosted in commonly used services, such as specific cloud computing providers. As Figure 3.6 shows, that can give you access to performance information on these cloud providers without requiring you to deploy your own extensive monitoring infrastructure. 40

49 Figure 3.6: Monitoring cloud provider performance. Deep, Distributed Application Stacks Figure 3.3 showed how complicated today s applications can really become, relying on numerous hosted services, different SaaS offerings, and on your own backend infrastructure which may be running on a variety of operating systems (OSs), virtualization hosts, and so forth. Every single element in this stack can contribute to performance problems, and even if you re monitoring the EUE directly, you ll still want insight into individual component performance for troubleshooting purposes. Considering just the example in Figure 3.3 (which I ll repeat here for your convenience), you re looking at a huge variety of things to monitor: SalesForce.com s overall responsiveness Performance of your Web site hosted in the Rackspace Cloud which may involve a quantity of servers that expands and shrinks transparently in respond to demand, and which likely involves servers that are globally distributed The performance of the hosted Exchange service The performance of your own Windows and Linux OSs, including their memory, CPU, disk throughput, and so forth The responsiveness of your backend Oracle database 41

50 The performance of the VMware virtualization infrastructure that hosts some or all of your servers The performance of related services such as a Cisco VoIP solution The performance of the various Internet connections that let all of these components communicate That s a lot. Although measuring the EUE doesn t necessarily require deep insight into this highly distributed application stack, you will need that deep insight in order to tune performance and troubleshoot problems. We ll cover that in more detail in the next chapter. Techniques for Monitoring the EUE So how do you go about actually monitoring the EUE? What tools are needed, and what skills are involved? How do those tools actually gather EUE information, store it, and present it to you? Because today s applications are themselves constructed from numerous different components, we ll have to adopt a number of techniques and technologies for monitoring the EUE. But let s first talk about the unachievable ideal way to monitor the EUE: A little monitoring agent installed on all your users computers. That might be possible if your end users are entirely within your organization, but it s outright impractical to distribute that many agents and collect information from them. It s impossible if your end users are external users, such as customers; such agents are often referred to as spyware no matter how beneficial they may seem. But an EUE based agent would be the perfect monitor: It would see exactly what the end user sees and be able to monitor specific transactions and report back with real time, real world performance information. Given the impossibility of using such an agent, however, we have to look at other techniques. In fact, because we can t directly and empirically monitor the EUE, we may have to use several techniques to monitor different aspects of the EUE, depending upon the exact elements of our application. Again, we want to make sure we re monitoring things that map directly to the end user s actual experience; we can t just fall back to monitoring CPU utilization and other technology centric metrics. Platform Level APIs One approach is to use application programming interfaces (APIs) for specific platforms. For example, being able to pull performance information from a Windows server or an Oracle database requires specific knowledge of how those platforms are built and how to pull performance information from them. 42

51 Data from Providers In some cases, we might be able to draw performance information directly from our service providers. Some managed service providers (MSPs) can offer us detailed performance information not only at the technology component level but also at the EUE level, helping us to see the time it takes to complete a specific transaction, for example. Distributed Monitoring Agents One excellent tool for monitoring the EUE is, as I ve already described, a network of distributed monitoring agents. Properly deployed, these can help us measure elements of the EUE from all around the globe which is especially useful when we re dealing with a highly distributed application. Click to Click Monitoring Click to click monitoring is really the ultimate in EUE monitoring because it operates at the EUE level. It involves measuring the exact amount of time that it takes to complete a sample transaction or even discrete steps within a transaction. Literally, click to click means measuring the time between an end user s specific actions clicking Next in a shopping cart to clicking Submit Order, for example. The reason this is such a good EUE monitoring technique is that it encompasses the entire underlying application, including whatever components are included in it. If a shopping cart submission requires the coordination of twelve backend components, we ll capture all of that delay in the click to click time; by using the techniques I ve described earlier platform APIs, provider data, and monitoring agents we can even break down the amount of time each element contributes to the final EUE. For example, let s revisit a figure from the previous chapter. Figure 3.7 shows a distributed, multi component application. At the EUE level, labeled End user Response View, we get the rollup of the time it takes to complete specific transactions. This is a Web application, so we see the TCP/IP response time, the HTTP connect time, and the time it takes to get a response from the URL. Those are the wait times that an end user would experience, and they re our top level EUE metrics. 43

52 Figure 3.7: Click to click monitoring. Literally, those times are what the user waits on between clicking their mouse button and seeing a final result on the screen, and being able to click their mouse button again to take their next action. Those are the times we manage to, and those are the times we should create SLAs around. But when those EUE times don t look good, we immediately need to find out why, and that s where the End to End Component View comes into play (it ll be the subject of a much more detailed discussion in the next chapter). The idea here is to use platform specific APIs, monitoring agents, and so forth to find out how that EUE time breaks down. Let s look at a detail of that portion, in Figure 3.8. Figure 3.8: End to end component view. 44

53 We can see that Web server to Application Server communications, in the top left, used 193ms of time; the LDAP communications from our authentication server on the lower left took 889ms. Here s the trick, though: These times can be summed to provide us with the EUE time. In other words, sometimes it might not be possible or practical to empirically measure the EUE. Although we cannot derive the EUE by monitoring technical elements like CPU utilization, we can derive the EUE by monitoring response times between specific elements of the application architecture. That s the secret of the EUE: The EUE is simply a measurement of response time. It does not take into consideration resource utilization, such as CPU, memory, or disk; it is entirely the sum of the response times of individual application elements. Although we cannot derive the EUE by monitoring individual elements consumption of resources, we can derive the EUE by monitoring individual elements response times. That s the big paradigm shift in IT monitoring for APM. We re shifting from monitoring resources against predefined thresholds to monitoring actual response times. End users sole metric for their perception of application performance is speed, which equates to response time, and so that s what we measure. Monitoring the EUE is really just a fancy way of saying measure how long it takes. Why We Often Don t Monitoring EUE Today So if the EUE is such an amazing thing to monitor, why don t we do more of it? There are a number of reasons, one of which is sheer inertia: The IT industry hasn t ever really capitalized on EUE monitoring until recently, and the IT industry for all that it is a force of change in many companies doesn t like change. But there are also practical reasons. Complexity Measuring response times which, again, is what the EUE is all about can be difficult. For one, it almost always requires external instrumentation. You can t always ask a Windows server, for example, to measure its response time for looking up a piece of data because that measurement will be affected by the server s performance. If the server isn t performing well, the measurement won t be accurate. That means our traditional technical monitoring which just draws on performance information directly from whatever we re monitoring can t always deliver accurate response time information. We re then faced with the complexity of creating new monitoring schemes and implementing entirely new tools. That can be complicated. Lack of Tools Speaking of tools, until the last half decade or so, EUE wasn t on anyone s minds. That means our technology components servers, databases, networks, and so on weren t built to deliver response time information. It s only in the past 5 years or so that APM has really become a major thing, inviting innovative third parties to create a marketplace to fill the business need. In other words, until fairly recently, there s simply been a lack of tools that could monitor the EUE. 45

54 Even now, the industry is still figuring out what works. The very first vendors in the space adopted approaches that were suitable for applications of that time, and are often still applying those same approaches. Today s newer applications, however and this has just been happening in the past couple of years are built so differently and along such distributed models that the original EUE monitoring tools often can t keep up. That means we ve been waiting on an entirely new set of tools and vendors who can specifically address EUE monitoring for the highly distributed applications of the hybrid IT age. For example, consider Figure 3.9. This is a portion of a fairly traditional APM console. You can see that it is indeed focused on response times, which contribute to the EUE, and it is generating alarms and alerts on response times that are exceeding predefined SLAs in other words, it s helping to alert us to a problem in the EUE and helping us find the root cause. That s great. Figure 3.9: Traditional EUE monitoring. The problem is that much of this intelligence is gleaned from a series of agents installed directly on servers, and even perhaps from network probe appliances connected to the network. After all, you have to collect the response times somehow, and the probe and agent based approaches are very common and effective in traditional applications. Take a look at Figure 3.10, which shows the application stack from a more physical perspective, and see if you can spot the problem with this approach. 46

55 Figure 3.10: Physical application model. The problem is that these tools assume we own the entire application infrastructure. In order to install agents and probes, we need to have all of these components in our own data centers, under our own control. How do you propose to install monitoring agents on SalesForce.com s servers? How do you think you would monitor response times from the Windows Azure cloud s globallydistributed data centers? It s impractical, and that s why the approaches taken by traditional APM tools aren t always workable with highly distributed, partially outsourced hybrid IT applications. But there is a new breed of tools, both from innovative new vendors as well as from the traditional monitoring vendors, all of whom recognize the special challenges created by hybrid IT. Cost On the face of it, EUE monitoring for hybrid IT applications seems very expensive to build yourself. Simply deploying worldwide agents to monitor response times from a cloud computing provider would be prohibitive for most companies. Working with MSPs to gain instrumentation into their infrastructure, for the purposes of monitoring your use of that infrastructure, would have to be incredibly expensive for every single business to do and that s why most businesses haven t done these things. 47

56 The trick is to leverage some economies of scale. Rather than each company building their own global monitoring network, companies need to look to application performance vendors who can provide that network, thus helping to subsidize the cost of that network across many monitoring customers, making that network affordable for all of them to use. Component Level Monitoring Can Be Close Enough And do you know the number one reason so many companies have been content to ignore the EUE for so long? Because, in old style applications, monitoring technical elements has always been close enough. Look, you might say, we know that users are happy so long as CPU stays under 70% or so, and so long as disk queues don t get longer than 5 or 6. So we monitor those things, and if they start to go awry, we know we have a problem that needs to be fixed. And I can t argue with that logic for old school applications. That is, for applications that involve one, two, or maybe even three tiers. Where every element of the application lives right within your own data center. Where you have full control of everything. Where you can see the end users, get phone calls from them, and talk to them about how the application is performing. Plus, most IT shops are really good at this kind of monitoring. They ve learned how to do it over decades, and refined the process down to a true science. The problem is, applications don t look like that anymore. Why We Must Monitor EUE Going Forward Applications are simply growing too complex. Outsource a single component of your application take a dependency on an SaaS provider, for example and almost all your old school monitoring techniques go out the window. Sure, you could just build applications that don t use outsources elements but that s letting your technical limitations drive the business rather than letting what s right for the business drive your technical decisions. The fact is, to support modern business requirements, we re going to be seeing much more distributed, hybridized applications and we need to figure out how to monitor them effectively. And the bottom line there is that monitoring the EUE is a more effective metric for any application whether it s entirely in house or mostly delivered from the cloud. Vastly More Complex Environments The old performance and threshold world really didn t offer fantastic performance management; it was simply good enough and it was readily achievable within the relatively simplistic application environments of the past. And that s the past. 48

57 We are very rightly building ever more complicated application environments today: Cloud computing offers the potential for near infinite application scalability, without massive infrastructure investments. Even for internal line of business applications in geographically distributed companies, businesses are crazy not to at least investigate and consider putting a portion if not all of their applications into a cloud environment. It makes an incredible amount of sense in many cases. Anyone who s suffered through agonizingly long implementations for applications like Customer Relationship Management (CRM) solutions can appreciate the ease, convenience, and lowered overhead of SaaS offerings. Very few businesses are in the business of supporting massive software installations, and SaaS can deliver the capabilities businesses need without the overhead and distractions. Like SaaS, managed services help businesses lower the cost and overhead not to mention the distraction factor of critical business services that aren t central to the business competencies. Say you re a retailer. You obviously need , but you re not in the business of providing services. Why not let an MSP worry about it for you? The arguments for this kind of piecemeal IT outsourcing are compelling, and thousands of businesses are benefitting from these new models. But the fact remains that we still need to build our own applications that depend upon these outsourced services. An online retailer might not want to deal with their or CRM systems, but they do want to be responsible for their e commerce systems which unfortunately need to interface with the system and the CRM system. The old school of monitoring would say, well, we can t outsource and CRM because they re critical to the e commerce app, and the only way we can monitor them is if they re in house. No more: We have to focus on EUE monitoring because it lets us outsource key pieces of our infrastructure while still maintaining visibility into what matters the most to our business. Business and Perception Level Focus Measuring technical elements like CPU, memory, and disk may have been good enough to spot impending performance problems, but it did nothing for helping drive a business focus within the IT group. Let s face it, The CPU is running at a steady 80% utilization isn t quite as compelling, from a business perspective, as, The Web site is running slowly and we re losing 10% of our shopping carts that s money out the door! 49

58 Although I do feel that purely internal applications ones completely under your control can be effectively monitored even if you re ignoring the EUE, I don t think any application s top level performance metric should be anything except the EUE. The speed of the application as perceived by the end user and by the business is the only thing that matters. Knowing your EUE can help make a ton of other business decisions much easier: 1. You: Boss, the server s running 80% all the time. Can we add a processor? Boss: Um, no. versus: You: Boss, the server is only able to maintain a 2 second response time for shopping cart checkouts, and we re losing about $12,000 in carts per day. Can we buy a new processor for $200? Boss: Um, yes. Immediately, please. 2. You: Hi, Mr. Cloud Provider? Yeah, we have a guy in the office who feels that our Web site is taking a really long time to load. Can you do something about it? Cloud Provider: (laughter) versus: You: Hi, Mr. Cloud Provider? We re seeing 800ms response times from our Web application in your cloud. We agreed that 500ms was the limit, and we ve narrowed the problem down to a 200ms extra delay in your database layer s response time. Fix it. Cloud Provider: Wow okay. We ll get on it. 3. Boss: Sales for Asia are down. We re blaming the response times of the Web site. You need to get on it. You: Sure, just let me update my resume, first. versus: You: Boss, we re noticing 30% slower response times for our users based in Asia. Boss: That s a million dollars in business a year! We ll get the provider on the phone and dig into this immediately. The point is that moving to a business focus helps make a number of technology decisions easier because it helps put a business colored spotlight on everything. 50

59 Too Much Is Out of Your Control The last reason to move to EUE monitoring is that, quite simply, it s the only metric you can accurately and consistently obtain when much of the rest of the application infrastructure is completely out of your control. Yes, you might need help in obtaining accurate EUE numbers especially with globally distributed application components and end users but the EUE is the only thing you can point to, with confidence, that tells you that the business is doing okay. The less of your application infrastructure you can touch, the more the EUE is going to mean to you. The Provider Perspective: You Want Your Customers Measuring the EUE If you re an MSP, the preceding discussion should tell you what your customers are going to be looking for from you. In fact, there are some excellent reasons for you to begin providing performance metrics to your customers immediately. The Provider Isn t 100% Responsible for Performance When customers rely on your services as a part of their application, it s all too easy for them to point to you as the weak link when things aren t looking perfect. By providing your customers with accurate metrics preferably in a way that can be consumed by the customers own performance monitoring tools you can help not only dispute claims that your performance is at fault but also avoid those claims entirely. When customers can see into your network to some degree, you re giving them a number of benefits: You re proving that you don t have anything to hide. You re making yourself a partner in their business, not just a vendor. You re helping them eliminate you as a potential weak link because they can see the performance they re getting from you. Everyone benefits. You re able to define more granular and accurate SLAs, and your customers are able to more easily verify that you re meeting them. When you and your customer have the same set of performance data in front of you, you re both able to have a stronger business relationship. You Gain a Competitive Advantage There s an enormous competitive advantage in being able to provide your customers with performance metrics. For one, doing so proves that you re a mature, competent, confident service provider. You re not a fly by night company who promises amazing performance for a too good to be true price, then doesn t deliver. By providing metrics and challenging customers to evaluate your competition s ability to do so you re making yourself accountable, and providing customers with a transparency that they ll appreciate. 51

60 You re also making yourself much more a partner in your customers business. Look, the bottom line is that all service providers want to gain and retain customers; customers, for the most part, want to stick with their providers switching is an enormous hassle with little value add. If you can help to integrate your infrastructure with your customers, you ll help them manage their businesses and IT investment more easily and accurately. They re a lot less likely to want to switch you ll have gone a long way toward making a customer for life. Coming Up Next We re hopefully agreed that the EUE is the way to go for managing application performance at the top level. If you see an EUE problem, though, you re going to have to be able to dig deeper to find the root cause. That s what the next chapter will focus on: Monitoring at the component level. This isn t about ensuring good application performance; it s about fixing performance problems that you ve noticed in your EUE measurements. I ll look at the traditional monitoring stack and some of the challenges that come into play with today s multi discipline applications. I ll also look at newer monitoring techniques, and provide insight for service providers who want to offer their customers deeper insight into their service offerings. 52

61 Chapter 4: Success Is in the Details: Monitoring at the Component Level The EUE is your ultimate metric for whether an application is performing well. It s what you should base SLAs upon, and it s certainly your ultimate measure of success or failure with an application. The EUE isn t, however, very useful at helping you troubleshoot problems when they occur. For that, you ll need a deeper, more detailed level of monitoring. In this chapter, I want to compare and contrast two approaches for that moredetailed kind of monitoring: the traditional, multi tool monitoring stack, and a more modern approach that focuses on getting everything into a single view. Traditional, Multi Tool Monitoring In a traditional monitoring environment, IT experts tend to rely on single discipline tools to troubleshoot the application stack. That s an approach that has served for years, although as applications become more complex and distributed, we ve needed a wider number and variety of tools to get insight into everything. Typically, separate tools exist for each major layer of the application. Consider the example in Figure 4.1, which illustrates the application stack I ll be using for this section of this chapter. This diagram is more hardware centric, as it is intended to represent the major physical elements of the application: client application, network infrastructure, application server (which might be a Web server, for example), and a back end database. Notice that this example in keeping with the traditional approach does not incorporate any cloudbased elements; we ll come to the cloud issue shortly, though. Let s look at each of these elements in turn. Figure 4.1: An example application stack. 53

62 Client Layer The client obviously plays a major role in an application s performance. A slow client computer can do more to ruin the perception of an application than almost any other component in the stack. Unfortunately, it s often impractical to monitor performance directly on the client, particularly in a commercial application setting where the client belongs to your customer and not to you. You can, of course, have a test client computer with hardware and software configurations that resemble your average client computer or customer computer, and you can install monitoring software on that computer to see what kind of performance problems, if any, the client is introducing into the equation. That doesn t always help you troubleshoot problems at the client level. In fact, in some cases, client specific problems can t practically be solved. For example, suppose some of your customers run a particular brand of antivirus software that simply slows their computers there s nothing you can really do about that, aside from making sure that your client application performs as best it can under the circumstances. There s unfortunately not much else to say about the client layer. It s often outside your control, and although you can and should test your client applications on representative client computers, that s about all you can do in a traditional setting. Network Layer At the network layer, however, you can begin to get more involved. You can certainly monitor your network s performance, and many tools exist to help you do so. For example, Figure 4.2 shows a tool that helps to monitor network performance at a particular network node. Figure 4.2: Monitoring network performance. 54

63 For a larger view of your entire network, there are tools to roll up per node performance monitoring into a whole network view, as illustrated in Figure 4.3. But there s a problem with this kind of monitoring: It s absolutely unaware of the application. This type of infrastructure level monitoring can tell you whether a particular router is overloaded, for example, but it can t correlate that problem to a particular application performance issue. Figure 4.3: Monitoring the entire network. Another problem with these types of tools is that they start to fail you once you begin relying on someone else s network. As you move toward a hybrid IT environment, and you begin deploying elements of applications in the cloud meaning in someone else s data centers you lose the ability to closely track network device performance. I would argue that, although this type of monitoring tool is often widely used within organizations, it isn t really all that useful for application performance monitoring because it isn t application aware it s simply reporting the state of the network. Also, as applications begin to move toward a hybrid or cloud based model, this type of tool simply loses all its utility, as it can t do its job across a super distributed, partially outsourced network. 55

64 Application and Database Layers As you move into the application itself, performance monitoring can become even more complicated. Ignoring the hardware that connects and runs the application, modern applications consist of numerous interconnected components many of which may be shared by other applications. This is especially true in a cloud computing scenario: If you re storing data in Microsoft s SQL Server for Windows Azure, there are likely multiple other customers doing the same thing. Thus, your performance could potentially be impacted by others use of the cloud infrastructure. That kind of sharing can even occur within applications that live entirely within your own data center, making performance troubleshooting complex and difficult. Consider a relatively straightforward Web application, as Figure 4.4 shows. Figure 4.4: Application software stack. This application consists of a Web server, application code running on an AS/400, and back end data from two different databases. The application itself utilizes Exchange Server for messaging, which in turn has a dependence on Active Directory (AD) for authentication, address books, and so on. Much of the application is running inside VMware virtual machines, which add another layer of performance complexity. How do you monitor an application like this? 56

65 Traditionally, you d probably need about seven tools, each one to monitor a specific component: IIS, VMware, AD, Exchange, DB/2, the AS/400, and SQL Server. Each tool would be entirely focused on a single element of the application stack. For example, Figure 4.5 shows what you might expect from a VMware monitoring solution: information on CPU, disk, memory, and network utilization. Figure 4.5: VMware monitoring. A SQL Server monitoring tool, in contrast, would present a completely different set of statistics and views and wouldn t be at all aware of the underlying impact from VMware itself. As Figure 4.6 shows, you d be dealing with two completely different tools, with different goals and methodologies neither of which would have the slightest idea about your application s performance. 57

66 Figure 4.6: Monitoring SQL Server performance. That s ultimately the problem with traditional application performance monitoring even before you start involving hybrid IT elements like cloud computing: every tool is focused on one bit of the application, and those tools don t even monitor the application. Those tools just monitor their one element, in a completely standalone and out of context fashion. Start involving cloud computing and things get even more difficult because you aren t likely to even get a tool that will let you directly monitor your cloud database performance. Other Concerns So let me clearly state the problem: Performance tools that focus only on a single element of the application are ineffective. In fact, that can actually hinder your application performance monitoring and troubleshooting efforts, as you ll see in the next section. And that s assuming your applications are entirely under your control, within your own data center; start moving application elements into the cloud whether as Software as a Service (SaaS) solutions, hosted services, or true cloud computing and these kinds of tools become even more impractical because in many cases you can t use them to monitor the cloud elements of your application. 58

67 Multi Discipline Monitoring and Troubleshooting Part of the problem with element specific tools like those I discussed earlier is that they encourage siloing within your IT organization. IT professionals tend to specialize in just one or two elements of an application, and the fact that their tools only monitor those elements enables them to put on blinders for the rest of the application. That simply causes more difficulty when performance problems arise. Applications Are Not the Sum of Their Parts The problem is that you can never look at a single element of an application and judge its overall performance. That s a very intuitive concept that you might not have actively thought about, but do so for a second: Do you think you could look strictly at a single application element say, the database, the Web server, or a network router and determine the overall performance of the application? No, of course not. Therefore, you can t look at every component in a standalone fashion and determine performance information, either. You can t take a bunch of disparate statistics from multiple tools and merge them in your head for a performance picture of an application it just isn t practical. That s because applications aren t simply the sum of their parts. In other words, you can t simply add up or average the performance numbers for various application elements and arrive at an accurate top level performance number for the entire application. Application performance is, in this respect, a lot like a sports team. You can t just add up the average performance of each player and get a feel for the team s overall success. Nor can you simply look for the worst performing player and focus all your troubleshooting efforts on that person. Individuals might perform very well on their own in practice, and perform entirely differently when the whole team is in action. You have to manage the team s performance as a team by watching the entire group of players work together. When it comes to an individual s performance, you might coach them to adopt certain behaviors or to correct specific problems. Doing so, however, might not change their interactions with other team members during actual play. You can t simply tune an individual player on their own; you need to tune everyone s performance as a part of the team. 59

68 Tossing Problems Over the Fence: Troubleshooting Challenges Tuning individual application elements is what IT professionals tend to do well, but it can result in significant delays and dead ends when it comes to managing the performance of an entire application. The week before I wrote this chapter, for example, I visited a consulting client who was experiencing performance problems with an application. I sat down with representatives from their various IT disciplines, and had a conversation something like this: Me: So what s the problem? Manager: We re seeing a slowdown in one of our applications. We ve traced the problem to a particular query for data from the database sometimes, it can take more than 4 minutes to execute that one query. DBA: I ve looked at the SQL Server performance, though, and there doesn t seem to be a problem. Anytime I run that same query manually, it works fine. I ve rebuilt the indexes on the affected tables. I ve even run the query repeatedly, using a load simulation tool, in a test environment and it works fine. Network Engineer: We re not seeing anything in the network. Whenever someone reports the slowdown in the application, the network is at the exact same performance levels it always is. Developer: It is definitely not the client application. We ve tested it extensively. Anytime this delay occurs, it happens as the client application is waiting for data to return from the query. The application pauses, but it isn t doing anything it s just waiting on the database. DBA: It can t be waiting on the database; I m not seeing any indication that the database is taking more than a few milliseconds to process that query! It must be the amount of data you re querying. Developer: No, it isn t we re only grabbing two rows of data, three at most. That continues for a half hour or so, with all sides offering up charts and other evidence that their element wasn t causing the problem. So if nothing was causing the problem, what was the problem? This is a classic example of what I call tossing the problem over the fence. Each individual IT discipline has set itself up in a fenced off silo, and they only concern themselves with what s happening inside their fence. If they determine, in their own judgment, that their bit of the world is fine, then they toss the problem over the fence to another discipline. 60

69 That conversation was about an application living entirely within the company s data center: Just imagine how much more fun it would have been if outside vendors SaaS providers, Managed Service Providers (MSPs), cloud computing vendors, and so on were involved! With absolutely no insight into the outsourced components performance, everyone within the organization would likely have tossed the problem right outside the organization and into the laps of those vendors. Unless those vendors had tools that could show they weren t the problem, the argument would have gone on forever. So what s the solution? Integrated monitoring, or what some would call unified monitoring. You still have to monitor the individual application elements; you just do so all in one place. Integrated, Bottom Up Monitoring The idea behind integrated, or unified, monitoring is to give everyone access to the same information, in the same way, in the same place. You re still monitoring components, to be sure each component does contribute to or take away from the overall application performance, after all. But you re doing so in a way that puts all the information in front of all the experts at the same time. That helps to break down the fences between IT disciplines, gives everyone the same evidence to work from, and helps to integrate the performance troubleshooting effort more effectively. Monitoring Performance Across the Entire Stack Consider a completely unified dashboard, like the one shown in Figure 4.7. Here, you can get a top level view for all your application components not just one, and without the need to open multiple tools. You can establish thresholds for good and bad performance, and get a quick at a glance view of how your components are all doing. Anything other than green in any one spot may indicate a performance problem that will impact your EUE. 61

70 Figure 4.7: Dashboard of all components. When you do see something other than green, you need the ability to drill down into component specific performance information. For example, the VMware and Hyper V status for this application is orange, meaning there s some kind of problem. Drilling down might reveal a screen like the one in Figure 4.8, which breaks down the problem in domainspecific terms. The dashboard lets everyone see where the problem lies; the drill down helps to start the troubleshooting process. There s no need to toss anything over any fences, because it s clear where the problem exists. 62

71 Figure 4.8: Detail for virtualization problem. Here, we can see that there specific service alerts for both VMware and Hyper V. By viewing those alerts, we can begin to troubleshoot the problem more quickly. We ve gotten the right domain expert involved, made it clear that there s no over the fence option because we know the problem is in his or her domain, and have gotten troubleshooting started. The key to a toolset like this is having every part of the application represented. As Figure 4.9 shows, that representation can even include infrastructure components in this case, Cisco powered Voice over IP (VoIP) services. 63

72 Figure 4.9: Including infrastructure services in the monitoring. It s critical that we be able to see performance for everything that the application depends upon. That way, no IT discipline is excluded or forgotten, and there aren t any hidden dependencies impacting performance under the hood or behind the scenes. As you can see from Figure 4.9, however, including everyone doesn t mean making the performance information generic: This example clearly illustrates the domain specific details that a tool must be able to deliver. VoIP is entirely different from, say, a database server, and the troubleshooting tools have to respect that difference and represent appropriate information for each element. Our own data center needs to be included (see Figure 4.10). The idea is to roll up all the resources under our control so that we can get a top level performance or health metric for our self managed resources. This drill down lets us see network devices, individual servers, virtualization hosts, and the other elements under our direct control. When something exceeds a threshold, we can immediately dispatch the right domain expert to begin troubleshooting the problem and we would have some confidence that the problem is on our end and within our control. 64

73 Figure 4.10: Including our data center in monitoring. The corollary, of course, is that the monitoring solution also needs a way to monitor the resources not under our direct control. As Figure 4.11 shows, that might include cloud computing services like Amazon Web Services. This is where a monitoring solution really needs to break with traditional techniques: Instead of monitoring these external services directly, the tool must often do so through vendor supported application programming interfaces (APIs), through direct observation of service response times, and so forth. This is really a new field in performance monitoring, so as you begin evaluating solutions, it will be important to consider different vendors ability to draw information from the outsourced services that you rely upon to run your application. 65

74 Figure 4.11: Monitoring cloud computing services. I have to emphasize tight integration with outsourced services. For example, in the original dashboard view, you ll notice that SalesForce.com shows a status color of blue not the green we d hope for, but not as immediately alarming as yellow, orange, or red. Drillingdown reveals a service alert, shown in Figure As you begin working with services outside your own data center, you need to become more aware of their data center operations including, in this case, a scheduled maintenance window that may result in diminished performance for our application that relies on SalesForce.com. By having this information right within the monitoring console, we can help manage our future performance more effectively. We might choose to temporarily turn off portions of our application that depend on SalesForce.com during that maintenance window, or we might simply need to be alert for performance problems that result from the maintenance window. 66

75 Figure 4.12: Viewing service alerts. As you move toward a hybrid IT model, this kind of integration with external service providers will prove absolutely invaluable. Integrated Troubleshooting Saves Time and Effort I presented this unified monitoring concept to the client I was with, and they gave it a shot. As I was writing this chapter, they contacted me to let me know they had deployed a unified monitoring solution and that they were quite happy with it. It turns out the problem was in the SQL Server, and had something to do with the way the server was recompiling that query. The DBA was, unfortunately, a bit stubborn about admitting it, but with a single tool showing perfect performance in all but his application component, he was forced to start troubleshooting the problem. By getting the same tool in front of everyone, each discipline s fences started to come down a bit. Sure, they were all responsible for their individual elements, but it gets a lot harder to toss a problem over the fence when everyone can clearly see that it s on your side. 67

76 The Provider Perspective: Providing Details on Your Stack MSPs have a more challenging time of monitoring. They must not only monitor their own data centers contending with all the multi discipline problems I ve described in this chapter but also provide their customers with a rolled up view of their services. Ideally, they should do so in a way that customers can integrate into their monitoring tools so that service alerts and other information rolls down from the MSP s data center into the customers monitoring tools. With the right monitoring tools, you can do that pretty easily. What you re really after is a tool that gives you all the detailed, cross discipline, unified monitoring you want within your data center, with the ability to aggregate some of that data into a status indicator for your data center, done in such a way that your aggregate indicator can become a part of your customers dashboards. Figure 4.13 illustrates the concept. Figure 4.13: Aggregating your MSP network for customer information. 68

77 Of course, you re quite likely going to want to provide your customers with more detailed information as well because they re probably going to demand it: custom dashboards specific to your service offerings, quality of service (QoS) reports, SLA reports, and more. The last chapter of this book will dive more into those offerings. Coming Up Next You should now have a vision for how your cloud based or hybrid applications can be monitored from the EUE and the component level. In the next chapter, we ll start exploring some of the specific capabilities you need to start monitoring a hybrid IT environment from your data center into the cloud. Consider the next chapter to be a sort of shopping list layout of all the features that you should at least be considering in a monitoring solution. 69

78 Chapter 5: The Capabilities You Need to Monitor IT from the Data Center into the Cloud In the previous chapters of this book, I ve covered a lot of the why and how of unified monitoring. In this chapter, I want to start focusing specifically on the what: What capabilities you need to bring into your environment to successfully manage and monitor a hybrid IT infrastructure and its applications. Think of this chapter as an instructional guide for building a shopping list. I ll cover not only features but also some of the finer, easilyoverlooked details that can make all the difference in a successful implementation. First, though, let s clearly define some of the major business and technical goals for this kind of evolved, hybrid IT monitoring. As I do so, I want to re introduce the works from World Coffee, the case study I introduced in Chapter 1. Business Goals for Evolved Monitoring I want to cover the business goals for monitoring first because in reality the business goals are the only ones that matter. The business is paying for not only the monitoring solution but also the applications and infrastructure being monitored; meeting the goals of the business is really the whole point of all of this. So what might a business hope to gain from a more evolved form of monitoring? EUE and SLAs The business primary concern, of course, is to have applications and an infrastructure that perform well. A problem one I discussed in the first chapter of this book is that many businesses have given up on simply saying, we want our applications to perform well, and have instead gotten themselves bogged down in the minute details of application performance. But knowing that Server5 is running at 80% utilization isn t truly a business goal although many businesses have accepted that this is how they have to define good performance. 70

79 They shouldn t accept that. Instead, businesses should back off a level and concern themselves with what really concerns them: Applications that perform, from an end user perspective, in a way that supports the business requirements. That s the end user experience, or EUE, that I ve referred to throughout this book. We want users to be able to complete their checkout process in 5 seconds or less from the time they click Submit Order. That s an EUE focused metric, and it can become a part of formal Service Level Agreements (SLAs), which are used to communicate the desired EUE level performance across the business and its IT team. Let s be clear on something: A monitoring solution that doesn t allow you to quickly determine the current EUE metrics and that doesn t help you manage to an SLA defined EUE metric is not a monitoring solution you should be using. Different vendors take very different approaches to how they show you the EUE, how they determine the EUE, and so forth; those are technical details that are important, but the most important thing is that the solution give you some means of managing the EUE. EUE: All That Matters to the Business Business have been unaccustomed to dealing with an EUE metric for so long that many will resist the concept of relying solely on the EUE simply through force of habit. I had a recent consulting client tell me that they were happy defining an EUE like, Sales orders must be accepted by the system in 2 seconds or less after submission. But they still wanted to add other things to their SLA, like that the system must have a minimum 99.5% uptime. Think about it: If the system is down, it isn t accepting orders in 2 seconds or less. So you didn t meet the EUE metric. There s no need to specify anything else. We want to put maintenance windows in the SLA, they told me. Well, that s fine make it part of the EUE. Sales orders must be accepted within 2 or fewer seconds between the hours of 7am and 7pm; outside of those hours, sales orders do not need to be accepted by the system. That makes it clear what end users should expect of the system it might not be available to them from 7pm to 7am. By stating things in that kind of end user context, you re communicating not only your desires to your IT team but also your commitment to your end users. Everyone s using the same language. End users don t have maintenance windows after all; they have expectations for when they ll be able to do their jobs. State your SLAs in those terms, and manage those expectations. 71

80 We want to add a clause that systems must not run at more than 80% capacity. They really insisted on that one, but I eventually convinced them that such a metric might be a good IT management guideline, it didn t belong in an SLA. So long as the EUE metrics were met, it wouldn t matter how burdened the systems were. If IT could meet that 2 second rule with a 95% loaded server more power to them! They d be saving money by doing more work with fewer resources. And what does 95% loaded mean, anyway? 95% processor capacity? Network throughput? Disk I/O? Don t dive into the technical details within an SLA: Try to stick with EUE based metrics that describe your desired bottom line performance, and make sure IT has the tools they need to manage the technical components to your EUE based SLA. Let s check back in with World Coffee on this. I ll re introduce some of these folks, as we haven t heard from them in a couple of chapters. Ernesto is an inside sales manager for World Coffee, a gourmet coffee wholesaler. Ernesto s job is to keep coffee products flowing to the various independent outlets who resell his company s products. Like most users, Ernesto consumes basic IT services, including file storage, , and so on. He also interacts with a customer relationship management (CRM) application that his company owns, and he uses an in house order management application. Ernesto works on a team of more than 600 salespeople that are distributed across the globe: His company sells products to resellers in 46 countries and has sales offices in 12 of those countries. Ernesto s biggest concerns are the speed of the CRM and order management application. He literally spends three quarters of his day using these applications, and much of his time today is spent waiting on them to process his input and serve up the next data entry screen. He needs that process to be quick and efficient. When he looks up information or submits new information, such as sales orders, he needs the system to be responsive. Every minute he spends waiting is a minute he s not selling, and those wasted minutes can add up quickly. Budget Control As businesses start moving toward hybrid IT environments and applications and incorporating outsourced components, budget starts to become a very real concern. For example, consider how internally hosted applications and services are priced: The business pays some up front cost to acquire a solution, often some kind of recurring maintenance, and will have some amount of IT staff time spent on supporting the solution. Those costs are relatively easy to determine, and are pretty much fixed. If the business has a really busy week, the application will cost about as much to support as it would in a really slow week. 72

81 Now consider how cloud services are priced. Some, like many SaaS solutions, may be priced per user, based on storage consumption, and so forth relatively easy to track, trend, predict, and control. Other services, however, are priced based on actual usage. Here s the current pricing, as of August 2010, for Microsoft s Windows Azure cloud computing platform: Compute = $0.12/hour Storage = $0.15/GB stored/month Storage transactions = $0.01/10K Data transfers = $0.10 in/$0.15 out/gb ($0.30 in/$0.45 out/gb in Asia) This is a fairly typical pricing model for cloud computing; companies like Rackspace, Amazon, and others all have similar pricing models. You re paying based on usage. What will your monthly bill be? There s no way to know in advance. This is where a truly unified monitoring system can help. In addition to tracking raw performance and high level EUE metrics, a solution can also keep track of your service consumption. It can help you see how much you re using, and therefore how much you ll be paying. You ll be able to keep an eye on your costs, and relate those costs to the income produced by your hybrid applications. If an application is consuming more than it is returning, you ll be able to address the problem before your bills start getting out of hand. This is an entirely new territory for monitoring software. It s made more complicated by the fact that some applications are really super distributed across different hosting providers. Consider, for example, Figure 5.1. This is an example I ve used before, but it bears repeating in this new, budgetary context. Figure 5.1: Example of a super distributed hybrid application. 73

82 This application has a certain number of resources in your own data center shown by the Windows, Linux, Oracle, and VMware icons. Those costs, as I ve said, are relatively fixed and predictable. The application also relies on SalesForce.com, which is an SaaS offering, and on the cloud computing platform Rackspace Cloud. Your use of SalesForce.com might be per transaction (perhaps it s being used to issue license keys to customers), and your cloud computing costs will likely be based on actual usage as well. Being able to track that usage, and therefore those costs, across all those different platforms can be quite complex. The business has a clear need for a monitoring platform that can unify all of that information into a single place so that you can get a true and accurate picture of your costs. John works for World Coffee s IT department and is in charge of several important applications that the company relies upon including the CRM application and the in house order management application. World Coffee has moved to a hybrid IT application for its in house order management. The CRM element is now outsourced to SalesForce.com, an SaaS provider; the in house order management application is Web based, and runs in Amazon s EC2 cloud computing platform. Amazon is also used for customerfacing ordering applications. John needs to make sure that every user of every system is experiencing response times at or better than those defined by the company s SLAs. This task is difficult because many of the applications elements are outside his direct control. He needs ways to directly test the EUE as well as ways to check on the direct response times for the various SaaS, cloud computing, and other outsourced IT elements. John is also responsible for providing his boss with information on how much all of this is costing the company. Since the switch to hybrid IT, the company has spent more on outsourced IT services than they anticipated. They ve seen an increase in sales volume, so it s likely that the additional expenses are justified, but they need some way to closely track actual usage and charges so that they can relate that more directly to the resulting sales volume. Technology Goals for Evolved Monitoring IT s job is to implement the SLA that the business has agreed to. Their job is to make sure that the EUE metrics needed by the business can be delivered within whatever parameters the SLA outlines. That means IT essentially needs tools that can tell them when the EUE is starting to go wrong, and help them find the root cause of the problem so that they can fix the problem before the SLA is missed. 74

83 Centralized Bottom Up Monitoring Today, one of IT s biggest challenges is that they simply can t get enough information onto a single, consolidated screen. Instead, they re stuck looking at numerous consoles, as illustrated in Figures 5.2 and 5.3. One for the database. One for VMware. One for Windows. One for Active Directory (AD). One for Exchange. One for the other database. In addition, few of these consoles have any ability to look into outsourced systems like SalesForce.com, Rackspace Cloud, Amazon EC2, Google AppEngine, and so forth. These consoles offer no means of calculating the EUE; thus, they can t tell you when you re meeting the EUE metric or not. Figure 5.2: Database performance. Figure 5.3: VMware performance. 75

84 EUE metrics cannot be derived from looking at component performance; the EUE must be directly observed, and it takes a central, unified monitoring solution capable of doing so to get an accurate EUE measurement. If you re not meeting your EUE metric, however, you still need to look at the individual components performance to find out which ones are contributing to the reduced performance. Again, a unified monitoring console give you this capability by putting every component s performance right in front of you including outsourced elements like cloud computing platforms, hosted services, SaaS services, and so on. So that s the technical need: Everything in one place. It s the only way to meet the business EUE centric SLAs. Improved Troubleshooting IT departments also have a need for more streamlined, efficient troubleshooting. When something does start to go wrong with application performance, the IT team needs to be able to quickly pinpoint the cause of the problem and bring domain specific tools to bear so that the problem can be solved quickly. A unified monitoring platform is generally regarded as the best way to avoid the siloing that can occur during IT troubleshooting scenarios. In other words, by getting every team member on the same screen, with the same information, everyone can agree more quickly on which major application element is causing or contributing to a problem rather than each team member using individual domain specific tools and independently stating that their component is working fine. Once affected application elements are identified, either the unified monitoring solution or domain specific troubleshooting tools, or a combination of both, can be used to further refine the root cause of the problem and to discover a solution. The key is getting everyone on the same page. A unified monitoring solution does this by presenting similar statistics for a database server, virtualization host, Windows server, cloud computing platform, and so on. While each of these elements will obviously have a variety of different performance metrics that need to be examined, by bringing them all together into the same place, and presenting them similarly, the solution can create a sort of level playing field, providing a more authoritative starting point for troubleshooting. A Shopping List for Evolved Monitoring Those are our business and technology goals. Now we need to outline the exact capabilities that an evolved, hybrid IT monitoring system needs. High Level Consoles It s a screen shot I ve shown you before, but the first and most important aspect of a truly unified monitoring system is a high level console that gives you a broad, dashboard style view of your overall application. Figure 5.4 shows an example. 76

85 Figure 5.4: Top level, unified monitoring dashboard. This dashboard provides at a glance information for several cloud based services, and offers an overview chart of response time from those services. A line chart shows historical response times for the past few days or hours, helping administrators quickly identify performance trends, spot weak links, and so forth. Drilling down into any of these services provides additional information. Additional panels might include your own datacenter; Figure 5.5 shows what a next level drill down into that data center might look like. Here, we can see a more detailed view of what s happening in the data center. Busy routers are highlighted, and graphs show top utilization levels for storage, memory, processor, and so forth. We re essentially treating our data center as a kind of cloud that s fully under our control. This second level drill down lets us dive into the cloud and see some of the individual elements that run it. 77

86 Figure 5.5: Drilling down into a data center. For other data centers our cloud providers, for example we might not get that same level of detail. After all, the cloud is supposed to just be a big bucket of functionality and services, not individual servers. So the drill down here might offer a different kind of detail, as Figure 5.6 shows. Here, we re presented with information on specific service instances, response times (over time), and the number of transactions we ve been sending to this provider a key in helping us meet that business requirement of monitoring actual usage. Every cloud provider s drill down might be a bit different, as each one works somewhat differently. 78

87 Figure 5.6: Drilling down into a cloud provider. When there s a performance problem with a cloud provider, this is likely as far down as we d drill; the next step is to get them on the phone and find out what s happening. Within our own data center, however, we d likely want to drill a bit deeper. Domain Specific Drilldown Within our own data center, having additional levels of detail can help focus troubleshooting efforts. For example, in Figure 5.7, we can see alerts generated from a specific server as well as summary information for key performance metrics from a variety of servers. By configuring performance thresholds, we can receive alarms when something looks wrong. These should be from across all our servers, whether they re running Windows, Linux, Unix, or whatever. In fact, in the figure, you can see that both Windows servers running SQL Server and Red Hat Linux computers are included in the list of alarms. Getting all of this information onto the same page will help direct efforts to resolve these alarms. 79

88 Figure 5.7: Drilling deeper into specific servers. Performance Thresholds Ideally, a monitoring solution will come preconfigured with performance thresholds based on the vendor s experience with the technologies involved. Commonly, you ll also be able to define your own thresholds, perhaps defining a larger pad between good performance and bad performance to give your team more time to react. As Figure 5.8 shows, these thresholds should be used to create instantly readable visual displays: Indicators, graphs, and gauges that help draw your eye to elements that need the most immediate attention. 80

89 Figure 5.8: Performance thresholds help drive graphical displays. Thresholds will be different across different technologies. This example is for Exchange Server, so it includes information about message queues and transfer agents in addition to more generic metrics such as memory, CPU, disk, and network measurements. Broad Technology Support: Virtualization, Applications, Servers, Databases, and Networks A unified monitoring solution is only as useful as the number of your technologies it can unify into a single console and ideally, you want a solution that can handle everything you ve got. There are subtle differences in how solutions work. Here are some considerations: Virtualization o Look for VMware, Hyper V, Citrix, Sun, and IBM support o Consider support for agentless monitoring, which requires less impact on your infrastructure and means less long term support and maintenance o Applications Look for templates that provide a starting point for Quality of Service (QoS) metrics o Look for Messaging and Directory Services support for Exchange and AD at a minimum, along with any other technologies you have in house, such as Lotus Notes/Domino o For Web servers, Internet Information Services and Apache support are both desirable o Collaboration platforms should be supported, too, including Lotus Notes/Domino and Microsoft SharePoint 81

90 o You might need support for Voice over IP (VoIP) services, such as Cisco VoIP components o Also consider support for application services, such as those from Citrix, IBM WebSphere, IBM WebLogic, JBoss, Tomcat, and Sun s Java Virtual Machine Servers You probably have a variety of servers in your environment, and you want to make sure they re all included in your monitoring solution: o AS/400 (for example, Figure 5.9 shows how a monitoring solution might raise alarms for IPL or reboot conditions) o Linux o Unix o NetWare o Windows Figure 5.9: Viewing AS/400 alarms. Databases These form the backbone of most modern applications, and you don t want to be tied to just one or two simply because they re all your monitoring solution supports; get maximum flexibility by looking for support for: o Sybase o Informix o Oracle o Microsoft SQL Server o DB2 o MySQL 82

91 Networ ks The infrastructure that connects everything can play a critical role in your applications performance; look for a solution that can monitor: o Cisco s IP SLA Figure 5.10 shows a monitoring dashboard for Cisco SLA, which is critical on modern converged network carrying voice, data, video, and other traffic o o o Core DNS, DHCP, and LDAP services SNMP management information Routers and switches o Raw network traffic Figure 5.10: Monitoring Cisco IP SLA. End User Response Monitoring There are two main kinds of EUE monitoring: Active and Passive. With Active, you can actually set up synthetic transactions to feed into your system. The monitoring solution can trace those, and accurately report on response times. You might end up with a screen like the one in Figure 5.11, showing granular detail for user transactions. 83

92 Figure 5.11: Active user transaction monitoring. Passive monitoring doesn t inject transactions into the system but rather monitors the individual system elements and creates an aggregated EUE metric. It isn t always as accurate as active monitoring, but it s completely non intrusive. Most organizations use both active and passive monitoring, and you should look for a solution that present both because the information they generate is largely complementary. SLA Reporting This is something we ll dive into more in the next chapter, but SLA reporting is absolutely a capability your monitoring solution must provide. Even a straightforward report like the one Figure 5.12 shows can be tremendously useful. It shows a breakdown of specific elements of the application launch, login, lookup, and logoff and indicates which elements are meeting their SLA. It also gives you the good or bad news, indicating whether, at current rates, you are trending toward a breach in your SLA. 84

93 Figure 5.12: An SLA report. Public Cloud Support: IaaS, PaaS, SaaS Finally, because so many applications are becoming reliant on cloud based services, your monitoring solution must be able to include them. Whether you re using Infrastructure as a Service (IaaS think cloud computing like Amazon s), Platform as a Service (PaaS think cloud computing like Microsoft s Azure), or Software as a Service (SaaS think SalesForce.com). Popular choices you should absolutely be able to include in your monitoring are: Amazon EC2 and S3 Rackspace Cloud Google AppEngine Windows Azure SalesForce.com CRM The Provider Perspective: Capabilities for Your Customers As a Managed Service Provider (MSP), you become a part of your customers IT teams. Typically, that means you have a dual problem that your customers don t often face: You need to maintain, monitor, and manage your own network for your own reasons. After all, you want to provide excellent services to your customers. You need to help your customers include your network and applications in their own monitoring. There are a number of ways in which you can do this (many of which we ll discuss in the next chapter), but the bottom line is that you need to provide your customers with some visibility into your infrastructure so that they can treat your systems as a true part of their systems. 85

94 Li works for New Earth Services, a cloud computing provider. Li is in charge of their network infrastructure and computing platform and is working with World Coffee, who plans to shift their existing Web services based order management application into New Earth s cloud computing platform. Li knows that he ll have to provide statistics to World Coffee s IT department regarding New Earth s platform availability because that availability is guaranteed in the SLA between the two companies. However, he also knows that he ll need to provide them with more detailed insight into certain aspects of New Earth s infrastructure. After all, World Coffee is essentially making New Earth s network a part of World Coffee s network through their hybridized IT applications so as a customer, they deserve some insight. Li plans to search for monitoring applications that can be used by his internal network engineers that will also allow him to provide dashboards and reports directly to customers like World Coffee. That way, he doesn t have to build his own monitoring and data provisioning mechanism. MSPs often look for monitoring systems that have multi tenant capabilities. In other words, the MSP can buy a monitoring system that lets them monitor their entire infrastructure while also providing monitoring capabilities directly to their customers restricting each customer so that they can only see their portion of the MSP s infrastructure. Such monitoring systems are obviously more complex. In some cases, the monitoring itself might be something that the MSP offers as an additional service. Imagine a conversation between Li, who works for an MSP, and John, who works for one of Li s customers: John: Will we be able to monitor the servers that we re using on your network? Li: It s possible. Let me ask, what sort of monitoring tools do you have now? John: We use a lot of different ones. We have some for Oracle, others for VMware, and others for Microsoft Windows. We don t have anything specifically designed for monitoring cloud based resources. Li: You know, in addition to the cloud computing that we re providing you, we can provide you with a complete unified monitoring solution. It s basically a Software as a Service offering. It can monitor your entire infrastructure in one console, including your Oracle, VMware, and Windows systems. It can also include monitoring of our systems so that you ll get your entire network including those bits you ve outsourced to us on the same screen. There s minimal impact on your network, and you wouldn t be responsible for maintaining or patching the monitoring system it s just a service we d provide to you. John: I never knew such a thing was possible. 86

95 It is possible. Vendors are becoming increasingly creative and efficient at handling this kind of hybrid IT environment, and the ability for service providers to offer monitoring as just another service to their customers well, it s compelling. Coming Up Next In the next and final chapter of this book, I ll focus on one last set of capabilities that your monitoring solution should offer: reporting. It s very easy to get caught up in the actual details of monitoring, concerns about performance alerts, and setting thresholds and forget that management reporting is equally important. I ll look at different kinds of reports that can be used to help manage SLAs and keep the business on budget as well as reports that show component level health, usage trends, and so on. I ll also look at newer kinds of reports, including dashboards, in addition to ways in which you might want to leverage performance information elsewhere, such as data stores and application programming interfaces (APIs). For the MSP perspective, I ll also look at how things like multi tenant capabilities can help deliver added value to your service offerings. 87

96 Chapter 6: IT Health: Management Reporting as a Service I ve spent a lot of time in this book explaining the capabilities and technologies you need to add to your environment in order to enable truly hybrid, data center to the cloud application and service monitoring. But all of that monitoring is useless without output: One of the end goals of this entire effort is to provide your managers and executives with effective reports whether they are internal customers or external customers. Dashboards and other elements that show a manager that the environment is healthy and on budget or that show them which service (not IT component) isn t doing well. The goal of this final chapter is to focus on these reports, what they should look like, and what value you can expect to derive from them. Note This is an unusual chapter in that I ll mainly be presenting examples of reports. My goal is to help you develop a kind of shopping list for the types of reports you should look for in systems that you re evaluating and to explain some of the finer details that I like in these reports. Most of these examples are taken from live systems, so in some cases, I ve obfuscated customer specific information such as publicly accessible server names, IP addresses, and so forth. The Value of Management Reporting There s no question that reporting has value, but what, specifically, is that value? In other words, what should you expect reports to provide other than pretty graphs? What will you get out of reports? Let me quickly outline the major points so that I can then show you examples of monitoring system reports that deliver those benefits. Business Value Businesses look, primarily, for reliability and return on investment (ROI). Specifically, businesses want reports that can: Monitor compliance with service level agreements (SLAs) Monitor application performance from an end user perspective Track utilization, especially when that utilization relates to cost, as it does with most cloud computing platforms Help predict growth in utilization (to help estimate the costs of supporting that growth) Assist in maintaining maximum uptime and responsiveness for entire applications 88

97 Technology Value Technologists need reports and tools that can help them achieve the business goals. That means technology focused reports are often a bit (or a lot) more detailed, focusing on implementation details that support the business high level views and metrics. Technologists look for reports that can: Quickly detail key performance metrics for components, highlighting out oftolerance metrics that require attention Show usage trends so that IT can predict when usage will exceed the system s ability to perform within tolerance Dive from high level metrics, like end user experience measurements, into deeper, technology specific metrics for troubleshooting purposes Reporting Elements In the examples that follow, I will highlight specific capabilities of a monitoring system. In most cases, I ll call out specific features of these reports that I find especially useful, and that I think you should look for in your own monitoring solution. I ll spend the most time on detailed performance reports because those provide the bulk of the intelligence you ll need to operate your infrastructure. I ll also look at SLA specific reports and a few dashboards that help provide a high level, at a glance view of the environment or specific applications and services. Performance Reports Let s dive into the examples I ve gathered. First up, in Figure 6.1 is a look at Exchange Server availability. This report really highlights the value of being able to monitor a hybrid IT environment: This Exchange system is hosted at Rackspace, not in our own data center. Being able to monitor system uptime especially when other applications depend on this system is crucial to maintaining the overall performance of our environment. Figure 6.1: An example Exchange availability report. 89

98 Note You re seeing a really good day on my Exchange Server; here the performance line is the blue one at the very top of the graph, indicating 100% uptime. Figure 6.2 shows another way of monitoring availability, and it illustrates a key capability that you should look for. This chart is showing overall availability from a variety of services Web mail, SMTP, POP, and so on. Those are the services you rely on, so rather than monitoring the system, this report is monitoring those services. This kind of service level availability is important for anyone who is relying on hosted or cloud based services as a part of their IT infrastructure. Figure 6.2: An example general availability report rackspace hosted services. Note Once again, everything s looking good all of these services are at 100%. Boring looking performance charts are the ones you hope to see all the time! Figure 6.3 shows another hosted element: Salesforce.com statistics. This is a more basic performance report, showing the number of transactions as well as a couple of service level type statistics: transaction speed and overall system status. There are numerous other stats you would want to track for Salesforce.com if you relied upon it, and your monitoring system should deliver in this kind of easy to read, live report. 90

99 Figure 6.3: Example Salesforce.com statistics. Note Notice that a drop in the number of transactions isn t bad, although the drop in transaction speed might be worrying. Neither of these statistics affects the system s total uptime, shown on the bottom graph as 100%. 91

100 If you re using a network (and what else would you be using?), you should be concerned about its performance and availability. Figure 6.4 highlights another key capability I ve discussed throughout this book: getting everything into one monitoring system. Just because you can monitor network protocol statistics using other tools, you should still want them monitored in the same place as everything else. That way, when a problem occurs, you have all the troubleshooting information you need in one place. Figure 6.4: Network statistics from Cisco Netflow. 92

101 Note No DNS traffic at all? That could potentially be a bad sign, except that in my network, most DNS is resolved internally, so we re not seeing DNS use much bandwidth. Figure 6.5 is another example of a service level report, showing overall network bandwidth utilization. I especially like the inclusion of a high water mark line, showing where bandwidth has maxed out in the past 12 hours. Notice that there s also a break down of protocol traffic, so if there is a problem, you can get a good idea of what protocol is contributing to that problem. Figure 6.5: Network bandwidth. If you have any service or application that relies on external services such as an external LDAP server then you need to be able to monitor that. Figure 6.6 shows how a monitoring system can do so, connecting to an external LDAP system and measuring response times. By establishing health thresholds for these response times, you can start to create alerts and other notifications when response times exceed your tolerances. 93

102 Figure 6.6: Service response time LDAP. Traditional server monitoring should be included as well, as illustrated in Figure 6.7, which shows common stats for a Windows server. Again, it s not so much that you don t already have monitoring tools that can do this; it s that you want all your monitoring information in one place, whether it s a simple Windows server or a completely outsourced server or service. 94

103 Figure 6.7: Windows server utilization. 95

104 Note Figure 6.7 is huge and I actually cropped off additional charts. This is one reason I prefer Web based reports, because the browser can scroll as much as it needs in order to display large, detailed reports. Broad platform coverage is a must. Even if you don t have Unix (or Linux) today, for example, you might well have a server or two in the future. As Figure 6.8 shows, your monitoring system needs to be able to accommodate that growth. I like to see reports like this, which essentially mirrors the Windows report and includes specifics for Unix. Windows and Unix are similar, and their performance would be monitored similarly, but a monitoring system can t ignore their unique aspects. Figure 6.8: Linux server utilization. 96

105 As the use of virtualization grows, so must your ability to monitor it no matter which brand you re using. Figure 6.9 shows guest performance statistics on an IBM virtualization host using terms and elements that are specific to IBM s implementation. Figure 6.9: IBM virtualization guest statistics. Note Another excellent server notice the stable performance over time. 97

106 Figure 6.10, however, shows that a monitoring solution can include other brands such as VMware. This figure focuses on host statistics, showing key performance indicators for CPU, network, and so forth. Again, you want cross platform reports to look similar so that you can do a sort of apples to apples mental comparison, but you don t want to exclude vendor specific information. Figure 6.10: VMware vcenter monitoring. 98

107 More and more companies are adding VoIP to their technology mix, and there s no reason a monitoring system can t include that. Figure 6.11 shows a report for Cisco s CallManager system, providing a way to monitor and troubleshoot VoIP performance. Figure 6.11: Cisco CallManager monitoring. 99

108 Every business has databases, and some of your applications will access those via Java Database Connectivity (JDBC), so you need to be able to monitor JDBC. Figure 6.12 is my first example of low level, under application monitoring, showing JDBC statistics. This might not be a report you look at first when a problem arises, but the point is that a monitoring system should provide this kind of information to help you dive deeper into a problem and either confirm or eliminate potential systems and technologies as the source of, or contributor to, a performance problem. Figure 6.12: JDBC statistics. 100

109 You ll want to be able to monitor the database connectivity as well as the database platform itself. Figure 6.13 shows an example of MySQL monitoring, but your monitoring solution should include support for all major platforms, including Microsoft SQL Server, IBM DB2, Oracle, Sybase, and so on. Figure 6.13: MySQL statistics. Getting back to service levels for a moment, take a look at Figure 6.14, which shows how a monitoring system can also provide high level, service focused information such as round trip times. This is a good indicator of overall system health, and I especially like the inclusion of a trend line that shows where performance is heading. That s a great way to get on top of a problem before it becomes a problem. 101

110 Figure 6.14: response times (SMTP). Other services contribute to your overall IT performance, such as Active Directory (AD) a lynchpin for many Microsoft based (and third party) services. Figure 6.15 shows that AD responsiveness can be monitored right within the same monitoring solution, watching statistics like connect time, replication speed, search load, and so on. Again, notice the inclusion of an average line, which lets you visually ignore peaks and valleys and focus on the overall average performance of a given service. Figure 6.15: AD response times. 102

111 Web servers are running more and more applications, both internal and external, and monitoring the Web server platform is critical. Again, having this in a single solution makes it easier to monitor the entire application stack: Web server, database platform, database connectivity, network protocols, and so forth. You can start to see how this kind of system gives you insight into every aspect of the application, making it easier to spot and solve problems. Figure 6.16 looks at Microsoft s IIS Web server. Figure 6.16: IIS Web Server statistics. 103

112 Because few shops are completely homogenous these days, Figure 6.17 shows that the same solution can also monitor the Apache Web server. Again, this report is similar to the IIS one, as both IIS and Apache are quite similar, but the Apache report includes specifics to that platform. Figure 6.17: Apache Web Server statistics. Note See that vertical red area on each of the three charts? That s a time period during which my monitoring system wasn t able to talk to the Web server being monitored, so it couldn t draw the chart accurately. 104

113 Figure 6.18 once again returns to a service level focused report, showing responsiveness for a Customer Relationship Management (CRM) solution. In fact, this particular report shows the results of synthetic transactions injected into the system to manage real world performance from the end user perspective the end user experience (EUE), that I ve discussed in prior chapters. Here, we can see real world response times for end user tasks such as opening the application s home page, logging in, and searching. A problem at this level would drive us to dive deeper into the Web platform, database platform, network utilization, and so on. Figure 6.18: CRM system responsiveness. 105

114 Finally, you can t ignore the physical aspect of your infrastructure, and Figure 6.19 shows that a monitoring system can include considerations such as your server room s temperature provided, of course, you have the right measurement probes in place to gather this information. Figure 6.19: Server room temperature. SLA Reports Having inundated you with examples, I ll just provide a couple for this section. The idea here is to roll up performance into something that can map to your SLAs, making it easier for you to manage those SLAs. Figure 6.20 shows the first example, rolling up numerous statistics into a simple you made it or you didn t measurement for several services, including a database server, Web server, Web response times as measured from two locations, and so forth. There s a trend analysis, too, indicating that the SLA is in no danger of being breached given current performance. 106

115 Figure 6.20: An SLA report. A historical look is also nice, and Figure 6.21 shows an example. Figure 6.21: Historical SLA performance. Here, we can easily see when the SLA was breached and how badly. This can be useful when it comes time to negotiate pricing or performance, especially for hosted services and applications. 107

116 Dashboards I love dashboards, and I dislike monitoring solutions that don t provide lots of em. These are a great tool for quickly checking the status of your environment at a high level, and for starting the detail dive when something is wrong. Figure 6.22 provides an excellent example, showing the end to end component view of an application, including individual servers, connectivity between them, response times for specific services such as SQL or LDAP queries, and so on. Problem systems are conveniently highlighted in orange, directing my attention to the components that require it. Figure 6.22: End to end performance dashboard. Figure 6.23 is an EUE dashboard, showing me in simple colors and graphs what my users are experiencing when they use a particular application (in this case, my Bugzilla bug tracking application). I can see how fast the home page is launching, how long it takes to log in, how long it takes to find bugs and open them, and so on. I get a quick view (on the right) of the major platforms that comprise this application: the application code, MySQL database, Apache Web server, and Linux operating system (OS). 108

117 Figure 6.23: EUE dashboard. For highly distributed applications, geo views like the one in Figure 6.24 are tremendously useful. This helps you see, at a large scale, where your application may be having specific problems. Figure 6.24: Geographic dashboard. 109

118 Finally, I especially like monitoring systems that can provide a customized, wholeenvironment rollup like the one shown in Figure 6.25, which was developed for a hospital. This dashboard provides an at a glance view of everything critical to healthcare applications in the environment. It s literally the thing you want running all the time on some monitor somewhere so that everyone can be assured that all the systems are okay or quickly take action if something isn t. Figure 6.25: High level applications and services dashboard. The Provider Perspective: Reports for Your Customers Managed Service Providers (MSPs) will appreciate most of the reports and dashboards I ve shown so far, but they also need something specific to the kind of business they re in. Most MSPs will also need the ability to look at performance from a client perspective so that they can see how a given client s services are performing. A monitoring solution should absolutely be able to provide that, and Figure 6.26 shows one way in which it might do so: grouping services by customer and showing the overall utilization of each customer. 110

119 Figure 6.26: MSP dashboard. Conclusion There you have it: Advanced, modern monitoring for the hybrid IT environment from the data center to the cloud and even for MSPs that need to provide customers with insight into their own networks and systems. It is possible, using the right tools and the right techniques and some vendors can even provide you with these monitoring capabilities as an SaaS solution, giving you an almost instant implementation, if desired. The solutions are out there time to start looking. 111

The Definitive Guide. Monitoring the Data Center, Virtual Environments, and the Cloud. Don Jones

The Definitive Guide. Monitoring the Data Center, Virtual Environments, and the Cloud. Don Jones The Definitive Guide tm To Monitoring the Data Center, Virtual Environments, and the Cloud Don Jones The Nimsoft Monitoring Solution SERVICE LEVEL MONITORING VISUALIZATION AND REPORTING PRIVATE CLOUDS»

More information

The Definitive Guide. Monitoring the Data Center, Virtual Environments, and the Cloud. Don Jones

The Definitive Guide. Monitoring the Data Center, Virtual Environments, and the Cloud. Don Jones The Definitive Guide tm To Monitoring the Data Center, Virtual Environments, and the Cloud Don Jones The Nimsoft Monitoring Solution SERVICE LEVEL MONITORING VISUALIZATION AND REPORTING PRIVATE CLOUDS»

More information

Becoming Proactive in Application Management and Monitoring

Becoming Proactive in Application Management and Monitoring The Essentials Series: Improving Application Performance Troubleshooting Becoming Proactive in Application Management and Monitoring sponsored by by Becoming Proactive in Application Managem ent and Monitoring...

More information

Protecting Data with a Unified Platform

Protecting Data with a Unified Platform Protecting Data with a Unified Platform The Essentials Series sponsored by Introduction to Realtime Publishers by Don Jones, Series Editor For several years now, Realtime has produced dozens and dozens

More information

Protecting Data with a Unified Platform

Protecting Data with a Unified Platform Protecting Data with a Unified Platform The Essentials Series sponsored by Introduction to Realtime Publishers by Don Jones, Series Editor For several years now, Realtime has produced dozens and dozens

More information

Collaborative and Agile Project Management

Collaborative and Agile Project Management Collaborative and Agile Project Management The Essentials Series sponsored by Introduction to Realtime Publishers by Don Jones, Series Editor For several years now, Realtime has produced dozens and dozens

More information

Best Practices for Log File Management (Compliance, Security, Troubleshooting)

Best Practices for Log File Management (Compliance, Security, Troubleshooting) Log Management: Best Practices for Security and Compliance The Essentials Series Best Practices for Log File Management (Compliance, Security, Troubleshooting) sponsored by Introduction to Realtime Publishers

More information

Developing a Backup Strategy for Hybrid Physical and Virtual Infrastructures

Developing a Backup Strategy for Hybrid Physical and Virtual Infrastructures Virtualization Backup and Recovery Solutions for the SMB Market The Essentials Series Developing a Backup Strategy for Hybrid Physical and Virtual Infrastructures sponsored by Introduction to Realtime

More information

Steps to Migrating to a Private Cloud

Steps to Migrating to a Private Cloud Deploying and Managing Private Clouds The Essentials Series Steps to Migrating to a Private Cloud sponsored by Introduction to Realtime Publishers by Don Jones, Series Editor For several years now, Realtime

More information

Streamlining Web and Email Security

Streamlining Web and Email Security How to Protect Your Business from Malware, Phishing, and Cybercrime The SMB Security Series Streamlining Web and Email Security sponsored by Introduction to Realtime Publishers by Don Jones, Series Editor

More information

Creating Unified IT Monitoring and Management in Your Environment

Creating Unified IT Monitoring and Management in Your Environment Creating Unified IT Monitoring and Management in Your Environment sponsored by Ch apter 2: Eliminating the Silos in IT Management... 16 Too Many Tools Means Too Few Solutions... 16 Domain Specific Tools

More information

Real World Considerations for Implementing Desktop Virtualization

Real World Considerations for Implementing Desktop Virtualization Real World Considerations for Implementing Desktop Virtualization The Essentials Series sponsored by Intro duction to Desktop Virtualization for the IT Pro... 1 What Is Desktop Virtualization?... 2 VDI

More information

How Configuration Management Tools Address the Challenges of Configuration Management

How Configuration Management Tools Address the Challenges of Configuration Management Streamlining Configuration Management The Essentials Series How Configuration Management Tools Address the Challenges of Configuration Management sponsored by Introduction to Realtime Publishers by Don

More information

Why Endpoint Encryption Can Fail to Deliver

Why Endpoint Encryption Can Fail to Deliver Endpoint Data Encryption That Actually Works The Essentials Series Why Endpoint Encryption Can Fail to Deliver sponsored by W hy Endpoint Encryption Can Fail to Deliver... 1 Tr aditional Solutions... 1

More information

The Definitive Guide. Active Directory Troubleshooting, Auditing, and Best Practices. 2011 Edition Don Jones

The Definitive Guide. Active Directory Troubleshooting, Auditing, and Best Practices. 2011 Edition Don Jones The Definitive Guide tm To Active Directory Troubleshooting, Auditing, and Best Practices 2011 Edition Don Jones Ch apter 5: Active Directory Auditing... 63 Goals of Native Auditing... 63 Native Auditing

More information

How to Use SNMP in Network Problem Resolution

How to Use SNMP in Network Problem Resolution The Essentials Series: Solving Network Problems Before They Occur How to Use SNMP in Network Problem Resolution sponsored by KNOW YOUR NETWORK by Greg Shields Ho w to Use SNMP in Network Problem Resolution...

More information

What Are Certificates?

What Are Certificates? The Essentials Series: Code-Signing Certificates What Are Certificates? sponsored by by Don Jones W hat Are Certificates?... 1 Digital Certificates and Asymmetric Encryption... 1 Certificates as a Form

More information

The Next-Generation Virtual Data Center

The Next-Generation Virtual Data Center The Essentials Series: Managing Workloads in a Virtual Environment The Next-Generation Virtual Data Center sponsored by by Jaime Halscott Th e Next Generation Virtual Data Center... 1 Be nefits of Virtualization

More information

Controlling and Managing Security with Performance Tools

Controlling and Managing Security with Performance Tools Security Management Tactics for the Network Administrator The Essentials Series Controlling and Managing Security with Performance Tools sponsored by Co ntrolling and Managing Security with Performance

More information

Beyond the Hype: Advanced Persistent Threats

Beyond the Hype: Advanced Persistent Threats Advanced Persistent Threats and Real-Time Threat Management The Essentials Series Beyond the Hype: Advanced Persistent Threats sponsored by Dan Sullivan Introduction to Realtime Publishers by Don Jones,

More information

The Art of High Availability

The Art of High Availability The Essentials Series: Configuring High Availability for Windows Server 2008 Environments The Art of High Availability by The Art of High Availability... 1 Why Do We Need It?... 1 Downtime Hurts... 1 Critical

More information

How to Install SSL Certificates on Microsoft Servers

How to Install SSL Certificates on Microsoft Servers How to Install SSL Certificates on Microsoft Servers Ch apter 3: Using SSL Certificates in Microsoft Internet Information Server... 36 Ins talling SSL Certificates in IIS with IIS Manager... 37 Requesting

More information

Realizing the IT Management Value of Infrastructure Management

Realizing the IT Management Value of Infrastructure Management The Essentials Series: Infrastructure Management Realizing the IT Management Value of Infrastructure Management sponsored by by Chad Marshall Realizing the IT Management Value of Infrastructure Management...1

More information

10 Must-Have Features for Every Virtualization Backup and Disaster Recovery Solution

10 Must-Have Features for Every Virtualization Backup and Disaster Recovery Solution Virtualization Backup and Recovery Solutions for the SMB Market The Essentials Series 10 Must-Have Features for Every Virtualization Backup and Disaster Recovery Solution sponsored by Introduction to Realtime

More information

How to Install SSL Certificates on Microsoft Servers

How to Install SSL Certificates on Microsoft Servers How to Install SSL Certificates on Microsoft Servers Introduction to Realtime Publishers by Don Jones, Series Editor For several years now, Realtime has produced dozens and dozens of high quality books

More information

Maximizing Your Desktop and Application Virtualization Implementation

Maximizing Your Desktop and Application Virtualization Implementation Maximizing Your Desktop and Application Virtualization Implementation The Essentials Series sponsored by David Davis Article 1: Using Hosted Applications with Desktop Virtualization... 1 The State of Desktop

More information

The Evolving Threat Landscape and New Best Practices for SSL

The Evolving Threat Landscape and New Best Practices for SSL The Evolving Threat Landscape and New Best Practices for SSL sponsored by Dan Sullivan Chapter 2: Deploying SSL in the Enterprise... 16 Infrastructure in Need of SSL Protection... 16 Public Servers...

More information

The Definitive Guide. Active Directory Troubleshooting, Auditing, and Best Practices. 2011 Edition Don Jones

The Definitive Guide. Active Directory Troubleshooting, Auditing, and Best Practices. 2011 Edition Don Jones The Definitive Guide tm To Active Directory Troubleshooting, Auditing, and Best Practices 2011 Edition Don Jones Ch apter 2: Monitoring Active Directory... 14 Monitoring Goals... 14 Event Logs... 15 System

More information

Malware, Phishing, and Cybercrime Dangerous Threats Facing the SMB State of Cybercrime

Malware, Phishing, and Cybercrime Dangerous Threats Facing the SMB State of Cybercrime How to Protect Your Business from Malware, Phishing, and Cybercrime The SMB Security Series Malware, Phishing, and Cybercrime Dangerous Threats Facing the SMB State of Cybercrime sponsored by Introduction

More information

Managing for the Long Term: Keys to Securing, Troubleshooting and Monitoring a Private Cloud

Managing for the Long Term: Keys to Securing, Troubleshooting and Monitoring a Private Cloud Deploying and Managing Private Clouds The Essentials Series Managing for the Long Term: Keys to Securing, Troubleshooting and Monitoring a Private Cloud sponsored by Managing for the Long Term: Keys to

More information

Active Directory Recovery: What It Is, and What It Isn t

Active Directory Recovery: What It Is, and What It Isn t Active Directory Recovery: What It Is, and What It Isn t Abstract Does your Active Directory recovery plan address all of the most common failure scenarios? This white paper explains how to handle each

More information

ZCorum s Ask a Broadband Expert Series:

ZCorum s Ask a Broadband Expert Series: s Ask a Broadband Expert Series: The Advantages of Network Virtualization An Interview with Peter Olivia, Director of Systems Engineering ZCorum 1.800.909.9441 4501 North Point Parkway, Suite 125 Alpharetta,

More information

Virtual Machine Environments: Data Protection and Recovery Solutions

Virtual Machine Environments: Data Protection and Recovery Solutions The Essentials Series: The Evolving Landscape of Enterprise Data Protection Virtual Machine Environments: Data Protection and Recovery Solutions sponsored by by Dan Sullivan Vir tual Machine Environments:

More information

THE WINDOWS AZURE PROGRAMMING MODEL

THE WINDOWS AZURE PROGRAMMING MODEL THE WINDOWS AZURE PROGRAMMING MODEL DAVID CHAPPELL OCTOBER 2010 SPONSORED BY MICROSOFT CORPORATION CONTENTS Why Create a New Programming Model?... 3 The Three Rules of the Windows Azure Programming Model...

More information

Tips and Best Practices for Managing a Private Cloud

Tips and Best Practices for Managing a Private Cloud Deploying and Managing Private Clouds The Essentials Series Tips and Best Practices for Managing a Private Cloud sponsored by Tip s and Best Practices for Managing a Private Cloud... 1 Es tablishing Policies

More information

5 Group Policy Management Capabilities You re Missing

5 Group Policy Management Capabilities You re Missing 5 Group Policy Management Capabilities You re Missing Don Jones 1. 8 0 0. 8 1 3. 6 4 1 5 w w w. s c r i p t l o g i c. c o m / s m b I T 2011 ScriptLogic Corporation ALL RIGHTS RESERVED. ScriptLogic, the

More information

Data Protection in a Virtualized Environment

Data Protection in a Virtualized Environment The Essentials Series: Virtualization and Disaster Recovery Data Protection in a Virtualized Environment sponsored by by J. Peter Bruzzese Da ta Protection in a Virtualized Environment... 1 An Overview

More information

The Definitive Guide to Cloud Acceleration

The Definitive Guide to Cloud Acceleration The Definitive Guide to Cloud Acceleration Dan Sullivan sponsored by Chapter 5: Architecture of Clouds and Content Delivery... 80 Public Cloud Providers and Virtualized IT Infrastructure... 80 Essential

More information

The Shortcut Guide To

The Shortcut Guide To tm The Shortcut Guide To Securing Your Exchange Server and Unified Communications Infrastructure Using SSL Don Jones Ch apter 3: Best Practices for Securing Your Exchange Server... 32 Business Level Concerns

More information

Auditing File and Folder Access

Auditing File and Folder Access The Essentials Series: Fundamentals of Effective File Server Security Auditing File and Folder Access sponsored by by Greg Shields Au diting File and Folder Access... 1 Auditing Considerations... 1 Co

More information

Tips and Tricks Guide tm. Windows Administration. Don Jones and Dan Sullivan

Tips and Tricks Guide tm. Windows Administration. Don Jones and Dan Sullivan Tips and Tricks Guide tm To tm Windows Administration Don Jones and Dan Sullivan Tip, Trick, Technique 13: Configuring Server Core in Windows Server 2008 R2... 1 Tip, Trick, Technique 14: What Are Microsoft

More information

Maximizing Your Desktop and Application Virtualization Implementation

Maximizing Your Desktop and Application Virtualization Implementation Maximizing Your Desktop and Application Virtualization Implementation The Essentials Series sponsored by David Davis Article 1: Using Hosted Applications with Desktop Virtualization... 1 The State of Desktop

More information

Quickly Recovering Deleted Active Directory Objects

Quickly Recovering Deleted Active Directory Objects The Essentials Series: Tackling Active Directory s Four Biggest Challenges Quickly Recovering Deleted Active Directory Objects sponsored by by Greg Shields Qu ickly Recovering Deleted Active Directory

More information

Isolating Network vs. Application Problems

Isolating Network vs. Application Problems The Essentials Series: Network Troubleshooting and Problem Identification Isolating Network vs. Application Problems sponsored by by Greg Shields Isolating Network vs. Application Problems...1 Common

More information

Cloud Computing Safe Harbor or Wild West?

Cloud Computing Safe Harbor or Wild West? IT Best Practices Series Cloud Computing Safe Harbor or Wild West? With IT expenditures coming under increasing scrutiny, the cloud is being sold as an oasis of practical solutions. It s true that many

More information

Maximizing Your Desktop and Application Virtualization Implementation

Maximizing Your Desktop and Application Virtualization Implementation Maximizing Your Desktop and Application Virtualization Implementation The Essentials Series sponsored by David Davis Using Hosted Applications with Desktop Virtualization... 1 The State of Desktop Virtualization...

More information

Bringing the Cloud into Focus. A Whitepaper by CMIT Solutions and Cadence Management Advisors

Bringing the Cloud into Focus. A Whitepaper by CMIT Solutions and Cadence Management Advisors Bringing the Cloud into Focus A Whitepaper by CMIT Solutions and Cadence Management Advisors Table Of Contents Introduction: What is The Cloud?.............................. 1 The Cloud Benefits.......................................

More information

Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction

Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction Simplifying Storage Operations By David Strom (published 3.15 by VMware) Introduction There are tectonic changes to storage technology that the IT industry hasn t seen for many years. Storage has been

More information

Is Cloud ERP Really Cheaper?

Is Cloud ERP Really Cheaper? Is Cloud ERP Really Cheaper? A Simple Guide to Understanding the Differences Between Cloud and On- Premise Distribution Software This guide attempts to outline all of the principal considerations that

More information

WINDOWS AZURE EXECUTION MODELS

WINDOWS AZURE EXECUTION MODELS WINDOWS AZURE EXECUTION MODELS Windows Azure provides three different execution models for running applications: Virtual Machines, Web Sites, and Cloud Services. Each one provides a different set of services,

More information

Using Web Security Services to Protect Portable Devices

Using Web Security Services to Protect Portable Devices Using Cloud Services to Improve Web Security The Essentials Series Using Web Security Services to Protect Portable Devices sponsored by Us ing Web Security Services to Protect Portable Devices... 1 Understanding

More information

Understanding the Microsoft Cloud

Understanding the Microsoft Cloud Understanding the Microsoft Cloud by Ken Withee 2013 Microsoft Corporation. All rights reserved. Microsoft, Lync, Microsoft Dynamics, SharePoint, Windows, and Windows Azure are trademarks of the Microsoft

More information

Mapping Your Path to the Cloud. A Guide to Getting your Dental Practice Set to Transition to Cloud-Based Practice Management Software.

Mapping Your Path to the Cloud. A Guide to Getting your Dental Practice Set to Transition to Cloud-Based Practice Management Software. Mapping Your Path to the Cloud A Guide to Getting your Dental Practice Set to Transition to Cloud-Based Practice Management Software. Table of Contents Why the Cloud? Mapping Your Path to the Cloud...4

More information

Mortgage Secrets. What the banks don t want you to know.

Mortgage Secrets. What the banks don t want you to know. Mortgage Secrets What the banks don t want you to know. Copyright Notice: Copyright 2006 - All Rights Reserved Contents may not be shared or transmitted in any form, so don t even think about it. Trust

More information

Account Access Management - A Primer

Account Access Management - A Primer The Essentials Series: Managing Access to Privileged Accounts Understanding Account Access Management sponsored by by Ed Tittel Understanding Account Access Management...1 Types of Access...2 User Level...2

More information

10 How to Accomplish SaaS

10 How to Accomplish SaaS 10 How to Accomplish SaaS When a business migrates from a traditional on-premises software application model, to a Software as a Service, software delivery model, there are a few changes that a businesses

More information

Pr oactively Monitoring Response Time and Complex Web Transactions... 1. Working with Partner Organizations... 2

Pr oactively Monitoring Response Time and Complex Web Transactions... 1. Working with Partner Organizations... 2 Pr oactively Monitoring Response Time and Complex Web Transactions... 1 An atomy of Common Web Transactions... 1 Asking for Decisions... 1 Collecting Information... 2 Providing Sensitive Information...

More information

Contents. Introduction. What is the Cloud? How does it work? Types of Cloud Service. Cloud Service Providers. Summary

Contents. Introduction. What is the Cloud? How does it work? Types of Cloud Service. Cloud Service Providers. Summary Contents Introduction What is the Cloud? How does it work? Types of Cloud Service Cloud Service Providers Summary Introduction The CLOUD! It seems to be everywhere these days; you can t get away from it!

More information

Cloud: It s not a nebulous concept

Cloud: It s not a nebulous concept WHITEPAPER Author: Stuart James Cloud: It s not a nebulous concept Challenging and removing the complexities of truly understanding Cloud Removing the complexities to truly understand Cloud Author: Stuart

More information

7 Deadly Sins of the DIY Cloud

7 Deadly Sins of the DIY Cloud 7 Deadly Sins of the DIY Cloud Uncovering the Hidden Impact of Custom App Development in the Cloud The Do-It-Yourself Cloud Revolution Cloud computing has brought a revolution to application development.

More information

Do slow applications affect call centre performance?

Do slow applications affect call centre performance? Do slow applications affect call centre performance? A white paper examining the impact of slow applications on call centre quality and productivity Summary To be successful in today s competitive markets

More information

Stop the Finger-Pointing: Managing Tier 1 Applications with VMware vcenter Operations Management Suite

Stop the Finger-Pointing: Managing Tier 1 Applications with VMware vcenter Operations Management Suite Stop the Finger-Pointing: Managing Tier 1 Applications with VMware vcenter Operations Management Suite By David Davis, VMware vexpert WHITE PAPER There is a tradition of finger-pointing in too many IT

More information