100% Application Availability Using Hybrid Cloud Architectures Making monitoring actionable through automated global server load balancing CEDEXIS RESEARCH April 2014
Table of Contents Overview...3 Redundancy: Importance and Challenges...3 The Hybrid Cloud Solution...4 Cloud Orchestration Needs Cloud Awareness...5 Building a Comprehensive View...5 Network Level: Watching the Last Mile...6 The Platform Level: Multi-home and Multi-tenant Challenges...8 Application Level: Digging Deeper...10 Server Level: Virtual Machines and Bare Metal...11 Intelligent Analysis and Decision Making...12 Tools Mentioned in This Paper...13 2
Overview: High Availability is not an Accident. Hybrid Cloud orchestration is the active management of a cloud-aware application across multiple content and hosting environments, including combinations of public and private resources. Unlike their traditional counterparts, cloud-aware applications can be optimized in real time to adapt to changes in usage, traffic, and resource contention. Such application architectures are self-healing in that they self correct around network, bare-metal server, cloud and platform outages. To insure that applications are self-healing in this manner requires traffic shaping across clouds, content delivery networks (CDNs), and private data centers. This type of traffic shaping requires a lot of monitoring. It can be tricky; but the benefit is a robust, redundant application with incredible fault tolerance and a very low time between failures. Automated, real-time decisions need to be made with data representing the holistic health of your complete cloud and data center infrastructure. If done well, these architectural elements can also perform the dual purpose of making an application perform at its optimum performance potential. Redundancy: Importance and Challenges Using Cloud or Hosting options for your platform services means not worrying about infrastructure. This is both a great benefit and a potential weakness. Independence from infrastructure means platform problems can be separated from app development, speeding time to market and giving us a great deal of flexibility for deployment. However, it also means that some performance and reliability problems are cloaked in the cloud, beyond our ability to understand or control. Losing that control could result in a nightmare scenario. Your pager goes off in the middle of the busiest online shopping day of the year and your greatest fears are realized: Your cloudhosted virtual webservers are down. Your cloud services provider cannot be reached. Figure 1: Multi-Homed Hybrid Cloud Solution 3
You spend the first 15 minutes trying to figure out if your monitoring solution is telling you the truth. You spend the next hour trying to figure out what failed: power-supply, router, firewall? Your holiday plans are starting to look rancid. What can your operations team do? All they control is the software stack. With troubleshooting options limited by the access you are allowed to the cloud provider s infrastructure, you are dependent on your cloud provider s operations staff to address most connectivity and infrastructure problems. The Hybrid Cloud Solution A hybrid cloud approach mitigates this problem. Mixing multiple cloud providers, CDNs, and private data centers makes your web application more robust and redundant. If one content provider experiences issues, traffic can be directed to another provider until issues are resolved. A holistic view of health allows for detection of issues that result from interactions between architecture elements (both inside and outside the firewall). Figure 2: Holistic User Experience and Application Health View Redundancy historically meant having one data center with multiple servers, or an intelligent load balancer with failover capabilities. Over time that evolved into a Disaster Recovery (DR) solution that allowed companies to fail over to an application framework designed for that purpose. These DR frameworks often involve dormant infrastructure in separate datacenters. 4
Today it s possible for your web application (whether in the Cloud or in the Datacenter) to approach 100% uptime and deliver content to users faster and more consistently using multiple homes 1 in an active-active style configuration. Users can be routed intelligently by determining the closest, fastest host. Cloud Orchestration Requires Cloud Awareness Cloud-aware applications are designed to be event driven, stateless, and fault tolerant. These applications are often supported by lightweight APIs using the representational state transfer (REST) model to communicate between services and manage user sessions across a diverse infrastructure. The shifting state of the cloud means that failures will happen, so cloud awareness means fault tolerance. Cloud-aware applications aim for a low mean time to recovery instead of a low mean time between failures. Handling faults while providing the best experience for your users requires a holistic view of your application s health. To create this type of self-healing network aware application developers and architects must use tools that provide them with an end-to-end view of the Internet. They must see the application as their users see it so that performance related issues can be mitigated. Traditional monitoring tools do not provide the last mile view that is required for true self-healing cloud applications. Figure 3: Layers of a Hybrid Cloud Application Stack Building a Comprehensive View To ensure that a web application operates properly, you must monitor multiple levels of performance and react quickly to resolve any problems that arise. As cloud-hosted 1 We ve taken a look at how many homes an application might need to reduce latency and achieve 100% uptime. Read more about it on our blog. 5
applications become multi-homed, this monitoring becomes more complex. In the age of hybrid cloud applications, you need to watch more than just the server and its configuration. The entire end user experience is important, so every aspect of the network also needs to be taken into account. To build a comprehensive view of your application s health, monitor everything from the bare metal to the user s client. The breadth of the data being measured means that different tools and techniques must be utilized at each of these four distinct levels: Network, Platform (Data Center with Hosting or Cloud), Server, and Application. Network Level: Watching the Last Mile The network level of a web application can be the most frustrating for developers and operations staff. This is the level beyond your last router. It s your data s path through the wilderness of the Internet, across crazy routing and WANs, over hundreds of miles of fiber, bouncing off satellite dishes, creeping along DSL lines or mobile networks, and finally arriving at the user s device. It can be difficult to gather useful statistics and metrics about performance at the network level. Even when those metrics are available, it can be impossible to find solutions and remedies to problems that exist on hardware and software completely beyond your control. The network level is also incredibly varied. Each end user has a different path to your web application. Measurements for one user s bogged-down ISP will have nothing to do with the cut fiber that is blocking the Internet access of another user just a mile away. Applications with large user bases can have so many problems at this level that even if it was possible for an operations team to address them, the quantity of issues would be overwhelming. Figure 4: Real user measurements eliminate the mystery of the last mile. Monitoring at the network level needs with 6
take into account the variety of user experience. The Real user monitoring (RUM) approach puts together a meaningful picture of the network level, including the last mile of Internet traffic and the wide range of clients and operating systems. This paper is not intended to blatantly promote Cedexis products, but this kind of monitoring is our area of expertise. Cedexis Radar, our crowd-sourced RUM product, gathers billions of metrics per day. Operations teams can use Radar data to determine problematic regions, configurations, or other surprises. With information about last mile problems, operations teams can work to mitigate their effect on user experience. Last mile issues are frequently beyond operation s control. However, a good hybrid cloud application stack will allow you to route clients to alternate hosts to improve their experience. So RUM measurements paired with an operational multi-homed Cloud and CDN strategy is the best practice for application design. Finally, because problems at the network level can appear and disappear quickly, any alternate routing or mitigating decisions need to be done as near to real time as possible. While a user might not mind reloading a page once, we already know that every 100ms of delay in a user s web application experience can be up to a 1% loss in revenue 2. 2 Greg Linden, Make Data Useful, Amazon.com (2006), downloaded from https://sites.google.com/site/ glinden/home/stanforddatamining.2006-11-28.ppt 7
Tango is a popular mobile application that provides social networking and communications services to users worldwide. The application s rapid growth prompted the adoption of a hybrid cloud solution. Using Cedexis Openmix to monitor health and route traffic, they saw an impressive 15% reduction in latency and a 25% increase in call duration. To learn more about Tango s success story, check out our case study. Figure 5: Hybrid Cloud Deployment Architecture The Platform Level: Multi-home and Multi-tenant Challenges. Companies are moving to multi-tenet platforms. The rise of hosting and cloud platforms is a clear testament to this fact. For companies that do not want to expend capex on servers there are 2 options to outsource their application. These are the platform options. There are: 1. Cloud: This option involves a public or private cloud. This can be a pay as you go arrangement or contract. These clouds are simply virtualized servers clustered to provide the ability to spin up and down as many servers as the cluster will have capacity to support. 8
2. Hosting: A company can host in a 3rd party datacenter (or multiple ones). These hosting arrangements are usually tied to bare metal servers but these servers are setup and deployed by the hosting company and are usually limited by the number of configurations that are supported. These configurations allow the hosting company to support these platforms on behalf of their customers. Applications that are built in either of these environments are (by design) hosted in a multi-tenant environment. These environments offer many benefits, but they also offer 2 major drawbacks is that directly impact the application: 1. Noisy neighbors. In a multi-tenant environment your performance can drop unexpectedly and you will not always understand why nor know what can be done to fix the issue. These noisy neighbors can impact the network, memory, storage or processing power. Figure 6: Private monitoring contributes to a view of holistic health. 2. Your team s ability to access and observe the data center s infrastructure is limited. These days, a company s operations team usually does not have intimate knowledge of the data center that hosts and powers their web applications. Most people that lease hosted environments never see the data center their application is hosted within. Although not to the extent of a private data center, Content delivery networks (CDNs) and cloud hosting services have enough visibility that monitoring is possible. In fact, monitoring nodes located in these types of hosting situations can collect information about connections between data centers. This is invaluable information for routing traffic to the best location. Because the infrastructure of a cloud host is invisible to the customer, it is difficult to arrange certain configurations that used to be considered crucial in a private 9
data center or colocation environment. For example, private high-speed dedicated Ethernet connections between your application servers and your database server might not be possible. This can lead to latency and bandwidth issues, however, which means that cloud-aware applications need to plan for completely different issues. You can monitor the path between a CDN and your primary hosting infrastructure using tools similar to those that watch end user experiences on the network level. Intelligent CDN usage means that nodes pull different web objects from cache servers selected based on latency and performance. Analyzing metrics in real-time makes this type of CDN optimization most effective. Cedexis Radar can collect metrics on private networks as well as public. This means that data from inside your private data center or managed hosting facility can be fed to Radar alongside public RUM data. PROACTIVE APPLICATION HEALTH Holistic health doesn t always involve real-time decisions, of course. Application level health means that a healthy software development lifecycle (SDLC) includes feature testing, stress testing, and an intelligent architecture. While you can wait until your application is live to see how it handles the pressure, it s best to proactively address these problems using process such as SDLC. Application Level: Digging Deeper Monitoring health at the application level seems at first like a pretty simple concept. However, with a cloud-aware application, there are a number of health metrics you need to track to know the well-being of your application. A cloudhosted virtual machine might co-exist on the same hardware as a number of other servers belonging to different companies. As a cloud hosting customer, you don t necessarily have control over who your immediate cloud neighbors are, and those neighbors can change at any time. Computing resources in a cloud environment must be shared with your neighbors. This reduces reliability. Application monitoring spots contention situations and helps make real-time decisions to direct traffic to optimal resources. Application Performance Monitoring (APM) tools such as New Relic s Web Application Monitoring, AppDynamics, and Catchpoint Transaction Testing collect enormous amounts of data from within your running web application. This can reveal performance issues caused by various problems at the application level. 10
Sometimes these issues can correct themselves. Sometimes, they recur. In either situation, the immediate impact that they have on the user experience needs to be mitigated. These tools feed data to external decision-makers so the health of your system can be monitored and maintained in real time. Server Level: Virtual Machines and Bare Metal Describing the server level as below the application level has only made sense since the advent of cloud-aware applications. With so many hosting solutions of the virtual variety, server monitoring can feel meaningless and frustrating. There will be times when virtual server monitoring will show you some pretty strange problems which you have no recourse to fix. Focusing instead on the application level and intelligent routing decisions makes more sense. However, the cloud landscape is still evolving. New tools can monitor virtual servers and contribute to an intelligent assessment of the health of a web application. Amazon CloudWatch monitors the health of EC2 Figure 7: Server Layer Monitoring instances and other AWS resources. Rackspace Cloud Monitoring is a similar tool, but it promises to measure virtual instances from multiple cloud host providers as well as physical servers. These metrics do not directly watch the health of your web application, but add to the holistic view. These tools must be used in conjunction with the RUM based tools and application tools to create a true self-healing architecture. Knowing how your servers are acting will enable you to see if you re having an application performance problem or some kind of issue with a service provider. LogicMonitor and Science Logic are both excellent examples of companies provide this type service. These companies incorporate modules that monitor many parts of your application stack. The operational insight provided can help make crucial traffic 11
routing decisions. Combined with RUM on the network level, the metrics from this type monitoring give an essential view of the health of your network and application stack. HIGH-LEVEL CLOUD-AWARE LOAD BALANCING Cedexis Openmix collects data from each of the four layers of the hybrid cloud stack and builds a picture of holistic health for your application. The RUM metrics collected by Cedexis Radar are combined with private data feeds, application monitoring information, and even the layer-spanning metrics of LogicMonitor. All of this data is used to figure out which components of your hybrid cloud solution comprise the healthiest, fastest route for each user to your web application. Openmix operates at the DNS level. Each request triggers another look at the status of the components of your hybrid cloud application stack. If services are running below a soft threshold, they are given a yellow light and allowed to rest for a while to see if conditions improve. Services with critical failures are given a red light and avoided completely. Performance and topographical data are then used to select the best destination for the client device. Historical data can then be used to view the ongoing health of your stack. Intelligent Analysis and Decision Making Once the tools are in place to monitor each of the four levels, you need to be able to collect the data in one place, analyze it, and act upon it. The real-time data collected by all of these tools are useless if it s impossible to act in time. Intelligent analysis watches for trends in the data and then routes traffic and usage around trouble spots. Cloud performance changes quickly, and a multi-home load balancer needs to be aware of every level of the hybrid cloud stack. This is the correct method to create a self-healing application framework. Real-time monitoring of the holistic health of a hybrid cloud solution is essential to the end user experience. Intelligent analysis and immediate action make sure 12
that every client is given the best possible route. To best orchestrate cloud-aware applications, you must take a serious look at redundancy, load balancing, and your mean time to recovery. The busiest online shopping day of the year has arrived again, but this time you re ready. Your pager never goes off. The logs show that one of your cloud hosting providers had an issue, but thanks to successful cloud orchestration, your customers were routed elsewhere and did not even notice. It looks like you did not need to brew that second pot of coffee after all. A hybrid cloud solution just saved Christmas. Tools Mentioned in This Paper Monitoring Level Network Data Center Application Tools Cedexis Radar Cedexis Radar LogicMonitor New Relic Web Application Monitoring AppDynamics Catchpoint Transaction Testing LogicMonitor Server Amazon CloudWatch Rackspace Cloud Monitoring LogicMonitor With deep experience in delivery networks and performance optimization, Cedexis is the global expert in multi-cloud strategies. Today, over 350 media, retail, luxury and consumer brands count on Cedexis for 100% availability, optimal web performance, flexibility and choice that drives trac and revenue at lower cost and risk. Portland, Oregon 317 SW Alder St, #650 Portland, OR 97204 +1 855 CEDEXIS (233-3947) fax: +1 503 914 0488 Paris, France 27 rue Raymond Lefebvre 94250 Gentilly, France +33 (0)1 79 755 253 visit cedexis.com or email sales@cedexis.com. 13