Hosted Service Strategy Guide

Hosted Service Strategy Guide Prepared by Jason Gaudreau, Senior Technical Account Manager VMware Professional Services jgaudreau@vmware.com

Revision History Date Author Comments Reviewers 11/20/2014 Jason Gaudreau VMware 2015 VMware, All rights reserved. U.S and international copyright and intellectual property laws protect this product. This product is covered by one or more patents listed at http://www.vmware.com/download/patents.html. VMware, the VMware boxes logo and design, Virtual SMP and vmotion are registered trademarks or trademarks of VMware, in the United States and/or other jurisdictions. All other marks and names mentioned herein may be trademarks of their respective companies. VMware, Inc 3401 Hillview Ave Palo Alto, CA 94304 www.vmware.com

Contents Introduction... 5 Executive Summary... 5 Key Functional Requirements for Service Tiers... 6 Infrastructure Resiliency... 6 Recovery Point Objective (RPO)... 6 Recovery Time Object (RTO)... 6 Infrastructure Performance... 6 Factors for Building Service Tiers... 6 Host Systems... 7 Server Form Factor... 7 Host Resource Capacity... 7 Host High Availability... 8 VM Restart Priority... 9 Live Migration... 9 Trickle Down Servernomics... 9 Storage... 13 Data Protection... 13 Storage Performance... 14 Multipathing... 17 Virtual Disks... 17 Networking... 18 Multi-NIC Configuration... 19 Virtual Machines... 24 VM-VM Anti-Affinity Rules... 24 Reservations... 24 Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 3 of 30

Limits... 25 Shares... 25 Resource Pools... 25 Infrastructure Maintenance and Deployment Management... 27 Maintenance and Support... 27 Managed... 27 Unmanaged... 27 Creating Service Offerings... 28 Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 4 of 30

Introduction Mission critical applications, such as customer facing applications and financial systems are vital to a smooth operation of the company s business. These applications are core to the company s mission, and system downtime translates to financial losses to the organization. While other applications, like general purpose printing, software media libraries, and infrastructure monitoring tools don t require the same service level capabilities as mission critical applications. Customer facing applications provide new business opportunities and improved business capability; however it is driving the need for decreasing recovery time objectives and more stringent service levels from a service availability perspective. We cannot treat all IT data center systems the same, some systems are more critical to the operation than others. Business requirements have changed over the past few years for system applications that drive business revenue, expectations are that systems are available 24/7, like online retail systems. Because of the exploding number of applications, infrastructure growth, and the high cost of downtime, IT organizations need to use their infrastructure resources to build service tiers that they offer their business partners. Not only will this help improve application performance, reliability, and availability; but it can increase host density ratios, provide cost transparency for meeting the expected business requirements, and enable you to create process improvements from a service management perspective. We will not be focusing on a complete disaster recovery solutions in this paper; this assessment will strictly center on IT system availability and performance. Executive Summary Vital business applications that run on mission critical systems must be able to recover quickly and require high availability solutions. Core business applications should be supported by an N+2 failover solution. This will be enforced with vsphere Admission Controls. Data protection should include RAID 0+1 to ensure there is no impact to the array in the event of a multi-drive failure and to increase storage I/O performance. Virtual disks should consist of Thick Provisioned Eager Zeroed to guarantee that there is no oversubscription of the datastores and to maximize performance. Stateless web applications that use network load balancing (NLB) will have VM-VM anti-affinity rules set to ensure that each virtual instance is on a separate host node. Availability comes at a cost because higher levels of availability require redundancy, automated recovery, and mirrored-pair solutions. The greater the need for higher availability, the more the IT system price tag. In order to tackle this scenario, it is important to adopt a tiered approach to providing high availability based on the criticality of the IT system to our business partners. The proposed solution above provides a cost efficient approach to create a highly stable and available platform to host mission critical systems. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 5 of 30

Key Functional Requirements for Service Tiers Infrastructure Resiliency IT system resiliency is determined by redundant system components including servers, networking, storage, and system recoverability. The components must be highly available to meet mission critical system needs and minimize downtime. A failure in any link in the infrastructure chain could result in the loss of IT system availability to the business. As a result, redundancy must be applied to all infrastructure components to ensure high availability. Recovery Point Objective (RPO) The system data on our critical systems dictates the amount of data that can be lost as the result of a failure. Generally, mission critical systems cannot sustain any data loss and require a very low recovery point objective. Systems that are not mission critical IT systems often can sustain some amount of data loss or lost transactions resulting from a system failure. Recovery Time Object (RTO) Recovery time objectives (RTOs) spell out the maximum allowable time to restore IT services. RTOs are typically associated with recoverability, whereas Quality of Service (QoS) needs are associated with availability. Most organizations use RTOs to express disaster recovery requirements. For our purpose, we are going to focus on availability solutions for protecting our IT systems from downtime caused by individual system outages, component outages, and maintenance activity. Infrastructure Performance For mission critical applications, it is no longer sufficient for an application to just meet functional requirements. You need to ensure the application satisfies the desired performance parameters for the consumer. With a focus on controlling costs, IT leaders must run at just enough resource capacity to meet the requirements of the business. For many organizations, day-to-day operations includes running several generations of physical servers, which provide varying degrees of performance. It is important to use the latest hardware with the largest feature set to run the core business applications. This will provide the best application performance and hardware reliability. Factors for Building Service Tiers Vital business functions need to be highly available and operate with minimum disruption Highly resilient infrastructure design to withstand failures Decrease the number of system outages to mission critical systems Operate within scheduled maintenance and deployment release dates Provide necessary performance to meet application requirements Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 6 of 30

Host Systems Server Form Factor VMware vsphere allows organizations to spread the virtual machines (servers) across multiple physical hosts, with the ability to consolidate workloads into each server. Essentially, a scale up design uses a small number of large powerful servers, as opposed to a scale out solution design that revolves around smaller servers. Both aim to achieve the computing power that is required to run business applications, but the way in which they scale is different and has a different impact to support. Scale up advantages: Better resource management: Larger servers can take better advantage of the hypervisor s resource optimization capabilities. Scaling out doesn t make as efficient use of the resources because they are more limited on an individual node. Cost: Scaling up is cheaper. Fewer Hypervisors: With fewer servers loaded with the hypervisor, it is easier to maintain hypervisor upgrades, hypervisor patching, BIOS and firmware upgrades, and a smaller footprint for system monitoring. Larger VMs possible: Scale up is more flexible with large VMs because of resource scaling. Power and cooling: In general scaling up requires less power and cooling because it is a smaller amount of host nodes. Scale out advantages: Less impact during a host failure: Having fewer VMs per server reduces the risk if a physical host failure should occur. By scaling out to small servers, fewer VMs are affected at once. Less expensive host redundancy: It is significantly cheaper to maintain an N+2 host policy. Although scaling up hosts saves money on OPEX and infrastructure costs, the recommendation for mission critical applications is to scale out so that the VM impact is minimized in the event of a system failure. vsphere High Availability (HA) uses a restart of the virtual machine as the mechanism for addressing host failures. This means there is a period of downtime when the host fails and the VM(s) completes reboot on a different host(s). Host Resource Capacity vsphere clustering has the capability of admission control to ensure that capacity is available for maintenance and host failure. Failover capacity is calculated by determining how many hosts can fail and still leave enough capacity to satisfy the requirements of all powered-on virtual machines. An N+2 solution, where N is the number of physical servers in the environment plus two additional physical servers to host the VMs provides the advantage of allowing for an unexpected system failure while one host is out of the cluster for maintenance. This cluster design can sustain an impact of two hosts without disrupting mission critical systems. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 7 of 30

This ensures that we are not over-committed in host resource allocation which can lead to poor performance on the VMs should there be a multi-host failure. For non-mission critical applications, running at N+1 allows for non-disruptive maintenance of the underlying host systems and tolerates the impact of a single host outage without business impact. Host High Availability Required Host Resources for Fail-Over Number of Hosts N+1 Resource Capacity N+2 Resource Capacity 2 hosts 50% NA 3 hosts 67% 33% 4 hosts 75% 50% 5 hosts 80% 60% 6 hosts 83% 67% 7 hosts 86% 71% 8 hosts 87% 75% 9 hosts 89% 78% 10 hosts 90% 80% Figure 1 Resources Allocations for Host Failure vsphere High Availability is a clustering solution to detect failed physical hosts and recover virtual machines. If vsphere HA discovers that a host node is down, it quickly restarts the host s virtual machines on other servers in the cluster. This enables us to protect virtual machines and their workloads. Figure 2 vsphere HA Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 8 of 30

VM Restart Priority If a host fails and its virtual machines need to be restarted, you can control the order in which this is done with the VM restart priority setting. VM restart priority determines the relative order in which virtual machines are restarted on a new host after an outage. The virtual machines with the highest priority are attempted first and it continues to those with lower priority until all virtual machines are running or there are no cluster resources available. Placing mission critical applications with a VM restart priority of High will ensure critical applications are online quickly. Live Migration vsphere vmotion provides the ability to perform live migrations of a virtual machine from one ESXi host to another ESXi host without service interruption. This is a no-downtime operation; network connections are not dropped and applications continue running uninterrupted. This makes vmotion an effective tool for load balancing VMs across host nodes within a cluster. Additionally, if a host node needs to be powered off for hardware maintenance, you will use vmotion to migrate all the active virtual machines from the host going offline to another host to ensure there is no business disruption. Trickle Down Servernomics VMware recommends that all hosts in a cluster have similar CPU and memory configurations to have a balanced cluster and optimal HA resource calculations. This will not only help you in the event of a physical server outage in a cluster, it can help improve performance by taking advantage of all the capabilities in your latest generation servers. In order to have multiple processor architectures in a single cluster you need to enable Enhanced vmotion Compatibility (EVC) mode. EVC mode allows migration of virtual machines between different generations of CPUs, making it possible to aggregate older and new server hardware generations in a single cluster. However, despite the obvious advantages of EVC mode, you need to factor in the costs associated with this feature. Some applications will potentially lose performance due to certain advanced CPU features not being made available to the guest, even though the underlying host supports them. When an ESXi host with a newer generation CPU joins the cluster, the baseline will automatically hide the CPU features that are new and unique to that CPU generation. The below table (Figure 3) lists the EVC Levels and a description of the features that are enabled. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 9 of 30

Figure 3 EVC Modes To illustrate some of the performance variations, VMware ran some test that replicated applications in our customer environments to find out the impact of EVC mode. They created several guest virtual machines to run workloads with different EVC modes ranging from Intel Merom to Intel Westmere. For the Java-based server-side applications, its performance on an ESXi host with processor as new as Westemere and as old as Merom had a negligible variation of 0.0007%. For OpenSSL(AES), the Intel Westmere EVC mode outperformed the other modes by more than three times. The improved performance is due to the encryption acceleration made possible by the introduction of the AESNI instruction set available on Intel processors Westmere (Figure 4). Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 10 of 30

Figure 4 OpenSSL with EVC Mode Another key aspect to a balanced cluster is ensuring there is not a large variation of resources available in a single cluster, this happens when mixing different generation servers. For instance, when purchasing a HP ProLiant DL380 Generation 5 (G5) server the processor available was two Quad-Core Intel Xeon processor with 12 MB of L2 cache memory and a maximum of 64 GB of memory. The Generation 8 (G8) version of the HP ProLiant DL380p allowed for two 12-core Intel Xeon processors with 30 MB of L2 cache memory and maximum 768 GB of memory. That is a dramatic difference! If we look at theoretical density ratios, we can expect 32 vcpus on the HP ProLiant DL380 G5 and 96 vcpus on the HP Proliant DL380p G8. A solid estimate for the number of vcpus per processor core for production workloads is 4. Total vcpus = Processor Cores x 4 Unbalanced clusters can have a significant impact on VMware HA when a new generation server fails and legacy systems need to pick up the additional workload. Furthermore, to mix the same servers in a cluster, you would need to enable EVC mode L1 for the Penryn processor architecture, hiding all the chipset features from L2 through L4. By defining cluster service tiers, you can ensure that applications with critical workloads that are vital to the business have the latest generation features and your clusters are balanced. You trickle down older generation servers to hosting tiers that don t have the same SLA and performance requirements as your core business applications. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 11 of 30

Figure 5 Cluster Service Tiers Additionally, you can incorporate a strategy that includes scale out for mission critical applications and scale up for non-mission critical workloads. Your clusters should be created to meet a business requirement or a functional requirement. For instance, you could create clusters based on service levels like illustrated above (Figure 5) or you can construct clusters based on functional requirements, such as a SQL cluster. The database cluster might require a high number of processor cores so it stays within NUMA architecture and it may need some of the resources reserved to meet performance requirements. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 12 of 30

Figure 6 Cluster Tiers Storage Data Protection It s all about recovery; data protection design protects against all relevant types of failure and minimizes data loss. While disk capacity has increased more than 1,000-fold since RAID levels were introduced in 1987, disk I/O rates have only increased by 150-fold. This means that when a disk in a RAID set does fail, it can take hours to repair and re-establish full redundancy. RAID levels: RAID-1: An exact copy (or mirror) of a set of data on two disks. RAID-5: Uses block-level striping with parity data distributed across all member disks. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 13 of 30

RAID-6: Extends RAID 5 by adding an additional parity block; thus it uses block-level striping with two parity blocks distributed across all member disks. RAID 10: Arrays consisting of a top-level RAID-0 array (or stripe set) composed of two or more RAID-1 arrays (or mirrors). A single-drive failure in a RAID 10 configuration results in one of the lower-level mirrors entering degraded mode, but the top-level stripe may be configured to perform normally (except for the performance hit), as both of its constituent storage elements are still operable this is application-specific. RAID 0+1: In contrast to RAID 10, RAID 0+1 arrays consist of a top-level RAID-1 mirror composed of two or more RAID-0 stripe sets. A single-drive failure in a RAID 0+1 configuration results in one of the lower-level stripes completely failing (as RAID 0 is not fault tolerant), while the top-level mirror enters degraded mode. For mission critical systems being able to overcome the overlapping failure of two disks in a RAID set is important to protect from data loss. RAID 0+1 stripes data across a pair of mirrors. This approach gives an excellent level of redundancy, because every block of data is written to a second disk. Rebuild times are also short in comparison to other RAID types. Raid 0+1 has increased read performance with mirrored copies of the data; it can read from the mirrored disks in parallel. Furthermore, there is a dramatic improvement in write performance to the disk; RAID 0+1 needs to only write to two disks at a time. As opposed to RAID-5 which has to take into account four steps when writing to the disk. It needs to read the old data, then read the parity, then write the new data, and then write the parity. This is known as the RAID-5 write penalty. Mirrored RAID volumes offer high degrees of protection, but at the cost of 50 percent loss of usable capacity. Storage Performance The types of drives in the storage array and IO activity have a dramatic impact on application performance. In the below diagram (Figure 7), you can see the typical IOPs expected by today s magnetic and flash drives. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 14 of 30

Figure 7 Drive IOPS Mission critical applications and applications that have a heavy IO workload can benefit from incorporating flash drives. For instance, if we size a 3PAR storage array with just 15K FC disks we need 106 total disks to meet the IOP requirements of 16,000 IOPS. That would put us into the 3PAR 7400 with 9 drive enclosures and 152 disk for redundancy. The cost is around $400,000.00 for the storage array. If we look at using a 3PAR storage array with a mix of SSD and 10K drives with Adaptive Optimization, we need 16 SATA disks and 8 SLC SSD disks to meet the 16,000 IOPS requirement. This drops us down to the 3PAR 7200 with 3 drive enclosures and 24 disks for redundancy. The cost is around $100,000.00 for this storage array. Figure 8 Disk Reduction As you can see, by leveraging SSD you can dramatically reduce the price of building out the infrastructure necessary for applications that have a significant IOP requirement. Using sub-lun auto-tiering, such as HP 3PAR s Adaptive Optimization enables automatic storage tiering on the array. With this feature, the storage system analyzes IO and then migrates regions of 128 MB Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 15 of 30

between different storage tiers. Frequently accessed regions of volumes are moved to higher tiers, less frequently accessed regions are shifted to lower tiers. Figure 9 Sub-LUN Auto-Tiering Like mentioned previously when working with applications workloads, you must be mindful of the write penalty with raid sets. When using RAID-5, you need to take into account four steps when writing to the disk. It needs to read the old data, then read the parity, then write the new data, and then write the parity. This is known as the RAID-5 write penalty. Figure 10 RAID Penalty When calculating your IO workload, use the following formula: Read IOPS + (Write IOPS * Raid Penalty) = Total IOPS Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 16 of 30

In our example, we have a corporate application that requires 1,280 average IOPS with the readwrite ratio being 50/50 in a RAID-5 volume. The formula would be 640 Read IOPS + (640 Write IOPS * 4) = 3,200 IOPS. The impact of including the RAID penalty into your IO calculations is very important, if you are calculating your applications IO requirements based on the 1,280 IOPS instead of the 3,200 IOPS you can degrade the performance of your application. Multipathing vsphere hosts use HBA adapters through fabric switches to connect to the storage array s storage processor ports. By using multiple HBA devices for redundancy, more than one path is created to the LUNs. The hosts use a technique called multipathing which provides several features such as load balancing, path failover management, and aggregated bandwidth. Virtual Disks Virtual disks (VMDKs) are how virtual machines encapsulate their disk devices. Virtual disks come in three formats Thin Provision, Thick Provisioned Lazy Zeroed, and Thick Provisioned Eager Zeroed. Thick Provision Lazy Zeroed: Creates a virtual disk in a default thick format. Space required for the virtual disk is allocated when the virtual disk is created. Data remaining on the physical device is not erased during creation, but is zeroed out on demand at a later time on first write from the virtual machine. Thick Provision Eager Zeroed: Space required for the virtual disk is allocated at creation time. In contrast to the flat format, the data remaining on the physical device is zeroed out when the virtual disk is created. It might take much longer to create disks in this format than to create other types of disks, but you can see a slight performance improvement. Thin Provision: Use this format to save storage space. For the thin disk, you provision as much datastore space as the disk would require based on the value that you enter for the disk size. However, the thin disk starts small and at first, uses only as much datastore space as the disk needs for its initial operations. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 17 of 30

Figure 11 Disk Provisioning Thick Provisioned Eager Zeroed virtual disks are true thick disks. In this format, the size of the VMDK file on the datastore is the size of the virtual disk that you create and is pre-zeroed. For example, if you created 500 GB virtual disk and place 100 GB of data on it, the VMDK file will be 500 GB at the datastore filesystem. As the I/O occurs in the guest, the VMkernel (Host OS kernel) does not need to zero the blocks prior to the I/O occurring. The result is slightly improved I/O latency and fewer backend storage I/O operations. Because zeroing takes place at run-time for a thin disk, there will be some performance impact for write-intensive applications while writing data for the first time. After all of a thin disk s blocks are allocated and zeroed out, the thin disk is no different from a thick disk in terms of performance. Some storage array manufacturers implement thin provisioning behind the LUN. Although in most instances array based thin provisioning will perform better than VMFS thin provisioning, you still need to take into account the higher CPU, disk, and memory overhead to maintain the LUNs thin. Another benefit for Thick Provisioned Eager Zeroed is that you can t over-subscribe the LUN like you can with Thin Provisioned disks. Thick Provisioned Eager Zeroed ensures disk resources are committed to mission critical systems and provides slight disk I/O improvement. The drawback to this disk format is it requires more storage capacity than Thin Provisioning because you are committing the entire disk allocation to the datastore. Networking vsphere Networking A vsphere standard switch works much like a physical switch. It is a software-based switch that keeps track of which virtual machines are connected to each of its virtual ports and then uses that information to forward traffic to other virtual machines. A vsphere standard switch (vss) can be connected to a physical switch by physical uplink adapters; this gives the virtual machines the Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 18 of 30

ability to communicate to the external networking environment and other physical resources. Even though the vsphere standard switch emulates a physical switch, it lacks most of the advanced functionality of physical switches. A vsphere distributed switch (vds) is a softwarebased switch that acts as a single switch to provide traffic management across all associated hosts on a datacenter. This enables administrators to maintain a consistent network configuration across multiple hosts. A distributed port is a logical object on a vsphere distributed switch that connects to a host s VMkernel or to a virtual machine s network adapter. A port group shares port configuration options, these can include traffic shaping, security settings, NIC teaming, and VLAN tagging policies for each member port. Typically, a single standard switch is associated with one or more port groups. A distributed port group is a port group associated with a vsphere distributed switch; it specifies port configuration options for each member port. Distributed port groups define how a connection is made through the vsphere distributed switch to the network. Additionally, vsphere distributed switches provide advanced features like Private VLANs, network vmotion, bi-directional traffic shaping, and third party virtual switch support. Multi-NIC Configuration Most corporate environments are using multiple 1 gigabyte (1 GB) Ethernet adapters deployed as their physical uplinks. In the diagram below (Figure 12), we are using 6 uplink adapters connected to a combination of vsphere standard switches and a vsphere distributed switch. By using multiple network adapters, we can separate the VMware kernel (host OS kernel) traffic which includes management, vmotion, and fault tolerance from virtual machine traffic. In the example below, VM traffic goes through the virtual distributed switch and VMware kernel traffic stays on the virtual standard switch providing further isolation. Additionally, this provides Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 19 of 30

redundancy for all components except fault tolerance, which may not be a requirement for all companies due to its limitation of supporting one vcpu. ESXi Host 1 Gb Virtual Center Mgmt vmotion FT vcenter Configuration Port Groups VM VLAN * VM VLAN * VM VLAN * vds dvswitch ESX Configuration vss dvuplinks vss vmnic0 vmnic1 vmnic2 vmnic3 vmnic4 vmnic5 Service Console vmotion VLAN Tag VLAN Tag Physical ilo Onboard 0 Onboard 1 PCI A 0 PCI A 1 PCI B 0 PCI B 1 Trunk Team Trunk Team 100 Full 1000 Auto 1000 Auto 1000 Auto 1000 Auto 1000 Auto 1000 Auto VLAN 90 VLAN 178 VLAN 180 VLAN * VLAN 98 Physical Switch Figure 12 1 GB Network Connection Today, many virtualized datacenters are shifting to the use of 10 gigabit Ethernet (10GbE) network adapters. The use of 10GbE adapters replaces configuring multiple 1GB network cards. With 10GbE, ample bandwidth is provided for multiple traffic flows to coexist and share the same physical 10GbE link. Flows that were limited to the bandwidth of a single 1GbE link are now able to use as much as 10GbE. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 20 of 30

Now let s take a look at 10 gigabit Ethernet configurations and the impact on your environments. Because we don't have as many uplink adapters, the way we approach traffic shaping and network isolation is different. I am going to demonstrate two scenarios, the first provides traffic shaping and isolation by the uplink adapters, and the second is a more dynamic approach that takes advantage of vsphere Network I/O Control (NIOC). With the first scenario we are segmenting the virtual machine traffic to dvuplink1 and providing failover to dvuplink0, this provides physical isolation of your virtual machine traffic from your management traffic. The VMkernel traffic is pointed to dvuplink0 with dvuplink1 being the failover adapter. If security controls dictate that you segment your traffic, this is a good solution, but there is a good chance that you won't be using the full capabilities of both your 10GbE network adapters. ESXi Host 1 pair 10 GbE vcenter Configuration vds dvswitch Port Groups Mgmt vmotion FT VM NICs ESX Host Configuration dvuplinks dvuplink0 dvuplink1 Physical ilo Onboard 0 Onboard 1 1G auto 10G Auto 10G Auto VLAN 90 VLAN * Traffic at the Port Group segmented by different VLANs (multiple VLANs for VMNICS) Physical Switch Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 21 of 30

Traffic Type VLAN (Example) Teaming Policy Active dvuplink Standby dvuplink Management 178 Explicit failover dvuplink0 dvuplink1 vmotion 180 Explicit failover dvuplink0 dvuplink1 FT 98 Explicit failover dvuplink0 dvuplink1 Virtual Machine * Explicit failover dvuplink1 dvuplink0 Figure 13 10 GB Static Network Design In our second scenario, we are going to use network resource pools to determine the bandwidth that different network traffic types are given on a vsphere distributed switch. With vsphere Network I/O Control (NIOC), the convergence of diverse workloads can be enabled to be on a single networking pipe to take full advantage of 10 GbE. The NIOC concept revolves around resource pools that are similar in many ways to the ones already existing for CPU and memory. In the diagram below (Figure 14), all the traffic is going through the Active dvuplinks 0 and 1. We are going to use a load-based teaming (LBT) policy, which was introduced vsphere 4.1, to provide traffic-load-awareness and ensure physical NIC capacity in the NIC team is optimized. Last, we are going to set our NIOC share values. I have set virtual machine traffic to High (100 shares), management and fault tolerance to Medium (50 shares), and vmotion to Low (25 shares). The share values are based on the relative importance we placed on the individual traffic roles in our environment. Furthermore, you can enforce traffic bandwidth limits on the overall vds set of dvuplinks. Network I/O Control provides the dynamic capability necessary to take full advantage of your 10GbE uplinks, it provides sufficient controls to the vsphere administrator, in the form of limits and shares parameters, to enable and ensure predictable network performance when multiple traffic types contend for the same physical network resources. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 22 of 30

ESXi Host 1 pair 10 GbE vcenter Configuration vds dvswitch Port Groups Mgmt vmotion FT VM NICs ESX Host Configuration dvuplinks dvuplink0 dvuplink1 Physical ilo Onboard 0 Onboard 1 1G auto 10G Auto 10G Auto VLAN 90 VLAN * Traffic at the Port Group segmented by different VLANs (multiple VLANs for VMNICS) Physical Network Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 23 of 30

Traffic Type VLAN (Example) Teaming Policy Active dvuplink Standby dvuplink Management 178 LBT dvuplink0,1 None vmotion 180 LBT dvuplink0,1 None FT 98 LBT dvuplink0,1 None Virtual Machine * LBT dvuplink0,1 None Traffic Type NIOC Shares Management 50 vmotion 25 FT 25 Virtual Machine 100 Figure 14 10 GB Dynamic Network Design Virtual Machines VM-VM Anti-Affinity Rules A VM-VM Anti-Affinity rule specifies which virtual machines are not allowed to run on the same host. Anti-Affinity rules can be used to offer host failure resiliency to mission critical services provided by multiple virtual machines using network load balancing (NLB). It also allows you to separate virtual machines with network intensive workloads; if they were placed on one host, they might saturate the host s networking capacity. Reservations Reservations are the guaranteed minimum amount of host resources allocated to a virtual machine to avoid over commitment. It ensures the virtual machine has sufficient resources to run efficiently. vcenter Server or ESXi allows you to power on a virtual machine only if there are enough unreserved resources to satisfy the reservation of the virtual machine. The server guarantees that amount even when the physical server is heavily loaded. After a virtual machine has accessed its full reservation, it is allowed to retain that amount of memory and the memory is not reclaimed, even if the virtual machine becomes idle. For example, assume you have 2 GB of memory available for two virtual machines. You specify a reservation for 1 GB of memory for VM1 and 1 GB of memory for VM2. Now each virtual machine is guaranteed to get 1 GB of memory if it needs it. However, if VM1 is only using 500 MB of memory and hasn t accessed all the memory, than VM2 can use 1.5 GB of memory until VM1 s resource demand increases to 1 GB. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 24 of 30

If an application that is customer facing or mission critical, needs a guaranteed memory allocation; the reservation needs to be specified carefully because it may impact the performance of other virtual machines and significantly reduce consolidation ratios. Limits Figure 15 Virtual Memory Configuration A limit is the upper threshold of the host resources allocated to a virtual machine. A server will never allocate more resources to a virtual machine than the limit. The default is set to unlimited, which means the amount of resources configured for the virtual machine when it is created becomes the effective limit. For example, if you configured 2 GB of memory when you created a virtual machine but set a limit of 1 GB, the virtual machine would never be able to access more than 1 GB of memory even when the application demand required more resources. If this value is misconfigured, users may experience application performance issues even though the host has plenty of resources available. Shares Shares specify the relative priority for a virtual machine to the host s resources. If the host s memory is overcommitted, and a mission critical virtual machine is not achieving an acceptable performance level, the virtualization administrator can adjust the virtual machine s shares to escalate the relative priority so that the hypervisor will allocate more host memory to the mission critical virtual machine. The shares can be selected in a Low, Normal, or High value; which specifies the shares value respectively in a 1:2:4 ratio. Resource Pools A resource pool is a logical abstraction for flexible management of resources. Resource pools can be grouped into hierarchies and used to hierarchically partition available CPU and memory resources. Each standalone host and each DRS cluster has an (invisible) root resource pool that groups the resources of that host or cluster. The root resource pool does not appear because the resources of the host (or cluster) and the root resource pool are always the same. Users can create child resource pools of the root resource pool or of any user-created child resource pool. Each child resource pool owns some of the parent s resources and can, in turn, have a hierarchy of child resource pools to represent successively smaller units of computational capability. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 25 of 30

A resource pool can contain child resource pools, virtual machines, or both. You can create a hierarchy of shared resources. The resource pools at a higher level are called parent resource pools. Resource pools and virtual machines that are at the same level are called siblings. The cluster itself represents the root resource pool. If you do not create child resource pools, only the root resource pools exist. For each resource pool; you specify reservation, limit, shares, and whether the reservation should be expandable. The resource pool resources are then available to child resource pools and virtual machines. For example, assume a host has a number of virtual machines (Figure 16). The marketing department uses three of the virtual machines and the QA department uses two virtual machines. Because the QA department needs larger amounts of CPU and memory, the administrator creates one resource pool for each group. The administrator sets CPU Shares to High for the QA department pool and to Normal for the Marketing department pool so that the QA department users can run automated tests. The second resource pool with fewer CPU and memory resources is sufficient for the lighter load of the marketing staff. Whenever the QA department is not fully using its allocation, the marketing department can use the available resources. Figure 16 Resource Pool Example By using resource pools, you can create customer service level definitions tailored toward service offerings. The below chart demonstrates service class definitions, which incorporate limits, shares, and reservations. This can help with compute resource micro-segmentation. Each resource pool will have different shares, CPU and memory limits, and different expansion capabilities. This helps to prioritize the virtual machine workloads in accordance to service guidelines. Figure 17 Resource Pool Tiers Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 26 of 30

Infrastructure Maintenance and Deployment Management Maintenance and Support All IT organizations have limits on their resources, people, time and money. Therefore, it is critical to determine what the vital business functions are. By creating a small infrastructure cell dedicated to mission critical core systems, you can enhance your infrastructure maintenance and deployment processes. Moreover, it can be risky taking more than two hosts out of the cluster at a time to perform maintenance and upgrades. By creating a small cluster, you can assure that changes to the mission critical cell are only performed on approve infrastructure release dates. This will help to minimize the risk to the business for vital business functions by having a more rigid change and release management processes. Furthermore, by defining service levels with infrastructure resources, you can define infrastructure operations support levels to match application availability expectations. As infrastructure growth continues at 20% year-over-year, head count to support the increased infrastructure remains flat. There are only a handful of options to meet this challenge. 1. Do nothing, which will degrade the ability to be strategic and place your infrastructure engineers in firefighting mode. 2. Maintain a steady FTE ratio and regularly add headcount 3. Create support offerings to balance out time and effort based on application availability. This is no different than vendor support offerings. For Gold service level, the IT operations team would provide 24x7 support with root cause analysis, Silver service might be 24x5 support, and Bronze support would be Monday through Friday 8 am to 8 pm. Like mentioned in the introduction; mission critical applications, such as customer facing applications and financial systems are core to the company s mission, and system downtime translates to financial losses to the organization. While other applications, like general purpose printing, software media libraries, and infrastructure monitoring tools don t require the same service level capabilities as mission critical applications. Managed In a managed server environment, infrastructure services generally takes the responsibility of providing all server builds, any infrastructure or application software upgrades, and general maintenance such as reboots and hardware issues. The managed server type could apply to your Gold and Silver environments. A managed server leaves all the management duties of running the server in infrastructure services control. In a managed server environment the application support team will not have any administration rights to the server unless a business justified exception is approved by Enterprise Information Security & Risk Management. Unmanaged A private cloud provides IT business partners the equivalent of their own personal datacenter. The infrastructure team allocates each owner a pool of resources (compute, memory, and disk), helps them with a catalog of standard server build templates, and then allows them to create, manage, and delete their virtual instances through a cloud management portal. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 27 of 30

The Bronze service tier with self-service provisioning should be considered unmanaged unless special arrangements are made on an exception basis. Unmanaged virtual machines gives application support teams complete server administration along with the responsibility that goes with it. Unmanaged server hosting, despite its name, does not really leave application support teams to their own devices, all of the application support teams are still bound to adhere to corporate security guidelines and standards. Infrastructure services support in an unmanaged server is limited. Infrastructure services would still monitor the overall host and cluster performance, resolves problems with infrastructure related software, and troubleshoots operating system and connectivity issues. In the event that an issue occurs on an unmanaged server and is due to a change made by the application support team, infrastructure services could provide a limited amount of engineering time to resolve the issue (ex. 30 minutes). Creating Service Offerings By defining service offerings, you provide your business partners the framework to make right decisions and accountability to encourage the desired behavior through cost transparency. By using a multi-faceted approach with all the technology capabilities available, you can provide the expected business outcome by matching up technology with business requirements. In the diagram below, we start to put the components together into our service levels. For instance, our Gold service level includes the Gold Cluster, Tier 1 and Tier 2 storage offerings, Platinum and Gold+ Resource Pools, Restart Order of High, Managed service, and 24 x 7 infrastructure support. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 28 of 30

Figure 18 Service Tiers Also, creating a hosting services heat map provides further clarity, it helps define which application services are approved for specific operational infrastructure hosting tiers. This can include options for external public cloud service providers. Through these measured steps, you become a service broker to your business partners. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 29 of 30

Figure 19 Hosting Services Heat Map Infrastructure investments are capital expenditures made for corporate-wide consumption to support business capabilities through IT operations. Traditionally, infrastructure investments tended to be more tactical and required more effort to identify, quantify, and calculate benefits and costs. However, by operating at a more service oriented level, the infrastructure investments can closely align with the strategic technical business plan and provide a greater return on investment. This hosted service strategy will help business leadership when accessing four major enterprise goals: 1. Cost-effective use of infrastructure 2. Effective use of asset utilization for business requirements 3. Application availability and resiliency 4. Maintaining appropriate staffing levels The job of a CIO is determining the trade-off between the cost of technology and meeting business requirements. Becoming a service oriented organization will encourage you to become a more cohesive partner to the business, business leadership will have more information available to make decisions and you will become a greater stakeholder in influencing IT decisions. Proprietary VMWare 2015 VMware, All rights reserved. VMware is a registered trademark of VMware, Page 30 of 30