How To Fix A Fault Fault Fault Management In A Vsphere 5 Vsphe5 Vsphee5 V2.5.5 (Vmfs) Vspheron 5 (Vsphere5) (Vmf5) V



Similar documents
VMware Virtual Machine File System: Technical Overview and Best Practices

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Nimble Storage for VMware View VDI

IOmark- VDI. HP HP ConvergedSystem 242- HC StoreVirtual Test Report: VDI- HC b Test Report Date: 27, April

IOmark-VM. DotHill AssuredSAN Pro Test Report: VM a Test Report Date: 16, August

VMware Best Practice and Integration Guide

Microsoft SMB File Sharing Best Practices Guide

Using VMWare VAAI for storage integration with Infortrend EonStor DS G7i

VMware vsphere 5.1 Advanced Administration

Virtual Volumes Technical Deep Dive

Storage I/O Control Technical Overview and Considerations for Deployment. VMware vsphere 4.1

Using EonStor FC-host Storage Systems in VMware Infrastructure 3 and vsphere 4

Data recovery Data management Electronic Evidence

Setup for Failover Clustering and Microsoft Cluster Service

Dell Compellent Storage Center SAN & VMware View 1,000 Desktop Reference Architecture. Dell Compellent Product Specialist Team

Top Ten Private Cloud Risks. Potential downtime and data loss causes

Storage Protocol Comparison White Paper TECHNICAL MARKETING DOCUMENTATION

Best Practice of Server Virtualization Using Qsan SAN Storage System. F300Q / F400Q / F600Q Series P300Q / P400Q / P500Q / P600Q Series

June Blade.org 2009 ALL RIGHTS RESERVED

VMware vsphere 5.0 Boot Camp

WHITE PAPER Optimizing Virtual Platform Disk Performance

IBM XIV Gen3 Storage System Storage built for VMware vsphere infrastructures

Efficient Storage Strategies for Virtualized Data Centers

Backup and Recovery Best Practices With CommVault Simpana Software

Technology Insight Series

Configuration Maximums

WHITE PAPER. Permabit Albireo Data Optimization Software. Benefits of Albireo for Virtual Servers. January Permabit Technology Corporation

Leveraging Virtualization for Disaster Recovery in Your Growing Business

VMware Virtual SAN Backup Using VMware vsphere Data Protection Advanced SEPTEMBER 2014

TECHNICAL PAPER. Veeam Backup & Replication with Nimble Storage

Evolving Datacenter Architectures

BEST PRACTICES GUIDE: VMware on Nimble Storage

DR-to-the- Cloud Best Practices

How To Set Up An Iscsi Isci On An Isci Vsphere 5 On An Hp P4000 Lefthand Sano On A Vspheron On A Powerpoint Vspheon On An Ipc Vsphee 5

The Modern Virtualized Data Center

Solution Brief Availability and Recovery Options: Microsoft Exchange Solutions on VMware

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Addendum No. 1 to Packet No Enterprise Data Storage Solution and Strategy for the Ingham County MIS Department

Choosing and Architecting Storage for Your Environment. Lucas Nguyen Technical Alliance Manager Mike DiPetrillo Specialist Systems Engineer

Virtual server management: Top tips on managing storage in virtual server environments

Nutanix NOS 4.0 vs. Scale Computing HC3

Nimble Storage VDI Solution for VMware Horizon (with View)

Getting the Most Out of Virtualization of Your Progress OpenEdge Environment. Libor Laubacher Principal Technical Support Engineer 8.10.

The next step in Software-Defined Storage with Virtual SAN

Oracle Solutions on Top of VMware vsphere 4. Saša Hederić VMware Adriatic

EMC Avamar Backup Solutions for VMware ESX Server on Celerra NS Series 2 Applied Technology

Drobo How-To Guide. Deploy Drobo iscsi Storage with VMware vsphere Virtualization

Introduction. Setup of Exchange in a VM. VMware Infrastructure

Pure Storage and VMware Integration

Availability Guide for Deploying SQL Server on VMware vsphere. August 2009

Setup for Microsoft Cluster Service ESX Server and VirtualCenter 2.0.1

WHITE PAPER 1

Analysis of VDI Storage Performance During Bootstorm

Deep Dive on SimpliVity s OmniStack A Technical Whitepaper

SAN Conceptual and Design Basics

QNAP in vsphere Environment

Best Practices when implementing VMware vsphere in a Dell EqualLogic PS Series SAN Environment

VMWARE VSPHERE 5.0 WITH ESXI AND VCENTER

Redefining Microsoft SQL Server Data Management. PAS Specification

Virtual DR: Disaster Recovery Planning for VMware Virtualized Environments

Backup and Recovery Best Practices With Tintri VMstore

Quorum DR Report. Top 4 Types of Disasters: 55% Hardware Failure 22% Human Error 18% Software Failure 5% Natural Disasters

Best Practices for Running VMware vsphere on Network-Attached Storage (NAS) TECHNICAL MARKETING DOCUMENTATION V 2.0/JANUARY 2013

Increasing Storage Performance, Reducing Cost and Simplifying Management for VDI Deployments

VMware Virtual SAN Design and Sizing Guide TECHNICAL MARKETING DOCUMENTATION V 1.0/MARCH 2014

What s New in VMware vsphere 4.1 Storage. VMware vsphere 4.1

Best Practices for Monitoring Databases on VMware. Dean Richards Senior DBA, Confio Software

How To Use Vcenter Site Recovery Manager 5 With Netapp Fas/Vfs Storage System On A Vcenter Vcenter 5 Vcenter 4.5 Vcenter (Vmware Vcenter) Vcenter 2.

VMware Software-Defined Storage & Virtual SAN 5.5.1

VMware vstorage Virtual Machine File System. Technical Overview and Best Practices

What s New: vsphere Virtual Volumes

VMware and VSS: Application Backup and Recovery

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

VMware vsphere Data Protection 6.0

VMware Certified Professional 5 Data Center Virtualization (VCP5-DCV) Exam

How To Backup A Vranger Vrander With Asynch Mirroring On A Vmserd With An Asyncher Backup On A Datacore Vrangers Memory On A Powerpoint Vrgera Vrenger On A

Q & A From Hitachi Data Systems WebTech Presentation:

Performance characterization report for Microsoft Hyper-V R2 on HP StorageWorks P4500 SAN storage

THE SUMMARY. ARKSERIES - pg. 3. ULTRASERIES - pg. 5. EXTREMESERIES - pg. 9

Comparison of Hybrid Flash Storage System Performance

A virtual SAN for distributed multi-site environments

WHITE PAPER Guide to 50% Faster VMs No Hardware Required

Delivering SDS simplicity and extreme performance

What s New with VMware Virtual Infrastructure

Drobo How-To Guide. Topics Drobo and vcenter SRM Basics Configuring an SRM solution Testing and executing recovery plans

The Power of Deduplication-Enabled Per-VM Data Protection SimpliVity s OmniCube Aligns VM and Data Management

Nutanix Tech Note. Data Protection and Disaster Recovery

Server and Storage Sizing Guide for Windows 7 TECHNICAL NOTES

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

Maximizing VMware ESX Performance Through Defragmentation of Guest Systems. Presented by

Virtual Machine Backup Guide

Dell PowerVault MD32xx Deployment Guide for VMware ESX4.1 Server

Transcription:

VMware Storage Best Practices Patrick Carmichael Escalation Engineer, Global Support Services. 2011 VMware Inc. All rights reserved

Theme Just because you COULD, doesn t mean you SHOULD. Lessons learned in Storage Best Practices 2

Just because you Could, doesn t mean you SHOULD. Storage Performance and Technology Interconnect vs IOP. Disk and RAID differences. SSD vs Spinning Media. VAAI Xcopy/write_same ATS VMFS5 Thin Provisioning Architecting for Failure Planning for the failure from the initial design. Individual Components Complete Site Failure Backup RTO DR RTO 3

Storage Performance Interconnect vs IOP Significant advances in interconnect performance FC 2/4/8GB iscsi 1G/10G NFS 1G/10G Differences in performance between technologies. None NFS, iscsi and FC are effectively interchangeable. Despite advances, performance limit is still hit at the media itself. 90% of storage performance cases seen by GSS that are not config related, are media related. Payload (throughput) is fundamentally different from IOP (cmd/s). IOP performance is always lower than throughput. 4

Factors that affect Performance Hard Disks Disk subsystem bottlenecks Performance versus Capacity Performance versus Capacity Disk performance does not scale with drive size Larger drives generally equate lower performance IOPS(I/Os per second) is crucial How many IOPS does this number of disks provide? How many disks are required to achieve a required number of IOPS? More spindles generally equals greater performance 5

Factors that affect Performance - RAID RAID is used to aggregate disks for performance and redundancy However RAID has an I/O Penalty for Writes Reads have an IO penalty of 1. Write IO penalty varies depending on RAID choice RAID Type IO Penalty 1 2 5 4 6 6 10 2 6

Factors that affect Performance - I/O Workload and RAID Understanding workload is a crucial consideration when designing for optimal performance. Workload is characterized by IOPS and write % vs read %. Design choice is usually a question of: How many IOPs can I achieve with a given number of disks? Total Raw IOPS = Disk IOPS * Number of disks Functional IOPS = (Raw IOPS * Write%)/(Raid Penalty) + (Raw IOPS * Read %) How many disks are required to achieve a required IOPS value? Disks Required = ((Read IOPS) + (Write IOPS*Raid Penalty))/ Disk IOPS 7

IOPS Calculations Fixed number of disks Calculating IOPS for a given number of disks 8 x 146GB 15K RPM SAS drives ~150 IOPS per disk RAID 5 150 * 8 = 1200 Raw IOPS Workload is 80% Write, 20% Read (1200*0.8)/4 + (1200*0.2) = 480 Functional IOPS Raid Level IOPS(80%Read 20%Write) IOPS(20%Read 80%Write) 5 1020 480 6 1000 400 10 1080 720 8

IOPS Calculations Minimum IOPS Requirement Calculating number of disks for a required IOPS value 1200 IOPS required 15K RPM SAS drives. ~150 IOPS per disk Workload is 80% Write, 20% Read RAID 5 Disks Required = (240 + (960*4))/150 IOPS 27 Disks required Raid Level Disks(80%Read 20%Write) Disks (20%Read 80%Write) 5 13 27 6 16 40 10 10 14 9

What about SSD? SSD potentially eliminates the physical limitation of spinning media. Advertised speeds of 10,000 IOPS+ Only reached under very specific conditions. Real world performance must be tested Test with IO footprint as close to your intended use as possible Actual values will vary, but will be significantly higher than spinning media The value of testing, regardless of SSD or traditional media, cannot be understated. Every array is different. 10

VAAI (xcopy/write_same) Advertised as a way to improve performance of certain operations Despite common belief, VAAI does not reduce load. Offload to array of certain operations A storage array is built to handle these operations far more efficient, and much faster than ESX sending the commands for each individual block. Still requires the disks perform the commands in question In some scenarios, offloading these operations can push the array past its limits, much like doing the same sequence on the host would. If your environment is at maximum performance capacity, VAAI will not allow you to do things you could not otherwise do. 11

VAAI (ATS) A final answer to the SCSI Reservation problem. Everyone is familiar with the issues behind SCSI reservations. Whole lun locking for simple metadata changes Blocks IO from all other hosts Lost reserves can mean downtime Differing capabilities by vendor / model mean different maximums. ATS instead locks (via a new SCSI spec) only the blocks in question. Eliminates the design limitations of SCSI reserves Capable of handling significantly larger VM/lun ratios. Allows for larger luns without lost space. 12

VAAI (ATS) contd. Remember our theme: Just because you could, doesn t mean you should. ATS will allow you to significantly increase consolidation ratios (by up to 100% in some cases) per-lun. It will not, however, guarantee the underlying spindles can handle the normal IO load of said VMs. Primarily a concern with linked clone environments View/VCD/Lab Manager vms take up very little space (storing changes / persistent disks only) Linked clones generate significant amounts of reservations ATS is designed specifically to handle this, but many forget that the VMs have a normal IO load as well that can overwhelm the disks in other ways. Doubling VM count doubles IO load. Consider all the implications of what the technology will allow you to do. 13

VAAI (ATS) in ESX5. New feature! ATS-Only volumes. Any volume created on ESX5, as VMFS5, where the array reports that it supports ATS (at the time of creation), will be created as ATS-only. Flag disables SCSI-2 reservations. This is good! No reservation storm from ATS failures. This also means that if something changes, your volume may be unreadable. SRM does your DR site have an ATS capable array? If not, volumes won t mount (different firmware revisions). Some firmware upgrades on arrays disable their ATS support. A global option can be set to disable this feature, or set pervolume. KB to be public shortly. Some info in KB 2006858 14

VAAI (other minor considerations) Performance graphs/esxtop will be skewed for VAAI commands. Unlike traditional commands that receive an acknowledgement for each command/block, the array will execute multiple commands for each VAAI command This takes significantly longer, but esxtop/performance graphs expect each command to return as normal, so the values reported will be skewed. Does not indicate a performance problem. 15

VMFS5 VMFS5 is the new, 3 rd generation filesystem from VMware Introduced with vsphere5 Eliminates 2TB-512B size limit Max size: 64TB 1MB block size File size for VMDKs still limited to 2TB currently 64TB max for prdm GPT partition table (with backup copy at end of disk). Allows use of truly large logical units for workloads that would previously have required extents/spanned disks. 16

VMFS5 contd. VMFS5, in combination with ATS, will allow consolidation of ever-larger number of VMs onto single VMFS datastores. One lun could contain the VMs previously stored on 32 (assuming max utilization). While potentially easier for management, this means that 32 LUNs worth of VMs are now reliant on a single volume header. When defining a problem space, you ve now expanded greatly the number of items in that problem space Just because you could, does that mean you should? 17

Thin Provisioning Thin provisioning offers very unique opportunities to manage your storage after provisioning. Workloads that require a certain amount of space, but don t actually use it. Workloads that may grow over time, and can be managed/moved if they do. Providing space for disparate groups that have competing requirements. The question is, where should you thin provision, and what are the ramifications? 18

Thin Provisioning VM Level VM disk is created @ 0b, until used, and then grows @ VMFS Block size as needed Minor performance penalty for provisioning / zeroing. Disk cannot currently be shrunk once grown, it stays grown. There are some workarounds for this, but they are not guaranteed What happens when you finally run out of space? All VMs stun until space is created Production impacting, but potentially a quick fix (shut down VMs, adjust memory reservation, etc). Extremely rare to see data loss of any kind. 19

Thin Provisioning LUN Level. Allows your array to seem bigger than it actually is. Allows you to share resources between groups (the whole goal of virtualization). Some groups may not use all or much of what they re allocated, allowing you to utilize the space they re not using. Standard sized or process defined luns may waste significant amounts of space, and space being wasted is $$ being wasted. Significant CapEX gains can be seen with thin luns. 20

Thin Provisioning LUN Level - contd What happens when you finally run out of space? New VAAI primitives, for compatible arrays, will let ESX know that the underlying storage has no free blocks. If VAAI works, and your array is compatible, and you re on a supported version of ESX, this will result in the same as a thin VMDK running out of space All VMs will stun (that are waiting on blocks). VMs not waiting on blocks will continue as normal. Cleanup will require finding additional space on the array, as VSWP files / etc will already be allocated blocks at the lun level. Depending on your utilization, this may not be possible, unless UNMAP also works (very limited support at this time). If VAAI is not available for your environment, or does not work correctly, then what? On a good day, the VMs will simply crash with a write error, or the application inside will fail (depends on array and how it handles a full filesystem). Databases and the like are worst affected, will most likely require rebuild/repair. And on a bad day? 21

Thin Provisioning LUN Level contd. 22

Thin Provisioning There are many reasons to use Thin Provisioning, at both the VM and the LUN level. Thin provisioning INCREASES the management workload of maintaining your environment. You cannot just ignore it. 23

Details for new VAAI Features http://blogs.vmware.com/vsphere/2011/07/new-enhanced-vsphere- 50-storage-features-part-3-vaai.html Please note, UNMAP has been disabled by default as of P01, due to issues with some arrays. Please confirm with your vendor the support status before turning it back on. 24

Just because you Could, doesn t mean you Should Everything we ve covered so far is based on new technologies What about the existing environments? The ultimate extension of Just because you could, doesn t mean you should, is what I call Architecting for Failure 25

Architecting for Failure The ultimate expression of Just because you could, doesn t mean you should. Many core infrastructure designs are built with tried and true hardware and software, and people assume that things will always work We all know this isn t true Murphy s law. Architect for the failure. Consider all of your physical infrastructure. If any component failed, how would you recover? If everything failed, how would you recover? Consider your backup/dr plan as well 26

Black Box Testing Software engineering concept. Consider your design, all of the inputs, and all of the expected outputs. Feed it good entries, bad entries, and extreme entries, find the result, and make sure it is sane. If not, make it sane. This can be applied before, or after, you build your environment 27

Individual Component Failure Black box: consider each step your IO takes, from VM to physical media. Physical hardware components are generally easy to compensate for. VMware HA and FT both make it possible for a complete system to fail with little/no downtime to the guests in question. Multiple hardware components add redundancy and eliminate single points of failure Multiple NICs Multiple storage paths. Traditional hardware (multiple power supplies, etc). Even with all of this, many environments are not taking advantage of these features. Sometimes, the path that IO takes passes through a single point of failure that you don t realize is one 28

What about a bigger problem? You re considering all the different ways to make sure individual components don t ruin your day. What if your problem is bigger? 29

Temporary Total Site Loss Consider your entire infrastructure during a temporary complete failure. What would happen if you had to bootstrap it cold? This actually happens more often than would be expected. Hope for the best, prepare for the worst. Consider what each component relies on do you have any circular dependencies? Also known as the Chicken and the Egg problem, these can increase your RTO significantly. Example: Storage mounted via DNS, all DNS servers on the same storage devices. Restoring VC from backup when all networking is via DVS. What steps are required to bring your environment back to life? How long will it take? Is that acceptable? 30

Permanent Site Loss. Permanent site loss is not always an Act of God type event Far more common is a complete loss of a major, critical component. Site infrastructure (power, networking, etc) may be intact, but your data is not Array failures (controller failure, major filesystem corruption, RAID failure) Array disaster (thermal event, fire, malice) Human error yes, it happens! Multiple recovery options which do you have? Backups. Tested and verified? What s your RTO for a full failure? Array based replication Site Recovery Manager Manual DR How long for a full failover? Host based replication. 31

Permanent Site Loss contd. Consider the RTO of your choice of disaster recovery technology. It equates directly to the amount of time you will be without your virtual machines. How long can you, and your business, be without those services? A perfectly sound backup strategy is useless, if it cannot return you to operation quickly enough. Architect for the Failure make sure every portion of your environment can withstand a total failure, and recovery is possible in a reasonable amount of time. 32

The End. Questions? 33