Understanding SAN and Storage Constraints In Private Clouds L. Mark Stone, Founder and CIO (207) 772-5678 mark.stone@reliablenetworks.com
Agenda Non-technical Executive Summary Choke Points (there are many) Technical Level Setting Key Challenges Robust I/O = Fast, Wide and Consistent Use Cases: MSSQL Transaction Logs, Tuned MySQL server, and LAMP-stack IVR System Evaluating Cloud Providers: Due Diligence Checklists and SLAs Q&A
Introduction L. Mark Stone Founder and CIO of Reliable Networks CIO, NYC-based corporate trading company 19 offices in 16 countries Directed $1.2B in procurement spend Responsible for global infrastructure and $5.0 million software development Technology and Media M&A, 13 years M.Sc. Econ., London School of Economics B.S. Finance, State U. of NY @ Albany DJ and Chief Engineer 91FM WCDB
Executive Summary Most Public Cloud workloads to date do not require Robust I/O. Backups and Disaster Recovery Web Apps File Sharing Development and Testing
Executive Summary Most (public portions of) Hybrid Clouds to date do not require Robust I/O either: Production database servers stay on premises Scalable web front-end expands to Cloud Replica/Reporting database servers moving to Cloud
Executive Summary Many traditional Private Cloud workloads do require Robust I/O : Electronic Health Record systems Busy email systems Loaded LAMP stack systems Heavy database usage Heavy disk usage (e.g. map rendering)
Executive Summary Most Cloud providers to date have: Offered lower-cost storage solutions to meet the needs of typical web workloads Developed infrastructure architectures where storage I/O was not expected to be a bottleneck Favored investments in high-performance CPU cycles and RAM over storage
Consequently: Executive Summary High I/O applications work occasionally-tointermittently well in public clouds Cloud providers starting to invest in Robust I/O hardware Hybrid Clouds poised to take off, but: Storage QoS solutions primitive at best, so careful technical due diligence and performance-oriented SLAs suggested Know your own application s attributes (IOPs, read:write ratio and average I/O size most critical)
Key Challenges ACID/near-ACID-compliant Legacy (database-driven) applications generally require Robust I/O Atomicity, Consistency, Isolation, Durability Apps coded to BASE standards (i.e. most Mobile apps) generally don t require Robust I/O Basically Available, Soft state, Eventual consistency First-Gen Public Clouds unsuitable for many busy Legacy systems:
Key Challenges Choke Points Poor Random Read:Write Performance Storage backends used low cost SATA-backed SANs or midline DASD Too many VMs per compute node Linux top shows high %wa and %st Windows queue depth high Too many compute nodes per storage frame Only so many IOPs available Only so much compute node-to-storage frame bandwidth available
Key Challenges Choke Points Piggish compute node neighbors Your application gets starved (throttled) for CPU, I/O and storage bandwidth resources Distributed file systems/databases create additional choke points Results: Inconsistent performance Benchmark results non-repeatable Consequently, Legacy apps stay in-house
Wake Up Call Your own Private Cloud will face the same challenges as First-Gen Public Cloud providers Your Hybrid Cloud will perform only as well as the weakest link in the chain Consistent high performance from Legacy apps requires Robust I/O
Robust I/O Defined What is Robust I/O? Includes all I/O-related infrastructure upon exiting the compute nodes: Cabling, switches, routers, spinning disks, SSDs, disk controllers, SANs, network storage frames, etc. Must be as good or better than premisesbased DASD, which means
Robust I/O Defined Speed. Sustained and random I/O at the disk subsystem level must be fast enough to support the intended workloads. Bandwidth. Traffic channels between and within the compute and storage nodes must be wide enough to support the intended data traffic flows. Repeatability. Performance must be consistent whether the workloads are conducted at 2:00am or 2:00pm.
IOPs How Best To Measure? Input-Output Instructions per Second Domain Controller = ~50 SMB Email Server = ~150 Busy Email MTA Server = ~500 Map Rendering Server = ~1,000
IOPs Quantified How many IOPs per disk? 15K SAS Enterprise DASD = ~180 IOPs 7.2K SATA Midline DASD = ~80 IOPs SAN-Grade SSD = ~3,000 to ~6,000 IOPs Does not include RAID Penalties Q: How many Read IOPs from a 24 x 15K SAS RAID 10 array? (Ignore write penalty for the moment) A: (24 / 2) *180 = 4,320 IOPs (maybe)
Why Maybe? Compute Node-to-Storage Frame Bandwidth can get in the way of consuming all of your SAN s IOPs: Need to know your application s average I/O size to see if IOPs and/or bandwidth will be the choke point(s)
Example One: Do The Math MSSQL Server transaction log LUN Writes typically 4K each 1,000 IOPs = 0.31 Gbps of compute nodeto-storage frame bandwidth required (Note 2x RAID 10 write penalty) No problem to host this on mid-tier storage frames.
Example Two: Do The Math MySQL database averaging 32K I/O 500 IOPs consumed; presume 100% reads 1.25 Gbps of compute node-to-storage frame bandwidth required for just one server in the farm Key Takeway: One-half the IOPs consumes four times the bandwidth
Example Three: Do The Math Integrated LAMP IVR system: About 50:50 read:write Inbound/outbound Email Third-party IVR hosting Peak Load = ~5,000 calls per hour ~1,000 IOPs and ~1.0 Gbps compute node-to-storage bandwidth consumed
Technical Level Setting IOPs at the application level is always less than IOPs needed at the storage level unless the system is 100% read Storage IOPs needed = (App IOPs x % read) + ((App IOPs x %write) x RAID penalty) RAID Penalties (for writes only) = RAID1 = 2; RAID6 = 6
Technical Level Setting Typical I/O Sizes It Depends SANs we access are reporting averages of 10K to 50K Bottom Line 1: Storage IOPs consumed are typically a multiple of IOPs produced at the application layer Bottom Line 2: No set rule of thumb correlating IOPs and bandwidth
What To Do? Step One: Know Your Application Benchmark IOPs, read:write percentage and average I/O size Step Two: Provider Due Diligence Eliminate those providers who have too Robust I/O and those who have not Robust-enough I/O As soon as you start talking tech, your provider s salesperson will refer you to a knowledgeable engineer
What To Do? Provider Due Diligence Continued Ask about the choke points Ask about the architecture Redundancy and resiliency One provider uses DASD storage for performance; if a compute node fails, your VM will need to be restored from a backup Note: If your app is coded to BASE standards, redundancy/resiliency within a data center is effectively irrelevant (true Cloud deployment)
What To Do? Step Three: Negotiate Performance- Related SLAs over Uptime-Related SLAs No mature storage QoS solutions yet, so need to rely on close proxies at the application layer, e.g. Linux: %st as reported by top Windows: Disk queue depth
What To Do? Step Four: Avoid Square Peg-Round Hole Syndrome: There is nothing wrong with dedicated, non-virtualized servers using DASD if the workloads require it Recoding Legacy apps to BASE standards is at best a non-trivial task
Summary Conclusions Storage systems and the connectivity to them have many choke points which can make the performance of your application erratic at best Obtaining Robust I/O in a Private Cloud for Legacy applications is a challenge; know your application so you can conduct informed provider due diligence
Summary Conclusions In the absence of storage QoS, application performance proxies should be monitored Negotiate for SLAs based on performance primarily, in addition to uptime There s no shame (today) in nonvirtualized, dedicated server hosting if the workloads require it
Q&A and Contact Info L. Mark Stone, Founder and CIO http://www.reliablenetworks.com/blog/ mark.stone@reliablenetworks.com @LMStone http://www.linkedin.com/in/lmarkstone Reliable Networks 477 Congress Street, Suite 812 Portland, ME 04101 (207) 772-5678