Windows Server Infrastructure for SQL Server Michael Frandsen michaelf@mentalnote.dk
Agenda SQL Server storage challenges The SAN legacy Traditional interconnects SMB past New old interconnects File Shares the new Black DIY Shared Storage Microsoft vnext HP vnext
Bio - Michael Frandsen I have worked in the IT industry for just over 22 years, 18 of these has been spent as a consultant. I have a close relationship with Microsoft R&D in Redmond, with the Windows team for 19 years, ever since the first beta of Windows NT 3.1, SQL server for 19 years, since the first version Microsoft did by themselves, v4.21a I am in various advisory positions in Redmond and am involved in vnext versions of Windows, Hyper-V, SQL Server and Office/SharePoint. Currently Windows Blue (Windows 8.1 & Windows Server 2012 R2) and SQL14 (SQL Server 2014) Specialty areas: Architecture & design High performance Storage Low Latency Kerberos Scalability (scale-up & scale-out) Consolidation (especially SQL Server) High-Availability VLDB Data Warehouse platforms BI platforms High Performance Computing (HPC) clusters Big Data platforms & architecture
Bio - Michael Frandsen
MentalNote Independent One-Man company with no affiliations Only selling consultancy deliverables No hardware sales No software sales Mission Excellence: Delivering knowledge and empowering clients through extensive and deep knowledge, always based on facts and objectiveness. This is achieved through a constant development of skills, by both internal research and partnerships with leading companies in the IT industry, such as Microsoft, HP and others, not only locally but mainly with these companies' Research and Development departments at their respective headquarters. Responsibility: Giving back to people, animals and places in need of support. Done with both manpower and funding of various causes, such as Cancer Research, Heart illness Animal Protection Terminally sick children, Orphans Developing Countries, Areas hit by natural disaster
SQL Server storage challenges Capacity Fast Shared Reliable
Performance The SAN legacy Because it s expensive it must be fast 3000 2500 2000 1500 1000 500 0 0 4 8 12 16 20 24 Price
The SAN legacy Shared storage or Direct Attached SAN File Server 2 x 8Gb/s Database Server 2 x 8Gb/s Mail Server 2 x 8Gb/s SAN 2 x 8Gb/s
The SAN legacy Widespread misconception
CACHE SQL SERVER WINDOWS CPU CORES MPIO Algorithm MPIO DSM WWN Zoning FC SWITCH Port Logic XOR Engine SCSI Controller The SAN legacy Complex stack FC HBA FC HBA A B A B A B CACHE STORAGE CONTROLLER A B A B DISK DISK LUN LUN DISK DISK SQL Server Read Ahead Rate CPU Feed Rate HBA Port Rate Switch Port Rate SP Port Rate LUN Read Rate Disk Feed Rate
SAN Bottleneck Typical SAN load: Low to medium I/O processor load (top - slim rectangles) Low cache load (Middle - big rectangles) Low disk spindle load (lower half - squares)
SAN Bottleneck Typical Data Warehouse / BI / VLDB SAN load: High I/O processor load maxed out (top - slim rectangles) High cache load (Middle - big rectangles) Low disk spindle load (lower half - squares)
SAN Bottleneck Ideal Data Warehouse / BI / VLDB SAN load: Low to medium I/O processor load (top - slim rectangles) Low to medium cache load (Middle - big rectangles) High disk spindle load (lower half - squares)
Traditional interconnects Fibre Channel Stalled at 8Gb/s for many years 16Gb/s FC still very exotic Strong movement towards FCoE (Fibre Channel over Ethernet) iscsi Started in low-end storage arrays Many still 1Gb/s 10Gb/E storage arrays typically have few ports compared to FC NAS NFS, SMB, etc.
File Share reliability Is this mission critical technology?
SMB 1.0-100+ Commands Protocol negotiation, user authentication and share access (NEGOTIATE, SESSION_SETUP_ANDX, TRANS2_SESSION_SETUP, LOGOFF_ANDX, PROCESS_EXIT, TREE_CONNECT, TREE_CONNECT_ANDX, TREE_DISCONNECT) File, directory and volume access (CHECK_DIRECTORY, CLOSE, CLOSE_PRINT_FILE, COPY, CREATE, CREATE_DIRECTORY, CREATE_NEW, CREATE_TEMPORARY, DELETE, DELETE_DIRECTORY, FIND_CLOSE, FIND_CLOSE2, FIND_UNIQUE, FLUSH, GET_PRINT_QUEUE, IOCTL, IOCTL_SECONDARY, LOCK_AND_READ, LOCK_BYTE_RANGE, LOCKING_ANDX, MOVE, NT_CANCEL, NT_CREATE_ANDX, NT_RENAME, NT_TRANSACT, NT_TRANSACT_CREATE, NT_TRANSACT_IOCTL, NT_TRANSACT_NOTIFY_CHANGE, NT_TRANSACT_QUERY_QUOTA, NT_TRANSACT_QUERY_SECURITY_DESC, NT_TRANSACT_RENAME, NT_TRANSACT_SECONDARY, NT_TRANSACT_SET_QUOTA, NT_TRANSACT_SET_SECURITY_DESC, OPEN, OPEN_ANDX, OPEN_PRINT_FILE, QUERY_INFORMATION, QUERY_INFORMATION_DISK, QUERY_INFORMATION2, READ, READ_ANDX, READ_BULK, 14 distinct READ_MPX, WRITE READ_RAW, RENAME, SEARCH, SEEK, SET_INFORMATION, SET_INFORMATION2, TRANS2_CREATE_DIRECTORY, TRANS2_FIND_FIRST2, TRANS2_FIND_NEXT2, TRANS2_FIND_NOTIFY_FIRST, TRANS2_FIND_NOTIFY_NEXT, TRANS2_FSCTL, TRANS2_GET_DFS_REFERRAL, operations?!?? TRANS2_IOCTL2, TRANS2_OPEN2, TRANS2_QUERY_FILE_INFORMATION, TRANS2_QUERY_FS_INFORMATION, TRANS2_QUERY_PATH_INFORMATION, TRANS2_QUERY_PATH_INFORMATION, TRANS2_REPORT_DFS_INCONSISTENCY, TRANS2_SET_FILE_INFORMATION, TRANS2_SET_FS_INFORMATION, TRANS2_SET_PATH_INFORMATION, TRANSACTION, TRANSACTION_SECONDARY, TRANSACTION2, TRANSACTION2_SECONDARY, UNLOCK_BYTE_RANGE, WRITE, WRITE_AND_CLOSE, WRITE_AND_UNLOCK, WRITE_ANDX, WRITE_BULK, WRITE_BULK_DATA, WRITE_COMPLETE, WRITE_MPX, WRITE_MPX_SECONDARY, WRITE_PRINT_FILE, WRITE_RAW) Other (ECHO, TRANS_CALL_NMPIPE, TRANS_MAILSLOT_WRITE, TRANS_PEEK_NMPIPE, TRANS_QUERY_NMPIPE_INFO, TRANS_QUERY_NMPIPE_STATE, TRANS_RAW_READ_NMPIPE, TRANS_RAW_WRITE_NMPIPE, TRANS_READ_NMPIPE, TRANS_SET_NMPIPE_STATE, TRANS_TRANSACT_NMPIPE, TRANS_WAIT_NMPIPE, TRANS_WRITE_NMPIPE)
SMB 2.0-19 Commands Protocol negotiation, user authentication and share access (NEGOTIATE, SESSION_SETUP, LOGOFF, TREE_CONNECT, TREE_DISCONNECT) File, directory and volume access (CANCEL, CHANGE_NOTIFY, CLOSE, CREATE, FLUSH, IOCTL, LOCK, QUERY_DIRECTORY, QUERY_INFO, READ, SET_INFO, WRITE) Other (ECHO, OPLOCK_BREAK) TCP is a required transport SMB2 no longer supports NetBIOS over IPX, NetBIOS over UDP or NetBEUI
SMB 2.1 Performance improvement Up to 1MB MTU to better utilize 10Gb/E! Disabled by default! Real benefit required app support Ex. Robocopy in W7 / 2K8R2 is multi-threaded Defaults to 8 threads, range 1-128
SQL Server SMB support < 2008 Using UNC path could be enabled with trace flag Not officially supported scenario No support for system databases No support for failover clustering 2008 R2 UNC path fully supported by default No support for system databases No support for failover clustering
Two things happened SQL Server 2012 Windows Server 2012
SQL Server 2012 UNC support expanded System Databases supported on SMB Failover Clustering supports SMB as shared storage and TempDB can now reside on NON-shared storage Mark Souza commented: Great Suggestion!
Windows Server 2012 InfiniBand Teaming SMB 3.0 RDMA Multichannel SMB Direct
New old interconnects InfiniBand characteristics Been around since 2001 Used mainly for HPC clusters and Super Computing High throughput RDMA capable Low latency Quality of service Failover Scalable
InfiniBand throughput
InfiniBand throughput Trends in I/O Interfaces with Servers
InfiniBand throughput Low-level Uni-directional Bandwidth Measurements
File Shares the new Black Why file shares? Massively increased stability Cleaned up protocol Transparent Failover between cluster nodes with no service outage! Massively increased functionality Multichannel RDMA and SMB Direct Massively decreased complexity No more MPIO, DSM, Zoning, HBA tuning, Fabric zoning etc.
New protocol - SMB 3.0 Which SMB protocol version is used Client / Server OS Windows 8 Windows Server 2012 Windows 7 Windows Server 2008 R2 Windows Vista Windows Server 2008 Previous versions of Windows Windows 8 Windows Server 2012 Windows 7 Windows Server 2008 R2 Windows Vista Windows Server 2008 Previous versions of Windows SMB 3.0 SMB 2.1 SMB 2.0 SMB 1.0 SMB 2.1 SMB 2.1 SMB 2.0 SMB 1.0 SMB 2.0 SMB 2.0 SMB 2.0 SMB 1.0 SMB 1.0 SMB 1.0 SMB 1.0 SMB 1.0
Transparent Failover SQL Server or Hyper-V Server Normal operation \\fs1\share Failover to Node B Connections and handles autorecovered; application IO continues with no errors \\fs1\share File Server Node A File Server Node B File Server Cluster
SMB Multichannel Single RSS-capable SMB Client RSS Multiple 1GbE s 1GbE SMB Client 1GbE Multiple in a team SMB Client Teaming Multiple RDMA s /IB SMB Client /IB Full Throughput Bandwidth aggregation with multiple s Multiple CPUs cores engaged when using Receive Side Scaling (RSS) Switch RSS SMB Server Switch 1GbE 1GbE SMB Server Switch 1GbE 1GbE Switch Teaming SMB Server Switch Switch /IB /IB SMB Server Switch /IB /IB SMB Multichannel implements end-to-end failure detection Leverages teaming if present, but does not require it Automatic Configuration SMB detects and uses multiple network paths
SMB Multichannel RSS SMB Client CPU utilization per core Switch RSS SMB Server Core 1 Core 2 Core 3 Core 4
SMB Multichannel RSS SMB Client CPU utilization per core Switch RSS SMB Server Core 1 Core 2 Core 3 Core 4
SMB Multichannel RSS SMB Client 1 RSS SMB Client 2 Switch Switch Switch Switch RSS RSS SMB Server 1 SMB Server 2
SMB Multichannel RSS SMB Client 1 RSS SMB Client 2 Switch Switch Switch Switch RSS RSS SMB Server 1 SMB Server 2
MB/sec SMB Multichannel Performance SMB Client Interface Scaling - Throughput 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 x 2 x 3 x 4 x I/O Size
RDMA in SMB 3.0 SMB over TCP and RDMA User Client 1 Application Unchanged API Memory 4 RDM A Memory File Server 1. Application (Hyper-V, SQL Server) does not need to change. Kernel SMB Client SMB Server 2. SMB client makes the decision to use SMB Direct at run time TCP/ IP NDKPI 2 SMB Direct SMB Direct NDKPI TCP/ IP 3. NDKPI provides a much thinner layer than TCP/IP No longer flow anything via regular TCP/IP 3 RDMA RDMA 4. Remote Direct Memory Access performed by the network interfaces. Ethernet and/or InfiniBand
SMB Direct and SMB Multichannel SMB Client 1 SMB Client 2 R- 54GbIB R- 54GbIB R- R- Switch 54GbIB Switch 54GbIB Switch Switch R- 54GbIB R- 54GbIB R- R- SMB Server 1 SMB Server 2
SMB Direct and SMB Multichannel SMB Client 1 SMB Client 2 R- 54GbIB R- 54GbIB R- R- Switch 54GbIB Switch 54GbIB Switch Switch R- 54GbIB R- 54GbIB R- R- SMB Server 1 SMB Server 2
DIY Shared Storage New paradigm for SQL Server storage design Direct Attached Storage (DAS) Now with flexibility Converting DAS to shared storage Fast RAID controllers will be shared storage NAND Flash PCIe cards will be shared storage
New Paradigm designs Fusion Fusion Fusion PCIe Flash IO IO IO
New Paradigm designs File Server File Server NAND Flash Shared Storage Traditional SAN Shared Storage
IODuo IODuo New Paradigm designs SQL Server Cluster Node 1 HP DL980 SQL Server Cluster Node 2 HP DL980 Windows Server 2012 Windows Server 2012 HP IO Accelerator Flash SQL TempDB PCIe 2 x 56Gb/s 2 x 56Gb/s PCIe HP IO Accelerator Flash SQL TempDB 2 x 36 port Mellanox Infiniband Switch Windows Server 2012 Cluster Node 1 HP DL580 2 x 56Gb/s 2 x 56Gb/s Windows Server 2012 Cluster Node 2 HP DL580 4 x 8Gb/s 4 x 8Gb/s 3PAR 7000-series Shared Flash Storage SQL Data / Log
Demo
Demo
Demo
Demo
SQL Server Virtualization - storage challenges Capacity Fast Shared Reliable
Hyper-V v3.0 Only two goals: Adopt new technologies in the Win8 kernel Be the best hypervisor for SQL Server
Hyper-V v3.0 How do you become the best hypervisor for SQL Server?
Hyper-V v3.0 Microsofts initial idea up to November 2010 40% 35% 36% 33% Servers by number of CPU Sockets 30% 25% 20% 21% 15% 10% 5% 0% 6% 1% 0% 0% 0% 0% 0% 0% 0% 0% 0% 0% 2% 0% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 22 24
Hyper-V v3.0 New insight for the Hyper-V Team SQL instances by number of CPU Sockets 40% 35% 36% 30% 25% 24% 20% 15% 10% 16% 12% 11% 5% 0% 1% 0% 0% 0% 0% 0% 0% 0% 1 2 3 4 5 6 8 11 12 14 15 16 24 CPU s
Hyper-V Team idea of Physical to Virtual Before: 750 Servers with SQL Server 920 SQL Server Instances 200TB Storage After: 780-790 Servers with Hypervisor and SQL Server 920 SQL Server Instances 200TB Storage
Real life consolidation on Physical servers Before: 750 Servers with SQL Server 920 SQL Server Instances 200TB Storage After: 6 Servers with SQL Server 12 SQL Server Instances 140TB Storage
Real life consolidation on Physical servers How did we achieve the Storage savings? Databases by type 63% 37% System User
Share of disk space Digging deeper Further storage re-claims could easily be done in databases 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Disk capacity waste in SQL environments (350TB) 83% 57% 17% Allocated disk space Allocated database space Used database space Total free space
Final specs of Hyper-V v3.0 Capability Hyper-V Server 2008 R2 Hyper-V Server 2012 Number of logical processors on host 64 320 Maximum supported RAM on host 1 TB 4 TB Virtual CPUs supported per host 512 2048 Maximum virtual CPUs supported per virtual machine 4 64 Maximum RAM supported per virtual machine 64 GB 1 TB Maximum running virtual machines supported per host 384 1024 Guest NUMA No Yes Maximum failover cluster nodes supported 16 64 Maximum number of virtual machines supported in failover clustering 1000 8000
Final specs of Hyper-V v3.0 So what about Storage? VMware tops out at 300,000 IOPS per VM A really good number A single Windows Server 2012 Hyper-V VM does: 985,000 IOPS
New Paradigm designs File Server File Server NAND Flash Shared Storage Traditional SAN Shared Storage
New things are happening SQL Server 2014 (SQL14) Windows Server 2012 R2 (Windows Blue)
Windows Server 2012 R2 RTM September 5 th 2013 Both Server and Client Win8.1 Hyper-V v4.0 985.000 IOPS -> 1.300.000 IOPS Improved network performance 300.000 IOPS/ Improved Storage Spaces Caching Tiered Storage -> 450.000 IOPS/ 8K I/O from 950 MB/s -> 1.300 MB/s
SQL Server 2014 Still in development Project Hekaton In-Memory OLTP Columnstore Index Clustered & Updateable Updated Always-On Improved reliability and scalability 8 replicas Completely New Query Engine For the first time control of IOPS with resource policies Buffer Pool Extension
Re-think hardware usage Storage Memory
Re-think hardware usage Storage Memory L5 Cache L4 Cache
Re-think hardware usage Storage Memory L2 RAM L1 RAM
SSD Buffer Pool Extensions What s being delivered: Usage of non-volatile drives (SSD) to extend buffer pool NUMA aware Large page and BUF array allocation Main benefits: BP Extension for SSDs Improve OLTP query performance with no application changes No risk of data loss (using clean pages only) Easy configuration optimized for OLTP workloads on commodity servers (32GB RAM) Scalability improvements for systems with >8 sockets
Data Buffer Pool Manager Query Tree Cmd Parser Command TDS TDS Query Plan Optimizer Query Execut or Result Sets SNI Relational Engine Results Protocol Layer Plan Cache Transaction Log Access Methods GetPage D Data Cache Cached Pages Data Files Write I/O Transaction Manager Storage Engine Buffer Manager Read I/O Buffer Pool
Data IOPS Offload to Storage Class Memory (SCM) Query Tree Cmd Parser Command TDS TDS Query Plan Optimizer Query Execut or Result Sets SNI Relational Engine Results Protocol Layer Plan Cache Transaction Log Access Methods GetPage Cached Pages D Data Cache L2 Buffer Pool Data Files Write I/O Transaction Manager Storage Engine Buffer Manager Read I/O Buffer L1 Buffer Pool Pool PCIe Flash Fusion Fusion Fusion (SCM) IO IO IO
Easy enablement
Troubleshooting options DMVs sys.dm_os_buffer_pool_extension_configuration sys.dm_os_buffer_descriptors XEvents sqlserver.buffer_pool_extension_pages_written sqlserver.buffer_pool_extension_pages_read sqlserver.buffer_pool_extension_pages_evicted sqlserver.buffer_pool_page_threshold_recalculated Performance Monitor Counters Extension page writes/sec Extension page reads/sec Extension outstanding IO counter Extension page evictions/sec Extension allocated pages Extension free pages Extension page unreferenced time Extension in use as percentage on buffer pool level
New things are happening in hardware
HP vnext Cluster-in-a-box HP X5000 DL580 G8 Ivy Bridge EX (4+ sockets) Dragonhawk SuperDome2 with x64 and Windows Moonshot Many servers in very little space at low power
Thank you michaelf@mentalnote.dk Business Card: Above e-mail on LinkedIn