How System Settings Impact PCIe SSD Performance

Similar documents
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server

Converged storage architecture for Oracle RAC based on NVMe SSDs and standard x86 servers

AirWave 7.7. Server Sizing Guide

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Host Power Management in VMware vsphere 5

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

White Paper: M.2 SSDs: Aligned for Speed. Comparing SSD form factors, interfaces, and software support

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Figure 1A: Dell server and accessories Figure 1B: HP server and accessories Figure 1C: IBM server and accessories

Vertical Scaling of Oracle 10g Performance on Red Hat Enterprise Linux 5 on Intel Xeon Based Servers. Version 1.0

HP Z Turbo Drive PCIe SSD

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and IBM FlexSystem Enterprise Chassis

The Bus (PCI and PCI-Express)

Host Power Management in VMware vsphere 5.5

Optimizing SQL Server Storage Performance with the PowerEdge R720

HP PCIe IO Accelerator For Proliant Rackmount Servers And BladeSystems

HP ProLiant DL580 Gen8 and HP LE PCIe Workload WHITE PAPER Accelerator 90TB Microsoft SQL Server Data Warehouse Fast Track Reference Architecture

Performance Comparison of Fujitsu PRIMERGY and PRIMEPOWER Servers

ClearPath MCP Software Series Compatibility Guide

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

Power Efficiency Comparison: Cisco UCS 5108 Blade Server Chassis and Dell PowerEdge M1000e Blade Enclosure

Analyzing the Virtualization Deployment Advantages of Two- and Four-Socket Server Platforms

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

A Quantum Leap in Enterprise Computing

RED HAT ENTERPRISE LINUX 7

Intel Xeon Processor 5500 Series. An Intelligent Approach to IT Challenges

The 8Gb Fibre Channel Adapter of Choice in Oracle Environments

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

The Motherboard Chapter #5

Platfora Big Data Analytics

Best Practices for Optimizing SQL Server Database Performance with the LSI WarpDrive Acceleration Card

Parallel Processing and Software Performance. Lukáš Marek

SAS Business Analytics. Base SAS for SAS 9.2

NUMA Best Practices for Dell PowerEdge 12th Generation Servers

Performance brief for IBM WebSphere Application Server 7.0 with VMware ESX 4.0 on HP ProLiant DL380 G6 server

Accomplish Optimal I/O Performance on SAS 9.3 with

Scaling in a Hypervisor Environment

Oracle Database Scalability in VMware ESX VMware ESX 3.5

The Transition to PCI Express* for Client SSDs

NVMe SSD User Installation Guide

Achieving a High Performance OLTP Database using SQL Server and Dell PowerEdge R720 with Internal PCIe SSD Storage

HP Universal Database Solution: Oracle and HP 3PAR StoreServ 7450 All-flash array

SQL Server Business Intelligence on HP ProLiant DL785 Server

Technical Paper. Moving SAS Applications from a Physical to a Virtual VMware Environment

Summary. Key results at a glance:

The Foundation for Better Business Intelligence

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

KVM Virtualized I/O Performance

Power efficiency and power management in HP ProLiant servers

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

HUAWEI TECHNOLOGIES CO., LTD. HUAWEI FusionServer X6800 Data Center Server

How To Test Nvm Express On A Microsoft I7-3770S (I7) And I7 (I5) Ios 2 (I3) (I2) (Sas) (X86) (Amd)

How To Build A Cisco Ukcsob420 M3 Blade Server

Technical Note. P320h/P420m SSD Performance Optimization and Testing. Introduction. TN-FD-15: P320h/P420m SSD Performance Optimization and Testing

Cisco Prime Home 5.0 Minimum System Requirements (Standalone and High Availability)

HP reference configuration for entry-level SAS Grid Manager solutions

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Virtualization Performance on SGI UV 2000 using Red Hat Enterprise Linux 6.3 KVM

Solid State Storage in Massive Data Environments Erik Eyberg

HP ProLiant BL460c takes #1 performance on Siebel CRM Release 8.0 Benchmark Industry Applications running Linux, Oracle

Server: Performance Benchmark. Memory channels, frequency and performance

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Intel X58 Express Chipset

High Performance SQL Server with Storage Center 6.4 All Flash Array

Hitachi Virtage Embedded Virtualization Hitachi BladeSymphony 10U

Fusion iomemory iodrive PCIe Application Accelerator Performance Testing

Virtualizing Enterprise SAP Software Deployments

DDR3 memory technology

Configuring and using DDR3 memory with HP ProLiant Gen8 Servers

HP ProLiant BL460c achieves #1 performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Microsoft, Oracle

Multi-core and Linux* Kernel

N /150/151/160 RAID Controller. N MegaRAID CacheCade. Feature Overview

HP ProLiant BL685c takes #1 Windows performance on Siebel CRM Release 8.0 Benchmark Industry Applications

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Configuring Memory on the HP Business Desktop dx5150

High Performance Tier Implementation Guideline

Intel Virtualization and Server Technology Update

Quad-Core Intel Xeon Processor

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Scaling from Datacenter to Client

ECLIPSE Performance Benchmarks and Profiling. January 2009

Maximum performance, minimal risk for data warehousing

Distribution One Server Requirements

How To Compare Two Servers For A Test On A Poweredge R710 And Poweredge G5P (Poweredge) (Power Edge) (Dell) Poweredge Poweredge And Powerpowerpoweredge (Powerpower) G5I (

LS DYNA Performance Benchmarks and Profiling. January 2009

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Xserve Transition Guide. November 2010

Intel Xeon Processor E5-2600

HP ProLiant DL380 G5 takes #1 2P performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Windows

Server Forum Copyright 2014 Micron Technology, Inc

RESOLVING SERVER PROBLEMS WITH DELL PROSUPPORT PLUS AND SUPPORTASSIST AUTOMATED MONITORING AND RESPONSE

Boost Database Performance with the Cisco UCS Storage Accelerator

Minimize cost and risk for data warehousing

FLOW-3D Performance Benchmark and Profiling. September 2012

Transcription:

How System Settings Impact PCIe SSD Performance Suzanne Ferreira R&D Engineer Micron Technology, Inc. July, 2012 As solid state drives (SSDs) continue to gain ground in the enterprise server and storage market, PCI Express (PCIe) devices are emerging as the high-performance device of choice. To achieve maximum performance from PCIe SSDs, host servers must be tuned and adjusted to enable components to work together to provide the best possible system-level performance. This white paper discusses the impact of various system settings on the performance of Micron RealSSD P320h PCIe SSDs in half-height, half-length (HHHL) configurations. Note: This is not an exhaustive guide or tutorial on system configuration. Consult your system documentation for settings that may impact SSD performance. Testing Process and Configuration A number of tests were performed at Micron s facility in Folsom, California, over a period of approximately six weeks. All measurements of performance were made according to Storage Networking Industry Association (SNIA) guidelines. (For more information about testing SSDs, see Micron s Best Practices for SSD Performance Measurement white paper.) The primary testing categories included: Processor speed impact on enterprise SSD performance BIOS settings impact on drive performance Dual- versus single-cpu impact on drive performance Five different servers were used for the various tests:, HP, and Cisco units, which all used an Intel Xeon CPU. (See Table 1.) In all tests (except the dual-cpu test), a baseline benchmark was completed, after which alterations were made and the results compared as a percent of performance reduction to the baseline. In the dual-cpu test, the following parameters were varied: I/O Cores: A list of the cores assigned to handle I/O Interrupt Core: The core assigned to handle interrupts Workers: The number of workers generating I/O Threads: The number of I/O submission threads Random read IOPs were then measured. Random reads were used because they tend to stress the card and they are most sensitive to misconfiguration. The open source software, vdbench, was used to benchmark performance. All tests were run using Red Hat Enterprise Linux (RHEL) version RHEL6.1. The 3.3 system configuration was used to establish baseline results. The other servers were used to compare CPU speeds. Table 1 outlines the five systems, including CPU, RAM, software, and SSD configurations, which were used throughout the testing. Number of workers impact on single-cpu drive performance 1

System Name Server Model* CPU RAM OS SSD 3.3 X8DTH x5680 at 3.33 GHz, 6 cores per CPU 24GB DDR3-1333 700GB P320h 2.4 X8DTH 24GB DDR3-1333 700GB P320h 2.4A X8DTH 16GB DDR3-1096 700GB P320h HP 2.4 HP ProLiant DL180 G6 8GB DDR3-1333 350GB P320h Cisco 2.4 Cisco UCS C250 8GB DDR3-1333 350GB P320h *All models use the Intel 5520 chipset. Table 1: System Configurations Used for Testing Processor Speed Three servers (one 3.33 GHz and two 2.4 GHz) were used to test the effects of processor speed on enterprise SSD performance. In the case of the 2.4 GHz systems, the same clock speed was compared for both vendors servers and different SSD densities as a cross-check for other factors that could potentially interfere with the results. The impact of CPU speed on random read performance is described in Table 2. While the clock speed, P320h capacity, and manufacturer varied, the results show that the performance was identical for the two servers with different densities but the same clock speed. The results show that clock speed was the only factor that impacted performance. The two 2.4 GHz systems performed 36% slower than the 3.33 GHz system, which achieved almost 800,000 IOPs in a random 4k read workload. BIOS Settings BIOS settings can significantly impact SSD performance. To measure their impact on P320h performance, the following settings were tested: CPU Ratio: Enables the user to set the ratio between the CPU core clock and the front-side bus (FSB) frequency. Intel C-State: Enables the CPU to enter powersaving mode. Enhanced Intel SpeedStep Technology (EIST): Enables the system to automatically adjust processor voltage and core frequency in an effort to reduce power consumption. Throttling: Enables the CPU to run faster versus save energy. Clock Spread Spectrum: Enables the BIOS to monitor and attempt to reduce the level of electromagnetic interference caused by the components. 2

System Configuration Random 4k Read P320h Capacity Performance Reduction 3.33 GHz, 6-Core X5680 787,000 700GB 0% HP 2.4 GHz, 4-Core E5620 502,406 350GB -36% 2.4 GHz, 4-Core E5620 505,435 700GB -36% Table 2: Processor Speed Performance Results C1E Support: Enables the use of the enhanced halt state, which significantly reduces CPU power consumption by reducing the CPU clock cycle and voltage during a halt state. Intel Virtualization Technology (VT): Used when running multiple instances of an operating system or running VMware. Frequency: The speed of the QuickPath Interconnect () links in the system. Table 3 and Figure 1 detail the results of performance testing and provide examples of how BIOS settings can impact impact SSD performance. As shown, most of the settings impacted performance by more than 10%. This is not surprising given that most of the settings also impact CPU performance, which directly impacts I/O performance. Dual-Processor Configurations Dual-CPU configurations, when properly configured, generally provide better performance than single-cpu configurations. To measure the impact of different configurations on performance in a dual-cpu environment, a system with an Intel 5520 chipset was used. Figure 2 shows the (partial) architecture of the 5520 chipset. In this configuration, each PCIe card is connected to an I/O hub (IOH) through a PCIe interface. The IOH is directly connected to a single CPU though a connection and indirectly to another CPU through another IOH. The P320h I/O driver limits the number of interrupts to be serviced by a single core on a single CPU. However, this core can be any core from any CPU in the system. Once an interrupt is serviced, the handling of I/O operations can be spread across multiple cores and multiple CPUs. Four different configuration options were tested to measure their impact on P320h performance: Assigning I/O submission threads and interrupts to cores on the same and separate CPUs Separating I/O submission and interrupt handling cores Tuning an application for the optimal number of worker threads for I/O submissions Assigning I/O submissions and interrupts to the CPU closest to the PCIe slot containing the P320h SSD In each test, a single P320h PCIe card was inserted into PCIe slot 5, as shown in Figure 2. 640,000 620,000 600,000 580,000 560,000 540,000 520,000 500,000 Default Settings CPU Ratio Intel C-STATE Intel EIST Frequency Throttling Clock Speed Spectrum Optimized Default C1E Support C-State Disabled, CPU Ratio Set... Figure 1: Performance vs. BIOS Setting Enhanced Intel SpeedStep Intel VT 3

BIOS Settings* Setting 4k Random Reads MB/s Difference Default Settings Default 549,939 2,148 CPU Ratio Manual 614,755 2,401 +12% Intel C-STATE Disabled 618,449 2,416 +12% Intel EIST Disabled 613,200 2,395 +12% Frequency 6.4 GT/s 605,120 2,364 +10% Throttling Disabled 611,931 2,390 +11% Clock Spread Spectrum Enabled 621,521 2,428 +13% Optimized Default 629,076 2,457 +14% C1E Support Disabled 626,572 2,448 +14% Intel C- State Disabled, CPU Ratio Set to Manual, Intel VT Disabled 575,770 2,432 +5% Enhanced Intel SpeedStep Disabled 622,509 +13% Intel VT Disabled 624,873 +14% *All adjustments were made individually, one at a time, except where noted. Table 3: Performance Impact of BIOS Settings Assigning I/O Submissions and Interrupts to Cores on the Same and Separate CPUs In most current SSD designs, interrupts by the card are handled by a single core on a single CPU. While this makes it easy to isolate any CPU performance limitations of a card, care should be taken to size a system correctly for the device workload versus the rest of the system workload. In this test, I/O submission threads and interrupts were assigned to cores on the same CPU and compared to the performance of a system where I/O submissions and interrupts were handled on separate CPUs. Performance results are shown in Table 4 and Figure 3. were assigned to a different CPU. This performance drop occured whether I/Os were assigned or not. Separating I/O Submission and Interrupt Handling Cores Other phenomena happen when I/O cores are either separate from the interrupt core or when they overlap even if they are both on the same CPU. The best performance was achieved when the I/O cores were specified in such a way that the interrupt core was not also used for I/O. Conversely, when the I/O cores were specified to include the interrupt core, a significant drop in performance occurred, as shown in Table 5. The results show a 47% drop in performance when the cores handling I/O submission threads and interrupts 4

RAM CPU 2 Cores 6-11 CPU 1 Cores 0-5 RAM 4 5 6 7 PCIe IOH 2 IOH 1 PCIe 1 2 3 Figure 2: Intel 5520 Chipset Architecture with P320h Slot for Performance Testing Tune Application for Optimal Number of Workers When it comes to workers, more is not necessarily better. There is an optimal number of I/O submission threads that achieve maximum performance. This number varies depending on the application. In this test, the number of workers varied from five to eight. Optimal performance was obtained with six workers, as shown in Table 6. tool, and the following Linux command was used to assign vdbench to cores 0 3: >taskset -c 0-3 /root/ tools/vdbench502/vdbench -f read _ 4k _ 256. 800,000 700,000 600,000 Assigning I/O Submissions and Interrupts to the CPU Closest to the PCIe Slot with the P320h SSD Figure 2 shows PCIe slots 4 7 connected to IOH 2. If a P320h is inserted into slot 5, assigning I/O submissions and interrupts on cores on CPU 2 (the adjacent CPU) achieves maximum performance. Alternately, if the I/O submissions and interrupt cores are assigned to the CPU furthest from the PCIe slot (CPU 1), there is a mild drop in performance, as shown in Table 7. In our performance tests, vdbench was used as the performance 500,000 400,000 300,000 200,000 100,000 0 Same CPU Separate CPUs Figure 3: I/O Core vs. Interrupt Core I/O Cores Interrupt Cores Workers Threads Difference 7 11 6 6 255 736,088 7 11 5 6 255 400,493-46% 0 4 5 6 255 758,736 0 4 6 6 How System 255 Settings Impact 401,834 PCIe SSD Performance -47% Table 4: Assigning Interrupts and I/Os to CPUs 5

I/O Cores Interrupt Cores Workers Threads Difference 7 11 6 6 255 746,720 7 11 7 6 255 619,251-17% Table 5: Handing I/O Submissions and Interrupts on Separate Cores I/O Cores Interrupt Cores Workers Threads Difference 1 4 0 6 255 701,070 0% 1 4 0 7 254 696,340-1% 1 4 0 5 254 695,003-1% 1 4 0 8 254 664,898-5% Table 6: Performance Relative to the Number of Workers I/O Cores Interrupt Cores PCIe Slot Workers Threads Difference 7 10 (CPU 2) 6 (CPU 2) 5 (adjacent to CPU 2) 6 255 714,637 0% 1 4 (CPU 1) 0 (CPU 2) 5 (adjacent to CPU 2) 6 255 701,070-2% Table 7: Relationship of Performance to PCIe Slot and IOH Used Conclusion This white paper explored the impact of CPU clock speed, BIOS settings, and dual, multi-core processors in various configurations. The tests performed provided the following conclusions: Separate interrupt core and I/O cores are necessary for the best performance Using the CPU connected to the IOH closest to the PCIe slots achieves best performance (versus the opposite CPU) Assigning cores for interrupts and I/O handling to be from the same CPU provides significantly better performance compared to separate CPUs The number of workers for optimal performance was not necessarily the highest; in this test case it was six For more information on Micron s P320h SSD performance, visit micron.com or contact your Micron representative. 6

Term AIO Core Definition Asynchronous I/O: A non-blocking system call convention that can be used by a single worker to submit more than one I/O to a storage device. A logically or physically discrete execution context within a processor package. Many current CPUs have multiple cores with individual L1 caches, individual or shared L2 caches, and in some cases a large shared L3 cache. Each core can run a separate thread of execution, allowing for more workload parallelization. CPU HT NUMA Node Process Synchronous I/O Thread Central processing unit, also known as a processor or microprocessor. In the context of this paper, a CPU is the physical package in a socket. This package could contain one or more NUMA nodes, each with one or more cores. Hypertransport: A point-to-point interconnect used to connect AMD CPUs, similar to. Non-uniform memory access node: While sometimes considered synonymous to CPU, a NUMA node refers to a processor/memory combination in which the memory is connected directly to one CPU, causing a non-uniform access situation. A time-sliced context in which an application is run. That context has a unique identifier across the system. A blocking system call methodology whereby only a single I/O can be outstanding per worker context. A lightweight process, which may have multiple threads that can run serially or in parallel. Quick path interconnect: A point-to-point interconnect present on the several most recent generations of Intel CPUs. It replaces the traditional front-side bus while also allowing communication between I/O hubs. Worker In the context of this paper, a worker is a single context (process or thread) that is used to submit I/O to a storage device. It can be affinitized to a particular NUMA node and/or core or allowed to move from core to core as determined by the operating system scheduler. Table 8: Glossary of Terms micron.com Products are warranted only to meet Micron s production data sheet specifications. Products and specifications are subject to change without notice. The information contained herein is provided on an as is basis without warranties of any kind. Micron and the Micron logo are trademarks of Micron Technology, Inc. All other trademarks are the property of their respective owners. 2012 Micron Technology, Inc. All rights reserved. 07/23/12 EN 7