OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING



Similar documents
DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Performance Beyond PCI Express: Moving Storage to The Memory Bus A Technical Whitepaper

HP Z Turbo Drive PCIe SSD

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Scaling from Datacenter to Client

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

Accelerating Server Storage Performance on Lenovo ThinkServer

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Flash s Role in Big Data, Past Present, and Future OBJECTIVE ANALYSIS. Jim Handy

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Windows Server Performance Monitoring

SSDs and RAID: What s the right strategy. Paul Goodwin VP Product Development Avant Technology

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

The Bus (PCI and PCI-Express)

Enabling Technologies for Distributed and Cloud Computing

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Boost Database Performance with the Cisco UCS Storage Accelerator

How NAND Flash Threatens DRAM

How System Settings Impact PCIe SSD Performance

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

FUSION iocontrol HYBRID STORAGE ARCHITECTURE 1

Accelerating Data Compression with Intel Multi-Core Processors

Data Center Solutions

Random Access Memory (RAM) Types of RAM. RAM Random Access Memory Jamie Tees SDRAM. Micro-DIMM SO-DIMM

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

All-Flash Storage Solution for SAP HANA:

WITH A FUSION POWERED SQL SERVER 2014 IN-MEMORY OLTP DATABASE

Accelerating MS SQL Server 2012

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

Memoright SSDs: The End of Hard Drives?

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Communicating with devices

Optimizing SQL Server Storage Performance with the PowerEdge R720

SOLID STATE DRIVES AND PARALLEL STORAGE

Enabling Technologies for Distributed Computing

ioscale: The Holy Grail for Hyperscale

Intel RAID SSD Cache Controller RCS25ZB040

White Paper Fujitsu PRIMERGY Servers RAID Controller Performance 2013

Benchmarking Cassandra on Violin

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

The Use of Flash in Large-Scale Storage Systems.

Solid State Storage in Massive Data Environments Erik Eyberg

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Server: Performance Benchmark. Memory channels, frequency and performance

Architecture Enterprise Storage Performance: It s All About The Interface.

WHITE PAPER FUJITSU PRIMERGY SERVERS PERFORMANCE REPORT PRIMERGY BX620 S6

The Data Placement Challenge

Intel Xeon Processor E5-2600

A Close Look at PCI Express SSDs. Shirish Jamthe Director of System Engineering Virident Systems, Inc. August 2011

Evaluation Report: Supporting Microsoft Exchange on the Lenovo S3200 Hybrid Array

7 Real Benefits of a Virtual Infrastructure

Solid State Storage 101 An introduction to Solid State Storage. January 2009

Solid State Technology What s New?

Solid State Drive Architecture

Evaluation Report: HP Blade Server and HP MSA 16GFC Storage Evaluation

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

The Revival of Direct Attached Storage for Oracle Databases

Flash 101. Violin Memory Switzerland. Violin Memory Inc. Proprietary 1

How To Scale Myroster With Flash Memory From Hgst On A Flash Flash Flash Memory On A Slave Server

NV-DIMM: Fastest Tier in Your Storage Strategy

Managing Data Center Power and Cooling

Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011

VDI Solutions - Advantages of Virtual Desktop Infrastructure

Distribution One Server Requirements

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

Essentials Guide CONSIDERATIONS FOR SELECTING ALL-FLASH STORAGE ARRAYS

SQL Server 2014 Optimization with Intel SSDs

Software-defined Storage at the Speed of Flash

SUN STORAGE F5100 FLASH ARRAY

Improve Business Productivity and User Experience with a SanDisk Powered SQL Server 2014 In-Memory OLTP Database

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Storage Class Memory and the data center of the future

Data Center Storage Solutions

Maximum performance, minimal risk for data warehousing

Performance Brief: MegaRAID SAS 9265/9285 Series

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Accelerating Database Applications on Linux Servers

Virtualization of the MS Exchange Server Environment

Flash Performance in Storage Systems. Bill Moore Chief Engineer, Storage Systems Sun Microsystems

Hard Disk Drive vs. Kingston SSDNow V+ 200 Series 240GB: Comparative Test

Comparison of Hybrid Flash Storage System Performance

Architecting High-Speed Data Streaming Systems. Sujit Basu

Price/performance Modern Memory Hierarchy

Increase Database Performance by Implementing Cirrus Data Solutions DCS SAN Caching Appliance With the Seagate Nytro Flash Accelerator Card

Flash Memory Technology in Enterprise Storage

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Can Flash help you ride the Big Data Wave? Steve Fingerhut Vice President, Marketing Enterprise Storage Solutions Corporation

PCI Express Overview. And, by the way, they need to do it in less time.

Transcription:

OBJECTIVE ANALYSIS WHITE PAPER MATCH ATCHING FLASH TO THE PROCESSOR Why Multithreading Requires Parallelized Flash T he computing community is at an important juncture: flash memory is now generally accepted as a way to significantly increase performance, but in order to take advantage of all of the throughput that flash offers, system developers must abandon certain classic concepts of storage. Flash memory is growing in popularity as a new layer in the memorystorage hierarchy. Flash is faster than HDD but slower than DRAM; it is cheaper than DRAM but more expensive than HDD. This is the magic combination that gives flash its appeal. One significant disadvantage, though, is that the distinction between memory and storage is very clear in conventional computing structures, with storage on one side of the storage controller and memory on the other. To date, flash has been relegated to the storage side of the storage controller, even though it is, in fact, memory. The problem with getting from here to there is that we are looking at computer architecture from an outdated vantage point, and this makes it difficult to see the clearest way to guarantee peak performance. How Did We Get Here? Figure 1.) Classic Approach to Increases In the past microprocessors vied to outdo each other by increasing clock rates. Each generation was faster than the next. Increase and Clock Frequencies This worked because a semiconductor phenomenon called Dennard Scaling made transistor speed scale with size, allowing each new process shrink to naturally result in faster speeds. It was relatively easy to balance computer performance in those days. The system designer would choose a processor clock frequency and would match that processor s speed with a DRAM interface that had been designed to fit. Figure 1 is a loose graphic depiction of this approach. To increase performance the clock speeds of both the processor and the memory were raised. Since the boxes in the diagram remain the same height then making both boxes taller by increasing their clock speeds will ensure that the systems performance remains balanced. usually received little attention because hard drives (HDDs) were significantly slower than the rest of the system and the storage controller always

provided significantly more bandwidth than even multiple HDDs could use. In the 1990s it became clear that this approach would become unfeasible within the next few years. A combination of leakage currents and reisitor-capacitor (RC) time constants limited the maximum speed at which a processor could run to roughly 2GHz, after which heat dissipation would become a significant issue. Indeed, processor clock speeds have reached a limit (See Figure 2.) Still, transistors continued to shrink, so more of them could be economically added to the processor chip. The designers dilemma was to find a way to improve processor performance with a growing number of transistors even if clock speeds had reached a stopping point. The result was to increase the number of cores in the processor chip rather than to use a single Becomes a processor running at its Bottleneck maximum speed, designers harnessed the power of multiple processors running in parallel. This created a new challenge. Most software at that point had been written under the assumption that it would run on a single Figure 2.) Clock Speeds Figure 3.) Using Parallel CPUs & Channels Revised Approach: Parallel s and Multiple Channels CPU. Intel and AMD, the two processor leaders, undertook the task of creating a computing environment around multiple processing. The two companies worked with compiler companies and other software toolmakers to ensure that these tools could create code to allow multiple concurrent tasks to Source: Stanford CPU Database: Source: Objective CPUdb.Stanford.edu Analysis 2014 run on separate CPU cores. Intel and AMD also trained the programmer community to think in terms of multithreading how could they maximize the number of tasks that could be run concurrently? The idea was for software to support as many separate concurrent tasks as possible so that it would run at a speed dictated by the number of available cores. Ideally, a program would run twice as fast on a dual-core processor as it would on a single-core processor, three times as fast on three cores, and so on, until the number of cores exceeded the number of tasks to be performed. Naturally this motivated programmers to explore ways to divide their programs into the largest number of threads that made sense, so that their product would always perform better if more cores became available.

At the same time server designers defined motherboards that amplified this concept, teaming two or four processor chips into dual-socket or quad-socket designs. Adding DRAM Channels for Bandwidth had to play a role in feeding this architecture s voracious hunger for data and instructions. Not only did DRAM interfaces migrate to SDRAM, then DDR, DDR2, and Multiple s Improve I/O to DDR3, but Interface multiple memory were added to the board supporting as many as four separate memory buses. This is depicted as multiple memory boxes in Figure 3. A Four-Socket (four processor chip) motherboard with four memory channels per socket would then have sixteen memory channels, each pumping out 64 bits (8 bytes) of data at double the 1,066MHz clock rate, amounting to a total memory bandwidth of 270GB/s. The move to four memory channels not only quadrupled memory/processor bandwidth, but it also allowed processors support memory arrays that were four times as large. Even with these larger memory arrays, today s ballooning datasets expanded beyond the server s maximum memory sizes, forcing the working data to be swapped into and out of HDD, and shifting the focus to the storage subsystem. A Disconnect with Figure 4.) Boosting Speed : A New Bottleneck I/O speeds didn t undergo the same speed ramp. There was really no reason to because of their mechanical nature HDD bandwidth is relatively limited and cannot be improved. Although spindle speeds did double with the advent of the enterprise HDD, this change significantly increased power consumption, and designers determined that even faster spindle speeds would be impractical simply due to issues of power dissipation. Since HDDs peaked at around 0.4MB/s parallel interfaces to the CPU would have made little difference. If the disk speed wouldn t improve, there was little reason to improve the speed of the storage controller. systems were accelerated not by improving the bandwidth between the disk system and the memory-processor complex, but by increasing the data transfer rate between the disk and the storage controller, which already had significantly more bandwidth than was available from HDDs. Common approaches were to use RAID or other striping mechanisms to parallelize the disk array. This is represented by the multiple very small boxes in Figure 4. When SSDs came along the game changed significantly. Think of it as adding taller boxes to the disk column of Figure 4. These taller boxes, representing the higher performance of the SSD were suddenly hampered by the storage controller, which could not service all of the bandwidth the SSD provided. RAID

cards had to be redesigned to handle all the bandwidth that was suddenly available in multiple-ssd systems. makers partially solved this problem by routing the SSD through the PCIe bus, which was originally conceived as a way to add coprocessors onto a system. This opened up bandwidth significantly, but still resulted in a bottleneck at the interface between the disk system and the processor-memory complex. Even with multiple PCIe lanes the system still suffers from significantly lower bandwidth than does the memory bus, mainly because there is only one PCIe bus per processor chip. This has prevented the flash from providing its best performance to the processor. Achieving Parallel with Flash DIMMs Notice that during this entire discussion nothing has been said about changing the system topology to optimize it for flash. Even when the flash has been moved to the PCIe bus, it is still on the storage side of the graphic, where the storage controller (and storage control software) slows it down. Users who break with convention and think of flash as a new kind of memory can get past this snag. What if flash could be added to the memory bus? Rather than adding a knot of highperformance storage on the slow side of the storage controller, several smaller blades of fast flash could be added to the system, each on a different memory bus. This would allow parallel access from the processor to the flash at speeds several times that of the storage controller. This is the way that DRAM bandwidth is made to match the needs of multiple processors, and this is exactly the way that flash should be used by anyone who wants to achieve the highest performance this technology can deliver. Figure 5 illustrates this approach. The small boxes in this diagram represent flash memory that has been moved (arrows) from the storage side of the system to the RAM side of the system. These boxes could all be RAIDed on the HDD side of the storage controller, or they could communicate with the processor via multiple PCIe channels, but the memory-processor interface between flash and the CPU chip is still significantly faster than either of these options. Breaking Up DIMMs for Bandwidth Figure 5.) Getting Past the Slow Interface Add to Channels Another interesting point is that the designer doesn t need to add all that much flash to each memory bus. As a rule of thumb tiers in the memory hierarchy should be ten times the size of the nextfaster tier. In a large server board the maximum memory size is four channels of 32GB each, or 128GB. The rule of thumb would indicate that each channel should have about 320GB of NAND flash. More realistically, the designer might reduce the sys-

tem s DRAM complement when adding flash, partly because they will be giving up a DIMM socket. Let s say the system actually used 16GB of DRAM and around 160GB of NAND flash. Such a trade-off is well warranted: Objective Analysis has performed benchmarks that show that systems with a small DRAM and flash solidly outperform systems with a large DRAM and no flash, when the dollar spent on flash plus DRAM is the same for both systems. Spreading smaller amounts of flash among the systems various memory buses should provide the highest throughput. Rather than use a single large 1TB SSD on a PCIe, SAS, or SATA interface, and rather than add a RAID of SSDs that communicate at a high speed through a PCIe-based RAID card, the optimum system would pepper ten times as many 200GB memory-channel storage (MCS) based flash DIMMs throughout the system right on the memory bus to allow the processor s multiple memory channels to gain access to flash memory s ultimate performance benefit. The amount of flash used would remain the same, but the speed at which it can be accessed would increase by more than an order of magnitude. In most cases the flash is added to the system in order to improve throughput at a reasonable cost, so the focus isn t on installing the largest amount of flash, but in achieving the speediest access to an affordable amount of flash. All-flash DIMMs are available today in the form of San s ULLtraDIMM or the IBM exflash DIMM memorychannel storage. These devices are both available in capacities of 200GB and 400GB and are block accessible, so that they can be treated as storage by existing software, yet accessed at memory speeds through the memory bus. Summary Flash memory has brought exciting improvements to computing performance, but it has not been able to perform to its maximum potential because current computer architecture calls for all storage to be placed behind a storage controller. Meanwhile, other parts of the system have been broken into a number of parallel paths in order to coax increasing performance out of the system without raising clock rates. This is how multicore processors evolved, with each supporting multiple memory channels. alone has failed to keep pace with this change. Now that flash is accepted as a storage layer designers need to further explore the way that it is put to work. Placing flash behind the storage controller slows it down. Modern flash-based DIMMs allow NAND flash to be added to the highest-speed channel in the system the memory bus. With the adoption of a bus-based parallel flash system we can expect to see performance increase significantly over the boost already being realized in systems that use SSD-based flash storage. Jim Handy, March 2014