InfoScale Storage & Media Server Workloads



Similar documents
Software-defined Storage Architecture for Analytics Computing

High Performance Tier Implementation Guideline

IOmark- VDI. Nimbus Data Gemini Test Report: VDI a Test Report Date: 6, September

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

The IntelliMagic White Paper: Storage Performance Analysis for an IBM Storwize V7000

RAID Performance Analysis

The IntelliMagic White Paper on: Storage Performance Analysis for an IBM San Volume Controller (SVC) (IBM V7000)

Configuration Maximums VMware Infrastructure 3

PADS GPFS Filesystem: Crash Root Cause Analysis. Computation Institute

SAN Conceptual and Design Basics

RAID5 Scaling. extremesan Performance 1

Quantum StorNext. Product Brief: Distributed LAN Client

VNX HYBRID FLASH BEST PRACTICES FOR PERFORMANCE

Benchmarking Hadoop & HBase on Violin

Performance Report Modular RAID for PRIMERGY

Virtuoso and Database Scalability

WHITEPAPER: Understanding Pillar Axiom Data Protection Options

Veritas Storage Foundation Tuning Guide

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

WHITE PAPER Optimizing Virtual Platform Disk Performance

iscsi Performance Factors

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

ENTERPRISE VIRTUALIZATION ONE PLATFORM FOR ALL DATA

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

Violin Memory Arrays With IBM System Storage SAN Volume Control

Scala Storage Scale-Out Clustered Storage White Paper

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Q & A From Hitachi Data Systems WebTech Presentation:

PERFORMANCE TUNING ORACLE RAC ON LINUX

technology brief RAID Levels March 1997 Introduction Characteristics of RAID Levels

VIDEO SURVEILLANCE WITH SURVEILLUS VMS AND EMC ISILON STORAGE ARRAYS

SPC BENCHMARK 1 FULL DISCLOSURE REPORT IBM CORPORATION IBM SYSTEM STORAGE SAN VOLUME CONTROLLER V6.2 SPC-1 V1.12

Capacity planning for IBM Power Systems using LPAR2RRD.

WHITE PAPER BRENT WELCH NOVEMBER

Post Production Video Editing Solution Guide with Apple Xsan File System AssuredSAN 4000

AIX 5L NFS Client Performance Improvements for Databases on NAS

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Optimizing Large Arrays with StoneFly Storage Concentrators

Promise Pegasus R6 Thunderbolt RAID Performance

DELL TM PowerEdge TM T Mailbox Resiliency Exchange 2010 Storage Solution

NetApp High-Performance Computing Solution for Lustre: Solution Guide

Cray DVS: Data Virtualization Service

PSAM, NEC PCIe SSD Appliance for Microsoft SQL Server (Reference Architecture) September 11 th, 2014 NEC Corporation

PCI Express and Storage. Ron Emerick, Sun Microsystems

James Serra Sr BI Architect

Using Multipathing Technology to Achieve a High Availability Solution

Storage benchmarking cookbook

Accelerating Server Storage Performance on Lenovo ThinkServer

Azure VM Performance Considerations Running SQL Server

Optimizing LTO Backup Performance

EMC CLARiiON Backup Storage Solutions: Backup-to-Disk Guide with IBM Tivoli Storage Manager

Sun 8Gb/s Fibre Channel HBA Performance Advantages for Oracle Database

FlexArray Virtualization

POSIX and Object Distributed Storage Systems

SQL Server Business Intelligence on HP ProLiant DL785 Server

1 Storage Devices Summary

Evaluation Report: Supporting Microsoft Exchange on the Lenovo S3200 Hybrid Array

Hitachi Unified Storage 110 Dynamically Provisioned 10,400 Mailbox Exchange 2010 Mailbox Resiliency Storage Solution

Comprehending the Tradeoffs between Deploying Oracle Database on RAID 5 and RAID 10 Storage Configurations. Database Solutions Engineering

An Oracle White Paper September Oracle Exadata Database Machine - Backup & Recovery Sizing: Tape Backups

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

Maximizing Backup and Restore Performance of Large Databases

Panasas at the RCF. Fall 2005 Robert Petkus RHIC/USATLAS Computing Facility Brookhaven National Laboratory. Robert Petkus Panasas at the RCF

Using Synology SSD Technology to Enhance System Performance Synology Inc.

RAID Overview: Identifying What RAID Levels Best Meet Customer Needs. Diamond Series RAID Storage Array

Leveraging EMC Fully Automated Storage Tiering (FAST) and FAST Cache for SQL Server Enterprise Deployments

HITACHI VIRTUAL STORAGE PLATFORM FAMILY MATRIX

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

Windows Server Performance Monitoring

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

HP reference configuration for entry-level SAS Grid Manager solutions

An Alternative Storage Solution for MapReduce. Eric Lomascolo Director, Solutions Marketing

WHITE PAPER Guide to 50% Faster VMs No Hardware Required

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Distributed File System Choices: Red Hat Storage, GFS2 & pnfs

Microsoft SQL Server 2014 Fast Track

STORAGE CENTER. The Industry s Only SAN with Automated Tiered Storage STORAGE CENTER

Windows 8 SMB 2.2 File Sharing Performance

Analysis of VDI Storage Performance During Bootstorm

Xangati Storage Solution Brief. Optimizing Virtual Infrastructure Storage Systems with Xangati

Hitachi Unified Storage 130 Dynamically Provisioned 8,000 Mailbox Exchange 2010 Mailbox Resiliency Storage Solution

Oracle Database 10g: Performance Tuning 12-1

HP 85 TB reference architectures for Microsoft SQL Server 2012 Fast Track Data Warehouse: HP ProLiant DL980 G7 and P2000 G3 MSA Storage

A Framework for Performance Analysis and Tuning in Hadoop Based Clusters

The Benefit of Migrating from 4Gb to 8Gb Fibre Channel

Frequently Asked Questions: EMC UnityVSA

A Performance Analysis of the iscsi Protocol

Architecting a High Performance Storage System

HITACHI VIRTUAL STORAGE PLATFORM FAMILY MATRIX

Deep Dive: Maximizing EC2 & EBS Performance

Brocade and EMC Solution for Microsoft Hyper-V and SharePoint Clusters

The PCIe solution from ExaSAN

Diablo and VMware TM powering SQL Server TM in Virtual SAN TM. A Diablo Technologies Whitepaper. May 2015

Isilon IQ Scale-out NAS for High-Performance Applications

Performance and scalability of a large OLTP workload

Virtualizing Microsoft SQL Server 2008 on the Hitachi Adaptable Modular Storage 2000 Family Using Microsoft Hyper-V

SMB Direct for SQL Server and Private Cloud

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Transcription:

InfoScale Storage & Media Server Workloads Maximise Performance when Storing and Retrieving Large Amounts of Unstructured Data Carlos Carrero Colin Eldridge Shrinivas Chandukar 1

Table of Contents 01 Introduction 02 Hardware Setup 03 Maximize Volume Throughput 04 File System Throughput 2

Introduction + Document purpose + Internet of Things 01 3

DOCUMENT PURPOSE This document uses a given hardware configuration as a base framework and uses it as a guide to find what is the best possible software configuration. Ways to exercise I/O Several tests will be performed in order to understand how the hardware responds to different I/O demands so bottlenecks can be identified at different parts of the architecture. As a typical I/O pattern for media solutions, sequential I/O read will be used. There are some software tunables like the number of bytes that are pre-fetched using read ahead whose values can affect overall performance. This guide will evaluate how to make the best choices. While this is a high level overview, a very detailed guide with more information and examples can be found in this technical article. Measure performance Achieve balanced I/O Identify bottlenecks File System tuning Find a perfect balanced configuration that will get the most out of your hardware 4

INTERNET OF THINGS Data is at the core of the business and it is a key asset for the Internet of Things Market where some analyst predict 30% CAGR. Even without having to read any analyst paper, it is quite clear that the amount of data in any form of video, images, social information that we are storing is dramatically being increased year after year. Companies are looking for ways to take advantages of that huge amount of information and workloads such as Rich Media Analytics are expected to grow by 3 times in the coming year. The platform needed to store and analyse all that information needs to be continuously available and has to be prepared to satisfy all the I/O requirements for both stream analytics for immediate results or post analysis to find common patterns. InfoScale Storage is a perfect Software Defined Solution to create a storage backend that can adapt to changing I/O demands. Because the flexibility of using a Software Defined Solution any hardware can be used to create a commodity storage backend able to store and manage the data needed for Internet of Things solutions. This document highlights a step by step procedure to understand the hardware capabilities used under InfoScale Storage so you can make sure that maximum throughput and capacity can be obtained from your current and future hardware investments. InfoScale Storage 5

Hardware Setup + Hardware configuration 02 + Host side + Switch side + Modular array side + LUN side 6

HARDWARE CONFIGURATION Understanding your hardware capabilities is the first step towards a tuning exercise. The hardware we used during our testing is represented below. Two nodes where each node has a dual port HBA card There are 2 active paths to each LUN Two switches with 2 switch ports per switch (4 in total) are used to connect to the dual port HBA cards 6 switch ports per switch (12 in total) are used to connect to the six modular storage arrays Each modular storage array has one port connected to each switch (2 connections for each modular storage array) Each modular array has 4 LUNs using 11 disks Therefore we have 24 LUNs available in total 7

HOST SIDE Understand what are the theoretical limits at the host side Two nodes where each node has a dual port HBA card There are 2 active paths to each LUN Max theoretical throughput per port is 8Gbits/sec Because two paths are used, max theoretical throughput per node would be 16Gbits/sec The max theoretical throughput for two nodes should be 32Gbits/sec In our One-node testing, the dual port HBA card bottlenecked at approximately 12Gbits/s 8

AT THE SWITCH Understand the theoretical limits at the switch Each switch is capable of 32Gbits/sec There are two switches, so the theoretical limit should be 64Gbits/sec Two switches with 2 switch ports per switch (4 in total) are used to connect to the dual port HBA cards 6 switch ports per switch (12 in total) are used to connect to the six modular storage arrays However, each individual switch port is capable of 8Gbits/sec Because only 4 switch ports are connected to the host nodes, it limits the max throughput through both switches to 32Gbits/sec 9

MODULAR STORAGE ARRAY Understand the theoretical limits of the storage modular arrays There are 6 modular storage arrays, each has 2 ports Each port has a theoretical max throughput of 4Gbits/sec There are a total of 12 storage array connections to the two FC switches The total theoretical max throughput is therefore 48Gbits/sec for all six modular storage arrays Each modular storage array has one port connected to each switch (2 connections for each modular storage array) In our Two-node testing, the combination of 6 storage arrays bottlenecked at 20Gbits/sec 10

LUN CONFIGURATION Understand the LUN configuration, ensure all the LUNs are configured equally Each LUN is composed of 11 disks in a RAID-0 configuration There are 4 LUNS available per modular array There are a total of 24 LUNs available in the environment Each LUN is 3TB approx Because there are 2 paths per LUN, there is a total of 48 active paths on each node Each modular array has 4 LUNs using 11 disks Therefore we have 24 LUNs available in total 11

LUN THROUGHPUT Details on the LUN throughput Reading from a single LUN we achieve 170MBytes/sec Reading from two LUNs we achieve 219MBytes/sec because the contention at the modular array level Reading from two LUNs, where each LUN is located in a different modular storage array, the performance is 342MBytes/sec Each modular array controller cache is shared when I/O is generated to multiple LUNs within the same array 12

HARDWARE CONFIGURATION It is important to understand the difference between the theoretical performance of each of the components and the real performance exhibited in your particular configuration Understanding all the different performance we may get will help us to understand the limits we have in our configuration. If we only take a look at the modular storage array performance, we may have the perception of being able to achieve 48Gbits/s. But as we have seen, we have a maximum theoretical throughput for the two nodes of 32Gbits/s. In fact, and as we will see later, the maximum performance achieved is 20Gbits/sec when reading from both nodes and 12Gbits/sec when reading from one node. The following sections will describe the volume and file system configuration and the different tests performed to achieve a balanced IO configuration across all six storage modular arrays. We will also see how we can take performance metrics at different levels to understand how each piece in our architecture is performing. 13

Maximize Volume Throughput + Volume configuration 03 + Performing sequential reads from the volume + Stripe unit + Balanced IO across the storage system 14

VOLUME CONFIGURATION 24 columns to balance I/O across all HW 64k 512k 1024k Because we want to balance the I/O across all the available hardware, we are going to create a volume with 24 columns which is the same number of LUNs that we have available. With this configuration we guarantee the I/O is going to be spread across all the LUNs. When creating a volume, the stripe unit needs to be specified. This is the IO size that we are going to use to read from each LUN. We performed experiments with three different stripe units sizes to understand the best choice for performance in our configuration. 15

PERFORMING SEQUENTIAL READS FROM THE VOLUME SINGLE NODE vxbench is a tool that can be used to perform sequential reads. We are reading using a block size of 1MB and 64 parallel processes. We are going to measure the performance at different levels. Vxbench tool reports 1577033.29 KBytes/sec, which is 1.504 Gbytes/sec Iostat reports 3155251.20 sectors/sec, which is 1.504 Gbytes/sec Because the 8b/10b encoding overhead, the switch metrics reports a higher throughput. Using the metrics at the host level is a better approach. At Fibre channel roadmap v1.8 the 8GFC throughput is 1600MB/s for a full duplex. Vxstat reports 63109120 blocks read every 20 seconds,, which is 1.504 Gbytes/sec The HBA is a dual port card. Documentation shows that it is actually 797MB/sec for each direction. Portperfshow at the switch level shows 1.632 Gbytes/sec Therefore, using our dual port card in the host, the maximum theoretical throughput will be 1.5566 GBytes/sec, which matches with our results This shows the HBA bottlenecks at approx.1.5 Gbytes/sec in our environment. 16

PERFORMING SEQUENTIAL READS FROM THE VOLUME - TWO NODES Now vxbench is going to be run in each of the nodes. Again we measure the performance at different levels when running IO from both nodes at the same time Vxbench tool running in both nodes reports a total of 2.569 Gbytes/sec Again, because the 8b/10b encoding overhead, the switch metrics reports a higher throughput. Iostat reports a total of 2.570 Gbytes/sec for both nodes Vxstat reports a total of 2.573 Gbytes/sec for both nodes Portperfshow at the switch level shows 2.741 Gbytes/sec for both switches In this two node test, Vxbench only reports approx. 1.285 Gbytes/sec per node, therefore we do not reach the per node HBA bottleneck when performing IO from both nodes. Iostat reports for each node shows how the performance per LUN decrements when performing IO from both nodes This shows the storage bottlenecks at approx. 2.5 Gbytes/sec in our environment. 17

STRIPE UNIT 64k 512k 1 Server 2 Servers 11.5Gbits/sec 19.5Gbits/sec 12.0Gbits/sec 20.5Gbits/sec Running the vxbench program using different stripe units we see that the 512k width is the most optimal. We have used sequential reads with 1024k IO size. Using one single node we reached the 12Gbits/sec (1.5Gbytes/sec) throughput which is limited by the HBA as we saw in the previous page. Using two servers, we reached 20Gbits/sec (2.5Gbytes/sec). 1024k 12.0Gbits/sec 20.3Gbits/sec From this point onwards we now know the maximum throughput achievable using our hardware configuration. 18

Reads / sec Average request size BALANCED IO ACROSS THE STORAGE SYSTEM Vxbench captures the throughput per process. Vxstat captures the throughput per disk (LUN). Iostat output 512K Iostat captures the throughput per path. The graph shows how the IO is evenly distributed across all the paths and how the average request size for each path is 512K, which matches with the stripe unit size we used. Balanced IO is observable across all the different metrics. 48 Devices paths (sdp, sdo, sdn,.) Each path to each LUN is doing the same amount of work 19

File System Throughput + File System performance using Direct-IO reads 04 + Performance with read ahead disabled + Enabling read ahead and tuning read_nstream + Summary 20

Gbits/sec FILE SYSTEM PERFORMANCE USING DIRECT-IO READS File System direct IO mimics the Volume Manager raw disk test. To make it more clear we are creating a file with a single extent so all the blocks are contiguous. Then we can use vxbench with direct option to read the file using direct IO. As Direct IO is used, each read will fetch data directly from disk, so no buffering is being performed, therefore data is not being pre-fetched from disk, i.e. no read ahead is being performed. $ touch /data1/file1 $ /opt/vrts/bin/setext -r 4194304 -f contig /data1/file1 $ dd if=/dev/zero of=/data1/file1 bs=128k count=262144 $ /opt/vrts/bin/fsmap -A /data1/file1 Volume Extent Type File Offset Dev Offset Extent Size Inode# vol1 Data 0 34359738368 34359738368 4 $ ls -lh /data1/file1 -rw-r--r-- 1 root root 32G Mar 3 14:12 /data1/file1 The sequential read throughput was the same using the file system with direct IO as reading from the Volume Manager raw disks devices. Each of the 64 processes is achieving similar throughput In this direct IO test all 64 processes read from the same file and all see similar throughput. In the previous raw disk test all 64 processes read from the same device and also see similar throughput. In both test maximum possible throughput is achieved. 64 Processes 21

FILE SYSTEM PERFORMANCE WITH BUFFERED IO SEQUENTIAL READS To perform the test using Buffered IO, independent files for each processes are going to be created. 16GB of data will be pre-allocated to each file. As we want to make sure that data is not in memory, the file system can be remounted (-o remount) between each test to make sure that all reads will be coming from disk. Each process will now be reading from a different file and the vxbench IO block size will be, as this is closer to real world workloads. The greatest impact to the performance of sequential reads when using buffered IO is read ahead. It asynchronously prefetches data into memory, providing clear benefits for sequential read performance. There are two parameters that control read ahead behaviour. read_pref_io This parameter means the preferred read IO size. It will be the maximum IO request submitted by the file system to the volume manager. The default value is set by the volume manager stripe unit, so in our case it will be 512K (524228 bytes). If an application is requesting only 8K reads, read ahead will pre-fetch the file data using IO requests of size 512K to the volume manager. This converts the 8K application reads into 512K disk reads. The larger IO size reading from disk improves read performance, 512k is our optimal size. It is not recommended to tune read ahead by changing the value of read_pref_io, instead we can tune the value of read_nstream if it is needed. read_nstream This parameter defaults to the number of columns (LUNs) in the volume. In our configuration the default value will be 24. read_pref_io * read_nstream determines the max amount of data that is prefetched from disk using read_ahead. To reduce the amount of read ahead this parameter can be reduced. In our example, as we are using 24 columns with 512K stripe unit, the max amount of read ahead will be 12MB. As we will see this will be too much data to read ahead and will cause an imbalance in read IO performance between the processes.. 22

Gbits/sec IO SIZES WITH READ AHEAD DISABLED When read ahead is disabled, the maximum throughput threshold cannot be achieved. Request Provides Memory File System Layer The maximum throughput per node cannot be achieved when read ahead is disabled in our test. Although the throughput is evenly balanced across the processes, the total throughput is reduced by half. Request Request Provides Read Volume Manager Layer read_pref_io is only used if read ahead is enabled. vxbench is reading at a time and because read ahead is disabled we will only read a maximum of at a time from disk (the size of the vxbench reads). Performance is below what is expected In this example we are using 64 process reading 64 files. 64 Processes 23

IO SIZES WITH READ AHEAD ENABLED Read ahead improves sequential read performance by increasing the IO size and prefetching data from disk. Now read ahead is enabled and vxbench is still reading at a time. However now we are reading a maximum of 512K at a time from disk because read_pref_io is set to 512K. This larger IO size improves performance. Request Request 512K Provides Provides 512K 512K Memory File System Layer Volume Manager Layer Also data is being pre-fetched from disk before vxbench has requested it, so the next vxbench IO request will fetch the data directly from memory without having to go to the disk. Request 512K Read 512K 24

Gbits/sec Gbits/sec Gbits/sec Gbits/sec ENABLING READ AHEAD AND TUNING READ_NSTREAM Reading too much data in advance can create an imbalance between processes. Remember that read_pref_io * read_nstream determines the max amount of data pre-fetched from disk. read_nstream = 24 read_nstream = 12 read_nstream = 6 read_nstream = 1 64 Processes 64 Processes 64 Processes 64 Processes 24 x 512K = 12MB 12 x 512K = 6MB 6 x 512K = 3MB 1 x 512K = 512K With read ahead enabled all the tests achieved the max amount of throughput, however the IO is not evenly balanced between the processes until read_nstream is set to one The amount of data that is pre-fetched is reduced as read_nstream is reduced. By default, read_nstream was 24 in our configuration, which was pre-fetching 12MB of data at a time, which is too aggressive. 25

READ AHEAD TUNING SUMMARY We have seen how read ahead parameters obtain their default values from the stripe unit size and the number of columns in the volume. In our test read_pref_io default was 512K and read_nstream default was 24 based in the number of columns used. This meant that each process would pre-fetch 12MB of data from disk at a time. When adjusting the parameters it is important to consider the number of running processes that will be reading from disk at the same time, as the available throughput will be distributed between these processes. Below graphic shows how disabling read ahead resulted in poor performance and how in our case using read ahead with read_nstream=1 provided the maximum throughput and perfectly balanced performance across all 64 processes. Pre-fetching this amount of data causes an imbalance in throughput between the processes reading from different files. Our goal was to maintain the total performance of 1.5Gbytes/sec but also to maintain a balanced throughput across all the processes. When tuning read ahead parameter values, our recommendation is to reduce read_nstream rather than read_pref_io. 26