Evaluation of 40 Gigabit Ethernet technology for data servers



Similar documents
Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Linux NIC and iscsi Performance over 40GbE

Supermicro Ethernet Switch Performance Test Straightforward Public Domain Test Demonstrates 40Gbps Wire Speed Performance

High-Density Network Flow Monitoring

Sun 8Gb/s Fibre Channel HBA Performance Advantages for Oracle Database

Performance of Software Switching

Performance Tuning Guidelines

Fundamentals of Data Movement Hardware

Campus Network Design Science DMZ

Comparing SMB Direct 3.0 performance over RoCE, InfiniBand and Ethernet. September 2014

FLOW-3D Performance Benchmark and Profiling. September 2012

Know your Cluster Bottlenecks and Maximize Performance

Certification Document bluechip STORAGEline R54300s NAS-Server 03/06/2014. bluechip STORAGEline R54300s NAS-Server system

LSI MegaRAID CacheCade Performance Evaluation in a Web Server Environment

Deploying Ceph with High Performance Networks, Architectures and benchmarks for Block Storage Solutions

10GE NIC throughput measurements (Revised: ) S.Esenov

Comparison of 40G RDMA and Traditional Ethernet Technologies

Intel Solid- State Drive Data Center P3700 Series NVMe Hybrid Storage Performance

Certification Document macle GmbH Grafenthal-S1212M 24/02/2015. macle GmbH Grafenthal-S1212M Storage system

SMB Direct for SQL Server and Private Cloud

High Throughput File Servers with SMB Direct, Using the 3 Flavors of RDMA network adapters

Introduction to MPIO, MCS, Trunking, and LACP

Toward a practical HPC Cloud : Performance tuning of a virtualized HPC cluster

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

EVALUATING NETWORK BUFFER SIZE REQUIREMENTS

Certification Document macle GmbH GRAFENTHAL R2208 S2 01/04/2016. macle GmbH GRAFENTHAL R2208 S2 Storage system

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

Performance Tuning Guidelines

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

D1.2 Network Load Balancing

An Analysis of 8 Gigabit Fibre Channel & 10 Gigabit iscsi in Terms of Performance, CPU Utilization & Power Consumption

PCI Express and Storage. Ron Emerick, Sun Microsystems

PCI Express* Ethernet Networking

I3: Maximizing Packet Capture Performance. Andrew Brown

Intel DPDK Boosts Server Appliance Performance White Paper

KVM PERFORMANCE IMPROVEMENTS AND OPTIMIZATIONS. Mark Wagner Principal SW Engineer, Red Hat August 14, 2011

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Hadoop on the Gordon Data Intensive Cluster

perfsonar Host Hardware

PCIe Over Cable Provides Greater Performance for Less Cost for High Performance Computing (HPC) Clusters. from One Stop Systems (OSS)

NVM Express TM Infrastructure - Exploring Data Center PCIe Topologies

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Implementing Enterprise Disk Arrays Using Open Source Software. Marc Smith Mott Community College - Flint, MI Merit Member Conference 2012

LS DYNA Performance Benchmarks and Profiling. January 2009

New Storage System Solutions

Configuring Your Computer and Network Adapters for Best Performance

SMB Advanced Networking for Fault Tolerance and Performance. Jose Barreto Principal Program Managers Microsoft Corporation

EDUCATION. PCI Express, InfiniBand and Storage Ron Emerick, Sun Microsystems Paul Millard, Xyratex Corporation

Commoditisation of the High-End Research Storage Market with the Dell MD3460 & Intel Enterprise Edition Lustre

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Intel Cluster Ready Appro Xtreme-X Computers with Mellanox QDR Infiniband

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Big Data Technologies for Ultra-High-Speed Data Transfer in Life Sciences

PE310G4BPi40-T Quad port Copper 10 Gigabit Ethernet PCI Express Bypass Server Intel based

High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/ CAE Associates

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Picking the right number of targets per server for BeeGFS. Jan Heichler March 2015 v1.2

Datacenter Operating Systems

LSI MegaRAID FastPath Performance Evaluation in a Web Server Environment

Accelerating 4G Network Performance

Experiences with MPTCP in an intercontinental multipathed OpenFlow network

Architecting a High Performance Storage System

Building High-Performance iscsi SAN Configurations. An Alacritech and McDATA Technical Note

PCI Express Impact on Storage Architectures and Future Data Centers. Ron Emerick, Oracle Corporation

Up to 4 PCI-E SSDs Four or Two Hot-Pluggable Nodes in 2U

The Bus (PCI and PCI-Express)

Stovepipes to Clouds. Rick Reid Principal Engineer SGI Federal by SGI Federal. Published by The Aerospace Corporation with permission.

PCI Express Impact on Storage Architectures and Future Data Centers

PCI Express Impact on Storage Architectures. Ron Emerick, Sun Microsystems

Installing Hadoop over Ceph, Using High Performance Networking

4 Channel 6-Port SATA 6Gb/s PCIe RAID Host Card

VMWARE WHITE PAPER 1

How To Test Nvm Express On A Microsoft I7-3770S (I7) And I7 (I5) Ios 2 (I3) (I2) (Sas) (X86) (Amd)

Next Generation Storage Networking for Next Generation Data Centers

Introduction to Infiniband. Hussein N. Harake, Performance U! Winter School

over Ethernet (FCoE) Dennis Martin President, Demartek

NexentaStor Enterprise Backend for CLOUD. Marek Lubinski Marek Lubinski Sr VMware/Storage Engineer, LeaseWeb B.V.

Boost Database Performance with the Cisco UCS Storage Accelerator

Network Function Virtualization: Virtualized BRAS with Linux* and Intel Architecture

Evaluation Report: HP Blade Server and HP MSA 16GFC Storage Evaluation

Welcome to the Dawn of Open-Source Networking. Linux IP Routers Bob Gilligan

HP Z Turbo Drive PCIe SSD

Test Report Newtech Supremacy II NAS 05/31/2012. Newtech Supremacy II NAS storage system

Post-production Video Editing Solution Guide with Microsoft SMB 3 File Serving AssuredSAN 4000

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

Building a Top500-class Supercomputing Cluster at LNS-BUAP

3 Port PCI Express 2.0 SATA III 6 Gbps RAID Controller Card w/ msata Slot and HyperDuo SSD Tiering

Scaling from Datacenter to Client

SX1024: The Ideal Multi-Purpose Top-of-Rack Switch

HP SN1000E 16 Gb Fibre Channel HBA Evaluation

ECLIPSE Performance Benchmarks and Profiling. January 2009

Network Performance Optimisation and Load Balancing. Wulf Thannhaeuser

Transcription:

Evaluation of 40 Gigabit Ethernet technology for data servers Azher Mughal, Artur Barczyk Caltech / USLHCNet CHEP-2012, New York http://supercomputing.caltech.edu

Agenda Motivation behind 40GE in Data Servers Network & Servers Design Designing a Fast Data Transfer Kit PCIe Gen3 Server Performance 40G Network testing WAN Transfers Disk to Disk Transfers Questions? 2

The Motivation The LHC experiments, with their distributed Computing Models and global program of LHC physics, have a renewed focus on networks, and correspondingly a renewed emphasis on capacity and reliability of the networks Networks have seen an exponential growth in capacity 10X in usage every 47 months in ESnet over 18 years About 6M times capacity growth over 25 years across the Atlantic (LEP3Net in 1985 to USLHCNet as of today) LHC experiments (CMS / ATLAS) are generating large data sets which need to be efficiently transferred to end sites, anywhere in the world A sustained ability to use ever-larger continental and transoceanic networks effectively: high throughput transfers HEP as a driver of R&E and mission-oriented networks Testing latest innovations both in terms of software and hardware Harvey Newman, Caltech 3

27.3 PetaBytes Transferred Over 6 Months average transfer rate = 14 Gbps (Dec 2011 May. 2012) 4

Target Features in 40GE Server Has at least one 40GE port connecting to LAN/WAN Able to read from Disks at near 40Gbps (4.9 GB/sec) Able to write on Disks at near 40Gbps (4.9 GB/sec) Line rate Network throughput with minimum CPU utilization (therefore more headroom for applications) 5

History of 40GE NICs Mellanox is the only vendor offering 40GE NICs (since 2010). Mainly Three variants: 40GE Gen2 NIC ConnectX-2 PCIe Gen 2.0 x8 interface, 32Gbps FD 8b/12b line encoding, 20% overhead, 25.6Gbps FD 40GE Gen 3 NIC PCIe Gen 3.0 x8 interface, 1GB per lane or 64Gbps FD More efficient 64/66 encoding 40/56Gbps Ethernet/VPI NIC Faster Clock rate, can go upto 56Gbps using IB FDR mode 6

40GE Server Design Kit SandyBridge E5 Based Servers: (SuperMicro X9DRi-F or Dell R720) Intel E5-2670 with C1 or C2 Stepping 128GB of DDR3 1600MHz RAM Mellanox VPI CX-3 PCIe Gen3 NIC Dell / Mellanox QSFP Active Fiber Cables LSI 9265-8i, 8 port SATA 6G RAID Controller OCZ Vertex 3 SSD, 6Gb/s (preferably enterprise disks like Deneva 2) Dell Force10; Z9000 40GE Switch Server Cost = ~ $15k http://supercomputing.caltech.edu/40gekit.html 7

System Layout FDT Transfer Application LSI RAID Controller Mellanox NIC MSI Interrupts: 4 / 8 / 16 PCIe-3 X8 DDR3 DDR3 DDR3 DDR3 Sandy Bridge CPU 0 8 Cores QPI QPI Sandy Bridge CPU 1 8 Cores DDR3 DDR3 DDR3 DDR3 LSI PCIe-2 X8 PCIe-2 X8 PCIe3 x8 PCIe3 x8 PCIe3 x8 DMI2 PCIe2 x4 PCIe3 x8 PCIe3 x8 PCIe3 x8 PCIe3 x8 LSI 8

Hardware Setting/Tuning SuperMicro Motherboard X9DRi/F PCI-e slot needs to be manually set to Gen3, otherwise defaults are Gen2 Disable Hyper threading Change PCI-e payload to the maximum (for Mellanox NICs) Mellanox CX3 VPI Use latest firmware and drivers Use QSFP Active Fiber cables Dell-Force10 Switch - Z9000 Flow control needs to be turned on for server facing ports Single Queue compared to 4 Queue model MTU = 9000 9

Software and Tuning Scientific Linux 6.2 Distribution, default kernel Fast Data Transfer (FDT) utility for moving data among the sites Writing on the RAID-0 (SSD disk pool) /dev/zero /dev/null memory test Kernel smp affinity: Bind the Mellanox NIC driver queues to the processor cores where NIC s PCIe Lane is connected Move LSI Driver IRQ to the second processor Using NUMA Control to bind FDT application to the second processor Change Kernel TCP/IP parameters as recommended by Mellanox 10

System Tuning Details /etc/sysctl.conf (added during Mellanox driver installation) ## MLXNET tuning parameters ## net.ipv4.tcp_timestamps = 0 net.ipv4.tcp_sack = 0 net.ipv4.tcp_low_latency = 1 net.core.netdev_max_backlog = 250000 net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 net.core.rmem_default = 16777216 net.core.wmem_default = 16777216 net.core.optmem_max = 16777216 net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 ## END MLXNET ## Ethernet Interface Ifconfig eth2 mtu 9000 ethtool -G eth2 rx 8192 Numactl (with local node memory binding) numactl --physcpubind=1,2 --localalloc /usr/java/latest/bin/java -jar /root/fdt.jar & Smp Affinity (Mellanox NIC) set_irq_affinity_bynode.sh 1 eth2 Smp Affinity (LSI RAID Controller) echo 20 > /proc/irq/73/smp_affinity 11

SuperComputing 2011 Collaborators Caltech Booth 1223 Courtesy of Ciena 12

SC11 - WAN Design for 100G 13

SC11 - PCIe Gen3 performance: 36.8 Gbps 37Gbps Dell R720xd showed similar results like SuperMicro 4 FDT TCP Flows mem-mem test 14

SC11 - Servers Testing, reaching 100G In (Gbps) Traffic: Out 100 70 40 0 90 Sustained 186 Gbps; Enough to transfer 100,000 Blue-ray per day 15

SC11 - Disk to Disk Results; Peaks of 60Gbps 60Gbps 10Gbps 12Gbps Disk write on 7 Supermicro and Dell servers with a mix of 40GE and 10GE Servers. 16

40GE Data Transfer Demo between Amsterdam and Geneva In Collaboration with SURFnet, CERN, Ciena, and Mellanox Distance = 1650km, RTT=16ms 17

40GE line rate at 39.6Gbs 40Gbps Write ~ 17.5 Gbps CPU Core Utilization 25% each for 4 cores 18

Key Challenges encountered SuperMicro Servers, Mellanox CX3 NIC and drivers were all in BETA stage. First hand experience with PCIe Gen3 servers using sample E5 Sandy Bridge processors, Not many vendors were available for testing. What do we know on the BIOS settings for Gen3 (Slots, processor performance mode) Mellanox NIC randomly throwing interface errors. QSFP Passive Copper cable has issues at line rate (39.6Gbps). Use Fiber Cables LSI drivers, single threaded, utilizing a single core to maximum. Will FDT be able to go close to the line rate of Mellanox Network Cards, 39.6Gbps (theoretical peak) End to End 100G and 40G testing, any transport issues? 19

Future Directions Investigate bottlenecks in the LSI Raid Card driver, New driver supporting MSI-x Vectors is available (many configurable queues) Optimizing Linux kernel, SSD tuning as compared to mechanical disks, kernel timers, other unknowns Investigate performance for PCIe based SSD drives from vendors like Intel, FusionIO, OCZ Ways to lower CPU Utilization, investigate RoCE Understand/overcome the SSD wearing out problems over a time 20

Summary The 40/100Gbps network technology has shown the potential possibilities to transfer peta-scale physics datasets in a matter of hours around the world. Three highly tuned servers can easily reach the 100GE line rate, effectively utilizing the PCIe Gen3 technology. Individual Server tests using E5 processors and PCIe Gen-3 based Network Cards have shown stable network performance reaching line rate at 39.6Gbps. During SC11, Fast Data Transfer (FDT) application achieved an aggregate disk write of 60Gbps. MonALISA intelligent monitoring software, effectively recorded and displayed the traffic at 40/100G and the other 10GE links. 21

Questions? 22