Research Report: The Arista 7124FX Switch as a High Performance Trade Execution Platform



Similar documents
Finteligent Trading Technology Community (FTTC) Research Report: 10GbE Low Latency Networking Technology Review

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

D1.2 Network Load Balancing

Xeon+FPGA Platform for the Data Center

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

Going to the wire: The next generation financial risk management platform

RDMA over Ethernet - A Preliminary Study

Achieving a High-Performance Virtual Network Infrastructure with PLUMgrid IO Visor & Mellanox ConnectX -3 Pro

Nine Use Cases for Endace Systems in a Modern Trading Environment

White Paper. Recording Server Virtualization

Data Center and Cloud Computing Market Landscape and Challenges

Evaluation Report: Emulex OCe GbE and OCe GbE Adapter Comparison with Intel X710 10GbE and XL710 40GbE Adapters

50. DFN Betriebstagung

FLOW-3D Performance Benchmark and Profiling. September 2012

High-Density Network Flow Monitoring

The Next Generation Trading Infrastructure What You Will Learn. The State of High-Performance Trading Architecture

VXLAN: Scaling Data Center Capacity. White Paper

Hardwired: FX trading in a chip

Mellanox Academy Online Training (E-learning)

HC900 Hybrid Controller When you need more than just discrete control

VXLAN Performance Evaluation on VMware vsphere 5.1

Technical Bulletin. Enabling Arista Advanced Monitoring. Overview

QoS & Traffic Management

A Low Latency Library in FPGA Hardware for High Frequency Trading (HFT)

10/100/1000Mbps Ethernet MAC with Protocol Acceleration MAC-NET Core with Avalon Interface

White Paper Increase Flexibility in Layer 2 Switches by Integrating Ethernet ASSP Functions Into FPGAs

OpenFlow with Intel Voravit Tanyingyong, Markus Hidell, Peter Sjödin

Axon: A Flexible Substrate for Source- routed Ethernet. Jeffrey Shafer Brent Stephens Michael Foss Sco6 Rixner Alan L. Cox

Simplifying Big Data Deployments in Cloud Environments with Mellanox Interconnects and QualiSystems Orchestration Solutions

Why 25GE is the Best Choice for Data Centers

The virtualization of SAP environments to accommodate standardization and easier management is gaining momentum in data centers.

10/100/1000 Ethernet MAC with Protocol Acceleration MAC-NET Core

Web Traffic Capture Butler Street, Suite 200 Pittsburgh, PA (412)

VMWARE WHITE PAPER 1

Networking Virtualization Using FPGAs

Intel DPDK Boosts Server Appliance Performance White Paper

NetTESTER Embedded 'Always-On' Network Testing & In-Service Performance Assurance

Web Analytics Understand your web visitors without web logs or page tags and keep all your data inside your firewall.

Virtualised MikroTik

ETHERNET ENCRYPTION MODES TECHNICAL-PAPER

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

How To Switch A Layer 1 Matrix Switch On A Network On A Cloud (Network) On A Microsoft Network (Network On A Server) On An Openflow (Network-1) On The Network (Netscout) On Your Network (

How To Monitor And Test An Ethernet Network On A Computer Or Network Card

Informatica Ultra Messaging SMX Shared-Memory Transport

HANIC 100G: Hardware accelerator for 100 Gbps network traffic monitoring

RoCE vs. iwarp Competitive Analysis

A Low-Latency Solution for High- Frequency Trading from IBM and Mellanox

Enterprise Network Management. March 4, 2009

High-performance vswitch of the user, by the user, for the user

EKT 332/4 COMPUTER NETWORK

Cluster Computing at HRI

File System Design and Implementation

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Hardware acceleration enhancing network security

How To Make Gigabit Ethernet A Reality

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

How To Trade At Speed On An Exchange Market

Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build

Big Data in the Enterprise: Network Design Considerations

ACHILLES CERTIFICATION. SIS Module SLS 1508

Network Security Platform 7.5

High Performance Computing in CST STUDIO SUITE

Fibre Channel over Ethernet in the Data Center: An Introduction

I/O Virtualization Using Mellanox InfiniBand And Channel I/O Virtualization (CIOV) Technology

Leveraging NIC Technology to Improve Network Performance in VMware vsphere

AlliedWare Plus OS How To Use sflow in a Network

APPLICATION NOTE 209 QUALITY OF SERVICE: KEY CONCEPTS AND TESTING NEEDS. Quality of Service Drivers. Why Test Quality of Service?

CCNA Discovery Networking for Homes and Small Businesses Student Packet Tracer Lab Manual

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

INDIAN INSTITUTE OF TECHNOLOGY KANPUR Department of Mechanical Engineering

HARDWARE ACCELERATION IN FINANCIAL MARKETS. A step change in speed

Windows Server 2008 R2 Hyper-V Live Migration

Performance of Cisco IPS 4500 and 4300 Series Sensors

Latency in High Performance Trading Systems Feb 2010

How To Set Up Foglight Nms For A Proof Of Concept

Inside Track Research Note. In association with. Enterprise Storage Architectures. Is it only about scale up or scale out?

Performance Evaluation of Linux Bridge

HP ProLiant BL460c takes #1 performance on Siebel CRM Release 8.0 Benchmark Industry Applications running Linux, Oracle

1-Gigabit TCP Offload Engine

2015 Global Technology conference. Diane Bryant Senior Vice President & General Manager Data Center Group Intel Corporation

SERVER CLUSTERING TECHNOLOGY & CONCEPT

Technical Support Information Belkin internal use only

Wireshark in a Multi-Core Environment Using Hardware Acceleration Presenter: Pete Sanders, Napatech Inc. Sharkfest 2009 Stanford University

HP ProLiant DL380 G5 takes #1 2P performance spot on Siebel CRM Release 8.0 Benchmark Industry Applications running Windows

Cisco for SAP HANA Scale-Out Solution on Cisco UCS with NetApp Storage

dnstap: high speed DNS logging without packet capture Robert Edmonds Farsight Security, Inc.

Where IT perceptions are reality. Test Report. OCe14000 Performance. Featuring Emulex OCe14102 Network Adapters Emulex XE100 Offload Engine

Programmable Networking with Open vswitch

Datasheet iscsi Protocol

Zadara Storage Cloud A

The Future of Computing Cisco Unified Computing System. Markus Kunstmann Channels Systems Engineer

OpenFlow Based Load Balancing

SIDN Server Measurements

hp ProLiant network adapter teaming

VDI: What Does it Mean, Deploying challenges & Will It Save You Money?

Linux NIC and iscsi Performance over 40GbE

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

Zeus Traffic Manager VA Performance on vsphere 4

Symantec Data Loss Prevention Network Monitor and Prevent Performance Sizing Guidelines. Version 12.5

Optimising a PROFINET IRT Architecture for Maximisation of Non Real Time Traffic

Transcription:

Research Report: The Arista 7124FX Switch as a High Performance Trade Execution Platform Abstract: Many groups are working on reducing trading execution latency - the time from a critical Ethernet frame (such as market feed or execution instruction message) to a corresponding market order being sent out. This report, an adjunct to the October 2012 Finteligent report titled Research Report: be Low Latency Networking Technology Review describes an approach where market feed analysis and order execution is performed directly on an Arista Application Switch to achieve an order of magnitude reduction in latency. September 2013 P8006-R-001d Page 1 of 8

1 Summary This document is an adjunct to the report published by the Finteligent Trading Technology Community (FTTC) in October 2012 titled Research Report: be Low Latency Networking Technology Review and related reports released throughout 2012/13. These reports are available upon registration here. FTTC is an informal community of technology vendors who collaborate to facilitate tests of architecture designs across the technology stack as it applies to high performance trading. The community has tested the impact of CPUs, servers, switches, virtualisation and FIX engines on overall performance as reflected in latency figures. A standard and consistent test harness has been used across all the tests, such that comparison can be made between design concepts and products which are interchanged. The majority of the tests have been conducted in the Intel fasterlab facility in Winnersh near London, England using resources from OnX Enterprise Solutions Inc, Intel Corporation and collaborating vendors. This report presents the findings of tests conducted by Argon Designs in their Cambridge, England labs using the same test harness and test rig configuration, with the objective of assessing the effect of heterogeneous hardware architectures (x86 + FPGA) to enhance trading systems performance. The best results achieved by FTTC for average latency from an all x86 environment was 4,663ns. The Argon Design tests delivered latency of 170ns for the same comparative measured leg of the trade indicating that quantum gains can be achieved from hardware architecture design of the trading technology stack. The design techniques, methodology and environment are described fully below. 2 The Finteligent setup 2.1 The hardware setup The diagram below summarizes the hardware setup of the standard test rig developed by FTTC. An x86 trade server interfaces through a switch with a basic server which hosts a market data simulator and a venue/exchange simulator, both based on the FIX protocol. The switch is tapped on the ports connected to the simulators. The taps capture packets using a precise timestamping Endace card. The latency figures are calculated in a post-processing stage. (See full description at the end of section 4, page 9, in the Finteligent report.) P8006-R-001d Page 2 of 8

Trade server (x86) Switch Network tap Network tap Market data simulator x86 Exchange Simulator The x86 trade server (2x Intel Xeon E5-2690 @ 2.9GHz, with 192 GB of RAM) is a high end specification and has been optimised for low latency. The network cards implement kernelbypass, and various power and networking settings have been tweaked in the BIOS and elsewhere. (See the start of section 5, page 9 of the Finteligent report.) 2.2 The trading algorithm The standard test harness has a very simple trading algorithm which is used to generate the trades; its function is to get message data into the infrastructure allowing the focus of the benchmark to be on the hardware components under test. The algorithm consists of two parts, the buy part which is triggered by Ethernet frames from the market data simulator, and the sell part which is triggered by Ethernet frames from the trade matching function in the exchange/venue simulator. The buy and sell parts are similar. Both involve sending market orders in response to FIX messages from simulators. The aim for high-frequency traders is to minimise the response latencies of the switch plus trade server system when sending those market orders. (For more details, see page 8 of the Finteligent report.) 2.2.1 Test harness: The buy part The market data simulator broadcasts Market Data Incremental Refresh messages where the stock symbols are taken at random from a list of stocks. Independently for each stock, the price is incremented in units of $0.001 on each update. Whenever such a feed update has a price with decimal part.000 the trade server should send a BUY market order to the exchange simulator, requesting to buy 100 shares of the given stock at the given price. 2.2.2 Test harness: The sell part Whenever the exchange simulator receives a BUY market order from the trade server, it immediately sends an execution report confirming receipt of the BUY order. Some amount of time later, the exchange simulator sends another execution report informing the trade server P8006-R-001d Page 3 of 8

that the BUY order has been filled. When such an execution report is received, the trade server should send a SELL market order to the exchange simulator, requesting to sell the 100 shares it originally bought. 2.3 The results As seen in the series of FTTC reports, a number of products and technologies were tested across the stack. In this optimised x86 hardware setup and simplified algorithmic setting, the best average latency achieved was 4,663ns. This occurred with the Mellanox ConnectX-3 EN network card (specifically, for the buy strategy under a load of 71k market data updates per second). 3 The Argon Design test scenario In this test exercise the Argon Design trading stack is based on the Arista FX 7124 switch. This Arista switch is at the forefront of current design thinking and is unconventional because it contains a large FPGA (from Altera) with hardware-level access to 8 of its 24 b Ethernet ports. By introducing an FPGA embedded alongside the switch ASIC and working in collaboration with x86 components, this design allows high end hardware technologies to be brought into the standard equipment mix in CoLo racks in exchanges and venues for high performance HFT trading strategy execution. 3.1 The philosophy The following diagram shows the fast-path approach to making market orders. The high level trading algorithm, run on an x86 platform, delegates the low level order execution to the FPGA in the Arista switch. The insight is that by decoupling trading algorithm and order execution, execution times can be taken right down while retaining the flexibility and scale of the x86 platform as a high-level numerical analysis platform. This approach uses the best hardware technology for the task in hand it is not either-or. For firms thinking about their technology strategy to keep them at the high end of performance, it allows the option of retaining x86 code libraries (and taking advantage of CPU innovation such as many core) while using FPGA coding techniques to handle discrete tasks in a complimentary manner. Working in tandem the results deliver more than an order of magnitude improvement. P8006-R-001d Page 4 of 8

Trading Algorithm High level x86 platform or PCIe Order execution Low level Arista FPGA Market data Exchange 3.2 Argon Design FPGA-x86 architecture set up As a proof-of-concept demonstration the order execution functions have been decoupled and coded to the FPGA part of the Arista 7124FX switch. The diagram below shows the hardware setup. Terminal 1G (SSH) Algorithm Configuration GUI (x86) PCIe Arista Switch Scope probe Market data MAC FPGA Exchange MAC Scope probe Market data simulator x86 Exchange simulator At the bottom of the above topology, the same market data and exchange simulator test harness from the Finteligent consortium has been used on a similar specification server. Moving up, the passive switch in the Finteligent setup is replaced by an active Arista FPGA Application Switch. The network taps in the Finteligent setting are replaced by scope probes at the FPGA MAC layer. At the highest level, the trade server is replaced by a terminal where high level trading strategies are configured by an operator of the test harness itself and executed in real time by the switch. This is discussed further in the last section. P8006-R-001d Page 5 of 8

3.3 Latency improvement techniques available to FPGAs We use two techniques which are made possible with the use of an FPGA. 3.3.1 Inline parsing and matching The first is inline parsing and matching of the FIX messages. As the Ethernet frames enter the FPGA, the FIX key/value pairs get parsed as soon as possible. This allows for partial information to be extracted before even the whole frame has been received. Frame received from the exchange Feed parsing and matching Start of frame End of frame 3.3.2 Pre-emption of market orders The second technique is what we call pre-emption. Instead of waiting until the end of the FIX messages before acting upon them, we start sending the overhead part of the response which contains the Ethernet, IP, TCP and FIX headers. This allows for sending a populated response right after a match is made. If a match is not made, the Ethernet frame is poisoned by using an invalid Frame Checksum. This aborted Ethernet Frame will be discarded by the network device at the other end of the cable. Frame received from the exchange Scenario 1: The exchange data matches and a corresponding market order is sent Scenario 2: The exchange data does not match and the pre-empted order is invalidated Market order sent to the exchange Aborted order discarded by the exchange Preemption starts ~20ns latency Frame is invalidated 3.4 Latency results For the PHY and MAC layers, the ultra-low latency solution by Tamba Networks was used, which has a roundtrip latency of 150ns. The diagram below explains the other latencies involved, both when preemption is enabled and disabled. P8006-R-001d Page 6 of 8

Market order sent with preemption Market order without preemption RX data from feed TX data to exchange Total latency: ~170ns < 150ns PHY+MAC overhead + ~20ns ~20ns latency RX data from feed TX data to exchange 6.4ns latency 150ns order sending time Total latency: ~300ns < 150ns PHY+MAC overhead + 6.4ns + ~150ns order sending time Latencies were already measured using a scope. We attach a screenshot below. 3.5 Matching of market data The incoming market data feeds and execution message transports use the FIX protocol. FIX messages are a series of key/value pairs. For example, the keys could correspond to price and symbol and the values could correspond to $492.120 and AAPL. For each port with incoming market information a set of trigger events are defined, each with an associated action. Each trigger event is a set of key/value rules all of which are compared to the incoming FIX messages by a strategy matcher, and must all be true for the order to be completed. The comparison operators for the demo are equal, not equal, greater than and less than. 3.5.1 Real time configuration of trigger events For the demonstration, a terminal is used to configure the trigger events in real time. The screenshot below shows an example trigger on the market feeds. The key/value rules are on the left, and the action on the right. P8006-R-001d Page 7 of 8

In a real system the terminal and algorithm configuration GUI would be replaced by a high performance x86 trading algorithm platform that issued rule sets to the FPGA. This would support the high trading volumes of a real world scenario. 4 Conclusion The FTTC initiative has delivered very useful comparative data on performance over the 2012-2013 period. Argon Design s initiative in adding FPGA hardware has proved conclusively the advantages to be gained from such hybrid designs. The potential of this kind of approach to technology design has been known to the market for some time, but the lack of widespread skills and the significant historical investment in x86 code libraries has meant that take up has been relatively slow and limited to niche ecosystems. This exercise has proved the potential of treating x86 and FPGA as complimentary technologies that can work together. The key is the partitioning of tasks and the ability of companies like Argon Design to apply leading edge engineering skills to this specialist task. These proof points and the availability of skills aligned to trading workloads indicate that the technology is poised for prime time. P8006-R-001d Page 8 of 8