15-418 Final Project Report. Trading Platform Server



Similar documents
GPU File System Encryption Kartik Kulkarni and Eugene Linkov

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Operating Systems OBJECTIVES 7.1 DEFINITION. Chapter 7. Note:

CUDA programming on NVIDIA GPUs

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

High Performance Computing in CST STUDIO SUITE

CPU Scheduling. Basic Concepts. Basic Concepts (2) Basic Concepts Scheduling Criteria Scheduling Algorithms Batch systems Interactive systems

Delivering Quality in Software Performance and Scalability Testing

GPUs for Scientific Computing

Chapter 11 I/O Management and Disk Scheduling

Parallel Computing with MATLAB

OPERATING SYSTEMS SCHEDULING

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 4, July-Aug 2014

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

New Design and Layout Tips For Processing Multiple Tasks

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

HPC with Multicore and GPUs

Objectives. Chapter 5: CPU Scheduling. CPU Scheduler. Non-preemptive and preemptive. Dispatcher. Alternating Sequence of CPU And I/O Bursts

Operating Systems Concepts: Chapter 7: Scheduling Strategies

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Deciding which process to run. (Deciding which thread to run) Deciding how long the chosen process can run

A Comparative Study of CPU Scheduling Algorithms

Best Practices for Deploying SSDs in a Microsoft SQL Server 2008 OLTP Environment with Dell EqualLogic PS-Series Arrays

In-Memory Computing for Iterative CPU-intensive Calculations in Financial Industry In-Memory Computing Summit 2015

Understanding the Benefits of IBM SPSS Statistics Server

IBM RATIONAL PERFORMANCE TESTER

Disk Storage Shortfall

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

AFDX networks. Computers and Real-Time Group, University of Cantabria

Performance Workload Design

PHAST: Hardware-Accelerated Shortest Path Trees

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

NVIDIA Tools For Profiling And Monitoring. David Goodwin

ICS Principles of Operating Systems

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Intel DPDK Boosts Server Appliance Performance White Paper

FPGA-based Multithreading for In-Memory Hash Joins

Introduction to GPU hardware and to CUDA

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Objectives. Chapter 5: Process Scheduling. Chapter 5: Process Scheduling. 5.1 Basic Concepts. To introduce CPU scheduling

Processor Scheduling. Queues Recall OS maintains various queues

Benchmarking Cassandra on Violin

Amazon EC2 Product Details Page 1 of 5

Announcements. Basic Concepts. Histogram of Typical CPU- Burst Times. Dispatcher. CPU Scheduler. Burst Cycle. Reading

Job Scheduling Model

Capacity Estimation for Linux Workloads

Decentralized Task-Aware Scheduling for Data Center Networks

CPU Scheduling. Core Definitions

Datacenter Operating Systems

Clustering Billions of Data Points Using GPUs

Chapter 5: CPU Scheduling. Operating System Concepts 8 th Edition

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Real-time Visual Tracker by Stream Processing

2. is the number of processes that are completed per time unit. A) CPU utilization B) Response time C) Turnaround time D) Throughput

CPU Scheduling. CPU Scheduling

Guided Performance Analysis with the NVIDIA Visual Profiler

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

Multicore Programming with LabVIEW Technical Resource Guide

There are a number of factors that increase the risk of performance problems in complex computer and software systems, such as e-commerce systems.

Operating System: Scheduling

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

solution brief September 2011 Can You Effectively Plan For The Migration And Management of Systems And Applications on Vblock Platforms?

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Performance Test Report: Novell iprint Appliance 1.1

Chapter 18: Database System Architectures. Centralized Systems

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Solid State Storage in Massive Data Environments Erik Eyberg

REAL TIME OPERATING SYSTEMS. Lesson-18:

Parallel Firewalls on General-Purpose Graphics Processing Units

TESTING JAVA MONITORS BY STATE SPACE EXPLORATION MONICA HERNANDEZ. Presented to the Faculty of the Graduate School of

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Parallel Algorithm Engineering

MAGENTO HOSTING Progressive Server Performance Improvements

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Comparison of Windows IaaS Environments

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Numerix CrossAsset XL and Windows HPC Server 2008 R2

1. Simulation of load balancing in a cloud computing environment using OMNET

CloudAnalyst: A CloudSim-based Tool for Modelling and Analysis of Large Scale Cloud Computing Environments

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

Load Testing and Monitoring Web Applications in a Windows Environment

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud

OpenCL Programming for the CUDA Architecture. Version 2.3

Transcription:

15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support for trading algorithms and user interface. Specifically, the server will support order execution, indicators computation, Monte Carlo simulation, market scanning and client requests handling. The trading platform was tested on GHC 3 machines with 6-core Xeon CPU and GTX 48 graphic card. This report provides an overview of the background, implementation details, challenges and performance analysis of a trading platform server that could be launched on a workstation. Background In the dynamic and ever-changing financial markets, traders and portfolio managers increasingly rely on advanced trading technology and infrastructure to achieve returns that outperform the broad market. As trading becomes automated and algorithms replace conventional traders in scanning market signals, execution speed and API support are important to outstanding performance. The implementation of the project will follow the structure as shown in the figure below. The trading platform provides key back-end functionalities, such as order routing, computing and storing technical indicators, portfolio scenario analysis and market signal scanning.

Functions & Performance Objective Order Execution The trading platform exposes its order execution functions through an API to the client side. The order execution function is a wrapper that bridges trading algorithms and brokerage servers. Since the trading platform is a wrapper for brokerage trading functions, it should not significantly increase trade execution time. The objective is that all orders should be handled within 1 milliseconds. The trading speed of major brokerage accounts is shown in the figure below. Adding 1 milliseconds to their trade execution time will not decrease overall trading speed by a significant amount. Several key factors are considered in measuring a scheduler s performance. Priority. Tasks with higher priority should have a higher chance to be executed than a task with lower priority. Order execution should be executed after requests arrive at the master process immediately because execution speed is key to electronic trading. Deadline. Certain operations are required to be finished within a certain period of time. For example, every second a new database update is generated and such requests should be executed before the next request arrives. Starvation. Although higher priority tasks should be executed earlier than lower priority tasks, the scheduler should not create starvation for lower priority tasks. Waiting time relative to task size. Reducing 1 second execution time is more significant for a task that usually takes 1 seconds than a task that usually takes 1. The following performance measure is defined for the scheduler. where is the priority of client request, is the maximum priority level, T is the execution time, is the number of occurrences that orders are not handled within 1ms and is the number of occurrences that database update is not handled within 1 second. Since the scheduler aims to executes client requests as fast as possible, the less the score, the better the performance.

Approach Algorithm 1. This algorithm is based on the idea that in the trailing 1 seconds, the weight execution time of tasks of each category should be roughly balanced. Specifically, where represents priority i and represents time spent executing tasks of priority i. The scheduler maintains separate work queues for different tasks and select tasks based on the idea of balanced execution time presented above. Algorithm 2. The skeleton of the algorithm is similar to algorithm 1. However, it partitioned large tasks into small parts. For example, if a simulation has 1, paths, the scheduler may partition the tasks into 1 parts. After executing 1, paths, the scheduler put the task together with partial results back to the work queue. Indicators Computation The task is to compute technical indicators for Russell 3 stocks. The following technical indicators are supported: simple moving average, exponential moving average, Bollinger bands and parabolic SAR. Three approaches were tested. Single CPU core Multiple CPU cores CUDA The rationale behind using a CUDA implementation is that the bottleneck in indicators computation is memory access. The algorithm for generating those technical indicators have O(n) memory access. Since the computation itself has O(n) complexity, increasing memory access speed is key to improving performance. CUDA implementation is able to hide latency and boost bandwidth.

Monte Carlo Simulation The trading platform is able to handle scenario analysis tasks. In scenario analysis tasks, Monte Carlo simulation is widely used. Three implementations were tested: Sequential implementation Single-thread implementation Multi-thread implementation Result For comparison purposes, a number of algorithms were tests. First in first out. Orders are executed first and other tasks are executed disregarding priority. Algorithm 1 Algorithm 2 7 6 5 4 3 2 1 633 Performance Score 1813 652 55 FIFO No priority Algorithm 1 Algorithm 2 Algorithm 2 has the best performance because it is able to achieve workload balance between tasks of different priority and is designed to capture diminishing return of performance (reducing 1 second for a program that usually takes 1 seconds is more significant than reducing 1 second for a program that usually takes 5 seconds). Monte Carlo Simulation The speedup diagram is shown below. The trading platform is able to achieve significant speedup doing CPU-intensive tasks. The peak performance occurs when using 4 threads on a single core.

6 Speedup 5 4 3 2 1 Seq 1 2 4 8 16 Indicators Computation CUDA implementation is able to achieve 1.7 times speedup over a sequential implementation. CPU-based implementations were not able to achieve significant speedup because the bottleneck is memory access. However, GPU could help hide latency and boost memory bandwidth. Thus, CUDA implementation could outperformance CPU implementations. Speedup 12 1.7 1 8 6 4 2 1 1.4 1.5 Seq 1 thread 4 threads GPU