15-418 Final Project Report. Trading Platform Server

15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support for trading algorithms and user interface. Specifically, the server will support order execution, indicators computation, Monte Carlo simulation, market scanning and client requests handling. The trading platform was tested on GHC 3 machines with 6-core Xeon CPU and GTX 48 graphic card. This report provides an overview of the background, implementation details, challenges and performance analysis of a trading platform server that could be launched on a workstation. Background In the dynamic and ever-changing financial markets, traders and portfolio managers increasingly rely on advanced trading technology and infrastructure to achieve returns that outperform the broad market. As trading becomes automated and algorithms replace conventional traders in scanning market signals, execution speed and API support are important to outstanding performance. The implementation of the project will follow the structure as shown in the figure below. The trading platform provides key back-end functionalities, such as order routing, computing and storing technical indicators, portfolio scenario analysis and market signal scanning.

Functions & Performance Objective Order Execution The trading platform exposes its order execution functions through an API to the client side. The order execution function is a wrapper that bridges trading algorithms and brokerage servers. Since the trading platform is a wrapper for brokerage trading functions, it should not significantly increase trade execution time. The objective is that all orders should be handled within 1 milliseconds. The trading speed of major brokerage accounts is shown in the figure below. Adding 1 milliseconds to their trade execution time will not decrease overall trading speed by a significant amount. Several key factors are considered in measuring a scheduler s performance. Priority. Tasks with higher priority should have a higher chance to be executed than a task with lower priority. Order execution should be executed after requests arrive at the master process immediately because execution speed is key to electronic trading. Deadline. Certain operations are required to be finished within a certain period of time. For example, every second a new database update is generated and such requests should be executed before the next request arrives. Starvation. Although higher priority tasks should be executed earlier than lower priority tasks, the scheduler should not create starvation for lower priority tasks. Waiting time relative to task size. Reducing 1 second execution time is more significant for a task that usually takes 1 seconds than a task that usually takes 1. The following performance measure is defined for the scheduler. where is the priority of client request, is the maximum priority level, T is the execution time, is the number of occurrences that orders are not handled within 1ms and is the number of occurrences that database update is not handled within 1 second. Since the scheduler aims to executes client requests as fast as possible, the less the score, the better the performance.

Approach Algorithm 1. This algorithm is based on the idea that in the trailing 1 seconds, the weight execution time of tasks of each category should be roughly balanced. Specifically, where represents priority i and represents time spent executing tasks of priority i. The scheduler maintains separate work queues for different tasks and select tasks based on the idea of balanced execution time presented above. Algorithm 2. The skeleton of the algorithm is similar to algorithm 1. However, it partitioned large tasks into small parts. For example, if a simulation has 1, paths, the scheduler may partition the tasks into 1 parts. After executing 1, paths, the scheduler put the task together with partial results back to the work queue. Indicators Computation The task is to compute technical indicators for Russell 3 stocks. The following technical indicators are supported: simple moving average, exponential moving average, Bollinger bands and parabolic SAR. Three approaches were tested. Single CPU core Multiple CPU cores CUDA The rationale behind using a CUDA implementation is that the bottleneck in indicators computation is memory access. The algorithm for generating those technical indicators have O(n) memory access. Since the computation itself has O(n) complexity, increasing memory access speed is key to improving performance. CUDA implementation is able to hide latency and boost bandwidth.

Monte Carlo Simulation The trading platform is able to handle scenario analysis tasks. In scenario analysis tasks, Monte Carlo simulation is widely used. Three implementations were tested: Sequential implementation Single-thread implementation Multi-thread implementation Result For comparison purposes, a number of algorithms were tests. First in first out. Orders are executed first and other tasks are executed disregarding priority. Algorithm 1 Algorithm 2 7 6 5 4 3 2 1 633 Performance Score 1813 652 55 FIFO No priority Algorithm 1 Algorithm 2 Algorithm 2 has the best performance because it is able to achieve workload balance between tasks of different priority and is designed to capture diminishing return of performance (reducing 1 second for a program that usually takes 1 seconds is more significant than reducing 1 second for a program that usually takes 5 seconds). Monte Carlo Simulation The speedup diagram is shown below. The trading platform is able to achieve significant speedup doing CPU-intensive tasks. The peak performance occurs when using 4 threads on a single core.

6 Speedup 5 4 3 2 1 Seq 1 2 4 8 16 Indicators Computation CUDA implementation is able to achieve 1.7 times speedup over a sequential implementation. CPU-based implementations were not able to achieve significant speedup because the bottleneck is memory access. However, GPU could help hide latency and boost memory bandwidth. Thus, CUDA implementation could outperformance CPU implementations. Speedup 12 1.7 1 8 6 4 2 1 1.4 1.5 Seq 1 thread 4 threads GPU