MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA



Similar documents
Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

Learn CUDA in an Afternoon: Hands-on Practical Exercises

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Final Project Report. Trading Platform Server

STCE. Fast Delta-Estimates for American Options by Adjoint Algorithmic Differentiation

Introduction to CUDA C

1. If we need to use each thread to calculate one output element of a vector addition, what would

Binomial option pricing model. Victor Podlozhnyuk

Monte-Carlo Option Pricing. Victor Podlozhnyuk

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Adjoint Algorithmic Differentiation of a GPU Accelerated Application

Texture Cache Approximation on GPUs

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

OpenCL Programming for the CUDA Architecture. Version 2.3

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

From CFD to computational finance (and back again?)

RevoScaleR Speed and Scalability

Introduction to GPU Programming Languages

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Parallel Computing for Data Science

Linear Algebra Review. Vectors

Introduction to GPU hardware and to CUDA

Java Modules for Time Series Analysis

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Black-Scholes option pricing. Victor Podlozhnyuk

CUDA Basics. Murphy Stein New York University

Real-time Visual Tracker by Stream Processing

HPC with Multicore and GPUs

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

CUDA Programming. Week 4. Shared memory and register

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Linear Algebraic Equations, SVD, and the Pseudo-Inverse

Heavy Parallelization of Alternating Direction Schemes in Multi-Factor Option Valuation Models. Cris Doloc, Ph.D.

ultra fast SOM using CUDA

Pricing of cross-currency interest rate derivatives on Graphics Processing Units

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Partial Least Squares (PLS) Regression.

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Optimizing Application Performance with CUDA Profiling Tools

Parallelism and Cloud Computing

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

From CFD to computational finance (and back again?)

GPUs for Scientific Computing

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

GPU Performance Analysis and Optimisation

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

CUDA programming on NVIDIA GPUs

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

HP ProLiant SL270s Gen8 Server. Evaluation Report

High Performance Computing in CST STUDIO SUITE

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

STATISTICA Solutions for Financial Risk Management Management and Validated Compliance Solutions for the Banking Industry (Basel II)

STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Robust Algorithms for Current Deposition and Dynamic Load-balancing in a GPU Particle-in-Cell Code

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Introduction to GPGPU. Tiziano Diamanti

Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

Time Domain and Frequency Domain Techniques For Multi Shaker Time Waveform Replication

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #18: Dimensionality Reduc7on

D-optimal plans in observational studies

Question 2: How do you solve a matrix equation using the matrix inverse?

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

INTEREST RATES AND FX MODELS

Univariate Regression


(Toward) Radiative transfer on AMR with GPUs. Dominique Aubert Université de Strasbourg Austin, TX,

Lecture 5: Singular Value Decomposition SVD (1)

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Hands-on CUDA exercises

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Rootbeer: Seamlessly using GPUs from Java

Parallel and Distributed Computing Programming Assignment 1

CUDAMat: a CUDA-based matrix class for Python

Accelerating CFD using OpenFOAM with GPUs

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

CSE 6040 Computing for Data Analytics: Methods and Tools

HIGH PERFORMANCE BIG DATA ANALYTICS

Transcription:

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA

STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American exercise basket option, correlated Heston dynamics, Longstaff Schwartz Monte Carlo Independently audited results GPU Solution Over 9x the average speed of a system with the same class of CPUs but no GPUs The first system to handle the baseline problem size in real time (less than a second) Please see http://www.stacresearch.com/a2 for more details of the STAC-A2 Benchmark Also see http://devblogs.nvidia.com/parallelforall/american-option-pricing-monte-carlo-simulation for more details on Longstaff-Schwartz Monte Carlo on GPUs

AMERICAN OPTIONS American put option on a stock Alice buys a put option on a stock from Bob Strike price K Time to expiry T Between now and time T, Alice can sell the stock to Bob at a price K Is today the right day to sell? How long should Alice wait? The option pays off if K is higher than the stock price S payoff = max(k S[i], 0)

LONGSTAFF-SCHWARTZ ALGORITHM Generate random prices for the stock Split the time to expiry T into N time steps: t0, t1, t2, Use M independent paths Different schemes to generate the stock prices: Euler scheme Andersen QE (used in STAC-A2)

LONGSTAFF-SCHWARTZ ALGORITHM Compute the payoff at time T along each path Walk back in time for( int ti = T-1 ; ti > 0 ; --ti ) For each time step ti Fit a model to predict the payoffs at ti+1 from the stock prices at ti Using the payoffs and stock prices from all the paths (in the money) For each path Predict the payoff for the path Decide whether to exercise or continue on that path

Payoff LINEAR REGRESSION 1/ Find the coefficients (beta) defining the curve 2/ Predict the payoff from the stock price Stock price

LINEAR REGRESSION Linear model to fit beta[0] + beta[1]*x + beta[2]*x^2 We want to find beta which minimizes A*beta P ^2 A is the matrix of powers of stock prices, P vector of payoffs 1 S[0] S[0]^2 A = 1 S[1] S[1]^2 1 S[2] S[2]^2 max(k-s[0], 0) P = max(k-s[1], 0) max(k-s[2], 0)

LINEAR REGRESSION We use the Singular Value Decomposition (SVD) A = U * Sigma * V^T To build the (Moore-Penrose) pseudoinverse And compute V * Sigma^-1 * U^T beta = V * Sigma^-1 * U^T * P How can we efficiently build the pseudoinverse?

KEY DESIGN POINTS Expose as much parallelism as possible Eliminate unneeded synchronization points E.g. move computations outside of the main loop Inside kernels, maximize the amount of independent work E.g. threads do sequential work in parallel before a parallel reduction Reduce memory transfers to a minimum Have coalesced memory accesses E.g. map on thread per Monte Carlo path Recompute rather than store intermediate results E.g. do not store the square of S[i]

BUILD THE PSEUDOINVERSE Each A is a long-and-thin matrix with 32,000 rows x 3 columns One matrix A per time step It takes too much time and space to compute the SVD of A as-is A well-known approach: Build the QR decomposition of A A = QR R is much smaller. Compute the SVD of R to build the SVD of A R = UR * SigmaR * VR^T => A = Q * UR * SigmaR * VR^T Since R is 3x3, we can compute its SVD on a multiprocessor

COMPUTE THE QR DECOMPOSITION Householder-based algorithm to build the QR decomposition 3x dot products over ~32,000 elements 3x 32,000x32,000 rank updates There are too many memory accesses!!! Our solution: R can be built using 8 scalars (see the code): s0, s1, s2, Sum si^0, Sum si^1, Sum si^2, Sum si^3, Sum si^4 Where si is the stock price on the i-th path which pays off

COMPUTE THE QR DECOMPOSITION During the main loop, Q can be built on-the-fly using A and R Q = AR^-1 In summary, we build all the W matrices before the main loop W = VR * SigmaR^-1 * UR^T Each CUDA block computes a different W At each iteration of the main loop, we compute beta as Pseudoinverse beta = W * (R^-1)^T * A^T * P Q^T

BUILD THE PSEUDO-INVERSE Before the main loop, we build W (one block per time step) int m = 0; double4 sums = {0.0}; // Iterate over the paths. Each thread computes its own partial sums. for( int path = threadidx.x ; path < num_paths ; path += THREADS_PER_BLOCK ) { // Load the asset price. double S = paths[offset + path]; // Update the sums if the path pays off. if( payoff.is_in_the_money(s) ) { ++m; double S2 = S*S; sums.x += S; sums.y += S2; sums.z += S2*S; sums.w += S2*S2; } } m = cub::blockreduce<...>(...).sum(m); sums = cub::blockreduce<...>(...).sum(sums); // Build and store W. See the code.

MAIN LOOP W is a 3x3 matrix and R has only 6 non-zero values We map one CUDA thread per path (or more) if( threadidx.x < 15 ) // Load W for the block. smem_w[threadidx.x] = W[threadIdx.x]; syncthreads(); double3 beta = {0.0}; // Each thread computes a partial sum of beta. // Iterate over the paths. for( int path = tidx ; path < num_paths ; path += blockdim.x*griddim.x ) { double S = stock[path]; double S2 = S*S; // Rebuild A on the fly }... // Update beta. No global memory access!!! beta = cub::blockreduce<...>(...).sum(beta); // Parallel reduction... // Store beta

PERFORMANCE RESULTS Tesla K40 (875MHz, 3004MHz), runtime in milliseconds 18 16 Runtime (in ms). Lower is better. 14 12 10 8 6 4 2 5.56 5.375 9.268 8.493 7.242 7.093 12.648 12.18 8.793 8.661 15.438 14.99 100 timesteps 100 timesteps (NEQ) 200 timesteps 200 timesteps (NEQ) 0 32K paths 64K paths 96K paths NEQ: Linear regression using the Normal Equation Timings include the generation of paths

PERFORMANCE RESULTS 100% 90% 80% 70% 60% 50% 40% 30% 20% compute_final_sum_kernel compute_partial_sums_kernel update_cashflow_kernel compute_partial_beta_kernel prepare_svd_kernel generate_paths_kernel gen_sequenced generate_seed_pseudo_mrg 10% 0% 32K/100 64K/100 96K/100 32K/200 64K/200 96K/200

PERFORMANCE RESULTS The importance of the main loop (update_cashflow/compute_partial_beta) Increases with the number of time steps At 32K/64K paths, the two kernels are limited by latency High impact of instruction and constant cache misses The loop is impacted by launch latency of kernels (for #paths <= 32K) See how to reduce the impact in the companion code (#define WITH_FUSED_BETA) On 32K/64K paths, we have a limited number of CUDA blocks Tail effects (load balancing is not optimal) We need more paths or work on several problems in parallel Idea: Use several CPU threads and CUDA streams Keep it in mind when you design your infrastructure

CONCLUSION GPUs are good at American option pricing See our STAC-A2 results (compared to high-end CPUs/GPU-like) Robust algorithms like the SVD can be implemented Our blog post: http://devblogs.nvidia.com/parallelforall/american-option-pricing-monte-carlo-simulation/ The companion code: https://github.com/parallel-forall/code-samples/tree/master/posts/american-options Our STAC-A2 results: http://www.stacresearch.com/nvidia/4dec13