Adjoint Algorithmic Differentiation of a GPU Accelerated Application
|
|
|
- Dina O’Brien’
- 10 years ago
- Views:
Transcription
1 Adjoint Algorithmic Differentiation of a GPU Accelerated Application Jacques du Toit, Johannes Lotz and Uwe Naumann Abstract We consider a GPU accelerated program using Monte Carlo simulation to price a basket call option on 10 FX rates driven by a 10 factor local volatility model. We develop an adjoint version of this program using algorithmic differentiation. The code uses mixed precision. For our test problem of 10,000 sample paths with 360 Euler time steps, we obtain a runtime of 522ms to compute the gradient of the price with respect to the 438 input parameters, the vast majority of which are the market observed implied volatilities (the equivalent single threaded tangent-linear code on a CPU takes 2hrs). 1 Introduction In a financial setting risk is more or less synonymous with sensitivity calculations. The foundational results of mathematical finance show how to hedge any contingent claim by using a dynamic trading strategy based on mathematical derivatives of the claim with respect to the risk factors. These mathematical derivatives (sensitivities) therefore show us not only where our exposures are most sensitive with respect to movements in the risk factors, but also tell us how (in a perfect world) we can eliminate those exposures completely. The most common way of computing sensitivities is by finite differences. Through a simple bump and revalue approach one can get estimates of any mathematical derivative. There are two problems, however: firstly, those estimates (especially for higher order derivatives) can be arbitrarily bad if one does not have a good feel for the underlying surface; and secondly, bump and revalue is a costly computational exercise. Algorithmic differentiation has become an attractive alternative to finite differences for many applications. For a thorough introduction to algorithmic differentiation (hereafter simply AD) the reader is referred to [1], but the idea is very simple. At the most basic level, computers can only add, subtract, multiply and divide two floating point numbers. It is elementary to compute the derivatives of these fundamental operations. A computer program implementing a mathematical formula or model consists of many of these fundamental operations strung together. One can therefore use the chain rule, together with the derivatives of these operations, to compute the mathematical derivative of the computer program with respect to some input variables. Furthermore, considering programming language features such as templates and operator overloading, one begins to see how a software tool could be written to perform AD automatically in an efficient and non-intrusive manner. Numerical Algorithms Group LuFG Informatik 12, RWTH Aachen University, Germany 1
2 The main benefit of AD is exact mathematical derivatives of a computer program. When this is coupled with adjoint methods (see Section 4), a substantial increase in computational efficiency can often be achieved. We refer to [1] for a full discussion of these matters. A word of caution: AD computes exact mathematical derivatives of the given computer program. This presumes two things: the given computer program is continuously differentiable, the derivative of the computer program bears some resemblance to the derivative of the mathematical formula or algorithm that it implements. These two conditions are often not satisfied. Indeed for certain variables such as interest rates there is a smallest unit of measurement (the basis point), which leads some practitioners to prefer finite differences when computing sensitivities with respect to these variables. It is important to consider a computer program carefully when deciding whether AD is an appropriate technique. The main purpose of this article is to demonstrate that it is possible to apply AD, and in particular adjoint AD, to a non-trivial GPU (CUDA) accelerated program. The program consists of three stages: a complex initial setup stage, a compute intensive massively parallel stage, and a post-processing stage. The compute intensive stage is executed as a GPU kernel while the rest of the code is kept on the CPU. By combining an AD tool (dco) with a hand-written adjoint of the GPU kernel, the entire program can be differentiated with respect to the over 400 input variables. 2 Local volatility FX basket option The problem considered is a foreign exchange (FX) basket call option driven by a multifactor local volatility model. This is the same problem presented in [2] and we refer the reader there for further details. We will outline the problem briefly here. A basket call option consists of a weighted sum of N correlated assets and has a price given by C = e r dt E(B T K) + (1) where r d is the domestic risk free interest rate, K > 0 is the strike price, and B T is given by N B T = w (i) S (i) T. (2) i=1 Here S (i) denotes the value of i th underlying asset (FX rate) for i = 1,...,N and the w (i) s are a set of weights summing to one. Each underlying asset is driven by the stochastic differential equation (SDE) ds (i) t S (i) t = ( r d r (i) ) f dt+σ (i) ( S (i) t,t ) dw (i) t (3) where r (i) f is the foreign risk free interest rate for the i th currency pair, (W t ) t 0 is a correlated N-dimensional Brownian motion with W (i),w (j) t = ρ(i,j) t and ρ (i,j) denotes the correlation coefficient for i,j = 1,...,N. The function σ (i) in (3) is unknown and is calibrated from market implied volatility data according to the Dupire formula σ 2 (K,T) = θ 2 +2Tθθ T +2 ( r d T rf T) KTθθK ( 1+Kd+ TθK ) 2 +K 2 Tθ ( θ KK d + Tθ 2 K ). (4) 2
3 where θ denotes the market observed implied volatility surface and d + = ln(s(i) 0 /K)+(r q θ2 )T θ. (5) T The basket call option price (1) is evaluated using Monte Carlo simulation. The inputs to this model are therefore the domestic and foreign risk free rates, the spot values of the FX rates, the correlation matrix, the maturity and strike of the option, the weights, and the market observed implied volatility surface for each of the underlying FX rates. The task is to compute the sensitivity of the price with respect to each of these variables. 3 The GPU accelerated computer program (pre-ad) Let us examine the computer program evaluating (1) before any AD is introduced. As mentioned previously, the program is divided into three stages. 3.1 Stage 1: setup The Dupire formula (4) implies there are infinitely many implied volatility quotes. However the market only trades a handful of call options at different strikes and maturities. One therefore has to create a sufficiently smooth function θ from this fairly small set of discrete quotes which can be used in (4). Doing this in an arbitrage-free manner is not easy and has been a topic of much research. A common approach taken by market practitioners is to ignore the finer points of arbitrage free interpolation and simply to use cubic splines. This is the approach that we have adopted. The first stage of the program takes the input implied volatility surface and produces a local volatility surface represented by a collection of one dimensional cubic splines. Fuller details of this procedure are given in [2] and here we just summarise the main steps to give a feel for the complexity. The FX market typically provides option quotes for five deltas per tenor. We fit a cubic spline in the strike direction through the 5 quotes at each tenor. For each tenor we then fit 5 piecewise monotonic Hermite polynomials from time 0 to time T, with each polynomial passing through one of the quotes. Since each tenor s quotes are at different strikes, this requires interpolating values with the previously fitted cubic splines. For all the Monte Carlo time steps that lie between two tenors, we use the Hermite polynomials to interpolate implied volatility values at each of the 5 strikes. Then for each of the of the Monte Carlo time steps cubic splines are fitted in the strike direction through the 5 implied volatility values just computed. These splines can be used to compute θ,θ K and θ KK in (4) above. Essentially the same procedure can be used to compute θ T. Finally, these derivatives are substituted into (4) to produce local volatility points on a regular grid, and sets of cubic splines are fitted to these points. The slopes at the end points of these splines are projected to form linear extrapolation functions to cover the left and right (in the strike direction) tails of the local volatility surface. 3.2 Stage 2: GPU accelerated Monte Carlo We use an Euler-Maruyama scheme on the log-process to solve (3). This is by far the most compute intensive part of the entire program and is executed on a GPU. To perform the Euler time stepping we use n T steps each of length t so that n T t = T, the maturity of the basket option. With S τ (i) S (i) τ t for τ = 1,...,n T the discretised version of (3) then 3
4 becomes ( ) ( log S (i) τ+1 log S (i) τ ) = ( r d r (i) f 1 2 σ(i) (S (i) τ,τ t)2) t+σ (i) (S (i) τ,τ t) td i Z τ (6) where Z τ is an N 1 column vector of standard Normal random numbers and d i is the i th row of the Cholesky factorisation of the N N correlation matrix, in other words d i1 d T i2 = ρ (i 1,i 2 ) for all i 1,i 2 = 1,...,N. The output of this stage is the final value log ( S n (i) ) ( (i) T log S n T (j) )a for each of the j Monte Carlo sample paths and each of the i underlying assets i = 1,...,N. 3.3 Stage 3: payoff The final stage consists of copying the terminal values of the Monte Carlo sample paths back to the CPU and computing the price (1). While the payoff could certainly be computed on the GPU (an indeed this was done in [2]) we deliberately carried out these final steps on the CPU in order to mimic financial applications where there is a non-trivial post-processing stage which has to be carried out on the CPU. 3.4 Implementation The program described above is written in C++ and the GPU kernel is written in CUDA. Since the vast majority of the runtime is spent in the Monte Carlo kernel, the CPU C++ code was written in a natural manner with no special attention to optimisation. By contrast, quite some care was taken to produce efficient CUDA code. Probably the most important design decision was to load a single cubic spline for a Monte Carlo time step τ into shared memory and advance all sample paths from time τ to time τ+1 before loading the cubic spline for the next time step. The spline is kept in "cache" for as long as it is needed, and as little cache as possible is consumed. This was done so that the code would work well with small as well as large cubic splines, thereby covering as many asset classes as possible. 4 Adjoint algorithmic differentiation Adjoint methods, whether analytic or obtained through algorithmic differentiation, are sophisticated numerical techniques. We can but give the briefest of introductions here the reader is referred to [1] for a fuller exposition. AD is a technique for differentiating computer programs. There are two main modes (or models) of AD. Consider the case of a computer program implementing a function f : R n R namely y = f(x). Take a vector x (1) R n and define the function F (1) : R 2n R by ( ) f y (1) = F (1) (x,x (1) ) = f(x) x (1) = x (1) (7) where the dot denotes the regular vector dot product. The function F (1) is called the tangent-linear model of f and is the simplest form of AD. Letting x (1) range over the Cartesian basis vectors in R n and calling F (1) repeatedly one obtains each of the partial derivatives y (1) of f with respect to each of the input variables x i for i = 1,...,n. Hence in order to get the full gradient f one must evaluate the tangent-linear model n times, which means the runtime of the risk calculation will be roughly n times the cost of computing f (there are precise results in this direction). a We will write S (i) τ (j) when we wish to highlight the role of sample paths, otherwise we suppress the (j) to simplify notation 4
5 Now consider the function F (1) : R n+1 R n defined by x (1) = F (1) (x,y (1) ) = y (1) f(x) = y (1) f for any value y (1) in R. The function F (1) is called the adjoint model of f. Setting y (1) = 1 and calling the adjoint model F (1) once gives the full vector of partial derivatives of f. Furthermore, it can be proved that in general computing F (1) requires no more than five times as many flops b as computing f. Hence adjoints are extremely powerful, allowing one to obtain large gradients at potentially very low cost. Mathematically, adjoints are defined as partial derivatives of an auxiliary scalar variable t so that in (8) above we have y (1) = t y x (1) = t This has a subtle and significant implication for the adjoint model. Consider a computer program evaluating a function f : R R which takes an input x and by a series of simple operations (e.g. add, subtract, multiply, divide) propagates x through a sequence of temporary variables to an output y (8) (9) x α β γ y (10) where α,β,γ are temporary variables. Using the definition of adjoint (9), we can write x (1) = t = α t α β = α = α = α t α β β γ t α β γ β γ y α β γ t y = y y (1) which is nothing other than the adjoint model F (1) of f. Note however that y (1) is an input to the adjoint model. Hence to evaluate the adjoint model one starts with y (1) and then propagates it upwards through (11) in order to reach x (1), i.e. one has to compute ((( y (1) y γ ) γ β ) β α ) α This effectively means that the computer program implementing f has to be run backwards. Since computing y/ γ might well require knowing γ (in general one may need β and α as well), it is clear that to run the program backwards one first has to run the program forwards and store all the intermediate values needed to calculate the partial derivatives. In general, adjoint codes can require a huge amount of storage to keep all the required intermediate calculations. 5 dco To implement adjoint AD there are two options: one can write the adjoint code by hand, or one can use a software tool which computes adjoints automatically. Writing adjoints b Floating point operations (11) (12) 5
6 by hand is a tedious and error prone task, and for large codes is simply too expensive. In this case, tools are the only option. Developers of AD tools have come up with two different approaches to generate adjoint code. Both approaches require the source code to be available and both approaches need to collect data during a first step (forward run), reusing it during a second step (reverse run) as shown above (see [1] for a much more rigorous discussion). The first approach is to use source code transformation tools which make a static program analysis and generate a new program which computes adjoints. The resulting code is usually quite efficient. The major drawback of this approach is language coverage: while source transformation tools may cover a large subset of simpler programming languages such as C, they typically have very limited, or no coverage, of more advanced languages such as C++. For C++ in particular a non-negligible, non-trivial code rewrite is typically required before the tool can understand the program. The second approach is to use operator overloading techniques to compute adjoints. This approach of necessity uses the language that the underlying program is written in and so has full language coverage. The basic idea is that one goes through the underlying program and replaces all floating point data types with a new specialized data type. For a simple C++ code this is roughly all that is required, but for real world applications further preprocessing may be required. The downside to overloading techniques is that the resulting code may suffer from varying drops in runtime performance. All AD tools built on operator overloading create a data structure during the forward run which keeps track of the dependencies from the inputs to the outputs. This data structure is typically called a tape. During the reverse run the tape is played back to compute the adjoint. The performance of the tool is usually closely tied to how efficient this data structure is. Although the previous two approaches may seem mutually exclusive, for a language such as C++ there is in fact quite some overlap. The AD tool dco tries to exploit this. At its heart it is an operator overloading approach, but it uses efficiency-raising C++ features such as expression templates (see e.g. [4]) to exploit as much compile time optimization as possible. Additionally, dco provides an external function interface which is a user friendly way to integrate with itself any code that implements an adjoint model. This code would typically come either from a source transformation tool (the first approach outlined above) or would be hand written. This is the key feature we exploit for the GPU accelerated program considered here. 5.1 Performance and features The typical performance measure for AD tools is given by the ratio of the time for performing one adjoint projection (i.e. computing the full set of first derivatives) to the time for one execution of the underlying program: R = (Time for one adjoint projection) (Time for executing underlying program) = time( f). time(f) This is an efficiency measure and shows how much overhead is introduced by the AD tool in order to calculate the gradient. When dco is applied to real world programs, the ratio R typically varies between 5 and 15. For a given code, the best achievable R is highly dependent on the code efficiency (e.g. cache use) and the amount of effort a programmer is willing to invest to tune the adjoint code (dco has several run time and compile time options influencing memory use and run time). This is simply a trade-off between man hours and performance. Some of dco s key features are: Generic derivative types: the dco data types can be instantiated with arbitrary base types (e.g. double/float) 6
7 Higher order derivatives can be computed External function interface: this allows Checkpointing (e.g. time stepping in PDE solution) to control memory use User-defined intrinsics (e.g. smoothed pay-offs) Low-memory adjoint ensembles (e.g. Monte Carlo) allowing the programmer to exploit parallelism Supporting accelerators (e.g. GPUs) as is done in this paper NAG Library support: dco versions of any NAG Library routine can be made, allowing dco to treat the routine as an intrinsic To summarise, in it s simplest form dco performs adjoint AD by the following steps: 1. Replace floating point data types in the program with dco data types 2. Run the program forward, recording optimised versions of intermediate calculations in the tape 3. Set y (1) = 1 and play the tape back, effectively running the whole calculation backwards and computing the gradient 5.2 Adjoint AD on GPU accelerators Readers familiar with GPU programming will know that the hardware constraints of these platforms are quite different from traditional CPU programming. In particular, there is a rather small pool of memory (around 6GB) shared by a relatively large number of threads (in a typical CUDA kernel one may have upwards of 40,000 threads). In addition L1 and L2 caches are rather small. This means that memory use is of primary importance when programming GPUs. Unfortunately adjoint AD methods can require prohibitive amounts of memory. Moreover there are currently no AD tools which support CUDA or OpenCL all GPU adjoint codes are hand-written (CUDA support for dco in tangent-linear mode is in development). For this reason, there are few examples of non-trivial adjoint codes running on GPUs. Mike Giles s LIBOR market model code (available on his website, see [3]) is the only one in the public domain of which we are aware. 6 Implementing adjoint AD for the FX basket option We use dco for the CPU code, namely stages 1 and 3 in Section 3 above. At the various points where the code calls into the NAG C Library (e.g. for the interpolation routines), we use the dco enabled versions of those routines. Since dco in adjoint mode does not support CUDA, something different has to be done to handle the GPU Monte Carlo kernel. Here we make use of dco s external function interface. This effectively allows one to insert gaps in dco s tape and to provide external functions which fill those gaps. When running the code forward, the Monte Carlo kernel becomes an external function. Stage 1 is computed with dco data types. At the start of stage 2 one extracts the values of the inputs (local volatility spline knots and coefficients, Cholesky factorisation of correlation matrix, domestic and foreign interest rates, etc) from the dco data types, copies them to the GPU, launches the kernel and copies the results back from the GPU to the CPU. The results are then inserted into new dco data types and the calculation proceeds to stage 3 as normal. 7
8 When the tape is played back a function must be provided which implements the adjoint model of the Monte Carlo kernel in stage 2. Since there is no AD tool support for CUDA, this function has to be written by hand. The Monte Carlo kernel is massively parallel. Therefore the adjoint model of it is also massively parallel (each thread computes the adjoint of one or more sample paths independently) and is well suited to GPU acceleration provided the memory use can be constrained. This adjoint GPU kernel is therefore run when the tape reaches stage 2: the input adjoints are extracted from the dco tape and copied to the GPU, the kernel is launched, the output adjoints are copied to the CPU and inserted into the dco tape, and the calculation continues as normal. Figure 1: The control flow for the adjoint program. In stages 1 and 3, dco data types record to a tape. The tape has a gap for stage 2. To compute the adjoint the tape is played back, and the hand-written adjoint kernel fills in the gap left by stage 2. While it may sound daunting to write an adjoint code by hand, we found that the key was to break the Monte Carlo kernel up into functions. Adjoint models of each of these functions can then be made, each of which is relatively simple, and the individual adjoint functions can then be strung together to make the adjoint kernel. The most difficult part was the cubic spline evaluation function c which consists of about 50 lines of C code. However once the adjoint model of this function has been made it becomes a library c The GPU version of function e02bbc in the NAG C Library 8
9 component and is available for use in multiple projects, since the code for evaluating a cubic spline does not change. 6.1 Controlling memory use As mentioned in Section 4 adjoint AD codes can use prohibitive amounts of memory to store all the intermediate calculations needed to run the program backwards. For a Monte Carlo code this can easily amount to tens or hundreds of gigabytes if some care is not taken. For GPUs it is a general rule of thumb that memory is expensive and flops are free, which suggests one should only store what is essential and recompute everything else. Consider (6) above, suppress the superscript (i) for ease of notation and suppose that the calculation has to be run backwards from τ = n T to τ = n T 1 to compute adjoints. To get some feel for what this involves let us apply (9) to (6) in a simplified form and suppose that y (1) is the adjoint of log(s nt ) and x (1) is the adjoint of log(s nt 1). (In reality x (1) is a vector x (1) of the adjoints of r d, r (i) f, the cubic spline data, log(s n T 1) etc, but here we consider only one output for clarity). Then t log(s nt 1) = log(s n T ) t log(s nt 1) log(s nt ) (1+σ ( S ) σ( S nt 1,T t ) nt 1,T t = log(s nt 1) + σ ( S nt 1,T t ) ) t tdz nt log(s nt 1) log(s nt ). From this it is clear that one cannot perform the adjoint calculation without knowing the value of log(s nt 1) since σ(x,t) is not invertible. Moreover knowing this value all other values can be computed (indeed this is how the original Monte Carlo kernel works, stepping forwards in time). Therefore to compute the adjoint model all that has to be stored are the values log ( S (i) τ (j) ) for all assets i = 1,...,N, all time steps τ = 1,...,n T and all Monte Carlo sample paths j. With these values known, all other values can be recomputed. In our implementation we chose to store the local volatilities σ (i)( S (i) τ (j),τ t ) as well since the spline calculations are rather expensive. All in all the test problem we considered (see Section 7 below) uses around 420MB of GPU memory. 6.2 Dealing with race conditions and mixed precision Any parallelised adjoint model of a Monte Carlo kernel will have race conditions. This can easily be seen from (6) above: if the adjoint model is computed for two sample paths j and j +1 concurrently, then at some point both threads will be using the input adjoints log ( S (i) τ+1 (j)) (1) and log( S (i) τ+1 (j+1)) (1) (13) to update the adjoint of the domestic interest rate r d(1). If this is not done in a thread safe way, data corruption and/or incorrect results will follow. Of course, in this case each thread can simply have its own copy of r d(1) and at the end of the kernel these can be accumulated in a thread safe way with no difficulty. The local volatility model, however, has a more unpleasant race condition when updating the adjoints of the spline data. Recall how cubic splines are evaluated. At its simplest a cubic spline consists of a set of knots λ 1 < λ 2 < < λ nλ and a set of coefficients c 1,c 2,...,c nc. To get the value of the spline at a point x (in (3) this is log(s (i) τ ) which is essentially random) one must first find the index k such that λ k x < λ k+1. The spline value y is then a function f of several λs and several cs selected by the index k so that y = f(x,λ k,λ k+1,...,λ k+k1,c k,c k+1,...,c k+k2 ) (14) 9
10 Suppose that k 1 = k 2 = 3. From (8) the adjoint model of (14) is x λ k λ k+1 λ k+2 c k c k+1 c k+2 (1) = y (1) f(x,λ k,λ k+1,λ k+2,c k,c k+1,c k+2 ) (15) If at the same time another thread is performing the same calculation for some λ k+1 x < λ k+2, then both threads could try to update the adjoints λ k+1(1),λ k+2(1),c k+1(1) and c k+2(1) simultaneously. Moreover x and x are random there is no way to predict when these race conditions will occur. One way around this is to give each thread its own copy of the adjoints of the spline data. However for a GPU kernel with 40,000 threads this could require tens of gigabytes of memory, which is totally impractical. Another approach would be to give each thread its own copy in shared memory and then have each thread block synchronise its threads to combine the results in a safe way. The problem here is that one rapidly runs out of shared memory, or at best ends up with only one thread block per GPU streaming multiprocessor, which gives poor performance. A third option is to give each thread block its own copy of the data in shared memory and then try to do sophisticated inter-thread synchronisation to avoid race conditions. This would create a complicated code which would probably not perform very well. A fourth option is to use hardware atomic operations d to update the adjoints in a thread safe way. This approach works, but is rather slow (at least 4 times slower than our implementation). The solution we chose is to do the GPU calculations in single precision and use a combination of a few atomic operations (purely for ease of programming the algorithm can be implemented without them) and private arrays for active threads to compute the adjoints. (Note: the CPU calculations for stages 1 and 3 are still done in double precision). In [2] we showed the impact on performance and accuracy of single, double and mixed precision Monte Carlo code on GPUs. Switching from double to single precision halved the runtime of the application considered in this paper and gave results which were within single precision tolerances. In particular, the standard deviation of the Monte Carlo estimate was orders of magnitude greater than the threshold of single precision accuracy (roughly 10 6 ). As always, care must be taken when working in single precision, especially when calculating things like adjoints where derivative contributions are continually accumulated (added to a variable). In our adjoint kernel we ended up computing r d(1) in mixed precision since the numerical value was rather small and unacceptable round off errors occurred in single precision. We emphasise that the choice to perform the adjoint calculation in single precision was purely for performance. The calculation could also be done in double precision on the GPU A word of caution (nil desperandum!) The problematic race condition described above only arises when large shared arrays are used to update Monte Carlo sample paths from one time step to the next. This is typically a feature of local volatility-type models, but there are many financial models (e.g. Heston) where this does not happen. For these models there are usually a few scalar variables d Atomic operations are guaranteed to be thread safe: if two threads execute an atomic add on the same variable, the variable is guaranteed to contain the correct result 10
11 which govern the SDE, and the race conditions which arise are easily handled by giving each thread its own copy of the adjoints of the input variables, and then doing standard parallel reductions to combine the results. This is efficient and quite easy to do. 7 Results Results are presented in terms of runtimes and accuracy of the derivatives when compared with a full double precision version of the code. As mentioned before, all CPU calculations are done in double precision while the GPU calculations are predominantly in single precision with one or two calculations in the adjoint kernel done in double precision. 7.1 Test problem We considered a 1 year basket option on 10 currency pairs: EURUSD, AUDUSD, GBPUSD, USDCAD, USDJPY, USDBRL, USDINR, USDMYR, USDRUB and USDZAR all with equal weighting. The spot rates were taken from market data as of Sept 2012 and the correlation matrix was estimated. The interest rates, both foreign and domestic, and the strike price were taken to be zero in the interest of simplicity. There are 5 FX quotes per maturity and 7 maturities per FX rate, which means in total there are 438 input parameters, and hence 438 derivatives to compute. In the Monte Carlo simulation we used 10,000 sample paths each with 360 time steps. 7.2 Hardware and compilers The computing platform we used was as follows: 1. CPU: Intel Xeon E5-2670, an 8 core processor with 20MB L3 cache running at a base frequency of 2.6GHz but able to ramp up to 3.3GHz under computational loads. We used gcc version GPU: NVIDIA K20X with 6GB RAM. The GPU clock rate is 700MHz, ECC is off and the memory bandwidth (as measured by the bandwidth test in the CUDA SDK) is around 200GB/s. We used the CUDA toolkit Runtimes and accuracy For the problem outlined above the results are as follows: Overall runtime: 522.4ms Forward calculation: 367.0ms GPU random number generator: 4.5ms GPU Monte Carlo kernel: 14.5ms Computation of adjoints (backward calculation): 155.4ms GPU adjoint Monte Carlo kernel: 85.1ms Memory copies to and from the GPU: 0.5ms dco used 268MB of CPU memory to construct the tape and in total about 420MB of GPU memory was used 11
12 For comparison, the reference CPU serial double precision program using dco in tangentlinear mode took over 2hrs to compute the 438 derivatives. Since the tangent-linear model has the same complexity as finite differences, one would expect a similar runtime for a finite difference calculation of the gradient. A plot and log-plot of the errors of the mixed precision derivatives compared with the reference double precision derivatives is given in Figure 2. The x-axis is the log of the Figure 2: Plot and log-plot of the error of the gradient against the log of the gradient. The error is given by (16) and the logarithms are taken in base 10. absolute value of the gradient and the error is defined as g ref ( g mp ) if g ref < err = max g ref, g mp ( 1 otherwise (16) min g ref, g mp ) where g ref is the reference double precision gradient and g mp is the gradient from the mixed precision code. Note that many of the gradients are quite small in absolute value (there are 103 values less than ) and for these values we measure accuracy as an absolute rather than a relative difference. Only 5 out of 438 relative errors are greater than 10 4 and only 16 are greater than The accuracy is therefore pretty much what one would expect for a single precision code, especially seeing as most of the gradients have such small numerical values. 7.4 Scaling We ran the same problem with 20,000 Monte Carlo sample paths and got the following results: Overall runtime: 611.6ms GPU Monte Carlo kernel: ms GPU adjoint kernel: 142.8ms dco used about 280MB of CPU memory for the tape The GPU Monte Carlo kernel scales by a factor of 2 and the GPU adjoint kernel by a factor of 1.7, however the overall runtime scales by a factor of This is due to the relatively expensive setup (stage 1) which accounts for around 66% of the runtime and is completely independent of the number of sample paths. 12
13 We then ran the problem with 10,000 sample paths but only 5 assets in the basket and obtained: Overall runtime: 283.4ms GPU Monte Carlo kernel: 6.6ms GPU adjoint kernel: 58.7ms dco used about 135MB of CPU memory for the tape The runtimes now scale more or less by a factor of 2 apart from the adjoint kernel, which scales by a factor of 1.4. This is due to the synchronisation overhead of combining the adjoints in a thread safe way. 8 Acknowledgments The authors would like to thank NVIDIA for providing access to their PSG cluster. References [1] Naumann, U. (2012) The art of differentiating computer programs. Siam. [2] du Toit, J., Ehrlich, I. (2013) Local volatility FX basket option on CPU and GPU. NAG Technical Report, [3] Giles, M., Xiaoke, S. (2007) A note on using the nvidia 8800 GTX graphics card Research Report, available at [4] Vandevoorde, D., Josuttis, N. (2002). C++ templates: the complete guide. Addison Wesley. 13
From CFD to computational finance (and back again?)
computational finance p. 1/21 From CFD to computational finance (and back again?) Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute of Quantitative Finance
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
Hedging Illiquid FX Options: An Empirical Analysis of Alternative Hedging Strategies
Hedging Illiquid FX Options: An Empirical Analysis of Alternative Hedging Strategies Drazen Pesjak Supervised by A.A. Tsvetkov 1, D. Posthuma 2 and S.A. Borovkova 3 MSc. Thesis Finance HONOURS TRACK Quantitative
Numerical methods for American options
Lecture 9 Numerical methods for American options Lecture Notes by Andrzej Palczewski Computational Finance p. 1 American options The holder of an American option has the right to exercise it at any moment
From CFD to computational finance (and back again?)
computational finance p. 1/17 From CFD to computational finance (and back again?) Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute of Quantitative Finance
Moreover, under the risk neutral measure, it must be the case that (5) r t = µ t.
LECTURE 7: BLACK SCHOLES THEORY 1. Introduction: The Black Scholes Model In 1973 Fisher Black and Myron Scholes ushered in the modern era of derivative securities with a seminal paper 1 on the pricing
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Does Black-Scholes framework for Option Pricing use Constant Volatilities and Interest Rates? New Solution for a New Problem
Does Black-Scholes framework for Option Pricing use Constant Volatilities and Interest Rates? New Solution for a New Problem Gagan Deep Singh Assistant Vice President Genpact Smart Decision Services Financial
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Pricing of cross-currency interest rate derivatives on Graphics Processing Units
Pricing of cross-currency interest rate derivatives on Graphics Processing Units Duy Minh Dang Department of Computer Science University of Toronto Toronto, Canada [email protected] Joint work with
Numerix CrossAsset XL and Windows HPC Server 2008 R2
Numerix CrossAsset XL and Windows HPC Server 2008 R2 Faster Performance for Valuation and Risk Management in Complex Derivative Portfolios Microsoft Corporation Published: February 2011 Abstract Numerix,
CS 4204 Computer Graphics
CS 4204 Computer Graphics Computer Animation Adapted from notes by Yong Cao Virginia Tech 1 Outline Principles of Animation Keyframe Animation Additional challenges in animation 2 Classic animation Luxo
How the Greeks would have hedged correlation risk of foreign exchange options
How the Greeks would have hedged correlation risk of foreign exchange options Uwe Wystup Commerzbank Treasury and Financial Products Neue Mainzer Strasse 32 36 60261 Frankfurt am Main GERMANY [email protected]
Hedging Barriers. Liuren Wu. Zicklin School of Business, Baruch College (http://faculty.baruch.cuny.edu/lwu/)
Hedging Barriers Liuren Wu Zicklin School of Business, Baruch College (http://faculty.baruch.cuny.edu/lwu/) Based on joint work with Peter Carr (Bloomberg) Modeling and Hedging Using FX Options, March
Some Practical Issues in FX and Equity Derivatives
Some Practical Issues in FX and Equity Derivatives Phenomenology of the Volatility Surface The volatility matrix is the map of the implied volatilities quoted by the market for options of different strikes
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation Ying Peng, Bin Gong, Hui Liu, and Yanxin Zhang School of Computer Science and Technology, Shandong University,
the points are called control points approximating curve
Chapter 4 Spline Curves A spline curve is a mathematical representation for which it is easy to build an interface that will allow a user to design and control the shape of complex curves and surfaces.
Solution of Linear Systems
Chapter 3 Solution of Linear Systems In this chapter we study algorithms for possibly the most commonly occurring problem in scientific computing, the solution of linear systems of equations. We start
Pricing complex options using a simple Monte Carlo Simulation
A subsidiary of Sumitomo Mitsui Banking Corporation Pricing complex options using a simple Monte Carlo Simulation Peter Fink Among the different numerical procedures for valuing options, the Monte Carlo
INTEREST RATES AND FX MODELS
INTEREST RATES AND FX MODELS 8. Portfolio greeks Andrew Lesniewski Courant Institute of Mathematical Sciences New York University New York March 27, 2013 2 Interest Rates & FX Models Contents 1 Introduction
Master of Mathematical Finance: Course Descriptions
Master of Mathematical Finance: Course Descriptions CS 522 Data Mining Computer Science This course provides continued exploration of data mining algorithms. More sophisticated algorithms such as support
OpenGamma Quantitative Research Adjoint Algorithmic Differentiation: Calibration and Implicit Function Theorem
OpenGamma Quantitative Research Adjoint Algorithmic Differentiation: Calibration and Implicit Function Theorem Marc Henrard [email protected] OpenGamma Quantitative Research n. 1 November 2011 Abstract
Impedance 50 (75 connectors via adapters)
VECTOR NETWORK ANALYZER PLANAR TR1300/1 DATA SHEET Frequency range: 300 khz to 1.3 GHz Measured parameters: S11, S21 Dynamic range of transmission measurement magnitude: 130 db Measurement time per point:
DIGITAL FOREX OPTIONS
DIGITAL FOREX OPTIONS OPENGAMMA QUANTITATIVE RESEARCH Abstract. Some pricing methods for forex digital options are described. The price in the Garhman-Kohlhagen model is first described, more for completeness
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
LECTURE 15: AMERICAN OPTIONS
LECTURE 15: AMERICAN OPTIONS 1. Introduction All of the options that we have considered thus far have been of the European variety: exercise is permitted only at the termination of the contract. These
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
a 11 x 1 + a 12 x 2 + + a 1n x n = b 1 a 21 x 1 + a 22 x 2 + + a 2n x n = b 2.
Chapter 1 LINEAR EQUATIONS 1.1 Introduction to linear equations A linear equation in n unknowns x 1, x,, x n is an equation of the form a 1 x 1 + a x + + a n x n = b, where a 1, a,..., a n, b are given
3.3. Solving Polynomial Equations. Introduction. Prerequisites. Learning Outcomes
Solving Polynomial Equations 3.3 Introduction Linear and quadratic equations, dealt within Sections 3.1 and 3.2, are members of a class of equations, called polynomial equations. These have the general
Mathematics Course 111: Algebra I Part IV: Vector Spaces
Mathematics Course 111: Algebra I Part IV: Vector Spaces D. R. Wilkins Academic Year 1996-7 9 Vector Spaces A vector space over some field K is an algebraic structure consisting of a set V on which are
Hedging. An Undergraduate Introduction to Financial Mathematics. J. Robert Buchanan. J. Robert Buchanan Hedging
Hedging An Undergraduate Introduction to Financial Mathematics J. Robert Buchanan 2010 Introduction Definition Hedging is the practice of making a portfolio of investments less sensitive to changes in
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
Lecture 10. Finite difference and finite element methods. Option pricing Sensitivity analysis Numerical examples
Finite difference and finite element methods Lecture 10 Sensitivities and Greeks Key task in financial engineering: fast and accurate calculation of sensitivities of market models with respect to model
STCE. Fast Delta-Estimates for American Options by Adjoint Algorithmic Differentiation
Fast Delta-Estimates for American Options by Adjoint Algorithmic Differentiation Jens Deussen Software and Tools for Computational Engineering RWTH Aachen University 18 th European Workshop on Automatic
MATH10212 Linear Algebra. Systems of Linear Equations. Definition. An n-dimensional vector is a row or a column of n numbers (or letters): a 1.
MATH10212 Linear Algebra Textbook: D. Poole, Linear Algebra: A Modern Introduction. Thompson, 2006. ISBN 0-534-40596-7. Systems of Linear Equations Definition. An n-dimensional vector is a row or a column
Equity-Based Insurance Guarantees Conference November 1-2, 2010. New York, NY. Operational Risks
Equity-Based Insurance Guarantees Conference November -, 00 New York, NY Operational Risks Peter Phillips Operational Risk Associated with Running a VA Hedging Program Annuity Solutions Group Aon Benfield
1 VECTOR SPACES AND SUBSPACES
1 VECTOR SPACES AND SUBSPACES What is a vector? Many are familiar with the concept of a vector as: Something which has magnitude and direction. an ordered pair or triple. a description for quantities such
Using simulation to calculate the NPV of a project
Using simulation to calculate the NPV of a project Marius Holtan Onward Inc. 5/31/2002 Monte Carlo simulation is fast becoming the technology of choice for evaluating and analyzing assets, be it pure financial
Simulating Stochastic Differential Equations
Monte Carlo Simulation: IEOR E473 Fall 24 c 24 by Martin Haugh Simulating Stochastic Differential Equations 1 Brief Review of Stochastic Calculus and Itô s Lemma Let S t be the time t price of a particular
Hedging Variable Annuity Guarantees
p. 1/4 Hedging Variable Annuity Guarantees Actuarial Society of Hong Kong Hong Kong, July 30 Phelim P Boyle Wilfrid Laurier University Thanks to Yan Liu and Adam Kolkiewicz for useful discussions. p. 2/4
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
Volatility modeling in financial markets
Volatility modeling in financial markets Master Thesis Sergiy Ladokhin Supervisors: Dr. Sandjai Bhulai, VU University Amsterdam Brian Doelkahar, Fortis Bank Nederland VU University Amsterdam Faculty of
9.2 Summation Notation
9. Summation Notation 66 9. Summation Notation In the previous section, we introduced sequences and now we shall present notation and theorems concerning the sum of terms of a sequence. We begin with a
State of Stress at Point
State of Stress at Point Einstein Notation The basic idea of Einstein notation is that a covector and a vector can form a scalar: This is typically written as an explicit sum: According to this convention,
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
How To Price A Call Option
Now by Itô s formula But Mu f and u g in Ū. Hence τ θ u(x) =E( Mu(X) ds + u(x(τ θ))) 0 τ θ u(x) E( f(x) ds + g(x(τ θ))) = J x (θ). 0 But since u(x) =J x (θ ), we consequently have u(x) =J x (θ ) = min
Real-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
OpenFOAM Optimization Tools
OpenFOAM Optimization Tools Henrik Rusche and Aleks Jemcov [email protected] and [email protected] Wikki, Germany and United Kingdom OpenFOAM Optimization Tools p. 1 Agenda Objective Review optimisation
Operation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point
Pricing and calibration in local volatility models via fast quantization
Pricing and calibration in local volatility models via fast quantization Parma, 29 th January 2015. Joint work with Giorgia Callegaro and Martino Grasselli Quantization: a brief history Birth: back to
ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE
ECON20310 LECTURE SYNOPSIS REAL BUSINESS CYCLE YUAN TIAN This synopsis is designed merely for keep a record of the materials covered in lectures. Please refer to your own lecture notes for all proofs.
Inner Product Spaces
Math 571 Inner Product Spaces 1. Preliminaries An inner product space is a vector space V along with a function, called an inner product which associates each pair of vectors u, v with a scalar u, v, and
Risk/Arbitrage Strategies: An Application to Stock Option Portfolio Management
Risk/Arbitrage Strategies: An Application to Stock Option Portfolio Management Vincenzo Bochicchio, Niklaus Bühlmann, Stephane Junod and Hans-Fredo List Swiss Reinsurance Company Mythenquai 50/60, CH-8022
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
1 The Brownian bridge construction
The Brownian bridge construction The Brownian bridge construction is a way to build a Brownian motion path by successively adding finer scale detail. This construction leads to a relatively easy proof
Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver
Høgskolen i Narvik Sivilingeniørutdanningen STE637 ELEMENTMETODER Oppgaver Klasse: 4.ID, 4.IT Ekstern Professor: Gregory A. Chechkin e-mail: [email protected] Narvik 6 PART I Task. Consider two-point
Chapter 3 RANDOM VARIATE GENERATION
Chapter 3 RANDOM VARIATE GENERATION In order to do a Monte Carlo simulation either by hand or by computer, techniques must be developed for generating values of random variables having known distributions.
1 Error in Euler s Method
1 Error in Euler s Method Experience with Euler s 1 method raises some interesting questions about numerical approximations for the solutions of differential equations. 1. What determines the amount of
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Supplement to Call Centers with Delay Information: Models and Insights
Supplement to Call Centers with Delay Information: Models and Insights Oualid Jouini 1 Zeynep Akşin 2 Yves Dallery 1 1 Laboratoire Genie Industriel, Ecole Centrale Paris, Grande Voie des Vignes, 92290
5 Numerical Differentiation
D. Levy 5 Numerical Differentiation 5. Basic Concepts This chapter deals with numerical approximations of derivatives. The first questions that comes up to mind is: why do we need to approximate derivatives
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Variance Reduction. Pricing American Options. Monte Carlo Option Pricing. Delta and Common Random Numbers
Variance Reduction The statistical efficiency of Monte Carlo simulation can be measured by the variance of its output If this variance can be lowered without changing the expected value, fewer replications
2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
Lecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
1 Sets and Set Notation.
LINEAR ALGEBRA MATH 27.6 SPRING 23 (COHEN) LECTURE NOTES Sets and Set Notation. Definition (Naive Definition of a Set). A set is any collection of objects, called the elements of that set. We will most
Pricing Barrier Options under Local Volatility
Abstract Pricing Barrier Options under Local Volatility Artur Sepp Mail: [email protected], Web: www.hot.ee/seppar 16 November 2002 We study pricing under the local volatility. Our research is mainly
Lecture 11: The Greeks and Risk Management
Lecture 11: The Greeks and Risk Management This lecture studies market risk management from the perspective of an options trader. First, we show how to describe the risk characteristics of derivatives.
Unified Lecture # 4 Vectors
Fall 2005 Unified Lecture # 4 Vectors These notes were written by J. Peraire as a review of vectors for Dynamics 16.07. They have been adapted for Unified Engineering by R. Radovitzky. References [1] Feynmann,
Systems of Linear Equations
Systems of Linear Equations Beifang Chen Systems of linear equations Linear systems A linear equation in variables x, x,, x n is an equation of the form a x + a x + + a n x n = b, where a, a,, a n and
1 Portfolio mean and variance
Copyright c 2005 by Karl Sigman Portfolio mean and variance Here we study the performance of a one-period investment X 0 > 0 (dollars) shared among several different assets. Our criterion for measuring
Multiple Kernel Learning on the Limit Order Book
JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor
A QUICK GUIDE TO THE FORMULAS OF MULTIVARIABLE CALCULUS
A QUIK GUIDE TO THE FOMULAS OF MULTIVAIABLE ALULUS ontents 1. Analytic Geometry 2 1.1. Definition of a Vector 2 1.2. Scalar Product 2 1.3. Properties of the Scalar Product 2 1.4. Length and Unit Vectors
Monte Carlo Methods in Finance
Author: Yiyang Yang Advisor: Pr. Xiaolin Li, Pr. Zari Rachev Department of Applied Mathematics and Statistics State University of New York at Stony Brook October 2, 2012 Outline Introduction 1 Introduction
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
Stochastic Inventory Control
Chapter 3 Stochastic Inventory Control 1 In this chapter, we consider in much greater details certain dynamic inventory control problems of the type already encountered in section 1.3. In addition to the
Lecture 3: Finding integer solutions to systems of linear equations
Lecture 3: Finding integer solutions to systems of linear equations Algorithmic Number Theory (Fall 2014) Rutgers University Swastik Kopparty Scribe: Abhishek Bhrushundi 1 Overview The goal of this lecture
Solving Simultaneous Equations and Matrices
Solving Simultaneous Equations and Matrices The following represents a systematic investigation for the steps used to solve two simultaneous linear equations in two unknowns. The motivation for considering
FTS Real Time System Project: Using Options to Manage Price Risk
FTS Real Time System Project: Using Options to Manage Price Risk Question: How can you manage price risk using options? Introduction The option Greeks provide measures of sensitivity to price and volatility
Implied Volatility Surface
Implied Volatility Surface Liuren Wu Zicklin School of Business, Baruch College Options Markets (Hull chapter: 16) Liuren Wu Implied Volatility Surface Options Markets 1 / 18 Implied volatility Recall
HP ProLiant SL270s Gen8 Server. Evaluation Report
HP ProLiant SL270s Gen8 Server Evaluation Report Thomas Schoenemeyer, Hussein Harake and Daniel Peter Swiss National Supercomputing Centre (CSCS), Lugano Institute of Geophysics, ETH Zürich [email protected]
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B. KITCHENS
December 4, 2013 MATH 171 BASIC LINEAR ALGEBRA B KITCHENS The equation 1 Lines in two-dimensional space (1) 2x y = 3 describes a line in two-dimensional space The coefficients of x and y in the equation
Constrained optimization.
ams/econ 11b supplementary notes ucsc Constrained optimization. c 2010, Yonatan Katznelson 1. Constraints In many of the optimization problems that arise in economics, there are restrictions on the values
Vanna-Volga Method for Foreign Exchange Implied Volatility Smile. Copyright Changwei Xiong 2011. January 2011. last update: Nov 27, 2013
Vanna-Volga Method for Foreign Exchange Implied Volatility Smile Copyright Changwei Xiong 011 January 011 last update: Nov 7, 01 TABLE OF CONTENTS TABLE OF CONTENTS...1 1. Trading Strategies of Vanilla
Review of Fundamental Mathematics
Review of Fundamental Mathematics As explained in the Preface and in Chapter 1 of your textbook, managerial economics applies microeconomic theory to business decision making. The decision-making tools
Markovian projection for volatility calibration
cutting edge. calibration Markovian projection for volatility calibration Vladimir Piterbarg looks at the Markovian projection method, a way of obtaining closed-form approximations of European-style option
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
The Heston Model. Hui Gong, UCL http://www.homepages.ucl.ac.uk/ ucahgon/ May 6, 2014
Hui Gong, UCL http://www.homepages.ucl.ac.uk/ ucahgon/ May 6, 2014 Generalized SV models Vanilla Call Option via Heston Itô s lemma for variance process Euler-Maruyama scheme Implement in Excel&VBA 1.
Using the SABR Model
Definitions Ameriprise Workshop 2012 Overview Definitions The Black-76 model has been the standard model for European options on currency, interest rates, and stock indices with it s main drawback being
FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z
FACTORING POLYNOMIALS IN THE RING OF FORMAL POWER SERIES OVER Z DANIEL BIRMAJER, JUAN B GIL, AND MICHAEL WEINER Abstract We consider polynomials with integer coefficients and discuss their factorization
QCD as a Video Game?
QCD as a Video Game? Sándor D. Katz Eötvös University Budapest in collaboration with Győző Egri, Zoltán Fodor, Christian Hoelbling Dániel Nógrádi, Kálmán Szabó Outline 1. Introduction 2. GPU architecture
Univariate Regression
Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is
