GPU Accelerated Monte Carlo Simulations and Time Series Analysis

Similar documents

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Introduction to GPU hardware and to CUDA

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Real-time Visual Tracker by Stream Processing

ultra fast SOM using CUDA

Monte-Carlo Option Pricing. Victor Podlozhnyuk

Image Processing & Video Algorithms with CUDA

CUDA Basics. Murphy Stein New York University

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

GPUs for Scientific Computing

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Lecture 1: an introduction to CUDA

GPU Computing with CUDA Lecture 3 - Efficient Shared Memory Use. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Texture Cache Approximation on GPUs

Introduction to GPU Computing

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Introduction to GPGPU. Tiziano Diamanti

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Introduction to GPU Programming Languages

Final Project Report. Trading Platform Server

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

CUDA programming on NVIDIA GPUs

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Speeding Up RSA Encryption Using GPU Parallelization

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

CUDA Debugging. GPGPU Workshop, August Sandra Wienke Center for Computing and Communication, RWTH Aachen University

GPGPU Computing. Yong Cao

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Experiences on using GPU accelerators for data analysis in ROOT/RooFit

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

1. INTRODUCTION Graphics 2

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Introduction to GPU Architecture

HPC with Multicore and GPUs

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Next Generation GPU Architecture Code-named Fermi

Alberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis Martínez, Gerardo Fernández-Escribano, José M. Claver and José Luis Sánchez

CUDA Programming. Week 4. Shared memory and register

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Clustering Billions of Data Points Using GPUs

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Introduction to CUDA C

Optimizing Application Performance with CUDA Profiling Tools

QCD as a Video Game?

Data-parallel Acceleration of PARSEC Black-Scholes Benchmark

Learn CUDA in an Afternoon: Hands-on Practical Exercises

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

Experiences With Mobile Processors for Energy Efficient HPC

NVIDIA GeForce GTX 580 GPU Datasheet

Binary search tree with SIMD bandwidth optimization using SSE

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Black-Scholes option pricing. Victor Podlozhnyuk

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

GPU Performance Analysis and Optimisation

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

HP ProLiant SL270s Gen8 Server. Evaluation Report

Parallel Programming Survey

Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi

Rootbeer: Seamlessly using GPUs from Java

GPU Computing. The GPU Advantage. To ExaScale and Beyond. The GPU is the Computer

Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Porting the Plasma Simulation PIConGPU to Heterogeneous Architectures with Alpaka

Choosing a Computer for Running SLX, P3D, and P5

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

High Performance CUDA Accelerated Local Optimization in Traveling Salesman Problem

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

~ Greetings from WSU CAPPLab ~

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Adaptive Stable Additive Methods for Linear Algebraic Calculations

High Performance GPGPU Computer for Embedded Systems

Computer Graphics Hardware An Overview

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GeoImaging Accelerator Pansharp Test Results

GPU Architecture. Michael Doggett ATI

Accelerating variant calling

Qualified Apple Mac Workstations for Avid Media Composer v5.0.x

Intel Pentium 4 Processor on 90nm Technology

GPU-BASED TUNING OF QUANTUM-INSPIRED GENETIC ALGORITHM FOR A COMBINATORIAL OPTIMIZATION PROBLEM

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Parallel Computing with MATLAB

Transcription:

GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis Capital Asset Management GmbH preis@uni-mainz.de With thanks to: Peter Virnau, Wolfgang Paul, and Johannes J. Schneider

GPGPU computing GT200 single precision G80 Realistic illustrations Driving force: computer game industry NV30 3.2 GHz Harpertown Source: NVIDIA CUDA programming guide 2003 2006 2008

GPU device architecture GeForce GTX 280: Global memory 24 MB Number of multiprocessors 30 Number of cores 240 Constant memory 64 kb Shared memory 16 kb Clock rate 1.30 GHz

GPU device / Reference system GeForce GTX 280: Global memory 24 MB Number of multiprocessors 30 Number of cores 240 Constant memory 64 kb Reference CPU: Intel Core 2 Quad 2.66 GHz Shared memory 16 kb Clock rate 1.30 GHz Cache size 4096 KB

C code with extensions global void gpu_function(int n, float* a, float* b) { } //Determine array element int i = threadidx.x + blockidx.x * blockdim.x; if(i<n) b[i] += a[i] * a[i]; Block 0 Block 1 Block 2... host void cpu_function() { int n = 128 * 128; int n_blocks = 128; int n_threads = 128; gpu_function<<<n_blocks,n_threads>>>(n, a, b); // Global barrier between GPU functions Thread 0 Block 1 Thread 1 Thread 2... } gpu_function<<<n_blocks/2,n_threads*2>>>(n, a, b);

Linear congruential RNGs x i+1,j =(a x i,j + c) mod m x 0,j+1 = (16807 x 0,j ) mod m a = 1664525 c = 13904223 32 bit architecture provided by the GPU x i,j [ 2 31 ;2 31 1] y i,j = abs ( x i,j /2 31) 4.656612 abs (x i,j )

Computation times Random numbers Time [ms] 8 7 6 5 4 3 2 Acceleration " 0 0 50 0 50 0 0 200 s Time on GPU for allocation Time on GPU for memory transfer Time on GPU for main function Total processing time on GPU Total processing time on CPU Speedup factor 1 0 β = Total processing time on CPU Total processing time on GPU!1!2 50 0 0 200 250 Block number s

Ising model H = J i,j S i S j H i S i nearest neighbors Spin update: Metropolis criterion W a b = exp( H /k B T ) if H > 0 W a b =1 H 0 if

2D Ising: GPU implementation Noninteracting domains where Monte Carlo moves are performed in parallel Checkerboard algoritm

Computation times 2D Time [ms] 9 8 7 6 5 4 3 Acceleration " 60 50 40 30 20 0 2 4 2 5 2 6 2 7 2 8 2 9 n/2 Time on GPU for allocation Time on GPU for memory transfer Time on GPU for main function Total processing time on GPU Total processing time on CPU Speedup factor 2 1 0!1 2 4 2 5 2 6 2 7 2 8 2 9 Block size n/2

Binder cumulant 2D kbtc = 2.269185 J "M (T )4 # U4 (T ) = 1 3"M (T )2 #2 0.7 0.6 U4 0.5 Critical temperature 0.4 0.3 0.2 0.1 n/2 = 16 n/2 = 32 n/2 = 64 n/2 = 128 n/2 = 256 2.245 2.250 2.255 2.260 2.265 2.270 2.275 2.280 2.285 2.290 2.295 kbt [J]

Ising model 3D H = J i,j S i S j H i S i nearest neighbors

3D Ising: GPU implementation Noninteracting domains where Monte Carlo moves are performed in parallel

Computation times 3D Time [ms] 9 8 7 6 5 4 3 Acceleration " 35 30 25 20 2 4 2 5 2 6 2 7 n/2 Time on GPU for allocation Time on GPU for memory transfer Time on GPU for main function Total processing time on GPU Total processing time on CPU Speedup factor 2 1 0!1 2 4 2 5 2 6 2 7 Block size n/2

Binder cumulant 3D 0.7 0.6 k B T C = {4.53 J, 4.51 J} Critical temperature U 4 0.5 0.4 U 4 (T )=1 M(T )4 3 M(T ) 2 2 0.3 0.2 0.1 n/2 = 16 n/2 = 32 n/2 = 64 n/2 = 128 4.47 4.48 4.49 4.50 4.51 4.52 4.53 4.54 4.55 4.56 4.57 4.58 4.59 k B T [J]

GPGPU / Time Series Analysis

GPGPU / Time Series Analysis

Random Walk

20.06.2008 German Stock Index (Dax).09.2008 19.09.2008 PDF of returns Autocorrelation Pattern Conformity #$%&'()!!!!!"!"" Hurst Exponent./01('23,''4 "!!!!!"!""!&!"!5,!,!+!*!,'!&-!&'!- '!!!!"! " &' &-,' φ( p( t )) = u exp( v p2 )

GPU computing / Hurst exponent p(t + t) p(t) q 1/q t H q( t)

GPU computing / Hurst exponent Time [ms] 7 6 5 4 3 2 Acceleration # 80 60 40 20 0 2 4 6 8 12 14! Time on GPU for allocation Time on GPU for memory transfer Time on GPU for main function Time on GPU for post processing Time on GPU for final processing Total processing time on GPU Total processing time on CPU 1 0 "1 "2 2 4 6 8 12 14 Length parameter!

GPU computing / Hurst exponent p(t) [percent of par value] 116.5 116.0 1.5 1.0 114.5 114.0 113.5 113.0 112.5 112.0 111.5 t [ 5 units of time tick] FGBL JUN 2007 1 2 3 4 5 6 7 8 9 Hurst exponent H(!t) 0.5 0.4 0.3 0.2 0.1 Random walk FGBL (CPU) FGBL (GPU) Relative error " [%] 0.06 0.04 0.02 0.00 0.0 0 200 300 400 500 Time lag!t [units of time tick] 2 1 2 2 2 3 2 4 2 5 2 6 2 7 2 8 2 9 Time lag!t [units of time tick] TP, P. Virnau, W. Paul, and J. J. Schneider, Preprint submitted (2009)

Fluctuation Patterns The aim is to compare the current reference pattern of time interval length t with all previous patterns in the time series. p(t) [units of points] 6120 61 60 (a)! t=75 6090 "(t) 2500 2000 00 00 (b)

Fluctuation Patterns 1.0 True range adapted modified time series p(t) p p t l ( t, ) (t) = p h ( t, ) p l ( t, ) p t (t) ~ "t p t^ (t) ~ "t p (t!#) t^!# 0.5 ~ p "t (t^! 1) t^ 0.0 p t (t) [0; 1] t [ t ; ) ^ t!"t ^ t! 1 t^

Fluctuation Patterns Mean-square quality between current and comparison sequence with Q t (τ) = Q t (τ) [0, 1] t θ=1 ( p t ( θ) p t τ ( τ θ) 1.0 t ) 2 ~ "t p t^ (t) ~ "t p (t!#) t^!# 0.5 ~ p "t (t^! 1) t^ 0.0 ω t (τ t + )= ( p t ^ t!"t ) ( 1+ t + ) p t ( 1) ( p t ^ t! 1 t^ In order to quantify the value of reference and comparison pattern relative to the reference point, one can define..., ) τ ( τ 1+ t + ) p t ( 1)

Fluctuation Patterns Observable for pattern conformity:, ξ χ ( t + t ) = T t + = t τ =τ ( sgn ( exp ω t χq t ) (τ t, + ) ) (τ) Limitation: τ = { ˆτ if ˆτ t 0 t else Normalized pattern conformity:, Ξ χ ( t + t ) = Definition: T t + = t sgn (x) = 1 for x>0 0 for x =0 1 for x<0 ( ξ χ t + t ) ( ) sgn ω t (τ t, + ) ( ) exp (τ) τ =τ, χq t

Pattern Conformity / Trivial Cases Straight Line Random Walk (a) (b) 1.002 0.002 1.001 0.001 " 1.000 " 0.000 0.999 0.001 0.998 25 20 5 5 20 0 25!t +!t 0 0.002 25 20 5 5 20 0 25!t +!t 0

Pattern Conformity / FDAX (a) (b) " 0.25 0.20 0. 0. 0.05 0.00 25 20 5 5 20 0 25!t +!t Complex correlations for financial market time series especially for large pattern lengths. 0 " * 0. 0.08 0.06 0.04 0.02 0.00 25 20 γ =0.16 5 5 20 0 25!t +!t Ξ =Ξ FDAX χ=0 Ξ ACRW χ=0 Q t (τ) =Q p, t (τ) 0

Inclusion of volumes and ITWT (c) T. Preis et al., Europhys. Lett. 82, 68005 (2008) (d) " * 0. 0.08 0.06 0.04 " * 0.06 0.04 0.02 0.02 0.00 25 20 5 5 20 0 25!t +!t 0 0.00 25 20 5 5 20 0 25!t + Same structure high values of the pattern conformity Q t (τ) =Q p, t (τ)+q v, t (τ) Q t!t (τ) =Q p, t 0 (τ)+q ι, t (τ)

GPU computing / Pattern Conformity (a) " (b) " 0.8 0.6 0.0 25 5 20 5 20 0 25!t + 0.8 0.6 (c) 0.4 0.2 0.4 0.2 0.0 25 5 20 5 20 0 25!t + 0.20 0. " # 0. 0.05!t!t 0.00 25 5 20 5 20 0 25!t +!t 0 0 0 Time [ms] 12 11 9 8 7 6 5 4 3 2 1 0 "1 "2 Acceleration # 5 2 0 2 1 2 2 2 3 2 4 2 5! Time on GPU for allocation Time on GPU for memory transfer Time on GPU for main function Total processing time on GPU Total processing time on CPU 2 0 2 1 2 2 2 3 2 4 2 5 Scan interval parameter!

Final remarks TP, PV, WP, and JJS, GPU Accelerated Monte Carlo Simulation of 2D and 3D Ising Model, J. Comp. Phys. 228, 4468 4477(2009) TP, PV, WP, and JJS, Accelerated Fluctuation Analysis by Graphic Cards and Complex Pattern Formation in Econophysics, Preprint submitted (2009) Source code available: www.tobiaspreis.de

Thank you!