Detector Defect Correction of Medical Images on Graphics Processors
|
|
- Zoe King
- 8 years ago
- Views:
Transcription
1 This is the author s version of the work. The definitive work was published in Proceedings of the SPIE: Medical Imaging 2011: Image Processing, Lake Buena Vista, Orlando, FL, USA, February 12-17, Detector Defect Correction of Medical Images on Graphics Processors Richard Membarth *a, Frank Hannig a, Jürgen Teich a, Gerhard Litz b, and Heinz Hornegger b a Hardware/Software Co-Design, Department of Computer Science, University of Erlangen-Nuremberg, Germany. b Siemens Healthcare Sector, H IM AX, Forchheim, Germany. ABSTRACT The ever increasing complexity and power dissipation of computer architectures in the last decade blazed the trail for more power efficient parallel architectures. Hence, such architectures like field-programmable gate arrays (FPGAs) and particular graphics cards attained great interest and are consequently adopted for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, there is little effort to deploy barely computational, but memory intensive applications to graphics hardware. This paper considers a memory intensive detector defect correction pipeline for medical imaging with strict latency requirements. The image pipeline compensates for different effects caused by the detector during exposure of X-ray images and calculates parameters to control the subsequent dosage. So far, dedicated hardware setups with special processors like DSPs were used for such critical processing. We show that this is today feasible with commodity graphics hardware. Using CUDA as programming model, it is demonstrated that the detector defect correction pipeline consisting of more than ten algorithms is significantly accelerated and that a speedup of 20 can be achieved on NVIDIA s Quadro FX 5800 compared to our reference implementation. For deployment in a streaming application with steadily new incoming data, it is shown that the memory transfer overhead of successive images to the graphics card memory is reduced by 83 % using double buffering. 1. INTRODUCTION In recent years, multi-core and many-core processors have become mainstream due to the implications accompanied by the steady increase of transistor count and density in chip designs. Power dissipation became a major problem and the computer designs moved along to more simple and power efficient designs with multiple cores on a chip. Today, processors used in standard desktop computers host four cores and graphics cards up to several hundred cores. Consequently, computationally intensive algorithms and programs from different domains were adapted and mapped to these architectures. In particular graphics cards attracted great interest due to the high computational performance they provide at a relatively low cost and development barrier. Examples of these domains include molecular biology, 1 financial options analytics, 2 seismic computations, 3 and medical image processing. 4 6 In this paper, we present the evaluation of a medical image processing application from the domain of angiography and of cardiology. In angiography, the operating doctor passes a catheter into the artery and a small dose of contrast agent is injected. At the same time, a rapid series of radiographs are taken while the radio-opaque fluid passes through the vessels. Another series, taken when the contrast agent has passed through the tissues, visualizes the vessels and artery supply. The imaging is done using X-ray based techniques such as fluoroscopy. The intensity of the X-rays is recorded by a detector and passed to an imaging system. The imaging system comprises different steps like detector defect correction of the images in the preprocessing and postprocessing. Currently, mainly solutions based on multiple DSPs are used for preprocessing in real-time. In contrast, we map the detector defect correction pipeline of the preprocessing to graphics cards and combine it with the postprocessing and visualization system. The detector defect correction pipeline calculates also parameters to * richard.membarth@cs.fau.de 1
2 Figure 1: Tesla architecture (see 8 ): 240 streaming processors distributed over 30 multiprocessors. The 30 multiprocessors are partitioned into 10 groups, each comprising 3 multiprocessors, cache, and texture unit. control the dosage for subsequent X-ray acquisitions. The maximal latency to calculate and pass the parameters to the X-ray system is 20 ms. For the total pipeline, including postprocessing, 120 ms are available. The pipeline including visualization is also investigated regarding these strict latency requirements. To our knowledge we are the first who investigate preprocessing pipelines like detector defect correction on graphics cards. Related work in this domain focuses only on computationally intensive parts. MRI reconstruction, 4 3D computed tomographic reconstruction, 5 and 3D ultrasound 6 are such examples for computational intensive algorithms. In contrast, we consider a memory intensive pipeline, namely the detector defect correction pipeline, in this paper. We show that the detector defect correction pipeline can be mapped efficiently to graphics hardware to meet the strict latency requirements. The major advantage for future imaging systems using our approach is that preprocessing and postprocessing can be done on the same system. The remaining paper is organized as follows: Section 2 gives an overview of the hardware architecture used within this paper. The following Section 3 introduces the detector defect correction pipeline and the mapping to graphics cards is explained in Section 4. The performance and latency investigations are presented in Section 5, while in Section 6 conclusions of this work are drawn and suggestions for future work are given. 2. NVIDIA TESLA ARCHITECTURE In this section, we present an overview of the Tesla architecture of the Quadro FX 5800, which is used as accelerator for the algorithms studied within this paper. The Tesla is a highly parallel hardware platform with 240 processors integrated on a chip as depicted in Figure 1. The processors are grouped into 30 streaming multiprocessors. These multiprocessors comprise eight scalar streaming processors. While the multiprocessors are responsible for scheduling and work distribution, the streaming processors do the calculations. A program executed on the graphics card is called a kernel and is processed in parallel by many threads on the streaming processors. Therefore, each thread calculates a small portion of the whole algorithm, for example one pixel of a large image. A batch of these threads is always grouped together into a thread block that is scheduled to one multiprocessor and executed by its streaming processors. One of these thread blocks can contain up to 512 threads, which is specified by the programmer. The complete problem space has to be divided into subproblems such that these can be processed independently within one thread block on one multiprocessor. The multiprocessor always executes a batch of 32 threads, also called a warp, in parallel. NVIDIA calls this new streaming multiprocessor architecture single instruction, multiple thread (SIMT). 7 For all threads of a warp the same instructions are fetched and executed for each thread independently, that is, the threads of one warp can diverge and execute different branches. However, when this occurs the divergent branches are serialized until both branches merge again. Thereafter, the whole warp is executed in parallel again. 2
3 Each thread executed on a multiprocessor has full read/write access to the 4.0 GB global memory of the graphics card. However, this memory has a long memory latency of 400 to 800 clock cycles. To hide this long latency each multiprocessor is capable to manage and switch between up to eight thread blocks, but not more than 1024 threads in total. In addition registers and bytes of on-chip shared memory are provided to all threads executed simultaneously on one multiprocessor. These memory types are faster than the global memory, but shared between all thread blocks executed on the multiprocessor. The capabilities of the Tesla architecture and in particular of the used Quadro FX 5800 are summarized in Table 1. Current graphics cards support also asynchronous data transfers between host memory and global memory. This allows to execute kernels on the graphics card, while data is transferred to or from the graphics card. Data transfers are handled like normal kernels and assigned to a queue of commands to be processed in-order by the GPU. These queues are called streams in CUDA. However, commands from different streams can be executed simultaneously as long as one of the commands is a computational kernel and the other an asynchronous data transfer command. This provides support for streaming applications. Similar benefits provides a feature called zero-copy. Using zero-copy, data accessed by the streaming multiprocessors may reside in host memory. The data is implicitly transferred from host memory across PCI Express or the front-side bus to the streaming processors. This way, the user has not to bother about data transfers between the graphics card and the host memory. 3. DETECTOR DEFECT CORRECTION Setups for medical imaging consist typically of several independent systems responsible for preprocessing and postprocessing of medical images. Thereby, in the preprocessing stage, the raw image data is conditioned and in the postprocessing step, the data is visualized, overlayed with further information, and tasks like feature extraction are performed. For preprocessing often dedicated embedded systems based on multiple DSPs are used and the results are passed to a second system responsible for postprocessing and visualization. Within this paper, we consider a preprocessing pipeline for detector defect correction, which is performed on a graphics card. The benefit hereby is that the same system might be used for preprocessing as well as postprocessing. This makes the data transfers from the preprocessing system to the postprocessing system superfluous, but imposes strict requirements for the complete system. These are: For the complete imaging system, a maximal latency of 120 ms from image acquisition to image visualization is required. Larger delays are confusing for the operating doctor and can cause serious injuries up to the perforation of vessels. For the detector defect correction, a maximal latency of less than 20 ms from image acquisition to dose feedback for a frame rate of 30 frames per second is required (i. e., to use the dose feedback from image i n for the next image i n+1 ). The detector defect correction pipeline, as shown in Figure 2, has several functions for its deployment in medical setups in angiography or cardiology. First of all, it compensates for detector inherent effects, which have severe impacts on the quality of the recorded image. These are for example pixel individual offset and gain correction or line offset correction. Here, Table 1: Hardware capabilities of the Tesla architecture and of the Quadro FX Threads per warp 32 Warps per multiprocessor 32 Threads per block 512 Threads per multiprocessor 1024 Blocks per multiprocessor 8 Registers per multiprocessor Shared memory per multiprocessor
4 Figure 2: Detector defect correction pipeline: A series of kernels are applied to the input data to compensate for different effects introduced by the detector and the recording of the images. The dose feedback (DF, on a given region of interest (ROI)) and histogram analysis (HA) kernels are control driven and used for the dose control of the next images. Dose control can be omitted using global gain GG. also defect pixels as well as completely defect lines and columns of the detector are compensated. This is done in different independent algorithms (top image pipeline of Figure 2). These algorithms process the raw images acquired by the detector and require also some additional inputs like defect maps, gain images, or other parameters. The input images acquired by the detector are gray scale images with an accuracy of 14-bit. As a second key function of the defect correction pipeline, image synchronous gain control is provided. This is done after the defects of the detector are corrected so that only good images are considered. For the synchronous gain control, three different modes (A to C) are available. The first two modes, dose feedback (DF) and histogram analysis (HA), calculate parameters for the subsequent dosage of the detector. These parameters are passed to an external system, which is responsible for dose control. This system imposes a strict maximal latency of less than 20 ms between image acquisition and the retrieval of the parameters to adjust the dosage for the following image. These parameters are used in the subsequent algorithm of the pipeline, too. The third mode, global gain (GG) provides a constant parameter instead. In Figure 3, an image at different stages of the detector defect correction pipeline is shown. The raw input image is shown in Figure 3(a). Here, a pattern of the detector interferes with the acquired image and at the left and right border a dark reference zone (DRZ) is added. This zone is removed later on. The detector defect corrected image is shown in Figure 3(b). The signal of this image is normalized using the parameters from dose feedback and histogram analysis in Figures 3(c) and (d) for better visual contrast. 4. DETECTOR DEFECT CORRECTION ON GRAPHICS CARDS We map the detector defect correction as described in the previous section to graphics cards using CUDA as programming model. The preprocessing is combined with a rudimentary postprocessing and visualization using OpenGL. Single-precision 32-bit floating-points are used as number representation on the graphics card for the complete imaging system. Therefore, the input values are converted at the beginning of the pipeline. Floatingpoint numbers have several advantages over fixed-point representations on DSPs where the bit-width is explicit and must be managed by the programmer. For example, when programming for DSPs, the bit-width has to be 4
5 (a) Input image from detector. (b) Detector defect corrected output image. (c) Signal normalization using dose feedback. (d) Signal normalization using HA. Figure 3: Images of the detector defect correction pipeline: (a) shows the raw input image of the detector including a dark reference zone at the left and right border ( ), while (b) shows the detector defect corrected output image ( ). In (c) the image is normalized using the dose feedback algorithm and in (d) using the histogram analysis (HA) algorithm. 5
6 extended by hand when higher accuracy is needed. The input images we consider have a resolution of After the dark reference zone is removed from the images, the resolution is In the following subsections the mapping of two algorithms, namely dose feedback and histogram analysis, are exemplarily illustrated in detail. 4.1 Dose Feedback The dose feedback algorithm calculates mean values in regions of interest (ROI) of the image. The region of interest can be changed or adjusted during run-time and is determined by the operator of the system. Therefore, the image is divided in 32 horizontal and 32 vertical blocks, each having a size of pixels. We use one bit for each image block to encode whether the block contributes to the overall mean value or not. Each block of the image is processed by one multiprocessor on the graphics card. Since the schedule of thread blocks to the multiprocessors is not deterministic, one thread block is needed for each image block, even if the image block does not contribute to the mean value. However, only if the bit for the image block is set, the multiprocessors have to calculate the mean value. Otherwise they finish execution in no time and the next thread block is assigned to them. Given that the different multiprocessors cannot communicate with each other, the final mean value of all image blocks has to be calculated in two steps. First, each multiprocessor calculates one mean value per image block using a parallel reduction technique and stores it to global memory as illustrated in Figure 4. In a second step, one multiprocessors calculates the final mean value of the values stored to global memory, again by the use of a reduction. 9 The number of threads executed simultaneously on one multiprocessor is always a multiple of the warp size. However, the size of one block is only 30 in the x-dimension. That is, there are two threads for every 30 pixels that would work on a different block line resulting in misaligned memory accesses. Therefore, we use two more threads per block line that are idle. Using one thread per pixel requires = 960 threads for one image block, but only up to 512 threads can be assigned to a single thread block. For that reason each thread calculates the mean value of multiple pixels, which allows us to use one thread block per image block. The mean value is required on the graphics card for successive kernels as well as on the host system for dose control. Hence, the value is stored in the global memory on the graphics card and downloaded to the host for further processing. 4.2 Histogram Analysis The histogram analysis algorithm calculates two characteristic values for the image based on the histogram of the image. To create a histogram for a 14-bit image, 2 14 = bins are required. However, to achieve good performance, the limited shared memory of 16 kb has to be used to create the histogram. 10 That is, either only parts of the histogram or simply smaller histograms can be stored. In our approach we use smaller histograms that fit completely into shared memory and assign multiple contiguous values to the same bin. 11 This is seen in Algorithm 1, where in the first step one histogram is generated per thread block and stored to global memory. In the second step, these sub-histograms are merged into one final histogram. The characteristic values are also stored to global memory and transferred back to the host for further processing. In addition to the problem of limited shared memory, there exist race conditions when several threads try to update the same bin of the histogram, that is, some bin updates may be lost. In our implementation, we use atomic functions to update bins of the histogram to avoid the race conditions as long as the graphics card supports them. Unfortunately, atomic functions are only available on newer hardware. Therefore, we use five bits of the bin counter as tag to ensure that the bin was successfully updated using a compare-and-swap mechanism. 5. RESULTS In this section, the results of the detector defect correction pipeline on the graphics card are discussed. This is on the one hand the performance of the pipeline itself and a streaming application incorporating the detector defect correction pipeline on a series of images. On the other hand, we investigate the latency to get parameters from the graphics card required for dose control. 6
7 Figure 4: Dose feedback: Image tiled into 32 region of interests in each dimension for an image of pixels. In the first step, for each ROI the mean value is calculated and stored to global memory if the corresponding df bit is stet. In a second step, the average of the mean values stored in global memory is calculated. Algorithm 1: Histogram generation on the graphics card Step 1: Create sub-histograms in parallel; one per thread block: forall the thread blocks B do in parallel for each thread t in thread block do in parallel initialize_shared_memory(local_hist,t); synchronize; value get_pixel_value(image,t); index get_bin_number(value); atomic_inc(local_hist,index); synchronize; global_histogram[block_index][t] local_hist(t); end end Step 2: Merge sub-histograms in parallel using one thread block: for each thread t in thread block do in parallel sum 0; for i 0 to number_of_histograms do sum sum + global_histogram[i][t]; end global_histogram[t] sum; end 7
8 5.1 Detector Defect Correction Pipeline The timing of all algorithms which are part of the detector defect correction pipeline, are shown in Table 2. The first column displays the name of the kernel in the detector defect correction pipeline. The second and third columns show the execution time on the high-end Quadro FX 5800 graphics card and the low-end GeForce 9400M mobile graphics card. The reference implementation running on one core of a Core 2 Quad processor is listed in the last column. Some algorithms of the detector defect correction pipeline consist of multiple kernel launches on the graphics card, for example in order to perform a reduction on values and to use the reduced value later on. The execution times include only the time to execute the kernel of the detector defect correction pipeline, that is, the data resides already in global memory of the graphics card. The image data is read from that memory, processed by the streaming processors, and stored back to global memory. Most of the kernels, which take about 0.20 ms to complete, are memory bound. That is, the streaming processors perform only few instructions per data element and wait most of the time for data. Such algorithms are for instance ALG8 and ALG10. Computationally more intensive kernels require more execution time. The times for the low-end mobile graphics card is in the same range as the reference implementation. This is in particular interesting, since the capabilities of the GeForce 9400M are inferior to those of the Quadro FX 5800 and the implementation was not optimized for the mobile graphics card. To be accurate, the missing memory access abstractions of the GeForce 9400M causes penalties of a factor up to 4 compared to the Quadro FX Also the missing support for atomic operations required to implement a manual compare-and-swap mechanism in order to calculate histograms. The total execution time in the last three rows correspond to the execution of the detector defect correction pipeline, using dose feedback, histogram analysis, and global gain, respectively. For the first two alternatives, parameters for dose control are calculated. In Figure 5, the speedup of the graphics card implementation compared to the reference implementation is shown. The high-end Quadro FX 5800 achieves a speedup of up to compared to the reference implementation, while the low-end mobile GeForce 9400M is slightly faster compared to our reference implementation. Table 2: Execution times in ms of all kernels of the detector defect correction pipeline on a high-end Quadro FX 5800 (240 GHz) graphics card, a low-end GeForce 9400M (16 GHz) mobile graphics card, and a Core 2 Quad GHz). Quadro FX 5800 (ms) GeForce 9400M (ms) Core 2 Quad (ms) ALG ALG ALG ALG ALG ALG ALG ALG DF HA ALG9 (DF) ALG9 (HA) ALG9 (GG) ALG ALG Total (DF) Total (HA) Total (GG)
9 Core 2 Quad GeForce 9400M Quadro FX speedup DF HA GG Figure 5: Achieved speedup executing the detector defect correction pipeline on graphics cards compared to the reference implementation for different operation modes. Still, there are a lot of optimization opportunities to improve the performance further. On the one hand, many of the kernels in the detector defect correction pipeline are memory bound and have no data dependencies. Such kernels can be fused into one single kernel, eliminating data transfers to global memory and diminishing thereby the execution time at most by the time of the faster kernel (see 12 ). On the other hand, the utilization of the high-end card is not given at a resolution of and Higher workloads utilize the graphics card to its full capacity (e. g., resolutions of and higher). However, the primary goal of our study was not to optimize the algorithms completely and tune them to one architecture, but to investigate the feasibility and the latency to get the parameters for dose control. 5.2 Double Buffering From Table 2 we can see that the processing of the complete detector defect correction pipeline takes between 4.5 ms and 5.5 ms depending on the mode. In a real application, however, not only the pure execution times are important, but also the time to feed the graphics card with data. In order to transfer the input image of to the graphics memory, it takes 1.06 ms. This is one fifth of the time it takes to process the complete detector defect correction pipeline. Therefore, it is essential to reduce or hide the memory transfer times in order to achieve best performance. To support concepts like double buffering, special memory allocation on the host is required. Copying data asynchronously from host memory to the device memory is only possible when the host memory is allocated as page-locked memory. This allows to overlap kernel executions with memory transfers. To manage such concurrency, streams are used. A stream is a sequence of commands that execute in order, while commands from different streams may be executed in parallel. Supported is the parallel execution of a kernel from one stream and an asynchronous memory transfer from a different stream. Assigning different image iterations to different streams allows to fetch the image for iteration i n+1 while iteration i n is processed (i. e., kernels for iteration i n are executed). Page-locked memory allows also faster memory transfers. Transferring the image from page-locked memory takes only 0.65 ms. Even more, kernels can access page-locked memory directly, that is, it is not necessary anymore to copy data to graphics card memory before launching a kernel. This is possible since page-locked memory is not pageable 9
10 stream kernel execution memory transfer zero-copy 1 (a) Synchronous data transfer. time stream 1 (b) Page-locked data transfer. time stream 1 (c) Zero-copy data transfer. time stream 2 1 (d) Asynchronous data transfer. Figure 6: Gantt chart of the detector defect correction pipeline, processing ten images: (a) uses normal synchronous memory transfers, (b) uses faster page-locked memory, (c) uses zero-copy to transfer data, and (d) uses asynchronous memory transfers. Two streams are used for asynchronous memory transfers. While one stream transfers the next image to the graphics memory, the current image is processed on the graphics card. time anymore by the operating system and consequently can be accessed from the graphics card without the support of the host. Kernels can directly read and write to page-locked host memory. However, this makes only sense for the first and last kernel of the image pipeline, respectively. For those kernels the image has to be fetched from the host memory and stored back again. Instead of doing this using memory transfers, the kernels can directly read/write to host memory using zero-copy. Zero-copy makes asynchronous memory transfers superfluous. The more computation is done in the first kernel, the more memory transfer times can be hidden. Figure 6 shows the activity on the graphics cards of the above mentioned variants for ten iterations. Only when double buffering is used, two streams are required to transfer the next image asynchronously. At the beginning, the first image has to be on hand before the two streams can use asynchronous data transfers to hide the data transfers of the successive iterations. Each kernel launch and memory transfer is denoted by an own 10
11 bar in the Gantt chart. Using the double buffering implementation, most of the data transfers can be hidden as seen in Table 3. The execution time of 10 iterations with no data transfers takes about ms. Using only one stream and synchronous memory transfers takes about ms, hence, 10 ms are required for the data transfers. Using page-locked memory, the execution time is reduced to ms. Using zero-copy takes the same time. Here, almost none of the memory transfer time could be hidden, since the first kernel of the pipeline is memory bound. The best solution uses asynchronous memory transfers, when the ten iterations take only ms, and 83 % of the data transfer overhead could be hidden. This is almost the complete memory transfer overhead apart from the transfer of the first image. 5.3 Dose Control Parameter Latency The parameters for dose control are calculated during dose feedback or histogram analysis and are stored on the graphics card. In order to obtain and pass these parameters without interruption of the detector defect correction pipeline, we use asynchronous data transfers to download the parameters to the host while the next kernels are executed on the graphics card. Doing so, allows us to pass the parameters as fast as possible to dose control. Table 4 shows the latency from the moment when the image is in the host memory to the point where the parameters for dose control have been downloaded to the host and can be passed to dose control. For all alternatives, but the naïve synchronous memory transfer model, the latency for the dose control parameters is below 5 ms and meets, thus, the 20 ms latency requirement. 6. CONCLUSIONS In this paper, a memory intensive preprocessing pipeline for medical imaging with strict latency requirements was investigated and it was shown that it is feasible to map and implement it on a graphics card. While current medical setups use different hardware systems for preprocessing and postprocessing, we use the same platform for both, which allows to reduce the complexity of the system. A detector defect correction pipeline with more than ten algorithms was mapped to the graphics hardware. The pipeline compensates for different effects caused by a detector during exposure of X-ray images and calculates parameters to control the subsequent dosage. Medical images of were processed in less than 5 ms, 20 faster than on a standard CPU. The time critical part of the pipeline, the calculation and the forwarding of dose parameters to the host and to an external system, was also done in less than 5 ms, which fulfills the 20 ms maximal latency requirement throughout. For deployment in medical environments with streams of images, we showed that the memory transfer overhead of successive images to the graphics card memory is reduced by 83 % using asynchronous memory transfers. Zero-copy relieves the programmer in addition from data transfer management and transfers the data implicitly from main memory. However, using zero-copy only 36 % of the transfer overhead was saved. Table 3: Execution time for 10 iterations of the detector defect correction pipeline for different memory management approaches when no X server is running. No data transfers Synchronous memory transfers Page-locked memory Zero-copy Asynchronous memory transfers ms ms ms ms ms Table 4: Latency to get the parameters for dose control from the graphics card. The latency includes the time when the image is still on the host until the parameters are available on the host. DF HA Synchronous memory transfers 5.14 ms 5.66 ms Page-locked memory 4.23 ms 4.76 ms Zero-copy 4.27 ms 4.82 ms Asynchronous memory transfers 4.35 ms 4.89 ms 11
12 The control flow required to interact with the detector defect correction pipeline is implemented using CUDA and OpenGL. Different modes are supported and can be selected on a per frame basis. Actually, the current preprocessing and visualization leave much room for further processing and would still meet the latency requirements. This allows medical systems in cardiology and angiography to reduce the complexity of existing systems and to use one common system for preprocessing and postprocessing, whereas, current setups use different systems and hardware platforms. To guarantee the latency requirements for dose feedback, a fixed number of multiprocessors can be reserved exclusively for preprocessing in the upcoming Fermi architecture of NVIDIA. 13 REFERENCES [1] Manavski, S. and Valle, G., CUDA Compatible GPU Cards as Efficient Hardware Accelerators for Smith- Waterman Sequence Alignment, BMC Bioinformatics 9(Suppl 2), S10 (2008). [2] Preis, T., Virnau, P., Paul, W., and Schneider, J., GPU Accelerated Monte Carlo Simulation of the 2D and 3D Ising Model, Journal of Computational Physics 228(12), (2009). [3] Micikevicius, P., 3D Finite Difference Computation on GPUs using CUDA, in [Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units], 79 84, ACM (2009). [4] Stone, S., Haldar, J., Tsao, S., Wen-Mei, W., Liang, Z., and Sutton, B., Accelerating Advanced MRI Reconstructions on GPUs, Proceedings of the 2008 Conference on Computing Frontiers, (2008). [5] Xu, F. and Mueller, K., Real-Time 3D Computed Tomographic Reconstruction using Commodity Graphics Hardware, Physics in Medicine and Biology 52, (2007). [6] Reichl, T., Passenger, J., Acosta, O., and Salvado, O., Ultrasound goes GPU: Real-Time Simulation using CUDA, Progress in Biomedical Optics and Imaging 10(37) (2009). [7] Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J., NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro 28(2), (2008). [8] Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., and Phillips, J., GPU Computing, Proceedings of the IEEE 96(5), (2008). [9] Sengupta, S., Harris, M., Zhang, Y., and Owens, J., Scan Primitives for GPU Computing, in [Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware], , Eurographics Association (2007). [10] Shams, R. and Kennedy, R. A., Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices, in [Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS)], (2007). [11] Scott, D., On Optimal and Data-Based Histograms, Biometrika 66(3), 605 (1979). [12] Membarth, R., Hannig, F., Dutta, H., and Teich, J., Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors, in [Proceedings of the 9th International Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS Workshop)], (2009). [13] NVIDIA Corporation, NVIDIA Whitepaper: NVIDIA s Next Generation CUDA Compute Architecture: Fermi. Architecture_Whitepaper.pdf (2009). 12
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
More informationStream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
More informationDynamic Task-Scheduling and Resource Management for GPU Accelerators in Medical Imaging
This is the author s version of the work. The definitive work was published in Proceedings of the 25th International Conference on Architecture of Computing Systems (ARCS), Munich, Germany, February 28
More informationGPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationGPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
More informationNVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
More informationIntroduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it
t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
More informationComputer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
More informationIntroduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationGPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
More informationE6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
More informationOpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
More informationApplications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
More informationOpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
More informationwww.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING
www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING GPU COMPUTING VISUALISATION XENON Accelerating Exploration Mineral, oil and gas exploration is an expensive and challenging
More informationIntroduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
More informationParallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationCUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
More informationCOMPUTER HARDWARE. Input- Output and Communication Memory Systems
COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationTowards Large-Scale Molecular Dynamics Simulations on Graphics Processors
Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware Outline Introduction Introduction to GPU programming Why MD
More informationNVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS
NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS DU-05349-001_v6.0 February 2014 Installation and Verification on TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2.
More informationNVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
More informationReconfigurable Architecture Requirements for Co-Designed Virtual Machines
Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra
More informationIntroduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationChapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
More informationNVIDIA VIDEO ENCODER 5.0
NVIDIA VIDEO ENCODER 5.0 NVENC_DA-06209-001_v06 November 2014 Application Note NVENC - NVIDIA Hardware Video Encoder 5.0 NVENC_DA-06209-001_v06 i DOCUMENT CHANGE HISTORY NVENC_DA-06209-001_v06 Version
More informationultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
More informationUSB readout board for PEBS Performance test
June 11, 2009 Version 1.0 USB readout board for PEBS Performance test Guido Haefeli 1 Li Liang 2 Abstract In the context of the PEBS [1] experiment a readout board was developed in order to facilitate
More informationTexture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache
More informationInteractive Level-Set Deformation On the GPU
Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation
More informationClustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate
More informationOverview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
More informationGPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding
More informationEnhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationReal-time Visual Tracker by Stream Processing
Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol
More informationGeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
More informationANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationTurbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
More informationLBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
More informationCUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
More informationMulti-GPU Load Balancing for Simulation and Rendering
Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks
More informationSeeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
More informationAccelerating Wavelet-Based Video Coding on Graphics Hardware
Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink. Accelerating Wavelet-Based Video Coding on Graphics Hardware using CUDA. In Proc. 6th International Symposium on Image and Signal Processing
More informationGuided Performance Analysis with the NVIDIA Visual Profiler
Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided
More informationNVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
More informationGPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
More informationGPU Architecture. Michael Doggett ATI
GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationIntroduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
More informationGPU Programming Strategies and Trends in GPU Computing
GPU Programming Strategies and Trends in GPU Computing André R. Brodtkorb 1 Trond R. Hagen 1,2 Martin L. Sætra 2 1 SINTEF, Dept. Appl. Math., P.O. Box 124, Blindern, NO-0314 Oslo, Norway 2 Center of Mathematics
More informationReal-time Process Network Sonar Beamformer
Real-time Process Network Sonar Gregory E. Allen Applied Research Laboratories gallen@arlut.utexas.edu Brian L. Evans Dept. Electrical and Computer Engineering bevans@ece.utexas.edu The University of Texas
More informationEvaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
More informationSpeeding Up RSA Encryption Using GPU Parallelization
2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Speeding Up RSA Encryption Using GPU Parallelization Chu-Hsing Lin, Jung-Chun Liu, and Cheng-Chieh Li Department of
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationThe Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
More informationMulti-GPU Load Balancing for In-situ Visualization
Multi-GPU Load Balancing for In-situ Visualization R. Hagan and Y. Cao Department of Computer Science, Virginia Tech, Blacksburg, VA, USA Abstract Real-time visualization is an important tool for immediately
More informationGEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
More informationGPGPU Computing. Yong Cao
GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power
More informationMaking Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association
Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?
More informationCommunicating with devices
Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.
More informationBenchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
More informationMedical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute andek@vtc.vt.
Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute andek@vtc.vt.edu Outline Motivation why do we need GPUs? Past - how was GPU programming
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationGPU Accelerated Monte Carlo Simulations and Time Series Analysis
GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationPerformance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
More informationIBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
More informationMONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA
MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American
More informationParallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
More informationOptimization. NVIDIA OpenCL Best Practices Guide. Version 1.0
Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What
More informationL20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
More informationDELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering
DELL Virtual Desktop Infrastructure Study END-TO-END COMPUTING Dell Enterprise Solutions Engineering 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL
More information15-418 Final Project Report. Trading Platform Server
15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support
More informationThe Fastest, Most Efficient HPC Architecture Ever Built
Whitepaper NVIDIA s Next Generation TM CUDA Compute Architecture: TM Kepler GK110 The Fastest, Most Efficient HPC Architecture Ever Built V1.0 Table of Contents Kepler GK110 The Next Generation GPU Computing
More informationGPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
More informationLong-term monitoring of apparent latency in PREEMPT RT Linux real-time systems
Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems Carsten Emde Open Source Automation Development Lab (OSADL) eg Aichhalder Str. 39, 78713 Schramberg, Germany C.Emde@osadl.org
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationHigh Performance GPU-based Preprocessing for Time-of-Flight Imaging in Medical Applications
High Performance GPU-based Preprocessing for Time-of-Flight Imaging in Medical Applications Jakob Wasza 1, Sebastian Bauer 1, Joachim Hornegger 1,2 1 Pattern Recognition Lab, Friedrich-Alexander University
More informationGenerations of the computer. processors.
. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations
More informationInteractive Level-Set Segmentation on the GPU
Interactive Level-Set Segmentation on the GPU Problem Statement Goal Interactive system for deformable surface manipulation Level-sets Challenges Deformation is slow Deformation is hard to control Solution
More informationCUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
More informationProgramming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
More informationFast Implementations of AES on Various Platforms
Fast Implementations of AES on Various Platforms Joppe W. Bos 1 Dag Arne Osvik 1 Deian Stefan 2 1 EPFL IC IIF LACAL, Station 14, CH-1015 Lausanne, Switzerland {joppe.bos, dagarne.osvik}@epfl.ch 2 Dept.
More informationParallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
More informationGPU-based Decompression for Medical Imaging Applications
GPU-based Decompression for Medical Imaging Applications Al Wegener, CTO Samplify Systems 160 Saratoga Ave. Suite 150 Santa Clara, CA 95051 sales@samplify.com (888) LESS-BITS +1 (408) 249-1500 1 Outline
More informationIntel DPDK Boosts Server Appliance Performance White Paper
Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks
More informationOverview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions
Module 4: Beyond Static Scalar Fields Dynamic Volume Computation and Visualization on the GPU Visualization and Computer Graphics Group University of California, Davis Overview Motivation and applications
More informationOn Sorting and Load Balancing on GPUs
On Sorting and Load Balancing on GPUs Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology SE-42 Göteborg, Sweden {cederman,tsigas}@chalmers.se Abstract
More informationGPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
More informationMotivation: Smartphone Market
Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics
More informationACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
More information