Detector Defect Correction of Medical Images on Graphics Processors

Size: px
Start display at page:

Download "Detector Defect Correction of Medical Images on Graphics Processors"

Transcription

1 This is the author s version of the work. The definitive work was published in Proceedings of the SPIE: Medical Imaging 2011: Image Processing, Lake Buena Vista, Orlando, FL, USA, February 12-17, Detector Defect Correction of Medical Images on Graphics Processors Richard Membarth *a, Frank Hannig a, Jürgen Teich a, Gerhard Litz b, and Heinz Hornegger b a Hardware/Software Co-Design, Department of Computer Science, University of Erlangen-Nuremberg, Germany. b Siemens Healthcare Sector, H IM AX, Forchheim, Germany. ABSTRACT The ever increasing complexity and power dissipation of computer architectures in the last decade blazed the trail for more power efficient parallel architectures. Hence, such architectures like field-programmable gate arrays (FPGAs) and particular graphics cards attained great interest and are consequently adopted for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, there is little effort to deploy barely computational, but memory intensive applications to graphics hardware. This paper considers a memory intensive detector defect correction pipeline for medical imaging with strict latency requirements. The image pipeline compensates for different effects caused by the detector during exposure of X-ray images and calculates parameters to control the subsequent dosage. So far, dedicated hardware setups with special processors like DSPs were used for such critical processing. We show that this is today feasible with commodity graphics hardware. Using CUDA as programming model, it is demonstrated that the detector defect correction pipeline consisting of more than ten algorithms is significantly accelerated and that a speedup of 20 can be achieved on NVIDIA s Quadro FX 5800 compared to our reference implementation. For deployment in a streaming application with steadily new incoming data, it is shown that the memory transfer overhead of successive images to the graphics card memory is reduced by 83 % using double buffering. 1. INTRODUCTION In recent years, multi-core and many-core processors have become mainstream due to the implications accompanied by the steady increase of transistor count and density in chip designs. Power dissipation became a major problem and the computer designs moved along to more simple and power efficient designs with multiple cores on a chip. Today, processors used in standard desktop computers host four cores and graphics cards up to several hundred cores. Consequently, computationally intensive algorithms and programs from different domains were adapted and mapped to these architectures. In particular graphics cards attracted great interest due to the high computational performance they provide at a relatively low cost and development barrier. Examples of these domains include molecular biology, 1 financial options analytics, 2 seismic computations, 3 and medical image processing. 4 6 In this paper, we present the evaluation of a medical image processing application from the domain of angiography and of cardiology. In angiography, the operating doctor passes a catheter into the artery and a small dose of contrast agent is injected. At the same time, a rapid series of radiographs are taken while the radio-opaque fluid passes through the vessels. Another series, taken when the contrast agent has passed through the tissues, visualizes the vessels and artery supply. The imaging is done using X-ray based techniques such as fluoroscopy. The intensity of the X-rays is recorded by a detector and passed to an imaging system. The imaging system comprises different steps like detector defect correction of the images in the preprocessing and postprocessing. Currently, mainly solutions based on multiple DSPs are used for preprocessing in real-time. In contrast, we map the detector defect correction pipeline of the preprocessing to graphics cards and combine it with the postprocessing and visualization system. The detector defect correction pipeline calculates also parameters to * richard.membarth@cs.fau.de 1

2 Figure 1: Tesla architecture (see 8 ): 240 streaming processors distributed over 30 multiprocessors. The 30 multiprocessors are partitioned into 10 groups, each comprising 3 multiprocessors, cache, and texture unit. control the dosage for subsequent X-ray acquisitions. The maximal latency to calculate and pass the parameters to the X-ray system is 20 ms. For the total pipeline, including postprocessing, 120 ms are available. The pipeline including visualization is also investigated regarding these strict latency requirements. To our knowledge we are the first who investigate preprocessing pipelines like detector defect correction on graphics cards. Related work in this domain focuses only on computationally intensive parts. MRI reconstruction, 4 3D computed tomographic reconstruction, 5 and 3D ultrasound 6 are such examples for computational intensive algorithms. In contrast, we consider a memory intensive pipeline, namely the detector defect correction pipeline, in this paper. We show that the detector defect correction pipeline can be mapped efficiently to graphics hardware to meet the strict latency requirements. The major advantage for future imaging systems using our approach is that preprocessing and postprocessing can be done on the same system. The remaining paper is organized as follows: Section 2 gives an overview of the hardware architecture used within this paper. The following Section 3 introduces the detector defect correction pipeline and the mapping to graphics cards is explained in Section 4. The performance and latency investigations are presented in Section 5, while in Section 6 conclusions of this work are drawn and suggestions for future work are given. 2. NVIDIA TESLA ARCHITECTURE In this section, we present an overview of the Tesla architecture of the Quadro FX 5800, which is used as accelerator for the algorithms studied within this paper. The Tesla is a highly parallel hardware platform with 240 processors integrated on a chip as depicted in Figure 1. The processors are grouped into 30 streaming multiprocessors. These multiprocessors comprise eight scalar streaming processors. While the multiprocessors are responsible for scheduling and work distribution, the streaming processors do the calculations. A program executed on the graphics card is called a kernel and is processed in parallel by many threads on the streaming processors. Therefore, each thread calculates a small portion of the whole algorithm, for example one pixel of a large image. A batch of these threads is always grouped together into a thread block that is scheduled to one multiprocessor and executed by its streaming processors. One of these thread blocks can contain up to 512 threads, which is specified by the programmer. The complete problem space has to be divided into subproblems such that these can be processed independently within one thread block on one multiprocessor. The multiprocessor always executes a batch of 32 threads, also called a warp, in parallel. NVIDIA calls this new streaming multiprocessor architecture single instruction, multiple thread (SIMT). 7 For all threads of a warp the same instructions are fetched and executed for each thread independently, that is, the threads of one warp can diverge and execute different branches. However, when this occurs the divergent branches are serialized until both branches merge again. Thereafter, the whole warp is executed in parallel again. 2

3 Each thread executed on a multiprocessor has full read/write access to the 4.0 GB global memory of the graphics card. However, this memory has a long memory latency of 400 to 800 clock cycles. To hide this long latency each multiprocessor is capable to manage and switch between up to eight thread blocks, but not more than 1024 threads in total. In addition registers and bytes of on-chip shared memory are provided to all threads executed simultaneously on one multiprocessor. These memory types are faster than the global memory, but shared between all thread blocks executed on the multiprocessor. The capabilities of the Tesla architecture and in particular of the used Quadro FX 5800 are summarized in Table 1. Current graphics cards support also asynchronous data transfers between host memory and global memory. This allows to execute kernels on the graphics card, while data is transferred to or from the graphics card. Data transfers are handled like normal kernels and assigned to a queue of commands to be processed in-order by the GPU. These queues are called streams in CUDA. However, commands from different streams can be executed simultaneously as long as one of the commands is a computational kernel and the other an asynchronous data transfer command. This provides support for streaming applications. Similar benefits provides a feature called zero-copy. Using zero-copy, data accessed by the streaming multiprocessors may reside in host memory. The data is implicitly transferred from host memory across PCI Express or the front-side bus to the streaming processors. This way, the user has not to bother about data transfers between the graphics card and the host memory. 3. DETECTOR DEFECT CORRECTION Setups for medical imaging consist typically of several independent systems responsible for preprocessing and postprocessing of medical images. Thereby, in the preprocessing stage, the raw image data is conditioned and in the postprocessing step, the data is visualized, overlayed with further information, and tasks like feature extraction are performed. For preprocessing often dedicated embedded systems based on multiple DSPs are used and the results are passed to a second system responsible for postprocessing and visualization. Within this paper, we consider a preprocessing pipeline for detector defect correction, which is performed on a graphics card. The benefit hereby is that the same system might be used for preprocessing as well as postprocessing. This makes the data transfers from the preprocessing system to the postprocessing system superfluous, but imposes strict requirements for the complete system. These are: For the complete imaging system, a maximal latency of 120 ms from image acquisition to image visualization is required. Larger delays are confusing for the operating doctor and can cause serious injuries up to the perforation of vessels. For the detector defect correction, a maximal latency of less than 20 ms from image acquisition to dose feedback for a frame rate of 30 frames per second is required (i. e., to use the dose feedback from image i n for the next image i n+1 ). The detector defect correction pipeline, as shown in Figure 2, has several functions for its deployment in medical setups in angiography or cardiology. First of all, it compensates for detector inherent effects, which have severe impacts on the quality of the recorded image. These are for example pixel individual offset and gain correction or line offset correction. Here, Table 1: Hardware capabilities of the Tesla architecture and of the Quadro FX Threads per warp 32 Warps per multiprocessor 32 Threads per block 512 Threads per multiprocessor 1024 Blocks per multiprocessor 8 Registers per multiprocessor Shared memory per multiprocessor

4 Figure 2: Detector defect correction pipeline: A series of kernels are applied to the input data to compensate for different effects introduced by the detector and the recording of the images. The dose feedback (DF, on a given region of interest (ROI)) and histogram analysis (HA) kernels are control driven and used for the dose control of the next images. Dose control can be omitted using global gain GG. also defect pixels as well as completely defect lines and columns of the detector are compensated. This is done in different independent algorithms (top image pipeline of Figure 2). These algorithms process the raw images acquired by the detector and require also some additional inputs like defect maps, gain images, or other parameters. The input images acquired by the detector are gray scale images with an accuracy of 14-bit. As a second key function of the defect correction pipeline, image synchronous gain control is provided. This is done after the defects of the detector are corrected so that only good images are considered. For the synchronous gain control, three different modes (A to C) are available. The first two modes, dose feedback (DF) and histogram analysis (HA), calculate parameters for the subsequent dosage of the detector. These parameters are passed to an external system, which is responsible for dose control. This system imposes a strict maximal latency of less than 20 ms between image acquisition and the retrieval of the parameters to adjust the dosage for the following image. These parameters are used in the subsequent algorithm of the pipeline, too. The third mode, global gain (GG) provides a constant parameter instead. In Figure 3, an image at different stages of the detector defect correction pipeline is shown. The raw input image is shown in Figure 3(a). Here, a pattern of the detector interferes with the acquired image and at the left and right border a dark reference zone (DRZ) is added. This zone is removed later on. The detector defect corrected image is shown in Figure 3(b). The signal of this image is normalized using the parameters from dose feedback and histogram analysis in Figures 3(c) and (d) for better visual contrast. 4. DETECTOR DEFECT CORRECTION ON GRAPHICS CARDS We map the detector defect correction as described in the previous section to graphics cards using CUDA as programming model. The preprocessing is combined with a rudimentary postprocessing and visualization using OpenGL. Single-precision 32-bit floating-points are used as number representation on the graphics card for the complete imaging system. Therefore, the input values are converted at the beginning of the pipeline. Floatingpoint numbers have several advantages over fixed-point representations on DSPs where the bit-width is explicit and must be managed by the programmer. For example, when programming for DSPs, the bit-width has to be 4

5 (a) Input image from detector. (b) Detector defect corrected output image. (c) Signal normalization using dose feedback. (d) Signal normalization using HA. Figure 3: Images of the detector defect correction pipeline: (a) shows the raw input image of the detector including a dark reference zone at the left and right border ( ), while (b) shows the detector defect corrected output image ( ). In (c) the image is normalized using the dose feedback algorithm and in (d) using the histogram analysis (HA) algorithm. 5

6 extended by hand when higher accuracy is needed. The input images we consider have a resolution of After the dark reference zone is removed from the images, the resolution is In the following subsections the mapping of two algorithms, namely dose feedback and histogram analysis, are exemplarily illustrated in detail. 4.1 Dose Feedback The dose feedback algorithm calculates mean values in regions of interest (ROI) of the image. The region of interest can be changed or adjusted during run-time and is determined by the operator of the system. Therefore, the image is divided in 32 horizontal and 32 vertical blocks, each having a size of pixels. We use one bit for each image block to encode whether the block contributes to the overall mean value or not. Each block of the image is processed by one multiprocessor on the graphics card. Since the schedule of thread blocks to the multiprocessors is not deterministic, one thread block is needed for each image block, even if the image block does not contribute to the mean value. However, only if the bit for the image block is set, the multiprocessors have to calculate the mean value. Otherwise they finish execution in no time and the next thread block is assigned to them. Given that the different multiprocessors cannot communicate with each other, the final mean value of all image blocks has to be calculated in two steps. First, each multiprocessor calculates one mean value per image block using a parallel reduction technique and stores it to global memory as illustrated in Figure 4. In a second step, one multiprocessors calculates the final mean value of the values stored to global memory, again by the use of a reduction. 9 The number of threads executed simultaneously on one multiprocessor is always a multiple of the warp size. However, the size of one block is only 30 in the x-dimension. That is, there are two threads for every 30 pixels that would work on a different block line resulting in misaligned memory accesses. Therefore, we use two more threads per block line that are idle. Using one thread per pixel requires = 960 threads for one image block, but only up to 512 threads can be assigned to a single thread block. For that reason each thread calculates the mean value of multiple pixels, which allows us to use one thread block per image block. The mean value is required on the graphics card for successive kernels as well as on the host system for dose control. Hence, the value is stored in the global memory on the graphics card and downloaded to the host for further processing. 4.2 Histogram Analysis The histogram analysis algorithm calculates two characteristic values for the image based on the histogram of the image. To create a histogram for a 14-bit image, 2 14 = bins are required. However, to achieve good performance, the limited shared memory of 16 kb has to be used to create the histogram. 10 That is, either only parts of the histogram or simply smaller histograms can be stored. In our approach we use smaller histograms that fit completely into shared memory and assign multiple contiguous values to the same bin. 11 This is seen in Algorithm 1, where in the first step one histogram is generated per thread block and stored to global memory. In the second step, these sub-histograms are merged into one final histogram. The characteristic values are also stored to global memory and transferred back to the host for further processing. In addition to the problem of limited shared memory, there exist race conditions when several threads try to update the same bin of the histogram, that is, some bin updates may be lost. In our implementation, we use atomic functions to update bins of the histogram to avoid the race conditions as long as the graphics card supports them. Unfortunately, atomic functions are only available on newer hardware. Therefore, we use five bits of the bin counter as tag to ensure that the bin was successfully updated using a compare-and-swap mechanism. 5. RESULTS In this section, the results of the detector defect correction pipeline on the graphics card are discussed. This is on the one hand the performance of the pipeline itself and a streaming application incorporating the detector defect correction pipeline on a series of images. On the other hand, we investigate the latency to get parameters from the graphics card required for dose control. 6

7 Figure 4: Dose feedback: Image tiled into 32 region of interests in each dimension for an image of pixels. In the first step, for each ROI the mean value is calculated and stored to global memory if the corresponding df bit is stet. In a second step, the average of the mean values stored in global memory is calculated. Algorithm 1: Histogram generation on the graphics card Step 1: Create sub-histograms in parallel; one per thread block: forall the thread blocks B do in parallel for each thread t in thread block do in parallel initialize_shared_memory(local_hist,t); synchronize; value get_pixel_value(image,t); index get_bin_number(value); atomic_inc(local_hist,index); synchronize; global_histogram[block_index][t] local_hist(t); end end Step 2: Merge sub-histograms in parallel using one thread block: for each thread t in thread block do in parallel sum 0; for i 0 to number_of_histograms do sum sum + global_histogram[i][t]; end global_histogram[t] sum; end 7

8 5.1 Detector Defect Correction Pipeline The timing of all algorithms which are part of the detector defect correction pipeline, are shown in Table 2. The first column displays the name of the kernel in the detector defect correction pipeline. The second and third columns show the execution time on the high-end Quadro FX 5800 graphics card and the low-end GeForce 9400M mobile graphics card. The reference implementation running on one core of a Core 2 Quad processor is listed in the last column. Some algorithms of the detector defect correction pipeline consist of multiple kernel launches on the graphics card, for example in order to perform a reduction on values and to use the reduced value later on. The execution times include only the time to execute the kernel of the detector defect correction pipeline, that is, the data resides already in global memory of the graphics card. The image data is read from that memory, processed by the streaming processors, and stored back to global memory. Most of the kernels, which take about 0.20 ms to complete, are memory bound. That is, the streaming processors perform only few instructions per data element and wait most of the time for data. Such algorithms are for instance ALG8 and ALG10. Computationally more intensive kernels require more execution time. The times for the low-end mobile graphics card is in the same range as the reference implementation. This is in particular interesting, since the capabilities of the GeForce 9400M are inferior to those of the Quadro FX 5800 and the implementation was not optimized for the mobile graphics card. To be accurate, the missing memory access abstractions of the GeForce 9400M causes penalties of a factor up to 4 compared to the Quadro FX Also the missing support for atomic operations required to implement a manual compare-and-swap mechanism in order to calculate histograms. The total execution time in the last three rows correspond to the execution of the detector defect correction pipeline, using dose feedback, histogram analysis, and global gain, respectively. For the first two alternatives, parameters for dose control are calculated. In Figure 5, the speedup of the graphics card implementation compared to the reference implementation is shown. The high-end Quadro FX 5800 achieves a speedup of up to compared to the reference implementation, while the low-end mobile GeForce 9400M is slightly faster compared to our reference implementation. Table 2: Execution times in ms of all kernels of the detector defect correction pipeline on a high-end Quadro FX 5800 (240 GHz) graphics card, a low-end GeForce 9400M (16 GHz) mobile graphics card, and a Core 2 Quad GHz). Quadro FX 5800 (ms) GeForce 9400M (ms) Core 2 Quad (ms) ALG ALG ALG ALG ALG ALG ALG ALG DF HA ALG9 (DF) ALG9 (HA) ALG9 (GG) ALG ALG Total (DF) Total (HA) Total (GG)

9 Core 2 Quad GeForce 9400M Quadro FX speedup DF HA GG Figure 5: Achieved speedup executing the detector defect correction pipeline on graphics cards compared to the reference implementation for different operation modes. Still, there are a lot of optimization opportunities to improve the performance further. On the one hand, many of the kernels in the detector defect correction pipeline are memory bound and have no data dependencies. Such kernels can be fused into one single kernel, eliminating data transfers to global memory and diminishing thereby the execution time at most by the time of the faster kernel (see 12 ). On the other hand, the utilization of the high-end card is not given at a resolution of and Higher workloads utilize the graphics card to its full capacity (e. g., resolutions of and higher). However, the primary goal of our study was not to optimize the algorithms completely and tune them to one architecture, but to investigate the feasibility and the latency to get the parameters for dose control. 5.2 Double Buffering From Table 2 we can see that the processing of the complete detector defect correction pipeline takes between 4.5 ms and 5.5 ms depending on the mode. In a real application, however, not only the pure execution times are important, but also the time to feed the graphics card with data. In order to transfer the input image of to the graphics memory, it takes 1.06 ms. This is one fifth of the time it takes to process the complete detector defect correction pipeline. Therefore, it is essential to reduce or hide the memory transfer times in order to achieve best performance. To support concepts like double buffering, special memory allocation on the host is required. Copying data asynchronously from host memory to the device memory is only possible when the host memory is allocated as page-locked memory. This allows to overlap kernel executions with memory transfers. To manage such concurrency, streams are used. A stream is a sequence of commands that execute in order, while commands from different streams may be executed in parallel. Supported is the parallel execution of a kernel from one stream and an asynchronous memory transfer from a different stream. Assigning different image iterations to different streams allows to fetch the image for iteration i n+1 while iteration i n is processed (i. e., kernels for iteration i n are executed). Page-locked memory allows also faster memory transfers. Transferring the image from page-locked memory takes only 0.65 ms. Even more, kernels can access page-locked memory directly, that is, it is not necessary anymore to copy data to graphics card memory before launching a kernel. This is possible since page-locked memory is not pageable 9

10 stream kernel execution memory transfer zero-copy 1 (a) Synchronous data transfer. time stream 1 (b) Page-locked data transfer. time stream 1 (c) Zero-copy data transfer. time stream 2 1 (d) Asynchronous data transfer. Figure 6: Gantt chart of the detector defect correction pipeline, processing ten images: (a) uses normal synchronous memory transfers, (b) uses faster page-locked memory, (c) uses zero-copy to transfer data, and (d) uses asynchronous memory transfers. Two streams are used for asynchronous memory transfers. While one stream transfers the next image to the graphics memory, the current image is processed on the graphics card. time anymore by the operating system and consequently can be accessed from the graphics card without the support of the host. Kernels can directly read and write to page-locked host memory. However, this makes only sense for the first and last kernel of the image pipeline, respectively. For those kernels the image has to be fetched from the host memory and stored back again. Instead of doing this using memory transfers, the kernels can directly read/write to host memory using zero-copy. Zero-copy makes asynchronous memory transfers superfluous. The more computation is done in the first kernel, the more memory transfer times can be hidden. Figure 6 shows the activity on the graphics cards of the above mentioned variants for ten iterations. Only when double buffering is used, two streams are required to transfer the next image asynchronously. At the beginning, the first image has to be on hand before the two streams can use asynchronous data transfers to hide the data transfers of the successive iterations. Each kernel launch and memory transfer is denoted by an own 10

11 bar in the Gantt chart. Using the double buffering implementation, most of the data transfers can be hidden as seen in Table 3. The execution time of 10 iterations with no data transfers takes about ms. Using only one stream and synchronous memory transfers takes about ms, hence, 10 ms are required for the data transfers. Using page-locked memory, the execution time is reduced to ms. Using zero-copy takes the same time. Here, almost none of the memory transfer time could be hidden, since the first kernel of the pipeline is memory bound. The best solution uses asynchronous memory transfers, when the ten iterations take only ms, and 83 % of the data transfer overhead could be hidden. This is almost the complete memory transfer overhead apart from the transfer of the first image. 5.3 Dose Control Parameter Latency The parameters for dose control are calculated during dose feedback or histogram analysis and are stored on the graphics card. In order to obtain and pass these parameters without interruption of the detector defect correction pipeline, we use asynchronous data transfers to download the parameters to the host while the next kernels are executed on the graphics card. Doing so, allows us to pass the parameters as fast as possible to dose control. Table 4 shows the latency from the moment when the image is in the host memory to the point where the parameters for dose control have been downloaded to the host and can be passed to dose control. For all alternatives, but the naïve synchronous memory transfer model, the latency for the dose control parameters is below 5 ms and meets, thus, the 20 ms latency requirement. 6. CONCLUSIONS In this paper, a memory intensive preprocessing pipeline for medical imaging with strict latency requirements was investigated and it was shown that it is feasible to map and implement it on a graphics card. While current medical setups use different hardware systems for preprocessing and postprocessing, we use the same platform for both, which allows to reduce the complexity of the system. A detector defect correction pipeline with more than ten algorithms was mapped to the graphics hardware. The pipeline compensates for different effects caused by a detector during exposure of X-ray images and calculates parameters to control the subsequent dosage. Medical images of were processed in less than 5 ms, 20 faster than on a standard CPU. The time critical part of the pipeline, the calculation and the forwarding of dose parameters to the host and to an external system, was also done in less than 5 ms, which fulfills the 20 ms maximal latency requirement throughout. For deployment in medical environments with streams of images, we showed that the memory transfer overhead of successive images to the graphics card memory is reduced by 83 % using asynchronous memory transfers. Zero-copy relieves the programmer in addition from data transfer management and transfers the data implicitly from main memory. However, using zero-copy only 36 % of the transfer overhead was saved. Table 3: Execution time for 10 iterations of the detector defect correction pipeline for different memory management approaches when no X server is running. No data transfers Synchronous memory transfers Page-locked memory Zero-copy Asynchronous memory transfers ms ms ms ms ms Table 4: Latency to get the parameters for dose control from the graphics card. The latency includes the time when the image is still on the host until the parameters are available on the host. DF HA Synchronous memory transfers 5.14 ms 5.66 ms Page-locked memory 4.23 ms 4.76 ms Zero-copy 4.27 ms 4.82 ms Asynchronous memory transfers 4.35 ms 4.89 ms 11

12 The control flow required to interact with the detector defect correction pipeline is implemented using CUDA and OpenGL. Different modes are supported and can be selected on a per frame basis. Actually, the current preprocessing and visualization leave much room for further processing and would still meet the latency requirements. This allows medical systems in cardiology and angiography to reduce the complexity of existing systems and to use one common system for preprocessing and postprocessing, whereas, current setups use different systems and hardware platforms. To guarantee the latency requirements for dose feedback, a fixed number of multiprocessors can be reserved exclusively for preprocessing in the upcoming Fermi architecture of NVIDIA. 13 REFERENCES [1] Manavski, S. and Valle, G., CUDA Compatible GPU Cards as Efficient Hardware Accelerators for Smith- Waterman Sequence Alignment, BMC Bioinformatics 9(Suppl 2), S10 (2008). [2] Preis, T., Virnau, P., Paul, W., and Schneider, J., GPU Accelerated Monte Carlo Simulation of the 2D and 3D Ising Model, Journal of Computational Physics 228(12), (2009). [3] Micikevicius, P., 3D Finite Difference Computation on GPUs using CUDA, in [Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units], 79 84, ACM (2009). [4] Stone, S., Haldar, J., Tsao, S., Wen-Mei, W., Liang, Z., and Sutton, B., Accelerating Advanced MRI Reconstructions on GPUs, Proceedings of the 2008 Conference on Computing Frontiers, (2008). [5] Xu, F. and Mueller, K., Real-Time 3D Computed Tomographic Reconstruction using Commodity Graphics Hardware, Physics in Medicine and Biology 52, (2007). [6] Reichl, T., Passenger, J., Acosta, O., and Salvado, O., Ultrasound goes GPU: Real-Time Simulation using CUDA, Progress in Biomedical Optics and Imaging 10(37) (2009). [7] Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J., NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro 28(2), (2008). [8] Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., and Phillips, J., GPU Computing, Proceedings of the IEEE 96(5), (2008). [9] Sengupta, S., Harris, M., Zhang, Y., and Owens, J., Scan Primitives for GPU Computing, in [Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware], , Eurographics Association (2007). [10] Shams, R. and Kennedy, R. A., Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices, in [Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS)], (2007). [11] Scott, D., On Optimal and Data-Based Histograms, Biometrika 66(3), 605 (1979). [12] Membarth, R., Hannig, F., Dutta, H., and Teich, J., Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors, in [Proceedings of the 9th International Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS Workshop)], (2009). [13] NVIDIA Corporation, NVIDIA Whitepaper: NVIDIA s Next Generation CUDA Compute Architecture: Fermi. Architecture_Whitepaper.pdf (2009). 12

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Dynamic Task-Scheduling and Resource Management for GPU Accelerators in Medical Imaging

Dynamic Task-Scheduling and Resource Management for GPU Accelerators in Medical Imaging This is the author s version of the work. The definitive work was published in Proceedings of the 25th International Conference on Architecture of Computing Systems (ARCS), Munich, Germany, February 28

More information

GPU Parallel Computing Architecture and CUDA Programming Model

GPU Parallel Computing Architecture and CUDA Programming Model GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it

Introduction to GPGPU. Tiziano Diamanti t.diamanti@cineca.it t.diamanti@cineca.it Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

More information

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

OpenCL Programming for the CUDA Architecture. Version 2.3

OpenCL Programming for the CUDA Architecture. Version 2.3 OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different

More information

www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING

www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING VISUALISATION GPU COMPUTING www.xenon.com.au STORAGE HIGH SPEED INTERCONNECTS HIGH PERFORMANCE COMPUTING GPU COMPUTING VISUALISATION XENON Accelerating Exploration Mineral, oil and gas exploration is an expensive and challenging

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

COMPUTER HARDWARE. Input- Output and Communication Memory Systems COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors

Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Towards Large-Scale Molecular Dynamics Simulations on Graphics Processors Joe Davis, Sandeep Patel, and Michela Taufer University of Delaware Outline Introduction Introduction to GPU programming Why MD

More information

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS

NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS NVIDIA CUDA GETTING STARTED GUIDE FOR MICROSOFT WINDOWS DU-05349-001_v6.0 February 2014 Installation and Verification on TABLE OF CONTENTS Chapter 1. Introduction...1 1.1. System Requirements... 1 1.2.

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

Introduction to GPU Programming Languages

Introduction to GPU Programming Languages CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

NVIDIA VIDEO ENCODER 5.0

NVIDIA VIDEO ENCODER 5.0 NVIDIA VIDEO ENCODER 5.0 NVENC_DA-06209-001_v06 November 2014 Application Note NVENC - NVIDIA Hardware Video Encoder 5.0 NVENC_DA-06209-001_v06 i DOCUMENT CHANGE HISTORY NVENC_DA-06209-001_v06 Version

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

USB readout board for PEBS Performance test

USB readout board for PEBS Performance test June 11, 2009 Version 1.0 USB readout board for PEBS Performance test Guido Haefeli 1 Li Liang 2 Abstract In the context of the PEBS [1] experiment a readout board was developed in order to facilitate

More information

Texture Cache Approximation on GPUs

Texture Cache Approximation on GPUs Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, joshua.sanmiguel@mail.utoronto.ca 1 Our Contribution GPU Core Cache Cache

More information

Interactive Level-Set Deformation On the GPU

Interactive Level-Set Deformation On the GPU Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation

More information

Clustering Billions of Data Points Using GPUs

Clustering Billions of Data Points Using GPUs Clustering Billions of Data Points Using GPUs Ren Wu ren.wu@hp.com Bin Zhang bin.zhang2@hp.com Meichun Hsu meichun.hsu@hp.com ABSTRACT In this paper, we report our research on using GPUs to accelerate

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile GPU Computing with CUDA Lecture 4 - Optimizations Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 3 Control flow Coalescing Latency hiding

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Real-time Visual Tracker by Stream Processing

Real-time Visual Tracker by Stream Processing Real-time Visual Tracker by Stream Processing Simultaneous and Fast 3D Tracking of Multiple Faces in Video Sequences by Using a Particle Filter Oscar Mateo Lozano & Kuzahiro Otsuka presented by Piotr Rudol

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @

More information

Multi-GPU Load Balancing for Simulation and Rendering

Multi-GPU Load Balancing for Simulation and Rendering Multi- Load Balancing for Simulation and Rendering Yong Cao Computer Science Department, Virginia Tech, USA In-situ ualization and ual Analytics Instant visualization and interaction of computing tasks

More information

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

More information

Accelerating Wavelet-Based Video Coding on Graphics Hardware

Accelerating Wavelet-Based Video Coding on Graphics Hardware Wladimir J. van der Laan, Andrei C. Jalba, and Jos B.T.M. Roerdink. Accelerating Wavelet-Based Video Coding on Graphics Hardware using CUDA. In Proc. 6th International Symposium on Image and Signal Processing

More information

Guided Performance Analysis with the NVIDIA Visual Profiler

Guided Performance Analysis with the NVIDIA Visual Profiler Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided

More information

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring. David Goodwin NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

More information

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015 GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once

More information

GPU Architecture. Michael Doggett ATI

GPU Architecture. Michael Doggett ATI GPU Architecture Michael Doggett ATI GPU Architecture RADEON X1800/X1900 Microsoft s XBOX360 Xenos GPU GPU research areas ATI - Driving the Visual Experience Everywhere Products from cell phones to super

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

Introduction to GPU Architecture

Introduction to GPU Architecture Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three

More information

GPU Programming Strategies and Trends in GPU Computing

GPU Programming Strategies and Trends in GPU Computing GPU Programming Strategies and Trends in GPU Computing André R. Brodtkorb 1 Trond R. Hagen 1,2 Martin L. Sætra 2 1 SINTEF, Dept. Appl. Math., P.O. Box 124, Blindern, NO-0314 Oslo, Norway 2 Center of Mathematics

More information

Real-time Process Network Sonar Beamformer

Real-time Process Network Sonar Beamformer Real-time Process Network Sonar Gregory E. Allen Applied Research Laboratories gallen@arlut.utexas.edu Brian L. Evans Dept. Electrical and Computer Engineering bevans@ece.utexas.edu The University of Texas

More information

Evaluation of CUDA Fortran for the CFD code Strukti

Evaluation of CUDA Fortran for the CFD code Strukti Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center

More information

Speeding Up RSA Encryption Using GPU Parallelization

Speeding Up RSA Encryption Using GPU Parallelization 2014 Fifth International Conference on Intelligent Systems, Modelling and Simulation Speeding Up RSA Encryption Using GPU Parallelization Chu-Hsing Lin, Jung-Chun Liu, and Cheng-Chieh Li Department of

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques

More information

Multi-GPU Load Balancing for In-situ Visualization

Multi-GPU Load Balancing for In-situ Visualization Multi-GPU Load Balancing for In-situ Visualization R. Hagan and Y. Cao Department of Computer Science, Virginia Tech, Blacksburg, VA, USA Abstract Real-time visualization is an important tool for immediately

More information

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102

More information

GPGPU Computing. Yong Cao

GPGPU Computing. Yong Cao GPGPU Computing Yong Cao Why Graphics Card? It s powerful! A quiet trend Copyright 2009 by Yong Cao Why Graphics Card? It s powerful! Processor Processing Units FLOPs per Unit Clock Speed Processing Power

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Communicating with devices

Communicating with devices Introduction to I/O Where does the data for our CPU and memory come from or go to? Computers communicate with the outside world via I/O devices. Input devices supply computers with data to operate on.

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute andek@vtc.vt.

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute andek@vtc.vt. Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute andek@vtc.vt.edu Outline Motivation why do we need GPUs? Past - how was GPU programming

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

GPU Accelerated Monte Carlo Simulations and Time Series Analysis

GPU Accelerated Monte Carlo Simulations and Time Series Analysis GPU Accelerated Monte Carlo Simulations and Time Series Analysis Institute of Physics, Johannes Gutenberg-University of Mainz Center for Polymer Studies, Department of Physics, Boston University Artemis

More information

Rethinking SIMD Vectorization for In-Memory Databases

Rethinking SIMD Vectorization for In-Memory Databases SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

More information

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS Julien Demouth, NVIDIA STAC-A2 BENCHMARK STAC-A2 Benchmark Developed by banks Macro and micro, performance and accuracy Pricing and Greeks for American

More information

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0 Optimization NVIDIA OpenCL Best Practices Guide Version 1.0 August 10, 2009 NVIDIA OpenCL Best Practices Guide REVISIONS Original release: July 2009 ii August 16, 2009 Table of Contents Preface... v What

More information

L20: GPU Architecture and Models

L20: GPU Architecture and Models L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.

More information

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering DELL Virtual Desktop Infrastructure Study END-TO-END COMPUTING Dell Enterprise Solutions Engineering 1 THIS WHITE PAPER IS FOR INFORMATIONAL PURPOSES ONLY, AND MAY CONTAIN TYPOGRAPHICAL ERRORS AND TECHNICAL

More information

15-418 Final Project Report. Trading Platform Server

15-418 Final Project Report. Trading Platform Server 15-418 Final Project Report Yinghao Wang yinghaow@andrew.cmu.edu May 8, 214 Trading Platform Server Executive Summary The final project will implement a trading platform server that provides back-end support

More information

The Fastest, Most Efficient HPC Architecture Ever Built

The Fastest, Most Efficient HPC Architecture Ever Built Whitepaper NVIDIA s Next Generation TM CUDA Compute Architecture: TM Kepler GK110 The Fastest, Most Efficient HPC Architecture Ever Built V1.0 Table of Contents Kepler GK110 The Next Generation GPU Computing

More information

GPUs for Scientific Computing

GPUs for Scientific Computing GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles mike.giles@maths.ox.ac.uk Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research

More information

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems

Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems Long-term monitoring of apparent latency in PREEMPT RT Linux real-time systems Carsten Emde Open Source Automation Development Lab (OSADL) eg Aichhalder Str. 39, 78713 Schramberg, Germany C.Emde@osadl.org

More information

Control 2004, University of Bath, UK, September 2004

Control 2004, University of Bath, UK, September 2004 Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of

More information

High Performance GPU-based Preprocessing for Time-of-Flight Imaging in Medical Applications

High Performance GPU-based Preprocessing for Time-of-Flight Imaging in Medical Applications High Performance GPU-based Preprocessing for Time-of-Flight Imaging in Medical Applications Jakob Wasza 1, Sebastian Bauer 1, Joachim Hornegger 1,2 1 Pattern Recognition Lab, Friedrich-Alexander University

More information

Generations of the computer. processors.

Generations of the computer. processors. . Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations

More information

Interactive Level-Set Segmentation on the GPU

Interactive Level-Set Segmentation on the GPU Interactive Level-Set Segmentation on the GPU Problem Statement Goal Interactive system for deformable surface manipulation Level-sets Challenges Deformation is slow Deformation is hard to control Solution

More information

CUDA programming on NVIDIA GPUs

CUDA programming on NVIDIA GPUs p. 1/21 on NVIDIA GPUs Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Fast Implementations of AES on Various Platforms

Fast Implementations of AES on Various Platforms Fast Implementations of AES on Various Platforms Joppe W. Bos 1 Dag Arne Osvik 1 Deian Stefan 2 1 EPFL IC IIF LACAL, Station 14, CH-1015 Lausanne, Switzerland {joppe.bos, dagarne.osvik}@epfl.ch 2 Dept.

More information

Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Firewalls on General-Purpose Graphics Processing Units Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering

More information

GPU-based Decompression for Medical Imaging Applications

GPU-based Decompression for Medical Imaging Applications GPU-based Decompression for Medical Imaging Applications Al Wegener, CTO Samplify Systems 160 Saratoga Ave. Suite 150 Santa Clara, CA 95051 sales@samplify.com (888) LESS-BITS +1 (408) 249-1500 1 Outline

More information

Intel DPDK Boosts Server Appliance Performance White Paper

Intel DPDK Boosts Server Appliance Performance White Paper Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks

More information

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions

Overview Motivation and applications Challenges. Dynamic Volume Computation and Visualization on the GPU. GPU feature requests Conclusions Module 4: Beyond Static Scalar Fields Dynamic Volume Computation and Visualization on the GPU Visualization and Computer Graphics Group University of California, Davis Overview Motivation and applications

More information

On Sorting and Load Balancing on GPUs

On Sorting and Load Balancing on GPUs On Sorting and Load Balancing on GPUs Daniel Cederman and Philippas Tsigas Distributed Computing and Systems Chalmers University of Technology SE-42 Göteborg, Sweden {cederman,tsigas}@chalmers.se Abstract

More information

GPU Computing - CUDA

GPU Computing - CUDA GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective

More information

Motivation: Smartphone Market

Motivation: Smartphone Market Motivation: Smartphone Market Smartphone Systems External Display Device Display Smartphone Systems Smartphone-like system Main Camera Front-facing Camera Central Processing Unit Device Display Graphics

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information