Detector Defect Correction of Medical Images on Graphics Processors

Transcription

1 This is the author s version of the work. The definitive work was published in Proceedings of the SPIE: Medical Imaging 2011: Image Processing, Lake Buena Vista, Orlando, FL, USA, February 12-17, Detector Defect Correction of Medical Images on Graphics Processors Richard Membarth *a, Frank Hannig a, Jürgen Teich a, Gerhard Litz b, and Heinz Hornegger b a Hardware/Software Co-Design, Department of Computer Science, University of Erlangen-Nuremberg, Germany. b Siemens Healthcare Sector, H IM AX, Forchheim, Germany. ABSTRACT The ever increasing complexity and power dissipation of computer architectures in the last decade blazed the trail for more power efficient parallel architectures. Hence, such architectures like field-programmable gate arrays (FPGAs) and particular graphics cards attained great interest and are consequently adopted for parallel execution of many number crunching loop programs from fields like image processing or linear algebra. However, there is little effort to deploy barely computational, but memory intensive applications to graphics hardware. This paper considers a memory intensive detector defect correction pipeline for medical imaging with strict latency requirements. The image pipeline compensates for different effects caused by the detector during exposure of X-ray images and calculates parameters to control the subsequent dosage. So far, dedicated hardware setups with special processors like DSPs were used for such critical processing. We show that this is today feasible with commodity graphics hardware. Using CUDA as programming model, it is demonstrated that the detector defect correction pipeline consisting of more than ten algorithms is significantly accelerated and that a speedup of 20 can be achieved on NVIDIA s Quadro FX 5800 compared to our reference implementation. For deployment in a streaming application with steadily new incoming data, it is shown that the memory transfer overhead of successive images to the graphics card memory is reduced by 83 % using double buffering. 1. INTRODUCTION In recent years, multi-core and many-core processors have become mainstream due to the implications accompanied by the steady increase of transistor count and density in chip designs. Power dissipation became a major problem and the computer designs moved along to more simple and power efficient designs with multiple cores on a chip. Today, processors used in standard desktop computers host four cores and graphics cards up to several hundred cores. Consequently, computationally intensive algorithms and programs from different domains were adapted and mapped to these architectures. In particular graphics cards attracted great interest due to the high computational performance they provide at a relatively low cost and development barrier. Examples of these domains include molecular biology, 1 financial options analytics, 2 seismic computations, 3 and medical image processing. 4 6 In this paper, we present the evaluation of a medical image processing application from the domain of angiography and of cardiology. In angiography, the operating doctor passes a catheter into the artery and a small dose of contrast agent is injected. At the same time, a rapid series of radiographs are taken while the radio-opaque fluid passes through the vessels. Another series, taken when the contrast agent has passed through the tissues, visualizes the vessels and artery supply. The imaging is done using X-ray based techniques such as fluoroscopy. The intensity of the X-rays is recorded by a detector and passed to an imaging system. The imaging system comprises different steps like detector defect correction of the images in the preprocessing and postprocessing. Currently, mainly solutions based on multiple DSPs are used for preprocessing in real-time. In contrast, we map the detector defect correction pipeline of the preprocessing to graphics cards and combine it with the postprocessing and visualization system. The detector defect correction pipeline calculates also parameters to * richard.membarth@cs.fau.de 1

2 Figure 1: Tesla architecture (see 8 ): 240 streaming processors distributed over 30 multiprocessors. The 30 multiprocessors are partitioned into 10 groups, each comprising 3 multiprocessors, cache, and texture unit. control the dosage for subsequent X-ray acquisitions. The maximal latency to calculate and pass the parameters to the X-ray system is 20 ms. For the total pipeline, including postprocessing, 120 ms are available. The pipeline including visualization is also investigated regarding these strict latency requirements. To our knowledge we are the first who investigate preprocessing pipelines like detector defect correction on graphics cards. Related work in this domain focuses only on computationally intensive parts. MRI reconstruction, 4 3D computed tomographic reconstruction, 5 and 3D ultrasound 6 are such examples for computational intensive algorithms. In contrast, we consider a memory intensive pipeline, namely the detector defect correction pipeline, in this paper. We show that the detector defect correction pipeline can be mapped efficiently to graphics hardware to meet the strict latency requirements. The major advantage for future imaging systems using our approach is that preprocessing and postprocessing can be done on the same system. The remaining paper is organized as follows: Section 2 gives an overview of the hardware architecture used within this paper. The following Section 3 introduces the detector defect correction pipeline and the mapping to graphics cards is explained in Section 4. The performance and latency investigations are presented in Section 5, while in Section 6 conclusions of this work are drawn and suggestions for future work are given. 2. NVIDIA TESLA ARCHITECTURE In this section, we present an overview of the Tesla architecture of the Quadro FX 5800, which is used as accelerator for the algorithms studied within this paper. The Tesla is a highly parallel hardware platform with 240 processors integrated on a chip as depicted in Figure 1. The processors are grouped into 30 streaming multiprocessors. These multiprocessors comprise eight scalar streaming processors. While the multiprocessors are responsible for scheduling and work distribution, the streaming processors do the calculations. A program executed on the graphics card is called a kernel and is processed in parallel by many threads on the streaming processors. Therefore, each thread calculates a small portion of the whole algorithm, for example one pixel of a large image. A batch of these threads is always grouped together into a thread block that is scheduled to one multiprocessor and executed by its streaming processors. One of these thread blocks can contain up to 512 threads, which is specified by the programmer. The complete problem space has to be divided into subproblems such that these can be processed independently within one thread block on one multiprocessor. The multiprocessor always executes a batch of 32 threads, also called a warp, in parallel. NVIDIA calls this new streaming multiprocessor architecture single instruction, multiple thread (SIMT). 7 For all threads of a warp the same instructions are fetched and executed for each thread independently, that is, the threads of one warp can diverge and execute different branches. However, when this occurs the divergent branches are serialized until both branches merge again. Thereafter, the whole warp is executed in parallel again. 2

3 Each thread executed on a multiprocessor has full read/write access to the 4.0 GB global memory of the graphics card. However, this memory has a long memory latency of 400 to 800 clock cycles. To hide this long latency each multiprocessor is capable to manage and switch between up to eight thread blocks, but not more than 1024 threads in total. In addition registers and bytes of on-chip shared memory are provided to all threads executed simultaneously on one multiprocessor. These memory types are faster than the global memory, but shared between all thread blocks executed on the multiprocessor. The capabilities of the Tesla architecture and in particular of the used Quadro FX 5800 are summarized in Table 1. Current graphics cards support also asynchronous data transfers between host memory and global memory. This allows to execute kernels on the graphics card, while data is transferred to or from the graphics card. Data transfers are handled like normal kernels and assigned to a queue of commands to be processed in-order by the GPU. These queues are called streams in CUDA. However, commands from different streams can be executed simultaneously as long as one of the commands is a computational kernel and the other an asynchronous data transfer command. This provides support for streaming applications. Similar benefits provides a feature called zero-copy. Using zero-copy, data accessed by the streaming multiprocessors may reside in host memory. The data is implicitly transferred from host memory across PCI Express or the front-side bus to the streaming processors. This way, the user has not to bother about data transfers between the graphics card and the host memory. 3. DETECTOR DEFECT CORRECTION Setups for medical imaging consist typically of several independent systems responsible for preprocessing and postprocessing of medical images. Thereby, in the preprocessing stage, the raw image data is conditioned and in the postprocessing step, the data is visualized, overlayed with further information, and tasks like feature extraction are performed. For preprocessing often dedicated embedded systems based on multiple DSPs are used and the results are passed to a second system responsible for postprocessing and visualization. Within this paper, we consider a preprocessing pipeline for detector defect correction, which is performed on a graphics card. The benefit hereby is that the same system might be used for preprocessing as well as postprocessing. This makes the data transfers from the preprocessing system to the postprocessing system superfluous, but imposes strict requirements for the complete system. These are: For the complete imaging system, a maximal latency of 120 ms from image acquisition to image visualization is required. Larger delays are confusing for the operating doctor and can cause serious injuries up to the perforation of vessels. For the detector defect correction, a maximal latency of less than 20 ms from image acquisition to dose feedback for a frame rate of 30 frames per second is required (i. e., to use the dose feedback from image i n for the next image i n+1 ). The detector defect correction pipeline, as shown in Figure 2, has several functions for its deployment in medical setups in angiography or cardiology. First of all, it compensates for detector inherent effects, which have severe impacts on the quality of the recorded image. These are for example pixel individual offset and gain correction or line offset correction. Here, Table 1: Hardware capabilities of the Tesla architecture and of the Quadro FX Threads per warp 32 Warps per multiprocessor 32 Threads per block 512 Threads per multiprocessor 1024 Blocks per multiprocessor 8 Registers per multiprocessor Shared memory per multiprocessor

4 Figure 2: Detector defect correction pipeline: A series of kernels are applied to the input data to compensate for different effects introduced by the detector and the recording of the images. The dose feedback (DF, on a given region of interest (ROI)) and histogram analysis (HA) kernels are control driven and used for the dose control of the next images. Dose control can be omitted using global gain GG. also defect pixels as well as completely defect lines and columns of the detector are compensated. This is done in different independent algorithms (top image pipeline of Figure 2). These algorithms process the raw images acquired by the detector and require also some additional inputs like defect maps, gain images, or other parameters. The input images acquired by the detector are gray scale images with an accuracy of 14-bit. As a second key function of the defect correction pipeline, image synchronous gain control is provided. This is done after the defects of the detector are corrected so that only good images are considered. For the synchronous gain control, three different modes (A to C) are available. The first two modes, dose feedback (DF) and histogram analysis (HA), calculate parameters for the subsequent dosage of the detector. These parameters are passed to an external system, which is responsible for dose control. This system imposes a strict maximal latency of less than 20 ms between image acquisition and the retrieval of the parameters to adjust the dosage for the following image. These parameters are used in the subsequent algorithm of the pipeline, too. The third mode, global gain (GG) provides a constant parameter instead. In Figure 3, an image at different stages of the detector defect correction pipeline is shown. The raw input image is shown in Figure 3(a). Here, a pattern of the detector interferes with the acquired image and at the left and right border a dark reference zone (DRZ) is added. This zone is removed later on. The detector defect corrected image is shown in Figure 3(b). The signal of this image is normalized using the parameters from dose feedback and histogram analysis in Figures 3(c) and (d) for better visual contrast. 4. DETECTOR DEFECT CORRECTION ON GRAPHICS CARDS We map the detector defect correction as described in the previous section to graphics cards using CUDA as programming model. The preprocessing is combined with a rudimentary postprocessing and visualization using OpenGL. Single-precision 32-bit floating-points are used as number representation on the graphics card for the complete imaging system. Therefore, the input values are converted at the beginning of the pipeline. Floatingpoint numbers have several advantages over fixed-point representations on DSPs where the bit-width is explicit and must be managed by the programmer. For example, when programming for DSPs, the bit-width has to be 4

5 (a) Input image from detector. (b) Detector defect corrected output image. (c) Signal normalization using dose feedback. (d) Signal normalization using HA. Figure 3: Images of the detector defect correction pipeline: (a) shows the raw input image of the detector including a dark reference zone at the left and right border ( ), while (b) shows the detector defect corrected output image ( ). In (c) the image is normalized using the dose feedback algorithm and in (d) using the histogram analysis (HA) algorithm. 5

6 extended by hand when higher accuracy is needed. The input images we consider have a resolution of After the dark reference zone is removed from the images, the resolution is In the following subsections the mapping of two algorithms, namely dose feedback and histogram analysis, are exemplarily illustrated in detail. 4.1 Dose Feedback The dose feedback algorithm calculates mean values in regions of interest (ROI) of the image. The region of interest can be changed or adjusted during run-time and is determined by the operator of the system. Therefore, the image is divided in 32 horizontal and 32 vertical blocks, each having a size of pixels. We use one bit for each image block to encode whether the block contributes to the overall mean value or not. Each block of the image is processed by one multiprocessor on the graphics card. Since the schedule of thread blocks to the multiprocessors is not deterministic, one thread block is needed for each image block, even if the image block does not contribute to the mean value. However, only if the bit for the image block is set, the multiprocessors have to calculate the mean value. Otherwise they finish execution in no time and the next thread block is assigned to them. Given that the different multiprocessors cannot communicate with each other, the final mean value of all image blocks has to be calculated in two steps. First, each multiprocessor calculates one mean value per image block using a parallel reduction technique and stores it to global memory as illustrated in Figure 4. In a second step, one multiprocessors calculates the final mean value of the values stored to global memory, again by the use of a reduction. 9 The number of threads executed simultaneously on one multiprocessor is always a multiple of the warp size. However, the size of one block is only 30 in the x-dimension. That is, there are two threads for every 30 pixels that would work on a different block line resulting in misaligned memory accesses. Therefore, we use two more threads per block line that are idle. Using one thread per pixel requires = 960 threads for one image block, but only up to 512 threads can be assigned to a single thread block. For that reason each thread calculates the mean value of multiple pixels, which allows us to use one thread block per image block. The mean value is required on the graphics card for successive kernels as well as on the host system for dose control. Hence, the value is stored in the global memory on the graphics card and downloaded to the host for further processing. 4.2 Histogram Analysis The histogram analysis algorithm calculates two characteristic values for the image based on the histogram of the image. To create a histogram for a 14-bit image, 2 14 = bins are required. However, to achieve good performance, the limited shared memory of 16 kb has to be used to create the histogram. 10 That is, either only parts of the histogram or simply smaller histograms can be stored. In our approach we use smaller histograms that fit completely into shared memory and assign multiple contiguous values to the same bin. 11 This is seen in Algorithm 1, where in the first step one histogram is generated per thread block and stored to global memory. In the second step, these sub-histograms are merged into one final histogram. The characteristic values are also stored to global memory and transferred back to the host for further processing. In addition to the problem of limited shared memory, there exist race conditions when several threads try to update the same bin of the histogram, that is, some bin updates may be lost. In our implementation, we use atomic functions to update bins of the histogram to avoid the race conditions as long as the graphics card supports them. Unfortunately, atomic functions are only available on newer hardware. Therefore, we use five bits of the bin counter as tag to ensure that the bin was successfully updated using a compare-and-swap mechanism. 5. RESULTS In this section, the results of the detector defect correction pipeline on the graphics card are discussed. This is on the one hand the performance of the pipeline itself and a streaming application incorporating the detector defect correction pipeline on a series of images. On the other hand, we investigate the latency to get parameters from the graphics card required for dose control. 6

7 Figure 4: Dose feedback: Image tiled into 32 region of interests in each dimension for an image of pixels. In the first step, for each ROI the mean value is calculated and stored to global memory if the corresponding df bit is stet. In a second step, the average of the mean values stored in global memory is calculated. Algorithm 1: Histogram generation on the graphics card Step 1: Create sub-histograms in parallel; one per thread block: forall the thread blocks B do in parallel for each thread t in thread block do in parallel initialize_shared_memory(local_hist,t); synchronize; value get_pixel_value(image,t); index get_bin_number(value); atomic_inc(local_hist,index); synchronize; global_histogram[block_index][t] local_hist(t); end end Step 2: Merge sub-histograms in parallel using one thread block: for each thread t in thread block do in parallel sum 0; for i 0 to number_of_histograms do sum sum + global_histogram[i][t]; end global_histogram[t] sum; end 7

8 5.1 Detector Defect Correction Pipeline The timing of all algorithms which are part of the detector defect correction pipeline, are shown in Table 2. The first column displays the name of the kernel in the detector defect correction pipeline. The second and third columns show the execution time on the high-end Quadro FX 5800 graphics card and the low-end GeForce 9400M mobile graphics card. The reference implementation running on one core of a Core 2 Quad processor is listed in the last column. Some algorithms of the detector defect correction pipeline consist of multiple kernel launches on the graphics card, for example in order to perform a reduction on values and to use the reduced value later on. The execution times include only the time to execute the kernel of the detector defect correction pipeline, that is, the data resides already in global memory of the graphics card. The image data is read from that memory, processed by the streaming processors, and stored back to global memory. Most of the kernels, which take about 0.20 ms to complete, are memory bound. That is, the streaming processors perform only few instructions per data element and wait most of the time for data. Such algorithms are for instance ALG8 and ALG10. Computationally more intensive kernels require more execution time. The times for the low-end mobile graphics card is in the same range as the reference implementation. This is in particular interesting, since the capabilities of the GeForce 9400M are inferior to those of the Quadro FX 5800 and the implementation was not optimized for the mobile graphics card. To be accurate, the missing memory access abstractions of the GeForce 9400M causes penalties of a factor up to 4 compared to the Quadro FX Also the missing support for atomic operations required to implement a manual compare-and-swap mechanism in order to calculate histograms. The total execution time in the last three rows correspond to the execution of the detector defect correction pipeline, using dose feedback, histogram analysis, and global gain, respectively. For the first two alternatives, parameters for dose control are calculated. In Figure 5, the speedup of the graphics card implementation compared to the reference implementation is shown. The high-end Quadro FX 5800 achieves a speedup of up to compared to the reference implementation, while the low-end mobile GeForce 9400M is slightly faster compared to our reference implementation. Table 2: Execution times in ms of all kernels of the detector defect correction pipeline on a high-end Quadro FX 5800 (240 GHz) graphics card, a low-end GeForce 9400M (16 GHz) mobile graphics card, and a Core 2 Quad GHz). Quadro FX 5800 (ms) GeForce 9400M (ms) Core 2 Quad (ms) ALG ALG ALG ALG ALG ALG ALG ALG DF HA ALG9 (DF) ALG9 (HA) ALG9 (GG) ALG ALG Total (DF) Total (HA) Total (GG)

9 Core 2 Quad GeForce 9400M Quadro FX speedup DF HA GG Figure 5: Achieved speedup executing the detector defect correction pipeline on graphics cards compared to the reference implementation for different operation modes. Still, there are a lot of optimization opportunities to improve the performance further. On the one hand, many of the kernels in the detector defect correction pipeline are memory bound and have no data dependencies. Such kernels can be fused into one single kernel, eliminating data transfers to global memory and diminishing thereby the execution time at most by the time of the faster kernel (see 12 ). On the other hand, the utilization of the high-end card is not given at a resolution of and Higher workloads utilize the graphics card to its full capacity (e. g., resolutions of and higher). However, the primary goal of our study was not to optimize the algorithms completely and tune them to one architecture, but to investigate the feasibility and the latency to get the parameters for dose control. 5.2 Double Buffering From Table 2 we can see that the processing of the complete detector defect correction pipeline takes between 4.5 ms and 5.5 ms depending on the mode. In a real application, however, not only the pure execution times are important, but also the time to feed the graphics card with data. In order to transfer the input image of to the graphics memory, it takes 1.06 ms. This is one fifth of the time it takes to process the complete detector defect correction pipeline. Therefore, it is essential to reduce or hide the memory transfer times in order to achieve best performance. To support concepts like double buffering, special memory allocation on the host is required. Copying data asynchronously from host memory to the device memory is only possible when the host memory is allocated as page-locked memory. This allows to overlap kernel executions with memory transfers. To manage such concurrency, streams are used. A stream is a sequence of commands that execute in order, while commands from different streams may be executed in parallel. Supported is the parallel execution of a kernel from one stream and an asynchronous memory transfer from a different stream. Assigning different image iterations to different streams allows to fetch the image for iteration i n+1 while iteration i n is processed (i. e., kernels for iteration i n are executed). Page-locked memory allows also faster memory transfers. Transferring the image from page-locked memory takes only 0.65 ms. Even more, kernels can access page-locked memory directly, that is, it is not necessary anymore to copy data to graphics card memory before launching a kernel. This is possible since page-locked memory is not pageable 9

10 stream kernel execution memory transfer zero-copy 1 (a) Synchronous data transfer. time stream 1 (b) Page-locked data transfer. time stream 1 (c) Zero-copy data transfer. time stream 2 1 (d) Asynchronous data transfer. Figure 6: Gantt chart of the detector defect correction pipeline, processing ten images: (a) uses normal synchronous memory transfers, (b) uses faster page-locked memory, (c) uses zero-copy to transfer data, and (d) uses asynchronous memory transfers. Two streams are used for asynchronous memory transfers. While one stream transfers the next image to the graphics memory, the current image is processed on the graphics card. time anymore by the operating system and consequently can be accessed from the graphics card without the support of the host. Kernels can directly read and write to page-locked host memory. However, this makes only sense for the first and last kernel of the image pipeline, respectively. For those kernels the image has to be fetched from the host memory and stored back again. Instead of doing this using memory transfers, the kernels can directly read/write to host memory using zero-copy. Zero-copy makes asynchronous memory transfers superfluous. The more computation is done in the first kernel, the more memory transfer times can be hidden. Figure 6 shows the activity on the graphics cards of the above mentioned variants for ten iterations. Only when double buffering is used, two streams are required to transfer the next image asynchronously. At the beginning, the first image has to be on hand before the two streams can use asynchronous data transfers to hide the data transfers of the successive iterations. Each kernel launch and memory transfer is denoted by an own 10

11 bar in the Gantt chart. Using the double buffering implementation, most of the data transfers can be hidden as seen in Table 3. The execution time of 10 iterations with no data transfers takes about ms. Using only one stream and synchronous memory transfers takes about ms, hence, 10 ms are required for the data transfers. Using page-locked memory, the execution time is reduced to ms. Using zero-copy takes the same time. Here, almost none of the memory transfer time could be hidden, since the first kernel of the pipeline is memory bound. The best solution uses asynchronous memory transfers, when the ten iterations take only ms, and 83 % of the data transfer overhead could be hidden. This is almost the complete memory transfer overhead apart from the transfer of the first image. 5.3 Dose Control Parameter Latency The parameters for dose control are calculated during dose feedback or histogram analysis and are stored on the graphics card. In order to obtain and pass these parameters without interruption of the detector defect correction pipeline, we use asynchronous data transfers to download the parameters to the host while the next kernels are executed on the graphics card. Doing so, allows us to pass the parameters as fast as possible to dose control. Table 4 shows the latency from the moment when the image is in the host memory to the point where the parameters for dose control have been downloaded to the host and can be passed to dose control. For all alternatives, but the naïve synchronous memory transfer model, the latency for the dose control parameters is below 5 ms and meets, thus, the 20 ms latency requirement. 6. CONCLUSIONS In this paper, a memory intensive preprocessing pipeline for medical imaging with strict latency requirements was investigated and it was shown that it is feasible to map and implement it on a graphics card. While current medical setups use different hardware systems for preprocessing and postprocessing, we use the same platform for both, which allows to reduce the complexity of the system. A detector defect correction pipeline with more than ten algorithms was mapped to the graphics hardware. The pipeline compensates for different effects caused by a detector during exposure of X-ray images and calculates parameters to control the subsequent dosage. Medical images of were processed in less than 5 ms, 20 faster than on a standard CPU. The time critical part of the pipeline, the calculation and the forwarding of dose parameters to the host and to an external system, was also done in less than 5 ms, which fulfills the 20 ms maximal latency requirement throughout. For deployment in medical environments with streams of images, we showed that the memory transfer overhead of successive images to the graphics card memory is reduced by 83 % using asynchronous memory transfers. Zero-copy relieves the programmer in addition from data transfer management and transfers the data implicitly from main memory. However, using zero-copy only 36 % of the transfer overhead was saved. Table 3: Execution time for 10 iterations of the detector defect correction pipeline for different memory management approaches when no X server is running. No data transfers Synchronous memory transfers Page-locked memory Zero-copy Asynchronous memory transfers ms ms ms ms ms Table 4: Latency to get the parameters for dose control from the graphics card. The latency includes the time when the image is still on the host until the parameters are available on the host. DF HA Synchronous memory transfers 5.14 ms 5.66 ms Page-locked memory 4.23 ms 4.76 ms Zero-copy 4.27 ms 4.82 ms Asynchronous memory transfers 4.35 ms 4.89 ms 11

12 The control flow required to interact with the detector defect correction pipeline is implemented using CUDA and OpenGL. Different modes are supported and can be selected on a per frame basis. Actually, the current preprocessing and visualization leave much room for further processing and would still meet the latency requirements. This allows medical systems in cardiology and angiography to reduce the complexity of existing systems and to use one common system for preprocessing and postprocessing, whereas, current setups use different systems and hardware platforms. To guarantee the latency requirements for dose feedback, a fixed number of multiprocessors can be reserved exclusively for preprocessing in the upcoming Fermi architecture of NVIDIA. 13 REFERENCES [1] Manavski, S. and Valle, G., CUDA Compatible GPU Cards as Efficient Hardware Accelerators for Smith- Waterman Sequence Alignment, BMC Bioinformatics 9(Suppl 2), S10 (2008). [2] Preis, T., Virnau, P., Paul, W., and Schneider, J., GPU Accelerated Monte Carlo Simulation of the 2D and 3D Ising Model, Journal of Computational Physics 228(12), (2009). [3] Micikevicius, P., 3D Finite Difference Computation on GPUs using CUDA, in [Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units], 79 84, ACM (2009). [4] Stone, S., Haldar, J., Tsao, S., Wen-Mei, W., Liang, Z., and Sutton, B., Accelerating Advanced MRI Reconstructions on GPUs, Proceedings of the 2008 Conference on Computing Frontiers, (2008). [5] Xu, F. and Mueller, K., Real-Time 3D Computed Tomographic Reconstruction using Commodity Graphics Hardware, Physics in Medicine and Biology 52, (2007). [6] Reichl, T., Passenger, J., Acosta, O., and Salvado, O., Ultrasound goes GPU: Real-Time Simulation using CUDA, Progress in Biomedical Optics and Imaging 10(37) (2009). [7] Lindholm, E., Nickolls, J., Oberman, S., and Montrym, J., NVIDIA Tesla: A Unified Graphics and Computing Architecture, IEEE Micro 28(2), (2008). [8] Owens, J., Houston, M., Luebke, D., Green, S., Stone, J., and Phillips, J., GPU Computing, Proceedings of the IEEE 96(5), (2008). [9] Sengupta, S., Harris, M., Zhang, Y., and Owens, J., Scan Primitives for GPU Computing, in [Proceedings of the 22nd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware], , Eurographics Association (2007). [10] Shams, R. and Kennedy, R. A., Efficient Histogram Algorithms for NVIDIA CUDA Compatible Devices, in [Proceedings of the International Conference on Signal Processing and Communications Systems (ICSPCS)], (2007). [11] Scott, D., On Optimal and Data-Based Histograms, Biometrika 66(3), 605 (1979). [12] Membarth, R., Hannig, F., Dutta, H., and Teich, J., Efficient Mapping of Multiresolution Image Filtering Algorithms on Graphics Processors, in [Proceedings of the 9th International Workshop on Systems, Architectures, Modeling, and Simulation (SAMOS Workshop)], (2009). [13] NVIDIA Corporation, NVIDIA Whitepaper: NVIDIA s Next Generation CUDA Compute Architecture: Fermi. Architecture_Whitepaper.pdf (2009). 12