Exploiting GPU Hardware Saturation for Fast Compiler Optimization
|
|
|
- Alfred Rice
- 9 years ago
- Views:
Transcription
1 Exploiting GPU Hardware Saturation for Fast Compiler Optimization Alberto Magni School of Informatics University of Edinburgh United Kingdom Christophe Dubach School of Informatics University of Edinburgh United Kingdom Michael O Boyle School of Informatics University of Edinburgh United Kingdom [email protected] ABSTRACT Graphics Processing Units (GPUs) are efficient devices capable of delivering high performance for general purpose computation. Realizing their full performance potential often requires extensive compiler tuning. This process is particularly expensive since it has to be repeated for each target program and platform. In this paper we study the utilization of GPU hardware resources across multiple input sizes and compiler options. In this context we introduce the notion of hardware saturation. Saturation is reached when an application is executed with a number of threads large enough to fully utilize the available hardware resources. We give experimental evidence of hardware saturation and describe its properties using OpenCL kernels on GPUs from Nvidia and AMD. We show that input sizes that saturates the GPU show performance stability across compiler transformations. Using the thread-coarsening transformation as an example, we show that compiler settings maintain their relative performance across input sizes within the saturation region. Leveraging these hardware and software properties we propose a technique to identify the input size at the lower bound of the saturation zone, we call it Minimum Saturation Point (MSP). By performing iterative compilation on the MSP input size we obtain results effectively applicable for much large input problems reducing the overhead of tuning by an order of magnitude on average. Categories and Subject Descriptors D.. [Programming languages]: Processors Compilers, Optimization General Terms Experimentation, Measurement, Keywords OpenCL, GPGPU, iterative compilation, optimization. INTRODUCTION Graphic Processing Units (GPUs) are now widely used due to their ability to deliver high performance for a large class of parallel applications. Such devices are most suited to solve problems larger in size than traditional multicore CPUs, taking advantage of the hundreds of cores available. Writing efficient code for GPUs is challenging due to the complexity of the underlying parallel hardware. Achieving high levels of performance requires expensive tuning by the programmer either evaluating manually multiple code versions or using semi-automatic tools. This effort has to be replicated for every target program and architecture. Given the variety of devices now available and the high rate of hardware update the tuning cost is a significant problem. To tuning time, application programmers often evaluate many different compiler or code transformations on problem sizes smaller than their target. The aim is to have quick feedback on what are the most effective optimization options. This methodology assumes that code transformations that work well on a small problem size will work well on the larger target size. The testing input size must be chosen with care: too small and it might not match the behaviour of the larger target size, too large and it makes tuning too expensive. Finding the right input size on which to perform tuning is a crucial issue. To tackle this problem we introduce the concept of saturation of hardware resources for graphics processors. Saturation is achieved when an application is run with an input size (i.e. number of threads) sufficiently large to fully utilize hardware resources. Programs running in such a region of the input size space usually scale according to the complexity of the problem and show performance stability across program optimizations. This means that code transformations effective for a small input size in the saturation zone are usually effective for very large input sizes too. We develop a search strategy that exploits the Minimum Saturation Point (MSP) to the total tuning time necessary to find good optimization settings. In this work we describe the property of saturation in terms of the throughput metric: the amount of work performed by a parallel program per unit of time, showing how this reaches stability in the saturation zone. In summary the paper makes the following contributions:
2 We show the existence of saturation for OpenCL benchmarks on three GPU architectures: Fermi and Kepler by Nvidia and Tahiti by AMD. We show that the hardware saturation can be successfully exploited to speedup iterative tuning of applications leading to a reduction of search space of an order of magnitude. The rest of the paper is organized as follows: section motivates the problem with an example. Section describes our experimental setup and section introduces the concept of saturation point. Section describes a technique to identify the lower bound of the saturation area using the throughput metric while section presents detailed results. Section 7 presents the related work, finally section concludes the paper.. MOTIVATING EXAMPLE This section introduces the problem by showing how an OpenCL program reacts when performing iterative compilation on multiple input sizes. As an example we use the program running on the AMD Tahiti GPU and the optimization space of the thread coarsening transformation []. Figure shows the behavior of our benchmark as a function of its input size. For OpenCL workloads there is direct mapping between the input size and the overall number of instantiated threads. We measure the problem size using the total number of instantiated threads. Figure a shows the execution time of the application as a function of the input size. The largest data size, with 7 million threads is the input size we want to optimize for. Unsurprisingly this leads to the longest execution time. If we were to tune this application by evaluating multiple versions, the time required to search the optimization space would be directly proportional to the execution time. As a results, it would be preferable to tune the application on the smallest possible input size. However, the performance of the best optimization settings for a given problem size does not transfer across all input sizes as shown in figure b. The best optimization settings for the smallest input size achieves only a third of the performance of the best one corresponding to the largest input size. This means, that if the application were tuned for the smallest input size, it would run three times slower than the best achievable on the large input size. A good trade-off is an input size of M threads achieving performance within few percents of the largest input size. This represents a savings of 7 in search time (7M/M). The question is how can we find the optimal input size without having to conduct the search for all input sizes in first place. One way to solve this problem is to look at the throughput metric, which is typically expressed as the number of operations executed per second (the formal definition of this metric is given in section ). Figure c shows the throughput of the application as a function of the problem size. As can be seen, the throughput starts very low and increases with the problem size. It flattens out at around M threads which corresponds to the saturation of the hardware resources. We call this point the minimum saturation point (MSP), given that passed this point throughput remains constant. Coincidentally, the MSP is smaller than the target input size and leads to performance on par with the best achievable when searching on the largest input size. e+.e+ e+ e+ Execution time [ns] K K K K K M M M 7M (a) Kernel execution time (in nanoseconds) as a function of the input size (i.e. total number of threads)..... K K K K K M M M 7M (b) Normalized performance attaibable when performing the tuning on a given input size and evaluating the best performing configuration found on the largest one (7M). For example the best performing configuration for input size K achieves only % of the maximum attainable by tuning coarsening directly on input size 7M. K K K K K M M M 7M (c) as a function of the problem input size. Figure. Execution time a, relative performance b and throughput c of the program running on AMD Tahiti as a function of the input size (expressed as total number of threads). Notice the log-scale of the x-axes. The remainder of the paper characterizes hardware saturation using the throughput metric and shows that it can be used successfully to tuning time.. EXPERIMENTAL SETUP. Benchmarks and Platforms In this paper, we use OpenCL benchmarks from various sources, as shown in Table. In the case of programs from the Parboil benchmark suite we used the opencl base version. The table reports the range of sizes for the input problem, these are reported in terms of the total number of launched threads. For the rest of the paper we will consider the largest problem size as the target one, i.e. the one that we want to optimize the program for. We used three plat-
3 Program name Source Problem Size Range AMD SDK [M... M] Nvidia SDK [K... M] AMD SDK [K... 7M] dwthaard AMD SDK [... M] Trans AMD SDK [K... M] AMD SDK [9K... 7M] 7 Parboil [K... K] Nvidia SDK [K... M] 9 Nvidia SDK [K... M] Nvidia SDK [... K] Nvidia SDK [... K] AMD SDK [K... K] AMD SDK [K... 7M] Parboil [... 9M] AMD SDK [K... M] Parboil [M... 97M] Table. OpenCL programs with the range of input sizes used for our experiments Name Model GPU OpenCL Linux Driver version kernel Tahiti AMD Tahiti SDK... Fermi Nvidia GTX.. CUDA.... Kepler Nvidia Kc.. CUDA...7. Table. OpenCL devices used for our experiments. forms described in Table. In all our experiments we only measure the kernel execution time. To measurement noise each experiment has been repeated times aggregating the results using the median.. Optimization Space The parameters of the thread-coarsening transformation define our search space. Coarsening can be thought as loop unrolling for the OpenCL parallel loop. It works by increasing the amount of work performed by a single thread by merging multiple threads of the original application and consequently reducing the overall number of running threads. As a convention, for the remainder of the paper we refer to input size the number of threads of the original uncoarsened application. Coarsening is controlled by a three parameters: factor: define how many threads to merge together, i.e. by how much to the thread space, stride: define how to remap threads after the transformation so to preserve coalescing of memory accesses, local work group size: define how many threads are scheduled concurrently onto a single core. We tuned the transformation by considering all combination of our parameters leading to about - configurations for one dimensional kernels and - configuration for two dimensional kernels depending on the device limitation and the problem size. These add up to about configurations evaluated across all input sizes, programs and devices. This large evaluation of the coarsening transformation is made possible by the our portable compiler toolchain [].. THROUGHPUT AND HARDWARE SATURATION This section defines the concept of hardware saturation and presents its features. Graphics processors are highly parallel machines containing a large amount of computing cores. To harness the computing power of GPUs applications must run large amount of threads to fully exploit all the hardware resources. Programs running a small number of threads might show a behavior which is not representative of larger ones due to under utilization of the hardware resources. We describe this behavior using the notion of throughput which we formally defined as : throughput = units of work execution time where units of work is a metric which depends on the algorithmic complexity of the application. In case of linear benchmarks, this corresponds to the number of input elements processed by the kernel : units of work = input size. In case of non-linear programs the input size is scaled according to the complexity. For example the program is quadratic in complexity and units of work = (input size). Note that in our benchmarks the total number of threads is a linear function of the input size. Figure shows the throughput as a function of the total number of running threads for each benchmark and our three platforms. Consider, for example, the first plot in figure, running on Fermi. This graph clearly shows that the throughput increases along with the problem size (i.e. total number of threads) until million threads. Passed this size the throughput reaches a plateau and stabilizes at around billion units of work per second. We call this region of the input space saturation region. From the shape of the plot we can make two observations. The first one is that a program run with a small number of threads does not behave in the same way as with a large number. In particular small input sizes are in proportion slower than large ones, showing very low throughput values. Second, the fact that the throughput stabilizes for sufficiently large input sizes shows that we hit the hardware saturation limit of the GPUs and that applications scale close to the theoretical algorithmic complexity.. Outliers Analysis The Tahiti column in figure shows a number of outliers which do not show a plateau. Examples of these are: mvuncaol,,, and. Notice that all these programs are memory bound. We investigate these cases using performance counters. Figure reports the results of our analysis for. The top sub-figure shows the throughput performance along with the value of the performance counter MemUnitBusy as a function of the total number of threads. MemUnitBusy is the hardware counter that measures the percentage of execution time that the memory unit is busy processing loads and stores. We can see that there is a very high correlation between the two curves. This signifies that performance for large input sizes is limited by the memory functional unit, which is unable to handle efficiently a large amount of requests coming from different threads. As shown in figure b the coarsening transformation mitigates this problem, leading to much higher performance. The throughput curve still follows the hardware ()
4 dwthaard Fermi M M M M M M M M M M M K M M M M M M M M M M M M M M M K K K K K M M M M M M K K K K K K K K K K K M M M M M M M M M M..... K K K K M M M M M Kepler M M M M M K M M M M M M M M M M M M M M M M M M..... M M M M M M K K K K K e M M M M M M K K K K K K K K K K K K M M M M M M M M M M... K K K K M M M M M M M Tahiti M M M M M M M M K M M M M M M M M M M M M M M M M M M M... K K K K K e M M M M M M K K K K K K K K K K M M M M M M M M M M M M M M M M M M M M M dwthaard.... Fermi M M M M M M M M M M M.... K M M M.... M M M M M M.... M M M M M M.... K K K K K M M M M M M K K K K K K K K K K K.... M M M M M M.... M M M M.... K K K K M.... M M M M.... Tahiti M M M M M M M M K M M M.... M M M M.... M M M M M M.... M M M M M M.... K K K K K M M M M M M K K K K K K.... K K K K M.... M M M M M M.... M M M M.... M M M M.... M M M M M M.... Kepler M M M M M.... K M M M.... M M M M M.... M M M M.... M M M M M M.... M M M M M M.... K K K K K M M M M M M K K K K K K K K K K K K.... M M M M M M.... M M M M.... K K K K M.... M M M M M M Figure., defined as billions units of work processed per unit of time (see formula ), as a function of the total number of threads (i.e. the problem input size) for all the programs and the three devices. Figure. of the best optimizations settings found for a given input size evaluated on the largest input size.
5 values Tahiti Coarsening Factor: M M M M globalsize (a) Coarsening Factor MemUnitBusy [%] MemUnitBusy Tahiti Coarsening Factor: MemUnitBusy [%] 9K K M M M 7M Figure. as a function of problem input size (i.e. the total number of threads) for five different values of the coarsening factor (on the right) for on Kepler. Note the log-scale of the x-axis, this to highlight the throughput of small input sizes. values M M M M globalsize (b) Coarsening Factor MemUnitBusy Figure. Correlation between throughput and MemUnit- Busy hardware counter for on Tahiti using coarsening factor and. MemUnitBusy is the percentage of time in which the memory unit is busy processing memory requests. We can see that the coarsened configuration does not suffer of low memory utilization leading much higher throughput. counter without dropping for higher input sizes. A similar analysis can be performed for programs such as where the uncoalesced memory access pattern highlights the problem even more. The benchmark is described in more detail in section.. THROUGHPUT-BASED TUNING Taking advantage of the saturation properties described earlier, this section introduces a methodology to accelerate iterative compilation.. Tuning Across Input Sizes We now show how the best compiler settings found for a given input size perform on the largest input size. Figure shows this performance on all benchmarks and devices for the full compiler transformation space described in section. For example a value of for input size x means that the best performing coarsening configuration for x is the best for the target size as well. A value of. means that the best configuration on x gives half of the maximum performance when evaluated on the target. By definition the line reaches for the largest input size. The overall trend for performance is to increase as the tuning input size gets close to the target one. After a certain input size, performance reaches a plateau and stabilizes in most cases. Small input sizes tends to have low and unpredictable performance while large input sizes have performance close to that of the largest input size. This means that for input sizes in that saturate the device performance reaches stability. This can be clearly seen when comparing figure and ; when throughput saturates, so does the performance.. and Coarsening Factor Before introducing our search technique, we consider the impact of the thread coarsening factor on the throughput as shown in figure. This figures shows the throughput for five different values of the factor parameter for the coarsening transformations (labeled on the right). The relative performance of the different coarsening factors changes across input sizes until saturation is reached between M and M threads. For all coarsening factors, the throughput reaches a plateau and stabilizes after M threads. In this example the factors and are the best parameters for problem sizes of about million threads remaining stable up to 7 million threads. This signifies that performing a search for the best parameter at the left of the saturation point will probably lead to bad choices when evaluating on the largest input size. On the contrary, performing beyond the saturation should lead to good choices when evaluated on the largest input size.. Based Input Size Selection We now propose a tuning technique that takes advantage of the saturation plateau that exists for both performance and throughput. We first build the throughput curve by () running the benchmark using the default compiler parameters on multiple input sizes within the range given in table. Once the throughput curve is built, () we select the smallest input size that achieves a throughput within a given threshold of the maximum one. The threshold is used to deal with noise in the experimental data and small fluctuations in execution time around the throughput plateau. The resulting selected input size is our Minimum Saturation Point (MSP)
6 Program name Fermi Kepler Tahiti ) M M M ) M M M ) K K M ) dwthaard K K M ) Trans M M M ) M M M 7) K K K ).M M M 9) M M K ) K K K ) K K K ) K K K ) M M M ) K K M ) K K M ) M M M Table. Minimum Saturation Point identified by the technique presented in section for all the benchmarks and architectures. MSP is expressed in number of threads. and are presented in table for all benchmarks and devices. The last step consist of () conducting the tuning or search at the MSP. Following this methodology, we expect that the best optimization settings found at the MSP leads to performance in par with the best for the largest input size. We refer to this searching technique as MSP-Tuning. The next section presents the results obtained applying the proposed tuning technique.. RESULTS This section provides detailed results for the speedups in kernel execution time and in search time given by MSP- Tuning. Definition In our experiments the baseline to compute kernel speedups is represented by the execution time of the application run without applying the coarsening transformation (i.e. coarsening factor ) and with the default local work group size by the benchmark suite. In all our experiments the problem size we are optimizing for is the largest of the range we record in table, we call this the target problem size. Figure summarizes the results of MSP-Tuning described in section for our three OpenCL devices. Consider the top barplot in each subfigure. It reports the speedup over the baseline attainable using MSP-Tuning compared to the maximum speedup attainable with thread-coarsening. The second barplot reports the speedup in the search optimal configuration. The speedup is computed using this formula: Search speedup = T ime target T ime MSP + T ime throughput () Here T ime target is the time spent exhaustively searching for the best configuration on the target input size, T ime MSP is the search time on the MSP input size and T ime throughput is the time spent building the throughput curve. MSP-Tuning Speedup Speedup Search time speedup dwthaard Speedup Search time speedup dwthaard Search time speedup (a) Fermi (b) Kepler. Maximum Speedup Speedup dwthaard (c) Tahiti Figure. Summary of the tuning results for the three architectures. In each subfigure, the top plot represents the kernel execution speedup attainable with MSP-Tuning in comparison with the maximum available speedup given by coarsening. The bottom plot shows the search-time speedup given by MSP-Tuning, notice that in this plot the y-axis is in log-scale. geomean geomean geomean
7 . Result Description The kernel speedups achieved on Fermi, Kepler and Tahiti are respectively:.,. and.. Which represent %, 7% and % of the maximum performance. On the other hand the search time speedup given by MSP-Tuning is about one order of magnitude for the three devices. To fully understand the savings results we have to take into account the algorithmic complexity of the application. A program like ensures very large tuning savings (even in the order of thousands) thanks to its cubic complexity: halving the input size leads to a factor of eight in tuning time saving. The overall results are very similar for the two Nvidia GPUs, Fermi and Kepler demonstrating the similarities of the two architectures. On the other hand the Tahiti results show significant differences with higher speedups on average. The search fails to achieve good performance due on, and due to the erratic shape of throughput and performance lines for these benchmarks. Of particular interest is the application. This is a matrix-vector multiplication benchmark from the Nvidia SDK. Considering the search time speedup barplots (and comparing tables and ) we can see that no reduction on the input size is attainable, i.e. the selected MSP corresponds to the largest input size. This is because no saturation plateau is reached, check the corresponding plot in figure. The reason for this behavior lies in the structure of the algorithm: one thread processes one row of the input matrix and the whole input vector producing a single element of the output. Thus we have only a thread for each matrix row: very few with respect to the complexity of the problem. Scaling the problem to larger sizes is prohibitive, running K threads means to work with about GB of data, scale to K means to work with GB, hitting device hardware constraints. In summary our target GPUs cannot full express the throughput potential of due to limitations in the available memory resources. Similar considerations can be made for, a different version of the same program.. Noise Threshold Figure 7 shows how the choice of the noise threshold described in section. affects the performance of our strategy. For the values of percentages between and it plots the achievable kernel speedup by implementing the search technique and the saving in search time. As expected the search speedup increases with the threshold percentage, this because the higher threshold values allows the selection of smaller input sizes as MSP. The kernel speedup slightly decreases increasing the percentage. Based on this data, we selected a threshold of %, leading to marginal performance degradation with respect to % and more than improvement in search speed across platforms. 7. RELATED WORK 7. Prediction for Parallel Applications and Systems Reducing the cost of program tuning has been widely considered in the literature. Yang et al. [] considered running just a small fraction of a parallel application on a testing architecture using the results to tune it on a different one. Other works [, ] have made use of performance modeling Threshold percentage..... (a) Fermi 7 9 Threshold percentage. (b) Kepler. 7 9 Threshold percentage (c) Tahiti Figure 7. Plots showing the attainable kernel speedup and the search-time speedup as a function of the noise threshold. A threshold of % means that we select the input size giving the maximum throughput. We can see that by increasing the threshold we can improve the search speedup with small degradation of the kernel speedup. The first [] predicts performance for various applicationspecific parameters such as working-set size and processors topology. The second [] predicts the scalability of an application on a various number of processors. However, both techniques require significant amount of training and do not consider the problem of finding the hardware saturation point. Other researchers have looked at using techniques such as deterministic replay coupled with clustering to select representative replays []. A trace of the application is extracted from a host machine and is then replayed locally on a single node. Interestingly, it is possible to use this technique to execute multiple replays on the same machine node in or-
8 der to estimate resource contention and predict performance of the whole system. This prior work has focused on extrapolating performance of the whole system using just a few nodes. In our case, we are addressing a different problem, i.e. determining what is the minimum application workload that saturates the machine. Once this point has been found, we can use this to extrapolate performance on a larger input size problem. Fursin and Chen [, ] have studied the sensitivity of optimizations to the program input for sequential applications. They have shown that it has little effect on the results of iterative compilations. This is a different conclusion from our results and is probably due to the benchmarks used (embedded programs) not stressing the underlying machines enough and the differences between sequential an parallel GPU hardware. 7. Optimization Space Exploration on GPUs There has been a large amount of work dedicated to optimization space pruning. We focus on the most recent work applied to GPU compiler space optimization. Ryoo et al. [] developed an analytical model to predict the performance of compiler optimization for a CUDA architecture. Liue et al. [] built a model to predict which optimization to apply for a given input size. While these approaches have the potential to speedup the exploration of the design space, they still rely on large data collection in order to build a model. Samadi et al. [9] have looked at using StreamIt and automatically generate optimised for high-level operators such a reduction. They use a performance model to determine which version to use on the target system. Our technique is orthogonal to these since we present a methodology to speed up empirical search of the optimization space by reducing the problem input size on which to perform the search. Finally, some recent work [7] looked at the effect on input size on the optimisation space. They use a simple clustering technique to group input-sizes based that share a common optimization configuration that lead to the best performance.. CONCLUSION AND FUTURE WORK. Conclusion This paper has introduced the concept of hardware saturation for GPUs. Saturation is reached when the device runs a problem large enough (in number of threads) so to fully utilize its hardware resources. We have provided experimental evidence of saturation on three devices from Nvidia and AMD. We have showed that the thread coarsening compiler transformation has stable performance across problem sizes that saturate hardware resources. Leveraging this insight we propose a technique to identify the lower bound of the saturation area in the input space (called MSP). We use this input size for fast tuning of the parameters of the coarsening transformation. The proposed tuning technique has shown to the impact of searching by over an oder of magnitude on average and up to hundred times reaching %, 7% and % of the maximum attainable performance with coarsening on Fermi, Kepler and Tahiti respectively.. Future Work For future work we will investigate the possibility to estimate the throughput curve without running multiple versions of a benchmark. This could be done running few versions of a program with small problem sizes and then use extrapolation to estimate the remaining ones. Such an improvement would increase the savings in the search and making the technique more widely applicable. Another possible extension would be to use throughput to study the performance bottlenecks of graphics devices. Related to this, programs deviating from the characteristic curve (like on Tahiti) are of interest for further study as they expose hardware limitations. MSP-Tuning has proved to be a valuable technique to speedup iterative compilation. The same underlying idea can be applied to other types of searches. 9. REFERENCES [] B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. ICS,. [] Y. Chen, Y. Huang, L. Eeckhout, G. Fursin, L. Peng, O. Temam, and C. Wu. Evaluating iterative optimization across datasets. PLDI,. [] G. Fursin, J. Cavazos, M. O Boyle, and O. Temam. Midatasets: creating the conditions for a more realistic evaluation of iterative optimization. HiPEAC, 7. [] B. C. Lee, D. M. Brooks, B. R. de Supinski, M. Schulz, K. Singh, and S. A. McKee. Methods of inference and learning for performance modeling of parallel applications. PPoPP, 7. [] Y. Liu, E. Zhang, and X. Shen. A cross-input adaptive framework for gpu program optimizations. IPDPS, 9. [] A. Magni, C. Dubach, and M. O Boyle. A large-scale cross-architecture evaluation of thread-coarsening. SC,. [7] A. Magni, D. Grewe, and N. Johnson. Input-aware auto-tuning for directive-based gpu programming. GPGPU,. [] S. Ryoo, C. I. Rodrigues, S. S. Stone, S. S. Baghsorkhi, S. zee Ueng, J. A. Stratton, and W. mei W. Hwu. Program optimization space pruning for a multithreaded gpu. CGO,. [9] M. Samadi, A. Hormati, M. Mehrara, J. Lee, and S. Mahlke. Adaptive input-aware compilation for graphics engines. PLDI,. [] L. Yang, X. Ma, and F. Mueller. Cross-platform performance prediction of parallel applications using partial execution. SC,. [] J. Zhai, W. Chen, and W. Zheng. Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. PPoPP,.
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB
BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next
CHAPTER 1 INTRODUCTION
1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.
Benchmarking Hadoop & HBase on Violin
Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages
The Methodology of Application Development for Hybrid Architectures
Computer Technology and Application 4 (2013) 543-547 D DAVID PUBLISHING The Methodology of Application Development for Hybrid Architectures Vladimir Orekhov, Alexander Bogdanov and Vladimir Gaiduchok Department
Intelligent Heuristic Construction with Active Learning
Intelligent Heuristic Construction with Active Learning William F. Ogilvie, Pavlos Petoumenos, Zheng Wang, Hugh Leather E H U N I V E R S I T Y T O H F G R E D I N B U Space is BIG! Hubble Ultra-Deep Field
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction
Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction Cristina Silvano [email protected] Politecnico di Milano HiPEAC CSW Athens 2014 Motivations System
Analysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University [email protected] Dr. Thomas C. Bressoud Dept. of Mathematics and
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
Contributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
OpenCL Programming for the CUDA Architecture. Version 2.3
OpenCL Programming for the CUDA Architecture Version 2.3 8/31/2009 In general, there are multiple ways of implementing a given algorithm in OpenCL and these multiple implementations can have vastly different
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING
ACCELERATING COMMERCIAL LINEAR DYNAMIC AND Vladimir Belsky Director of Solver Development* Luis Crivelli Director of Solver Development* Matt Dunbar Chief Architect* Mikhail Belyi Development Group Manager*
The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.
White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
MAQAO Performance Analysis and Optimization Tool
MAQAO Performance Analysis and Optimization Tool Andres S. CHARIF-RUBIAL [email protected] Performance Evaluation Team, University of Versailles S-Q-Y http://www.maqao.org VI-HPS 18 th Grenoble 18/22
GPU Usage. Requirements
GPU Usage Use the GPU Usage tool in the Performance and Diagnostics Hub to better understand the high-level hardware utilization of your Direct3D app. You can use it to determine whether the performance
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters
A Framework for Performance Analysis and Tuning in Hadoop Based Clusters Garvit Bansal Anshul Gupta Utkarsh Pyne LNMIIT, Jaipur, India Email: [garvit.bansal anshul.gupta utkarsh.pyne] @lnmiit.ac.in Manish
Optimizing Shared Resource Contention in HPC Clusters
Optimizing Shared Resource Contention in HPC Clusters Sergey Blagodurov Simon Fraser University Alexandra Fedorova Simon Fraser University Abstract Contention for shared resources in HPC clusters occurs
Performance Analysis of Web based Applications on Single and Multi Core Servers
Performance Analysis of Web based Applications on Single and Multi Core Servers Gitika Khare, Diptikant Pathy, Alpana Rajan, Alok Jain, Anil Rawat Raja Ramanna Centre for Advanced Technology Department
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Version 1.0
D5.6 Prototype demonstration of performance monitoring tools on a system with multiple ARM boards Document Information Contract Number 288777 Project Website www.montblanc-project.eu Contractual Deadline
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
Recent Advances in Periscope for Performance Analysis and Tuning
Recent Advances in Periscope for Performance Analysis and Tuning Isaias Compres, Michael Firbach, Michael Gerndt Robert Mijakovic, Yury Oleynik, Ventsislav Petkov Technische Universität München Yury Oleynik,
An objective comparison test of workload management systems
An objective comparison test of workload management systems Igor Sfiligoi 1 and Burt Holzman 1 1 Fermi National Accelerator Laboratory, Batavia, IL 60510, USA E-mail: [email protected] Abstract. The Grid
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU
Architectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Delivering Quality in Software Performance and Scalability Testing
Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,
RevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
The Yin and Yang of Processing Data Warehousing Queries on GPU Devices
The Yin and Yang of Processing Data Warehousing Queries on GPU Devices Yuan Yuan Rubao Lee Xiaodong Zhang Department of Computer Science and Engineering The Ohio State University {yuanyu, liru, zhang}@cse.ohio-state.edu
High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach
High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
Texture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, [email protected] 1 Our Contribution GPU Core Cache Cache
Improving Scalability for Citrix Presentation Server
VMWARE PERFORMANCE STUDY VMware ESX Server 3. Improving Scalability for Citrix Presentation Server Citrix Presentation Server administrators have often opted for many small servers (with one or two CPUs)
Amazon EC2 Product Details Page 1 of 5
Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
Oracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
A Middleware Strategy to Survive Compute Peak Loads in Cloud
A Middleware Strategy to Survive Compute Peak Loads in Cloud Sasko Ristov Ss. Cyril and Methodius University Faculty of Information Sciences and Computer Engineering Skopje, Macedonia Email: [email protected]
HPC performance applications on Virtual Clusters
Panagiotis Kritikakos EPCC, School of Physics & Astronomy, University of Edinburgh, Scotland - UK [email protected] 4 th IC-SCCE, Athens 7 th July 2010 This work investigates the performance of (Java)
Gary Frost AMD Java Labs [email protected]
Analyzing Java Performance Using Hardware Performance Counters Gary Frost AMD Java Labs [email protected] 2008 by AMD; made available under the EPL v1.0 2 Agenda AMD Java Labs Hardware Performance Counters
GPGPU for Real-Time Data Analytics: Introduction. Nanyang Technological University, Singapore 2
GPGPU for Real-Time Data Analytics: Introduction Bingsheng He 1, Huynh Phung Huynh 2, Rick Siow Mong Goh 2 1 Nanyang Technological University, Singapore 2 A*STAR Institute of High Performance Computing,
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications
GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications Harris Z. Zebrowitz Lockheed Martin Advanced Technology Laboratories 1 Federal Street Camden, NJ 08102
Energy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
Le langage OCaml et la programmation des GPU
Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline
Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?
Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control EP/K006487/1 UK PI: Prof Gareth Taylor (BU) China PI: Prof Yong-Hua Song (THU) Consortium UK Members: Brunel University
Advanced analytics at your hands
2.3 Advanced analytics at your hands Neural Designer is the most powerful predictive analytics software. It uses innovative neural networks techniques to provide data scientists with results in a way previously
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Load Testing Analysis Services Gerhard Brückl
Load Testing Analysis Services Gerhard Brückl About Me Gerhard Brückl Working with Microsoft BI since 2006 Mainly focused on Analytics and Reporting Analysis Services / Reporting Services Power BI / O365
NVIDIA Tools For Profiling And Monitoring. David Goodwin
NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale
Part I Courses Syllabus
Part I Courses Syllabus This document provides detailed information about the basic courses of the MHPC first part activities. The list of courses is the following 1.1 Scientific Programming Environment
SOFTWARE PERFORMANCE TESTING SERVICE
SOFTWARE PERFORMANCE TESTING SERVICE Service Definition GTS s performance testing services allows customers to reduce the risk of poor application performance. This is done by performance testing applications
Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage
White Paper Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage A Benchmark Report August 211 Background Objectivity/DB uses a powerful distributed processing architecture to manage
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
GeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
Performance Workload Design
Performance Workload Design The goal of this paper is to show the basic principles involved in designing a workload for performance and scalability testing. We will understand how to achieve these principles
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms
PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms P. E. Vincent! Department of Aeronautics Imperial College London! 25 th March 2014 Overview Motivation Flux Reconstruction Many-Core
Task Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
Modern Platform for Parallel Algorithms Testing: Java on Intel Xeon Phi
I.J. Information Technology and Computer Science, 2015, 09, 8-14 Published Online August 2015 in MECS (http://www.mecs-press.org/) DOI: 10.5815/ijitcs.2015.09.02 Modern Platform for Parallel Algorithms
Experimental Evaluation of Horizontal and Vertical Scalability of Cluster-Based Application Servers for Transactional Workloads
8th WSEAS International Conference on APPLIED INFORMATICS AND MUNICATIONS (AIC 8) Rhodes, Greece, August 2-22, 28 Experimental Evaluation of Horizontal and Vertical Scalability of Cluster-Based Application
GraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
Tableau Server 7.0 scalability
Tableau Server 7.0 scalability February 2012 p2 Executive summary In January 2012, we performed scalability tests on Tableau Server to help our customers plan for large deployments. We tested three different
Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009
Performance Study Performance Evaluation of VMXNET3 Virtual Network Device VMware vsphere 4 build 164009 Introduction With more and more mission critical networking intensive workloads being virtualized
Tableau Server Scalability Explained
Tableau Server Scalability Explained Author: Neelesh Kamkolkar Tableau Software July 2013 p2 Executive Summary In March 2013, we ran scalability tests to understand the scalability of Tableau 8.0. We wanted
OpenCL for programming shared memory multicore CPUs
Akhtar Ali, Usman Dastgeer and Christoph Kessler. OpenCL on shared memory multicore CPUs. Proc. MULTIPROG-212 Workshop at HiPEAC-212, Paris, Jan. 212. OpenCL for programming shared memory multicore CPUs
Application Performance Testing Basics
Application Performance Testing Basics ABSTRACT Todays the web is playing a critical role in all the business domains such as entertainment, finance, healthcare etc. It is much important to ensure hassle-free
Linux Performance Optimizations for Big Data Environments
Linux Performance Optimizations for Big Data Environments Dominique A. Heger Ph.D. DHTechnologies (Performance, Capacity, Scalability) www.dhtusa.com Data Nubes (Big Data, Hadoop, ML) www.datanubes.com
Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures
Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures 1 Hanwoong Jung, and 2 Youngmin Yi, 1 Soonhoi Ha 1 School of EECS, Seoul National University, Seoul, Korea {jhw7884, sha}@iris.snu.ac.kr
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
An Oracle White Paper March 2013. Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite
An Oracle White Paper March 2013 Load Testing Best Practices for Oracle E- Business Suite using Oracle Application Testing Suite Executive Overview... 1 Introduction... 1 Oracle Load Testing Setup... 2
