JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA
|
|
|
- Kerry Day
- 10 years ago
- Views:
Transcription
1 JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs with CUDA Yonghong Yan, Max Grossman, and Vivek Sarkar Department of Computer Science, Rice University {yanyh,jmg3,vsarkar}@rice.edu Abstract. A recent trend in mainstream desktop systems is the use of general-purpose graphics processor units (GPGPUs) to obtain order-ofmagnitude performance improvements. CUDA has emerged as a popular programming model for GPGPUs for use by C/C++ programmers. Given the widespread use of modern object-oriented languages with managed runtimes like Java and C#, it is natural to explore how CUDA-like capabilities can be made accessible to those programmers as well. In this paper, we present a programming interface called JCUDA that can be used by Java programmers to invoke CUDA kernels. Using this interface, programmers can write Java codes that directly call CUDA kernels, and delegate the responsibility of generating the Java-CUDA bridge codes and host-device data transfer calls to the compiler. Our preliminary performance results show that this interface can deliver significant performance improvements to Java programmers. For future work, we plan to use the JCUDA interface as a target language for supporting higher level parallel programming languages like X10 and Habanero-Java. 1 Introduction The computer industry is at a major inflection point in its hardware roadmap due to the end of a decades-long trend of exponentially increasing clock frequencies. It is widely agreed that spatial parallelism in the form of multiple homogeneous and heterogeneous power-efficient cores must be exploited to compensate for this lack of frequency scaling. Unlike previous generations of hardware evolution, this shift towards multicore and manycore computing will have a profound impact on software. These software challenges are further compounded by the need to enable parallelism in workloads and application domains that have traditionally not had to worry about multiprocessor parallelism in the past. Many such applications are written in modern object-oriented languages like Java and C#. A recent trend in mainstream desktop systems is the use of general-purpose graphics processor units (GPGPUs) to obtain order-of-magnitude performance improvements. As an example, NVIDIA s Compute Unified Device Architecture (CUDA) has emerged as a popular programming model for GPGPUs for use by C/C++ programmers [1]. Given the widespread use of managed-runtime execution environments, such as the Java Virtual Machine (JVM) and.net platforms, it is natural to explore how CUDA-like capabilities can be made accessible to programmers who use those environments. H. Sips, D. Epema, and H.-X. Lin (Eds.): Euro-Par 2009, LNCS 704, pp , c Springer-Verlag Berlin Heidelberg 2009
2 888 Y. Yan, M. Grossman, and V. Sarkar In this paper, we present a programming interface called JCUDA that can be used by Java programmers to invoke CUDA kernels. Using this interface, programmers can write Java codes that directly call CUDA kernels without having to worry about the details of bridging the Java runtime and CUDA runtime. The JCUDA implementation handles data transfers of primitives and multidimensional arrays of primitives between the host and device. Our preliminary performance results obtained on four double-precision floating-point Java Grande benchmarks show that this interface can deliver significant performance improvements to Java programmers. The results for Size C (the largest data size) show speedups ranging from 7.70 to with the use of one GPGPU, relative to CPU execution on a single thread. The rest of the paper is organized as follows. Section 2 briefly summarizes past work on high performance computing in Java, as well as the CUDA programming model. Section 3 introduces the JCUDA programming interface and describes its current implementation. Section 4 presents performance results obtained for JCUDA on four Java Grande benchmarks. Finally, Section discusses related work and Section 6 contains our conclusions. 2 Background 2.1 Java for High Performance and Numerical Computing A major thrust in enabling Java for high performance computing came from the Java Grande Forum (JGF) [2], a community initiative to promote the use of the Java platform for compute-intensive numerical applications. Past work inthejgffocusedontwoareas:numerics, which concentrated on issues with using Java on a single CPU, such as complex arithmetic and multidimensional arrays, and Concurrency, which focused on using Java for parallel and distributed computing. The JGF effort also included the development of benchmarks for measuring and comparing different Java execution environments, such as the JGF [3,4] and SciMark [] benchmark suites. The Java Native Interface (JNI) [6], Java s foreign function interface for executing native C code, also played a major role in JGF projects, such as enabling Message Passing Interface (MPI) for Java [7]. In JNI, the programmer declares selected C functions as native external methods that can be invoked by a Java program. The native functions are assumed to have been separately compiled into host-specific binary code. After compiling the Java source files, the javah utility can be used to generate C header files that contain stub interfaces for the native code. JNI also supports a rich variety of callback functions to enable native code to access Java objects and services. 2.2 GPU Architecture and the CUDA Programming Model Driven by the insatiable demand for realtime, high-definition 3D gaming and multimedia experiences, the programmable GPU (Graphics Processing Unit) has
3 JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs Copy data from main memory to GPU memory 2. CPU instructs GPU to start a kernel 3. GPU executes kernel in parallel and accesses GPU memory 4. Copy the results from GPU memory to main memory Fig. 1. Process Flow of a CUDA Kernel Call evolved into a highly parallel, multithreaded, manycore processor. Current GPUs have tens or hundreds of fragment processors, and higher memory bandwidths than regular CPUs. For example, the NVIDIA GeForce GTX 29 graphics card comes with two GPUs, each with 240 processor cores, and 1.8 GB memory with GB/s bandwidth, which is about 10 faster than that of current CPUs. The same GPU is part of the NVIDIA Tesla C1060 Computer Processor, which is the GPU processor used in our performance evaluations. The idea behind general-purpose computing on graphics processing units (GPGPU) is to use GPUs to accelerate selected computations in applications that are traditionally handled by CPUs. To overcome known limitations and difficulties in using graphics APIs for general-purpose computing, GPU vendors and researchers have developed new programming models, such as NVIDIA s Compute Unified Device Architecture (CUDA) model [1], AMD s Brook+ streaming model [8], and Khronos Group s OpenCL framework [9]. The CUDA programming model is an extension of the C language. Programmers write an application with two portions of code functions to be executed on the CPU host and functions to be executed on the GPU device. The entry functions of the device code are tagged with a global keyword, and are referred to as kernels. A kernel executes in parallel across a set of parallel threads in a Single Instruction Multiple Thread (SIMT) model [1]. Since the host and device codes execute in two different memory spaces, the host code must include special calls for host-to-device and device-to-host data transfers. Figure 1 shows the sequence of steps involved in a typical CUDA kernel invocation. 3 The JCUDA Programming Interface and Compiler With the availability of CUDA as an interface for C programmers, the natural extension for Java programmers is to use the Java Native Interface (JNI) as a bridge to CUDA via C. However, as discussed in Section 3.1, this approach is
4 890 Y. Yan, M. Grossman, and V. Sarkar CPU Host GPU Device JVM libcudakernel.so native cudakernel ( ); cudakernel ( ) static { System.loadLibrary( cudakernel"); } JNI cudakernel ( ) { getprimitivearraycritical ( ); cudamalloc( ); cudamemcpy ( ); kernel <<< >>> ( ) cudamemcpy ( ); cudafree ( ); } CUDA kernel () Java C/C++ and CUDA Fig. 2. Development process for accessing CUDA via JNI neither easy nor productive. Sections 3.2 and 3.3 describe our JCUDA programming interface and compiler, and Section 3.4 summarizes our handling of Java arrays as parameters in kernel calls. 3.1 Current Approach: Using CUDA via JNI Figure 2 summarizes the three-stage process that a Java programmer needs to follow to access CUDA via JNI today. It involves writing Java code and JNI stub code in C for execution on the CPU host, as well as CUDA code for execution on the GPU device. The stub code also needs to handle allocation and freeing of data in device memory, and data transfers between the host and device. It is clear that this process is tedious and error-prone, and that it would be more productive to use a compiler or programming tool that automatically generates stub code and data transfer calls. 3.2 The JCUDA Programming Interface The JCUDA model is designed to be a programmer-friendly foreign function interface for invoking CUDA kernels from Java code, especially for programmers who may be familiar with Java and CUDA but not with JNI. We use the example in Figure 3 to illustrate JCUDA syntax and usage. The interface to external CUDA functions is declared in lines 90 93, which contain a static library definition using the lib keyword. The two arguments in a lib declaration specify the name and location of the external library using string constants. The library definition contains declarations for two external functions, foo1 and foo2. The acc modifier indicates that the external function is a CUDA-accelerated kernel function. Each function argument can be declared as IN, OUT, orinout to indicate if a data transfer should be performed before the kernel call, after the kernel call or both. These modifiers allows the responsibility of device memory allocation and data transfer to be delegated to the JCUDA compiler. Our current
5 JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs double[ ][ ] l a = new double[num1][num2]; 2 double[ ][ ][ ] l aout = new double[num1][num2][num3]; 3 double[ ][ ] l aex = new double[num1][num2]; 4 initarray(l a); initarray(l aex); //initialize value in array 6 7 int [ ] ThreadsPerBlock = {16, 16, 1}; 8 int [ ] BlocksPerGrid = new int[3]; BlocksPerGrid[3] = 1; 9 BlocksPerGrid[0] = (NUM1 + ThreadsPerBlock[0] - 1) / ThreadsPerBlock[0]; 10 BlocksPerGrid[1] = (NUM2 + ThreadsPerBlock[1] - 1) / ThreadsPerBlock[1]; /* invoke device on this block/thread grid */ 13 cudafoo.foo1 <<<< BlocksPerGrid, ThreadsPerBlock >>>> (l a, l aout, l aex); 14 printarray(l a); printarray(l aout); printarray(l aex); static lib cudafoo ( cfoo, /opt/cudafoo/lib ) { 91 acc void foo1 (IN double[ ][ ]a, OUT int[ ][ ][ ] aout, INOUT float[ ][ ] aex); 92 acc void foo2 (IN short[ ][ ]a, INOUT double[ ][ ][ ] aex, IN int total); 93 } Fig. 3. JCUDA example JCUDA implementation only supports scalar primitives and rectangular arrays of primitives as arguments to CUDA kernels. The OUT and INOUT modifiers are only permitted on arrays of primitives, not on scalar primitives. If no modifier is specified for an argument, it is default to be IN. As discussed later in Section, there are related approaches to CUDA language bindings that support modifiers such as IN and OUT, and the upcoming PGI 8.0 C/C++ compiler also uses an acc modifier to declare regions of code in C programs to be accelerated. To the best of our knowledge, none of the past efforts support direct invocation of user-written CUDA code from Java programs, with automatic support for data transfer (including copying of multidimensional Java arrays). Line 13 shows a sample invocation of the CUDA kernel function foo1. Similar to CUDA s C interface, we use the <<<<...>>>> 1 to identify a kernel call. The geometries for the CUDA grid and blocks are specified using two three-element integer arrays, BlocksPerGrid and ThreadsPerBlock. In this example, the kernel will be executed with = 26 threads per block and by a number of blocks per grid that depends on the input data size (NUM1 and NUM2). 3.3 The JCUDA Compiler The JCUDA compiler performs source-to-source translation of JCUDA programs to Java program. Our implementation is based on Polyglot [10], a compiler front end for the Java programming language. Figures 4 and show the Java static class declaration and the C glue code generated from the lib declaration in Figure 3. The Java static class introduces declarations with mangled names for native functions corresponding to JCUDA functions foo1 and foo2 respectively, as well as a static class initializer to load the stub library. In addition, three 1 We use four angle brackets instead of three as in CUDA syntax because the >>> is already used as unsigned right shift operator in Java programming language.
6 892 Y. Yan, M. Grossman, and V. Sarkar private static class cudafoo { native static void HelloL 00024cudafoo foo1(double[ ][ ] a, int[ ][ ][ ] aout, float[ ][ ] aex, int[ ] dimgrid, int[ ] dimblock, int sizeshared); static void foo1(double[ ][ ] a, int[ ][ ][ ] aout, float[ ][ ] aex, int[ ] dimgrid, int[ ] dimblock, int sizeshared) { HelloL 00024cudafoo foo1(a, aout, aex, dimgrid, dimblock, sizeshared); } native static void HelloL 00024cudafoo foo2(short[ ][ ] a, double[ ][ ][ ] aex, int total, int[ ] dimgrid, int[ ] dimblock, int sizeshared); static void foo2(short[ ][ ] a, double[ ][ ][ ] aex, int total, int[ ] dimgrid, int[ ] dimblock, int sizeshared) { HelloL 00024cudafoo foo2(a, aex, total, dimgrid, dimblock, sizeshared); } static { java.lang.system.loadlibrary( HelloL 00024cudafoo stub ); } } Fig. 4. Java static class declaration generated from lib definition in Figure 3 extern global void foo1(double * d a, signed int * d aout, float * d aex); JNIEXPORT void JNICALL Java HelloL 00024cudafoo HelloL cudafoo 1foo1(JNIEnv *env, jclass cls, jobjectarray a, jobjectarray aout, jobjectarray aex, jintarray dimgrid, jintarray dimblock, int sizeshared) { /* copy array a to the device */ int dim a[3] = {2}; double * d a = (double*) copyarrayjvmtodevice(env, a, dim a, sizeof(double)); /* Allocate array aout on the device */ int dim aout[4] = {3}; signed int * d aout = (signed int*) allocarrayondevice(env, aout, dim aout, sizeof(signed int)); /* copy array aex to the device */ int dim aex[3] = {2}; float * d aex = (float*) copyarrayjvmtodevice(env, aex, dim aex, sizeof(float)); /* Initialize the dimension of grid and block in CUDA call */ dim3 d dimgrid; getcudadim3(env, dimgrid, &d dimgrid); dim3 d dimblock; getcudadim3(env, dimblock, &d dimblock); foo1 <<< d dimgrid, d dimblock, sizeshared >>> ((double *)d a, (signed int *)d aout, (float *)d aex); /* Free device memory d a*/ freedevicemem(d a); /* copy array d aout->aout from device to JVM, and free device memory d aout */ copyarraydevicetojvm(env, d aout, aout, dim aout, sizeof(signed int)); freedevicemem(d aout); /* copy array d aex->aex from device to JVM, and free device memory d aex */ copyarraydevicetojvm(env, d aex, aex, dim aex, sizeof(float)); freedevicemem(d aex); return; } Fig.. C glue code generated for the foo1 function defined in Figure 3
7 JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs 893 float a [10][10] Fig. 6. Java Multidimensional Array Layout parameters are added to each call dimgrid, dimblock, andsizeshared corresponding to the grid geometry, block geometry, and shared memory size. As we can see in Figure, the generated C code inserts host-device data transfer calls in accordance with the IN, OUT and INOUT modifiers in Figure Multidimensional Array Issues A multidimensional array in Java is represented as an array of arrays. For example, a two-dimensional float array is represented as a one-dimensional array of objects, each of which references a one-dimensional float array as shown in the Figure 6. This representation supports general nested and ragged arrays, as well as the ability to pass subarrays as parameters while still preserving pointer safety. However, it has been observed that this generality comes with a large overhead for the common case of multidimensional rectangular arrays [11]. In our work, we focus on the special case of dense rectangular multidimensional arrays of primitive types as in C and Fortran. These arrays are allocated as nested arrays in Java and as contiguous arrays in CUDA. The JCUDA runtime performs the necessary gather and scatter operations when copying array data between the JVM and the GPU device. For example, to copy a Java array of double[20][40][80] from the JVM to the GPU device, the JCUDA runtime makes = 800 calls to the CUDA cudamemcpy memory copy function with 80 double-words transferred in each call. In future work, we plan to avoid this overhead by using X10 s multidimensional arrays [12] with contiguous storage of all array elements, instead of Java s multidimensional arrays. 4 Performance Evaluation 4.1 Experimental Setup We use four Section 2 benchmarks from the Java Grande Forum (JGF) Benchmarks [3,4] to evaluate our JCUDA programming interface and compiler Fourier coefficient analysis (Series), Sparse matrix multiplication (Sparse), Successive overrelaxation (SOR), and IDEA encryption (Crypt). Each of these benchmarks has
8 894 Y. Yan, M. Grossman, and V. Sarkar three problem sizes for evaluation A, B and C with Size A being the smallest and Size C the largest. For each of these benchmarks, the compute-intensive portionswererewrittenincudawhereastherestofthecodewasretainedinitsoriginal Javaform except for the JCUDA extensionsused for kernel invocation. The rewritten CUDA codes are parallelized in the same way as the original Java multithreaded code, with each CUDA thread performing the same computation as a Java thread. The GPU used in our performance evaluations is a NVIDIA Tesla C1060 card, containing a GPU with 240 cores in 30 streaming multiprocessors, a 1.3 GHz clock speed, and 4GB memory. It also supports double-precision floating-point operations, which was not available in earlier GPU products from NVIDIA. All benchmarks were evaluated with double-precision arithmetic, as in the original Java versions. The CPU hosting this GPGPU is an Intel Quad-Core CPU with a 2.83GHz clock speed, 12MB L2 Cache and 8GB memory. The software installations used include a Sun Java HotSpot 64-bit virtual machine included in version of the Java SE Development Kit (JDK), version of the GNU Compiler Collection (gcc), version of the NVIDIA CUDA driver, and version 2.0 of the NVIDIA CUDA Toolkit. There are two key limitations in our JCUDA implementation which will be addressed in future work. First, as mentioned earlier, we only support primitives and rectangular arrays of primitives as function arguments in the JCUDA interface. Second, the current interface does not provide any direct support for reuse of data acrosskernelcalls since the parameter modes are restricted to IN, OUT and INOUT. 4.2 Evaluation and Analysis Table 1 shows the execution times (in seconds) of the Java and JCUDA versions for all three data sizes of each of the four benchmarks. The Java execution times Table 1. Execution times in seconds, and Speedup of JCUDA relative to 1-thread Java executions (30 blocks per grid, 26 threads per block) Benchmark Series Sparse Data Size A B C A B C Java-1-thread execution time 7.6s 77.42s s 0.0s 1.17s 19.87s Java-2-threads execution time 3.84s 39.21s 7.0s 0.26s 0.4s 8.68s Java-4-threads execution time 2.03s 19.82s s 0.2s 0.39s.32s JCUDA execution time 0.11s 1.04s 10.14s 0.07s 0.14s 0.93s JCUDA Speedup w.r.t. Java-1-thread Benchmark SOR Crypt Data Size A B C A B C Java-1-thread execution time 0.62s 1.60s 2.82s 0.1s 3.26s 8.16s Java-2-threads execution time 0.26s 1.32s 2.9s 0.27s 1.6s 4.10s Java-4-threads execution time 0.16s 1.37s 2.70s 0.11s 0.21s 2.16s JCUDA execution time 0.09s 0.21s 0.37s 0.02s 0.16s 0.4s JCUDA Speedup w.r.t. Java-1-thread
9 JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs Size A Size B Size C 2 Size A Size B Size C Time (s) Time (s) Threads Per Block (a) SOR (30 blocks/grid) Threads Per Block (b) Crypt (30 blocks/grid) Size A Size B Size C Size A Size B Size C 2 Time (s) 1. Time (s) Blocks Per Grid (c) SOR (26 threads/block) Blocks Per Grid (d) Crypt (26 threads/block) Fig. 7. Execution time by varying threads/block and blocks/grid in CUDA kernel invocations on the Tesla C1060 GPU were obtained for 1, 2 and 4 threads, representing the performance obtained by using 1, 2, and 4 CPU cores respectively. The JCUDA execution times were obtained by using a single Java thread on the CPU combined with multiple threads on the GPU in a configuration of 26 threads per block and 30 blocks per grid. The JCUDA kernel was repeatedly executed 10 times, and the fastest times attained for each benchmark and test size are listed in the table. The bottom row shows the speedup of the JCUDA version relative to the 1- thread Java version. The results for Size C (the largest data size) show speedups ranging from 7.70 to , whereas smaller speedups were obtained for smaller data sizes up to 74.1 for Size B and for Size A. The benchmark that showed the smallest speedup was SOR. This is partially due to the fact that our CUDA implementation of the SOR kernel performs a large number of barrier ( syncthreads) operations 400, 600, and 800 for sizes A, B, and C respectively. In contrast, the Series examples show the largest speedup because it represents an embarrassingly parallel application with no communication or
10 896 Y. Yan, M. Grossman, and V. Sarkar 100% 90% 80% 70% Percentage 60% 0% 40% 30% 20% 10% JCUDA cost Post-Compute Pre-Compute Compute 0% Series A Series B Series C Sparse A Sparse B Sparse C Problem and Size SOR A SOR B SOR C Crypt A Crypt B Crypt C Fig. 8. Breakdown of JCUDA Kernel Execution Times synchronization during kernel execution. In general, we see that the JCUDA interface can deliver significant speedups for Java applications with doubleprecision floating point arithmetic by using GPGPUs to offload computations. Figure 7 shows the results of varying threads/block and blocks/grid for SOR and Crypt to study the performance variation relative to the geometry assumed in Table 1 (26 threads/block and 30 blocks/grid). In Figures 7(a) and 7(b), for 30 blocks per grid, we see that little performance gain is obtained by increasing the number of threads per block beyond 26. This suggests that 26 threads per block (half the maximum value of 12) is a large enough number to enable the GPU to maintain high occupancy of its cores [13]. Next, in Figures 7(c) and 7(d), for 26 threads per block, we see that execution times are lowest when the blocks/grid is a multiple of 30. This is consistent with the fact that the Tesla C1060 has 30 streaming multiprocessors (SM). Further, we observe a degradation when the blocks/grid is 1 more than a multiple of 30 (31, 61, 91,...) because of the poor load balance that results on a 30-SM GPU processor. Figure 8 shows the percentage breakdown of the JCUDA kernel execution times into compute, pre-compute, post-compute and JCUDA cost components. The compute component is the time spent in kernel execution on the GPGPU. The pre-compute and post-compute components represent the time spent in data transfer before and after kernel execution. The JCUDA cost includes the overhead of the JNI calls generated by the JCUDA kernel. As the figure shows, for each benchmark, as the problem becomes larger, the compute time percentage increases. The JCUDA overhead has minimal impact on performance, especially for Size C executions. The impact of pre-compute and post-compute times depends on the communication requirements of individual benchmarks. Related Work In this section, we briefly discuss two areas of related work in making CUDA kernels accessible to other languages. The first approach is to provide library
11 JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs 897 bindings for the CUDA API. Examples of this approach include the JCublas and JCufft [14] Java libraries that provide Java bindings to the standard CUBLAS and CUFFT libraries from CUDA. We recently learned that JCublas and JCufft are part of a new Java library called jcuda. In contrast, the JCUDA interface introduced in this paper is a language-based approach, in which the compiler automatically generates glue code and data transfers. PyCuda [1] from Python community is a Python binding for the CUDA API. It also allows CUDA kernel code to be embedded as a string in a Python program. RapidMind provides a multicore programming framework based on the C++ programming language [16]. A programmer can embed a kernel intended to run on the GPU as a delimited piece of code directly in the program. RapidMind also supports IN, and OUT keywords for function parameters to guide the compiler in generating appropriate code for data movement. The second approach is to use compiler parallelization to generate CUDA kernel code. A notable example of this approach can be found in the PGI 8.0 x64+gpu compilers for C and Fortran which allow programmers to use directives, such as acc, copyin, andcopyout, to specify regions of code or functions to be GPU-accelerated. The compiler splits portions of the application between CPU and GPU based on these user-specified directives and generates object codes for both CPU and GPUs from a single source file. Another related approach is to translate OpenMP parallel code to hybrid code for CPUs and GPUs using compiler analysis, transformation and optimization techniques [17]. A recent compiler research effort that is of relevance to Java can be found in [18], where the Jikes RVM dynamic optimizing compiler [19] is extended to perform automatic parallelization at the bytecode level. To the best of our knowledge, none of the past efforts support direct invocation of user-written CUDA code from Java programs, with automatic support for data transfer of primitives and multidimensional arrays of primitives. 6 Conclusions and Future Work In this paper, we presented the JCUDA programming interface that can be used by Java programmers to invoke CUDA kernels without having to worry about the details of JNI calls and data transfers (especially for multidimensional arrays) between the host and device. The JCUDA compiler generates all the necessary JNI glue code, and the JCUDA runtime handles data transfer before and after each kernel call. Our preliminary performance results obtained on four Java Grande benchmarks show significant performance improvements of the JCUDA-accelerated versions over their original versions. The results for Size C (the largest data size) showed speedups ranging from 6.74 to with the use of a GPGPU, relative to CPU execution on a single thread. We also discussed the impact of problem size, communication overhead, and synchronization overhead on the speedups that were obtained. To address the limitations listed in Section 4.1, we are considering to add a KEEP modifier for an argument to specify GPU resident data between kernel
12 898 Y. Yan, M. Grossman, and V. Sarkar calls. To support the overlapping of computation and data transfer, we are developing memory copy primitives for programmers to initiate asynchronous data copy between CPU and GPU. Those primitives can use either the conventional cudamemcpy operation or the page-locked memory mapping mechanism introduced in the latest CUDA development kit. In addition to those, there are also several interesting directions for our future research. We plan to use the JCUDA interface as a target language to support high-level parallel programming language like X10 [12] and Habanero-Java in the Habanero Multicore Software Research project [20]. In this approach, we envision generating JCUDA code and CUDA kernels from a single source multi-place X10 program. Pursuing this approach will require solving some interesting new research problems such as ensuring that offloading X10 code onto a GPGPU will not change the exception semantics of the program. Another direction that we would like to explore is the use of the language extensions recommended in [21] to simplify automatic parallelization and generation of CUDA code. Acknowledgments We are grateful to Jisheng Zhao and Jun Shirako for their help with the Polyglot compiler infrastructure and JGF benchmarks. We would also like to thank Tim Warburton and Jeffrey Bridge for their advice on constructing the GPGPU system used to obtain the experimental results reported in this paper. References 1. Nickolls, J., Buck, I., Garland, M., Nvidia, Skadron, K.: Scalable Parallel Programming with CUDA. ACM Queue 6(2), 40 3 (2008) 2. Java Grande Forum Panel, Java Grande Forum Report: Making Java Work for High-End Computing, Java Grande Forum, SC 1998, Tech. Rep. (November 1998) 3. Bull, J.M., Smith, L.A., Pottage, L., Freeman, R.: Benchmarking Java Against C and Fortran for Scientific Applications. In: Proceedings of the 2001 joint ACM- ISCOPE Conference on Java Grande, pp ACM Press, New York (2001) 4. Smith, L.A., Bull, J.M., Obdrzálek, J.: A Parallel Java Grande Benchmark Suite. In: Proceedings of the 2001 ACM/IEEE conference on Supercomputing, p. 8. ACM Press, New York (2001). SciMark Java Benchmark for Scientific and Numerical Computing, 6. Liang, S.: Java Native Interface: Programmer s Guide and Specification. Sun Microsystems (1999) 7. Carpenter, B., Getov, V., Judd, G., Skjellum, A., Fox, G.: MPJ: MPI-Like Message Passing for Java. Concurrency - Practice and Experience 12(11), (2000) 8. AMD, ATI Stream Computing - Technical Overview. AMD, Tech. Rep. (2008) 9. Khronos OpenCL Working Group, The OpenCL Specification - Version 1.0. The Khronos Group, Tech. Rep. (2009) 10. Nystrom, N., Clarkson, M.R., Myers, A.C.: Polyglot: An Extensible Compiler Framework for Java. In: Kahng, H.-K. (ed.) ICOIN LNCS, vol. 2662, pp Springer, Heidelberg (2003)
13 JCUDA: A Programmer-Friendly Interface for Accelerating Java Programs Moreira, J.E., Midkiff, S.P., Gupta, M., Artigas, P.V., Snir, M., Lawrence, R.D.: Java Programming for High-Performance Numerical Computing. IBM Systems Journal 39(1), 21 6 (2000) 12. Charles, P., Grothoff, C., Saraswat, V., Donawa, C., Kielstra, A., Ebcioglu, K., von Praun, C., Sarkar, V.: X10: an Object-Oriented Approach to Non-Uniform Cluster Computing. In: OOPSLA 200: Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications, pp ACM, New York (200) 13. NVIDIA, NVIDIA CUDA Programming Guide 2.2.plus 0.em minus 0.4emN- VIDIA (2009), Java Binding for NVIDIA CUDA BLAS and FFT Implementation, 1. PyCuda, Matthew Monteyne, RapidMind Multi-Core Development Platform. RapidMind Inc., Tech. Rep (2008) 17. Lee, S., Min, S.-J., Eigenmann, R.: Openmp to gpgpu: a compiler framework for automatic translation and optimization. In: PPoPP 2009: Proceedings of the 14th ACM SIGPLAN symposium on Principles and practice of parallel programming, pp ACM, New York (2009) 18. Leung, A.C.-W.: Thesis: Automatic Parallelization for Graphics Processing Units in JikesRVM. University of Waterloo, Tech. Rep. (2008) 19. Alpern, B., Augart, S., Blackburn, S.M., Butrico, M., Cocchi, A., Cheng, P., Dolby, J., Fink, S., Grove, D., Hind, M., McKinley, K.S., Mergen, M., Moss, J.E.B., Ngo, T., Sarkar, V.: The Jikes Research Virtual Machine Project: Building an Open- Source Research Community. IBM Systems Journal 44(2), (200) 20. Habanero Multicore Software Project, Shirako, J., Kasahara, H., Sarkar, V.: Language Extensions in Support of Compiler Parallelization. In: Adve, V., Garzarán, M.J., Petersen, P. (eds.) LCPC LNCS, vol. 234, pp Springer, Heidelberg (2008)
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Writing Applications for the GPU Using the RapidMind Development Platform
Writing Applications for the GPU Using the RapidMind Development Platform Contents Introduction... 1 Graphics Processing Units... 1 RapidMind Development Platform... 2 Writing RapidMind Enabled Applications...
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU
Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents
HPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware
Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement
Hierarchical Place Trees: A Portable Abstraction for Task Parallelism and Data Movement Yonghong Yan, Jisheng Zhao, Yi Guo, and Vivek Sarkar Department of Computer Science, Rice University {yanyh,jisheng.zhao,yguo,vsarkar}@rice.edu
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach
High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach Beniamino Di Martino, Antonio Esposito and Andrea Barbato Department of Industrial and Information Engineering Second University of Naples
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA
Accelerating Intensity Layer Based Pencil Filter Algorithm using CUDA Dissertation submitted in partial fulfillment of the requirements for the degree of Master of Technology, Computer Engineering by Amol
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
Performance and Scalability of the NAS Parallel Benchmarks in Java
Performance and Scalability of the NAS Parallel Benchmarks in Java Michael A. Frumkin, Matthew Schultz, Haoqiang Jin, and Jerry Yan NASA Advanced Supercomputing (NAS) Division NASA Ames Research Center,
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
GPU Computing - CUDA
GPU Computing - CUDA A short overview of hardware and programing model Pierre Kestener 1 1 CEA Saclay, DSM, Maison de la Simulation Saclay, June 12, 2012 Atelier AO and GPU 1 / 37 Content Historical perspective
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model
Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Amin Safi Faculty of Mathematics, TU dortmund January 22, 2016 Table of Contents Set
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.
White Paper 021313-3 Page 1 : A Software Framework for Parallel Programming* The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud. ABSTRACT Programming for Multicore,
Overview of HPC Resources at Vanderbilt
Overview of HPC Resources at Vanderbilt Will French Senior Application Developer and Research Computing Liaison Advanced Computing Center for Research and Education June 10, 2015 2 Computing Resources
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE
APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE Tuyou Peng 1, Jun Peng 2 1 Electronics and information Technology Department Jiangmen Polytechnic, Jiangmen, Guangdong, China, [email protected]
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
CUDA Basics. Murphy Stein New York University
CUDA Basics Murphy Stein New York University Overview Device Architecture CUDA Programming Model Matrix Transpose in CUDA Further Reading What is CUDA? CUDA stands for: Compute Unified Device Architecture
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming
Characteristics of Java (Optional) Y. Daniel Liang Supplement for Introduction to Java Programming Java has become enormously popular. Java s rapid rise and wide acceptance can be traced to its design
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Cross-Platform GP with Organic Vectory BV Project Services Consultancy Services Expertise Markets 3D Visualization Architecture/Design Computing Embedded Software GIS Finance George van Venrooij Organic
OpenACC 2.0 and the PGI Accelerator Compilers
OpenACC 2.0 and the PGI Accelerator Compilers Michael Wolfe The Portland Group [email protected] This presentation discusses the additions made to the OpenACC API in Version 2.0. I will also present
Lecture 1: an introduction to CUDA
Lecture 1: an introduction to CUDA Mike Giles [email protected] Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming
~ Greetings from WSU CAPPLab ~
~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)
CUDA programming on NVIDIA GPUs
p. 1/21 on NVIDIA GPUs Mike Giles [email protected] Oxford University Mathematical Institute Oxford-Man Institute for Quantitative Finance Oxford eresearch Centre p. 2/21 Overview hardware view
PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)
PARALLEL JAVASCRIPT Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology) JAVASCRIPT Not connected with Java Scheme and self (dressed in c clothing) Lots of design errors (like automatic semicolon
Introduction to CUDA C
Introduction to CUDA C What is CUDA? CUDA Architecture Expose general-purpose GPU computing as first-class capability Retain traditional DirectX/OpenGL graphics performance CUDA C Based on industry-standard
Programming GPUs with CUDA
Programming GPUs with CUDA Max Grossman Department of Computer Science Rice University [email protected] COMP 422 Lecture 23 12 April 2016 Why GPUs? Two major trends GPU performance is pulling away from
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING
ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING Sonam Mahajan 1 and Maninder Singh 2 1 Department of Computer Science Engineering, Thapar University, Patiala, India 2 Department of Computer Science Engineering,
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server
Optimizing GPU-based application performance for the HP for the HP ProLiant SL390s G7 server Technology brief Introduction... 2 GPU-based computing... 2 ProLiant SL390s GPU-enabled architecture... 2 Optimizing
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Le langage OCaml et la programmation des GPU
Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
L20: GPU Architecture and Models
L20: GPU Architecture and Models scribe(s): Abdul Khalifa 20.1 Overview GPUs (Graphics Processing Units) are large parallel structure of processing cores capable of rendering graphics efficiently on displays.
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Embedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven [email protected] Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR
LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
An Exception Monitoring System for Java
An Exception Monitoring System for Java Heejung Ohe and Byeong-Mo Chang Department of Computer Science, Sookmyung Women s University, Seoul 140-742, Korea {lutino, [email protected] Abstract. Exception
OpenACC Basics Directive-based GPGPU Programming
OpenACC Basics Directive-based GPGPU Programming Sandra Wienke, M.Sc. [email protected] Center for Computing and Communication RWTH Aachen University Rechen- und Kommunikationszentrum (RZ) PPCES,
GeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
GPGPU Parallel Merge Sort Algorithm
GPGPU Parallel Merge Sort Algorithm Jim Kukunas and James Devine May 4, 2009 Abstract The increasingly high data throughput and computational power of today s Graphics Processing Units (GPUs), has led
A Thread Monitoring System for Multithreaded Java Programs
A Thread Monitoring System for Multithreaded Java Programs Sewon Moon and Byeong-Mo Chang Department of Computer Science Sookmyung Women s University, Seoul 140-742, Korea [email protected], [email protected]
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
OpenACC Programming and Best Practices Guide
OpenACC Programming and Best Practices Guide June 2015 2015 openacc-standard.org. All Rights Reserved. Contents 1 Introduction 3 Writing Portable Code........................................... 3 What
Direct GPU/FPGA Communication Via PCI Express
Direct GPU/FPGA Communication Via PCI Express Ray Bittner, Erik Ruf Microsoft Research Redmond, USA {raybit,erikruf}@microsoft.com Abstract Parallel processing has hit mainstream computing in the form
Compiling Object Oriented Languages. What is an Object-Oriented Programming Language? Implementation: Dynamic Binding
Compiling Object Oriented Languages What is an Object-Oriented Programming Language? Last time Dynamic compilation Today Introduction to compiling object oriented languages What are the issues? Objects
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism
Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism Jianqiang Dong, Fei Wang and Bo Yuan Intelligent Computing Lab, Division of Informatics Graduate School at Shenzhen,
Introducing Tetra: An Educational Parallel Programming System
Introducing Tetra: An Educational Parallel Programming System Ian Finlayson, Jerome Mueller, Shehan Rajapakse, Daniel Easterling The University of Mary Washington Fredericksburg, Virginia, USA [email protected],
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
Habanero Extreme Scale Software Research Project
Habanero Extreme Scale Software Research Project Comp215: Java Method Dispatch Zoran Budimlić (Rice University) Always remember that you are absolutely unique. Just like everyone else. - Margaret Mead
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration
Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,
Course Development of Programming for General-Purpose Multicore Processors
Course Development of Programming for General-Purpose Multicore Processors Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Richmond, VA 23284 [email protected]
Introduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
BLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu
1 MapReduce on GPUs Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu 2 MapReduce MAP Shuffle Reduce 3 Hadoop Open-source MapReduce framework from Apache, written in Java Used by Yahoo!, Facebook, Ebay,
Speed Performance Improvement of Vehicle Blob Tracking System
Speed Performance Improvement of Vehicle Blob Tracking System Sung Chun Lee and Ram Nevatia University of Southern California, Los Angeles, CA 90089, USA [email protected], [email protected] Abstract. A speed
A general-purpose virtualization service for HPC on cloud computing: an application to GPUs
A general-purpose virtualization service for HPC on cloud computing: an application to GPUs R.Montella, G.Coviello, G.Giunta* G. Laccetti #, F. Isaila, J. Garcia Blas *Department of Applied Science University
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs
Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Resource Scheduling Best Practice in Hybrid Clusters
Available online at www.prace-ri.eu Partnership for Advanced Computing in Europe Resource Scheduling Best Practice in Hybrid Clusters C. Cavazzoni a, A. Federico b, D. Galetti a, G. Morelli b, A. Pieretti
High Performance Matrix Inversion with Several GPUs
High Performance Matrix Inversion on a Multi-core Platform with Several GPUs Pablo Ezzatti 1, Enrique S. Quintana-Ortí 2 and Alfredo Remón 2 1 Centro de Cálculo-Instituto de Computación, Univ. de la República
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark
Data-parallel Acceleration of PARSEC Black-Scholes Benchmark AUGUST ANDRÉN and PATRIK HAGERNÄS KTH Information and Communication Technology Bachelor of Science Thesis Stockholm, Sweden 2013 TRITA-ICT-EX-2013:158
ST810 Advanced Computing
ST810 Advanced Computing Lecture 17: Parallel computing part I Eric B. Laber Hua Zhou Department of Statistics North Carolina State University Mar 13, 2013 Outline computing Hardware computing overview
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud
IBM Platform Computing Cloud Service Ready to use Platform LSF & Symphony clusters in the SoftLayer cloud February 25, 2014 1 Agenda v Mapping clients needs to cloud technologies v Addressing your pain
NVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
Experiences With Mobile Processors for Energy Efficient HPC
Experiences With Mobile Processors for Energy Efficient HPC Nikola Rajovic, Alejandro Rico, James Vipond, Isaac Gelado, Nikola Puzovic, Alex Ramirez Barcelona Supercomputing Center Universitat Politècnica
