Stream Processing on GPUs Using Distributed Multimedia Middleware

Size: px
Start display at page:

Download "Stream Processing on GPUs Using Distributed Multimedia Middleware"

Transcription

1 Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research Center for Artificial Intelligence (DFKI), Agents & Simulated Reality, Saarbrücken, Germany Abstract. Available GPUs provide increasingly more processing power especially for multimedia and digital signal processing. Despite the tremendous progress in hardware and thus processing power, there are and always will be applications that require using multiple GPUs either running inside the same machine or distributed in the network due to computational intensive processing algorithms. Existing solutions for developing applications for GPUs still require a lot of hand-optimization when using multiple GPUs inside the same machine and provide in general no support for using remote GPUs distributed in the network. In this paper we address this problem and show that an open distributed multimedia middleware, like the Network-Integrated Multimedia Middleware (NMM), is able (1) to seamlessly integrate processing components using GPUs while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing. 1 Introduction Since GPUs are especially designed for stream processing and free programmable, they are well suited to be used for multimedia or digital signal processing. Available many-core technologies like Nvidia s Compute Unified Device Architecture (CUDA) [1] on top of GPUs simplify development of highly parallel algorithms running on a single GPU. However, there are still many obstacles and problems when programming applications for GPUs. In general, a GPU can only execute algorithms for processing data, we call kernels, while the corresponding control logic still has to be executed on the CPU. The main problem for a software developer is that a kernel runs in a different address space than the application itself. To exchange data between the application and the kernel or between different GPUs within the same machine, specialized communication mechanisms (e.g., DMA data transfer), memory areas, and special scheduling strategies have to be used. This seriously complicates integrating GPUs into applications as well as combining algorithms for multimedia or digital signal processing running on CPUs and GPUs or have to be distributed in the network.

2 Available open distributed multimedia middleware solutions such as the Network-Integrated Multimedia Middleware (NMM) [2] provide a general abstraction for stream processing using a flow graph based approach. This allows an application to specify stream processing by combining different processing elements to a flow graph, each representing a single algorithm or processing step. A unified messaging system allows to send large data blocks, called buffer, and events. This enables the usage of these middleware solutions for data driven as well as event driven stream processing. Furthermore, they consider the network as an integral part of their architecture and allow transparent use and control of local and remote components. The important aspect in this context is that an open distributed multimedia middleware like NMM strongly supports the specialization of all its components to different technologies while still providing a unified architecture for application development. Thus, the general idea presented in this paper is to treat GPUs and CPUs within a single machine in the same way as a distributed system. We will show that using an open distributed multimedia middleware for stream processing allows (1) to seamlessly integrate processing components using GPUs while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing. This paper is structured as follows: in Section 2 we describe the essential components of NMM for stream processing on GPUs. Section 3 discusses related work and Section 4 presents the integration of CUDA into NMM. Section 5 shows how to use multiple local or distributed GPUs for processing. Section 6 presents performance measurements showing that multimedia applications using CUDA can even improve their performance when using our approach. Section 7 concludes this paper and highlights future work. 2 Open Distributed Multimedia Middleware Fig. 1. Technology specific aspects are either hidden within nodes or edges of a distributed flow graph. Distributed Flow Graph: To hide technology specific issues as well as used networking protocols from the application, NMM uses the concept of a distributed

3 flow graph, providing a strict separation of media processing and media transmission as can be seen in Figure 1. The nodes of a distributed flow graph represent processing units and hide all aspects of the underlying technology used for data processing. Edges represent connections between two nodes and hide all specific aspects of data transmission within a transport strategy (e.g., pointer forwarding for local and TCP for network connections). Thus, data streams can flow from distributed source to sink nodes, being processed by each node in-between. The concept of distributed flow graph is essential to (1) seamlessly integrate processing components using GPUs and hide GPU specific issues from the application developer. Corresponding CUDA kernels can then be integrated into nodes. In the following, nodes using the CPU for media processing are called CPU nodes while nodes using the GPU are called GPU nodes. A GPU node runs in the address space of the CPU but configures and controls kernel functions running on a GPU. Since kernels of a GPU node run in a different address space, data received from main memory has to be copied to the GPU using DMA transfer before it can be processed. An open distributed multimedia middleware hides all aspects regarding data transmission in the edges of a flow graph. An open distributed multimedia middleware allows to integrate these technology specific aspects regarding data transmission as a transport strategy. In case of NMM, a connection service automatically choose suitable transport strategies and thus allows (2) transparent combination of processing components using GPUs or CPUs in the same machine. Fig. 2. A data stream can consist of buffers B and events E. A parallel binding allows to use different transport strategies to send these messages while the strict message order is preserved. Unified Messaging System: In general, an open distributed multimedia middleware is able to support both, data and event driven stream processing. NMM supports both kind of messages by its unified messaging system. Buffers represent large data blocks, e.g., a video frame, that are efficiently managed by buffer manger. A buffer manager is responsible for allocating specific memory blocks and can be shared between multiple components of NMM. For example, CUDA needs memory allocated as page-locked memory on the CPU side to use

4 DMA communication for transporting media data to GPU. Realizing a corresponding buffer manager allows to use this kind of memory within NMM. Since page-locked memory is also a limited resource, this buffer manager provides an interface to the application for specifying a maximum value for used page-locked memory. Moreover, sharing of buffer managers, e.g, for page-locked memory, between CPU and GPU nodes enables an efficient communication between these nodes because a CPU node automatically uses page-locked memory that can be directly copied to the GPU. Events include control information, include arbitrary typed data and are directly mapped to a method invocation of a node, as long as a node supports this event. NMM allows the combination of events and buffers within the same data stream while a strict message order is preserved. Here, the important aspect is that NMM itself is completely independent of the information sent between nodes but provides an extensible framework for sending messages. Open Communication Framework: An open communication framework is required to transport messages between nodes of a flow graph correctly. Since a GPU can only process buffers, events have to be sent to and executed within the corresponding GPU node that controls the algorithm running on a GPU. This means that different transport strategies have to be used to send messages of a data stream to a GPU node, i.e., DMA transfer for media data and pointer forwarding for control events. For this purpose, NMM uses the concept of parallel binding [3], as shown in Figure 2. This allows to use different transport strategies for buffers and events, while the original message order is preserved. Moreover, NMM support a pipeline of transport strategies which is required to use remote GPU nodes. Since a GPU only has access to the main memory of the collocated CPU and is not able to send buffers directly to a network interface, it has to be copied to the main memory first. In a second step, the buffer can be sent to a remote system using standard network protocols like TCP. NMM supports to set up a pipeline of different transport strategies within a parallel binding in order to reuse existing transport strategies. The connection service of NMM allows to automatically choose these transport strategies for transmitting buffers between GPU and CPU nodes and enables (3) transparent use of local and remote GPUs for distributed processing. 3 Related Work Due to the wide acceptance of CUDA, there already exist some CUDA specific extensions. GpuCV [4] and OpenVidia [5] are computer vision libraries that completely hide the underlying GPU architecture. Since they act as black box solutions it is difficult to combine them with existing multimedia applications that use the CPU. However, the CUDA kernels of such libraries can be reused in the presented approach. In [6] CUDA was integrated in a grid computing framework which is built on top of the DataCutter middleware. Since the DataCutter middleware focuses on processing a large number of completely independent tasks, it is not suitable for multimedia processing.

5 Most of existing frameworks for distributed stream processing that also support GPUs have a special focus on computer graphics. For example WireGL [7], Chromium [8], and Equalizer [9] support distribution of workload to GPUs distributed in the network. However, all these frameworks are limited to work only with OpenGL-based applications and can not be used for general processing on GPUs. Available distributed multimedia middleware solutions like NMM [2], NIST II [10] and Infopipe [11] are especially designed for distributed media processing, but only NMM and NIST II support the concept of a distributed flow graph. However, the concept of parallel binding and pipelined parallel binding are only supported by NMM and not by NIST II. 4 Integration of CUDA Fig. 3. CUDA is integrated into NMM using a three layer approach. All layers can be accessed by the application, but only the distributed flow graph is seen by default. We use a three layer approach for integrating CUDA into NMM as can be seen in Figure 3. All three layers can be accessed from the application, but only the first layer, which includes the distributed flow graph, can be seen by default. Here, all CUDA kernels are encapsulated into specific GPU nodes, so that they can be used within a distributed flow graph for distributed stream processing. Since processing of a specific buffer requires that all following operations have to be executed on the same GPU, our integration of CUDA ensures that all following GPU nodes use the same GPU. Therefore, GPU nodes are interconnected using the LocalStrategy which simply forwards pointer of buffers. However, the concept of parallel binding is required for connecting CPU and GPU nodes. Here, incoming events are still forwarded to a LocalStrategy because a GPU node processes events in the same address space as a CPU node. Incoming buffers are sent to a CPUToGPUStrategy or GPUToCPUStrategy to copy media data from main memory to GPU memory or vice versa using CUDA s asynchronous DMA transfer. NMM also provides a connection service

6 that is extended to automatically choose these transport strategies for transmitting buffers between GPU and CPU nodes. This shows that the approach of a distributed flow graph (1) hides all specific aspects of CUDA and GPUs to the application. The second layer enables efficient memory management. Page-locked memory that can be directly copied to a GPU is allocated and managed by a CPUBufferManager, while a GPUBufferManager allocates and manages memory on a GPU. Since a CPUToGPUStrategy requires page-locked memory to avoid unnecessary copy operations, it forwards a CPUBufferManager to all predecessor nodes. This is enabled by the concept of shareable buffer managers, described in Section 2. As can be seen in Figure 3, this is done as soon as a connection between a CPU and GPU node is established. Again, this shows the benefit of using a distributed multimedia middleware for multimedia processing on a GPU. GPU nodes can be combined with existing CPU nodes in an efficient way, i.e., without unrequired memory copies, and without changing the implementation of existing nodes. Fig. 4. This figure shows buffer processing using a combination of CPU and GPU nodes. The lowest layer is responsible for managing and scheduling of available GPUs within a system. Since we directly use the driver API, different GPUs can be accessed by using the same application thread in general by pushing the corresponding CUDA context. However, the current implementation of CUDA s context management does not support asynchronous copy operation [12]. Thus, the CUDAManager maintains an individual thread for each GPU, called GPU thread, for accessing a GPU. If a component executes a CUDA operation within one of its methods (e.g., executes a kernel), it requests the CUDAManager to invoke this method by using a specific GPU thread. Executing a method through the CUDAManager blocks until the CUDAManager has executed the corresponding method. This completely hides the use of multiple GPU threads as well as different application threads for accessing a GPU and the application logic. However, since page-locked memory is already bound to a specific GPU, the CUDAManager instantiates a CPUBufferManager and GPUBufferManager for each

7 GPU. To propagate scheduling information between different nodes, each buffer stores information about the GPU thread that has to be used for processing. Moreover, CUDA operations are executed asynchronously in so called CUDA streams. Therefore, each buffer provides its own CUDA stream where all components along the flow graph queue their CUDA specific operations asynchronously. Since all buffers used by asynchronous operations of a single CUDA stream can only be released if the CUDA stream is synchronized, each CUDA stream is encapsulated into an NMM-CUDA stream that also stores involved resources and releases them if the CUDA stream is synchronized. Figure 4 shows all steps for processing a buffer using CPU and GPU nodes: 1. When CPUToGPUStrategy receives a buffer B CPU, it requests a suitable buffer B GPU for the same GPU. Then it initiates an asynchronous copy operation. Before forwarding B GPU to GPU node A, it adds B CPU to the NMM-CUDA stream because this buffer can only be released if the CUDA stream is synchronized. 2. GPU node A initiates the asynchronous execution of its kernel and forwards B GPU to GPU node B. Since both GPU nodes execute their kernel on the same GPU, the connecting transport strategy uses simple pointer forwarding to transmit B GPU to GPU node B. 3. GPU node B also initiates the asynchronous execution of the kernel and forwards B GPU. 4. When GPUtoCPUStrategy receives B GPU, it requests a new B CPU and initiates asynchronous memory copy from GPU to CPU. 5. Finally, the transport strategy synchronizes the CUDA stream to ensure that all operations on the GPU have been finished, before forwarding B CPU to CPU node B and releases all resources stored in the CUDA-NMM stream. 5 Parallel Processing on Multiple GPUs 5.1 Automatic Parallelization Automatic parallelization is only provided for GPUs inside a single machine. Here, the most important influence on scheduling is that page-locked memory is bound to a specific GPU. This means that all following processing steps are bound to a specific GPU, but GPU nodes are not explicitly bound to a specific GPU. So if multiple GPUs are available within a single system, the processing of next media buffers could be automatically initialized on different GPUs. However, this is only possible if a kernel is stateless and does not depend on information about already processed buffers, e.g. a kernel that changes the resolution of each incoming video buffer. In contrast to this, a stateful kernel, e.g. for encoding or decoding video, stores state information on a GPU and cannot automatically be distributed to multiple GPUs. When using a stateful kernel inside a GPU node, the corresponding transport strategy of type CPUToGPUStrategy uses a CPUBufferManager of a specific GPU to limit media processing to a single GPU. But if a stateless kernel

8 inside a GPU node is used, the corresponding CPUToGPUStrategy forwards a CompositeBufferManager to all preceding nodes. The CompositeBufferManager includes CPUBufferManager for all GPUs, and when a buffer is requested it asks the CUDAManager which GPU should be used and returns a page-locked buffer for the corresponding GPU. So far we implemented a simple round robin mechanism that is used inside the CUDAManager to distribute buffers one GPU after the other. 5.2 Explicit Parallelization To explicitly distribute workload to multiple local or distribute GPUs for stateless kernels, we provide a set of nodes that support explicit parallelization for application developer. The general idea can be seen in Figure 5. A manager node is used to distribute workload to a set of local or distributed GPU nodes. Since the kind of data that has to distributed to all succeeding GPU nodes strongly depends on the application, the manager node provides only a scheduling algorithm and distributes incoming data from its predecessor node. Fig. 5. The manager node distributes workload to connected GPU nodes. After data have been processed by all successive GPU nodes, they are sent back to an assembly node that recreates correct order of processed data. First, the manager node sends data to one connected GPU node after each other. As soon as a GPU node connected to the manager node has finished processing its data, it informs the manager node, by sending a control event, to send new data for processing. The manager node in turns sends next available data to this GPU node. This simple scheduling approach leads to an efficient dynamic load balancing between the GPU nodes, because GPU nodes that finish processing their data earlier, do receive new processing tasks earlier as well. This approach automatically considers differences in processing time that can be caused by using different graphic boards. 6 Performance Measurements For all performance measurements we use two PCs, PC1 and PC2, connected through a 1 Gigabit/sec full-duplex Ethernet connection, each with an Intel Core2 Duo 3.33 GHz E8600 processor, 4 GB RAM (DDR MHz), running 64 Bit Linux (kernel ) and CUDA Toolkit 2.0. PC1 includes 2 and PC2 1

9 Buffer size Reference [Buf/s] NMM: 1 GPU [Buf/s] NMM: 2 GPUs[Buf/s] NMM: 3GPUs [Buf/s] PC1 PC1 PC1 PC1 + PC2 50 KB (103 %) 2152 (163.8 %) 2790 (212.32%) 500 KB (117 %) 463 (231.5 %) 680 (340.0%) 1000 KB (113.6 %) 234 (227.2 %) 342 (332.7 %) 2000 KB (113.4 %) 118 (226.9 %) 168 (323.1%) 3000 KB (114.7 %) 79 (232.3 %) 105 (308.8%) Table 1. Performance of the NMM-CUDA integration versus a single threaded reference implementation. Throughput is measured in buffers per second [Buf/s] for a stateless kernel using 1, 2 and 3 GPUs. Nvidia GeForce 9600 GT graphics boards, each with 512 MB RAM. In order to measure the overhead of the presented approach, we compare it to a reference program that copies data from CPU to GPU, executes a kernel and finally copies data back to main memory using a single application thread with a corresponding flow graph that consists of two CPU nodes and one GPU node in between using the same kernel. Based on the throughput of buffers per second that can be passed, we compare the reference implementation and the corresponding NMM flow graph. Since NMM inherently uses multiple application threads, which is not possible by the reference application without using a framework like NMM, these measurements also include the influence of using multiple application threads. For all measurements, we used a stateless kernel that adjust brightness of incoming video frames. The resulting throughput with different buffer sizes can be seen in Table 1. The achieved throughput of our integration is up to 16.7% higher compared to the reference application. These measurements show that there is no overhead when using NMM as distributed multimedia middleware together with our CUDA integration, even for purely locally running applications. Moreover, the presented approach inherently uses multiple application threads for accessing the GPU, which leads to a better exploitation of the used GPU. Adding a second GPU can double the buffer throughput for larger buffer sizes, if a stateless kernel is used. Moreover, adding a single remote GPU shows that the overall performance can be increased up to a factor of three. When using both PCs for processing we use the manager node described in Section 5.2 that distributes workload to two GPU nodes, each running on PC1 and PC2. Since PC1 provides two GPUs and we us a stateless kernel, both graphic boards of PC1 are used. The assembly node runs on PC1 and receives the results from PC2. However, for large amount of data the network turns to be out the bottleneck. In our benchmark we already send up 800 MBit/sec in both directions so that a 4th GPU in a remote PC can not be fully exhausted. In this case faster network technologies, e.g. Infiniband which provides up to 20GBit/sec, have to be used. 7 Conclusion and Future Work In this paper we demonstrated that a distributed multimedia middleware like NMM is able (1) to seamlessly integrate processing components using GPUs

10 while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing. From our point of view, a distributed multimedia middleware such as NMM is essential to fully exploit the processing power of today s GPUs, while still offering a suitable abstraction for developers. Thus, future work will mainly focus on integrating emerging many-core technologies to conclude on which functionality should additionally be provided by a distributed multimedia middleware. Acknowledgements We would like to thank Martin Beyer for his valuable work on supporting the integration of CUDA into NMM. References 1. NVIDIA: CUDA Programming Guide 2.0. (2008) 2. M. Lohse, F. Winter, M. Repplinger, and P. Slusallek: Network-Integrated Multimedia Middleware (NMM). In: MM 08: Proceedings of the 16th ACM international conference on Multimedia. (2008) M. Repplinger, F. Winter, M. Lohse, and P. Slusallek: Parallel Bindings in Distributed Multimedia Systems. In: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems Workshops (ICDCS 2005), IEEE Computer Society (2005) Y. Allusse, P. Horain, A. Agarwal, and C. Saipriyadarshan: GpuCV: An Open- Source GPU-accelerated Framework for Image Processing and Computer Vision. In: MM 08: Proceeding of the 16th ACM international conference on Multimedia, New York, NY, USA, ACM (2008) J. Fung, and S. Mann: OpenVIDIA: parallel GPU computer vision. In: MUL- TIMEDIA 05: Proceedings of the 13th annual ACM international conference on Multimedia, New York, NY, USA, ACM (2005) D.R. Hartley et al.: Biomedical Image Analysis on a Cooperative Cluster of GPUs and Multicores. In: ICS 08: Proceedings of the 22nd annual international conference on Supercomputing, New York, NY, USA, ACM (2008) G. Humphreys et al.: WireGL: a scalable graphics system for clusters. In: SIG- GRAPH 01: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. (2001) G. Humphreys et al.: Chromium: a stream-processing framework for interactive rendering on clusters. In: SIGGRAPH 02: Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques. (2002) S. Eilemann, and R. Pajarola: The Equalizer parallel rendering framework. Technical Report IFI , Department of Informatics, University of Zürich (2007) 10. A. Fillinger et al.: The NIST Data Flow System II: A Standardized Interface for Distributed Multimedia Applications. In: IEEE International Symposium on a World of Wireless; Mobile and MultiMedia Networks (WoWMoM), IEEE (2008) 11. A. P. Black et al.: Infopipes: An Abstraction for Multimedia Streaming. Multimedia Syst. 8 (2002) NVIDIA: CUDA Programming and Development. NVidia Forum (2009)

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

GPU File System Encryption Kartik Kulkarni and Eugene Linkov GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through

More information

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH

Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH Equalizer Parallel OpenGL Application Framework Stefan Eilemann, Eyescale Software GmbH Outline Overview High-Performance Visualization Equalizer Competitive Environment Equalizer Features Scalability

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Graphics Processing Unit (GPU) Memory Hierarchy. Presented by Vu Dinh and Donald MacIntyre

Graphics Processing Unit (GPU) Memory Hierarchy. Presented by Vu Dinh and Donald MacIntyre Graphics Processing Unit (GPU) Memory Hierarchy Presented by Vu Dinh and Donald MacIntyre 1 Agenda Introduction to Graphics Processing CPU Memory Hierarchy GPU Memory Hierarchy GPU Architecture Comparison

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Intel DPDK Boosts Server Appliance Performance White Paper

Intel DPDK Boosts Server Appliance Performance White Paper Intel DPDK Boosts Server Appliance Performance Intel DPDK Boosts Server Appliance Performance Introduction As network speeds increase to 40G and above, both in the enterprise and data center, the bottlenecks

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Parallel Firewalls on General-Purpose Graphics Processing Units

Parallel Firewalls on General-Purpose Graphics Processing Units Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago

Globus Striped GridFTP Framework and Server. Raj Kettimuthu, ANL and U. Chicago Globus Striped GridFTP Framework and Server Raj Kettimuthu, ANL and U. Chicago Outline Introduction Features Motivation Architecture Globus XIO Experimental Results 3 August 2005 The Ohio State University

More information

GMP implementation on CUDA - A Backward Compatible Design With Performance Tuning

GMP implementation on CUDA - A Backward Compatible Design With Performance Tuning 1 GMP implementation on CUDA - A Backward Compatible Design With Performance Tuning Hao Jun Liu, Chu Tong Edward S. Rogers Sr. Department of Electrical and Computer Engineering University of Toronto haojun.liu@utoronto.ca,

More information

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014 CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014 Introduction Cloud ification < 2013 2014+ Music, Movies, Books Games GPU Flops GPUs vs. Consoles 10,000

More information

IP Video Rendering Basics

IP Video Rendering Basics CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

Server Based Desktop Virtualization with Mobile Thin Clients

Server Based Desktop Virtualization with Mobile Thin Clients Server Based Desktop Virtualization with Mobile Thin Clients Prof. Sangita Chaudhari Email: sangita123sp@rediffmail.com Amod N. Narvekar Abhishek V. Potnis Pratik J. Patil Email: amod.narvekar@rediffmail.com

More information

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter

Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel

More information

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,

More information

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features

High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features UDC 621.395.31:681.3 High-Performance IP Service Node with Layer 4 to 7 Packet Processing Features VTsuneo Katsuyama VAkira Hakata VMasafumi Katoh VAkira Takeyama (Manuscript received February 27, 2001)

More information

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Optimizing a 3D-FWT code in a cluster of CPUs+GPUs Gregorio Bernabé Javier Cuenca Domingo Giménez Universidad de Murcia Scientific Computing and Parallel Programming Group XXIX Simposium Nacional de la

More information

Dynamic Media Routing in Multi-User Home Entertainment Systems

Dynamic Media Routing in Multi-User Home Entertainment Systems Dynamic Media Routing in Multi-User Home Entertainment Systems Marco Lohse, Michael Repplinger, and Philipp Slusallek Computer Graphics Lab, Department of Computer Science Saarland University, Saarbrücken,

More information

GPU multiprocessing. Manuel Ujaldón Martínez Computer Architecture Department University of Malaga (Spain)

GPU multiprocessing. Manuel Ujaldón Martínez Computer Architecture Department University of Malaga (Spain) GPU multiprocessing Manuel Ujaldón Martínez Computer Architecture Department University of Malaga (Spain) Outline 1. Multichip solutions [10 slides] 2. Multicard solutions [2 slides] 3. Multichip + multicard

More information

ultra fast SOM using CUDA

ultra fast SOM using CUDA ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A

More information

IBM Deep Computing Visualization Offering

IBM Deep Computing Visualization Offering P - 271 IBM Deep Computing Visualization Offering Parijat Sharma, Infrastructure Solution Architect, IBM India Pvt Ltd. email: parijatsharma@in.ibm.com Summary Deep Computing Visualization in Oil & Gas

More information

Transparent Optimization of Grid Server Selection with Real-Time Passive Network Measurements. Marcia Zangrilli and Bruce Lowekamp

Transparent Optimization of Grid Server Selection with Real-Time Passive Network Measurements. Marcia Zangrilli and Bruce Lowekamp Transparent Optimization of Grid Server Selection with Real-Time Passive Network Measurements Marcia Zangrilli and Bruce Lowekamp Overview Grid Services Grid resources modeled as services Define interface

More information

Analysis of GPU Parallel Computing based on Matlab

Analysis of GPU Parallel Computing based on Matlab Analysis of GPU Parallel Computing based on Matlab Mingzhe Wang, Bo Wang, Qiu He, Xiuxiu Liu, Kunshuai Zhu (School of Computer and Control Engineering, University of Chinese Academy of Sciences, Huairou,

More information

~ Greetings from WSU CAPPLab ~

~ Greetings from WSU CAPPLab ~ ~ Greetings from WSU CAPPLab ~ Multicore with SMT/GPGPU provides the ultimate performance; at WSU CAPPLab, we can help! Dr. Abu Asaduzzaman, Assistant Professor and Director Wichita State University (WSU)

More information

OctaVis: A Simple and Efficient Multi-View Rendering System

OctaVis: A Simple and Efficient Multi-View Rendering System OctaVis: A Simple and Efficient Multi-View Rendering System Eugen Dyck, Holger Schmidt, Mario Botsch Computer Graphics & Geometry Processing Bielefeld University Abstract: We present a simple, low-cost,

More information

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance 11 th International LS-DYNA Users Conference Session # LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance Gilad Shainer 1, Tong Liu 2, Jeff Layton 3, Onur Celebioglu

More information

D1.2 Network Load Balancing

D1.2 Network Load Balancing D1. Network Load Balancing Ronald van der Pol, Freek Dijkstra, Igor Idziejczak, and Mark Meijerink SARA Computing and Networking Services, Science Park 11, 9 XG Amsterdam, The Netherlands June ronald.vanderpol@sara.nl,freek.dijkstra@sara.nl,

More information

VMWARE WHITE PAPER 1

VMWARE WHITE PAPER 1 1 VMWARE WHITE PAPER Introduction This paper outlines the considerations that affect network throughput. The paper examines the applications deployed on top of a virtual infrastructure and discusses the

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Boosting Data Transfer with TCP Offload Engine Technology

Boosting Data Transfer with TCP Offload Engine Technology Boosting Data Transfer with TCP Offload Engine Technology on Ninth-Generation Dell PowerEdge Servers TCP/IP Offload Engine () technology makes its debut in the ninth generation of Dell PowerEdge servers,

More information

Parallel Large-Scale Visualization

Parallel Large-Scale Visualization Parallel Large-Scale Visualization Aaron Birkland Cornell Center for Advanced Computing Data Analysis on Ranger January 2012 Parallel Visualization Why? Performance Processing may be too slow on one CPU

More information

Manjrasoft Market Oriented Cloud Computing Platform

Manjrasoft Market Oriented Cloud Computing Platform Manjrasoft Market Oriented Cloud Computing Platform Aneka Aneka is a market oriented Cloud development and management platform with rapid application development and workload distribution capabilities.

More information

Petascale Software Challenges. Piyush Chaudhary piyushc@us.ibm.com High Performance Computing

Petascale Software Challenges. Piyush Chaudhary piyushc@us.ibm.com High Performance Computing Petascale Software Challenges Piyush Chaudhary piyushc@us.ibm.com High Performance Computing Fundamental Observations Applications are struggling to realize growth in sustained performance at scale Reasons

More information

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data White Paper Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data What You Will Learn Financial market technology is advancing at a rapid pace. The integration of

More information

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.

Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild. Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development

More information

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU

ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Computer Science 14 (2) 2013 http://dx.doi.org/10.7494/csci.2013.14.2.243 Marcin Pietroń Pawe l Russek Kazimierz Wiatr ACCELERATING SELECT WHERE AND SELECT JOIN QUERIES ON A GPU Abstract This paper presents

More information

An Architecture Model of Sensor Information System Based on Cloud Computing

An Architecture Model of Sensor Information System Based on Cloud Computing An Architecture Model of Sensor Information System Based on Cloud Computing Pengfei You, Yuxing Peng National Key Laboratory for Parallel and Distributed Processing, School of Computer Science, National

More information

SOFT 437. Software Performance Analysis. Ch 5:Web Applications and Other Distributed Systems

SOFT 437. Software Performance Analysis. Ch 5:Web Applications and Other Distributed Systems SOFT 437 Software Performance Analysis Ch 5:Web Applications and Other Distributed Systems Outline Overview of Web applications, distributed object technologies, and the important considerations for SPE

More information

Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster

Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster Efficient mapping of the training of Convolutional Neural Networks to a CUDA-based cluster Jonatan Ward Sergey Andreev Francisco Heredia Bogdan Lazar Zlatka Manevska Eindhoven University of Technology,

More information

Design Issues in a Bare PC Web Server

Design Issues in a Bare PC Web Server Design Issues in a Bare PC Web Server Long He, Ramesh K. Karne, Alexander L. Wijesinha, Sandeep Girumala, and Gholam H. Khaksari Department of Computer & Information Sciences, Towson University, 78 York

More information

Program Grid and HPC5+ workshop

Program Grid and HPC5+ workshop Program Grid and HPC5+ workshop 24-30, Bahman 1391 Tuesday Wednesday 9.00-9.45 9.45-10.30 Break 11.00-11.45 11.45-12.30 Lunch 14.00-17.00 Workshop Rouhani Karimi MosalmanTabar Karimi G+MMT+K Opening IPM_Grid

More information

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing

Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Heterogeneous Workload Consolidation for Efficient Management of Data Centers in Cloud Computing Deep Mann ME (Software Engineering) Computer Science and Engineering Department Thapar University Patiala-147004

More information

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?

More information

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management

Enhancing Cloud-based Servers by GPU/CPU Virtualization Management Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department

More information

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, Josef.Pelikan@mff.cuni.cz Abstract 1 Interconnect quality

More information

Direct GPU/FPGA Communication Via PCI Express

Direct GPU/FPGA Communication Via PCI Express Direct GPU/FPGA Communication Via PCI Express Ray Bittner, Erik Ruf Microsoft Research Redmond, USA {raybit,erikruf}@microsoft.com Abstract Parallel processing has hit mainstream computing in the form

More information

Turbomachinery CFD on many-core platforms experiences and strategies

Turbomachinery CFD on many-core platforms experiences and strategies Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29

More information

Performance of Host Identity Protocol on Nokia Internet Tablet

Performance of Host Identity Protocol on Nokia Internet Tablet Performance of Host Identity Protocol on Nokia Internet Tablet Andrey Khurri Helsinki Institute for Information Technology HIP Research Group IETF 68 Prague March 23, 2007

More information

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Client/Server Computing Distributed Processing, Client/Server, and Clusters Client/Server Computing Distributed Processing, Client/Server, and Clusters Chapter 13 Client machines are generally single-user PCs or workstations that provide a highly userfriendly interface to the

More information

A Transport Protocol for Multimedia Wireless Sensor Networks

A Transport Protocol for Multimedia Wireless Sensor Networks A Transport Protocol for Multimedia Wireless Sensor Networks Duarte Meneses, António Grilo, Paulo Rogério Pereira 1 NGI'2011: A Transport Protocol for Multimedia Wireless Sensor Networks Introduction Wireless

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Violin: A Framework for Extensible Block-level Storage

Violin: A Framework for Extensible Block-level Storage Violin: A Framework for Extensible Block-level Storage Michail Flouris Dept. of Computer Science, University of Toronto, Canada flouris@cs.toronto.edu Angelos Bilas ICS-FORTH & University of Crete, Greece

More information

Journal of Chemical and Pharmaceutical Research, 2013, 5(12):118-122. Research Article. An independence display platform using multiple media streams

Journal of Chemical and Pharmaceutical Research, 2013, 5(12):118-122. Research Article. An independence display platform using multiple media streams Available online www.jocpr.com Journal of Chemical and Pharmaceutical Research, 2013, 5(12):118-122 Research Article ISSN : 0975-7384 CODEN(USA) : JCPRC5 An independence display platform using multiple

More information

The search engine you can see. Connects people to information and services

The search engine you can see. Connects people to information and services The search engine you can see Connects people to information and services The search engine you cannot see Total data: ~1EB Processing data : ~100PB/day Total web pages: ~1000 Billion Web pages updated:

More information

LDPC Decoding on the Intel SCC

LDPC Decoding on the Intel SCC LDPC Decoding on the Intel SCC Andreas Diavastos, Panayiotis Petrides, Gabriel Falcao, Pedro Trancoso Computer Science Department University of Cyprus Department of Electrical and Computer Engineering

More information

ALL-AIO-2321P ZERO CLIENT

ALL-AIO-2321P ZERO CLIENT ALL-AIO-2321P ZERO CLIENT PCoIP AIO Zero Client The PCoIPTM technology is designed to deliver a user s desktop from a centralized host PC or server with an immaculate, uncompromised end user experience

More information

Processing Large Amounts of Images on Hadoop with OpenCV

Processing Large Amounts of Images on Hadoop with OpenCV Processing Large Amounts of Images on Hadoop with OpenCV Timofei Epanchintsev 1,2 and Andrey Sozykin 1,2 1 IMM UB RAS, Yekaterinburg, Russia, 2 Ural Federal University, Yekaterinburg, Russia {eti,avs}@imm.uran.ru

More information

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute

More information

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR Frédéric Kuznik, frederic.kuznik@insa lyon.fr 1 Framework Introduction Hardware architecture CUDA overview Implementation details A simple case:

More information

Intrusion Detection Architecture Utilizing Graphics Processors

Intrusion Detection Architecture Utilizing Graphics Processors Acta Informatica Pragensia 1(1), 2012, 50 59, DOI: 10.18267/j.aip.5 Section: Online: aip.vse.cz Peer-reviewed papers Intrusion Detection Architecture Utilizing Graphics Processors Liberios Vokorokos 1,

More information

PLANNING FOR DENSITY AND PERFORMANCE IN VDI WITH NVIDIA GRID JASON SOUTHERN SENIOR SOLUTIONS ARCHITECT FOR NVIDIA GRID

PLANNING FOR DENSITY AND PERFORMANCE IN VDI WITH NVIDIA GRID JASON SOUTHERN SENIOR SOLUTIONS ARCHITECT FOR NVIDIA GRID PLANNING FOR DENSITY AND PERFORMANCE IN VDI WITH NVIDIA GRID JASON SOUTHERN SENIOR SOLUTIONS ARCHITECT FOR NVIDIA GRID AGENDA Recap on how vgpu works Planning for Performance - Design considerations -

More information

Manjrasoft Market Oriented Cloud Computing Platform

Manjrasoft Market Oriented Cloud Computing Platform Manjrasoft Market Oriented Cloud Computing Platform Innovative Solutions for 3D Rendering Aneka is a market oriented Cloud development and management platform with rapid application development and workload

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

LCMON Network Traffic Analysis

LCMON Network Traffic Analysis LCMON Network Traffic Analysis Adam Black Centre for Advanced Internet Architectures, Technical Report 79A Swinburne University of Technology Melbourne, Australia adamblack@swin.edu.au Abstract The Swinburne

More information

White Paper. Recording Server Virtualization

White Paper. Recording Server Virtualization White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...

More information

Choosing a Computer for Running SLX, P3D, and P5

Choosing a Computer for Running SLX, P3D, and P5 Choosing a Computer for Running SLX, P3D, and P5 This paper is based on my experience purchasing a new laptop in January, 2010. I ll lead you through my selection criteria and point you to some on-line

More information

SERVER CLUSTERING TECHNOLOGY & CONCEPT

SERVER CLUSTERING TECHNOLOGY & CONCEPT SERVER CLUSTERING TECHNOLOGY & CONCEPT M00383937, Computer Network, Middlesex University, E mail: vaibhav.mathur2007@gmail.com Abstract Server Cluster is one of the clustering technologies; it is use for

More information

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking Burjiz Soorty School of Computing and Mathematical Sciences Auckland University of Technology Auckland, New Zealand

More information

Understanding the Performance of an X550 11-User Environment

Understanding the Performance of an X550 11-User Environment Understanding the Performance of an X550 11-User Environment Overview NComputing's desktop virtualization technology enables significantly lower computing costs by letting multiple users share a single

More information

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 g_suhakaran@vssc.gov.in THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833

More information

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Reconfigurable Architecture Requirements for Co-Designed Virtual Machines Kenneth B. Kent University of New Brunswick Faculty of Computer Science Fredericton, New Brunswick, Canada ken@unb.ca Micaela Serra

More information

HPC performance applications on Virtual Clusters

HPC performance applications on Virtual Clusters Panagiotis Kritikakos EPCC, School of Physics & Astronomy, University of Edinburgh, Scotland - UK pkritika@epcc.ed.ac.uk 4 th IC-SCCE, Athens 7 th July 2010 This work investigates the performance of (Java)

More information

Accelerating CFD using OpenFOAM with GPUs

Accelerating CFD using OpenFOAM with GPUs Accelerating CFD using OpenFOAM with GPUs Authors: Saeed Iqbal and Kevin Tubbs The OpenFOAM CFD Toolbox is a free, open source CFD software package produced by OpenCFD Ltd. Its user base represents a wide

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

Computer Organization & Architecture Lecture #19

Computer Organization & Architecture Lecture #19 Computer Organization & Architecture Lecture #19 Input/Output The computer system s I/O architecture is its interface to the outside world. This architecture is designed to provide a systematic means of

More information

Putting it on the NIC: A Case Study on application offloading to a Network Interface Card (NIC)

Putting it on the NIC: A Case Study on application offloading to a Network Interface Card (NIC) This full text paper was peer reviewed at the direction of IEEE Communications Society subject matter experts for publication in the IEEE CCNC 2006 proceedings. Putting it on the NIC: A Case Study on application

More information

Delivering Quality in Software Performance and Scalability Testing

Delivering Quality in Software Performance and Scalability Testing Delivering Quality in Software Performance and Scalability Testing Abstract Khun Ban, Robert Scott, Kingsum Chow, and Huijun Yan Software and Services Group, Intel Corporation {khun.ban, robert.l.scott,

More information

Amazon EC2 Product Details Page 1 of 5

Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Product Details Page 1 of 5 Amazon EC2 Functionality Amazon EC2 presents a true virtual computing environment, allowing you to use web service interfaces to launch instances with a variety of

More information

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop Emily Apsey Performance Engineer Presentation Overview What it takes to successfully virtualize ArcGIS Pro in Citrix XenApp and XenDesktop - Shareable

More information

Chapter 8 Multiple Processor Systems. 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems

Chapter 8 Multiple Processor Systems. 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems Chapter 8 Multiple Processor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems Multiprocessor Systems Continuous need for faster computers shared memory model message passing multiprocessor

More information

Multiprocessor Systems. Chapter 8 Multiple Processor Systems. Multiprocessors. Multiprocessor Hardware (1)

Multiprocessor Systems. Chapter 8 Multiple Processor Systems. Multiprocessors. Multiprocessor Hardware (1) Chapter 8 Multiple Processor Systems Multiprocessor Systems 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed systems Continuous need for faster computers shared memory model message passing multiprocessor

More information

Packet-based Network Traffic Monitoring and Analysis with GPUs

Packet-based Network Traffic Monitoring and Analysis with GPUs Packet-based Network Traffic Monitoring and Analysis with GPUs Wenji Wu, Phil DeMar wenji@fnal.gov, demar@fnal.gov GPU Technology Conference 2014 March 24-27, 2014 SAN JOSE, CALIFORNIA Background Main

More information

NVIDIA Tools For Profiling And Monitoring. David Goodwin

NVIDIA Tools For Profiling And Monitoring. David Goodwin NVIDIA Tools For Profiling And Monitoring David Goodwin Outline CUDA Profiling and Monitoring Libraries Tools Technologies Directions CScADS Summer 2012 Workshop on Performance Tools for Extreme Scale

More information

A Survey Study on Monitoring Service for Grid

A Survey Study on Monitoring Service for Grid A Survey Study on Monitoring Service for Grid Erkang You erkyou@indiana.edu ABSTRACT Grid is a distributed system that integrates heterogeneous systems into a single transparent computer, aiming to provide

More information

Deep Learning GPU-Based Hardware Platform

Deep Learning GPU-Based Hardware Platform Deep Learning GPU-Based Hardware Platform Hardware and Software Criteria and Selection Mourad Bouache Yahoo! Performance Engineering Group Sunnyvale, CA +1.408.784.1446 bouache@yahoo-inc.com John Glover

More information

Informatica Ultra Messaging SMX Shared-Memory Transport

Informatica Ultra Messaging SMX Shared-Memory Transport White Paper Informatica Ultra Messaging SMX Shared-Memory Transport Breaking the 100-Nanosecond Latency Barrier with Benchmark-Proven Performance This document contains Confidential, Proprietary and Trade

More information

QoS & Traffic Management

QoS & Traffic Management QoS & Traffic Management Advanced Features for Managing Application Performance and Achieving End-to-End Quality of Service in Data Center and Cloud Computing Environments using Chelsio T4 Adapters Chelsio

More information