1 Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research Center for Artificial Intelligence (DFKI), Agents & Simulated Reality, Saarbrücken, Germany Abstract. Available GPUs provide increasingly more processing power especially for multimedia and digital signal processing. Despite the tremendous progress in hardware and thus processing power, there are and always will be applications that require using multiple GPUs either running inside the same machine or distributed in the network due to computational intensive processing algorithms. Existing solutions for developing applications for GPUs still require a lot of hand-optimization when using multiple GPUs inside the same machine and provide in general no support for using remote GPUs distributed in the network. In this paper we address this problem and show that an open distributed multimedia middleware, like the Network-Integrated Multimedia Middleware (NMM), is able (1) to seamlessly integrate processing components using GPUs while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing. 1 Introduction Since GPUs are especially designed for stream processing and free programmable, they are well suited to be used for multimedia or digital signal processing. Available many-core technologies like Nvidia s Compute Unified Device Architecture (CUDA)  on top of GPUs simplify development of highly parallel algorithms running on a single GPU. However, there are still many obstacles and problems when programming applications for GPUs. In general, a GPU can only execute algorithms for processing data, we call kernels, while the corresponding control logic still has to be executed on the CPU. The main problem for a software developer is that a kernel runs in a different address space than the application itself. To exchange data between the application and the kernel or between different GPUs within the same machine, specialized communication mechanisms (e.g., DMA data transfer), memory areas, and special scheduling strategies have to be used. This seriously complicates integrating GPUs into applications as well as combining algorithms for multimedia or digital signal processing running on CPUs and GPUs or have to be distributed in the network.
2 Available open distributed multimedia middleware solutions such as the Network-Integrated Multimedia Middleware (NMM)  provide a general abstraction for stream processing using a flow graph based approach. This allows an application to specify stream processing by combining different processing elements to a flow graph, each representing a single algorithm or processing step. A unified messaging system allows to send large data blocks, called buffer, and events. This enables the usage of these middleware solutions for data driven as well as event driven stream processing. Furthermore, they consider the network as an integral part of their architecture and allow transparent use and control of local and remote components. The important aspect in this context is that an open distributed multimedia middleware like NMM strongly supports the specialization of all its components to different technologies while still providing a unified architecture for application development. Thus, the general idea presented in this paper is to treat GPUs and CPUs within a single machine in the same way as a distributed system. We will show that using an open distributed multimedia middleware for stream processing allows (1) to seamlessly integrate processing components using GPUs while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing. This paper is structured as follows: in Section 2 we describe the essential components of NMM for stream processing on GPUs. Section 3 discusses related work and Section 4 presents the integration of CUDA into NMM. Section 5 shows how to use multiple local or distributed GPUs for processing. Section 6 presents performance measurements showing that multimedia applications using CUDA can even improve their performance when using our approach. Section 7 concludes this paper and highlights future work. 2 Open Distributed Multimedia Middleware Fig. 1. Technology specific aspects are either hidden within nodes or edges of a distributed flow graph. Distributed Flow Graph: To hide technology specific issues as well as used networking protocols from the application, NMM uses the concept of a distributed
3 flow graph, providing a strict separation of media processing and media transmission as can be seen in Figure 1. The nodes of a distributed flow graph represent processing units and hide all aspects of the underlying technology used for data processing. Edges represent connections between two nodes and hide all specific aspects of data transmission within a transport strategy (e.g., pointer forwarding for local and TCP for network connections). Thus, data streams can flow from distributed source to sink nodes, being processed by each node in-between. The concept of distributed flow graph is essential to (1) seamlessly integrate processing components using GPUs and hide GPU specific issues from the application developer. Corresponding CUDA kernels can then be integrated into nodes. In the following, nodes using the CPU for media processing are called CPU nodes while nodes using the GPU are called GPU nodes. A GPU node runs in the address space of the CPU but configures and controls kernel functions running on a GPU. Since kernels of a GPU node run in a different address space, data received from main memory has to be copied to the GPU using DMA transfer before it can be processed. An open distributed multimedia middleware hides all aspects regarding data transmission in the edges of a flow graph. An open distributed multimedia middleware allows to integrate these technology specific aspects regarding data transmission as a transport strategy. In case of NMM, a connection service automatically choose suitable transport strategies and thus allows (2) transparent combination of processing components using GPUs or CPUs in the same machine. Fig. 2. A data stream can consist of buffers B and events E. A parallel binding allows to use different transport strategies to send these messages while the strict message order is preserved. Unified Messaging System: In general, an open distributed multimedia middleware is able to support both, data and event driven stream processing. NMM supports both kind of messages by its unified messaging system. Buffers represent large data blocks, e.g., a video frame, that are efficiently managed by buffer manger. A buffer manager is responsible for allocating specific memory blocks and can be shared between multiple components of NMM. For example, CUDA needs memory allocated as page-locked memory on the CPU side to use
4 DMA communication for transporting media data to GPU. Realizing a corresponding buffer manager allows to use this kind of memory within NMM. Since page-locked memory is also a limited resource, this buffer manager provides an interface to the application for specifying a maximum value for used page-locked memory. Moreover, sharing of buffer managers, e.g, for page-locked memory, between CPU and GPU nodes enables an efficient communication between these nodes because a CPU node automatically uses page-locked memory that can be directly copied to the GPU. Events include control information, include arbitrary typed data and are directly mapped to a method invocation of a node, as long as a node supports this event. NMM allows the combination of events and buffers within the same data stream while a strict message order is preserved. Here, the important aspect is that NMM itself is completely independent of the information sent between nodes but provides an extensible framework for sending messages. Open Communication Framework: An open communication framework is required to transport messages between nodes of a flow graph correctly. Since a GPU can only process buffers, events have to be sent to and executed within the corresponding GPU node that controls the algorithm running on a GPU. This means that different transport strategies have to be used to send messages of a data stream to a GPU node, i.e., DMA transfer for media data and pointer forwarding for control events. For this purpose, NMM uses the concept of parallel binding , as shown in Figure 2. This allows to use different transport strategies for buffers and events, while the original message order is preserved. Moreover, NMM support a pipeline of transport strategies which is required to use remote GPU nodes. Since a GPU only has access to the main memory of the collocated CPU and is not able to send buffers directly to a network interface, it has to be copied to the main memory first. In a second step, the buffer can be sent to a remote system using standard network protocols like TCP. NMM supports to set up a pipeline of different transport strategies within a parallel binding in order to reuse existing transport strategies. The connection service of NMM allows to automatically choose these transport strategies for transmitting buffers between GPU and CPU nodes and enables (3) transparent use of local and remote GPUs for distributed processing. 3 Related Work Due to the wide acceptance of CUDA, there already exist some CUDA specific extensions. GpuCV  and OpenVidia  are computer vision libraries that completely hide the underlying GPU architecture. Since they act as black box solutions it is difficult to combine them with existing multimedia applications that use the CPU. However, the CUDA kernels of such libraries can be reused in the presented approach. In  CUDA was integrated in a grid computing framework which is built on top of the DataCutter middleware. Since the DataCutter middleware focuses on processing a large number of completely independent tasks, it is not suitable for multimedia processing.
5 Most of existing frameworks for distributed stream processing that also support GPUs have a special focus on computer graphics. For example WireGL , Chromium , and Equalizer  support distribution of workload to GPUs distributed in the network. However, all these frameworks are limited to work only with OpenGL-based applications and can not be used for general processing on GPUs. Available distributed multimedia middleware solutions like NMM , NIST II  and Infopipe  are especially designed for distributed media processing, but only NMM and NIST II support the concept of a distributed flow graph. However, the concept of parallel binding and pipelined parallel binding are only supported by NMM and not by NIST II. 4 Integration of CUDA Fig. 3. CUDA is integrated into NMM using a three layer approach. All layers can be accessed by the application, but only the distributed flow graph is seen by default. We use a three layer approach for integrating CUDA into NMM as can be seen in Figure 3. All three layers can be accessed from the application, but only the first layer, which includes the distributed flow graph, can be seen by default. Here, all CUDA kernels are encapsulated into specific GPU nodes, so that they can be used within a distributed flow graph for distributed stream processing. Since processing of a specific buffer requires that all following operations have to be executed on the same GPU, our integration of CUDA ensures that all following GPU nodes use the same GPU. Therefore, GPU nodes are interconnected using the LocalStrategy which simply forwards pointer of buffers. However, the concept of parallel binding is required for connecting CPU and GPU nodes. Here, incoming events are still forwarded to a LocalStrategy because a GPU node processes events in the same address space as a CPU node. Incoming buffers are sent to a CPUToGPUStrategy or GPUToCPUStrategy to copy media data from main memory to GPU memory or vice versa using CUDA s asynchronous DMA transfer. NMM also provides a connection service
6 that is extended to automatically choose these transport strategies for transmitting buffers between GPU and CPU nodes. This shows that the approach of a distributed flow graph (1) hides all specific aspects of CUDA and GPUs to the application. The second layer enables efficient memory management. Page-locked memory that can be directly copied to a GPU is allocated and managed by a CPUBufferManager, while a GPUBufferManager allocates and manages memory on a GPU. Since a CPUToGPUStrategy requires page-locked memory to avoid unnecessary copy operations, it forwards a CPUBufferManager to all predecessor nodes. This is enabled by the concept of shareable buffer managers, described in Section 2. As can be seen in Figure 3, this is done as soon as a connection between a CPU and GPU node is established. Again, this shows the benefit of using a distributed multimedia middleware for multimedia processing on a GPU. GPU nodes can be combined with existing CPU nodes in an efficient way, i.e., without unrequired memory copies, and without changing the implementation of existing nodes. Fig. 4. This figure shows buffer processing using a combination of CPU and GPU nodes. The lowest layer is responsible for managing and scheduling of available GPUs within a system. Since we directly use the driver API, different GPUs can be accessed by using the same application thread in general by pushing the corresponding CUDA context. However, the current implementation of CUDA s context management does not support asynchronous copy operation . Thus, the CUDAManager maintains an individual thread for each GPU, called GPU thread, for accessing a GPU. If a component executes a CUDA operation within one of its methods (e.g., executes a kernel), it requests the CUDAManager to invoke this method by using a specific GPU thread. Executing a method through the CUDAManager blocks until the CUDAManager has executed the corresponding method. This completely hides the use of multiple GPU threads as well as different application threads for accessing a GPU and the application logic. However, since page-locked memory is already bound to a specific GPU, the CUDAManager instantiates a CPUBufferManager and GPUBufferManager for each
7 GPU. To propagate scheduling information between different nodes, each buffer stores information about the GPU thread that has to be used for processing. Moreover, CUDA operations are executed asynchronously in so called CUDA streams. Therefore, each buffer provides its own CUDA stream where all components along the flow graph queue their CUDA specific operations asynchronously. Since all buffers used by asynchronous operations of a single CUDA stream can only be released if the CUDA stream is synchronized, each CUDA stream is encapsulated into an NMM-CUDA stream that also stores involved resources and releases them if the CUDA stream is synchronized. Figure 4 shows all steps for processing a buffer using CPU and GPU nodes: 1. When CPUToGPUStrategy receives a buffer B CPU, it requests a suitable buffer B GPU for the same GPU. Then it initiates an asynchronous copy operation. Before forwarding B GPU to GPU node A, it adds B CPU to the NMM-CUDA stream because this buffer can only be released if the CUDA stream is synchronized. 2. GPU node A initiates the asynchronous execution of its kernel and forwards B GPU to GPU node B. Since both GPU nodes execute their kernel on the same GPU, the connecting transport strategy uses simple pointer forwarding to transmit B GPU to GPU node B. 3. GPU node B also initiates the asynchronous execution of the kernel and forwards B GPU. 4. When GPUtoCPUStrategy receives B GPU, it requests a new B CPU and initiates asynchronous memory copy from GPU to CPU. 5. Finally, the transport strategy synchronizes the CUDA stream to ensure that all operations on the GPU have been finished, before forwarding B CPU to CPU node B and releases all resources stored in the CUDA-NMM stream. 5 Parallel Processing on Multiple GPUs 5.1 Automatic Parallelization Automatic parallelization is only provided for GPUs inside a single machine. Here, the most important influence on scheduling is that page-locked memory is bound to a specific GPU. This means that all following processing steps are bound to a specific GPU, but GPU nodes are not explicitly bound to a specific GPU. So if multiple GPUs are available within a single system, the processing of next media buffers could be automatically initialized on different GPUs. However, this is only possible if a kernel is stateless and does not depend on information about already processed buffers, e.g. a kernel that changes the resolution of each incoming video buffer. In contrast to this, a stateful kernel, e.g. for encoding or decoding video, stores state information on a GPU and cannot automatically be distributed to multiple GPUs. When using a stateful kernel inside a GPU node, the corresponding transport strategy of type CPUToGPUStrategy uses a CPUBufferManager of a specific GPU to limit media processing to a single GPU. But if a stateless kernel
8 inside a GPU node is used, the corresponding CPUToGPUStrategy forwards a CompositeBufferManager to all preceding nodes. The CompositeBufferManager includes CPUBufferManager for all GPUs, and when a buffer is requested it asks the CUDAManager which GPU should be used and returns a page-locked buffer for the corresponding GPU. So far we implemented a simple round robin mechanism that is used inside the CUDAManager to distribute buffers one GPU after the other. 5.2 Explicit Parallelization To explicitly distribute workload to multiple local or distribute GPUs for stateless kernels, we provide a set of nodes that support explicit parallelization for application developer. The general idea can be seen in Figure 5. A manager node is used to distribute workload to a set of local or distributed GPU nodes. Since the kind of data that has to distributed to all succeeding GPU nodes strongly depends on the application, the manager node provides only a scheduling algorithm and distributes incoming data from its predecessor node. Fig. 5. The manager node distributes workload to connected GPU nodes. After data have been processed by all successive GPU nodes, they are sent back to an assembly node that recreates correct order of processed data. First, the manager node sends data to one connected GPU node after each other. As soon as a GPU node connected to the manager node has finished processing its data, it informs the manager node, by sending a control event, to send new data for processing. The manager node in turns sends next available data to this GPU node. This simple scheduling approach leads to an efficient dynamic load balancing between the GPU nodes, because GPU nodes that finish processing their data earlier, do receive new processing tasks earlier as well. This approach automatically considers differences in processing time that can be caused by using different graphic boards. 6 Performance Measurements For all performance measurements we use two PCs, PC1 and PC2, connected through a 1 Gigabit/sec full-duplex Ethernet connection, each with an Intel Core2 Duo 3.33 GHz E8600 processor, 4 GB RAM (DDR MHz), running 64 Bit Linux (kernel ) and CUDA Toolkit 2.0. PC1 includes 2 and PC2 1
9 Buffer size Reference [Buf/s] NMM: 1 GPU [Buf/s] NMM: 2 GPUs[Buf/s] NMM: 3GPUs [Buf/s] PC1 PC1 PC1 PC1 + PC2 50 KB (103 %) 2152 (163.8 %) 2790 (212.32%) 500 KB (117 %) 463 (231.5 %) 680 (340.0%) 1000 KB (113.6 %) 234 (227.2 %) 342 (332.7 %) 2000 KB (113.4 %) 118 (226.9 %) 168 (323.1%) 3000 KB (114.7 %) 79 (232.3 %) 105 (308.8%) Table 1. Performance of the NMM-CUDA integration versus a single threaded reference implementation. Throughput is measured in buffers per second [Buf/s] for a stateless kernel using 1, 2 and 3 GPUs. Nvidia GeForce 9600 GT graphics boards, each with 512 MB RAM. In order to measure the overhead of the presented approach, we compare it to a reference program that copies data from CPU to GPU, executes a kernel and finally copies data back to main memory using a single application thread with a corresponding flow graph that consists of two CPU nodes and one GPU node in between using the same kernel. Based on the throughput of buffers per second that can be passed, we compare the reference implementation and the corresponding NMM flow graph. Since NMM inherently uses multiple application threads, which is not possible by the reference application without using a framework like NMM, these measurements also include the influence of using multiple application threads. For all measurements, we used a stateless kernel that adjust brightness of incoming video frames. The resulting throughput with different buffer sizes can be seen in Table 1. The achieved throughput of our integration is up to 16.7% higher compared to the reference application. These measurements show that there is no overhead when using NMM as distributed multimedia middleware together with our CUDA integration, even for purely locally running applications. Moreover, the presented approach inherently uses multiple application threads for accessing the GPU, which leads to a better exploitation of the used GPU. Adding a second GPU can double the buffer throughput for larger buffer sizes, if a stateless kernel is used. Moreover, adding a single remote GPU shows that the overall performance can be increased up to a factor of three. When using both PCs for processing we use the manager node described in Section 5.2 that distributes workload to two GPU nodes, each running on PC1 and PC2. Since PC1 provides two GPUs and we us a stateless kernel, both graphic boards of PC1 are used. The assembly node runs on PC1 and receives the results from PC2. However, for large amount of data the network turns to be out the bottleneck. In our benchmark we already send up 800 MBit/sec in both directions so that a 4th GPU in a remote PC can not be fully exhausted. In this case faster network technologies, e.g. Infiniband which provides up to 20GBit/sec, have to be used. 7 Conclusion and Future Work In this paper we demonstrated that a distributed multimedia middleware like NMM is able (1) to seamlessly integrate processing components using GPUs
10 while completely hiding GPU specific issues from the application developer, (2) to transparently combine processing components using GPUs or CPUs, and (3) to transparently use local and remote GPUs for distributed processing. From our point of view, a distributed multimedia middleware such as NMM is essential to fully exploit the processing power of today s GPUs, while still offering a suitable abstraction for developers. Thus, future work will mainly focus on integrating emerging many-core technologies to conclude on which functionality should additionally be provided by a distributed multimedia middleware. Acknowledgements We would like to thank Martin Beyer for his valuable work on supporting the integration of CUDA into NMM. References 1. NVIDIA: CUDA Programming Guide 2.0. (2008) 2. M. Lohse, F. Winter, M. Repplinger, and P. Slusallek: Network-Integrated Multimedia Middleware (NMM). In: MM 08: Proceedings of the 16th ACM international conference on Multimedia. (2008) M. Repplinger, F. Winter, M. Lohse, and P. Slusallek: Parallel Bindings in Distributed Multimedia Systems. In: Proceedings of the 25th IEEE International Conference on Distributed Computing Systems Workshops (ICDCS 2005), IEEE Computer Society (2005) Y. Allusse, P. Horain, A. Agarwal, and C. Saipriyadarshan: GpuCV: An Open- Source GPU-accelerated Framework for Image Processing and Computer Vision. In: MM 08: Proceeding of the 16th ACM international conference on Multimedia, New York, NY, USA, ACM (2008) J. Fung, and S. Mann: OpenVIDIA: parallel GPU computer vision. In: MUL- TIMEDIA 05: Proceedings of the 13th annual ACM international conference on Multimedia, New York, NY, USA, ACM (2005) D.R. Hartley et al.: Biomedical Image Analysis on a Cooperative Cluster of GPUs and Multicores. In: ICS 08: Proceedings of the 22nd annual international conference on Supercomputing, New York, NY, USA, ACM (2008) G. Humphreys et al.: WireGL: a scalable graphics system for clusters. In: SIG- GRAPH 01: Proceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. (2001) G. Humphreys et al.: Chromium: a stream-processing framework for interactive rendering on clusters. In: SIGGRAPH 02: Proceedings of the 29th Annual Conference on Computer Graphics and Interactive Techniques. (2002) S. Eilemann, and R. Pajarola: The Equalizer parallel rendering framework. Technical Report IFI , Department of Informatics, University of Zürich (2007) 10. A. Fillinger et al.: The NIST Data Flow System II: A Standardized Interface for Distributed Multimedia Applications. In: IEEE International Symposium on a World of Wireless; Mobile and MultiMedia Networks (WoWMoM), IEEE (2008) 11. A. P. Black et al.: Infopipes: An Abstraction for Multimedia Streaming. Multimedia Syst. 8 (2002) NVIDIA: CUDA Programming and Development. NVidia Forum (2009)