Network and Grid Support for Multimedia Distribution and Processing

Transcription

1 Network and Grid Support for Multimedia Distribution and Processing Petr Holub! " #$ %& G HI ')(+* EF,.-0/ ;:=< >@?BADC A thesis submitted for the degree of Doctor of Philosophy at The Faculty of Informatics, Masaryk University Brno May 2005 Brno, Czech Republic

2 Except where otherwise indicated, this thesis is my own original work. Petr Holub Brno, May 2005

3 ACKNOWLEDGMENTS I would like to express my gratefulness to prof. Luděk Matyska and dr. Eva Hladká for supporting me in my work a motivating me. I would also like to thank the fellows at the Laboratory of Advances Networking Technologies: Lukáš Hejtmánek, Tomáš Rebok, Miloš Liška, Jiří Denemark, and others for creating a great team which I really appreciate to work with. Furthermore, I d like to thank my parents and my grandparents and especially my grandfather Miloslav, who spent huge amount of time with me and being an excellent professor. He has taught me how to love languages and mathematics and how these fields are closely interrelated. And last but not least, I d like to appreciate my wife Aleška, who has been helping me immensely in the recent years and soothing and encouraging me in the moments when I was feeling really down. P. H.

4 Abstract In this thesis, we focus our work on two classes of multimedia data distribution and processing problems: synchronous or interactive, which require as low latency as possible, and asynchronous or non-interactive, where latency is not so restrictive. In Part I, we study scalable and user-empowered infrastructure for synchronous data distribution and processing. We propose a concept of a generalized user-empowered modular reflector called Active Element (AE). It supports both running as an element of a userempowered distribution network suitable for larger groups and also distributing the AE itself over tightly-coupled computer clusters. While the networks of AEs are aimed at scalability with respect to number of clients connected, the distributed AE is designed to be scalable with respect to bandwidth of individual data stream. We have also demonstrated both medium-bandwidth pilot applications suitable for both AE networks and high-bandwidth applications for distributed AEs. For AE networks, we have analyzed a number of distribution models suitable for synchronous data distribution, ranging from simple 2D full-mesh models through multiple spanning trees. All the models were evaluated both in terms of scalability and also in terms of robustness of operation with respect to AE failure and network disintegration. The most serious problem of distributed AE, where data is split over multiple equivalent AE units running in parallel, is packet reordering. We have designed and evaluated Fast Circulating Token protocol, which provides limited synchronization among egress sending modules of parallel paths in a distributed AE. While even the distributed AE with no explicit sending synchronization provides limited reordering, we have shown both theoretically and experimentally that FCT improves output packet reordering. Part II presents our approach to distributed asynchronous multimedia processing. We have designed efficient model for distributing asynchronous processing that is capable of very complex processing in real-time or faster, depending on degree of parallelism involved. The model is based on creating jobs with uniform size for parallel computing nodes without shared memory, as available in Grid environments. It uses distributed storage infrastructure as transient storage for source and possibly also target data. We have analyzed scheduling in such environment and found out that our problem with uniform jobs and non-uniform processors belongs to PO-class. When the distributed storage is connected with computing infrastructure via complete graph, the problem of scheduling tasks to storage depots belongs to the same class and thus scheduling as a whole is POclass problem. In order to experimentally evaluate these models, a prototype implementation called Distributed Encoding Environment (DEE) has been implemented based on Internet Backplane Protocol distributed storage infrastructure. The prototype confirms expected behavior and performance. DEE has become used routinely by its pilot applications, most notably processing of lecture archives recordings, which provides multi-terabyte archives of video material for educational purposes.

5 Contents Contents iv List of Figures viii List of Definitions and Theorems x List of Abbreviations xi Introduction I Synchronous Distributed Processing 3 2 Objectives of Synchronous Distributed Processing 4 2. Distribution of Processing Load Distribution of Network Load Fault Tolerance State of the Art 7 3. Multicast User-Empowered Modular Reflector Architecture of the Reflector Usage of the Reflector Resilient Overlay Networks Multimedia Processing and Distribution Systems Based on Overlay Networks Virtual Room Videoconferencing System (VRVS) Access Grid H.323 Videoconferences Use of Clusters as Distributed Routers Use of Clusters as Distributed Servers Peer-to-Peer Networks OptIPuter Networks of Active Elements 5 4. Synchronous Multimedia Distribution Networks Active Element with Network Management Capabilities Organization of AE Networks Re-balancing and Fail-Over Operations Distribution Models D Full Mesh D Layered-Mesh Network D Layered Mesh of AEs with Intermediate AEs Multicast Schemes Content Organization

6 CONTENTS vi 5 Distributed Active Element Architecture Operation in Static Environment Ingress Distribution Egress Synchronization Operation in Dynamic Environment Setup of a New Ring Addition of a New Node Failure Detection and Recovery Removal of Existing Node Communication between Distributed AE and Load Balancers Prototype Implementation Experimental Setup Performance Evaluation Packet Loss and Reordering Evaluation Pilot Applications for Synchronous Processing DV over IP HDV over IP Uncompressed HD II Asynchronous Distributed Processing 53 7 Introduction to Asynchronous Distributed Processing Objectives of Asynchronous Distributed Processing State of the Art Grid and Distributed Storage Infrastructure Video Processing Tools Automated Distributed Video Processing Distributed Encoding Environment Model of Distributed Processing Conventions Used Scheduling algorithms Use Cases and Scenarios Components of the Model Processor scheduling Storage scheduling problem, to model Storage scheduling problem, to n model Prototype Implementation Technical Background Architecture Access to IBP Infrastructure Scheduling Model Distributed Encoding Environment Performance Evaluation Pilot Applications for Asynchronous Processing 77

7 CONTENTS vii 0 Conclusions Summary and Discussion Synchronous processing Asynchronous processing Future Work Synchronous processing Distributed Encoding Environment A Detailed Measurements Results 83 Bibliography 94

8 List of Figures 3. User-Empowered Modular Reflector Architecture Architecture of the Active Element D full mesh Flow analysis in full 2D mesh of AEs Behavior of 2D full mesh for DV clients D layered mesh Number of AEs needed for 3D mesh with intermediate hops Recovery time for with backup SPT (solid line) and without it (dashed line) simulated using cnet-based network simulator Model infrastructure for implementing the distributed AE Model of the ideal distributed AE with ideal aggregation unit Sample load balancing packet distribution for distributed AE Fast Circulating Token algorithm Alternative load balancing packet distribution Distributed AE testbed setup Forwarding performance of distributed AE without explicit synchronization for number of paths through Forwarding performance of distributed AE with synchronization using FCT for number of paths 2 through DV over IP based stereoscopic transmission MPEG-2 Transport Stream packet format according to IEC Latencies limits for collaborative environments Workflow in the Distributed Encoding Environment model of processing distribution Model of target infrastructure PS Algorithm: Greedy algorithm for processor scheduling DS Algorithm: to task scheduling n-ts Algorithm: to n task scheduling Distributed Encoding Environment architecture and components Simplified job scheduling algorithm with multiple storage depots per processor used for downloading (i. e. N to data transfer) and neglecting the uploading overhead Example DEE workflow for transcoding from video in DV to RealMedia format Acceleration of DEE performance with respect to degree of parallelism Scheme of the lecture recording and processing workflow

9 LIST OF FIGURES ix 9.2 Interface to video lecture archives and example recording played from the streaming server A. Execution profile of DEE using shared infrastructure A.2 Execution profile of DEE with no remuxing using dedicated infrastructure. 95 A.3 Execution profile of DEE with remuxing using dedicated infrastructure... 96

10 List of Definitions and Theorems Definition 4. Simple distribution models Definition 4.2 2D full-mesh network Definition 4.3 Evenly populated AE network Theorem Theorem Definition Definition 4.5 3D layered-mesh network Theorem Theorem Theorem Definition 4.6 q-nary distribution tree Definition 4.7 3D layered mesh with intermediate AEs Definition 4.8 Intermediate AE Theorem Theorem Definition 5. Ideal network Definition 5.2 Ideal multimedia traffic Definition 5.3 Ideal aggregating unit Definition 5.4 Ideal distributed AE Definition 5.5 Ideal distribution unit Theorem 5. Maximum reordering with no explicit synchronization Definition 5.6 Non-preemptive data packet sending Definition 5.7 Token handling priority Theorem 5.2 Maximum reordering with FCT synchronization Definition 5.8 Reordering graph Theorem Theorem Definition 8. Data transcoding Definition 8.2 Data prefetch Definition 8.3 Completion Time Estimate Definition 8.4 Network Traffic Prediction Service Theorem Theorem Theorem Theorem

11 List of Abbreviations AAA Authorization, Authentication, Accounting ACL Access Control List AE Active Element AFS Andrew File System AG AccessGrid API Application Interface AVI Audio Video Interleave; an envelope video/audio format CARP Common Address Redundancy Protocol CERN Centre Européen pour la Recherche Nucleaire CPU Central Processor Unit CTE Completion Time Estimate DEE Distributed Encoding Environment DV Digital Video DVTS Digital Video Transport System EGEE Enabling Grids for E-sciencE; European Grid project FCT Fast Circulating Token protocol FIFO First In First Out FPGA Field Programmable Gate Array Gbps Gigabit(s) per second; Gb.s GE Gigabit Ethernet GM the low-level message-passing system for Myrinet networks GSI Grid Security Infrastructure HD High-Definition (video) HDV SONY format for compressing HD video HTTP HyperText Transfer Protocol IBP Internet Backplane Protocol IP Internet Protocol ITU International Telecommunication Union ITU-T ITU Telecommunication Standardization Sector LAN Local Area Network MBone Multicast Backbone Mbps Megabit(s) per second; Mb.s MCU Multi-point Connection Unit MOV QuickTime envelope video/audio format MPEG Moving Picture Experts Group MPEG-2 MPEG video compression format MPEG-4 MPEG video compression format MSB Multi-Session Bridge MTU Maximum Transmission Unit NFS Network File System protocol NIS Network Information Service NM Network Management NTPS Network Traffic Prediction Service NTSC National Television System Committee; a TV system OWD One-Way Delay P2P Peer-to-Peer PAL Phase Alternating Line; a TV system PBS Portable Batch Scheduling system PBSPro Portable Batch Scheduling system commercial Professional version PVM Parallel Virtual Machine QoS Quality of Service RAP Reflector Administration Protocol RFC Request For Comments RM RealMedia video/audio format RON Resilient Overlay Network RTCP Real Time Transport Control Protocol RTP Real-Time Protocol RTPv2 Real-Time Protocol version 2 RTT Round-Trip Time SDI Serial Digital Interface SMTPE Society of Motion Picture and Television Engineers TCP Transmission Control Protocol TTL Time To Live UDP User Datagram Protocol URI Unified Resource Identifier URL Unified Resource Locator VRRP Virtual Router Redundancy Protocol VRVS Virtual Room Videoconferencing System VRVS Virtual Room Videoconferencing System VLC VideoLAN Client XML extensible Markup Language 3D Three-dimensional

12 Chapter Introduction Current academic Internet environment has enabled fast transfers of huge amounts of data, making high quality multimedia and collaborative applications a reality. Both collaborative and multimedia applications involve processing of specific data with special requirements on distribution and delivery. However, the processing itself often needs to become distributed as the required vast amount of network traffic and processing capacity can easily overload any existing commodity centralized solution. Another reason for creating distributed solution is improvement in terms of robustness and fault tolerance. The problem of distributed multimedia processing can be divided into two classes of problems: synchronous (on-line or interactive) processing and asynchronous (off-line or noninteractive) processing. Though these two classes might ultimately converge, they seem to have their own distinct problems and goals. The problem of synchronous data processing aims at processing high data volumes with as low latency as possible and thus the amount of processing is limited by the latency requirements. The asynchronous processing has no such strict demands on latency and it can involve more complex processing, too. However, the overall speed and scalability of the processing is of utmost importance. Another problem of non-interactive asynchronous data processing and distribution is availability of transient storage of enough size and speed that can be accessed by acquisition tools, processing tools, and possibly also by client tools and/or distribution servers for later replay. Both synchronous and asynchronous environments that we have targeted to create in this work need to be experimentally evaluated on some pilot client applications. We focus our attention mainly on high-quality video transmission which generates high volumes of data at high rates (e. g., uncompressed high-definition (HD) video consumes as much as approximately.5 Gbps) and can be used in both synchronous and asynchronous modes. This work is organized into two parts: in Part I, we discuss problems and proposed solutions for synchronous distribution environments, while Part II is deals with asynchronous infrastructure. Organization of parts is outlined in respective introduction chapter of each part and the results and future work for both parts are summarized in concluding Chapter 0. Claims Author of this thesis claims the following contributions to the state of art of network and Grid support for multimedia processing and distribution: Proposal of the Active Element (AE) architecture designed as user-empowered distributed network element for multimedia distribution and processing.

13 2 Design and analysis of several robust models for synchronous data distribution scalable with respect to number of connected clients. Proposal of distributed active element capable of processing streams with bandwidth higher than the capacity of each individual AE. Load sharing distributed AE without explicit synchronization is analyzed and shown to provide bounded reordering. Load sharing distributed AE with synchronized sending using Fast Circulating Token (FCT) protocol is designed, analyzed and shown to provide bounded reordering superior to sending without explicit synchronization. Behavior of the both load sharing distributed AE with FCT and without it has been experimentally studied using prototype implementation based on state of the art computing clusters with low-latency interconnects. Distributed encoding environment workflow whose parallel phase scales linearly with respect to number of nodes involved in processing until task-specific maximum number of nodes is reached. Analysis of task scheduling for distributed encoding environment with respect to distributed processors and distributed data storage. The scheduling algorithm belongs to PO class. A number of pilot applications using prototype implementations of the both synchronous and asynchronous distribution and processing models.

14 Part I Synchronous Distributed Processing

15 Chapter 2 Objectives of Synchronous Distributed Processing The problem of synchronous (on-line or interactive) processing lies in providing environment with as low latency of processing and distribution as possible. For example, human perception of sound registers latency as low as 00 ms in general communication while latency requirements of visual perception is not that strict. However, for truly natural collaborative environment the video must be rather precisely synchronized with audio (so called lip synchronization ) so that the video latency is of the same importance. Furthermore, when haptic interaction (force feedback) is incorporated, the latency and especially jitter requirements become tighter in order of several magnitudes. Typical applications for synchronous distributed processing are high quality videoconferencing and remote collaborative applications. These might include for instance transmission of stereoscopic (3D) video to provide natural perception of communicating partner, transmission of 3D models visualization, transmission and processing of uncompressed video and audio to minimize latency and many others. A distributed environment for processing high volumes of data at high rate requires distribution of processing load to be able to process amounts of data that are impossible to process via any commodity computer today, network load to avoid bottlenecks formed by networking interface of single processing computer. Network load distribution has two aspects, which are handled separately in this work: first, a distribution that boosts scalability with respect to number of clients served by the infrastructure and second, distributed scheme that allows processing of high-bandwidth data streams that are beyond capacity, which can be handled by any single processing node. Distribution also allows forming overlay networks that can be used to provide faulttolerant behavior as shown later in this chapter. We also seek for more general framework for distribution of processing and transportation of multimedia data and possibilities it can bring. Our work deals with employing commodity PC clusters interconnected with low-latency interconnection (so called tightly coupled clusters) to perform distributed processing of multimedia data and distributing them to clients. Higher-level clusters of the tightly Commonly understood threshold for natural perception of haptic interaction is 5,000 Hz and thus precision of timing or jitter of much less than 200 µs is needed.

16 2.. DISTRIBUTION OF PROCESSING LOAD 5 coupled clusters or separate computers interconnected via network links with higher latency (and thus possibly distributed across wide area) will be used to create overlay networks. Such clusters are often available as a part of Grid high performance computing environment. On transport level, standard IP network with common commodity equipment interconnecting processing nodes with clients is expected. 2. Distribution of Processing Load The distribution of the processing load is efficient only when computationally intensive processing is required. Additionally, needed network capacity must be substantially lower than the capacity of commodity PC internal interconnects and buses. The following applications are possible examples suitable for distribution of load processing only [32]: Multimedia stream transcoding Multimedia transcoding is conversion from source to target format. For example video transcoding often involves multiple discrete cosine transforms and complicated matrix transforms, and other computationally intensive operations. Although it is possible to perform distribution of compression/decompression algorithm for some formats, it is often more effective to distribute processing on either per frame 2 or per client basis. This capability is useful for example when some client application needs multimedia data in other format than rest of the collaborating group. Video de-interlacing Another example is when video image is captured and transmitted in interlaced mode 3 and client applications need progressive video to display image correctly. So called de-interlacing process is computationally very demanding if high-quality output is requested, as it requires decompression of two successive fields, blending odd and even lines together, and finally compression of resulting image into target format. The distribution can be on per-frame basis. It is near to impossible to do it using current commodity PCs in real time without additional hardware support especially for HD video. It makes sense to perform de-interlacing centrally as almost none of client applications displays its output on interlaced display and the video producers are unable to transform it on their own because of excessive computational demands. This transformation also doesn t change bandwidth needed and thus if the original stream was processable by network subsystem of commodity PC, the resulting stream remains processable as well. Multimedia stream composition The stream composition means either merging several streams together (for sound) or arranging several down-sized (down-sampled) images into one frame [38]. This involves decompression and down-sizing of many streams in parallel and thus one computer may not be sufficient. The decompression and down-sizing phase can be efficiently performed on many nodes in parallel. This capability is useful for example if processing power at client sites is insufficient to decode and play large number of simultaneous streams in parallel. The distribution of the processing load is not very hard from theoretical point of view when modular user-empowered reflector is used distribution of processors can be handled efficiently, e. g. using message passing like MPI over Myrinet. Thus being more question of development and implementation, it is not the focus of this work. 2 For low latency video transmission, video compression algorithms producing independent frames or even independent blocks are used to minimize latency needed to compress and decompress the image. This approach also limits impact of data loss in the network. 3 Interlaced video means the odd lines are displayed first followed by even lines (or vice versa) to allow doubling of display frequency in order to achieve smooth perception of movement in the image. One video frame is split into two fields, one of them containing odd lines and the other even lines. This is common approach employed in most of video hardware devices as cameras and TV sets. Progressive video displays whole image at once and it is the way how computers typically display their output.

17 2.2. DISTRIBUTION OF NETWORK LOAD Distribution of Network Load The distribution of the processing load only is not suitable for many multimedia applications. Commodity PCs are usually equipped with network interfaces ranging from the 00 Mbps Fast Ethernet to 2 Gbps Gigabit Ethernet, all of them usually in full-duplex mode. For Myrinet low-latency interconnection the available capacity is 2 Gbps in fullduplex mode. However when a PC is used to handle Gbps traffic, one processor gets usually saturated by servicing the network interface card and the second processor is required to compute real data transformations. Further internal architecture of most commonly used IA32 PC architecture with PCI busses is easy to saturate when working with multi-gigabit data flows. When significantly more network bandwidth is needed to be processed, the network load has to be distributed over multiple hosts. A switch interconnecting a cluster has almost always higher switching capacity than any of the computers connected and thus the switch is not the network bottleneck for serial processing by any single node. Again the minimalist target scenario is to create a distributed environment with higher maximum processable throughput than it is possible with any single computer in the cluster. When several machines are used for sending data that are part of single stream it is necessary to design and implement some synchronized sending architecture to avoid large packet reordering with negative impact on most of applications. In Chapter 5 we show that even without explicit synchronization, the packet reordering is upper-bound limited and we also propose and evaluate a protocol to further reduce the reordering. Examples of possible scenarios that can utilize both distribution of processing and network load are shown below: Processing many streams with medium bandwidth requirements When standard definition DV video and audio is used by 0 collaborating clients, each of them is sending one stream and receiving streams from all other clients resulting in sending 30 Mbps per client and receiving 270 Mbps per client this can be handled by one PC without serious problems. The total bandwidth required at processing site is 300 Mbps for receiving and 2.7 Gbps for sending which is clearly beyond what is possible to process on single commodity PC. Processing one high-bandwidth stream High bandwidth stream can be for example a single uncompressed HD stream (.5 Gbps) or high resolution visualization with virtually unlimited bandwidth utilization. It might be needed e. g. to distribute such streams to multiple clients or to down-sample such stream to make it accessible for clients with lower bandwidth connectivity. 2.3 Fault Tolerance Failures of links in wide area networks are likely to occur quite frequently as shown in [2] even when network as a whole is rather stable. Failures can also happen inside the distributed processing environment, e. g. when cluster of processing nodes crashes or becomes unavailable for any other reason. Therefore an environment designed to support synchronous communication of clients distributed across wide area networks should attempt to mitigate perception of these problems by clients. Especially when it comes to multimedia distribution using native multicast, the distribution scheme is over-optimized from fault tolerance point of view. Therefore as a part of network load distribution models scalable with respect of number of clients connected, we study different distribution schemes with varying scalability and robustness ratios.

18 Chapter 3 State of the Art There is a number of systems available for synchronous distribution of multimedia data in non-distributed and fashion and some of them even have limited processing capabilities. Most of these system are also designed without user-empowered paradigm in mind. Our approach is different in the distribution of processing either over network of elements in the network or over a tightly coupled cluster of computers to allow distribution of both high number of streams (clients) and high bandwidth streams while also allowing more demanding processing. Therefore our work is related not only to multimedia processing, but also to projects utilizing PC clusters as distributed routers and servers. We also regard robustness as an important cornerstone of larger distributed systems and propose distribution models crafted with both scalability and robustness in mind. In this section we give a brief overview of related systems. We give a short overview of user-empowered modular reflector, that is a predecessor of our active elements described in Chapters 4 and 5. We emphasize systems that create overlay networks for increasing robustness instead of relying on robustness of underlying networks. We also describe short status quo of peer to peer network architectures which we envision as important due to their fault tolerance support. We conclude this section with OptIPputer description, which we see as interesting holistic vision of distributed collaborative platform for future that comprises all the levels from optical network to end user applications. 3. Multicast Multicast scheme has been designed for unidirectional data transmissions to reach any subset of nodes in the network while sending the data over any link at most once [66]. It involves no additional data processing except for simple data replication where appropriate and thus any user-specific data handling (processing, QoS, etc.) is impossible. The multicast is the natural solution for the synchronous data distribution as it involves data multiplication directly in the network so that the same data are transferred at most once over any particular link. Actually it was associated with multimedia transmission from the beginning as no lines of sufficient capacity were available for multimedia distribution around year 985, when first prototypes of multicast in IP networks appeared. While this approach implies large ( infinity ) scalability, it imposes non-trivial requirements on the network as all the network nodes must support it in a consistent way. MBONE (Multicast backbone) network was created by Steve Deering at the beginning of multicast history as an overlay over underlying unicast network. The multicast networks were connected using tunnels created by mrouted daemons [82]. Each mrouted connects to other mrouted daemons and creates tunnels that deliver the multicast traffic. The tunnel is used to encapsulate multicast packets into datagrams, that are in turn

19 3.2. USER-EMPOWERED MODULAR REFLECTOR 8 sent through the unicast network. Routing among mrouted daemons is performed using Distance-Vector Multicast Routing Protocol (DVMRP). Nowadays, more advanced protocols like Protocol Independent Multicast [8] (either in so called Sparse Mode [2] or Dense Mode []) are deployed, which are not supported in original mrouted software. The prevailing multicast routing protocol in current Internet is PIM SM. These protocols improve behavior of the original multicast protocols to some extent, however the basic problems of multicast as already discussed above (e. g. per client QoS handling) remain the same. Therefore multimedia distribution uses either multicast simulation as shown in subsequent section (idea of virtual multicast is also proposed in [32]) or uses hybrid approach where multicast-enabled clients are served via multicast and other clients using unicast distribution. Despite continuous effort only small fraction of places on Internet has reliable native multicast connectivity and users are left at disposal of administrators regarding their multicast connectivity as multicast is not user-empowered. Another problem with multicast is its implementation directly on routers inside the network: this is great from efficiency point of view, but disaster if these routers are lacking strict separation of processes in its internal operating system. If any problem occurs with stability of routers due to multicast implementation and there is good chance for this to happen as multicast is indeed very complicated compared to unicast and poses much more load on the router as the data must be also replicated the router administrator simply cuts off multicast because he or she can not afford to threaten unicast routing stability. Furthermore all the practically used multicast protocols have also other disadvantages: it is near to impossible to take care of quality of service requirements for the whole multicast group, it is very difficult to provide secured environment without a shared key, and there is no easy support for accounting. 3.2 User-Empowered Modular Reflector The problems of multicast may be overcome by multicast connectivity simulation or virtualization, where active nodes have a role of reflectors [32], that replicate all traffic passing through them in a controlled way. Another important property of the reflectors is that they can be used for any content processing in many different mode be it on per stream, per group, or per client basis Architecture of the Reflector The design of a reflector must be flexible enough to allow implementation of required features and leaving space for easy extensions for new features. This leads to a design that is very similar to active router architecture [37] modified to work entirely within the user space. Users without administrator privileges are thus able to run reflector on any machine they have access to. The reflector architecture is shown in Fig. 3.. Data Processing Architecture Data routing and processing part of the reflector comprises network listeners, shared memory, a packet classifier, a processor scheduler, number of processors, and a packet scheduler/sender. The network listeners are bound to one UDP port each. When a packet arrives to the listener it places the packet into the shared memory and adds reference to a to-be-processed queue. The packet classifier then reads the packets from that queue and determines a path of the data through the processor modules. It also checks with routing AAA module whether the packet is allowed or not (in the later case it simply drops that packet and creates event that may be logged). Zero-copy processing is used in all simple processors (packet filters), minimizing processing overhead (and thus packet delay). E. g. for simple data multiplication, the data are only referenced multiple times in the packet sched-

20 3.2. USER-EMPOWERED MODULAR REFLECTOR 9 messaging interface Reflector Kernel resource management administrative AAA management Processor packet processor session management messaging interface n network listener routing AAA session management processor scheduler packet classifier Processor n packet processor session management network listener n shared memory packet scheduler /sender data flow control information FIGURE 3.: User-Empowered Modular Reflector Architecture uler/sender queue before they are actually being sent. Only the more complex modules may require processing that is impossible without use of packet copies. The session management module follows the processors and fills the distribution list of the target addresses. The filling step can be omitted if data passed through a special processor that filled the distribution list structure and marked data attribute appropriately (this allows client-specific processing). Processor can also poll session management module to obtain up to date list of clients for specified session. Session management module also takes care of adding new clients to the session as well as removing inactive (stale) ones. When new client sends packets for the first time, session management module adds client to the distribution list (data from forbidden client has already been dropped by packet classifier). This mechanism is designed to work with the MBone Tools suite but it can be easily extended with other possibilities how to work with session management module and to add or remove items to/from the distribution lists. Information about the last activity of a client is also maintained by the session module and is used for pruning stale clients periodically. Even when distribution list is not filled by the session management module, packets must pass through it to allow addition of new clients and removal of stale ones. When the packet targets are determined by the router processor a reference to the packet is put into the to-be-sent queue. Then the packet scheduler/sender picks up packets from that queue, schedules them for transmission, and finally sends them to the network. Per client packet scheduling can also be used for e. g. client specific traffic shaping.

21 3.3. RESILIENT OVERLAY NETWORKS 0 The processor scheduler is not only responsible for the processors scheduling but it also takes care of start-up and (possibly forced) shutdown of processors which can be controlled via administrative interface of the reflector. It checks resource limits with routing AAA module while scheduling and provides back some statistics for accounting purposes. Administrative Part Communication with the reflector from the administrative point of view is provided using messaging interfaces, management module, and administrative AAA module of the reflector. Commands for the management module are written in a specific message language. Messaging interface is generic entity, which can be instantiated as e. g. RPC, SOAP over HTTP, plain HTTP interface with SSL/TLS or GSI support, or simple TCP connection bound to loop-back interface of the machine running the reflector. Each of these interfaces unwraps the message if necessary and passes it to the management module. A message language for communication with the management module is called Reflector Administration Protocol (RAP) described in [9]. More information on administrative part of the reflector can be found in [32] and [3] Usage of the Reflector The basic function of the reflector is retransmission of received data to one or more listeners. This can be easily extended to support other useful functions. The reflector replicates all the traffic coming through specified port to all the clients connected to that port. MBone Tools based clients do not need to interact in advance they just connect to the reflector to automatically receive all the traffic sent to the reflector and also all the client traffic is automatically distributed by the reflector. The reflector security (per port or per client) policy may change this behavior and forbid some clients from listening or sending data. 3.3 Resilient Overlay Networks Resilient Overlay Network (RON) approach [2] aims to built rather general overlay network on top of IP based network to improve speed of recovery from network outages and to improve routing between hosts in different autonomous systems using more complicated metrics than simple hop counts. The whole system is based on assumption that while very simple metrics and robust routing information distribution is required in the Internet that comprises hundreds millions of nodes (as presented by BGP 4), overlay networks of limited size (ranging from 2 to 50 RON forwarders) can use much more sophisticated mechanisms. RON evaluates three basic metrics in order to choose optimum path: (a) latency, (b) packet loss, (c) estimated maximum TCP throughput. The topology information is disseminated using link-state algorithm. RON is capable of failure detection and recovery in 8 s on average and the routing detour usually includes no more than one additional RON forwarder. RON also allows application integration and expressive policing to choose optimum path for specific applications (e. g. some application might need as low transmission latency as possible while other one needs maximum TCP throughput available). RON attempts to perform per flow routing to avoid sending data from one application over multiple parallel links to get rid of packet reordering problem. Other similar approaches based on general overlay networks to improve performance or to implement features not available in the underlying network include Detour [6] and X-Bone [64].

22 3.4 MULTIMEDIA PROC. AND DISTRIB. SYS. BASED ON OVERLAY NETWORKS 3.4 Multimedia Processing and Distribution Systems Based on Overlay Networks Current most advanced high-speed networks proceed in direction of being very fast while providing just very simple (basic) services only. Furthermore, lessons learnt from problems with implementing advanced functionality in the network layer show that it might be more appropriate to create overlay networks built on top of dumb and fast networks to provide advanced features needed. The overlay networks built on IP networks use just the basic unicast routing and transport functionality of underlying network while the other functions are implemented in a way orthogonal to the IP infrastructure. When some problem with overlay networks occurs, the underlying IP network is never influenced and works fine for other traffic and other clients Virtual Room Videoconferencing System (VRVS) Virtual Room Videoconferencing System (VRVS) [95] is one of the most popular systems based on ideas of overlay networks built on top of unicast networks with UDP/RTP packet reflectors taking care of distributing packet to all conference participants. It was originally developed at CERN and Caltech for communication of physicists studying behavior of high energy particles, who are spread all over the world and need to discuss their scientific problems. On the client side, the VRVS uses primarily MBone tools. Additional tools like chat or VNC [58] are also available. The VRVS uses web based portal as the videoconferencing front-end. Administration of the reflectors is done by a small closed group of people called VRVS administrators. To get a reflector local to some user group, the user group needs to provide a Linux based computer with remote access (typically using SSH) in advance and the VRVS administrators install there all the necessary software and also take care of supporting it afterwards. When using the VRVS, the system automatically selects the most appropriate reflector based on location of the users actually the geographically closest one is usually selected and the users are given no option to change it. This may not avoid transmitting lots of data trough costly lines. From the distributed multimedia processing point of view, the unique feature of VRVS reflectors is a gateway functionality that allows VRVS to interconnect the world of MBone tools with a world of H.323 videoconferencing systems. Theoretically, the only thing needed to handle when interconnecting MBone Tools and H.323 is creation/removal of H.323 signaling protocols as the MBone Tools use no real signaling protocols and the basic video and audio transmission formats remain the same (typically H.26 for video and µ-law/a-law for audio). However, as VRVS is closed source system provided as a service, it is hard to verify what is the real processing architecture and whether the H.323/MBone Tools conversion processing is really distributed or whether the reflectors work as relays only and the processing is performed in centralized manner Access Grid The Access Grid (AG) system [9, 0, 23] has been built in order to enable communication of researchers that collaborate in Grid environments. The AG communication environment uses MBone Tools videoconferencing software and basically relies on multicast support in the network. Integration of AG with DV video transmissions based on DVTS tools (Sec. 6.) has been tested [29] relying again on multicast or even layered multicast [28, 54]. However because of known problems with multicast deployment, it uses a system for bridging multicast sessions between multicast enabled network and a network with only unicast connectivity (or network which is locally multicast enabled but there is no multicast peering with other networks) and another system to run videoconferencing sessions on unicast network only. The first scenario uses QuickBridge software on some site which

23 3.4 MULTIMEDIA PROC. AND DISTRIB. SYS. BASED ON OVERLAY NETWORKS 2 has multicast connectivity to the AG network and clients in unicast network then point their client tools directly to the QuickBridge server. The second scenario is based on software called Multi-Session Bridge that is run on servers at Argonne National Laboratory (ANL), which are also running the basic AG infrastructure. Client on unicast network can join AG using vtc client which is provided within AG software suite. QuickBridge [6] was developed at the University of Southern California and has been further enhanced by AG team. It is simple reflector which uses IP address/subnet based authentication. There is a database of AG rooms and corresponding multicast addresses and ports maintained for QuickBridge and thus its administrator just specifies which room should be bridged. The Multi-Session Bridge (MSB) was created at Fermi National Accelerator Laboratory [59]. It consists of a server called msb, a client called vtc, and a web server web-vtc. It uses its tunneling capabilities to create an overlay network which can bridge both unicast and multicast videoconferences that are using RTP v2 streams (like MBone Tools). Again, the authentication is based on IP address restrictions and it also features simple plain-text client-server authentication. Furthermore, bridging can be restricted to specified direction only. The MSB is known to have scalability problems as it is the centralized solution and we can confirm these problems based on our experience. From the distributed processing point of view there is no processing on these reflectors except for the distribution of the streams to all clients H.323 Videoconferences H.323 based videoconferences use H.323 signaling protocol as an all-encompassing envelope for underlying protocols for channel negotiation, setup, and teardown, and for video, audio, and data transmission. H.323 videoconferencing protocol was designed to support directly point-to-point videoconferences only. Multi-point videoconferencing capabilities are provided by hardware or software devices called Multi-point Connection Units (MCUs). With the only two exceptions, these devices are usually either expensive hardware boxes with professional features like power supply redundancy designed for maximum reliability or some limited versions are built into high-end videoconferencing stations. As for those two exceptions: the first is VRVS system described above and second is open source implementation called OpenMCU described later in this section. Both of these have rather limited capabilities compared to full-featured MCUs. Typical MCU can provide several videoconferencing modes like scaling down and merging several video streams together to fit onto one screen, or video switching where video follows the most loud audio source. Audio is combined in a way that all the participants can hear one another. However, it is a common problem that one participant produces some strange sounds and nobody except for the MCU administrator is able to cut him/her off the videoconference. Even the MCU administrator might face serious problems since only a few of H.323 implementations feature reasonable user identification (although it is a feature defined in H.323 standard). There is single open source software implementation of H.323 MCU called OpenMCU. It has been created as a part of OpenH323 Project [86]. OpenH323 project also provides other open source tools needed for H.323 tool chain e. g. a gatekeeper called OpenGK, an OpenPhone client and a H.323 answering machine called OpenAM. The OpenMCU implementation is written using H.323 library which is the basic component created by the OpenH323 project and which is also being used by many other projects (e. g., software H.323 client GnomeMeeting). The OpenMCU is known to work on FreeBSD and Linux operating systems. It features G.7, GSM MS GSM, and LPC 0 audio codecs support and H.26 video codec support. It features also multiple parallel videoconferences using room concept. Early versions of the VRVS system were created on basis of the MSB.

24 3.5. USE OF CLUSTERS AS DISTRIBUTED ROUTERS 3 As for the processing, the audio streams are combined on OpenMCU and thus can be heard from all the participants but the video stream can only be seen from maximum of four users that are actively talking. The H.26 streams are down-sampled to /4 of original size and the resulting images are placed into 2 2 grid. 3.5 Use of Clusters as Distributed Routers There is a known effort to use computer clusters with low-latency interconnecting infrastructure as high performance and scalable routers. Probably the most advanced and theoretically founded achievement is Suez project [2, 3] 2, based on commodity PC clusters with Myrinet interconnection. The system works as follows: each cluster node has an internal interface to Myrinet switch for internal communication within the cluster and optionally one or more external interfaces. Both internal and external interfaces rely on specific capabilities of Myrinet interface cards and drivers (e. g. PeerDMA transfer). From the external point of view, the routing is performed among external interfaces. Suez uses a routing-table search algorithm that exploits CPU cache for fast lookup by treating IP addresses directly as virtual addresses. To scale the number of real-time connections supportable with given link speed, Suez implements a fixed-granularity fluid fair queuing (FGFFQ also called DFQ staying for Discretized Fair Queuing) algorithm [] that eliminates the per-packet overhead associated with FIFO or per-connection overhead for real-time scheduling based on the conventional weighted fair queuing algorithms. Another project which distributes processing load on active network elements is Active Network Node [7] that relies on specialized hardware. Software DSM project [27] attempts to build efficient distributed memory for the closely coupled clusters for using them as active routers. There is yet another similar project called Cluster-based Active Network Router [42]. However none of the above mentioned projects addresses finer than per-address network load distribution and thus there is no need for solving packet reordering issues. 3.6 Use of Clusters as Distributed Servers A number of servers based on utilizing computer clusters as distributed servers are available. Most distributed servers are prototyped as web servers [3, 7, 57] for simplicity reasons and also because rather standard and straightforward performance evaluation is available. For example, Carrera and Bianchini recently demonstrated cluster based web server called PRESS [8] concentrating on demonstrating advantages of user level communication like low processor overhead, remote memory accesses, and zero-copy transfers. For prototype implementation, they used Virtual Interface Architecture user-level standard for intra-cluster communication. It is designed as locality consciousness server [57] in order to utilize caching of the served data. After evaluating performance, they found that user-level communication is more than 50% more efficient compared to kernel level communication and achieved close to linear scalability up to 8 cluster nodes, which is the maximum they used for evaluation. 2 For some reason, the only publicly available article detailing Suez principles and internals is available when unpacking Suez distribution available at Further information has been obtained via private communication with authors.

25 3.7. PEER-TO-PEER NETWORKS Peer-to-Peer Networks The peer-to-peer (P2P) networks gained enormous popularity, both positive and negative, for file sharing and distribution. These systems provide very robust functionality for neighbor discovery and failure tolerance. However it seems to be hard to find a good compromise between scalability or efficiency and robustness of the whole system. There are several possible architectures that are employed in P2P networks that achieve different ratios of scalability and robustness. A good overview of P2P architecture is given in [65]. Pure or decentralized systems. In this model, there is no central authority and all nodes are equal and thus there is no single point of failure. However due to lack of hierarchical structure there are problems with scalability this class of systems e. g. in file sharing networks resulting in flooding P2P network of search queries. Centralized systems. This class of systems is on the opposite extreme of P2P spectrum. There is central authority, which is used for directory and discovery services making searches and discovery very efficient but resulting in a single point of failure. Central node can also become overloaded as both the network and number of requests grows. Super-peer systems. This type of network is similar to pure system but it reintroduces notion of hierarchy built in a robust way. The peers are organized into clusters with one node elected as super-peer, which performs some server activities on behalf of all nodes in the cluster (e. g. maintains indices and answers queries). The election system may be based on number of parameters like availability of bandwidth, enough processing power, or enough storage and any node can become super-peer if elected. To increase robustness it is possible for each cluster to have more than just one super-peer forming k-redundant virtual super-peer, where k is a number of super-peers per a cluster. 3.8 OptIPuter The OptIPuter project [48, 87] is probably the most advanced project of general distributed processing environment aimed at utilization of current high-speed optical networks and powerful distributed computing and storage infrastructure built as a part of Grid projects. Based on presumption that network capacity grows faster than storage capacity which grows faster than processing capacity, this project is focused on the re-optimization of the entire Grid stack of software layers to enable wasting bandwidth and storage in order to conserve rather scarce computing resources. The OptIPuter can be understood as a virtual parallel computer, in which the individual processors are clusters distributed across many regions; the memory takes on the form of large and fast distributed data storage; peripherals are e. g. scientific instruments, displays or sensor arrays; and the common infrastructure forming the virtual motherboard uses standard IP layer delivered over multiple dedicated lambda circuits 3. The prototype of OptIPputer is being built on campuses and metropolitan and state-wide optical fiber networks in southern California and in Chicago. 3 The dedicated lambda term is used in networking jargon to describe a dedicated circuit that is based on separate wavelength (or even fiber sometimes) used for such circuit on optical layer of the network.

26 Chapter 4 Networks of Active Elements A virtual multicasting environment, based on an active network element called reflector [32] has been successfully used for user-empowered synchronous multimedia distribution across wide area networks. While quite robust replacement for native, but not reliable multicast used for videoconferencing and virtual collaborative environment for small groups, its wider deployment is limited by scalability issues. This is especially important when high-bandwidth multimedia formats like Digital Video are used, when processing and/or network capacity of the reflector can easily be saturated. A simple network of reflectors [33] is a robust solution minimizing additional latency (number of hops within the network), but it still has rather limited scalability. In this paper, we study scalable and robust synchronous multimedia distribution approaches with more efficient application-level distribution schemes. The latency induced by the network is one of the most important parameters, as the primary use is for the real-time collaborative environments. We use the overlay network approach, where active elements operate on an application level orthogonal to the basic network infrastructure. This approach supports stability through components isolation, reducing complex and often unpredictable interactions of components across network layers. 4. Synchronous Multimedia Distribution Networks A synchronous multimedia distribution network, which operates at high capacity and low latency, can be composed of interconnected service elements so called active elements (AEs). They are a generalization of the user-empowered programmable reflector [32]. The reflector is a programmable network element that replicates and optionally processes incoming data usually in the form of UDP/RTP datagrams, using unicast communication only. If the data is sent to all the listening clients, the number of data copies is equal to the number of the clients, and the limiting outbound traffic grows with n(n ), where n is the number of sending clients. The reflector has been designed and implemented as a user-controlled modular programmable router, which can optionally be linked with special processing modules in run-time. It runs entirely in user-space and thus it works without need for administrative privileges on the host computer. The AEs add networking capability, i. e. inter-element communication, and also capability to distribute its modules over a tightly coupled cluster. Only the networking capability is important for scalable environments discussed in this paper. Local service disruption element outages or link breaks are common events in large distributed systems like wide area networks and the maximum robustness needs to be naturally incorporated into the design of the synchronous distribution networks. While the maximum robustness is needed for network organization based on out-of-band control messages, in our case based on user empowered peer to peer networks (P2P) approach de-

27 4.2. ACTIVE ELEMENT WITH NETWORK MANAGEMENT CAPABILITIES 6 Network Management Network Information Service Messaging Modules Kernel Processors Network Listeners Shared Memory Packet Scheduler/Sender data flow control information FIGURE 4.: Architecture of the Active Element. scribed in Sections 4.2. and 4.4, the actual content distribution needs carefully balanced solution between robustness and performance as discussed in Section 4.3. The content distribution models are based on the idea that even sophisticated, redundant, and computationally demanding approaches can be employed for smaller groups (of users, links, network elements,... ), as opposed to simpler algorithms necessary for large distributed systems (such as the global Internet). A specialized routing algorithm based on similar ideas has been shown, e. g. as part of the RON approach [2]. 4.2 Active Element with Network Management Capabilities As already mentioned in Sec. 4., the AE is the extended reflector with the capability to create network of active elements to deploy scalable distribution scenarios. The network management is implemented via two modules dynamically linked to the AE in the runtime: Network Management (NM) and Network Information Service (NIS). The NM takes care of building and managing the network of AEs, joining new content groups and leaving old ones, and reorganizing the network in case of link/node failure. The NIS serves multiple purposes. It gathers and publishes information about the specific AE (e. g. available network and processing capacity), about the network of AEs, about properties important for synchronous multimedia distribution (e. g. pairwise one-way delay, RTT, estimated link capacity). Further, it takes care of information on content and available formats distributed by the network. It can also provide information about special capabilities of the specific AE, such as multimedia transcoding capability. The NM and NIS modules can communicate with the AE administrator using administrative modules of the AE kernel. This provides authentication, authorization, and accounting features built into the AE anyway and it can also use Reflector Administration

28 4.2. ACTIVE ELEMENT WITH NETWORK MANAGEMENT CAPABILITIES 7 Protocol (RAP) [9] enriched by commands specific for NM and NIS. The NM communicates with the Session Management module in the AE kernel to modify packet distribution lists according to participation of the AE in selected content/format groups Organization of AE Networks For the out-of-band control messages, the AE network uses self-organizing principles already successfully implemented in common peer to peer network frameworks [60],[65], namely for AE discovery, available services and content discovery, topology maintenance, and also for control channel management. The P2P approach satisfies requirements on both robustness and user-empowered approach and its lower efficiency has no significant impact as it routes administrative data only. The AE discovery procedure provides capability to find other AEs to create or join the network. The static discovery relies on a set of predefined IP addresses of other AEs, while the dynamic discovery uses either broadcasting or multicasting capabilities of underlying networks to discover AE neighborhood. Topology maintenance (especially broadcast of link state information), exchange of information from NIS modules, content distribution group joins and keep-alives, client migration requests, and other similar services also use the P2P message passing operations of AEs Re-balancing and Fail-Over Operations The topology and use pattern of any network changes rather frequently, and these changes must be reflected in the overlay network, too. We consider two basic scenarios: () rebalancing is scheduled due to either use pattern change or introduction of new links and/or nodes, i. e. there is no link or AE failure, and (2) a reaction to a sudden failure. In the first scenario, the infrastructure re-balances to a new topology and then switches to sending data over it. Since it is possible to send data simultaneously over both old and new topology for very short period of time (what might result in short term infrastructure overloading) and either the last reflector on the path or the application itself discards the duplicate data, clients observe seamless migration and are subject to no delay and/or packet loss due to the topology switch. This scenario also applies when a client migrates to other reflector because of insufficient perceived quality of data stream. On the contrary, a sudden failure in the second scenario is likely to result in packet loss (for unreliable transmission like UDP) or delay (for reliable protocols like TCP), unless the network distribution model has some permanent redundancy built in. While multicast doesn t have such a permanent redundancy property, the client perceives loss/delay until a new route between the source and the client is found. Also in the overlay network of AE without permanent redundancy, the client needs to discover and connect to new AE. This process can be sped up when client uses cached data about other AEs (from the initial discovery or as a result of regular updated of the topology). For some applications, this approach may not be sufficiently fast and permanent redundancy must be applied: the client is continuously connected to at least two AEs and discards the redundant data. When one AE fails, the client immediately tries to restore the degree of redundancy by connecting to another AE. The same redundancy model is employed for data distribution inside the network of AEs, so that re-balancing has no adverse effect on the connected clients. The probability of failure of a particular link or AE is rather small, despite high frequence of failures in global view of large networks. Thus the two fold redundancy (k = 2) might be sufficient for majority of applications, with possibility to increase it (k > 2) for the most demanding applications.

29 4.2. ACTIVE ELEMENT WITH NETWORK MANAGEMENT CAPABILITIES 8 Fast Failure Detection and Recovery for Simple Models without Redundancy In this section we describe general algorithm for fast detection and recovery from AE failure in category of simple distribution models. Definition 4. (Simple distribution models) Simple data distribution model is any distribution scheme, where data traverses at most two AE inside the distribution network, i. e. one ingress and one egress AE with possibility of ingress and egress being the same AE. Examples of simple models are 2D full-mesh networks (Section 4.3.) and 3D layeredmesh networks (Sec ), while 3D with intermediate AEs (Sec ) and multicast-like schemes (Sec ) are not. The following preliminary steps are needed for fast detection algorithm to be put in place:. The client chooses and joins one AE as its primary AE. 2. The client chooses one AE as its backup AE. The client informs the backup AE that is has been chosen the backup AE for (client,primary AE) pair. 3. Both client and backup AE subscribe for keep-alive messages from primary AE. The failure detection works as follows: Failure of primary AE is recognized by node X (be it the client or the backup AE) when keep-alive messages are not received for grace period GRACE (X) (expressed in seconds). When backup AE recognizes failure of primary AE, it immediately starts to send data to the client (i. e. it considers client as if it just joined). When the client recognizes failure of primary AE, it immediately joins the backup AE by announcing the backup AE that it has just become primary one. For such model, we observe the following properties (A... the primary AE, B... the backup AE, C... the client, OWD (X Y )... one-way delay from X to Y, t 0... instant of primary AE failure, GRACE (X)... failure detection period by node X). Failure detection of primary AE by the client t = t 0 + OWD (A C) + GRACE (C) Failure detection of primary AE by the backup AE t = t 0 + OWD (A B) + GRACE (B) Reception recovery t rr t = t 0 + OWD (A C) t 2 = t 0 + OWD (A B) + GRACE (B) + OWD (B C) t 3 = t 0 + OWD (A C) + GRACE (C) + OWD (C B) + OWD (B C) t rr = min{t 2, t 3 } t = min{owd (A B) + GRACE (B), OWD (A C) + GRACE (C) + OWD (C B)} OWD (A C)

30 4.3. DISTRIBUTION MODELS 9 Distribution recovery t dr t dr = OWD (A C) + GRACE (C) + OWD (C B) Data hollow in terms of data timestamps t dh t dh = (t OWD (C B)) (t 0 OWD (C A)) = OWD (A C) + GRACE (C) + OWD (C A) This model assumes reliable network, i.e. it can t happen that backup AE detects primary AE failure while the primary AE works fine for the client. This can be improved by client sending stop message to backup AE when it starts receiving the same data from both primary and backup AE. Similar detection models can be used for non-simple networks, but in these networks it is impossible to state general formulas for recovery. As an addition to failure detection delay and failure announcement distribution latency, which are similar in both simple and non-simple distribution models, it also includes delay caused by recomputing or rebuilding of the distribution model for non-simple models (there is no need to recompute the distribution model for simple ones). This phase is generally hard to estimate as it involves complex distributed system with many variable. E. g. for multicast-like schemes, there are additional delays stemming from recomputation and/or distribution of new minimum spanning trees (or if alternative MSTs are available at each AE, there is still some delay due to broadcast of new MST ID for all AEs to switch to the same MST). Furthermore, if the failed AE was the root of the previous MST, new tree root needs to be elected. 4.3 Distribution Models D Full Mesh The simplest model with higher redundancy, serving also as the worst case estimate in terms of scalability, is a complete graph in which each AE communicates directly with all the remaining AEs, as shown in Figure 4.2. This model was studied and described in detail in [33]. FIGURE 4.2: 2D full mesh. Definition 4.2 (2D full-mesh network) Let s have a network with AEs and clients with each AE populated with at least one client. 2D full-mesh network of AEs is a network, in which each AE sends data from each client to other client connected to the same AE and

31 4.3. DISTRIBUTION MODELS 20 it sends the data also to all other AEs in the network. Each AE thus receives the data from all other AEs and sends them to all the clients connected to that AE. Let s assume network of m tot AEs with full-mesh communication. n clients connect to the AEs in such way that each AE has either n r or n r clients. n n r = (4.) m tot m = n r m tot n (4.2) m = m tot m (4.3) Definition 4.3 (Evenly populated AE network) A AE network with each AE having either n r or n r clients connected is called evenly populated AE network. All client are active, i. e. both sending and receiving. Theorem 4. In an evenly populated 2D full mesh of AEs, the inbound traffic is in = n streams. PROOF When full mesh operates in N:N way, the inbound traffic for AEs with n r clients will be in = n }{{} r + (m )n r + m (n r ) } {{ } } {{ } 2 3 (... directly connected clients, 2... streams from m other AEs with n r clients, 3... streams from all m clients with n r clients) and for AEs with n r clients (4.4) in = n r + mn } {{ } r + (m }{{} )(n r ) } {{ } (4.5) (4... directly connected clients, 5... streams from all m AEs with n r clients, 6... from other m AEs with n r clients). It can be easily shown that in = in and after some simplification the in formula can be written as in = n r m + m n r m (4.6) Further substituting m and m we get in = n (4.7) Theorem 4.2 In an evenly populated 2D full mesh of AEs, the limiting traffic in this mesh is the outbound traffic on the AE which is out = n 2 r m tot + n r (m tot 2) streams. PROOF Outbound traffic for AE with n r clients will be out = (n r )n r + (m )n 2 r + m n r (n r ) + n r (m + m ) } {{ } } {{ } } {{ } } {{ } (4.8) (7... from directly connected clients to directly connected clients (the AE doesn t send data to the client which sent them!), 8... data from m AEs with n r clients to all own n r clients, 9... data from all m AEs with n r client to all own n r clients, 0... data sent to other m + m AEs) and for AE with n r clients out = (n r 2)(n r ) + mn r (n r ) } {{ } } {{ } 2 + (m )(n r ) 2 } {{ } 3 + (n r )(m + m ) } {{ } 2 4 (4.9)

32 4.3. DISTRIBUTION MODELS 2 n r n r 4 n r n r FIGURE 4.3: Flow analysis in full 2D mesh of AEs. Bottom and right AEs are populated with n r clients, while top and left AEs are populated with n r clients. (... from directly connected clients to directly connected clients (the AE doesn t send data to the client which sent them!), 2... data from m AEs with n r clients to all own n r clients, 3... data from other m AEs with n r client to all own n r clients, 4... data sent to other m + m AEs). The numbers in equations correspond to numbers in Figure 4.3 on page 2. It can be easily shown that out out = n r. n r and we can also use use just simplified out formula out = n r (n r m + m n r + m 2) (4.0) as out > out for n r and m + m 2. For m + m < 2 the full mesh loses sense and thus out is the limiting value for outbound traffic. Further substituting m and m we get out = n r (m tot + n 2) = n 2 rm tot + (m 2)n r (4.) If we substitute the other way round the n r from (4.) (which is not precise due to ceil function) we get out = n(m tot + n 2) m tot (4.2) and the ratio between out for full mesh of AEs and single AE out = n(n ) is ratio = m tot + n 2 m tot (n ) (4.3)

33 4.3. DISTRIBUTION MODELS Sent Bandwidth [Mbps] m= m=2 m=4 m=8 m=2 saturation of GE # Clients FIGURE 4.4: Behavior of 2D full mesh for DV clients. Dependence of limiting outbound traffic on the number of 30 Mbps clients and the number of AEs in the mesh. Fail-Over Operation When a link or whole AE drops out in the full mesh, the accident only influences data distribution from/to the clients connected to that AE. In case of link failure inside the AE mesh, the client is requested to migrate to an alternative AE. In case that AE itself fails the client initiates migration on its own. Alternative AEs should be selected randomly to distribute load increase more evenly and the load increase will be at most nr m. When even this migration delay is not acceptable, it is possible for a client to be permanently connected to an alternative AE and just switch the communication. For even more demanding applications, the client can use more than one AE for sending in parallel. Although this model seems to be fairly trivial and not that interesting, it has two basic advantages: first, the model is robust and failure of one node influences only data from/to the clients connected to that AE. Second, it introduces only minimal latency because the data flows over two AEs at most. Next we will examine another model that has the same latency and robustness properties but that scales better D Layered-Mesh Network The layered mesh model creates k layers, in which data from a single AE are only distributed as shown in Figure 4.5. One layer is thus similar to 2D full mesh network except for that only one AE is both sending and receiving in one layer. Thus each client is connected to one layer for both sending and receiving (sending only if n r = ; to receive data from clients sending data via other layers, the client needs to receive data from remaining n r clients of the AE used for sending) and to all other layers for receiving only. Each layer comprises m AEs. For the sake of simplicity, we first assume that k = m and each AE has n r clients, thus n r = n m = n k. Definition 4.4 Active AE is an AE with clients that are both sending and receiving. Nonactive AE is an AE with all clients receiving only.

34 4.3. DISTRIBUTION MODELS 23 FIGURE 4.5: 3D layered mesh. Definition 4.5 (3D layered-mesh network) A network of AEs organized into 3D layeredmesh comprising k layers and m AEs in each layer. Each layer is used to distribute data from clients connected to one active AE. Thus each client is connected to one layer for sending and receiving and all other layers for receiving only. Theorem 4.3 In 3D layered-mesh network with each AE having n r clients, each AE has in = n r inbound streams. PROOF The active AE has n r clients and thus it receives in = n r streams. Each non-active AE receives in = n r streams to distribute to its clients from active AE. Theorem 4.4 In 3D layered mesh network with each AE having n r clients, the active AEs have out s/r = n 2 r + n r(m 2) outbound streams, and the non-active AEs have out r = n 2 r outbound streams. PROOF Number of output streams for active AE with both sending and receiving (s/r) clients is out s/r = n r (n r ) + n r (m ) = n r (n r + m 2) = n 2 r } {{ } } {{ } + n r(m 2) (4.4) 2 where part is for directly connected clients and part 2 is for all the remaining m non-active AEs in the same layer. For non-active AE that has only receiving (r) clients connected out r = n 2 r. (4.5) because the non-active AE distributes n r streams (from n r clients of active AE) to its own n r clients. It is obvious that for n r and m > 2, it always hold out s/r > out r, for m = 2, out s/r = out r and for m < 2 the distribution network doesn t have sense. Thus the out s/r can be seen as the limiting traffic.

35 4.3. DISTRIBUTION MODELS 24 The limiting throughput is the one which sending clients are connected to. Thus the ratio between such mesh (with total number of clients n = n r m and single AE is ratio = n + m(m 2) m 2 (n ) (4.6) while using total of km = m 2 AEs. This model is problematic because of quadratic increase with respect to number of AEs used. However it seems to be the last model that doesn t induce intermediate hops and thus keeps latency at the minimum. Transition from 3D to 2D mesh It is possible to perform vertical aggregation of AEs across 3D layers to get the 2D full mesh model as proved in Theorem 4.5. Theorem 4.5 The 3D layered mesh model is extension of 2D full-mesh model and the latter model can be obtained by aggregating AEs of the first model. PROOF To do so, we merge AEs that are positioned above each other ( flatten the layers ). In such case, the the AE is once used as sending/receiving AE and m times as receiving only AE. Thus the number of input streams is m times n r (since once it gets n r as both sending and receiving AE and m times it gets n r as receiving one) in = mn r = n (4.7) This relation is the same as (4.7). For number of output streams, it follows out = n 2 r + n r (m 2) + (m )n 2 r = n 2 r + (m 2)n r (4.8) } {{ } } {{ } 2 Part is one occurrence of AE in sending/receiving role and part 2 is m time occurrence of AE in receiving only role. The number of outbound streams is obviously equal to (4.). Thus we have proved that 2D full-mesh model is just special variant of 3D layered-mesh model. Fail-Over Operation Each of the mesh layers monitors its connectivity. When some layer disintegrates and becomes discontinuous, the information is broadcasted throughout the layer and to its clients. The clients that used that layer for sending are requested to migrate to randomly chosen layer from the remaining k layers and the listening-only clients simply disconnect from this layer. Such behavior increases load on the remaining k layers but as the clients choose the new layer randomly, the load increases in roughly uniform way by at most nr k D Layered Mesh of AEs with Intermediate AEs Definition 4.6 (q-nary distribution tree) The q-nary distribution tree is a directed acyclic graph, in which each parent node has q child nodes. Data in the distribution tree are distributed according to orientation of edges. Definition 4.7 (3D layered mesh with intermediate AEs) The 3D layered mesh with intermediate AEs is layered structure where each layer is organized as follows: each layer has one active AE that is the root of the q-nary distribution tree in each layer. Receivingonly clients are connected to m leaf AEs of the distribution tree.

36 4.3. DISTRIBUTION MODELS 25 Definition 4.8 (Intermediate AE) Intermediate AE is an AE that doesn t have any clients directly connected. Alternatively, it is any AE that is neither active nor non-active. Let s create q-nary tree used for distributing data from AE with sending clients to m AEs with listening clients. When building q-nary tree with λ intermediate layers the total number of intermediate AEs is λ = log q (m ), (4.9) L = λ p= q p = qλ+ q q = qlog q (m ) q q = m q. (4.20) q Theorem 4.6 Numbers of flows in the 3D layered mesh with intermediate AEs and n r clients per active and non-active reflectors are summarized in Table 4.. The limiting traffic is number of outbound streams on active AE. in out s/r n r n r (n r ) + qn r r n r n 2 r int n r qn r TABLE 4.: Flows in 3D network with intermediate AEs. s/r means outer AE with sending clients connected, r means outer AE with only receiving clients, int is intermediate AE. PROOF Number of inbound streams is again in = n r for all types of AEs because the network distributes n r streams from n r clients of the active reflector. Number of outbound streams on active AE is out s/r = n r (n r ) + qn r, (4.2) } {{ } }{{} 2 where part is distribution to locally connected clients and part 2 is for distribution of n r data streams to first layer of intermediate reflector comprising q AEs. Each intermediate AE distributes n r streams further on to its q child AEs and thus out int = qn r (4.22) The non-active AE distributes n r data streams to n r locally connected clients and thus out r = n 2 r (4.23) Since out s/r > out r out int for q 2 and n r >, the limiting traffic is out s/r. For q 2 and n r =, out s/r = out int > out r. If q < 2, the q-nary distribution wouldn t have sense. Thus out s/r can be understood as limiting outbound traffic. There are however two disadvantages of this model: The number of hops inside the mesh increases by λ compared to simple 3D mesh model. This will increase latency but it is impossible to enumerate the latency increase in general as it depends on the underlying network topology and one-way delays of distribution between hop pairs.

37 4.3. DISTRIBUTION MODELS 26 Compared to plain 3D model, the number of intermediate AEs further increases to For m = k it is m tot = mk + Lk (4.24) m tot = m(m + L) (4.25) This relation is illustrated in Figure 4.6 demonstrating quite extreme growth of total number of reflectors when having only 2 clients per one AE, while having reasonable properties as the number of clients per AE increases. Number of AEs (m_tot) n_r = 2, q = 2 n_r = 5, q = 2 n_r = 5, q = 5 n_r = 0, q = 2 n_r = 0, q = 5 n_r = 0, q = 0 n_r = 20, q = Number of clients (n) FIGURE 4.6: Number of AEs needed for 3D mesh with intermediate hops in which k = m and q = n r. Aggregation of inner AEs To limit number of inner AEs, we can take into account the linear increase of limiting outbound flow on each inner AE qn r as shown in Table 4.. Theorem 4.7 When only L AEs are used for all k layers, resulting numbers of data streams on internal AEs is in int = kn r and out int = kqn r. PROOF Because the number of streams distributed by intermediate AEs is kn r, because each of k layers distributes n r streams. Thus number of input streams for each intermediate AE is in int = kn r. Since each of kn r streams needs to be distributed to q child AEs, the number of outbound streams on intermediate AE is out int = kqn r. When we use k = m (meaning there is the same number of outer AEs and number of layers), then in int = n and out int = qn, where n = mn r is total number of clients. In this section, we have shown trivial way to aggregate inner AEs in the 3D mesh of AEs. However, searching for optimum general aggregation of nodes in such network

38 4.4. CONTENT ORGANIZATION 27 (not only the inner ones) leads to creating optimal AE-based application-level multicast network. This model is problematic because of increasing the number of AEs used. However it seems to be the last model that doesn t introduce intermediate hops and thus keeps hopcount at minimum Multicast Schemes In an ideal case, the multicast organization of the data distribution is the most efficient scheme to distribute data to multiple clients. However, it is very difficult for a user to place AEs into the physical network topology in such a way that no data will pass through any physical link twice. The only exception may be when AE network is implemented as a network of active routers, but this goes against the user-empowered approach we support. Thus the multicast paradigm is only an upper-limit on efficiency of the distribution. There are two basic approaches to build multicast distribution tree: source-based tree also known as shortest path tree (SPT) and shared tree. Regarding the synchronous character of multimedia data distribution, the SPT with reverse path forwarding (RPT) has two major advantages: it minimizes latency compared to shared tree where the data is sent through rendezvous point and it provides shortest paths between the source and the receivers (advantage for large volume of data transmission). To build SPTs, it is necessary to have underlying unicast routing information. This information can be maintained very efficiently by RON [2]. As an addition to fast convergence in case of network link failure, it is possible to define policy to select the shortest path not based on hop count, but based on path round trip time or even one way propagation delay if available. Fail-Over Operation Standard operation when the link failure occurs is to build a new SPT as described above. If even the convergence speed of RON is not acceptable, there is another possible strategy to minimize delay due to SPT reconstruction. It is possible to compute multiple SPTs at the same time, choose single SPT for data distribution and keep the remaining SPTs for fail-over operation. For permanent redundancy scenario, more than one SPT can be used simultaneously and duplicate data will be discarded by client applications. In full graph, there are n 2 n links between the AEs. For a small number of AEs, alternative SPTs can be computed that don t use one selected link at a time. If that particular link fails, the alternative SPT can be immediately switched on. For larger number of AEs where number of links is too large, it is possible to compute n/2 possible SPTs with disjunct set of links. When using SPTs or shared trees (ST) with backup based on disjunct sets of links, it is necessary to ensure that not all links from one AE are used in one SPT/ST, since the AE would become isolated in backup SPT/ST. When backup SPT/ST is available, the network recovery is limited just by broadcast speed to announce switching to a new SPT/ST, but when there is no backup, the alternative SPT/ST must be computed first (Figure 4.7). During the normal operation, all these SPTs are monitored for their usability and when link fails in the current SPT, the original SPT can be swapped for another working SPT if at least one other usable SPT is available. 4.4 Content Organization The multimedia content can be encoded in many different formats, that suit specific needs or capabilities of the network and the listening clients. In some cases (e. g. MPEG-4 formats) the highest quality format can be decomposed into N different layers (groups) that are sent over network independently. When native multicast is used, the client subscribes

39 4.4. CONTENT ORGANIZATION 28 received packets per second time [s] FIGURE 4.7: Recovery time for with backup SPT (solid line) and without it (dashed line) simulated using cnet-based network simulator. for the first M ; N groups only, thus controlling the quality reduction of received content. With native multicast, there is no easy way to prioritize and synchronize the streams, which may lead to unexpected loss of quality (if data in the first layer are lost, the other layers may render useless). As AEs support also multimedia transcoding (capable of being active gateways), an extended approach can be used. The format decomposition or even transcoding to completely different format may be performed by an AE, providing a flexible on demand service the transcoding occurs only if really needed by some client. Also, the AEs are capable of synchronizing individual streams they understand the decomposition and may re-synchronize individual streams. In case of severe overload, the higher (less important) stream layers are dropped first (again, AEs know the hierarchy), so the transmission quality is minimally affected. To formalize our approach, we have designed three layer hierarchy: content groups the highest level, an aggregation of several contents; it can be for instance a videoconferencing group (e. g. video and audio streams of individual videoconference participants) content intermediate level, a content (a video stream, format independent) format the lowest level, format definition. A multimedia stream is then characterized by (content_group, content, format) triplet. The available formats for each content create an oriented graph where the root is the source format and the child nodes define the formats created from their parents. A client can choose the best suitable format, or different formats for different contents within one content group (e. g. a lecturer s stream with the highest quality). The information about available content groups, content, and available formats is published via NIS on AEs and is distributed and shared across the network of AEs.

40 Chapter 5 Distributed Active Element Scalability based on networks of active elements as discussed in Chapter 4 is working well in terms of number of clients connected. It is however not sufficient for scalability with respect to bandwidth of each single stream distributed, i. e. it is not suitable for distributing streams whose bandwidth exceeds capacity of each single AE. This is the case e. g. when distributing.486 Gbps uncompressed high-definition video using AEs with only Gigabit Ethernet or even Fast Ethernet network interfaces. In order to improves scalability with respect to the bandwidth of each stream, we introduce a concept of distributed active element, suitable for implementing on computer clusters with low-latency internal interconnection. Its architecture is based on parallelizing all modules of general AE architecture shown in Figure 4. on page 6 including both listener and sender modules, actually creating multiple AE instances. Of course it would be possible to implement distributed AE using simple distribution of modules and exchanging necessary information using some message passing. However, as the cluster environments don t operate over shared memory, it would require transmission of very significant data volumes between machines in the cluster and thus pessimizing performance. Minimizing synchronization and internal communication overhead allows for additional performance optimizations. Complete parallelization of AE architecture however introduces problems with sending part, which may introduce packet reordering. While packet reordering is largely unwanted for general router behavior as it may severely tamper performance of lots of application and especially those based on the most widely used transmission protocol TCP, it is more acceptable for multimedia application that rely on UDP protocol and thus need to handle packet reordering on its own, usually by data buffering. Furthermore, as the UDP generally doesn t use any flow control mechanism based on implicit response of the network (like packet loss, or indirectly packet reordering incorrectly detected as packet loss), packet reordering doesn t influence performance of the multimedia application and it only imposes additional requirements on buffer sizes and thus indirectly may introduce additional latency. Simple solution to reordering would be to design architecture with all distributed modules except for sender module this would help for computationally intensive operations on the streams but it wouldn t help with high-bandwidth streams which exceed capacity of each single cluster node. In this chapter we show, that even when having multiple sending modules with no explicit synchronization, the reordering For instance ITU-T G.00 specification recommends that multimedia application buffer as much as 0 seconds of video to avoid effects due to packet reordering. Although this is clearly not desirable for synchronous interactive applications because of increasing latency of transmission, it stresses the fact that packet reordering effects are well understood by application designers and thus handled by most of multimedia applications to some extent. Actually, because for synchronous applications it is usually unacceptable to use retransmission to replace packet loss especially when it comes to high-bandwidth data flows, most applications implement some scheme to mitigate even packet loss: be it limited redundancy, forward error correction, interleaving or some other mechanism.

41 5.. ARCHITECTURE 30 introduced by distributed AE has an upper bound for real-time synchronous applications under certain assumptions. It can be further reduced by implementing proposed Fast Circulating Token protocol. This chapter is organized as follows: in Section 5. we discuss general architecture of the distributed AE, we study its behavior and packet reordering in static environment in Section 5.2, then we define protocols needed for operation of distributed AE in changing dynamic environment in Section 5.3, and finally we conclude with evaluation of prototype implementation and its comparison with theoretical results in Section Architecture We are proposing distributed AE based on architecture of AE described in Section 4.2 and it is partly determined by requirement of implementability on existing tightly coupled clusters with low latency interconnection. Distributed AE implementation assumes the infrastructure as shown in Figure 5.. The computing nodes form a computer cluster interconnected with each node having two connections: one low-latency control connection used for internal communication and synchronization inside the distributed AE, one or more data connection used for receiving and sending the data. Low latency interconnection is necessary since current common network interfaces like Gigabit Ethernet or 0 Gigabit Ethernet provide large bandwidth, but latency of the transmission is still in order of hundreds of µs, which is not suitable for fast synchronization. Specialized low-latency interconnects like Myrinet provide as low latency as 0 µs, which is comparable to message passing between threads on a single computer. From the high-level perspective of operation, the incoming data needs to be first distributed across the multiple parallel units of the distributed AE, processed in these units, and finally aggregated and send over the network to the listening clients. Thus the architecture comprises three major parts: Distribution unit takes care of ingress data flow distribution over multiple parallel distributed AE units. Distribution unit would be naturally located just before the distributed AE, but that is not obligatory and it can be located elsewhere provided that the networking infrastructure between distribution unit and distributed AE has sufficient transmission capacity. This allows for cheap software implementation of distributed unit on some powerful computer or even as a part of sending application. Multiple distribution units might be present in the system, but in order to allow theoretical analysis of the system, we assume that single input data flow passes through single distribution unit only. When the distribution unit is part of the same L2 domain as parallel AE units, it may operate on L2 addresses only (e. g. Ethernet addresses in operation similar to VRRP [94] or CARP [68] protocols), otherwise it needs to operate on L3 (usually IP) addresses. Distribution unit works in simple round robin fashion in the simplest case, but it may also support more advanced schemes like static and dynamic load balancing. When supporting dynamic load balancing or operation in dynamic and/or unreliable environment, it also communicates with parallel AE units that it sends data to. Parallel AE unit is a complete instance of AE with architecture shown in Figure 4. on page 6 with modified sender module to allow for possible synchronization. It has

42 5.. ARCHITECTURE 3 low latency interconnect switch AE parallel nodes data path switch(es) network (Internet) data link optional data link low latency interconnect FIGURE 5.: Model infrastructure for implementing the distributed AE. the kernel with administrative submodules, session management, processor schedulers and AAA submodules. Control communication is handled by messaging interfaces with slightly extended Reflector Administration Protocol [9]. Data are received using network listener modules, stored into shared memory (shared across the instance of the reflector only, not across multiple AE unit instances), processed by zero or more processors, distribution lists filled up with either one of processors or with session management and finally sent with sender module. Unless there is some complex data processing involved, data passes through distributed AE unit in zero copy mode for performance reasons. The network management module handles communication with distribution unit and also communication with other distributed AE units if AE ring is to be set up and maintained for Fast Circulating Token protocol (Section 5.2.2). However, handling token itself for this protocol is done by sender module in order to minimize operation overhead. Network information module works in the same way it does for networks of AE: it publishes information on properties and capabilities of each distributed AE unit and

43 5.2. OPERATION IN STATIC ENVIRONMENT 32 it also publishes global information on whole distributed AE if such information is available. Aggregation unit aggregates the resulting traffic to output network line(s). Because the AE element is often used as data multiplication unit, we assume that output data flow from the distributed AE is larger than input data flow. Thus we need a unit that is even more powerful that the input load distribution unit and in most cases, cheap custom made software implementation is not available and we have to use available hardware solution like aggregating switch. However, in that case we must not assume any further behavior of the aggregating unit except for two things: first, it is over-provisioned enough not to loose any data and second, it has limited buffer space available (which is true for any such a device known to the author). Multiple aggregation units might be present in the system especially when parallel AE units have more than one network uplink, but in order to allow theoretical analysis of the system, we assume that single output data flow passes through single aggregation unit only. Whole architecture supports user-empowered operation, there is still no need for running any part of it in kernel space. The only administrative requirement is that the cluster environment needs to be set up together with networking infrastructure including some aggregation unit. 5.2 Operation in Static Environment In this section we describe operation of distributed AE in static environment, where there is constant number of AEs participating in distributed processing and the AEs work reliably. In static environment, there are two major pieces of functionality needed: distribution of incoming data packets and at least loosely synchronized sending of outgoing packets. As the sending procedure is more difficult from theoretical point of view and as it also limits possible distribution models for incoming packets, we describe the sending part first. In order to evaluate our models theoretically, we need to introduce an idealized environment. Different parts of this environment are defined in the following definitions. Definition 5. (Ideal network) The ideal network is a network in which no data are lost, corrupted, nor reordered. It also provides instant delivery, i. e. it introduces zero latency. Definition 5.2 (Ideal multimedia traffic) The ideal multimedia traffic has bandwidth b and independent packets of exactly same size s p, which is equal or smaller than MTU of the underlying network. The packets are sent in regular intervals. All the queue sized below are expressed in units of packet size s p. In order to isolate reordering introduced by the distributed AE, we assume that the ideal multimedia traffic has no reordering prior to entering distribution unit. Definition 5.3 (Ideal aggregating unit) The ideal aggregating unit has n input interfaces with the same parameters and one output interface with capacity equal or bigger than the n inputs together. It reads packets from the size-limited input interfaces queues and sends them on output interface in such a way, that packets are never lost. The speed is b SW j interface. for j-th input interface and each input queue has equal size of s SW i for each input In order not to lose any input data, the ideal aggregating unit needs to fulfill the following requirement in steady state: b SW j b SW o. j

44 5.2. OPERATION IN STATIC ENVIRONMENT 33 Definition 5.4 (Ideal distributed AE) The ideal AE has processing capacity equal or higher than stream bandwidth and it has an input queue size of qi AE. All the parallel units of the ideal AE have the same parameters and performance and the total bandwidth of the traffic is divided into streams with the same parameters. The ideal AE introduces no losses, nor data corruption, nor data reordering in the data stream. s AE i s AE o s SW i b j distribution unit AE input buffer AE AE output buffer aggregating input buffer aggregating unit FIGURE 5.2: Model of the ideal distributed AE with ideal aggregation unit Ingress Distribution The ingress data distribution takes care of distributing incoming data across different paths inside the distributed AE. For the ideal distributed AE, it is suitable to use simple roundrobin distribution as all the parallel AE units are equivalent in their performance. Definition 5.5 (Ideal distribution unit) The ideal distribution unit distributes packets in round-robin fashion. In each round, it distributes n packets, one to each of the parallel units. The distribution unit marks round number into each packet. Such an ideal distribution might not be suitable in the following cases: When parallel AE units are of unequal performance. In this case, load balancing described below is useful. When data stream packets are not independent and the processing needs to have all the inter-dependent packets through the same path. This might be for example when some data processing is done and some state inside the AE needs to be created and maintained. In this case, the packet distribution needs to follow the packet inter-dependencies. When distribution unit is implemented as a part of sending application (e. g. userspace library encapsulating UDP sendto() function), it is possible to utilize knowledge of data directly and distribute it correspondingly. If the distribution unit is implemented as separate stand-alone network unit, the application can mark groups

45 5.2. OPERATION IN STATIC ENVIRONMENT 34 of packets, which belong together as a fixed position and size field, so that the distribution unit can quickly distribute data appropriately without complex processing of data that might even include maintaining state. When parallel AE units do not have the same performance, the load balancer can send multiple packets in each round to the same parallel path. All the packets sent in one round are marked with the same round number. Sample packet distribution is shown in Figure Round Round 2 FIGURE 5.3: Sample load balancing packet distribution for distributed AE Egress Synchronization No explicit synchronization. The simplest model for egress synchronization is to use no synchronization at all. However, with this model and limited buffers on the input interfaces of the AEs, there is still some implicit synchronization achieved. Theorem 5. (Maximum reordering with no explicit synchronization) Maximum reordering induced by an ideal distributed AE with no explicit egress synchronization and ideal aggregating unit is n(s AE i + s AE o + s SW i + ), where n is the number of parallel AE units when all queues operate in FIFO tail-drop mode. PROOF We need to show that higher reordering in the ideal distributed AE model is impossible because it would resulted in packet loss. Suppose that we have n + consecutive packets. In the worst case scenario, the first packet arrives in j-th parallel path with all queues full except for AE input buffer which has just one position in buffer left free. The second packet goes over (j + )-th path where all the queues are empty. Because the packets are distributed in round-robin way and the ideal AE doesn t introduce any packet loss, at least one packet in the j-th queue must have been processed before (n + )-th packet arrives in order to have enough space in input queue. After repeating this (s AE i + s AE o + s SW i + ) times, the packet must have left input buffer of the aggregating unit and sent to its output interface, for otherwise some packet on the j-th path (not necessarily the one we are studying) would have been lost. Thus the maximum reordering is n(s AE i + s AE o + s SW i + ).

46 5.2. OPERATION IN STATIC ENVIRONMENT 35 Fast Circulating Token. In order to decrease packet reordering introduced by the distributed AE, we propose a distributed algorithm for achieving less packet reordering compared to no explicit synchronization. The nodes are ordered in a ring with one node elected as a master node and they circulate a token which serves as a barrier so that no node can run too much ahead with sending data. The mechanism is called Fast Circulating Token (FCT) since the token is not held for the entire time period of data sending as usual in the token ring networks. Definition 5.6 (Non-preemptive data packet sending) Because of real world implementation of data packet sending in common operating systems, we assume, that sending procedure is non-preemptive, i. e. once a packet is being sent, this process can be interrupted after sending is finished. Definition 5.7 (Token handling priority) We assume that token reception event processing has precedence over any other event processing in the distributed AE. If there are multiple token events waiting, they are processed in FIFO way. However, as the data sending is non-preemptive, if the token arrives in the middle of data packet sending, it will be handled just after that packet sending is finished. The token carries the following information: round number corresponds to round number from distribution unit; set and incremented on master node last round-trip time holding time left after traveling from master to current node Depending on implementation circumstances, timeleft() function may be used to allow keeping token on other nodes than master for limited amount of time. This might be needed if e. g. the master node is considerably faster than other nodes. For ideal distributed AE, timeleft() returns 0. Theorem 5.2 (Maximum reordering with FCT synchronization) The maximum reordering induced by an ideal distributed AE with FCT egress synchronization and ideal aggregating unit is n(s SW i + 3), where n is the number of parallel AE units when all queues operate in FIFO tail-drop mode. PROOF We need to show that reordering is now limited by buffer sizes on aggregation unit and number of nodes in the AE ring. When data flow through an ideal distribution unit and an ideal AE, no packet reordering is introduced. Thus when a FCT arrives at a node, there are three possible situations:. packets with lower round number than the one in the token are still being sent, 2. AE is waiting to start sending packets with round number equal to the number in the token, 3. AE has no data to send in current round. Thus the first condition in the FCT algorithm is just a correction for non-ideal behavior of each parallel path, that converts excessive reordering to packet loss. Because the data are distributed in pure round robin for ideal distributed AE, we may assume that after incrementing token round number on master node to r, there might be at most n packets from round r waiting on the other nodes in the ring to be sent and n packets form round r on all nodes together. Taking two consecutive packets, they must

47 5.2. OPERATION IN STATIC ENVIRONMENT 36 rnd := 0; 2 f inish := false; 3 while finish do 4 if (master rnd = 0) 5 token_ready := 0; 6 do 7 if test_recv_token(round_no, last_rt T, time_lef t) 8 token_ready := ; 9 if master time_left = 0 0 pass_on_token(round_no, last_rt T, time_lef t); fi 2 fi 3 if test_recv_finish() 4 f inish := true; 5 fi 6 send_packet(rnd); 7 while (token_ready f inish) od 8 if finish 9 break; 20 fi 2 fi 22 if master 23 send_all_packets(rnd); 24 rnd := get_rnd_from_queue(); 25 last_rt T := updatertt(); 26 time_lef t := timeleft(); 27 pass_on_token(rnd, last_rt T, time_lef t); 28 else 29 while time_lef t > 0 is_packet(rnd) > do 30 send_packet(rnd); 3 time_left := time_left ; 32 od 33 pass_on_token(round_no, last_rt T, time_lef t); 34 discard_packets(rnd); 35 rnd := round_no; 36 fi 37 if master f inish_requested 38 foreach s slaves do 39 send_finish(s); 40 od 4 f inish := true; 42 fi 43 od FIGURE 5.4: Fast Circulating Token algorithm.

48 5.2. OPERATION IN STATIC ENVIRONMENT 37 be either in the same round or in the two consecutive rounds. Thus the reordering when entering input buffer of the aggregation unit must be less then 2n. In each round, there is one packet added to each input queue and thus after circulating token s SW i + times, the the two consecutive packets either have to be sent or lost. Because we assume no loss on the ideal aggregating unit, the upper bound on packet reordering is 2n + n(s SW i + ) = n(s SW i + 3). When operating in a non-ideal environment, there are several complications that needs to be taken into account: packet reordering either before data reach distributed AE or on a single parallel path inside distributed AE possible implementation of the first condition after token reception influences whether excessive packet reordering will be converted to packet loss or not, due to unequal performance of parallel paths, load balancing may be deployed again the reordering of two consecutive packet is limited by size of two consecutive rounds, but each round may have more than n packets depending on load balancing scheme used. Exact Order Sending. It is possible to design sending protocol that results into exact ordering, but it requires defined behavior of aggregation unit and thus it is not suitable for implementation on commodity hardware like aggregating switches. One possibility is that aggregation unit behaves similar to sending modules with FCT protocol, i. e. it reads packets from input either in the same round robin way distribution unit distributes packets and utilizes mark of round number in each packet to recognize when the packet is ready to be sent. When each parallel path is ideal, namely there is no packet loss or corruption, even packet round numbering might not be necessary. However, in order to protect aggregation against lost and/or corrupted packets which would make one of the queues go ahead of the rest, it is advisable to stick to round numbering of the packets. For such operation, an alternative packet distribution shown in Figure 5.5 is more appropriate compared to the distribution shown in Figure 5.3, it has smaller rounds containing either zero or one packet. Instead of sending no packet, it actually needs to send empty round marker to let the aggregator know that there is no need to wait for the packet to arrive. It not efficient when circulating token is present as there are more rounds and thus it pronounces overhead of the token Round Round 2 Round 3 Round 4 Round 5 Round 6 FIGURE 5.5: Alternative load balancing packet distribution. Empty round markers are shown as small black filled rectangles. Such protocol might be implementable on custom hardware, e. g. data switch or processor with all custom hardware or at least some programmable routing device like FPGAenabled routing cards.

49 5.3. OPERATION IN DYNAMIC ENVIRONMENT Operation in Dynamic Environment In this section, we describe necessary protocols for distributed AE operating in dynamic environment, where nodes of the tightly coupled cluster may disappear suddenly while new becoming available. So protocols for failure recovery and nodes joining and leaving distributed AE are discussed in this section. Further, processing and network capacity of the individual nodes may change in time when the nodes are shared for some other operations and thus some protocol for communication with distribution unit to allow dynamically adjusted load balancing. Generalizing this scenario, we develop a general protocol for distributed AE units communicating with distribution unit Setup of a New Ring When new distributed AE is set up, the following steps are executed in order:. Any of the nodes starts to broadcast setup message. 2. All the nodes that want to participate reply with broadcasting their randomly generated IDs. If collision is encountered, the colliding nodes generate new random IDs which must not be any of existing IDs. 3. A node with lowest ID becomes the master. All nodes create the ring according to increasing ID and remember IDs of all nodes participating in the ring Addition of a New Node When a node wants to join an existing ring of AEs, the following steps take place:. New node asks any of the existing ring member for the ring topology including list of existing node IDs. If anycast is present in the network, it can be used. 2. New node generates a new random and unique ID. 3. New node broadcasts its joining to all the nodes. 4. After an acknowledgment received from all the nodes, it announces its availability to ingress load balancer. 5. Load balancer acknowledges its addition Failure Detection and Recovery Failure detection relies on timeout if no token is received during some timeout period, by default 0 RTT. To recover, each node uses tokens, but this time tokens are acknowledged and they gather the new topology as they travel around the ring. The following procedure is performed:. Each node generates its own recovery token. 2. On reception of recovery token, the node adds itself to the list in the token and acknowledges its reception to the node the token came from. 3. If the token is not the token it has generated, it sends the token to the next node in the ring and waits for an acknowledgment. If the acknowledgment doesn t arrive unit a timeout, it tries to send it to next plus one node in the ring and so on, until it receives and acknowledgment.

50 5.4. PROTOTYPE IMPLEMENTATION If the token is node s own one, it extracts the topology of the new ring from it. As an alternative, it is also possible to broadcast the failure of the ring to all the nodes and use standard setup procedure instead as described in Section Removal of Existing Node When a node wants to remove itself from a ring of AEs while allowing distributed AE to operate flawlessly without need for failure recovery, it executes the following protocol:. It announces its unavailability to the ingress load distribution unit. 2. After acknowledgment from distribution unit it considers itself ready for removal from the ring of AEs. 3. It broadcasts new topology to all nodes. 4. It waits until it hasn t received circulating token for some timeout value, by default 0 last RTT, and shuts down Communication between Distributed AE and Load Balancers There are four cases when the distributed AE communicates with the load distribution unit in dynamic environment: Announcement of new ring topology after ring set-up or failure recovery. After a new ring is set up or after ring recovery, the master node announces ring topology to the distribution unit if it is known. Otherwise the distribution unit asks for that information (by unicast if master or at least one node of the ring is know, or using broadcast/multicast otherwise). Addition of a new node. The new node announces its availability after it has successfully started its operation and joined the ring of existing parallel AE units as described above. Removal of an existing node. Before leaving the ring of parallel AE units, the node announces removal to distribution unit and waits for an acknowledgment. Load reporting from parallel AE units. This process is handled by the load distribution unit. It polls Network Information module of each parallel AE unit through a messaging interface either regularly or according to specified policy. Knowledge of the topology by the distribution unit is handled in soft-state manner, i. e. it has to be periodically refreshed by master node. If the information expires, the distributing unit stops distribution (to avoid flooding network) and tries to get information about the topology. 5.4 Prototype Implementation Prototype implementation of the distributed AE is implemented in ANSI C language for portability and performance reasons. The implementation comprises two parts: a load distribution library and distributed AE itself. Because of lack of flexible enough load distribution hardware unit, we have implemented it as a library, which allows simple replacement of standard UDP related sending functions in existing applications and allows developers to have defined type of load distribution either pure round robin or load balancing.

51 5.4. PROTOTYPE IMPLEMENTATION 40 Each parallel AE element uses threaded modular implementation based on reflector architecture described in Section 3.2. with networking and distribution extensions described in Section 4.2. Internal buffering capacity of each AE node has been set to 500 packets. Explicit synchronization using FCT protocol has been implemented using MPICH implementation [80] of MPI [79] built with low-latency Myrinet GM 2.0 API [84] (so called MPICH GM). For cost-effective prototype implementation, the aggregation unit was a implemented as commodity switch satisfying condition that egress link capacity is equal or larger than ingress capacities and with sufficient capacity of internal switching matrix. Prototype implementation has been tested and known to work on Linux and FreeBSD 5.x platforms Experimental Setup In order to evaluate performance and behavior of the distribute AE experimentally, we have set up a testbed shown in Figure 5.6 comprising eight machines and two switches: low latency interconnect switch AE parallel nodes data path switch sending/receiving probes GE data link Myrinet low latency interconnect FIGURE 5.6: Distributed AE testbed setup. A data switch HP ProCurve 608 with 8 Gigabit Ethernet full wire-speed ports. Manufacturer-specified switching capacity is 6 Gbps and switching performance of.9 million packets per second is given for 64 B packets. A low-latency Myrinet M3-E32 switch with M3 SW6 8F interface cards that created the control plane for passing control information like token for FCT protocol. According to manufacturer s specifications and benchmarks, it features as low oneway latency as 6.3 µs for short messages up to approximately 00 B with MPI and GM-2.0 API [83].

52 5.4. PROTOTYPE IMPLEMENTATION 4 6 PCs used as the parallel nodes for the distributed AE with configuration shown in Table 5.. Each node was connected to via data link to HP switch via full-duplex Gigabit Ethernet and also to Myrinet switch for control information passing. Sender and receiver PCs with the same configuration (Table 5.), that have been used for generating traffic and collecting and analyzing results. Both computers were connected to the HP switch via full-duplex Gigabit-Ethernet. Configuration Brand HP ProLiant Model DL 360 G3 Processor 2 Intel Xeon 2.40 GHz Front-side bus 533 MHz Memory 2 GB (PC 200 DDR with EEC) GE NIC 2 Broadcom Corporation NetXtreme BCM5703 Gigabit Ethernet (rev. 2) Myrinet NIC M3F-PCI64C-2 Operating system Linux Debian Woody kernel SMP GM Socket version TABLE 5.: Configuration of distributed AE nodes and sending/receiving probes. The data flows were generated using simple RTP-compliant sending application and data reception was done by receiver, which also computed all the required statistics. Since the distributed AE is aimed at scalability with respect to bandwidth of single data stream, we have evaluated it with sending data stream with bandwidth up to Gbps. Performance above Gbps hasn t been examined because testbed components were not available at the time of writing Performance Evaluation Performance of the distributed AE prototype without explicit sending synchronization and with FCT-based synchronization is shown in Figures 5.7 and 5.8 respectively. It turns out that single path AE (equivalent to single reflector) is not capable of processing streams beyond 600 Mbps on given testbed infrastructure without packet loss. This is in accordance with findings in [3], where stand-alone centralistic reflector was examined on testbed infrastructure with slightly less performance. Contrary to the stand-alone reflector, the distributed AE prototype can process and distribute streams up to Gbps using 472 B UDP payload without significant packet loss starting with two parallel paths. Jitter, usually explained as delay dispersion, is calculated according to RFC 355 based on arrivals of two consecutive packets as D i,j = (R j R i ) (S j S i ) = (R j S j ) (R i S i ), J = J + 6 ( D i,i J), 2 At the time of writing, there were two 0 Gbps Chelsio T0 cards available for working with uncompressed HD video in point-to-point way, but for evaluation of distributed AE, we would also need a switch with at least two 0 Gbps and multiple Gbps ports which was not available. As soon as such equipment is available, the distributed AE will be subjected to evaluation.

53 5.4. PROTOTYPE IMPLEMENTATION Bandwidth Real bandwidth [Mbps] parallel path 2 parallel paths 3 parallel paths Target bandwidth [Mbps] 300 Jitter 40 Loss Jitter [us] Loss [%] Target bandwidth [Mbps] Target bandwidth [Mbps] 000 Bandwidth Real bandwidth [Mbps] parallel paths 5 parallel paths 6 parallel paths Target bandwidth [Mbps] 300 Jitter 40 Loss Jitter [us] Loss [%] Target bandwidth [Mbps] Target bandwidth [Mbps] FIGURE 5.7: Forwarding performance of distributed AE without explicit synchronization for number of paths through 6.

54 5.4. PROTOTYPE IMPLEMENTATION Bandwidth parallel paths 3 parallel paths Real bandwidth [Mbps] Target bandwidth [Mbps] 300 Jitter 40 Loss Jitter [us] Loss [%] Target bandwidth [Mbps] Target bandwidth [Mbps] 000 Bandwidth Real bandwidth [Mbps] parallel paths 5 parallel paths 6 parallel paths Target bandwidth [Mbps] 300 Jitter 40 Loss Jitter [us] Loss [%] Target bandwidth [Mbps] Target bandwidth [Mbps] FIGURE 5.8: Forwarding performance of distributed AE with synchronization using FCT for number of paths 2 through 6.

55 5.4. PROTOTYPE IMPLEMENTATION 44 where S i is the RTP timestamp from packet i, and R i is the time of arrival in RTP timestamp units for packet i for two consecutive packets i and j. From this definition it doesn t include packet reordering discussed below and it only measures evenness of packet arrivals independent of packet order. It starts about 300 µs when sending 00 Mbps and slowly drops to below 00 µs as the data rate increases. For more parallel paths, it slightly raises and this effect is more pronounced on distributed AE without egress synchronization Packet Loss and Reordering Evaluation We have measured and analyzed also packet reordering in order to experimentally compare behavior of distributed AE without explicit egress synchronization and distributed AE with synchronization using FCT protocol. The reordering samples are shown in Tables A., A.2, A.3, A.4, A.5, and A.6 in Appendix A starting at page 83 using reordering graphs. Definition 5.8 (Reordering graph.) The reordering graph is a histogram (frequency or absolute number) of sequence number differences between two consecutive packets. x axis gives sequence number differences, while y axis means either absolute or relative (frequency) number of records. h j is the number of records (samples) for sequence number difference j. Thus, if all the sequentially numbered packets arrive in the same order they were sent, all the differences are +. Higher number than + means, that some packets were skipped forth (either because of packet reordering or because of packet loss) while negative number means stepping back in packet numbering (due to packet reordering only). Value of 0 occurs when duplicate packets arrive immediately following each other. min{j} is the maximum negative difference in sequence numbers of successively received packets and max{j} is the maximum positive difference. Theorem 5.3 For any interval of arrivals of two or more packets, the following equation holds j=min{j} jh j } {{ } H + h }{{} H + max{j} j=2 } {{ } H + jh j =, (5.) where is difference between sequence number of last and first packet in the observed interval. PROOF The proof will be done using induction with respect to number of consecutive packet arrivals. For arrivals of two successive packets with sequence numbers n and n 2, we have j 2 = n 2 n =. If j 2 0 and thus also 0, only one of H, H, H + is non-zero and equal directly to, because there is only one h j2 =. If j 2 = 0 meaning that a duplicate packet arrived, then h 0 = and = 0, which satisfies (5.), because h 0 is not included in the left-hand side and all other terms there are zero. Now we show that if (5.) holds for k packet arrivals, it will hold for k + arrivals as well. Assume that H k + H k + H+ k = n k n = k for k arrivals. For k + arrivals, k+ = n k+ n = (n k+ n k ) + (n k n ) = j k+ + k. Left-hand side of (5.) changes depending on value of j k+ as follows: j k+ > : H + k+ = H+ k + j k+ because h jk+ increased by and other terms remain unchanged j k+ = : H k+ = H k + j k+ = H k + because h increased by and other terms remain unchanged j k+ = 0: all terms remain unchanged, duplicate packet arrived

56 5.4. PROTOTYPE IMPLEMENTATION 45 j k+ < : H k+ = H k + j k+ because h jk+ increased by and other terms remain unchanged Thus it is obvious that both left-hand and right-hand side of (5.) increase by j k+ and the equation holds for k + packet arrivals. Therefore it holds for any number of packet arrivals greater than. Theorem 5.4 For the any interval of arrivals of more than one packet, the following equation holds: Π + j=min{j} h j } {{ } N + h }{{} N + max{j} j=2 h j } {{ } N + δ =. (5.2) where Π is number of lost packets and δ is a number of duplicated packets that are not included in h 0. PROOF The actually expresses number of packets sent by the sender. Each packet can be either delivered (either in order or out of order), lost in the network, or duplicated (or multiplicated in general). The packet might be also corrupted, but then it is discarded immediately by the receiver without actually receiving it and thus converted to packet loss. The N is the count of packets that arrived in order. N + N + counts number of packets that arrived out of order either due to reordering (N ) or due to reordering or loss of another packet (N + ). Duplicate packets either arrive as h 0 packets, in which case they are not counted in (5.2), or they count up into δ variable and they are subtracted from total of received packets. Thus the sum of number of received non-duplicate packets N + N + N + δ and the number of lost packets Π must count up to the number of packets sent in total,. By combining both equations (5.) and (5.2), we can derive packet loss as Π = H + H + N + N + δ. Because positive part of the graph described by H + or N + includes also packet loss, the negative part of the graph described by H or N can be seen as measure of packet reordering. The difference between the H-sums and the N-sums is that the H-sums are weighted sums. Thus the more packets are farther from in either direction, the higher the absolute value of H-sums are, while the N-sums remain the same. All the terms in the N, N +, and H + are positive and all the terms in the H are negative. If H N, the vast majority of out-of-order packets in the negative part is reordered by j =. Stand-alone reflector. As discussed above, the stand-alone reflector is not capable of forwarding data streams above 600 Mbps without packet loss and this can be seen in Table A., where only + value is only populated up to 600 Mbps and asymmetrical distribution leaning toward positive values is shown for 700 Mbps and above. The bigger the loss is, the larger the sum H + is. Because no reordering nor duplicates are introduced either, H = N = h 0 = 0. Distributed AE without explicit synchronization. While the distributed AE performs very well in terms of low packet loss, it introduces severe packet reordering in both terms of maximum reordering (expressed as the minimum reordering populated in histogram, i. e. min{j}) and also numbers of packets reordered (expressed by H and N ), as obvious from left column of Tables 5.2 and 5.3. Furthermore, the reordering fluctuates very significantly in time and is hardly reproducible.

57 5.4. PROTOTYPE IMPLEMENTATION 46 With no egress synchronization With FCT protocol 2 parallel paths BW min{j} H N BW min{j} H N [Mbps] [Mbps] BW min{j} H N [Mbps] parallel paths BW min{j} H N [Mbps] TABLE 5.2: Comparison of reordering for distributed AEs with no explicit egress synchronization and with synchronization using FCT. Part. Distributed AE with synchronization using FCT protocol. Compared to distributed AE without explicit synchronization, the FCT protocol allows distributed AE to work in much more predictable manner. It reduces packet reordering to just a very few packets and it also reduces reordering both in terms of maximum reordering (min{j}) and numbers of packets reordered (H and N ), as can be seen from right column of Tables 5.2 and 5.3. The maximum reordering grows by 2 packets for each additional parallel AE path, which indicates that the switch in the testbed works in rather very precise round robin fashion when aggregating flows from multiple interfaces. The number of reordered packets is comparable for lower number of parallel paths for both with and without synchronization and as the number of parallel path grows, the synchronized version becomes better more than 3 for higher bandwidths. Detailed reordering histograms for both distributed AE without explicit synchronization and with it can be found in Appendix A in Tables A.2 though A.6. Token round-trip time has been also periodically sampled 3 and it ranges between 4 µs for 2 parallel AE paths and raises up to 40 µs for 6 parallel paths. This closely approaches manufacturer-stated one-way message passing latency of Myrinet configuration used for the testbed as described above. 3 In order not to influence results of measurements of distributed AE performance, it was not possible to continuously gather individual token round-trip times. Thus only sample values were periodically gathered.

58 5.4. PROTOTYPE IMPLEMENTATION 47 With no egress synchronization With FCT protocol 4 parallel paths BW min{j} H N BW min{j} H N [Mbps] [Mbps] BW min{j} H N [Mbps] BW min{j} H N [Mbps] parallel paths 6 parallel paths BW min{j} H N [Mbps] BW min{j} H N [Mbps] TABLE 5.3: Comparison of reordering for distributed AEs with no explicit egress synchronization and with synchronization using FCT. Part 2.

59 Chapter 6 Pilot Applications for Synchronous Processing Active Elements described in previous chapters can be seen as network service for the synchronous multimedia processing and distribution. Complete synchronous virtual collaborative environment be it distant education, telemedicine remote consulting, or some other form needs also the end-point applications a user can interact with, which are sources and targets of the multimedia data sent through the network and the AEs. In order to demonstrate behavior of proposed AEs, we have chosen representatives of two large classes of applications: Applications with medium bandwidth requirements (like DV in Section 6. or HDV over IP in Section 6.2) which demonstrate behavior of the system with respect to number of AEs participating and number of clients being connected. This is characteristic kind of application for networks of AE as described in Chapter 4. High-bandwidth applications (like uncompressed HD over IP in Section 6.3), which are more suitable for distributed AE described in Chapter 5 as any single AE is not capable of handling such bandwidth. 6. DV over IP Digital Video (DV) [75, 7] is currently one of the most widely used formats for compression, storage and processing digital video. Its advantages are high quality (e. g. for PAL, it uses resolution with 25 frames per second), very low degradation of image quality in multiple re-compression due to multi-generation editing, relative affordability of DV-enabled devices with IEEE 394 interface (motion cameras, converters, recorders, etc.). DV compression uses 4:: sampling for NTSC and 4:2:0 for PAL format and utilizes intra-frame compression only. Audio is compressed together with video and uses sampling frequency of 32 khz, 44, khz, or 48 khz and 2, 6, or 20 bits quantization. Bandwidth requirements are 25 Mbps for video over IEEE-394 and additional 5 Mbps for audio and packet headers overhead of RTP and UDP, thus having 30 Mbps all together. Experimental latency achieved with consumer class devices can be as low as 00 ms [92]. Transmission of DV over IP networks is standardized in RFCs 389 and 390 and uses RTP protocol over UDP datagrams. Purely software implementation of DV over IP transmission has been implemented within DVTS project. Further development of DV over IP is pursued by DVTS Consortium which Masaryk University participates officially in.

60 6.2. HDV OVER IP 49 Because of unavailability of reasonably stable and performing display tool for nonwindows platforms, new implementation of xdvshow tool, which works on Linux and FreeBSD and supports both NTSC and PAL video formats using libdv library [76], has been implemented at he Masaryk University. The implementation features multi-threaded architecture separating reception from the network, video decompression, and video rendering using either SDL or X interfaces. It supports both window display and scaled or unscaled full screen display. Stereoscopic video DV over IP transmission has been successfully used for implementing synchronous stereoscopic video transmission [52] together with active elements, while having doubled bandwidth requirements compared to single DV over IP transmission. F IGURE 6.: DV over IP based stereoscopic transmission. [52] 6.2 HDV over IP Uncompressed video and especially high-definition (HD) video is extremely hard to work with because of stream bandwidth that needs to be processed. Therefore many manufacturers have devoted substantial effort to create compression algorithms that reduce bandwidth while maintaining certain level of quality. Such compression induces latency issues especially when involving more than a single frame at a time. So called inter-frame com-

61 6.2. HDV OVER IP 50 pression however allows achieving higher image quality compared to intra-frame compression only 2. The inter-frame approach has been chosen for HDV compression, which is designed for compression of HD video with resulting bandwidth compatible with DV video to facilitate storage of HDV stream on DV tapes. Similar to DV, HDV is also designed for transmission over IEEE-394 interface. The HDV format is actually just an MPEG-2 stream based on 8-bit color space with 4:2:0 sampling, resolution of , interlacing and 60: compression. In order to achieve the same bandwidth as DV stream (to avoid recording time reduction with DV cassettes), it uses inter-frame compression across 6 frames and thus it induces additional latency. The format of HDV video transmitted over FireWire utilizes an isochronous packet carrying MPEG-2 Transport Stream (MPEG2-TS) is shown in Figure 6.2. Detailed specification of the format is in not freely available IEC 6883 Parts and 4 [73, 74], but some description can be found in [50]. The following fields are fixed for IEEE-394: tag = 0b, tcode = 00b. The length is payload length, i.e. includes CIP header and data size. transmitted first len tag channel tcode sy header CRC sid dbs fn qpc S RSV dbc fmt fdf fdf/syt reserved cycle count cycle offset MPEG-2 TS payload 88 bytes data CRC header CIP header (N x) MPEG-2 TS data block transmitted last FIGURE 6.2: MPEG-2 Transport Stream packet format according to IEC Specifically for MPEG2-TS, the following fields are also constant: sph = (denoted as S in CIP header above), dbs = 6, fmt = (<<5), and fdf is reserved. The payload is divided into 8 blocks db0..db7. Depending on target stream bandwidth R, there are several possible payload distributions of the stream:. R <.5 Mbps: any of db0..db7 may be payload, 2..5 < R < 3 Mbps: db0/db or db2/db3 or db4/db5 or db6/db7 is payload, 3. 3 < R < 6 Mbps: db0/db/db2/db3 or db4/db5/db6/db7 is payload, 4. R > 6 Mbps: all db0..db7 contain the payload. Because our implementation is targeted for HD MPEG-2 streams, it supports R > 6 Mbps streams only and thus it requires qpc = 0 and fn = 3. Each isochronous packet may contain N MPEG2-TS data blocks which are 92 B long. From our experiences with below mentioned camera, N [0; 3]. 2 Interesting discussion of these cameras and compressions can be found at primer.htm. They notice that intra-frame compression is 2.5 times less efficient than inter-frame compression, so that 25 Mbps HDV stream should be equivalent to 60 Mbps intra-frame stream. The quality difference is obvious when dramatic changes happen in the image, e. g. during panning.

62 6.3. UNCOMPRESSED HD 5 Equipment for capturing and processing High-Definition (HD) video has been extremely expensive until advent of new camera class like SONY HDR-FXE and HVR-ZE. These cameras can output either component analog HD video (which can be further converted to more common SDI interface) and also HDV compressed video through FireWire interface. However, there are still close to none tools capable of working with this stream and thus we have developed our own implementation of HDV reception for FreeBSD 5 using modified fwcontrol program. The fwcontrol-based implementation has been submitted for evaluation by FreeBSD team 3. Latency measurements. In order to measure latency of the HDV workflow starting from capturing scene with HDV camera and ending with displaying HD video on computer screen, we have set up the following testbed: one computer (Pentium III 800 MHz, FreeBSD 5.3) was used to display changing scene (analog clock but generally, arbitrary scene can be used), other computer (Pentium M 700 MHz, ATI Mobility FireGL T2, FreeBSD 5.3) had HDV HVR-ZE camera set into 60i mode connected directly via IEEE-394 interface; VideoLAN Client (VLC) software [93] was used for displaying resulting video fullscreen directly without performing any deinterlacing, both computer screens were captured by another HVR-ZE camera set into 60i mode. Resulting video has been analyzed field by field and latency has been measured from 5 different points. Because delay of one field is 6.68 ms and there were 4.28±0.75 field on average between change occurred on the first screen and the change was displayed also on second screen, we have concluded that the average latency of the HDV processing is.907±0.03 seconds. 6.3 Uncompressed HD Real-time synchronous uncompressed HD over IP has been implemented by two independent projects: UltraVideo from McGill University [9] and UltraGrid by Colin Perkins [25]. The UltraGrid is under more active development and its authors are actively contributing to both IETF and RFC standards. However, both systems use very similar hardware infrastructure and have the same network requirements. Uncompressed HD video is usually transferred from the HD camera using Serial Digital Interface (SDI) SMPTE 292M format having bandwidth of.486 Gbps. The SDI data are extracted using SDI-enabled HD capture card, encapsulated to IP either using RFC 3497 [26] or IETF RTP Payload Format for Uncompressed Video standard [24] and transmitted over the network. On the other side, it is stripped off of the IP encapsulation and either displayed locally or sent to some SDI-enabled display device using HD capture card with SDI output. Uncompressed high-definition video poses much higher bandwidth requirements on the underlying infrastructure compared to DV or HDV transmission. However, as it includes no compression, it allows minimizing latency of the transmission. As noted in [4] (Figure 6.3), the human perception to latency can be very sensitive, especially when it comes to musicians for chamber orchestra, it can be as low as 5 ms. While this is clearly beyond limits of computer processing and even beyond speed of light in glass optical fibers for overseas transmission, it stresses importance of working with uncompressed multimedia. 3 freebsd-firewire/ freebsd-firewire

63 6.3. UNCOMPRESSED HD 52 chamber orchestra symphonic orchestra echo, lip-sync compressed video capture processing and compression uncompressed video capture processing network measurement light [ms] Brno, CZ Chicago (StarLight), USA FIGURE 6.3: Latencies limits for collaborative environments. Latency limits for different types of collaboration are from [4]. Because of lacking equipment for full distributed testing of streams with bandwidth over Gbps as noted in Section 5.4, we have set up point-to-point testbed to evaluate usability of the distributed AEs with uncompressed HD. Both machines where dual-processor AMD64 (AMD Opteron Processor 250 at 2.4 GHz) with 0 Gigabit Ethernet Chelsio T0 network interface cards. One machine was used both as sender and as a receiver while the other machine was used as a traffic replicator and processor. For duplicating the streams, we used distributed AE working over shared memory on both processors. When sender was sending.5 Gbps stream with 9 kb packets (Jumbo frames), which was duplicated into two streams on the other machine and sent back to the first machine so that both machines were subjected to data flow of 4.5 Gbps, the whole setup showed 0% packet loss. For bandwidth of.65 Gbps, the packet loss was 0.005% and for.7 Gbps, it was 0.02%. After increasing bandwidth to.75 Gbps, there was packet loss of 0.% and further bandwidth increase led just to more packet loss. This corresponds well with the maximum UDP performance measured on both machines, where we achieved 5.44 Gbps of unidirectional UDP stream without packet loss using the same testbed and when sending bidirectionaly, the maximum bandwidth was decreased proportionally. This is because the limitation is a result of saturating speed of PCI-X bus. Thus we have shown that concept and prototype implementation of the distributed AE is suitable for user-empowered distribution of uncompressed HD video. Bringing in appropriate networking equipment will make not only duplication but also general multiplication and on-the-fly processing of the uncompressed HD feasible.

64 Part II Asynchronous Distributed Processing

65 Chapter 7 Introduction to Asynchronous Distributed Processing The asynchronous distributed processing poses different requirements on the infrastructure compared to synchronous processing. There is a need for fast distributed storage of huge capacity accessible from acquisition tools, processing tools, and client tools (either indirectly using server that can access the storage or by enabling clients to access it directly). Even if the latency of the processing is not an issue, the users are concerned very much about the overall speed of processing. To create an efficient processing infrastructure, not only efficient job distribution model and scheduling must be applied, but also the location of data in the distributed storage infrastructure and location of sources, processing capacity, and clients must be taken into account and optimized. For example, if the data is available on one site, it makes sense to utilize processing capacity available on that site. On the other hand if we know that some significant processing capacity will become available in defined time, it might be reasonable to migrate data as close as possible to that processing infrastructure. 7. Objectives of Asynchronous Distributed Processing The goal of our effort is to build a distributed asynchronous processing system which minimizes overall time of the processing regardless of latency that will utilize powerful networks of computing and storage resources called Grids. This aim comprises several subtasks: designing scheme for efficient distributed processing that scales close to linear with respect to number of nodes involved in processing, propose suitable job scheduling, which incorporates distributed processing capacity and storage capacity, optimization of data location with respect to processing capacity and vice versa. The target system must be designed to provide processing at least at real-time speed even when applying advanced transformations of the multimedia material. Looking at real-world example, we may need to convert a video in DV format into raw format or into to common MPEG-4 based formats used for streaming or downloading (e. g. RealMedia, DivX) while applying high-quality de-interlacing and de-noising filters, current high-end single processor computers are able to process the data only three times slower than realtime speed.

66 7.2. STATE OF THE ART State of the Art A number of both open source and closed source tools is available to process multimedia content in a centralized way and some of them also in a distributed way. To author s knowledge with the only exception discussed below, none of them has been integrated into the Grid environment and none of them has support for distributed storage (unless it is emulated by operating system as a local filesystem). The most important tools and approaches are listed in this section Grid and Distributed Storage Infrastructure There are numerous projects aimed at building computational Grid infrastructure for various purposes in U. S. A., Europe and Japan. These projects are ranging from specificpurpose built Grids for selected applications to general infrastructure projects. As a part of the Grid infrastructure these projects provide large computational power in the form of Linux PC clusters that are being rapidly expanded each year because of the cost-efficiency of this solution which we target our attention on. The Grid activities are covered by META Center project [78] in the Czech Republic Video Processing Tools Currently a lot of tools is available for multimedia transcoding (list of freely available tools can be found e. g. in [69] and [70], description of most important commercial tools can be found in [20]) some of them being open source and some of them closed-source and vast majority of them doesn t allow distributed encoding, while some allow for distributed processing in homogenous environments with some centralized shared storage capacity. Due to its highly modular architecture, the transcode [55] tool supports transcoding between almost all common video distribution formats except for those for which no opensource freely available implementation of encoding library exist. It also allows advanced processing that is needed like high quality de-interlacing and down-sampling. There is a simple computation distribution available based on PVM [89] and shared filesystem supported in underlying operating system (typically NFS). The transcode tool can also be used as video and sound acquisition tool on Linux using Video4Linux interface. Other tool available for general multimedia transcoding is MEncoder which is part of MPlayer software suite [8]. However none of above mentioned tools support transcoding to RealMedia format, which is currently one of the most popular formats for video streaming delivery. The company producing RealMedia encoder decided to provide source code of all applications for research and development purposes under Helix Community Project [72]. Thus it is possible to explore integration of this format into asynchronous encoding environment Automated Distributed Video Processing Shortly after our Distributed Encoding Environment, there has been another system published based on similar ideas called A Fully Automated Fault-tolerant System for Distributed Video Processing and Offsite Replication [44]. This system uses similar overall architecture to ours based on parallelizing the encoding in a distributed computing environments by splitting the encoding into smaller chunks that are encoded in parallel. As the system doesn t use distributed file system with replica support, it handles the data replication using Condor-related tools Stork [46] and DiskRouter [45]. Furthermore, the claimed fault-tolerancy is understood and handled only on job scheduling level and the system actually demonstrates fault-tolerancy of the Condor-G [5] scheduling system which it is based on.

67 Chapter 8 Distributed Encoding Environment In recent years there has been a growing demand for creating video archives available on the Internet ranging from archives of university lectures [90, 67], archives of medical videos and scientific experiments recordings to business and entertainment applications. Building these media libraries requires huge processing and storage capacities. In this chapter, we describe a system called Distributed Encoding Environment (DEE) [39] that is designed to utilize Grid computing and storage infrastructure. This chapter is organized as follows: in Section 8. we propose architecture of the DEE, in Section 8.2 suitable data and processor scheduling algorithms are analyzed, Section 8.3 briefs prototype implementation and evaluates its performance. 8. Model of Distributed Processing There are two possible approaches to building a distributed video processing system differing in granularity of the parallelization:. parallelization on level of compression algorithm, i. e. fine-grained parallelization of the compression algorithm itself, 2. parallelization on data level, i. e. coarse-grained parallelizing the whole encoding process by splitting the material and encoding the resulting parts in parallel. While the former option is suitable for semi-asynchronous processing like live streaming, it adds significant overhead and almost prevents reasonable linear scalability on distributed processing infrastructure without single shared memory because of several reasons. It usually involves substantial synchronization among the distributed processes e. g. I-frames need to be handled before processing of P- and B-frames can occur. It also requires movement of source data to the processing node just before the calculation, as the data is not available well in advance, and transfer of resulting data back. As the source data is in this case usually not available in advance, it is hard to schedule data movements. Furthermore, the data movements in this model require low-latency transfer for efficient processing and thus it is impossible to utilize distributed storage infrastructure Third problem is that fine-grained parallelization requires modifications to all source and target codecs in use, which is very hard as it might comprise tens of different algorithms to parallelize. Theoretically, it would be possible to utilize some highly experimental and not very affordable storage systems, such as data circulating in optical networks.

68 8.. MODEL OF DISTRIBUTED PROCESSING 57 We have opted for the latter approach for the following reasons: since the asynchronous processing relaxes latency constraint, we may assume that the source data is completely acquired before the processing, and also because our target is to build a system that works faster than real-time and we can suppose whole material is available in advance anyway. Compared to parallelization on compression algorithm level, the parallelization on data level is codec-independent and thus the same architecture and implementation can be used for many input/output formats. Furthermore, it is possible to use it with target formats for which there is no open-source codec implementation and the only condition is that there is an efficient way for merging resulting chunks together. The proposed workflow for the distribution of the processing (Figure 8.) looks as follows: the source data is split into chunks which are then encoded in parallel and the resulting data is merged back into the target data. The goal is then to minimize completion time of the last finishing job of the parallel phase. Although we have relaxed latency requirements posed on the asynchronous processing and initial and final phases count together to overall latency of the processing, we require that these two phases should be much faster than the parallel phase, thus making the processing effective from the real-user point of view. The source chunks for the parallel phase are stored in distributed storage (possibly in multiple copies for performance and reliability purposes) to be effectively accessible by distributed processing nodes. Source Data Data Chunking Chunk Processing Chunk Processing Chunk Processing Chunk Processing Chunk Merging FIGURE 8.: Workflow in the Distributed Encoding Environment model of processing distribution. As the source data is complete before the processing, we may split the parallel processing into uniform chunks, which makes it possible to create a scheduling algorithm belonging to PO class as shown in Section Conventions Used The overview of the infrastructure model used throughout this chapter is given in Figure 8.2, comprising data sources, storage depots, processing nodes and the network infras-

69 8.. MODEL OF DISTRIBUTED PROCESSING 58 data source router/switch storage depot processing node FIGURE 8.2: Model of target infrastructure. tructure with links and active elements (routers/switches). In order to maintain consistent notation throughout this chapter, we also define a number of symbols below. Definition 8. (Data transcoding) Transformation of (multimedia) data from a source format to a target format is called transcoding. Definition 8.2 (Data prefetch) Data prefetch 2 is an act of moving data closer to the processing infrastructure during the time period between the job is scheduled and the job is run. There is also a number of symbols and variables used, some of which are also provided with deeper explanation where appropriate: t time t 0 now D set of depots that store data to be processed (all depots unless indicated otherwise) p processing node P set of processing nodes d the depot where the data to be processed are stored D u set of depots scheduled/used for actually accessing the data to be processed in task u u (type of) processing task U set of processing tasks (all the tasks have the same length) U set of tasks scheduled to processor p l u length of processing task u; units [Mb] 2 In Grid community, data stage-in term is often used as an equivalent to data prefetch. 3 This information can be theoretically obtained from most of current advanced schedulers. However there are a few issues that make it partially theoretical functionality only: existence of priority jobs in scheduling systems (the priority job can delay availability of the processor),

70 8.2. SCHEDULING ALGORITHMS 59 t sched_free p s p,u s p,u information from job scheduler in what time the processor p will be available 3 ; units [s] processing performance 4 of processor p on (type of) task u; units [ Mb. s ] resulting material production performance of processor p on (type of) task u; units [ Mb. s ] b D,p (t) download capacity (bandwidth) from depot set D to processor p in time t as discussed in Section 8.2.2; units [ Mb. s ] b p,d (t) upload capacity (bandwidth) from processor p to depot set D in time t as discussed in Section 8.2.2; units [ Mb. s ] 8.2 Scheduling algorithms The work described in this section is primarily motivated by the need for efficient job scheduling across geographically distributed computing cluster infrastructure and distributed storage systems for distributed processing of large data sets. Such scheduling system must take into account not only the processing power of each computing node (which is not uniform as understood by most of scheduling algorithms), but also estimated end-to-end network throughput between the location of the data in the distributed storage system and the processing nodes Use Cases and Scenarios There is a number of scenarios that can be covered by our approach and the following list includes the ones we consider the most important: Scheduling the processing on the best hosts to perform the processing. The best host doesn t need to be the fastest one in terms of available processing power. Actually, it is the one on which the calculation finishes in shortest time. To select the best node for the processing, we need to sort hosts according to estimated completion times of the processing and then use processor scheduling algorithm to schedule the tasks. Selection of best depots containing the source data to process with respect to processing capacity. Taking into account available bandwidth between the data depots containing the source data and processors that are about to process the data, we need to schedule which processor will use which depot. Prefetch decision support. Some evaluation criteria are needed to decide whether data prefetch is appropriate or not. The minimum condition states that the prefetch must accelerate the processing, i. e. it must decrease overall processing time. Upload distribution support. If the data processing is to happen in short enough future so that we can predict which computing resources will be used for processing, it may be useful to to upload the resulting data back into the distributed storage with respect to the location of these computing resources. existence of preemption jobs, which can preempt already running jobs, users are not required (and thus the most users don t bother) to specify expected run-time of their jobs thus defaulting to maximum run-time available in specifying queue, non-trivial known problems with this functionality in well-known scheduling systems (e. g. PBSPro [88]). Nevertheless there is considerable effort in current Grid computing to make estimates of job run-times [63] and thus we assume this functionality available very soon. 4 We assume that the processing performance of the processor is constant in time and the processor is either available (free) or unavailable (busy) for the purpose of job scheduling. When some algorithms assume uniform tasks, we denote s p processing performance of processor p.

71 8.2. SCHEDULING ALGORITHMS Components of the Model There are two basic components of the model: Completion Time Estimate (CTE) used for finding the best host for data processing, and Network Traffic Prediction Service (NTPS) for prediction of the available end-to-end bandwidth between data storage depot and processor. Furthermore, some auxiliary functionality like proximity functions, prefetch decision support, and upload optimizations are provided in this section. Definition 8.3 (Completion Time Estimate) Completion Time Estimate (CTE) is an estimate of the time when the processing finishes if it uses specified computing resources while data are processed directly from/to specified storage resources. Networking resources defined by location of computing and storage and network topology are used as well. Because prediction of network traffic can be very complex depending on requirements on the prediction as discussed below, we define NTPS as a general interface for traffic prediction: Definition 8.4 (Network Traffic Prediction Service) The Network Traffic Prediction Service (NTPS) is a service capable of estimating network bandwidth available for data transmission between two hosts in the network in end-to-end way using specified stack of network protocols. CTE Completion Time Estimation In general, CTE can be obtained by solving the following equation for the job u given location of the data in depots D u and using processor p and resulting data of length lu out are uploaded into depot set D u CTE d (p,d u,u) t sched_free p min{s p,u, b Du,p(t)} dt = l u (8.) CTE u (p,d u,u) t sched_free p r(t) dt = l out u (8.2) The CTE d (p, D u, u) is estimated completion time of download and processing phase, the CTE u (p, D u, u) is estimated completion time of upload phase, z(t) = s p,u b p,d u (t), amount of locally stored data if production is faster than available network bandwidth for transport is Z(t) = max{ t t sched_free z(t)dt, 0}, and the resulting upload rate is p. r(t) = Because CTE d (p, D u, u) < CTE u (p, D u, u), { min{s p,u, b p,d (t)} Z(t) = 0 u b p,d u (t) Z(t) > 0 CTE(p, D u, u) = CTE u (p, D u, u) (8.3) This model also presumes that the uploading into the storage infrastructure takes place in parallel with downloading otherwise the lower bound of the integral in (8.2) needs to be modified accordingly (e. g. when uploading happens just after the processing finishes, the lower bound in (8.2) would be CTE d (p, D u, u)). If we assume that b p,du (t) and r(t) is constant in interval tp sched_free, CTE (p, D u, u) (which can be justified e. g. since job duration is less than time resolution of network traffic

72 8.2. SCHEDULING ALGORITHMS 6 prediction service) and if we assume uploading phase just after processing finishes, we can use simplified model CTE (p, D u, u) = t sched_free p + l u min{s p,u, b Du,p(t sched_free p )} + lu out r(tp sched_free ) To simplify the model even further, we can assume that the uploading into the infrastructure is not the bottleneck since lu out << l u (which is typical for video processing applications from raw video format to compressed formats) while b Du,p b p,d u, or that the uploading phase takes negligible time only even if the uploading occurs after the processing. Thus we obtain formula that will be used further on for sake of simplicity CTE (p, D u, u) = tp sched_free l u + min{s p,u, b Du,p(tp sched_free )} In case that the presumption with neglecting uploading phase is not valid, the model and the resulting algorithm can be easily extended to support it. Such function allows us to find the most suitable processors for processing. To avoid synchronous overloading of processing infrastructure, we suggest to use one of the two well-known approaches: either to randomize set of processing nodes and pre-select some subset, or to pre-select the subset manually. For the given subset we calculate CTE estimates and launch a greedy scheduling algorithm starting with processor with lowest CTE. Available bandwidth estimate Let s assume we have some kind of Network Traffic Prediction Service (NTPS) that provides us with estimate of available network throughput between node A and node B in time t: NTPS (A, B, t). For receiving realistic estimate of available TCP bandwidth, at least following parameters need to be evaluated: minimum line capacity on the path, round-trip time (RTT), and packet loss rate as all of these are important for performance of TCP that is underlying our applications. There are several possible models of NTPS with different interactions with our job scheduling model as shown below. The main difficulty arises when the traffic generated by our application has regular patterns in its nature and thus it is included as a part of NTPS prediction itself. In such scenario, we need to differentiate between predicted traffic generated by our application and predicted background traffic. Moving from most complex to simpler models, we will show interactions with our scheduling system for each of them. NTPS Model # (8.4) (8.5) Let s assume the most complete NTPS model with following properties:. the NTPS can predict cumulative available TCP bandwidth in N: fashion when N host are sending data to single host in parallel, 2. our application tells the NTPS which traffic in NTPS measurements has been generated by it for NTPS to identify it, 3. the NTPS service can provide prediction of background traffic for our application by subtracting predicted traffic of our application from overall predicted traffic, 4. the NTPS performs in-advance bandwidth allocations in time for scheduled jobs and project these allocations into available bandwidth predictions, 5. the NTPS can compare reserved vs. actual traffic by our application and it can utilize it to keep statistic information (which can be e. g. automatically used for adjusting the reservation if application regularly overestimates bandwidth needed in its allocation requests),

73 8.2. SCHEDULING ALGORITHMS the NTPS can estimate the available bandwidth in end-to-end way; that means it can decrease reported available bandwidth correspondingly if the bottleneck is either in storage depot itself or processing node itself. Under such conditions what we need from the prediction service is total bandwidth available between processor p and depot set D, b D,p (t) = NTPS (D, p, t) (8.6) The application then allocates the bandwidth per processor b sched D,p (t) with NTPS service by adding depots from depot set D to depot set D. d D b D {d},p(t) > b D,p(t) b D,p(t) < s p,u. D := D {d}; D := D {d}; (8.7) The process is repeated until the total reserved bandwidth is larger than s p,u or until no other depot in D is available that increases b sched D,p (t) and then the reservation is done b sched D,p (t) := b D,p(t). NTPS Model #2 If the NTPS prediction is unable to perform prediction in N: fashion (relaxing condition ), it is near to impossible to use multiple depots to feed one processor as it requires detailed knowledge of network topology. Thus the formulas above become b D,p (t) = NTPS (d, p, t) where D = {d}, (8.8) b sched d,p (t) = min {s p,u, b d,p (t)} (8.9) NTPS Model #3 If we assume the same behavior as above (Model #2) with exception of unavailable bandwidth allocations (relaxing conditions and 4), the model gets more complicated again even if we use single depot per processor only. b D,p (t) = NTPS (d, p, t) p b sched d,p (t) where D = {d}, (8.0) where p goes over all processes that share some link with previously scheduled processing (thus omitting this term when creating the first estimate) (d, p) and (d, p ) share at least one link. Again, this requires detailed knowledge of network topology and thus it can hardly be used. NTPS Model #4 If the network traffic forecasting service is capable of including our jobs into its estimate but it is unable to isolate our traffic from its predictions, the criterium becomes NTPS (D, p, t) > 0 or NTPS (d, p, t) > 0 (8.) as we are watching whether there is still some spare bandwidth available to say whether the congestion (including our traffic ) is imminent or not. NTPS Model #5 If NTPS certainly doesn t include our traffic in its prediction (e. g. since traffic generated by our application is neither regular nor predictable), the criterium becomes b D,p (t) = NTPS (D, p, t) or b D,p (t) = NTPS (d, p, t) where D = {d}, (8.2)

74 8.2. SCHEDULING ALGORITHMS 63 Proximity Function The auxiliary proximity function is an approximative static replacement of NTPS, that allows scheduling system to have assessment of closeness of processors and storage depots in a static time average fashion when no dynamic NTPS is available. In similar way to NTPS function, the proximity functions must take into account maximum achievable end-to-end throughput depending in on transport protocol used between the data depots and computing nodes. For TCP transport protocol, it is based on estimated 5 or measured average TCP throughput, while for UDP it might be just limiting capacity and possibly also loss of the network between storage and processor. Proximity functions can be prototyped as follows: PX (p)... returns vector of depots close to p (in non-increasing order) PX inv (d)... returns vector of processors close to single depot d (in non-increasing order) PX inv (D)... returns vector of processors close to depot set D (in non-increasing order) Prefetch Evaluation First, it makes no sense to perform the prefetch if the processing power is the bottleneck, so the prefetch makes sense only if s p,u > b D,p (t) (8.3) where D is the depot set on which the data to process is located. Thus it is meaningful to perform prefetch from depot set D to depot set D if the following condition is met: t sched_free p + l u min{s p,u, b D,p (t sched_free p )} > t + l u min{s p,u, b D,p(t )} (8.4) where t = max{tp sched_free, t 0 + t prefetch D D } (8.5) It is also necessary to find out whether there is some available depot which is closer that current ones. Minimal condition to attempt to do prefetch is that for any p P {( ) } d {PX i (p)} D. b d,p(t) > 0. (8.6) i where PX i (p) is the i-th element of vector PX (p). If we want to maintain number of copies of the data and just flow the data in the storage infrastructure, the condition looks as follows {( ) } d {PX i (p)} D d D. b d,p(t) > b d,p. (8.7) i Simplified condition for case, when only single data copy is used, is PX 0 (p) d (8.8) 5 Maximum TCP throughput is limited on only by the network capacity, but also by round-trip time, packet loss, and maximum segment size (which is network maximum transmission unit minus TCP and IP headers, counting together 40 bytes). Based on analytical models of standard TCP congestion control [53, 5], the TCP MSS throughput is proportional to RT T. Such estimates are implemented e. g. in Network Diagnostic Tool [85]. loss More elaborate models are also available [56].

75 8.2. SCHEDULING ALGORITHMS 64 Upload Optimizations If we know at the time of uploading data from source nodes into data storage depots that there is some pool of processors P we want to use and assuming that the storage infrastructure can perform auto-replication and prefetching, we can upload data to the following set of depots PX 0 (p) (8.9) p P Our model consists of two stages. During the first stage, the processor scheduling algorithm assigns tasks to the processors and in the second stage, the storage scheduling algorithm assigns tasks to depots. We are assuming two models of storage scheduling: first one is to where one task data is stored in a single depot only. Second model is to n where one task data is replicated to n depots that can be accessed simultaneously. Last but not least, it is important to keep in mind that these models handle uniform tasks only, as discussed in Section Processor scheduling We are not using online algorithm and we use an abstraction that all the tasks are known at time of scheduling. For real online algorithm, this might not hold. Furthermore, for purpose of this algorithm we don t care about the available network capacity between storage depots and processing nodes and the only measure of speed is s p,u, which is denoted as s p because of uniform job size. Input: set of processors P, set of tasks U, task length l, speed of each processor s p. Output: sets U p that contain tasks assigned to processor p, and for u U scheduled time to start the task t u is computed. Goal: minimize. Measure: maximum processor running time. Common processor scheduling problem, which takes uniform processors and tasks of different sizes, belongs to NPO class. In this case we have uniform task size and processors with different speeds, denoted as Q m p j = C max class problem [47]. Let tp sched_free be the time when processor p is free (meaning there is no task scheduled to processor p at time tp sched_free ). We are using greedy algorithm shown in Figure 8.3 for assigning tasks to processors. It is easy to see that complexity of algorithm is O( P U ). foreach p P do 2 U p := ; 3 t sched_free p := 0; 4 od 5 foreach u U do 6 p : p P t sched_free p 7 U p := U p {u}; 8 t u := t sched_free 9 tp sched_free 0 od p ; := t sched_free p + lu s p + lu s p ; t sched_free p + lu s p ; FIGURE 8.3: PS Algorithm: Greedy algorithm for processor scheduling

76 8.2. SCHEDULING ALGORITHMS 65 Theorem 8. Processor scheduling algorithm PS belonging to PO class provides optimum solution for tasks of uniform size. PROOF We need to prove that greedy processor scheduling algorithm PS belonging to PO class (Figure 8.3) returns the optimum solution. Since all the tasks are uniform and no task precedence is allowed, we can see that no permutation of tasks inside U p results in better or worse solution. Moreover, let u U p and u 2 U p2 for two processors p p 2, we can see that {U p u } {u 2 }, {U p2 u 2 } {u } does not give better solution, because the tasks are of uniform size. Let u U p and U p2 where p p 2, we show that {U p u }, {U p2 u } does not give better solution. Since we can do any permutation of tasks in any U p, let u be the task whose t u is highest in U p, i. e. it is the last scheduled task in U p. Let u 2 U p2 whose t u2 is highest in U p2. In the case of t u2 + lu s p2 + lu s p2 In the case of t u2 + lu s p2 + lu s p2 > t u + lu s p < t u + lu s p this is impossible: holding for p P. t sched_free p we substitute variables t u2 + lu s p2 t u2 + lu s p2 + lu s p2 < t u + lu s p. + lu s p2 worse solution is found. a better solution is found, but we show that + lu s p Storage scheduling problem, to model t u + lu s p (line 6 in Figure 8.3), t u + lu s p. But this is a contradiction with Input: set of all depots D, set of processors P, set of tasks U, speed of each processor s p, transfer speed between processor and depot b d,p (t), for p P scheduled U p, and for u U scheduled t u. Output: sets P d that contain tasks assigned to depot d. Goal: maximize. Measure: f(b d,p (t u ) s p ) u U where d { D is such that u P d, p P is such that u U p, and x 0 f(x) = 0 x < 0 Theorem 8.2 The to storage scheduling problem is NPO-complete. PROOF We use Karp s reduction to Bin-Packing problem. Let I = {a, a 2,..., a n } be the finite set of rational numbers with a i (0, ] for i =,..., n, we search minimal partition {B, B 2,..., B k } of I such that a i B j a i for j =,..., k. Let s p = a, s p2 = a 2,..., s pn = a n for to storage scheduling problem. Let processors and depots are interconnected via the complete graph. Let d D : p P b d,p (t) for any time t. We search minimum set of depots D so that f(b d,p (t u ) s p ) u U where d D is such that u P d, p P is such that u U p, and { x 0 f(x) = 0 x < 0 is maximal. That means we run algorithm for to storage scheduling up to k + -times. Depots corresponds to partition {B, B 2,..., B k }.

77 8.2. SCHEDULING ALGORITHMS 66 We use an approximative algorithm First Fit Decreasing that is used for Bin-packing problem. The First Fit Decreasing is 3 2 approximative scheme [4]. foreach d D do 2 P d := ; 3 od 4 foreach p P do := P BS(p); 6 P roct ime(p) := 5 t sched_free p 7 od l u min{s p,b D,p(t sched_free p )} ; 8 foreach u U do 9 p : p P. tp sched_free + P roct ime(p) tp sched_free 0 d : d D. b d,p(tp sched_free ) b d,p (tp sched_free ); sched_depot(u, d); /* P d := P d {u} */ 2 sched_job(p, u); /* t sched_free 3 od p + P roct ime(p); := t sched_free p + P roct ime(p) */ FIGURE 8.4: -DS Algorithm: to task scheduling The algorithm for processor and task scheduling is shown in Figure 8.4. The function sched_job(p, u) tells the cluster s resources scheduling system (e. g. PBSPro) to allocate processor time starting at tp sched_free and to mark particular processor busy. The function sched_depot(p, d) changes network and depot conditions b d,p (tp sched_free ) in the way that data transfer from depot d to processor p will utilize network and depot capacity Storage scheduling problem, to n model Input: set of depots D, set of processors P, set of tasks U, speed of each processor s p, transfer speed between processor and depot b d,p (t), for p P scheduled U p, and for u U scheduled t u. Output: sets P d that contain tasks assigned to depot d. Goal: maximize. Measure: f( b d,p (t u ) s p ) u U p d D u where D u = {d u P d }, p P, and f(x) = { x 0 0 x < 0 n-ds algorithm in Figure 8.5 is a modified version of algorithm for to task scheduling that uses replicas of the data on different data storage depots to find optimum solution. Transfers to processors can be done from multiple sources. It is important to keep in mind that we work in the complete graph to achieve PO-class complexity and thus line 4 in Figure 8.5 can work with b d separately instead of b D, since there is no shared link and thus the data flows are independent and thus the b d is additive. Theorem 8.3 The to n storage scheduling belongs to PO class if and only if depots and processors are interconnected via the complete graph. PROOF We need to show that greedy algorithm n DS returns optimum solution to prove that to n storage scheduling belongs to PO class. Indeed, using replicas allows us to

78 8.3. PROTOTYPE IMPLEMENTATION 67 foreach u U do 2 D u := ; 3 od 4 foreach d D do 5 P d := ; 6 od 7 foreach p P do := P BS(p); 9 P roct ime(p) := 8 t sched_free p 0 od l u min{s p,b D,p(t sched_free p )} ; foreach u U do 2 p : p P. tp sched_free + P roct ime(p ) tp sched_free + P roct ime(p); 3 while (s p > b Du,p D D u ) do 4 d : d D. b d,p(tp sched_free ) b d,p (tp sched_free ); 5 sched_depot(u, d); /* P d := P d {u}, D u := D u {d} */ 6 od 7 sched_job(p, u); /* t sched_free p 8 od := t sched_free p + l u min{s p,b Du,p(t sched_free p )} */ FIGURE 8.5: n-ts Algorithm: to n task scheduling utilize all the depots to their maximum, this means that no better solution can be found. In the case of non complete graphs some network conditions can prevent utilizing some depots to the maximum extent when First Fit Decreasing algorithm is used (i.e. the second fastest processor is connected only to the fastest depot while the first fastest processor is connected to all depots and is fast enough to utilize the fastest depot to its maximum then the second fastest processor has no access to free depot.) Current largely over-provisioned high-speed networks can be seen from the scheduling point of view as providing logical (virtual) complete graph. As the network capacity grows extremely when reaching the network core and the limitations usually lie in so called last mile of the network (last or very few network links before the end-node is reached), the limiting network capacity for each depot and processor can usually be seen in the last mile or more specifically last link, which is available entirely for utilization by traffic from/to the depot. Therefore the depots and processors can not block each other as discussed below and thus can be utilized to maximum extent. Theorem 8.4 The to n storage scheduling is NPO-complete if depots and processors are interconnected via common graph. PROOF The proof uses the same reduction as proof for the Theorem 8.2 while network conditions restrict using of replicas. So there may exist a depot that is not utilized to maximum while there may be a processor that is neither utilized to the maximum extent and using more replicas does not improve performance any more. 8.3 Prototype Implementation 8.3. Technical Background Instead of building our own Grid infrastructure for testing, development and pilot applications, we have decided to opt for using the powerful Grid infrastructure made available

79 8.3. PROTOTYPE IMPLEMENTATION 68 in Czech Republic by the META Center project [78]. The META Center project was enhanced during year 2003 by a new project called Distributed Data Storage (DiDaS) [30] incorporating new distributed storage based on an Internet Backplane Protocol (IBP) [5]. Such storage infrastructure can be efficiently used for implementation and deployment of DEE system, however as the scheduling system in use is not capable of scheduling jobs with respect to location of the data and there is also neither data location optimization nor prefetch functionality and our data-intensive application requires at least some of them for optimal performance, we had to enhance the underlying infrastructure. As a part of the Grid infrastructure, the META Center provides large computational power in the form of IA32 Linux PC clusters that are being rapidly expanded each year because of the cost-efficiency of this solution. As follows from the discussion in Section 8.2, we need some globally accessible distributed data storage for transient storage of source, intermediate, and target data, that provides high enough performance to supply data to processing and that supports data replicas. The filesystem shared across these PC clusters is based either on a rather slow globally accessible AFS filesystem supporting read-only administrator controlled replicas (several terabytes of storage are available) or on a somewhat faster site-local NFS which doesn t support data replicas, has its own problems such as broken support for sharing files larger than 2 GB and a capacity on order of only a few tens of gigabytes available. Therefore we need a different means of storage for processing large volumes of data. The IBP uses soft-consistency model with a time-limited allocation thus operating in best effort mode. A basic atomic unit of the IBP is a byte array providing an abstraction independent of the physical device the data is stored on. An IBP depot (server), which provides set of byte arrays for storage, is the basic building block of the IBP infrastructure offering disk capacity. By mid-2004 the IBP data depots were present in all cluster locations as well as distributed across other locations in the Czech academic network providing total capacity of over 4 TB. The PBSPro [88] scheduling system is used for job scheduling across the whole META - Center cluster infrastructure. The PBSPro supports queue based scheduling as well as properties that can be used for constraining where a job may be run based on user requirements. These properties are static and defined on per-node basis. Under ideal circumstances the PBSPro is capable of advance reservations and estimate of time when a specified node is available for scheduling new jobs unless a priority job is submitted. The latter feature requires cooperation with users that submit their jobs as the PBSPro needs an estimate of processing time provided by the job owner for each job otherwise the maximum time for the specified queue is used and this results in non-realistic estimate of when a specified processing node will be available. For video processing we use the transcode tool [55]. As discussed in Section 7.2.2, this tool is unable to directly produce RealMedia format, which is one of few streaming formats with strong multi-platform support and which is also the format of choice for our pilot applications, the CESNET video archive and the Masaryk University lecture archive. Therefore we also need to use Helix Producer [72] to create the required target format. The Helix Producer needs raw video with PCM sound as input file and as this format is rarely the format of the input video, we use the transcode for pre-processing data for the Helix Producer Architecture As shown in Figure 8.6, the architecture of the Distributed Encoding Environment comprises several components and component groups: User interface The basic functionality of the user interface is for user to provide input information on the job. The basic information is usually the input file (media), job

80 8.3. PROTOTYPE IMPLEMENTATION 69 Media Processing Tools User Interface Job Preparation Local Scheduler Interfaces Job Submission Media Analyzer Media Spli er Media Processor Distributed Storage Job Logging Job Monitoring Media Merger FIGURE 8.6: Distributed Encoding Environment architecture and components. chunk size, target file/media, and target format and its parameters. If multiple processing infrastructures are available, the user may also select which one will be used for processing. When job monitoring module is available, the user interface may provide visualization of the job progress and informs user about problems encountered. Job preparation This module steers the preparation of the parallel job. It invokes media analyzer to find source format and its parameters, performs splitting job into chunks automatically or based on user specification if provided, prepares job files and invokes the job submission procedure to pass the jobs to local job scheduling system on the computing facilities. Local scheduler interface The local scheduler interface group provides one mandatory and one optional module: mandatory job submission interface to send the jobs for computation on computing facilities and optionally also job monitoring interface so that user can monitor overall job status using user interface. Shall the system support the proposed scheduling model (Section 8.3.4), the job submission interface should provide not only the write only access for job submission, but it should be also capable of reporting when specified resource will be available for scheduling according to the local scheduler tp sched_free (with all the limitations discussed above). Job logging facility The job logging facility takes care of keeping permanent track of the job status, results and especially error conditions. This component can be used only if job monitoring interface is used. Media processing tools The media processing group contains four modules: media analyzer for source media/format analysis, media splitter for splitting the source media into the job chunks, media processor, which is the actual processing, tool and media merger for merging resulting media chunks. Distributed storage Any distributed storage system that the media processing tools can interact with. It is desirable to have a system that can use data replicas for optimizing

81 8.3. PROTOTYPE IMPLEMENTATION 70 performance and that supports data location hinting for user to be capable of specifying data location, so that advanced functionality of the scheduling model can be utilized. The workflow in this architecture works as follows: a user specifies his jobs using the user interface. The job preparation module analyzes the source data using media analyzer and if an unsupported format is found, the user is notified and the processing is terminated. When supported format is found, the source media is split into smaller chunks either based on chunk size specified or degree of parallelism specified by the user or even using predefined split points provided by the user. The splitting can be done though job submission interface or locally the local splitting is sometimes desired since it avoids the job enqueuing latency, which can be quite long on heavily loaded systems. The job preparation module then creates job specification for each individual media chunk and sends it via job submission interface to the local job submission system on the computing resource using the scheduling mechanism described in Section It prepares the last job that merges resulting chunks into the target file or media. This job is executed only if all the chunk-processing jobs finish successfully. If the job monitoring interface is present, all the jobs are monitored throughout their lifetime and the results are gathered by the user interface. Also, if any error situation is found, the user is notified both using the user interface and using the job logging facility. Security Considerations. The Grid environment provides strong security via Grid Security Infrastructure (GSI) [22] based on X.509 certificates. Other projects have developed enhanced security architectures based on GSI and each computing Grid usually has some security infrastructure for authentication, authorization, and accounting (AAA) readily available. Because DEE is designed for Grid environment, it relies on Grid infrastructure for the AAA functionality Access to IBP Infrastructure A general purpose abstraction library called libxio [30] has been developed that provides an interface closely resembling standard UN*X I/O interface allowing developers to easily add IBP capabilities into their applications providing access to both local files and files stored in IBP infrastructure represented by IBP URI. The IBP URI has the following format: lors://host:port/local_path/file?bs=number&duration=number \ &copies=number&threads=number&timeout=number&servers=number \ &size=number where the host parameter is a specification of an L-Bone server (IBP directory server) to be used, the port is a specification of an L-Bone server port (default is 6767), the bs is a specification of block-size for transfer in megabytes (default value is 0 MB), the duration specifies allocation duration in seconds (default is 3600 s), requested number of replicas is specified by the copies (defaulting to ), the threads specifies number of threads (concurrent TCP streams) to be used (default is ), the timeout parameter is specification of timeout in seconds (defaulting to 00 s), the servers parameter specifies number of different IBP depots to be used (default is ), and the size specifies projected size of file to ensure that the IBP depot has enough free storage. It is possible to override default values using environment variables, too. If the given filename doesn t start with the lors:// prefix, the local_path/file is accessed as local file instead. When writing a file into the IBP infrastructure the local_path/file specifies the local file where a serialized XML representation of the file in IBP will be stored. At least an L-Bone server must be specified when writing a file into IBP. In our experience the file with serialized representation (meta-data) occupies approximately /0th of the actual data size in IBP on average, but it varies depending on block size to large extent.

82 8.3. PROTOTYPE IMPLEMENTATION 7 When a file stored in IBP is read, the local_path/file specifies the local file containing a serialized XML representation of the IBP file. The user can also use a short form URI lors:///local_path/file as the serving depots are already specified in local XML representation. The transcode program has been modified so that it can load and store files to/from IBP depots. As the transcode has very modular internals and some file operations are implemented inside libraries for certain file formats, it is necessary to patch such libraries as well. Currently we have patched the libquicktime [77] library, as the QuickTime MOV files are common products of editing software (e. g., AVID Xpress provides fast codecs for producing DV video wrapped in a QuickTime envelope which can be processed by transcode when libquicktime is modified to recognize its FourCC identification as common DV format to be processed using libdv library [76]) Scheduling Model For DEE prototype implementation, we have implemented a version of N to scheduling algorithm as shown in Figure 8.7 based on processor and storage scheduling analyzed in Section 8.2. Prototype implementation neglects uploading overhead as prototype pilot applications encode from large volumes of data to significantly (typically at least one magnitude) smaller data. In case that it is impossible to saturate processor by available data replicas, it also supports prefetch functionality. foreach u U do 2 D u := ; 3 od 4 foreach d D do 5 P d := ; 6 od 7 foreach p P do 8 t sched_free := sched_free(p); p 9 / sched_free(p) returns estimate when processor p becomes free. / 0 ProcTime(p) := l u min{s p,u,b D,p(t sched_free p )} ; /* assuming uniform size tasks u U */ 2 od 3 foreach u U do 4 p : p P. tp sched_free + ProcTime(p ) tp sched_free + ProcTime(p); 5 while (s p,u > b Du,p D D u ) do 6 d : d D. b d,p(tp sched_free ) b d,p (tp sched_free ); 7 sched_depot(u, d); /* P d := P d {u}; D u := D u {d}; */ 8 od 9 if s p,u > b Du,p D D u = 20 then prefetch(p, u); fi 2 sched_job(p, u); 22 t sched_free := sched_free(p); p 23 od FIGURE 8.7: Simplified job scheduling algorithm with multiple storage depots per processor used for downloading (i. e. N to data transfer) and neglecting the uploading overhead.

83 8.3. PROTOTYPE IMPLEMENTATION Distributed Encoding Environment An input video (typically in DV format with an AVI or QuickTime envelope produced by AVID or Adobe video editing software) gets uploaded into the IBP infrastructure from the source (editor s) workstation first. The video is then taken from IBP by IBP-enabled transcode, re-multiplexed to ensure proper interleaving of audio and video streams what is necessary pre-requisite for correct splitting, split into smaller chunks which are uploaded directly back to IBP and encoded on many cluster worker nodes in parallel (see Figure 8.8). At this stage the DEE system uses the PBSPro system to submit the parallel jobs on worker nodes. editing computer DV upload IBP DV download PC cluster single node DV remux DV chunks upload DV split DV chunks download RM chunks many nodes transcoding to raw AVI encoding to RealMedia upload RM chunks download single node RM upload RM upload RM chunks removal joining RM chunks streaming server RM upload FIGURE 8.8: Example DEE workflow for transcoding from video in DV to RealMedia format. The processing phase is somewhat more complicated when target format is RealMedia, the primary format for our pilot applications. During the parallel processing phase, each video chunk is first transcoded to raw video with PCM sound since this format is required by the Linux version of Helix Producer. All required transformations are performed at this step, typically including high quality deinterlacing and resizing, audio resampling and optionally audio normalization, and possibly noise reduction (if the original file is very

84 8.3. PROTOTYPE IMPLEMENTATION 73 noisy due to a low light level or due to low-quality camera used). This parallel phase uses storage capacity local to each worker node to create and process the raw video file. The raw file is then fed into the Helix Producer and the resulting chunk of RealMedia video is stored back into the IBP. As a final step the RealMedia chunks are joined, the complete file is stored in the IBP, and the individual RealMedia chunks are removed. When the Helix Producer is replaced by various transcode output modules or when the raw video is piped into another encoding program the DEE system can be used also for producing other video formats: DivX version 4 and 5, MPEG-, MPEG-2, etc. Also when no intermediate (raw) file is needed in the encoding process, the system can directly transcode the data from the IBP while simultaneously uploading results back into the IBP. Since there is no direct support for dynamic properties like location of files in the IBP infrastructure available in the PBSPro scheduling system and there is currently also no network traffic prediction service available in the META Center infrastructure, we have defined some static properties for the computing nodes that allow us to assess the proximity of the computing nodes to the IBP storage depots. Computing nodes which have the same processing characteristics and share the same network connection are given some static attributes. We measure static average estimate of bandwidth available between each such set of nodes and each storage depot. The DEE system then uses these properties and locations of files to be transcoded and gives hints to PBSPro where the jobs should be run according to the algorithm shown in Figure 8.7. It can also initiate replication of the data in the IBP infrastructure when processing power of the node is higher than network capacity from the IBP depots that keep the data Performance Evaluation It follows from the workflow, that scalability of the distributed processing is limited by the following factors: processing overhead comprising re-multiplexing, splitting, and final merging, job startup latency due to job scheduling and submission system used, minimum job chunk size for the parallel processing. While the first two factors will be discussed in the details of evaluation below, it is worth noting several points about the job minimum chunk size for the parallel processing phase, which depends mainly on the input media format. The main problem is that for many common formats used for multimedia distribution over the network, not all the video frames are usually independent. The independent frames called I-frames (or keyframes) are often followed by several P-frames which describe differences to I-frames, and B-frames describing differences to two P-frames. As the P- and B-frames are meaningless without the I-frames, the chunk size is limited by the maximum I-frame distance in the source format, which can comprise even several hundred frames. Fortunately, the common source formats used for editing and post-production like DV use at most inter-field compression and no inter-frame compression (thus all the frames are I-frames), so that all the frames are independent and can be used as split-points. This however doesn t hold for media in MPEG-4 based formats like DivX, XviD, Windows Media Files etc. The evaluation used the following transformations: DV to RealMedia with remultiplexing For processing, we used DV data with AVI envelope approximately GB in size with 69 DV frames which corresponds to 4:36 of PAL video time-line with 25 frames per second. The DV audio and video data is not properly interleaved and thus re-multiplexing is required before splitting the video into chunks. The data was stored in a single copy in the DiDaS IBP infrastructure relying on automatic distribution of the data across the IBP. We also deployed storage

85 8.3. PROTOTYPE IMPLEMENTATION 74 Processor RAM Maximum observed cache size OS kernel Shared FS Local scratch disk Local scratch disk FS NIC NIS status Working nodes configuration 2 Pentium 3.0 GHz 2 GB + 2 GB swap.8 GB Linux SMP NFSv3 Software RAID 0 on 2 disks XFS Intel PRO/000 Gbps full-duplex TABLE 8.: Dedicated testbed configuration. optimization and it turned out that storage location in IBP and the networking is not the bottleneck of the processing as the performance evaluation results are very close. During the processing, the video was down-scaled 6 from to using algorithm with Lanczos convolution 7, de-interlaced using high-quality cubic blend de-interlace filter, audio was re-sampled from 48 khz to 44 khz and finally converted through the raw format to RealMedia as shown in Figure 8.8. DV to RealMedia without remultiplexing The same source video, filters, target format, and IBP storage have been used as above, but this time, the source video was properly multiplexed and thus no remultiplexing was needed before splitting. The evaluation was performed in two environment: dedicated infrastructure to allow isolation of the experiment and shared infrastructure to verify the behavior on real-world Grid computing infrastructure. Evaluation Using Dedicated Infrastructure For evaluation of DEE performance, we used 4 dual-processor nodes that are part of META Center infrastructure and that were dedicated for testing. Configuration of nodes is summarized in Table 8.. For job scheduling, the META Center PBSPro system was used with dedicated job queue which was sending jobs on the dedicated nodes only, so the jobs of other users didn t interfere with the evaluation. The overall results of performance acceleration while changing a degree of parallelism are shown in Figure 8.9. Detailed execution profiles for different parallelization degrees without remultiplexing are shown in Figure A.2 on page 95 in Appendix A, while the corresponding profiles with remultiplexing are shown in Figures A.3 on page 96. It follows from the plot, that there is some unavoidable processing overhead which is discussed in detailed analysis below. It also turns out, that process with degree of parallelism equal one performs better than expected. This is due to the fact, that while dual processor nodes were always fully occupied for parallelism higher than, for degree of, the processing phase occupies the whole node and the internal threads of the transcode process can utilize actually more than one processor 8. The fitted curve also shows an average latency of 6 Conversion from to also changes aspect ratio as 4:3 ratio is needed for proper viewing on PC monitor with square pixels, while the original video with 5:4 aspect ratio is designed for displaying on TV sets with rectangular pixels. 7 More details on sinc-based Lanczos filter functions can be found e. g. in turk/computergraphics/resamplingfilters.pdf 8 The transcode process comprises of at least two computationally intensive threads, first of which called tcdecode reads the multimedia stream from the medium and decodes it to the raw format for further pro-

86 8.3. PROTOTYPE IMPLEMENTATION 75 DEE Parallelization Performance 20 with remux without remux y = /(0.0498*x) (RMS = 0.030) y = /(0.0520*x) (RMS = 0.040) time [min] degree of parallelism FIGURE 8.9: Acceleration of DEE performance with respect to degree of parallelism. Results for the without remultiplexing is shown in green, while the set with remultiplexing is shown in red. 95% confidence intervals are shown for both sets. the processing to be approximately 3 minutes with remultiplexing and 2.5 minutes without it. The detailed profiles reveal all the phases described in Figure 8.8. The file needs to be re-multiplexed first as the input file doesn t have proper video and audio interleaving as needed for splitting phase, immediately afterwards it is split into the chunks, uploaded back into IBP, and individual processing jobs are spawned. Then, there is a latency induced by the PBSPro scheduler when no jobs are run until they are started by the PBSPro. The distributed parallel phase follows when the number of processes reaches the desired degree of parallelism. The obvious step at the startup of this phase for 8 processes is because the PBSPro started the first 4 jobs at once and then the other 4 jobs after some delay, despite that all the computing resources were available and idle. In the final phase separated again by PBSPro scheduling latency, the merging job is run. The overhead of the first step can be reduced by using properly interleaved source media file, when re-multiplexing can be avoided, which approximately halves the initial phase. Evaluation Using Shared Production Infrastructure The same evaluation process using the same input data has also been performed with META Center infrastructure shared by all users. Performance profile for 8 parallel processing by the second threads, and the second thread which processes the resulting raw video with the filters specified. Usually, there is also a third thread which encodes video into the target format, but in our case it is just pass though since the output is raw video. The second thread can be even split into multiple threads which share the input using frame-buffer when filters which support concurrency are applied (so called _M_ thread filters). Thus on SMP machine with one transcode process, it is common to see that one of transcode threads consumes 99% CPU and the tcdecode consumes approximately another 30% CPU in the parallel phase. With estimate of 3 minutes of the splitting and merging phases, the measured value with single process only should be approximately 23 minutes, which agrees with fitted curve in Figure 8.9.

87 8.3. PROTOTYPE IMPLEMENTATION 76 cesses is shown in Figure A. in Appendix A. It turns out to be similar to the dedicated infrastructure except for two things. First, the processing takes longer since less powerful processing nodes were chosen by the scheduling algorithm as the more powerful ones were busy. Second, the parallel phase of the computation ends more step-wise as there were pronounced differences among the individual processing nodes and thus the calculation took different times on different nodes.

88 Chapter 9 Pilot Applications for Asynchronous Processing A pilot user group utilizing the prototype implementation of the Distributed Encoding Environment described in Chapter 8, has participated from the Faculty of Informatics, Masaryk University. The group has been providing recordings of many lectures at the faculty for more than last two years [34, 35]. Integration of the DEE into the processing workflow has been published in [36, 4]. At the beginning of 2004, the Faculty gained three lecture halls that are fully equipped for automatic lecture recording as a part of its e-learning efforts. The goal of this activity is to create as automatic, autonomous and unmanned system as possible. Because of scalability reasons, the DEE system was planned and integrated as the primary tool of the processing workflow for converting video from the source DV format to target formats suitable for streaming and downloading over the Internet. A simplified scheme of the architecture of the automatic lecture recording is shown in Figure 9.. The lectures are recorded using high-quality digital cameras. However, as the signal needs to be distributed over quite long distances (up to 00 meters), the analogue signal transmission is used. The signal is then converted to DV format using Canopus ADVC-00 converters. This turned out to be better solution than to use digital video output directly from the camera for even yet another reason: the audio needs to be taken from the audio circuits in the lecturing halls (i. e. not from the cameras) and the Canopus converters are capable of producing so called locked-audio DV format, that maintains strict synchronization of audio and video both in short and long term scales. This approach allows avoiding problems with very disturbing lip-synchronization. The Canopus converters are connected to ordinary PCs with medium sized disk capacity, which store each lecture on the their local disks first. When the entire lecture is recorded, it is uploaded into the IBP infrastructure, verified, deleted from the local disks of those PCs and passed on for further processing by the DEE system. After the distributed processing finishes, the resulting data are copied from IBP to streaming and downloading infrastructure. There are two basic formats this group uses for lecture recording distribution over the Internet: RealMedia format for streaming and DivX format for downloading. The Real- Media format is used for streaming in three quality levels. One is for students with low bandwidth connectivity and low processing power resulting in files with quarter PAL resolution ( pixels) at 5 frames per second (fps) with supported bit-rates ranging from 56 to 768 kbps. The second version is audio only either for students with extremely low bandwidth connectivity or for visually impaired students. The third format is for clients with high bandwidth and processing power available resulting video has full PAL resolution and frame rate ( pixels at 25 fps) with a bit-rate slightly below 3 Mbps.

89 78 Cameras Canopus ADVC-00 convertors Caching PCs IBP Storage Streaming and Downloading Server(s) DEE Processing Cluster FIGURE 9.: Scheme of the lecture recording and processing workflow. For downloading, DivX format files are produced with such bit-rate that one lecture fits one CD for students to be able to burn the lectures on CDs and watch them off-line. FIGURE 9.2: Interface to video lecture archives and example recording played from the streaming server. This pilot group usually needs to process more than 80 hours of video per week to produce files for both streaming and downloading, which results in approximately TB of recordings per one year. The system is further supposed to be used on other faculties of the university as well and thus both theoretical and practical scalability of the processing tools is very important. User interface to lecture recording archive together with example recorded lecture is shown in Figure 9.2. Since year 2004, all the formats have been transcoded using the prototype implementation described in Section 8.3. Because of amount of computing power needed, this groups has a dedicating processing cluster comprising 5 IA32 PCs. This cluster was used also for experimental evaluation of DEE prototype performance on dedicated infrastructure in Section