!"$#% &' ($)+*,#% *.- Kun Fu a and Roger Zimmermann a a Integrated Media Systems Center, University of Southern California Los Angeles, CA, USA 90089-56 [kunfu, rzimmerm]@usc.edu ABSTRACT Presently, IP-networked real-time streaming media storage has become increasingly common as an integral part of many applications. In recent years, a considerable amount of research has focused on the scalability issues in storage systems. Random placement of data blocks has been proven to be an effective approach to balance heterogeneous workload in a multi-disk environments. However, the main disadvantage of this technique is that statistical variation can still result in short term load imbalances in disk utilization, which in turn, cause large variances in latencies. In this paper, we propose a packet level randomization (PLR) technique to solve this challenge. We quantify the exact performance trade-off between our PLR approach and the traditional block level randomization (BLR) technique through analytical analysis. Our preliminary results show that the PLR technique outperforms the BLR approach and achieves much better load balancing in multi-disk storage systems. Keywords: Randomized load balancing, scalable storage system, continuous media server. INTRODUCTION Large scale digital continuous media (CM) servers are currently being deployed for a number of different applications. For example, cable television providers now offer video on demand services in all the major markets in the United States. A large library of full length feature movies and television shows requires a vast amount of storage space and transmission bandwidth. Magnetic disk drives are usually the storage devices of choice for such streaming servers and they are generally aggregated into arrays to enable support for many concurrent users. Multi-disk continuous media server designs can largely be classified into two different paradigms: () Data blocks are striped in a round-robin manner/ across the disks and blocks are retrieved in cycles or rounds on behalf of all streams. () Data blocks are placed randomly0 across all disks and the data retrieval is based on a deadline for each block. The first paradigm attempts to guarantee the retrieval or storage of all data. It is often referred to as deterministic. With the second paradigm, by its very nature of randomly assigning blocks to disks, no absolute guarantees can be made. For example, a disk may briefly be overloaded, resulting in one or more missed deadlines. This approach is often called statistical. One might at first be tempted to declare the deterministic approach intuitively superior. However, the statistical approach has many advantages. For example, the resource utilization achieved can be much higher because the deterministic approach must use worst case values for all parameters such as seek times, disk transfer rates, and stream data rates, whereas the statistical approach may use average values. Moreover, the statistical approach can be implemented on widely available platforms such as Windows or Linux which do not provide hard real time guarantees. It also lends itself very naturally to supporting a variety of different media types that require different data rates (both constant (CBR) or variable (VBR)) as well as interactive functions such as pause, fast-forward and fast-rewind. Moreover, it has been shown that the performance of a system based on the statistical method is on par with that of a deterministic system. Finally, a recent study argues that the statistical approach can support on-line data reorganization much more efficiently than deterministic method. Note the efficient on-line data reorganization feature is very crucial for scalable storage system since data rearrangement is inevitable due to disk addition and removal operations. Even though the random/deadline-driven design is very resource efficient and ensures a balanced load across all disk devices over the long term, there is a non-negligible chance that retrieval deadlines are violated. Short term load fluctuations occur because occasionally consecutive data blocks may be assigned to the same disk drive by the random location This research has been funded in part by NSF grants EEC-9595 (IMSC ERC), IIS-00886 and CMS-096, and unrestricted cash/equipment gifts from Intel, Hewlett-Packard and the Lord Foundation.
generator. During the playback of such a media file, the block retrieval request queue may temporarily hold too many requests such that not all of them can be served by their required deadline. In this report we introduce a novel packetlevel randomization (PLR) technique which significantly reduces the occurrence of deadline violations with random data placement. The remainder of this paper is organized as follows. Section relates our work to prior research. Section presents our proposed design approach. Performance analysis are contained in Section. Finally, Section 5 outlines our future plans.. RELATED WORK There are three basic mechanisms to achieve load balancing in striped multi-disk multimedia storage systems. One approach is to use large stripes that access many consecutive blocks from all disks at a single request for each active stream. This approach provides perfect load balancing because it guarantees that the number of disk I/Os are the same on all the devices. However, it results in extremely large data requests when the number of disks is large. Furthermore, it does not efficiently support unpredictable access patterns. The second approach is to use small requests accessing just one block on a single disk, with sequential requests cycling over all the disks./ This technique does not support unpredictable access patterns as well. The third technique randomly allocates data blocks to disks blocks, and therefore supports unpredictable workload efficiently. With the third approach, there has been proposal to counteract the short term load imbalance problem by random data replication, which requires more space and adds more complexity in the design and implementation. To our knowledge, no previous work has identified and quantified the trade-off between the randomization at the packet and block levels when fine-grained load balancing is desired or required.. DESIGN APPROACH We assume a multi-disk, multi-node streaming media server cluster design similar to the ones introduced in our previous research activities. Our first prototype implementation, called Yima, was a scalable real-time streaming architecture to support applications such as video-on-demand and distance learning on a large scale. Our current generation system, termed the High-performance Data Recording Architecture (HYDRA), improves and extends Yima with real time recording capabilities. In the HYDRA architecture each cluster node is an off-the-shelf personal computer with attached storage devices and, for example, a Fast Ethernet connection. The HYDRA server software manages the storage and network resources to provide real-time service to the various clients that are requesting media streams. For load-balancing purposes, without requiring data replication, a multimedia object is commonly striped into blocks, e.g., across the disk drives that form the storage system. Because of its many advantages, / / we assume a random assignment of blocks to the disk drives... Packets versus Blocks Packet-switched networks such as the Internet optimally transmit relatively small quanta of data per packet (for example 00 bytes). On the other hand, magnetic disk drives operate very inefficiently when data is read or written in small amounts. This is due to the fact that disk drives are mechanical devices that require a transceiver head to be positioned in the correct location over a spinning platter before any data can be transferred. The seek time and rotational latency, even though necessary, are wasteful. Figure shows the relative overhead experienced with a current generation disk drive (Seagate Cheetah X5) as a function of the retrieval block size. The overhead was calculated based on the seek time needed to traverse half of the disk s surface plus the average rotational latency. As illustrated, only large block sizes beyond one or two megabytes allow a significant fraction of the maximum bandwidth to be used (the fraction is also called effective bandwidth). Consequently, media packets need to be aggregated into larger data blocks for efficient storage and retrieval. Traditionally this has been accomplished as follows. Block-Level Randomization (BLR): Media packets are aggregated in sequence into blocks (see Figure a). For example, if packets fit into one block then the data distribution algorithm will place the first sequential packets into block, the next packets into block /, and so on. As a result, each block contains sequentially numbered packets. Blocks are then assigned randomly to the available disk drives. We term this technique Block-Level Randomization (BLR). During retrieval, the deadline of the first packet in each block is essentially the retrieval deadline for the whole block. The advantage of BLR is that only one buffer at a time per stream needs to be available in memory across all the storage nodes.
Packets Packets Randomized Block to Disk Mapping 5 6 Blocks Randomized Packet to Disk Mapping 9 6 5 9 5 6 7 8 Blocks 8 7 Figure a: Block Level Randomization (BLR) Figure b: Packet Level Randomization (PLR) Figure. Two different randomization schemes that can be applied in a Recording System, e.g. HYDRA. Note that in this example, each block contains packets. 00 Seagate ST675LC 80 Overhead [%] 60 0 0 0 0 00 00 600 800 000 00 00 600 800 000 Retrieval Block Size [KB] Figure. Overhead in terms of seek time and rotational latency as a percentage of the total retrieval time which includes the block transfer time (Seagate Cheetah X5). BLR ensures a balanced load across all disk devices over the long term. However, there is a non-negligible chance that retrieval deadlines are violated. Short term load fluctuations occur because occasionally consecutive data blocks may be assigned to the same disk drive by the random placement algorithm. During the playback of such a media file, the disk retrieval request queue may temporarily hold too many requests such that not all of them can be served by their required deadline. One way to avoid this scenario is to reduce the maximally allowed system utilization. For example, an admission control mechanism may restrict the maximum load to, say, 80%. Hence, the chance for overloads is much reduced. However, this approach wastes approximately 0% of the available resources most of the time. In order to allow high disk utilization while still reducing the probability of hot-spots we propose a novel technique to achieve better load-balancing as follows. Packet-Level Randomization (PLR): With this scheme, each media packet is randomly assigned to one of the storage nodes, where they are further collected into blocks (Figure b). One advantage of this technique is that during playback data is retrieved randomly from all storage nodes at the granularity of a packet. Therefore, load-balancing is achieved at a very small data granularity. The disadvantage is that memory buffers need to be allocated concurrently on all nodes per stream. However, we believe the benefit of PLR outweighs its disadvantage since the cost of memory is continually decreasing. Therefore, packet-level randomization is a promising technique for high-performance media servers. In the next section we evaluate the load-balancing properties of PLR and BLR in terms of the uniformity of data distribution on each disk. With PLR, let the random variable. PERFORMANCE ANALYSIS denote the number of packets assigned to a specific disk. As proposed in, uniformity, represented by of data distribution can be measured by the standard deviation and the coefficient of variation ( ) of
/ Term Definition Units Total number of disks Total data size Packet size Block size Total data size in blocks Total data size in packets The amount of data assigned to a disk in PLR The amount of data assigned to a disk in BLR The number of packets assigned to a disk in PLR The number of blocks assigned to a disk in BLR Ratio of block size and packet size, i.e., Table. List of terms used repeatedly in this study and their respective definitions. and respectively. By definition, can be derived from dividing by the mean value of,. If we consider the random assignment of a packet to a disk as a Bernoulli trial, which has probability / to be successfully allocated to a specific disk, then the total number of successful Bernoulli trials could be naturally mapped to. Intuitively, follows a Binomial distribution, where the number of Bernoulli trials is the total number of packets to be assigned, and denoted as. Note that can be computed as, where is the total data size and is the packet size. Therefore, the mean, standard deviation and coefficient of variation for can be obtained as shown in Equation. "!!&% $#! *) /(' %! # /(' +,.-- With BLR, let the random variable / denote the number of blocks assigned to a disk. Furthermore, 0 represents the total data size in blocks. 0 can be calculated as 0, where 0 denotes the block size. Similar to PLR, we obtain the mean, standard deviation and coefficient of variation of / as follows.!!% " #! *) /5' %! # /(' +6,7-- Let 89: and 890;: denote the amount of data assigned to a disk with <>=@? and AB=@?, respectively. Then, 8C: and 890;: can be calculated by 8 : 8 0": D/B 0 () 8 % 8 %! # Using Equations and, we obtain the mean, standard deviation and coefficient of variance of : as H) /(' $E : /5' +6,7-I-. Similarly, with Equation and, we can obtain the mean, standard deviation and coefficient of variance of 8 0": as J % $E *) /(' (5) 8 0;: %! # /5' +,.--. Finally, from Equations and 5, we obtain: and (6) 8 0": 890": K? () () () F ) (7)
where? represents the ratio of block size to packet size, i.e.,?. As shown in Equation 6, the mean value of 8 0": and 89: are the same, which confirms that both the BLR and PLR schemes achieve the same level of load balancing in the long run as expected. However, with respect to short term load balancing, PLR has a significant advantage over BLR as indicated by Equation 7. Note that the interesting fact that the ratio of and is only determined by the ratio of block to packet size. Next, we will discuss the performance advantages of PLR in details. 90000 Ratio of load imblance in BLR and PLR 80000 70000 60000 50000 0000 0000 0000 0000 0 0 00 00 600 800 000 00 00 600 800 000 G, with different block to packet size ratio Number of packets in a disk block Figure. Ratio of load imbalance,. Impact of Block to Packet Size Ratio? : Figure shows the ratio of load imbalance as a function of the block to packet size ratio?. We observe that when? increases from, to -I--, G increases sharply from, to approximately - -I--. The figure clearly indicates the significant performance gain of PLR over BLR. In fact, --packets in one block is not unusual. For example, with a packet size of ;, Bytes, a block size of 0, is a quite common configuration in current generation streaming servers. This figure also suggests that, as the block size increases, with BLR short term load balancing degrades quickly. Recall that with larger disk blocks the disk overhead decreases (see Figure ). Therefore, in BLR, there is a trade-off in determining the optimal block size. 5. CONCLUSIONS Load balancing is important to ensure overall good performance in a scalable multimedia storage system. Periods of load imbalance can result in poor end-to-end performance for client sessions. This paper identifies and quantifies the performance benefit of the packet level randomization (PLR) scheme over the traditional block level randomization (BLR) scheme. We intend to enhance PLR to support dynamic disk scaling operations. Furthermore, we plan to implement the PLR approach into our streaming prototype, HYDRA, and evaluate its performance with real measurements. REFERENCES. S. Berson, S. Ghandeharizadeh, R. Muntz, and X. Ju, Staggered Striping in Multimedia Information Systems, in Proceedings of the ACM SIGMOD International Conference on Management of Data, 99.. J. R. Santos and R. R. Muntz, Performance Analysis of the RIO Multimedia Storage System with Heterogeneous Disk Configurations, in ACM Multimedia Conference, (Bristol, UK), 998.. J. R. Santos, R. R. Muntz, and B. Ribeiro-Neto, Comparing Random Data Allocation and Data Striping in Multimedia Servers, in Proceedings of the SIGMETRICS Conference, (Santa Clara, California), June 7-000.. A. Goel, C. Shahabi, S.-Y. D. Yao, and R. Zimmermann, SCADDAR: An Efficient Randomized Technique to Reorganize Continuous Media Blocks, in Proceedings of the 8th International Conference on Data Engineering, pp. 7 8, February 00. 5. F. Tobagi, J. Pang, R. Baird, and M. Gang, Streaming RAID-A Disk Array Management System for Video Files, in First ACM Conference on Multimedia, August 99. 6. C. Shahabi, R. Zimmermann, K. Fu, and S.-Y. D. Yao, Yima: A Second Generation Continuous Media Server, IEEE Computer 5, pp. 56 6, June 00. 7. R. Zimmermann, K. Fu, and W.-S. Ku, Design of a large scale data stream recorder, in The 5th International Conference on Enterprise Information Systems (ICEIS 00), Angers - France, April -6 00. 8. V. Polimenis, The Design of a File System that Supports Multimedia, Technical Report TR-9-00, ICSI, 99. 5