The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points

Transcription

1 The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-oints aper 475, Regular paper, 4 pages Abstract While application end-point architectures have proven to be viable solutions for large-scale distributed applications such as distributed computing and file-sharing, there is little known about its feasibility for live streaming. Inherent properties of application end-points, for example, heterogeneity in bandwidth resources and dynamic group membership could adversely affect the feasibility of constructing a usable overlay. Even more challenging is attempting to do this at very large scales. In this paper, we use traces taken from a large content delivery network to characterize the behavior of users watching live audio and video streams. Based on these traces, we answer one of the most prominent architectural issues in the overlay multicast community: the feasibility of supporting large-scale groups using a purely application endpoint architecture. We show that in the majority of common scenarios, application end-points have inherent resources and stability to support large-scale live streaming. In addition, we motivate a real-world problem that has been largely ignored in previous designs, but must be addressed in order for application end-point architectures to be successful at very large scales: flash crowd joins. We show that a scalable and efficient join protocol can be implemented on top of dynamic application end-points. Introduction Live audio and video streams are now being delivered successfully over the Internet on a large scale. In the commercial sector, content delivery networks such as Akamai Technologies [] and Real Networks [5] have developed large scale dedicated infrastructure to deliver both live streams and video-on-demand. These architectures are capable of supporting many streams simultaneously and each stream may have thousands of subscribers. revious work in Overlay Multicast [7, 5, 9, 3,, 4, 2,, 7, 3, 2, 6, 2] has made the case that overlay networks are a promising architecture to enable quick deployment of multicast functionality on the Internet. In such an architecture, application end-points self-organize into an overlay structure and data is distributed along the links of the overlay. The responsibilities and cost of providing bandwidth is shared amongst the application end-points, reducing the burden at the content publisher. The lack of dependence on infrastructure support makes application end-point architectures easy to deploy, and economically viable. The ability for users to receive content that they would otherwise not have access to provides a natural incentive for them to contribute resources to the system. In contrast to content delivery networks, application end-point based architectures have shown promise for events where the peak group size is small-scale, on the order of to nodes [4]. The question remains whether or not such architectures are feasible at larger scales. In order to answer the feasibility question, we take the approach that we must first measure and understand the characteristics of application end-points. We leverage the wealth of data from a large content delivery network providing live streaming service. Because the system has been in everyday use for a some time, we can use the data to identify common real-world application end-point characteristics and behavior that are likely to impact the choice of architectures. Inherent properties of application end-point architectures, for example, heterogeneity in bandwidth resources and dynamic group membership could adversely affect the feasibility of constructing a usable overlay for data delivery. We look at feasibility along two key dimensions (i) bandwidth resources, and (ii) stability. The reason we choose these two dimensions is that they are fundamental in the sense that a purely application end-point architecture has little chance of providing good performance if these two are not satisfied. Secondary feasibility questions, such as network dynamics are relevant only when these primary feasibility requirements are satisfied. We find that in the majority of common scenarios, there is inherent bandwidth resources and inherent stability at application end-points, indicating the feasibility of such architectures for large-scale applications. Furthermore, we evaluate the viability of several design alternatives that seek to exploit such properties to improve system performance. In addition, we explore the design issues that one must address in order for application end-point architectures to be successful at very large scales. We find that a surprisingly large number of events have flash crowd behavior, motivating that a scalable and efficient solution for joining the system and managing group membership is required. We show that a simple solution based solely on application end-points can In Section 2, we describe and analyze the streaming media workload that we use in our study. Section 3 studies the feasibility of application end-point architectures to support large-scale live streaming events. In Section 4, we show that flash crowds are common in streaming media workloads and motivate the need for a scalable join and group management protocol. In Section 5, we summarize our findings.

2 4 live streams.2e+6 live requests 3 video requests.2e+6 audio requests 2 e+6 25 e+6 Number of live streams Number of live requests Number of video requests 2 5 Number of audio requests /4 /8 / /5 /29 2/3 2/27 / /24 Date (a) Distinct streams. /4 /8 / /5 /29 2/3 2/27 / /24 Date (b) Number of requests. /4 /8 / /5 /29 2/3 2/27 / /24 Date (c) Number of video requests. /4 /8 / /5 /29 2/3 2/27 / /24 Date (d) Number of audio requests. Figure : Daily summary of all live streams Stream ID Unique Hosts eak Hosts Video Unique Hosts Video eak Hosts Figure 2: The peak group size and unique number of entities for the large-scale streams. 2 Live Streaming Workload 2. Methodology In this study, we analyze the live streaming workload from a large content delivery network with the following motivating question: what can we learn about user behavior that will enable us to better understand the design requirements for building a live streaming system. We focus our analysis on characteristics that are most likely to impact design, such as group dynamics. We then evaluate the impact of these workloads on the performance of a application end-point based architecture. 2.2 Data Source The logs used in our study are collected from thousands of streaming servers belonging to a large content delivery network (CDN). The CDN uses a static overlay, composed of (i) edge nodes which are located at the edge of the network, very close to clients, and (ii) intermediate nodes that take streams from the original content publisher and split and replicate them to the edge nodes. The logs that we use in this study are from the edge nodes that directly serve client requests. The logs were collected over a 3-month period from October 23 to January 24. The daily statistics for live streaming traffic during that period is depicted in Figure. The traffic consists of three of the most popular streaming media formats. In Figure (a), there were typically 8- distinct streams on most days. However, there was a sharp drop in early December and a drop again in mid-december to January. This is because we do not have logs for one of the formats on those days due to a problem with our log collection infrastructure. Figure (b) depicts the number of requests for live streams which varies from 6, on the weekends to million on weekdays. Again, the drop in requests from mid-december onwards is due to the missing logs. Figure (c) and (d) depicts the number request for video streams and audio streams. The sharp peaks on various days correspond to special events. Note that there is an order of magnitude more requests for audio streams than video streams. In addition, audio streams have extremely regular weekend/weekday usage patterns. On the other hand, video streams are less regular, and are more dominated by special events with the very large events corresponding to the sharp peaks on various days. Streaming media events can be classified into two broad categories based on the event duration. The first category, which we call long-running events, are events in which there is live broadcast every day, all hours of the day. This is similar to always being on-the-air in radio terminology. The second category, which we call special events, are events in which the event duration is well-defined and typically on the order of a couple of hours. For example, a talk show that runs from 9am-am is only broadcast during that period, and has no traffic at any other time during the day. For simplicity, for either one of these categories, we break the events into 24-hour chunks, which we call streams. For the rest of the paper, we present analysis on the granularity of streams. Note that for special events, a stream is the same as an event. Rather than present an analysis on all the streams observed over the three-month period, which is an overwhelming amount of data, we limit the discussion in this paper to large-scale streams. Large-scale streams are defined as the streams in which the peak group size, or the simultaneous number of hosts is larger than, hosts. There were a total of over 653 large-scale streams, of which 6 were video streams and 593 were audio streams. In the following sections, we present results from the analysis conducted for 59 streams (55 video and 535 audio) encoded in two of the formats. The third format has similar events and characteristics to the others and are not discussed due to space limitations. Figure 2 depicts the peak group size and the unique number of I addresses (or entities, as we define in the next section). Most events had a peak group size of a few thousands. In addition to group size, we also summarize the session duration characteristics, which are analyzed in detail in Section

3 All 25kbps kbps 2kbps Cumulative robability Distribution mean 34.76, 394 Incarnations mean , 892 Entities min, 5 min (a) Membership over time. e+6 Session Duration (seconds) (b) Incarnation and entity session duration. Figure 3: Largest event. 2.3 Workload and Log rocessing Streaming workload: Each entry in the log corresponds to a request made by a client to an edge server. The following fields extracted from each entry are used in our study. User identification: I address and player ID Requested object: stream URI -stamps: session start time and session duration at the granularity of seconds erformance statistics: average received bandwidth for entire duration Entity vs. incarnation: We define an entity to be a unique user. An entity may join the broadcast many times, perhaps to tune in to distinct portions of the broadcast, and as a result have many incarnations. We use the I address in the logs to determine entity uniqueness. While this is not entirely accurate, it is the most robust in determining uniqueness. Its weaknesses are that it can (i) underestimate when multiple hosts are sharing the same I addresses in the case of NATs, firewalls, proxies, and DHC, and (ii) overestimate when the same host uses multiple I addresses in the case of DHC. We looked at other heuristics, such as the player id field which is intended to be a unique player identifier. However, in more than 5% of the requests, this field was anonymized by the user for privacy reasons. We also looked at combinations of operating system version, player version and user agent, but no combination was reliable in detecting uniqueness even for the cases in which we knew the clients were unique. 2.4 Largest event Next, we present more detailed statistics for the largest event in the logs. The event consisted of three streams with bitrates of 2 kbps audio, kbps video and audio, and 25 kbps video and audio. The event duration was 2 hours, from 9:-2:, as shown in Figure 3(a). Combining all three streams, the peak membership was 74, entities. The sharp rise in membership at 9: is caused by everyone wanting to tune in to the start of the event. There were roughly 9, entities and over 394, incarnations for the entire duration of the event. Note that there are many people who join the broadcast before the event started, probably to test that their set up was working. Also, there are many people who stay after the broadcast was over, for unknown reasons (they may have just left their player running). The 2 kbps audio stream has a sharper drop-off in membership than the other streams because many incarnations switch from audio to a better quality stream. The most dynamic segment of the stream was between 9:-9:3, where the peak join rate was over 7 arrivals/second and the peak leave rate was over 7 leaves/second. Figure 3(b) depicts the cumulative distribution of the incarnation and entity session duration time in seconds for the combined streams. For incarnation session duration, we see a sharp rise at 2 minutes. This is caused by a known clientside NAT/firewall problem that forced the streaming servers to time out on the connection. The average incarnation session duration was 22 minutes. And, more than 2% of incarnations stayed longer than 3 minutes. The entity session duration, which reflects the person s attention span, was much longer, with an average and median of 72 minutes and 63 minutes. The average is dominated by the tail of the distribution. Taking a closer look at the distribution, we find a fair amount of dynamics. For example, over 6% of incarnations and 5% of entities stay for less than 5 minutes. Furthermore, 5% of incarnations and 5% of entities stay for under minute. We explore what these numbers indicate about the stability of the system in Section 3.2. Unless otherwise stated, in the remaining sections, we combine all three streams into one stream, and refer to this one stream as the largest stream or event. We assume that the desired streaming bit rate for all hosts is 25 kbps. 3 Feasibility of overlay tree construction In this section, we look at the feasibility of supporting a large-scale live streaming application using a purely application end-point architecture with a focus on the data delivery plane. We capture feasibility using two important dimensions. The first dimension is bandwidth resources, and the second is system stability. 3

4 Access technology acket-pair measurement Outgoing bandwidth estimate Dial-up modems kbps BW kbps 3 kbps DSL, ISDN, Wireless kbps BW 6 kbps kbps Cable modems 6 kbps BW Mbps 25 kbps Edu, Others BW Mbps BW ublic only NAT and ublic Inefficient structure NAT NAT Connectivit Table : Mapping of access technologies to outgoing bandwidth. Type Degree-bound Number of hosts Free-riders (49.3%) Contributors (8.7%) Contributors 2 33 (8.4%) Contributors (5.2%) Contributors 2 85 (6.8%) Unknown (.6%) Total (%) Table 2: Number of hosts with assigned degree for the largest event. 3. Are There Enough Resources? In a purely application end-point architecture, there is no dependence on costly pre-provisioned infrastructure, making it favorable as an economical and quickly deployable alternative for streaming applications. On the other hand, the lack of any supporting infrastructure requires that application endpoints contribute their outgoing bandwidth as resources. The feasibility of such an architecture depends on whether or not there are enough bandwidth resources amongst application end-points to support all participants. To answer this feasibility question, we run trace-based analysis on 8 of the 59 large-scale streams. We first need to estimate the outgoing bandwidth resources at each host. Then, using these estimates, we derive the amount of resources in the system for each second in the traces. Note that the amount of resources are dependent on the join and leave patterns of participating hosts. 3.. Outgoing Bandwidth Estimation Data sources: We define resources as the amount of outgoing bandwidth that hosts in the system can contribute (i.e., how much bandwidth each host can send). We use a combination of data mining and active measurements for the estimation. Our primary goal was to generate an estimate for roughly 9, I addresses quickly and accurately. The most accurate methodology would be to actively measure the bandwidth from all the I addresses. However, doing so requires more time and resources than what we had available. Therefore, we took a different approach. We used existing tools and databases to help reduce the set of I addresses. Step : As a first order filter, we used speed test results from a popular web-site, broadbandreports.com. Speed tests are TC-based bandwidth measurements that are conducted from well-provisioned servers owned by the web-site. Users voluntarily go to the web-site to test their connection speeds. The results from the tests are aggregated based on DNS domain names or ISs, and are summarized over the past week on a publicly available web-page. Approximately, tests for over 2 ISs are listed in the summary. The Quality Index: 8/3 = 2.7 6/3 = 2. (a) (b) Figure 4: Example of how to compute the Quality Index. Quality Index Optimistic Cap 2 Distribution Cap 2 essimistic Cap 2 Optimistic Cap 6 Distribution Cap 6 essimistic Cap 6 Figure 6: Quality Index for largest event with two degree cap policies. measurements include bandwidth measurements in both incoming and outgoing directions. We use the outgoing bandwidth values for bandwidth estimation. We matched the hosts in our list with the ones from broadbandreports.com using DNS names. For the of hosts that matched, we assign their bandwidth values to the ones reported by broadbandreports.com. Step 2: Next, we aggregated the remaining 32,986 I addresses into /24 prefix blocks and conducted packet-pair based measurements to one host in each block. Of the 3,483 I addresses probed, 6,2 of them failed because the host did not respond do the probes. Finally, after all probing finished, we were able to compute bandwidth estimates for 7,463 I addresses. The bandwidth estimate for a /24 prefix was assigned to each host in that block, which resulted in estimates for a total of 9,28 (7.6%) I addresses. Step 3: We used a commercial service that maps I addresses to host information. The service is provided by the CDN who provided us with the streaming media logs. One of the fields in the host information database is access technology. We compared the overlapping matches from this database to the ones from broadbandreports.com and our DNS name mapping, and found them to be fairly consistent. Using this database, we were able to eliminate 8,42 hosts. Step 4: Finally, we used hosts DNS names to infer their access technology. For example, we use keywords such as adsl, cable-modem, dial-up, wireless,.edu and popular cable modem and dsl service providers such as rr.com and attbi. Of the remaining 5,546 I addresses, we matched 2,597 using this technique. Finally, we constructed a database manually 8/3 4

5 5 4 Optimistic Distribution essimistic 5 4 Optimistic Distribution essimistic 5 4 Optimistic Distribution essimistic Optimistic Distribution essimistic Quality Index 3 2 Quality Index 3 2 Quality Index 3 2 Quality Index Stream Sorted By Optimistic Quality Index (a) Audio streams, Degree Cap Stream Sorted By Optimistic Quality Index (b) Video streams, Degree Cap Stream Sorted By Optimistic Quality Index (c) Video streams, Degree Cap Stream Sorted By Optimistic Quality Index (d) Video streams, Degree Cap 2 Figure 5: Average quality index for each stream. for a subset of the remaining I addresses with resolved DNS names. This served to generate estimates for small DSL and cable modem providers, corporations and international educational institutions that do not use common DNS names. This techniques provided estimates for an additional,48 I addresses. Assign degree-bounds: To simplify the discussion, we normalize the bandwidth value by the streaming bit rate. For example, if a host has an outgoing link bandwidth of 3 kbps and the streaming rate is 25 kbps, then the normalized value is. That is, this host can have an out-degree of, or can support child at the full streaming rate. Although we assume a single-tree scenario, the out-degree values for multiple trees [2, 3, 6] can be derived from our estimates. Throughout this paper, we use degree as the outgoing bandwidth unit, instead of kbps. Using the data sources from the previous step, we end up with estimates with the following formats: Outgoing bandwidth (broadbandreports.com): We use this value to directly compute the degree. Average of the incoming and outgoing bandwidth (packetpair probing): Many access technologies have asymmetric bandwidth properties. For example, a DSL host with incoming link of 4 kbps and outgoing link of kbps will have an estimate of 25 kbps. Using the average of the two directions could over-estimate the outgoing link. Taking this into account, we discretize the bandwidth values from packet-pair into the access technology categories listed in Table, where BW stands for the bandwidth measured using packet-pair, and use the outgoing bandwidth estimate to compute the degree. Access technology (Host information database, DNS name resolution): We use the outgoing bandwidth estimate in Table to compute the degree. The degree estimate derived from the outgoing bandwidth value reflects the inherent capacity of a host. However, all that capacity may not be available for use. For example, the bandwidth may be shared across many applications on the end-host, or may be shared across many end-hosts in the case of a shared access link. Also, an application may want to be nice to the network and not saturate the link. In this paper, we set an absolute maximum bound of 2 for the outdegree such that no host in our simulations can exceed this limit, even if they have more resources to contribute. This roughly translates to 5 Mbps for video applications. We also study the effect of more conservative policies such as degree caps of 4, 6, and. Contributors vs. free-riders: The results from the previous two steps for the largest broadcast are listed in Table 2. Over half of the hosts have -degree, and are labeled as free-riders. Roughly 39% of the hosts are contributors, capable of supporting one or more children. Of these, 6.8% are hosts who are capable of supporting 2 or more children. Assign values to unknowns: After using all the data sources in the first step, we are missing estimates for % of the I addresses. For these unknown values, we assign resources to them using 3 assignment algorithms. The optimistic assignment assumes that all unknowns can contribute up to the maximum resource allocation (which we set to be degree 2). This provides an upper-bound on the best case resource estimate. The pessimistic assignment assumes that all unknowns are free-riders and contribute no resources. This provides a lower bound for the worst-case. The distribution algorithm assigns a random value drawn from the same distribution as the known resources. This provides a fair estimate assuming that the known and unknown resources follow the same distribution Quality Index To measure the resource capacity of the system, we use a metric called Quality Index [4]. The Quality Index is defined as the ratio of the supply of bandwidth to the demand for bandwidth in the system for a particular source rate. The supply is computed as the sum of all the degrees that application end-points participating in the system contribute. Note that the degree is dependent on the streaming bit rate, and in turn the Quality Index is also dependent on the streaming bit rate. Demand is computed as the number of participating end-points. For example, consider Figure 4, where each host has enough outgoing bandwidth to sustain 2 children. The number of unused slots is 5, and the Quality Index is "!#$ %'&(. A Quality Index of indicates that the system is fully saturated, and a ratio less than indicates that not all the participating hosts in the broadcast can receive the full source rate. As the Quality Index gets higher, the environment becomes less constrained and it becomes more feasible to construct a good overlay tree. A Quality Index of 2 indicates that there are enough resources to support two times the current number of participants Trace replay analysis In this section, we use Quality Index to measure the quality of resources across a set of 8 streams out of the 59 large-scale 5

6 streams. The analysis was conducted for all video streams and 5% of audio streams (randomly selected). To ensure some confidence in our results, we only analyzed streams that had sufficient bandwidth estimates (at least 7% of the I addresses in the traces had estimates) using the estimation methodology described earlier in Section 3... For each stream, we replay the trace using the group participation dynamics (joins and leaves) and compute the Quality Index for each second in the trace. First, we discuss the largest event, and then we present a summary for the other large-scale events. Figure 6 depicts the Quality Index as a function of time, with degree caps of 6 (bottom 3 lines) and 2 (top 3 lines) children. The interesting interval is between 9: - 2: when the event was taking place. Again, note that a Quality Index above means that there are sufficient resources to support the stream using a purely application end-point architecture. The highest and lowest curves for each degree cap policy are computed using optimistic and pessimistic assignments of bandwidth estimates to unknowns, respectively. Regardless of the treatment of hosts with unknown estimates and the degree cap policy, the Quality Index is always above during 9: - 2:. However, a degree cap of 6 places more constraints on the resources in the system and could potentially be more difficult to construct a tree with good performance. Figure 5 depicts a summary of the other large-scale streams. The quality index for audio streams is depicted in Figure 5(a). Each point on the x-axis represents an audio stream. The y-axis is the Quality Index for that stream averaged over the stream duration.the lowest curve in the figure is the Quality Index computed using the pessimistic assignment of bandwidth estimates to unknowns. For audio streams, even the pessimistic assignment is always between 2-3 when using a degree cap of 4 children. This is expected because audio is not a bandwidth-demanding application. The typical streaming rate for audio was between 6 kbps to 2 kbps. Most hosts on the Internet, including dialup modems, have outgoing bandwidth that is higher than 2 kbps. Application end-points participating in audio streaming applications can provide more than enough resources to support live streaming. Figure 5(b),(c), and (d) depict the average Quality Index for video streams with degree caps of 6,, and 2 children. As the degree cap increases, the Quality Index increases. The top most curve in all 3 figures represents the optimistic assignment. In the most optimistic view, across all degree cap policies (6,, and 2), only one stream had a Quality Index below one. In the worst case scenario, where the degree cap is 6 and the unknown assignment policy is pessimistic (b), roughly a third of video streams had Quality Index below. In order to determine feasibility, we look at the inherent amount of resources in the system. Using a degree cap of 2, and the distribution-based degree assignment for unknowns in Figure 5(d), we find that only stream has a Quality Index of less than. The stream with the worst Quality Index (4 on the x- axis) had a streaming rate of 3 kbps, but was composed almost exclusively (96%) of home broadband users (DSL and cable modem). Many home broadband connections can only support -25 kbps of outgoing bandwidth which is less than the steaming rate. Therefore, such hosts did. not contribute any resources to the system at all. To better understand whether or not this composition of hosts is common, we look at the nature of the event. This is a special event, starting on Sunday night at pm and ending at 2am in local time, where local time is determined based on the geographic location (Section 4.3 gives an overview of how geographic location is determined) of the largest group of hosts. Most participants are home users because the event took place on a weekend night where they are most likely to be at home. In contrast, most of the other events take place during the day, and are often accessed from more heterogeneous locations with potentially more bandwidth resources, such as from school or the workplace. About 5 of the 55 large-scale video streams fall under this category. Their Quality Index dips below for the distribution-based degree assignment in Figure 5(c). To summarize the results in this section, we find that there are more than sufficient bandwidth resources amongst application end-points to support audio streaming. In addition, there are enough inherent resources in the system at scales of or more simultaneous hosts to support video streaming in over 9% of the common scenarios. This shows that using application end-point architectures for live streaming is feasible. While we have shown that there are inherent resources, we wish to point out that assigning policies and mechanisms to extract those resources from application endpoints is beyond the scope of this paper. We have looked at simple policies, such as capping the degree bound to a static amount to make sure that no person contributes more than what is considered to be reasonable. There could be more complex policies to allow users to individually determine how much they are willing to contribute in return for some level of performance. While there are resources in the common scenarios, in % of the cases there were not enough resources to allow all participants to receive at the full streaming rate. In such scenarios, there are three generic classes of alternatives to consider. The first alternative is to enforce admission control and reject incoming free-riders when the Quality Index dips below one. A second alternative is to dynamically adapt the streaming bit-rate, either in the form of scalable coding [9], MDC [8], or multiple source rates [4]. This has the advantage that the system can still be fully supported using an application end-point architecture with a slight reduction in the perceived quality of the stream. Scaling the streaming by a factor of two effectively increases the Quality Index by a factor of two. And lastly, a third alternative is to add additional resources into the system. These resources can be statically allocated from an infrastructure service such as CDN s with the advantage that the resources only need to complement the already existing resources provided by the application endpoints. Another viable solution is to allocate resources in a more on-demand nature using a more dynamic pool of resources, for example, using a waypoint architecture [4]. 6

7 CDF Minutes 5th ercentile 25th ercentile 5th ercentile 95th ercentile Figure 7: Incarnation session duration in minutes. Interval between ancestor change Streams Random MinDepth Stability Figure 9: Median interval between ancestor change for 5 large-scale events. 3.2 Is There Any Stability? In this section, we look at the impact of group dynamics on stability, and evaluate mechanisms that can be used to increase the stability of the overlay Group dynamics Figure 7 depicts the incarnation session duration for the 59 large-scale streams. Note that an incarnation, as defined previously, refers to an instantiation of an entity or a unique host. The first curve on the left depicts the cumulative distribution of the observed 5th percentile incarnation session duration in all 59 streams. For example, 5% of incarnations in 3% of the streams had session durations of shorter than second (shown as. minutes). The next 3 curves are for the 25th, 5th and 95th percentile. Note that the same point on the y-axis does not necessarily correspond to the same stream across all curves. Based on this figure, we make two observations. First, there is a significant number of hosts that stay for very short durations. Looking at the 25th percentile curve, we find that for most streams, 25% of incarnations stay for under 2 minutes. Furthermore, looking at the 5th percentile curve, in most streams, 5% of hosts stay for roughly seconds. The most disastrous is that in the 5th percentile curve, more than % of the streams have extremely dynamic group membership with half of the incarnations staying less than 5 minutes. With such short session durations, it seems very unlikely that there could be any stability. Our second observation is that there appears to be some stability at the tail end, where there are a small number of incarnations that stay for very long periods. The 95th percentile curve in Figure 7 shows that for for most streams, the incarnations in the 95th percentile stay for longer than 3 minutes. There is a sharp rise at 3 minutes, caused by one long-running event. On multiple days, most people tuned in for 3 minutes. erhaps the tail can can help counter-act the effects of the high rate of dynamics at the other end of the distribution. Note that these observations are consistent with the session duration analysis of the largest event in Section Stability metrics When an incarnation leaves, it can cause all of its descendants to become disconnected from the overlay and stop receiving data. The descendants will need to find new parents and reconnect to continue receiving the stream. To capture stability of the overlay we look at two metrics: Mean interval between ancestor change for each incarnation. An ancestor change is caused by an ancestor leaving the group. This metric captures the typical performance of each incarnation. A change every second, or every minute is annoying. A larger value indicates better performance. If a host sees only one ancestor change during its session, the time between ancestor change is computed as its session duration. If a host sees no ancestor changes at all, the time between ancestor changes is infinity. Number of descendants of a departing incarnation. This metric captures overall stability of the system. If many hosts are affected by one host leaving, then the overall stability of the system is poor. However, assuming a balanced tree, most hosts will be leaf nodes and will not have children. Therefore, we hope to see that a large percentage of hosts will not have children when they leave Overlay protocol We simulate the effect of group dynamics on the overlay protocol using a trace-driven event-based simulator. The simulator takes the group dynamics trace from the real event and the degree assignments based on the techniques in the previous section, and simulates the overlay tree at each instance in time. Hosts in the simulator run a fully distributed selforganizing protocol to build a single connected tree rooted at the source. Note that we do not simulate any network dynamics or adaptation to network dynamics. Host Join: When a host joins, it contacts the source to get a random list of ) current group members. In our simulations, ) is set to. It then picks one of these members as its parent using the parent selection algorithm. Host Leave: When a host leaves, all of its descendants are disconnected from the overlay tree. For each of its descendants, this is counted as an ancestor change. Descendants then connect back to the tree by finding a new parent using the parent selection algorithm. arent Selection: When a host needs to find a parent, it selects a set of ) random hosts that are currently in the system, 7

8 9 8 Oracle Minimum Depth Random Longest-first 95 Oracle Minimum Depth Random Longest-first 9 8 Cumulative Distribution Cumulative Distribution Cumulative Distribution Interval between ancestor change (a) CDF of interval between ancestor change Number of Descendents (b) Number of descendants of a departing host. 2 Oracle Minimum Depth Random Longest-first Depth (c) Tree depth. Figure 8: erformance of largest event. probes them to see if they are currently connected to the tree and have enough resources to support a new incoming child, and then ranks them based on the parent selection criteria. We do not simulate the mechanisms for learning about the current hosts in the system, but simulate the effect of that knowledge propagation mechanism which provides random knowledge about hosts currently in the system. In a real system, Gossip-based mechanisms [6] can be used arent selection algorithms We ran simulations on 4 parent selection algorithms. Note that the chosen parent needs to be connected to the tree and have enough resources to support an incoming child (has not saturated its degree-bound), in addition to satisfying the parent selection criteria. Oracle: A host chooses the parent who will stay in the system the longest. This algorithm requires future knowledge and cannot be implemented in practice. However, it provides a good baseline comparison for the other algorithms. Longest-first: This algorithm attempts to predict the future and guess which nodes are stable by using the heuristic that if a host has stayed in the system for a long time, it will continue to stay for a long time. The intuition is that if the session duration distributions are heavy-tailed, then this heuristic should be a reasonable predictor for identifying the stable nodes. Minimum depth: A host chooses the parent with the minimum depth. If there is a tie, a random parent is selected from the ones with the minimum depth. The intuition is that a balanced shallow trees minimize the number of effected descendants when an ancestor higher up at the top of the tree leaves. Random: A host chooses a random parent. This algorithm provide a baseline for how a stability-agnostic algorithm would perform. Intuitively, random should perform the worst compared to the above algorithms. We used the degree assignment from the previous section, with a degree cap of 4 for audio streams and 2 for video streams, and the distribution-based assignment for the hosts with unknown measurements. Unless otherwise stated we use this same set up Results We simulated the performance of the 4 parent selection algorithms for the largest event over the most dynamic 3-minute segment from 9:-9:3. We also removed the hosts with NAT/firewall timeout problems discussed in Section 2 because their 2-minute time-outs are artificial and do not represent user behavior. We used the degree assignment from the previous section with a degree cap of 2 for video streams, and the distribution-based assignment for the hosts with unknown measurements. Unless otherwise stated we use this same set up for all video streams. We simulated runs for each algorithm with different random seeds. A CDF of the mean time interval between ancestor change is depicted in Figure 8(a). The x-axis is time in minutes. A larger interval is more desirable. Because we are simulating a 3-minute trace, the maximum time interval is 3 minutes, if a host sees on ancestor change. Hosts that do not see any ancestor changes have an infinite interval. For presentation purposes, we represent infinity as 3 on the x- axis. The bottom most line in the figure represents the CDF when using the oracle algorithm. Roughly % of the incarnations saw only one ancestor change in 3 minutes. Furthermore, 87% of incarnations did not see any changes at all. In fact, there were only one or two events that caused ancestor changes across all the runs. It is surprising that during the busiest 3 minutes in the trace, there is stability in the system. In addition, the overlay built by the oracle algorithm can exploit that inherent stability. The second-best algorithm is minimum depth. Over 8% of the incarnations saw either no changes, or or more minutes between changes. This is somewhat tolerable to humans as they will see a glitch every minutes or so. The random algorithm and the longest-first algorithm performed poorly in this metric. For random, only 65% of the incarnations saw no changes, or or more minutes between changes. To our surprise, the longest-first algorithm performed the worst, with 55% of incarnations seeing decent performance. The reason that it did not perform well stems from many factors. First, it may not always be correct in selecting the stable nodes. In fact, it was wrong for 9% of the cases because it had only 9% of hosts that had no descendants on departure (compared to % for oracle) as depicted in Figure 8(b). The number of descendants of a departing host is on the x-axis, in log scale. If longest-first were always correct, it would 8

9 overlap with the y-axis, like oracle where close to % of the incarnations had no descendants when they left the system. Note that longest-first is predicting correctly for some of the cases because it is performing better than random and min depth, which had 72% and 82% of incarnations with no descendants when they departed the system. One of the difficulties in getting accurate predictions is that at the start of the event, almost all hosts will appear to have been in the group for the same amount of time making stable hosts indistinguishable from dynamic hosts. An additional factor is that the consequence of its mistake is severe as depicted in Figure 8(c). The x-axis is the average depth of each node in the tree. The longest-first algorithm has taller trees than the random and min depth algorithms. Therefore, when it guesses incorrectly, a large number of descendants are effected. We examined the tail end of Figure 8(b) more closely and confirmed that this was the case. One interesting observation is that the oracle algorithm builds the tallest tree. The intuition here is that nodes that are stable will cluster together and stable branches will emerge. More nodes will cluster under these branches, making the tree taller. Although the height does not affect the stability results in our simulations, in practice a tall tree is more likely to suffer from problems with network dynamics. We find that min depth is the most effective and robust algorithm to enforce stability. Its property of minimizing damage seems to be the key to its good performance. The fact that it does not attempt to predict node stability makes it more robust to a variety of scenarios, as depicted in Figure 9. We ran the same set of simulations using the longest-first, random and min depth algorithms for 5 of the large-scale streams. These are the same streams as the ones used Section 3., with only half of the video streams used. Again, we used the distribution-based assignment for hosts with unknown measurements, a degree cap of 2 for video streams, and a degree cap of 4 for audio streams. The simulations were run over the most dynamic -hour segments in each trace. The y-axis is the median time interval between ancestor change observed in each stream in minutes. Again, min depth performed the best with 45 out of the 5 streams seeing a median time interval between ancestor changes of minutes or longer. Random and longest-first both performed poorly with only 2 and 3 streams seeing a median time interval of longer than minutes. While we present results based on 4 parent selection algorithms, we also explored many design alternatives. For example, we looked at prioritizing contributors, and combining multiple algorithms. However, we do not present them in this paper due to lack of space. We note that alternate algorithms did not perform as well as the ones listed above. For example, when the parent selection algorithm prioritized contributors such that they are higher up in the tree, the performance was as poor as random. This is explained by the observation that there is no correlation between being a contributor and being stable. We also looked at the impact of resource on stability. In particular, we looked at whether there is more stability if Join requests R J Explicit state update R E Keep-alives R K R Join responses R J Keep-alive responses R K Figure 2: Load at rendezvous point. there are more high degree nodes (i.e., more resources). We ran simulations on the audio streams with degree caps of 6 and, and found that there was only a slight improvement compared to when the degree cap was 4. To summarize, we find that there is inherent stability in application end-point architectures. Without future knowledge, the most effective and robust parent selection algorithm for maintaining stability and good performance is min depth. While we have looked at how to reduce the number of ancestor changes in the system, another important direction is to reduce the perceived impact of an ancestor change. In particular, mechanisms such as maintaining applicationlevel buffers to smooth out interruptions could help while the effected descendants are looking for new parents. Multiple trees [2, 3, 6] would also help reduce the impact of ancestor changes. In multiple trees, the stream is split and distributed across * independent trees. The probability of all trees seeing a disruption is small. And the impact of one tree seeing a disruption will degrade the quality of the stream. However, because a host now has * times more ancestors, it is likely that overall, it would see a larger number of ancestor changes. 4 Scalability for joining and membership management 4. Flash Crowds Are The Norm A common pattern emerged in our analysis of group membership across many of the large-scale streams. In 4% of the 59 large-scale streams, 236 (4%) of them exhibited flash crowd behavior. Breaking this down further, almost all video streams (53 out of 55) and 34% of audio streams had flash crowds. Figure and depict the group membership as a function of time for 4 large-scale video and 4 large-scale audio streams. The streams were sorted by their peak join arrival rate, and the streams with the maximum, 25th, 5th, and 75th percentile are shown in this figure. By 25th percentile, we mean that 25% of the streams had higher peak arrival rate than this stream. For all video streams in Figure, the sharp rise in group membership stands out. For the audio streams in Figure, the first two have a sharp rise in group membership. The last two have smoother join patterns. The video streams and audio streams with the maximum peak arrival rate had rates as high as arrivals/second and 2 arrivals/sec. For video streams, flash crowds are the common behavior. Intuitively, this is because all of the video streams and some of the audio streams are special events where the event 9

10 : 3: 6: 9: 2: 5: 8: 2: : (a) Max peak arrival rate. : 3: 6: 9: 2: 5: 8: 2: : (b) 25th percentile. : 3: 6: 9: 2: 5: 8: 2: : (c) 5th percentile. : 3: 6: 9: 2: 5: 8: 2: : (d) 75th percentile. Figure : Group membership for large-scale video streams, sorted by peak arrival rate : 3: 6: 9: 2: 5: 8: 2: : (a) Max peak arrival rate. : 3: 6: 9: 2: 5: 8: 2: : (b) 25th percentile. : 3: 6: 9: 2: 5: 8: 2: : (c) 5th percentile. : 3: 6: 9: 2: 5: 8: 2: : (d) 75th percentile. Figure : Group membership for large-scale audio streams, sorted by peak arrival rate. is taking place during a specific period, for example 2 hours. It is natural for people to want to start joining and watching the stream from the beginning. As long as there are special events, there are flash crowds. However, a number of longrunning audio streams also had flash crowds. We believe that special events that happened briefly during the long-running events, for example, invited guest appearances, could explain such behavior. With motivating data that flash crowds are a normal feature, it is very important to ensure that hosts can successfully join the system. If joining fails, there are no hosts in the system and no overlay to construct. The control plane in overlay multicast can be divided into three protocols: (i) join, which is used to bootstrap hosts into the system, (ii) membership management, which ensures that hosts know about other members in the system and (iii) tree construction and tree performance optimization, which depends on information gathered by the join and membership management protocol to construct a data delivery tree and optimize it. Given that flash crowds are normal in live streaming, the join protocol must be very scalable and efficient. In previous studies on Overlay Multicast scalability [7], the focus has been on the group management and tree construction protocols, however, attention has not been adequately given to flash crowds and the join protocol. In the remainder of this section, we outline the design requirements for scaling the join protocol and design a simple join protocol. We evaluate its performance in Section Join rotocol When a host wants to join the system, it needs to contact some node who knows the current members in the system. We refer to this node as the rendezvous point. For example, in End System Multicast [4] and CoopNet [3], the rendezvous point is the broadcast source and in NICE [7] the rendezvous points is separated from the data source. Newly joining members contact the rendezvous point to get a subset of members currently in the overlay. Current members in the group maintain periodic keep-alives with the rendezvous point Design guidelines This section motivates design guidelines resulting from the need to handle flash crowds. The rendezvous point cannot be overloaded. Figure 2 depicts the rendezvous point and the rate of join and keep-alive traffic that it needs to handle. Assuming that join requests, keep-alives and keep-alive responses have packet sizes of +, and the join responses have packet sizes of +-,/ ;:=<?>@:A64B, we can formulate the following bandwidth capacity constraint equations that need to be satisfied: CED@+FG!' CIHE+F3JLKM () C D +,N.9O ;:<Q>R:S64B T!UC H +NJVKW (2) KXM and K W are the bandwidth resources in the incoming and outgoing direction that the rendezvous point must dedicate to handling the control traffic. The rates C D for join rate and C-H keep-alive rate are in packets per second. Assuming that the rendezvous gives out a list of members to newly joining hosts, the outgoing bandwidth in equation (2) becomes the primary constraint. Substituting values from the largest event, where the peak C%D was 2 packets/sec and +%,/.32Y.95$698;:< is I-port pairs (6 Bytes), then KW needs to be more than 5 Mbps to just handle the membership lists. Keep-alive traffic would increase the bandwidth requirement even more. Explicit state update is required for maintaining fresh state. A less obvious failure than overload is that the group membership information given to newly joining hosts is useless

11 if it is stale, which is a very important problem for application end-point based live streaming because staleness is very dynamic in this context. Information is stale if it contains (i) members that have already left the group, or (ii) members that are currently saturated and cannot support any more children. Newly joining members need fresh, high quality membership information in order to find a parent and join the overlay. Generally, staleness can be addressed by maintaining more frequent keep-alives between members and the rendezvous points. However, using a single rendezvous point to maintain keep-alives between all hosts is very prohibitive. For example, using the session distribution presented in section 2, optimal saturation of hosts where a host gets saturated only if no other host in the system can take a child without getting saturated, and a bandwidth limit of Mbps, 56% of the membership information given to new hosts will be stale. A general principle to addressing overload is to distribute the load among multiple machines. Our solution leverages this principle. First, we simplify the computation functionality at the rendezvous server to only serve as a redirection mechanism that redirects incoming hosts to a number of membership service nodes. Second, the load of maintaining and distributing group membership lists is also distributed over the membership services nodes. We use clustering techniques to partition members into clusters and have one group member in each cluster serve as the membership server. Explicit state updates from a host go directly to its membership server, instead of the rendezvous point to maintain fresh state. Our proposed solution uses dynamic membership servers that are selected from amongst the contributors in the system. Because of their dynamic properties, robust mechanisms needed to find replacement membership servers for the cluster becomes another important guideline Join protocol We outline a simple join protocol that we will evaluate in the following section. We wish to highlight that the simplicity of the protocol allows for simple recovery given the dynamic nature of the membership service. Handling host join A new host joining the system contacts the rendezvous point. If there are existing membership servers, the rendezvous point gives out a subset of that list to newly joining hosts. The host then selects a membership server, either randomly or by using clustering techniques discussed in the next section. It contacts the chosen membership server, who then replies with a fresh list of current members in its cluster. The list is then given to the tree construction protocol. The tree construction protocol is responsible for connecting the newly joining host to some member in that list. Creating membership servers The rendezvous point is responsible for ensuring that there are enough membership servers in the system. Membership servers are created ondemand based on the needs of the system. For example, when a new host arrives, and there are not enough membership servers in the system, the rendezvous point will immediately assign the new host to function as a membership server (assuming the new host has enough resources to support the control traffic). Recovering from membership server dynamics Just before leaving a membership server looks inside its cluster to see if it can promote one of the hosts to become the new membership server. In some cases, due to resource constraints, a promotion may not be possible. The rendezvous server will notice that the number of membership servers has decreased and will create a new membership server from the newly arriving hosts. Note that when a membership server leaves, it only affects the ability for new hosts to join. The data overlay is not affected except for the hosts that are direct descendants of the membership server. The existing hosts that were part of the membership server s cluster need to find a new membership server. If a promotion was successful, the newly promoted host becomes their replacement membership server. The membership state can be quickly refreshed as hosts will maintain both explicit liveness and keep-alives with the replacement membership server. If a promotion was not successful, hosts will move to a different membership server. State maintenance In order to recover from dynamic membership servers, all membership servers exchange explicit state about their liveness with the rendezvous point. Membership servers also maintain explicit state, liveness, and information about a randomly selected subset of members in their cluster. In addition, all hosts inside a cluster exchange explicit state and maintain keep-alives with their membership server. When keep-alive messages are received at the membership server, the membership server will respond with a list of a subset of the other live membership servers in the system and other members outside its cluster (learned from exchanges with other membership servers). Interplay between control protocol and tree construction In this protocol, the control structure is loosely coupled with the data forwarding structure in the sense that a potential parent in the data forwarding structure is discovered through the control structure. Note that both protocols have full autonomy which makes recovery from failures independent and simple. 4.3 Resource-aware Clustering The previous section motivated the need for hierarchy and clustering of hosts in the group management structure. This section considers different clustering policies and evaluates their effectiveness. [8] showed that short delay is reasonably correlated with good bandwidth. We use Global Network ositioning (GN), geographic and random clustering policies. GN clustering approximates delay based clustering, however is much more efficient in terms of the measurements required [2]. Geographic clustering achieves locality in real world distance, which at a large scale approximates network distance. For example, the delay between two hosts on the east coast of the U.S. will likely be shorter than the delay between a host on the east and one on west coast. Similarly, hosts within a country will likely have shorter delays than hosts in two different countries.

12 Quality Index Cumulative Distribution of Clusters Random Clustering. Geographic Clustering. GN Clustering. NaturalClusters.Resource Cluster Figure 3: Quality Index of clusters Cluster Size Geographic GN Random Figure 4: Cumulative distribution of cluster size Data Acquisition and Simulation Setup GN coordinates are generated using the algorithm presented in [2]. We use active measurements to all hosts in the largest event from 3 different landmark nodes distributed around the world. Of the 8,92, only 27,35 hosts responded. We generated GN coordinates with dimentionality 8 for this set. Geographic data is acquired from a service provided by the CDN who provided us with the streaming-media logs. This service is the same as the one that provided access technology information and provided us with latitude and longitude coordinates for each host. Manual correlation with known geographic locations showed that the information was accurate for our purposes which is coarse-grain geographic clustering. We enhanced the tree construction simulator described in section 3 with the distributed join mechanism described in the previous section. However, when hosts join they are directed to the cluster with the closest membership server to them, where the measure of closest depends on the clustering policy. The membership server selection algorithm is the same as described in section 3. Also, the degree distribution used is the same as in section 3 The following section present results from simulations conducted using the largest event with GN and geographic clustering using the subset of hosts for which we had GN coordinates Cluster roperties We first analyze the largest event, which was introduced in section 2. Figure 3 shows the average quality index for all the clusters. For a system to support all members at a particular data rate, the quality index must be at least. Random clustering achieves close to uniform distribution of resources among clusters, where about 7% of clusters have quality index below. Since no locality exists with random clustering, the existence of resources does not necessarily mean they are useful. For example, a trans-oceanic link may need to be crossed to access a resource within the same cluster. Both geographic and GN clustering provide locality within clusters, however both policies produce clusters where 2% have quality index below. Figure 4 plots the cumulative distribution of cluster size for the three clustering policies. It shows that random clustering achieves cluster sizes with approximately uniform size distribution close to 2. The small clusters result from chosen membership servers staying for a very short duration. We noticed that all clusters with quality index below are very small using random. Geographic and GN clustering exhibit very different behavior. In general, these two clustering policies produce clusters with a wide variability in size. In addition, they produce a few large clusters with low quality index. These clusters comprise mainly of DSL and cable modem hosts. It is critical to ensure that (i) the quality index of a cluster stays above one such that hosts can find parents inside their own clusters, and (ii) the cluster sizes are bounded so the membership servers do not get overloaded. Therefore, the clustering algorithm must be aware of both constraints and maintain good values for available resources within clusters and size of clusters. Cluster Size Maintenance: Two possibilities for bounding the cluster size and handling overflows are: (i) redirection, where new hosts are redirected to the next best cluster until an unsaturated one is found and (ii) new cluster creation, where a new contributor host that is supposed to join a full cluster creates a new cluster and new free rider hosts are redirected. Resource Maintenance: Hosts within a cluster with quality index below must find parents outside their cluster. The design presented in the previous section incorporates this idea by having the membership servers provide its cluster members information about members in other clusters. It is still very important to maintain the quality index of a cluster above to reduce the recovery time from disconnects. We redirect free riders (using redirection from above) joining a cluster with quality index at or below to other clusters, but allow contributors in because they either increase the quality index or keep it the same Simulation Results Two parameters in the system that can potentially effect performance of hosts are the maximum cluster size and the minimum number of clusters. The number of clusters may increase if the new cluster creation mechanism is used to bound cluster size. We use the average and maximum intra-cluster distances to understand the impact of the two parameters and evaluate both control mechanisms. 2

13 9 9 Cumulative Distribution of Clusters Average 5 3 Maximum 5 Average 2 Maximum Average 2 Maximum 2 Average 5 Maximum Intra-Cluster Distance Cumulative Distribution of Clusters GN Average GN Maximum GN Redirect Average GN Redirect Maximum Intra-Cluster Distance Figure 5: Intra-cluster distances varying the number of clusters Figure 7: Intra-cluster distances for bounding quality index comparing naive (no bound) and redirect. Cumulative Distribution of Clusters GN Average 2 GN Maximum GN Redirect Average GN Redirect Maximum GN Create New Average GN Create New Maximum Intra-Cluster Distance Figure 6: Intra-cluster distances for bounding cluster size comparing naive (no bound), redirect and create new cluster. Analysis of intra-cluster distances varying the maximum cluster size between 2 and showed that the intracluster distances were not effected. However varying the minimum number of clusters does effect intra-cluster distances. Figure 5 plots the the cumulative distribution of the intra-cluster distances where the number of minimum clusters is varied between 5 and 5 and the maximum cluster size is maintained at 2. More clusters results in small intracluster distance for each cluster. The average improves only slightly, however the maximum improves significantly from close to 6ms for 5 clusters to about 25ms for 5 clusters. All remaining analysis uses minimum clusters and a 2 maximum cluster size. Next, we evaluate both control mechanisms. Figure 6 shows the cumulative distribution of intra-cluster distances for naive GN clustering, GN clustering with redirection and GN clustering with new cluster creation for bounding the cluster size. For 9% of the clusters the average cluster distance is within ms and the maximum reaches 3ms. This figure shows that redirection and new cluster creation do not effect the intra-cluster distances significantly as one may expect when hosts are not put into the closest cluster. Redirection of hosts from large clusters, whether they have low quality index or are at full host capacity, does not severely effect intra-cluster distances because there are many other hosts nearby due to the fact that the cluster is large and the second best cluster is probably also nearby. Similarly, using a new contributor to create a new cluster will work well because it should be expected that more hosts will join near Interval between ancestor change Random Random Minimum Depth Random Longest-first Random Random Geographic Minimum Depth Geographic Longest-first Geographic Streams Figure 8: Interval between ancestor change it. Figure 7 shows the cumulative distribution of intracluster distances for naive GN and GN using redirection for low quality index. Again, intra-cluster distances are not significantly effected. We compared the intra-cluster distances from our results above to distances resulting from clustering using the k-means algorithm. The key difference is that k-means will choose an optimal cluster center based on neighboring coordinates and our proposed mechanism chooses cluster heads randomly. The results for k-means assume all hosts are present at the same time so a direct comparison is not possible, however, it is still useful. Using k-means, 9% of the clusters had an average intra-cluster distance less than 5ms. Splitting large clusters and choosing optimal cluster heads is not a critical problem when clustering for the control plane. However, this does not convey the performance of hosts participating in the tree, which we consider next Summary of Results We evaluate the same set of streams from section 3 using geographic clustering with redirection to bound cluster size and maintain quality index above. We do not use GN clustering here because we did not have GN coordinates for hosts in all events. Figure 8 plots the interval between ancestor change for the different tree construction and clustering policies. The interval between ancestor change is very close to the result presented in section 3 without clustering. There are two ways that clustering can effect tree stability: (i) hosts 3

14 within different clusters do not have enough resources and (ii) hosts within different clusters have different session durations. We showed in figure 3 that resources are reasonably distributed and redirection can ensure available resources. In addition, we analyzed the session duration distribution for all clusters within an event and noticed that they show a similar trend. The similarity of resources and session durations among different clusters means that hosts in different clusters see similar amount of resources and session duration among their neighboring hosts and similar to what they would see if no clustering was used and their membership knowledge was randomly distributed over the entire group. This explains why the interval between ancestor change is similar for results with and without clustering. These results strongly suggest that clustering in the control plane can be accomplished with simple techniques that are aware of the available resources and cluster sizes. Clustering helps scale the group size, while providing locality in terms of network or geographic distances to hosts. Our results show that resources can easily be maintained within clusters and stability of the data tree is not impacted. 5 Summary In this paper, we used a large set of live streaming media traces to answer one of the most prominent architectural questions in overlay multicast. We look at the feasibility of supporting large-scale groups using a purely application end-point based architecture. We found solid supporting evidence that for most of the common application scenarios, there is enough stability and resources to support large-scale live streaming. In terms of resources, we believe that there are interesting open questions that need to be addressed in designing policies and mechanisms to extract resources from application end-points. For stability, we find that constructing overlays by optimizing for minimum depth provides the best performance. In addition to studying the feasibility of tree construction in the data plane, we also identify a surprisingly common behavior that has been overlooked in previous work. Almost all the video streams, and 3% of audio streams have flash crowd joins. This motivates that need to design scalable and efficient solutions for joining the overlay and group membership maintenance. We evaluate a solution that leverages application end-point nodes to serve as membership servers and partition the members into clusters for scalability. We find that proximity-based clustering mechanisms, enhanced with resource-awareness and cluster-sizeawareness can perform as well as solutions where there is no clustering in terms of resources and stability. References [] Akamai. [2] M. Castro,. Druschel, A. Kermarrec, A. Nandi, A. Rowstron, and A. Singh. SplitStream: Highbandwidth Content Distribution in Cooperative Environments. In roceedings of SOS, 23. [3] Y. Chawathe. Scattercast: An architecture for Internet broadcast distribution as an infrastructure service. Fall 2. h.d. thesis, U.C. Berkeley. [4] Y. Chu, A. Ganjam, T.S. Eugene Ng, S.G. Rao, K. Sripanidkulchai, J. Zhan, and H. Zhang. Early Experience with an Internet Broadcast System Based on Overlay Multiast. Technical Report cmu-cs-3-24, Carnegie Mellon University, December 23. [5] Y. Chu, S.G. Rao, and H. Zhang. A Case for End System Multicast. In roceedings of ACM Sigmetrics, June 2. [6] J. Albrecht D. Kostic, A. Rodriguez and A. Vahdat. Bullet: High Bandwidth Data Dissemination Using an Overlay Mesh. In roceedings of SOS, 23. [7]. Francis. Yoid: Your Own Internet Distribution, April 2. [8] Vivek K. Goyal. Multiple Description Coding: Ccompression Meets the Network. IEEE Signal rocessing Magazine, Vol. 8, pages 74 93, 2. [9] J. Jannotti, D. Gifford, K. L. Johnson, M. F. Kaashoek, and J. W. O Toole Jr. Overcast: Reliable Multicasting with an Overlay Network. In roceedings of the Fourth Symposium on Operating System Design and Implementation (OSDI), October 2. [] J. Liebeherr and M. Nahas. Application-layer Multicast with Delaunay Triangulations. In IEEE Globecom, November 2. [] A.M. Kermarrec M. Castro,. Druschel and A. Rowstron. Scribe: A large-scale and decentralized application-level multicast infrastructure. In IEEE Journal on Selected Areas in Communications Vol. 2 No. 8, Oct 22. [2] T.S. Eugene Ng and Hui Zhang. redicting Internet Network Distance with Coordinates-Based Approaches. In roceedings of INFOCOM, June 22. [3] V.N. admanabhan, H.J. Wang, and.a Chou. Resilient eer-to-peer Streaming. In roceedings of IEEE ICN, November 23. [4] Sylvia Ratnasamy, Mark Handley, Richard Karp, and Scott Shenker. Application-level Multicast using Content-Addressable Networks. In roceedings of NGC, 2. [5] Real broadcast network. [6] R. Renesse, Y. Minsky, and M. Hayden. A gossip-style failure detection service. Technical Report TR98-687, Cornell University Computer Science, 998. [7] B. Bhattacharjee S. Banerjee and C. Kommareddy. Scalable Application Layer Multicast. In roceedings of ACM SIGCOMM, August 22. [8] S. Saroiu,. K. Gummadi, and S. D. Gribble. A measurement study of peer-to-peer file sharing systems. In roceedings of Multimedia Computing and Networking (MMCN), January 22. [9] S.McCanne, V.Jacobson, and M.Vetterli. Receiverdriven layered multicast. In roceedings of ACM SIG- COMM, August 996. [2] S.Q.Zhuang, B.Y.Zhao, J.D.Kubiatowicz, and A.D.Joseph. Bayeux: An Architecture for Scalable and Fault-tolerant Wide-area Data Dissemination, April 2. Unpublished Report. [2] W. Wang, D. Helder, S. Jamin, and L. Zhang. Overlay optimizations for end-host multicast. In roceedings of Fourth International Workshop on Networked Group Communication (NGC), October 22. 4