Using Client Migrations for Load-Balancing in Video-on- Demand Systems



Similar documents
Load Balancing. Load Balancing 1 / 24

Production Floor Optimizations Using Dynamic Modeling and Data Mining

Load Balancing and Switch Scheduling

Distributed Caching Algorithms for Content Distribution Networks

A Dynamic Approach for Load Balancing using Clusters

CHAPTER 5 WLDMA: A NEW LOAD BALANCING STRATEGY FOR WAN ENVIRONMENT

Quantitative Analysis of Cloud-based Streaming Services

VISUALIZING HIERARCHICAL DATA. Graham Wills SPSS Inc.,

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

The International Journal Of Science & Technoledge (ISSN X)

An Approach to Load Balancing In Cloud Computing

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Efficient Parallel Processing on Public Cloud Servers Using Load Balancing

Online Classification of VoD and Live Video Streaming Applications

Oscillations of the Sending Window in Compound TCP

Load Balancing in cloud computing

IMPROVED PROXIMITY AWARE LOAD BALANCING FOR HETEROGENEOUS NODES

Energy Constrained Resource Scheduling for Cloud Environment

Decentralized Utility-based Sensor Network Design

CROSS LAYER BASED MULTIPATH ROUTING FOR LOAD BALANCING

Comparative Analysis of Congestion Control Algorithms Using ns-2

How To Model A System

SCALABILITY AND AVAILABILITY

LCMON Network Traffic Analysis

Keywords: Dynamic Load Balancing, Process Migration, Load Indices, Threshold Level, Response Time, Process Age.

Load Balancing in Distributed Web Server Systems With Partial Document Replication

QoS Parameters. Quality of Service in the Internet. Traffic Shaping: Congestion Control. Keeping the QoS

AN ADAPTIVE DISTRIBUTED LOAD BALANCING TECHNIQUE FOR CLOUD COMPUTING

AIC Attendance Management System

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Load Balancing in Fault Tolerant Video Server

15-441: Computer Networks Homework 2 Solution

Local Area Networks transmission system private speedy and secure kilometres shared transmission medium hardware & software

Building a Highly Available and Scalable Web Farm

Quality of Service versus Fairness. Inelastic Applications. QoS Analogy: Surface Mail. How to Provide QoS?

Sla Aware Load Balancing Algorithm Using Join-Idle Queue for Virtual Machines in Cloud Computing

How to recover a failed Storage Spaces

Call Center - Supervisor Application User Manual

Experiments on the local load balancing algorithms; part 1

Competitive Analysis of On line Randomized Call Control in Cellular Networks

A NOVEL RESOURCE EFFICIENT DMMS APPROACH

A NETWORK CONSTRUCTION METHOD FOR A SCALABLE P2P VIDEO CONFERENCING SYSTEM

On demand synchronization and load distribution for database grid-based Web applications

Multimedia Caching Strategies for Heterogeneous Application and Server Environments

Load Balancing to Save Energy in Cloud Computing

Quality of Service in the Internet. QoS Parameters. Keeping the QoS. Traffic Shaping: Leaky Bucket Algorithm

On the Interaction and Competition among Internet Service Providers

Real-Time (Paradigms) (51)

The Feasibility of Supporting Large-Scale Live Streaming Applications with Dynamic Application End-Points

Optimization of AODV routing protocol in mobile ad-hoc network by introducing features of the protocol LBAR

Keywords Load balancing, Dispatcher, Distributed Cluster Server, Static Load balancing, Dynamic Load balancing.

Dynamic Resource Allocation in Software Defined and Virtual Networks: A Comparative Analysis

Process Scheduling CS 241. February 24, Copyright University of Illinois CS 241 Staff

Energy Efficient MapReduce

Various Schemes of Load Balancing in Distributed Systems- A Review

Effects of Filler Traffic In IP Networks. Adam Feldman April 5, 2001 Master s Project

Examining Self-Similarity Network Traffic intervals

An Algorithm for Automatic Base Station Placement in Cellular Network Deployment

CHAPTER 8 CONCLUSION AND FUTURE ENHANCEMENTS

Network analysis and dimensioning course project

Competitive Analysis of QoS Networks

The Bus (PCI and PCI-Express)

On the Feasibility of Prefetching and Caching for Online TV Services: A Measurement Study on Hulu

Storage I/O Control: Proportional Allocation of Shared Storage Resources

Multimedia Requirements. Multimedia and Networks. Quality of Service

Efficient and low cost Internet backup to Primary Video lines

Secure SCTP against DoS Attacks in Wireless Internet

The Effect of Priorities on LUN Management Operations

Veritas Enterprise Vault. Miron Krokhmal CTO Emet IT

D1.1 Service Discovery system: Load balancing mechanisms

Lecture Outline Overview of real-time scheduling algorithms Outline relative strengths, weaknesses

Fair Scheduling Algorithm with Dynamic Load Balancing Using In Grid Computing

6.6 Scheduling and Policing Mechanisms

Reviewer s Guide. Morpheus Photo Animation Suite. Screenshots. Tutorial. Included in the Reviewer s Guide:

User Manual. Call Center - Supervisor Application

MINIMIZING STORAGE COST IN CLOUD COMPUTING ENVIRONMENT

Smart Queue Scheduling for QoS Spring 2001 Final Report

APPENDIX 1 USER LEVEL IMPLEMENTATION OF PPATPAN IN LINUX SYSTEM

Scheduling Allowance Adaptability in Load Balancing technique for Distributed Systems

Managing Applications, Services, Folders, and Libraries

Analysis of QoS Routing Approach and the starvation`s evaluation in LAN

Operating Systems. III. Scheduling.

Low-rate TCP-targeted Denial of Service Attack Defense

IntraVUE Diagnostics. Device Failure

4. Fixed-Priority Scheduling

Welcome to ScopServ. ScopTEL ACD Module

Microsoft Axapta Inventory Closing White Paper

Names of Parts. English. Mic. Record Button. Status Indicator Micro SD Card Slot Speaker Micro USB Port Strap Hook

Basic Multiplexing models. Computer Networks - Vassilis Tsaoussidis

EXTENDING NETWORK KNOWLEDGE: MAKING OLSR A QUALITY OF SERVICE CONDUCIVE PROTOCOL

Real Time Bus Monitoring System by Sharing the Location Using Google Cloud Server Messaging

Exploring Big Data in Social Networks

Energy Efficient Load Balancing among Heterogeneous Nodes of Wireless Sensor Network

Social Media Mining. Data Mining Essentials

Map-like Wikipedia Visualization. Pang Cheong Iao. Master of Science in Software Engineering

Transcription:

The Interdisciplinary Center, Herzlia Efi Arazi School of Computer Science Using Client Migrations for Load-Balancing in Video-on- Demand Systems M.Sc. Final project Submitted by Adam Oren Under the supervision of Dr. Tami Tamir September, 2007.

Table of Contents 1 Introduction... 3 1.1 Problem Statement... 3 1.2 Objectives and Motivation... 3 1.3 Organization of this Document... 3 2 Preliminaries... 4 2.1 Definitions and Notation... 4 2.2 Implemented Algorithms... 5 2.2.1 The Static Phase... 6 2.2.2 The Dynamic Phase... 8 3 Experimental Study...10 3.1 Simulation Model...10 3.1.1 Storage System...10 3.1.2 File Properties...10 3.1.3 User Behavior...10 3.2 The Simulator...11 4 Simulation Results...12 4.1 Experiment 1 Varying arrival gaps...14 4.2 Experiment 2 Varying pause frequencies...15 4.3 Experiment 3 Varying remote service bandwidth costs...16 4.4 Experiment 4 Varying migration threshold...17 4.5 Experiment 5 Varying color migration thresholds...17 4.6 Experiment 6 Partial migrations...18 4.7 Experiment 7 - Changing pause duration...19 4.8 Experiment 8 Tracking migrations performance over time...20 4.9 Experiment 9 Varying pause frequencies...21 5 Conclusions...22 5.1 Upper bound...22 5.2 Migrate only at pauses method performance...22 5.3 Migrations can be done with a subset of the migration types and still give satisfying results...22 5.4 Too low migration threshold causes low migration rate...23 5.5 The remote service bandwidth cost has great impact on the number of rejected customers...23 5.6 Long pause duration will reduce performance...23 5.7 The system performs better with pause migrations than with free migrations over time...23 5.8 Migrations improves performance at best by about 20%...24 6 Further Research...24 2

1 Introduction 1.1 Problem Statement Given a set of video files with fixed popularities, we need to find a data placement mechanism and a dynamic load balancing scheme that achieve best Quality of Service with respect to various measures (including throughput, resource utilization, resource requirement) 1.2 Objectives and Motivation This document presents the experimental results of the simulation system we developed by feeding it different sets of parameters. Given the problem and some known parameters (number of movies, movie length, number of clusters, etc.) we want to detect sets of parameters that enable successful simulations. We declare a successful simulation as one that has less than 1% rejections. we would like to find the set (or sets) of parameters that services the largest number of clients successfully. Because of the dynamic nature of the system, we can use load balancing to improve the quality of service. Migrating customers from one disk to another can make more room for other user, because different disks contain different movies and have different available load. We want the migrations to improve the locality of the servers. We distinguish between three types of migrations: 1. Green migration: applying the migration will improve the locality of the service. 2. Blue migration: applying the migration won't change the locality of the service. 3. Red migration: applying the migration will hurt the locality of the service. We distinguish between two settings: in the first one we will allow the migrations to occur only when a user pauses his transmission. This setting is motivated by the fact that migrating a user during a movie causes a brief flickering in current technologies and it is desirable to avoid or minimize it. In the second setting we allow migrations to occur whenever applicable, which will give us a good comparison to the first setting. We would like to analyze the results and make some conclusions about the system behavior. 1.3 Organization of this Document. In section 2 we discuss the algorithms we implemented in our Media-on- Demand system simulator. We first give some definitions and notations used later to describe the algorithms, and then describe each algorithm 3

briefly. In section 3 we present the experimental study. In section 3.1 we present the simulation model abilities. Then, we discuss the simulator itself and how it is implemented. In section 4 we present the analyzed data of the experimental results we got. In section 5 we present the conclusion of these experiments. Finally, In section 6 we suggest directions for further research and improvements of the model. 2 Preliminaries 2.1 Definitions and Notation Static Phase - In the static phase we have two main processes, the first process is to determine how many copies each movie should have in the system, according to its popularity, and the second process is to determine in which disks and cluster to put those copies. M - the number of movies the system contains. Theta Parameter - A Zipf distribution over M items is defined as follows. C The probability of the i-th item is p i = 1 Θ, where 0 Θ 1. i Recursive Depth - How many time we divide the items collection to 80% / 20%. Pop Cluster - The system services clients from 3 geographical regions. Each of the three pop-clusters is located in one of these regions. Each of those clusters contains 10 disks. The pop clusters can service remote clients, but only to a parameter defined fraction of their bandwidth. Root Cluster - The root consists of 20 disks, and it contains at least one copy of all the movies in the system. The root is being used to service remote clients when the local cluster cannot service them. Disk Capacity -The amount of movies a disk can contain. Dynamic Phase - In the dynamic phase the system gets requests for movies and assign customer to their requested movie, or reject them if it cannot service them. The system also balance itself by migrating customers from one disk to another. Successful simulation - a simulation in which at most 1% of the client requests are rejected. Green migration edge - applying the migration will improve the locality of the service. That is, the source disk is in a remote cluster for some client while the target disk is local. Blue migration edge - applying the migration will not change the locality of the service. That is, the source disk and the target disk belong to the same cluster or to two neighbor clusters. 4

Red migration edge - applying the migration will hurt the locality of the service. That is, the source disk is in a local cluster, while the target disk is in remote cluster. Parameters Static Utilization Parameter - measuring the fraction of the storage that is allowed to be used for the initial set of movies. Migration Threshold - it's the load threshold that a disk reach in order to start migrating customers from it. It is also being used to determined if a disk can accept migrations from other disks. Red / Blue / Green Threshold - The difference between the source disk and the target disk that should exist for the migration to be allowed. Root / Pop Cluster Remote Cost - The load overhead required in order to give the customer remote service (from another pop cluster or from the root). Expected Pause Frequency - The expected number of pauses a client will do during watching one movie. Expected Pause Duration - The expected time the user will be pausing his movie with each pause. Migrate During Pause Determines if the system will perform migrations during the customers pauses. Migrate During Not Pause Determines if the system will perform migrations when it is possible to do so (but actually not during the customers pauses). Statistics Arrival Gap - the expected gap in seconds between customers' requests. Arrived Customers - how many customers got successfully serviced by the system. Rejected Customers - how many customers got rejected by the system. Migrated Customers The total number of migrations performed by the customers. Green Migrated how many customers performed migration using a green edge. Blue Migrated how many customers performed migration using a blue edge. Red Migrated how many customers performed migration using a red edge. 2.2 Implemented Algorithms The simulation process is roughly divided into two phases: the static phase and the dynamic phase. The static phase initializes the system: during this 5

phase we need to determine how many copies each movie should have and where those copies should be stored. We are given the vector of movies ordered by their popularity, and the distribution that defines the ratio between the movies. The dynamic phase starts when the first client requests a movie. Each disk has a certain fixed load capacity to service clients and we need to determine which cluster and which disk in this cluster will service the client. If no disk can service the client then he is rejected. The system balances the load over the disks also by migrating customers from one disk to another from time to time. In this section we will describe in detail the algorithms we used to implement the above phases. 2.2.1 The Static Phase 1. First, we calculate how many copies of each movie will be stored in the system. 1. Compute how many movies should be stored in total 1. The number of copies we can store is the total number of disks multiplied by the disk capacity multiplied by the static utilization parameter. 2. Next, find the probability for each movie to be requested. 1. Initialize a vector of size M 2. Cut the vector to two parts: 20% and 80% (see image). 6

3. The 20% part is Zipf distributed: 0.2M 4 1 1. Calculate the Zipf C constant => C = 1 5 i Θ 2. For all the Zipf distributed movies C 1. Calculate the probability using Zipf: p i = 1 Θ i 3. Assign their probability to the probability vector. 4. Now handle the 80% part (the tail) 1. level = 1 2. k=recursivedepth 3. While (level<=k) 1. Cut the vector 80-20 2. The 20% part gets probability of 0.8. 1. level++; 2. Divide this part to 80-20 and repeat recursively (until depth k) 3. The 80% part (the tail) gets the cumulative probability of 0.2. 3. We multiply the total number of copies computed in step 1 by the probability for each movie to get the number of copies we should assign for that movie. 1. We make sure that each of the movies has at least one copy. 2. Now we take all movies that have more copies than the number of disks (each disk should have at most one copy of a certain movie) 1. We distribute the difference between that movie's copies to the number of disks to all other disks. For example, if a movie should get 80 copies and there are 50 disks in the system then 30 copies should be distributed among all the movies that 1 7

have less than 50 copies, if there are 1000 movies that have less than 50 copies then each of these movies should have 0.03 more copies. 3. We round the number of copies each movie got and distribute the remainder between all movies. 1. If a movie has 47.83 copies, we round the number to 47 and we take the remainder and distribute it between the other movies (just as we did in step 2.1). 4. Now we need to determine in which clusters and disks the movies should be stored. 2.2.2 The Dynamic Phase 1. Store the movies starting from the most popular one, concluding with the least popular. Assume you need to store Fi copies of movie i, then 2. Store one copy of i on the root cluster. 3. If Fi >1, use round-robin over the clusters; in each cluster store one copy of the movie on the least loaded disk (that does not hold a copy of this movie yet). If all the disks of some cluster are full to capacity 1300 x Static Utilization Parameter, move to the next cluster. It might be that more than one disk from a cluster will hold a copy of the movie. Note that the root cluster participates in this stage as a regular cluster and might get additional copies of i. As we mentioned in section 1.2, there are three types of migration edges - red, blue and green. each color denotes the effect we will get by migrating with this edge in the context of locality. red edges means that we will hurt the locality, i.e. we will migrate a customer to a further cluster than his current cluster. blue edges mean that the migration will have no effect on the locality, it can either be migration to the same cluster or to a cluster that is as far from the customer as his previous cluster. green migration will actually improve the locality, meaning that migrating with this edge will service the user from a closer cluster than the previous one. In order to migrate a customer from d1 to d2 three conditions need to hold: (1) d 2 holds at least one copy of some file that is currently transmitted by d 1, (2) the load on d 1 is at least Migration Threshold, and (3) the load on d 2 is at most Migration Threshold-g (another parameter). g is the edge weight, specifically: Red / Blue / Green Threshold. we give them different weights because we would like to encourage green edges more than blue and red edges, and blue edges more than red edges. we would like to increase the 8

locality as much as possible, because remote service costs more than local service. 1. Loop for 24 hours 1. generate a new customer request 1. generate customer request time 2. The arrival time is generated using the arrival gap parameter as an input to a random variable generator that simulates exponential distribution. 3. generate the customer region (which of the three clusters the request is coming from) 4. randomize a movie that the customer requests 2. check whether there are awaiting events to handle before the customer arrival 1. if there are departure events 1. release the customer from the movie and update the disk capacity 2. If there are pause events 1. Check if the disk has reached the migration threshold 2. if it has, check if the customer can migrate to another disk 1. if the customer edge is valid for migration 1. migrate the customer 2. else 3. Find the best edge in the current state, and if it exists, migrate the customer with it. 3. Add the customer arrival event to the time line (and handle it). 1. Find an available disk in an available cluster containing the movie 2. If the local cluster doesn't contain the movie, we search the other clusters and add the appropriate remote service cost. 3. assign the customer to the available disk found (or reject him if no disk was found) 4. Select the best possible edge to another disk from all the disks containing the movie (by that order - green, blue and then red). 5. add the edge to the graph 4. Generate customer pauses according to the expected number of pauses parameter. 1. Distribute the pauses events along the movie duration. 2. Add resuming events after each pause. 5. Add the leaving event to the time line. The leaving event should is added after calculating the movie length and the expected pausing duration. 9

3 Experimental Study In this section we discuss the way we simulated the system and how the simulator is implemented. 3.1 Simulation Model As mentioned above, we are to simulate an on-demand multimedia system. Our requests arrival model is designed to simulate the customers behavior in real scenarios. Our simulation will simulate arrival of movie requests and pauses made by users. 3.1.1 Storage System The Media-on-Demand system consists of 50 disks (servers) organized in four clusters. One of these clusters is the root. The other 3 are called popclusters. The root consists of 20 disks, and each of the 3 pop clusters consists of 10 disks. The storage capacity of each disk is 3.5 Terabytes. For standard movie files (we assume all the movies have the same storage requirement), this means that each disk is capable of storing 1300 movie files. Therefore, the total storage capacity of the system is 50*1300 = 65000 movie files. The streaming (load) capacity of each disk is L= 480. As will be explained later, the bandwidth required for a transmission of one stream depends on the locations of the transmitting server and the destination. 3.1.2 File Properties All movies are assumed to have the same length of 90 minutes. There are 6000 movies in the system. The static phase determines the initial allocation of the movie files. In the following we describe the distribution of the file popularities. This distribution can be viewed also as the probability that a specific movie is requested by an arriving client. The first (most popular) 20% of the movies are distributed with Zipf distribution, the other 80% of the movies are distributed with 80-20 distribution. This means that 20% of this part has cumulative probability 0.8 and the rest (the tail) 80% of the movies has cumulative probability 0.2. The first 20% have equal probability and the other 80% continue to be divided into 80-20 the same way until we reach a certain level determined by a parameter. 3.1.3 User Behavior Client requests are expected to arrive every arrival gap seconds. When a request arrives, if the system can service that request that client is serviced immediately and he will get rejected if it cannot. we predict that 10

the client is also expected to do a certain amount of pauses during the movie which is determined by another parameter. The pauses length is distributed exponentially with an expected value of ExpectedPauseLength, and the client resumes after that time. The client leaves the system after the period of time that is equal to the movie length plus his pauses length. Clients have equal probability to come from each of the three regions and the movies will be picked by the clients according to their popularity. 3.2 The Simulator The simulator is a Windows application written in.net C#. The main class is called Simulator, and it manages the simulation process. It has three main methods: StaticAssignment(), StartSimulation(), EndSimulation(). StaticAssignment() implements the static phase process, and what it generally does is to assign the movies to the different clusters before the simulation starts. StartSimulation() implements the dynamic phase process, and what it generally does is to simulate client arrivals, assign the client to their requested movies and balance the system. EndSimulation() is used to interrupt the simulation process and stop it, it's being used inside the project when the system overloads and can be used outside of the project as well. In order to handle the different events in the right order we have a class called TimeLine the keeps track on the different events and prevents, for example, a situation that a client's request will be serviced before other clients should have been released. The TimeLine class keeps track on all the events including client's arrival, leaving pausing and resuming. In order to handle the migration process we have a class call MigrationManager. This class is using a graph object as data structure to keep track after possible migrations to balance the load on the cluster. Each time we assign a customer to a movie, we look for the best possible migration, so when the time will come (for example, when that customer will pause) we can check if the migration is possible and migrate the customer. MigrationManager has also a method to migrate an entire disk (if the disk can be migrated). The Logger class allows us to log the events to a XML file. It is turned off by default due to performance issues. The Statistics class is a static class we use to gather data during the simulation process (how many customers were serviced, how many got rejected, etc.). We will use these results later to draw conclusions. The project design is designed to be object oriented as much as possible. Entities in the system like: cluster, pop cluster, disk, root, time event, etc. are represented by different classes and have some hierarchy between them. These classes can be extended easily later if we would like to simulate the system even more accurately (for example, add names to the 11

movie or to the clusters). Some classes aggregate other classes. for example, PopCluster (which is derived from Cluster) aggregate a collection of disks in it, and each disk aggregate a collection of the movies the disk contains, and of course the customers who are assigned to that disk. Other classes in the project mostly have technical roles. We use those classes to unit test the project, manage settings, declare constants, draw graphs and to test the system. 4 Simulation Results In order to check the how the system's quality of service behaves under different sets of parameters, we simulated several experiments. A good simulation is a one that services a maximal number of customers and has less than 1% rejections. Since the rejection rate is our main success indicator, we will mostly check this parameter against different scenarios. To provide basis of comparison, all experiments were simulated under similar basic conditions: 1. 24 hours simulation. 2. Teta Parameter = 0.5. 3. Same input stream (unless mentioned otherwise). 4. 0.27 seconds arrival gap (unless mentioned otherwise). 5. We perform migrations on customers only during their pauses of the movie (unless mentioned otherwise). 6. We expect each customer to pause his movie twice (unless mentioned otherwise). 7. We expect each pause to last on the average 60 seconds (unless mentioned otherwise). 8. 0.1 load unit when we serve a customer through another cluster (unless mentioned otherwise). 9. 0.2 load unit when we serve a customer through the root cluster (unless mentioned otherwise). 10. The migration threshold is 400 (unless mentioned otherwise). 11. Red Threshold = 8, Blue Threshold = 2, Green Threshold = 0 (unless mentioned otherwise). 12

Theta = 0.5 Theta = 0.4 Theta = 0.6 Figure 1.1: The distribution of the top 20% of the copies after the static phase (see section 2.2.1) The graph in figure 1.1 represents the accumulated probability of the movies according to Zipf distribution under different Theta constants. Below we can see actual data that we are trying to simulate. We can see that when Theta = 0.5 our graph is very similar to the actual data. Figure 1.2: Proportion of video accesses sorted by popularity. A total of 6176 videos (Taken from: Understanding User Behavior in Large-Scale Video-on- Demand Systems, EuroSys 2006) 13

In figure 1.3 we can see the simulated distribution of the movie copies probabilities in comparison to the real data in figure 1.2. Theta = 0.5 Figure 1.3: The simulated distribution of the movie copies created by our VOD system in the static phase 4.1 Experiment 1 Varying arrival gaps In this experiment we simulated the system with varying arrival gap between the customer requests. In figure 2.1 we can see that a small gap (under 0.26 seconds) overloads the system, but we can see that already with 0.28 seconds gap there are no rejected customers. As we can see in figure 2.2, the arrival gap has strong effect on the number of accepted requests, and when we increase the gap, the quality of service gets reduced. 14

Figure 2.1: Number of rejected requests depends on the arrival gap (see section 2.2.2) Figure 2.2: Number of accepted requests depends on the arrival gap (see section 2.2.2) 4.2 Experiment 2 Varying pause frequencies In this experiment we can see how the number of pauses affects the number of rejections. It might be surprising to see that although the 15

migrations are only being performed when the user pauses his movie, the quality is still being reduced as the number of pauses increases. This happens because the more pauses we have for each user, the more idle time the system cannot use. In other words, the more pauses we have, the longer the movies get and thus the system's quality of service get damaged. Figure 3: Number of rejected requests depends on the pause frequency. The blue and red lines were generated when migrations happened only at pauses, and the orange line was generated when migration happened always (see section 2.2.2) 4.3 Experiment 3 Varying remote service bandwidth costs In this experiment we can see how the bandwidth cost of the remote services effect the number of rejected requests. 16

Figure 4: Number of rejected requests depends on the fraction of bandwidth for remote service (see section 2.2.2) 4.4 Experiment 4 Varying migration threshold Figure 5: Number of rejected requests depends on the minimum migration threshold (see section 2.2.2) 4.5 Experiment 5 Varying color migration thresholds In this experiment we can see the number of rejected customers when we increase the threshold condition of each migration color and when we 17

increase all the thresholds together. Figure 6: Number of rejected requests depends on color specific varying migration threshold. Here we compare the performance, by changing a different migration threshold each time and all of them at the same time (see section 2.2.2). We can see that if we disable one of the migration types the system's quality of service is still about the same (see section 5.3). 4.6 Experiment 6 Partial migrations In this experiment we can see that even if we use only a subset of the migration types, we are still able to get good results in most cases. Red Migrations Blue Migrations Green Migrations Rejected Customers X X X 298 X X 102 X X 244 X X 176 X 151 X 315 X overloaded overloaded 18

Table 1: The number of rejected customers depends on the type of edges used in the simulations. 4.7 Experiment 7 - Changing pause duration Figure 7: Number of rejected requests depends on pause duration. Note that when the pause duration is 0 seconds, it will be redundant to perform migrations only during pauses. The system would have overloaded in that case if we would have taken that factor into account. 19

4.8 Experiment 8 Tracking migrations performance over time Figure 8: Number of migrations depends on time (represented by the current customer arrival). We can see that we get more migrations with 'free migrations' mode at the left side of the graph, but in the middle it changes and there are more migrations in 'pause migrations' mode (see section 5.7). 20

4.9 Experiment 9 Varying pause frequencies Figure 9: Maximum number of accepted requests depends on the pause frequency. As in experiment 2, the peak is achieved when we have expected 1-1.5 pauses per customer. If there are no pauses, and as a result no migrations, we can only service about 260,000 customers is 24 hours (instead of about 300,000). 21

5 Conclusions we now summarize the main results of our study. We explain the results we obtained, expected and unexpected. We sum up the results and their implications, and mention several issues that may be interesting for further research. 5.1 Upper bound If the system occupies all disk with customers all the time in 24 hours, the number of customers that will be serviced will be 384,000 (50 disks X 480 customers per disk * 16 movies = 384,000). Of course, this bound can never be reached because requests don't arrive in groups and because of migrations and remote services, but this can give us a reference point to our results. 5.2 Migrate only at pauses method performance When the system performs a migration of a customer from one server to another, the user may see flickering in the screen if he's being migrated while his movie plays. Because of that, we prefer to perform migrations on customers only when they pause their movie. In section 4.2 we can see that no matter how many pauses we expect the user to do, the results will be quite similar to if we do migration only at pauses or always. In section 4.1, if the users are expected to pause their movie twice, the performance is similar to the results we get if we migrate customers whenever possible. With these results we can conclude that performing migrations on the customers only when they pause their movie will not hurt the quality of service. 5.3 Migrations can be done with a subset of the migration types and still give satisfying results As we can see in table 1 (section 4.6), we get good results even if use only one type of edges in most cases. This happens because the total number of migrations is about the same in most cases, so the system actually compensates for the missing edges, and converges to the same results. The one exception is with green edges, because the number of green migrations is relatively low, it will not be enough to settle with just green edges and the system overloads. Experiment 5 supports this results and we can see that when we change the threshold of just color it doesn't affect the results very much. If we look at the red line we can see that disabling the red migrations hurts the quality of service slightly more than the others. The reason for that is that it is much easier to compensate the loss of a migration type by using red migrations than by using any other

type of migrations, because it is easier to hurt the locality than to improve it. When we disable the red migrations, the overall number of migrations is smaller and that is causing more requests to get rejected. Of course if we raise all thresholds together the system's migrations will get hurt along with the overall performance. 5.4 Too low migration threshold causes low migration rate In experiment 4 (section 4.4) we can see that the minimum migration threshold can definitely be too low, this might be surprising at first, but the reason for this is that when we lower the minimum migration threshold we not only encourage disks to migrate, but also discourage disks to accept migrations. The best result is somewhere in the middle. 5.5 The remote service bandwidth cost has great impact on the number of rejected customers In experiment 3 (section 4.3) we can see that even the slightest increase in the remote service bandwidth cost can cause the system to overload very quickly. Because under the experiment settings local services will not suffice for successful simulation, the system must use remote service, and because even if we raise the cost just by 0.1 load bandwidth, it means 10% more in the total remote service cost, and that is what causing the system to overload very quickly. 5.6 Long pause duration will reduce performance In experiment 7 (section 4.7) we can see that when we increase the pause duration, the number of rejected customers will stay steady until a certain point and then the number of rejected customers will rise up very quickly and eventually the system will overload. 5.7 The system performs better with pause migrations than with free migrations over time In experiment 8 (section 4.8) we can see that at the beginning there are more migrations in the free migrations, but as the simulation progress, we can see that the pause migrations are ahead. The reason for that is that when we do free migrations we are in a kind of greedy algorithm (i.e. we migrate when we can) but if we are in pause migration then we wait more time before we migrate and then we sometimes find a better migration available (blue instead of red for example), and because of that the load is reduced and there's more room for other customers. 23

5.8 Migrations improves performance at best by about 20% In experiment 9, we can see that if we expect the user to do 0 pauses, the system can accept about 260,000 requests successfully. If there are no pauses then it means that there are no migrations also (because migrations were done only during pauses). We can also see in figure 8 that the peak is received with expected 1-1.5 pauses per user, with about 310,000, under the same conditions. From conclusion 5.2 we know that if we add more pauses the performance will only be reduced, and that is why we can conclude that at best the performance will be improved by 20% if we use migrations (under the same conditions). 6 Further Research We may have had better results if the system would have changed its parameters dynamically according to the current load of the system. For example, when the system is quite empty we may want to decrease the arrival gap to allow more requests to get through. The system could give better results if instead of rejecting the customer it could check if there is an upcoming available slot for the user and service him a few seconds later. If each disk will keep some bandwidth available for that purpose it may give even better results without delaying the customers. We may get better results if we will add delay to the edges instead of using them immediately. For example, if we decide that green migrations will be performed immediately, blue migration will be performed after 20 minutes and red migration will be performed after 45 minutes, and before performing the actual migration we check again if there's a better migration, we might get better results (see section 5.7). We can use the information we have about the clusters and disks current load to encourage or discourage migrations to a certain disk or cluster. 24

תקציר מערכות (VOD) Video on Demand נועדו לשרת מספר רב של לקוחות המעוניינים בהזמנת סרטים בזמן שבו הם מעוניינים בהם. מערכת VOD מכילה קבצים (סרטים) ושרתים (דיסקים), כאשר כל דיסק יכול להכיל מספר מוגדר של סרטים ולשרת מספר מוגדר של לקוחות בו זמנית. המערכת מורכבת מארבעה אשכולות (clusters) של דיסקים, צביר על ושלושה צבירים מקומיים. צביר העל מכיל 20 דיסקים ולפחות עותק אחד של כל סרט במערכת וכל צביר מקומי מכיל 10 דיסקים. לקוחות מגיעים למערכת משלושה אזורים שונים, אם המערכת לא יכולה לשרת אותם מהצביר המקומי היא תנסה לשרת אותם מרחוק מצביר מקומי רחוק או מצביר העל תמורת עלות מסוימת של רוחב פס. כמו כן, כאשר הלקוחות עושים הפוגה (pause) בצפיית הסרט שלהם, המערכת מנסה לייעל את עצמה ע"י הגירת הלקוח לדיסק אחר, קרוב יותר אם הדבר אפשרי. הגירת הלקוחות מתבצעת בזמן ההפוגה על מנת לא לגרום להפרעה ברציפות הצפייה, אך ניתן להגדיר את המערכת שתבצע הגירה בכל זמן שבו מתגלה אפשרות כזו. על מנת שניתן יהיה לבצע את ההגירות האלה, המערכת שומרת מבנה נתונים המכיל גרף שמייצג את מצב המערכת הנוכחי. לכל דיסק במערכת יש צומת בגרף ובין הצמתים במערכת ייתכנו שלושה סוגי קשתות ירוקה, כחולה ואדומה. קשת ירוקה היא קשת אשר ביצוע הגירה של לקוח באמצעותה יקרב את הלקוח לצביר שלו (למשל באמצעות מעבר מצביר העל לצביר המקומי של הלקוח), קשת כחולה היא קשת אשר ביצוע הגירה של לקוח באמצעותה אינו משפיע על הקרבה של הלקוח לצביר שלו (למשל באמצעות מעבר מדיסק אחד לדיסק אחר באותו הצביר) וקשת אדומה היא קשת אשר ביצוע הגירה של לקוח באמצעותה מרחיק את הלקוח מהצביר שלו (למשל באמצעות מעבר מהצביר המקומי של הלקוח לצביר מרוחק). לכל קשת במערכת יש משקל ע"פ צבעה שמשפיע על הבחירה בה, כך למשל אם לקשת בצבע מסוים יש משקל X, ולדיסק המקור יש רמת עומס L1 ולדיסק המטרה יש רמת עומס L2 אז נבצע הגירה מדיסק המקור לדיסק המטרה רק אם L2. < L1 X התכנית שכתבנו מבצעת סימולציה בת 24 שעות של המערכת ורושמת את ביצועי המערכת בהתאם לפרמטרים השונים שלה. הסימולציה מורכבת משני שלבים עיקריים השלב הסטאטי והשלב הדינאמי. את השלב הסטטי אנו מבצעים בעת אתחול המערכת טרם הגעת הלקוחות, בשלב זה מתבצעים שני תהליכים עיקריים: התהליך הראשון הינו קביעה של כמה עותקים יש מכל סרט במערכת ע"פ הפופולאריות של הסרטים, והתהליך השני הינו קביעה של באיזה דיסק יהיה כל עותק. בשלב הדינאמי אנו מדמים הגעה של לקוחות מאזורים שונים המבקשים סרטים שונים והמערכת צריכה להתאים בין הלקוח לדיסק אשר יכול לקבל אותו (בין אם זהו דיסק בצביר המקומי ובין אם מדובר בשירות מרחוק) או לדחות את הלקוח. המערכת גם מדמה הפוגות בסרט שהלקוח מבצע ובזמן ההפוגות היא מנסה להגר את הלקוח שביצע את ההפוגה אם הדבר אפשרי. הרצת הסימולציה יכולה להתבצע באמצעות קליטת אירועים מוקלטים מראש בקובץ או באמצעות הדמיה "חיה" של המערכת. בסיום ההרצה נרשמים הפרמטרים שאיתם רצה התכנית והתוצאות של הסימולציה בקובץ XML המשמש כפלט לתכנית. סימולציה מוצלחת מוגדרת ככזו שבה יש עד 1% של לקוחות שנדחו ע"י המערכת. 25

המרכז הבינתחומי בהרצליה בית-ספר אפי ארזי למדעי המחשב שימוש בהגירת לקוחות לאיזון עומסים במערכות Video on Demand. מוגש כחיבור מסכם לפרויקט סופי תואר מוסמך על ידי הסטודנט אדם אורן העבודה בוצעה בהנחיית ד" ר תמי תמיר ספטמבר 2007 26