A Comparison between Centralized and Distributed Cloud Storage Data-Center Topologies



Similar documents
Inside Dropbox: Understanding Personal Cloud Storage Services

Experimentation with the YouTube Content Delivery Network (CDN)

Web Caching and CDNs. Aditya Akella

Personal Cloud Storage: Usage, Performance and Impact of Terminals

Global Server Load Balancing

Measuring the Web: Part I - - Content Delivery Networks. Prof. Anja Feldmann, Ph.D. Dr. Ramin Khalili Georgios Smaragdakis, PhD

How To Understand The Power Of A Content Delivery Network (Cdn)

Inside Dropbox: Understanding Personal Cloud Storage Services

Request Routing, Load-Balancing and Fault- Tolerance Solution - MediaDNS

Testing & Assuring Mobile End User Experience Before Production. Neotys

Frequently Asked Questions

Operating Systems and Networks Sample Solution 1

Web Application Hosting Cloud Architecture

DATA COMMUNICATOIN NETWORKING

Network Performance Between Geo-Isolated Data Centers. Testing Trans-Atlantic and Intra-European Network Performance between Cloud Service Providers

How To Identify Different Operating Systems From A Set Of Network Flows

Extending SANs Over TCP/IP by Richard Froom & Erum Frahim

FortiBalancer: Global Server Load Balancing WHITE PAPER

Measuring CDN Performance. Hooman Beheshti, VP Technology

Where Do You Tube? Uncovering YouTube Server Selection Strategy

1. The Web: HTTP; file transfer: FTP; remote login: Telnet; Network News: NNTP; SMTP.

Internet Content Distribution

Networking Topology For Your System

HIGH-SPEED BRIDGE TO CLOUD STORAGE

The Value of Content Distribution Networks Mike Axelrod, Google Google Public

Comparative Performance Report

Final for ECE374 05/06/13 Solution!!

Content Distribu-on Networks (CDNs)

ExamPDF. Higher Quality,Better service!

How To Connect To Bloomerg.Com With A Network Card From A Powerline To A Powerpoint Terminal On A Microsoft Powerbook (Powerline) On A Blackberry Or Ipnet (Powerbook) On An Ipnet Box On

Influence of Load Balancing on Quality of Real Time Data Transmission*

Background. Personal cloud services are gaining popularity

Four Ways High-Speed Data Transfer Can Transform Oil and Gas WHITE PAPER

networks Live & On-Demand Video Delivery without Interruption Wireless optimization the unsolved mystery WHITE PAPER

Benchmarking the Performance of XenDesktop Virtual DeskTop Infrastructure (VDI) Platform

Analysis of QoS Routing Approach and the starvation`s evaluation in LAN

GLOBAL SERVER LOAD BALANCING WITH SERVERIRON

Content Delivery Networks. Shaxun Chen April 21, 2009

Quantifying the Performance Degradation of IPv6 for TCP in Windows and Linux Networking

Challenges of Sending Large Files Over Public Internet

A First Look at Mobile Cloud Storage Services: Architecture, Experimentation and Challenge

Exploring YouTube s Content Distribution Network Through Distributed Application-Layer Measurements: A First View

Using TrueSpeed VNF to Test TCP Throughput in a Call Center Environment

Content Delivery Networks

Inside Dropbox: Understanding Personal Cloud Storage Services

A Tale of Three CDNs: An Active Measurement Study of Hulu and Its CDNs

Global Server Load Balancing

Internet Content Distribution

Key Components of WAN Optimization Controller Functionality

First Midterm for ECE374 03/24/11 Solution!!

AKAMAI WHITE PAPER. Delivering Dynamic Web Content in Cloud Computing Applications: HTTP resource download performance modelling

DNS: a statistical analysis of name server traffic at local network-to-internet connections

CS514: Intermediate Course in Computer Systems

Distributed Systems. 25. Content Delivery Networks (CDN) 2014 Paul Krzyzanowski. Rutgers University. Fall 2014

A DNS Reflection Method for Global Traffic Management

Question: 3 When using Application Intelligence, Server Time may be defined as.

Single Pass Load Balancing with Session Persistence in IPv6 Network. C. J. (Charlie) Liu Network Operations Charter Communications

Testing and Restoring the Nasuni Filer in a Disaster Recovery Scenario

Accelerating Cloud Based Services

Assignment # 1 (Cloud Computing Security)

Best practice for SwiftBroadband

AN EFFICIENT LOAD BALANCING ALGORITHM FOR A DISTRIBUTED COMPUTER SYSTEM. Dr. T.Ravichandran, B.E (ECE), M.E(CSE), Ph.D., MISTE.,

GPRS / 3G Services: VPN solutions supported

SiteCelerate white paper

DOCUMENT REFERENCE: SQ EN. SAMKNOWS TEST METHODOLOGY Web-based Broadband Performance White Paper. July 2015

Data Center Content Delivery Network

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Technical Overview Simple, Scalable, Object Storage Software

MOBILE APPLICATIONS AND CLOUD COMPUTING. Roberto Beraldi

Installation and Setup: Setup Wizard Account Information

Analysis of Effect of Handoff on Audio Streaming in VOIP Networks

Performance Analysis of IPv4 v/s IPv6 in Virtual Environment Using UBUNTU

Diagnosing the cause of poor application performance

Deploying in a Distributed Environment

GPRS and 3G Services: Connectivity Options

A TECHNICAL REVIEW OF CACHING TECHNOLOGIES

Firewall Security: Policies, Testing and Performance Evaluation

State of the Cloud DNS Report

Network Probe. Figure 1.1 Cacti Utilization Graph

Distributed Systems. 23. Content Delivery Networks (CDN) Paul Krzyzanowski. Rutgers University. Fall 2015

Six Steps for Hosting Providers to Sell CDN Services

Datasheet iscsi Protocol

Testing and Restoring the Nasuni Filer in a Disaster Recovery Scenario

F5 and Oracle Database Solution Guide. Solutions to optimize the network for database operations, replication, scalability, and security

CSE 473 Introduction to Computer Networks. Exam 2 Solutions. Your name: 10/31/2013

Transcription:

A Comparison between Centralized and Distributed Cloud Storage Data-Center Topologies Maurice Bolhuis University of Twente P.O. Box 217, 7500AE Enschede The Netherlands M.Bolhuis-1@student.utwente.nl ABSTRACT The popularity and bandwidth usage of cloud storage services has increased rapidly in recent years [20]. To provide users cloud storage with low synchronization latency, cloud storage providers are interested whether the performance of their datacenter topology is efficient for their users and how they can improve it. It is not clear whether distributed cloud storage data-center topologies perform better than centralized ones. In this paper, a comparison between centralized and distributed cloud storage data-center topologies is made. The topologies used by different cloud storage applications are analyzed with data collected at global vantage points. The average throughput is used as performance criteria. The results of this paper suggest that using a distributed data-center topology has a positive effect on average throughput compared to a centralized topology. This research contributes to getting an understanding of the impact of different cloud storage data-center topologies on the performance experienced by cloud storage users. Keywords Cloud Storage, Performance, Topology, Comparison, Distributed, Centralized, Dropbox, Google Drive, Measurements 1. INTRODUCTION The number of cloud storage users is rising rapidly [20]. Besides, users store increasingly more data in the cloud. In 2011, around 7% of the consumer content was stored in the cloud. It is expected that this share will increase to approximately one third of the consumer content in 2016 [8]. Therefore, the number of required data-centers to store all those user files will increase. Different approaches towards data-center location could be used by a cloud storage provider. Providers like Dropbox and SkyDrive concentrate their storage servers in a small geographical area (often the US) [6, 21]. Figure 1 shows an example of how such a centralized topology could lo like. All four clients in the figure connect to the same data-center. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 19 th Twente Student Conference on IT, June 24 th, 2013, Enschede, The Netherlands. Copyright 2013, University of Twente, Faculty of Electrical Engineering, Mathematics and Computer Science. Another approach is to distribute the data-centers around the globe, so that the topology is similar to that of a Content Delivery Network (CDN). Files that are requested often by different users in a CDN are cached at proxy servers near the end-users [26]. Figure 2 shows how such a distributed datacenter topology might lo like. The figure shows four clients who each connect to a different data-center that is close to it. The connections between the data-centers indicate that the datacenters can mirror each other s data. Figure 1. An example of a centralized topology. Figure 2. An example of a distributed topology. Google has data-centers around the globe [9]. Results from [3] denote that Google might be using these different world-wide data-centers for their Google Drive cloud storage. Research indicates that the performance of Dropbox is negatively influenced by the centralized topology due to the distance between the end-users and data-centers [6]. However, Krishnan et al. [14] suggests that performance of distributed topologies in which latency-based redirection forwards most clients to a geographically proximate node might be lower than one would expect due to routing inefficiencies and queuing delays. Thus, it is unclear for cloud storage providers if it does make a difference from the point of end-user performance if they place new data-centers centralized near existing data-centers or distributed near the end-user. Although research has been done on general performance of several cloud storage applications [6, 12, 21] and on CDN performance [14], no research has been done on the impact of cloud storage data-center topologies on performance.

The goal of this paper is to get thorough understanding of these effects by answering the following question: How does the performance of distributed cloud storage data-center topologies differ from the centralized ones? The main research question focusses on the performance from an end-user perspective. In order to answer this question three sub-questions are proposed: 1. What are centralized and distributed data-centers topologies for cloud storage? 2. Which performance criteria can be used for comparing the topologies? 3. What effects do the different topologies have on those performance criteria? We approach these questions by first performing a literature survey to understand how the centralized and distributed topologies work. A cloud storage provider using a centralized and distributed topology need to be found for the performance measurements. Dropbox and Google Drive are chosen, because [6] states that Dropbox uses a single cloud storage data-center location and [3, 4] indicate that Google Drive uses multiple cloud storage data-centers locations. Active measurements are performed to validate this. Secondly, a literature survey is performed for formulating the performance criteria. Thirdly, active measurements using a similar methodology as in [4] are performed to determine the data-center topology performance in different scenarios. The remainder of this paper is organized as follows: Section 2 describes the related work. Section 3 answers the first subquestion by discussing centralized and distributed data-center topologies and verifying the topology of Dropbox and Google Drive. Section 4 answers the second sub-question by explaining which performance criteria can be used for comparison of datacenter topologies and furthermore explains how the experiment is setup. Section 5 shows the results of the performance experiments and answers the third research question. Section 6 concludes the paper and provides recommendations for future work. 2. RELATED WORK In [6] Drago et al. describe how Dropbox works. This research study is related to this work since it describes features supported by the Dropbox application that could influence the results of the performance comparison. For Dropbox, in order to improve its performance, it proposes to bring the storage servers closer to the end-user. Krishnan et al. [14] performed a study on Google s CDN infrastructure and found that the performance of such distributed topology is lower than one might expect due to routing inefficiencies. Slatman describes a methodology for measuring upload traffic while uploading files to SkyDrive [21]. This methodology was used to show that random files are suitable for topology performance comparison of Dropbox and Google Drive. In [4], Bocchi et al. benchmark 5 popular cloud storage services including Dropbox and Google Drive and describe a methodology for determining data-center locations. In this paper additional scenarios are tested following this methodology for Dropbox and Google Drive. Furthermore, only the shortest round trip time was used for topology validation. Testa et al. explain how a distributed data-center topology works [22]. This information was used to define a distributed data-center topology and answer sub-question 1. 3. DATA-CENTER TOPOLOGIES In this section the first sub-question is answered by introducing a definition of centralized and distributed cloud storage datacenter topologies. The experiment of Bocchi et al. [4] is reproduced to validate that Dropbox still uses a centralized data-center topology and Google Drive uses a distributed datacenter topology. 3.1 Methodology The information of Testa et al. [22] was used to define centralized and distributed cloud storage data-center topologies. To obtain an overview of cloud storage providers using a centralized and distributed topology, some popular providers were searched on Google using cloud storage backup as search term. The websites of these providers were visited to search for information on data-center locations. Providers do not always state were they save user data or are not very specific about it. Therefore, for some providers the topology could not be determined. Furthermore, the search for providers was not exhaustive. Both a cloud storage provider using a distributed and centralized topology needed to be found for topology performance comparison. Dropbox and Google Drive were chosen for topology validation, because the results of [6] and [3, 4] suggest that Dropbox and Google Drive are using a centralized and distributed topology respectively. To validate that Dropbox still uses a centralized data-center topology and Google Drive uses a distributed topology, the experiment of Bocchi et al. [4] is reproduced. 200 PlanetLab nodes from 40 different countries in 5 continents were setup. PlanetLab nodes Figure 3. PlanetLab nodes used for measurements. The cloud storage hostnames used by Dropbox were retrieved from [6]. Cloud storage hostnames used by Google Drive were collected by measuring the synchronization traffic of Google Drive 1.8 using Wireshark. The IP addresses were resolved from those hostnames by querying approximately 2500 DNS servers spread around the world. Because the MaxMind database is inaccurate for geolocation of IP addresses of internal networks or large corporate network [23], an extra step to validate the topology was necessary. Each PlanetLab node sent ping packets to each of the cloud storage IP addresses. The location of the PlanetLab node with the shortest ping time to a storage IP address is assigned as approximate data-center location. The RTT (Round Trip Time) is not accurate for determining the exact distance between two nodes, because of routing inefficiencies or congestion. Furthermore, some PlanetLab nodes have a lower speed Internet connection which results in higher RTTs. However, since the aim is to validate distributed and centralized topologies and not to determine the exact datacenter location, this method will be accurate enough.

The IP addresses of the PlanetLab nodes with the shortest ping time to a certain cloud storage IP address was queried against the MaxMind GeoLite database to obtain the coordinates. 3.2 Centralized Topology 3.2.1 Definition In this paper a centralized cloud storage data-center topology is defined as a topology in which the cloud storage provider has one or a few data-centers located in a small geographical area, because the distance between the end-user and the data-center can potentially be large. An advantage of a centralized datacenter topology is the economies of scale of operational expenses [16]. A disadvantage is the higher risk of single-pointof-failure. Table 1 gives an overview of some providers that have all their data-centers in a small geographical area. first performs a DNS query to retrieve the IP address of the data-center when trying to save its data in the cloud (step 2). Content switches at the edge of each data-center can resolve IP addresses from the hostname in the DNS query (step 3) [22]. The IP address returned by the content switch can depend on several factors such as the DNS resolver s IP address/geographical location [26], the RTT [23], traffic load and the number of active servers at each location [22]. Depending on the IP address the content switch returns, the client connects to a certain data-center (step 4). Data is mirrored between data-centers at the back-end [22] which enables endusers to also access files from the closest data-center when they are for example in another country (step 5). Tabel 1. Providers with geographical centralized datacenters. Cloud storage provider Data-center location AltDrive United States [1] Dropbox United States [7] SkyDrive United States [21] Wuala Switzerland, Germany, France [25] 3.2.2 Dropbox Topology Validation In total 51 different Dropbox storage IP addresses have been collected for validating that Dropbox uses a centralized topology with its data-centers located in the United States. However, as can be seen in figure 4, only 4 unique PlanetLab nodes which are all located at the East Coast of the United States had a shortest RTT to a certain Dropbox IP address. From these results can be concluded that Dropbox is using a centralized topology for its cloud storage services. PlanetLab node with shortest RTT Figure 4. PlanetLab nodes with a shortest RTT to a Dropbox storage IP address. 3.3 Distributed Topology 3.3.1 Definition A distributed cloud storage data-center topology is a topology in which the cloud storage provider has multiple data-centers spread over a large geographical area and in which the user stores and retrieves data from the data-center closest to it. Testa et al. explains that DNS-based load-balancing algorithms are primarily used to determine the data-center to which traffic has to be directed [22]. Figure 5 shows how the clients are distributed over the data-centers. The content switch constantly monitors the traffic load of the data-centers (step 1). A client Figure 5. Distribution based on DNS. Figure taken and slightly adapted from [22]. Table 2 shows an overview of some providers that have multiple data-centers spread over the world. Not all provider topologies were validated, because only one provider using a distributed topology needed to be found for data-center topology performance comparison. However, this could be done using the same methodology as described in [4]. Tabel 2. Providers with geographical distributed datacenters. Cloud storage provider Data-center locations ASUS WebStorage Worldwide [2] DollyDrive Europe and United States [5] Google Drive Worldwide [3, 4, 9] JungleDisk Worldwide [13, 19] Memopal Worldwide [15] Mozy Worldwide [17] 3.3.2 Google Drive Topology Validation This section describes the results of the Google Drive datacenter topology validation. Table 3 shows the hostnames discovered while measuring the traffic of Google Drive. Only the IP addresses of the file storage and file upload hostnames

were resolved using the DNS servers, because these IP addresses can give an indication of data-center location. Tabel 3. Hostnames used by Google Drive and the corresponding service it is used for. Hostname accounts.google.com clients.l.google.com googlehosted.googleusercontent.com upload.drive.google.com Service Authentication Notifications File storage File upload Performing a dig command on these IP addresses returned hostnames such as wg-in-f132.1e100.net. Some additional valid Google hostnames and IP addresses were obtained by changing either the first two letters in the hostname or the number behind the letter f in the hostname. In total 377 Google IP addresses have been acquired. PlanetLab node with shortest RTT Figure 6. PlanetLab nodes with a shortest RTT to a Google Drive IP address. Figure 6 shows the results of the Google Drive topology validation. Only the PlanetLab nodes that had a shortest RTT to one or more of the 377 Google IP addresses are shown. The resulting number of PlanetLab nodes with a shortest ping time to a certain Google IP address are distributed as follows: 19 unique nodes from 2 different countries in North America, 2 unique nodes from 2 different countries in South America, 13 unique nodes from 10 different countries in Europe, 6 unique nodes from 4 different countries in Asia and 1 unique node from 1 country in Oceania. Figure 6 shows great similarities with the map of Edge Point of Presence Google presents on his website [11]. Therefore, there is a strong indication that Google is using a distributed topology for Google Drive. From now on it is assumed that Google Drive uses a distributed topology. It will be used for the comparison between the centralized and distributed cloud storage datacenter performance. 4. SETUP OF THE EXPERIMENT This section answers the second sub-question. It discusses some performance criteria to determine how topology performance can be measured from an user perspective. A file structure needed to be found for comparing data-center topology performance in different scenarios such that the amount of traffic generated by uploading or downloading these files to Dropbox and Google Drive is similar, since the amount of network traffic transferred will influence the performance. Furthermore, the methodology used for the performance experiments is explained. 4.1 Methodology Performance criteria are needed to be found in order to compare data-center topology performance. RFC 3577 was used to define these performance criteria [24]. Also the flow of actions in cloud storage file synchronization was studied to determine the measurement interval. The upload and download traffic sent by Google Drive and Dropbox was studied using Wireshark. Cloud storage applications can use different performance enhancing features, such as chunking, that could interfere with the results. A literature study on [4] showed the maximum chunk size used by Google Drive and Dropbox. Active measurements using the same methodology as in [21] were performed to test if uploading files containing readable text with repetition generate a same amount of traffic for both Google Drive and Dropbox. Google Drive release 1.8 together with Dropbox release 2.06 was installed on a Windows 7 virtual machine. The amount of upload traffic generated by uploading these files to Google Drive and Dropbox was measured using Wireshark. This experiment was repeated with random files. 4.2 Performance Criteria/Metrics Section 4.9 of RFC 3577 defines performance as the quality of service delivered to end-users by applications [24]. According to this RFC, human end-user perceive only the availability and responsiveness of an application [24]. It defines these concepts as follows: Availability - The percentage of the time that the application is ready to give a user service. Responsiveness - The speed at which the application delivers the requested service. [24] The application is only ready to give the user cloud storage service when the data-center is up. Therefore, availability in the context of cloud storage is the uptime of data-centers. Because in general both providers using a distributed and centralized topology have high uptimes around 99.97% [18], availability will not further be discussed in this paper. Cloud storage providers offer storage in the cloud as a service. The speed at which a cloud storage application delivers this service can be measured by the speed at which files are transferred between the provider and the user. In this paper, responsiveness is defined as the average throughput during file transfer. The effects of difference in responsiveness can be observed in the average throughput when it is calculated per file size, since timing is then the only variable to be monitored. Figure 7 shows the typical flow of actions in cloud storage file synchronization. In figure 7a. can be seen that after the file is added, the application needs some processing time. It then setups the TCP connection and exchanges the SSL keys. Next, it transfers the file to the server and afterwards closes the TCP connection. Figure 7b. shows the typical flow of actions when downloading a file from the cloud. The difference compared to figure 7a. is that the file is transferred from the server to the client and that the flow includes the times for requesting and retrieving a file from the data-center. Performance will be measured from connection setup till the end of file transfer. In the figure this is indicated by the shaded area. The processing time of the application is left out of the measurement interval, because the topology is researched. Furthermore, the teardown of the SSL-connections is also left out, because it has large delays and the user in general only perceives performance till the moment at which the entire file is uploaded to the server or the entire file is downloaded to the local storage folder.

Upload traffic (MB) File Added Processing Time TCP Connection Setup SSL key exchange File Transfer TCP Connection Teardown Client Server Client Server SYN SYN/ server_key_exchange client_key_exchange send_data send_data FIN FIN (a) Upload SYN SYN/ server_key_exchange client_key_exchange request_file send_data send_data FIN FIN (b) Download Figure 7. Flow of actions in cloud storage file synchronization. File Added Processing Time TCP Connection Setup SSL key exchange Request file Retrieve file from data-center File Transfer TCP Connection Teardown For uploading, only the traffic from the client to the server is selected within the cloud storage TCP stream. For downloading, it is the other way around. The first packet within the cloud storage TCP stream (connection setup) until the last packet within the TCP stream that has a nonzero payload (end of file transfer) is the interval that is used within the throughput calculation. The throughput then is calculated by dividing the amount of bytes measured within this interval (including retransmissions) by the interval time. The average throughput is calculated by averaging per file size the calculated throughput. This will be measured in different scenarios that will be discussed in section 4.4. 4.3 Defining File Test Sets Not any file structure can be used in the performance experiments, because it could result in large differences in upload or download traffic between the cloud storage applications due to performance enhancing features. For example, figure 8 shows the amount of traffic generated by uploading Lorem Ipsum files to Google Drive and Dropbox. 5.0 4.0 3.0 2.0 1.0 Google Drive Dropbox 0.0 1.8 3.6 5.3 7.1 8.9 10.7 12.4 14.2 16.0 17.7 File size (MB) Figure 8. Upload traffic when uploading Lorem Ipsum files. The curve showing the amount of upload traffic of Dropbox is quite fluctuating. According to Drago et al. Dropbox chunks files larger than 4 MB [6]. Chunking is a feature to split files into chunks of a maximum size [4]. Because the Lorem Ipsum files have repetition, certain chunks of larger files might not get uploaded due to the deduplication feature of Dropbox [6]. Data deduplication is a technique to prevent re-transmission of file chunks already present on the server [4]. From figure 8 can be concluded that both Google Drive and Dropbox compress the Lorem Ipsum files, because the measured upload traffic is less than the file size of the file being uploaded. Figure 9 shows the amount of upload traffic when uploading random files to Dropbox and Google Drive. The figure shows that the random files are not compressed. Furthermore, it shows that the amount of upload traffic is not affected by the deduplication feature when random files are used. The chunking feature of Dropbox does not interfere for files smaller than 4 MB. According to Bocchi et al. the maximum chunk size of Google Drive is 8 MB [4]. Therefore, files with sizes up to 4 MB can be used for performance experiments without the chunking feature interfering with the results of either Dropbox or Google Drive. From the figure can be concluded that Dropbox has a higher overhead than Google Drive. The absolute difference in upload traffic increases to approximately 2.7 MB as the file size increases 10 MB. This absolute difference is less for smaller files. Upload traffic (MB) 12.0 10.0 8.0 6.0 4.0 2.0 Google Drive Dropbox 0.0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0 File size (MB) Figure 9. Upload traffic when uploading random files. The amount of upload traffic sent by Google Drive and Dropbox is very different when Lorem Ipsum files are uploaded. Therefore, these files will not be used for the performance experiments. From figure 9 can be concluded that random files are not affected by the compression and deduplication feature. Therefore, these files are suitable for the performance experiments. Because the absolute difference in upload traffic of Google Drive and Dropbox when random files up to 4 MB are used is small compared to larger files, there is a bias towards using these smaller files. 4.4 Performance Experiment A similar benchmarking methodology as in [4] was used. A PC located in the Netherlands running Debian 7.0 together with a Windows 7 Enterprise virtual machine was used for the performance experiments. A Windows virtual machine was necessary, because Google Drive has only a Windows, Mac, Android and ios application [10]. Both Dropbox and Google Drive were installed in the virtual machine. The PC is connected to a 1GB/s Ethernet network at the University of Twente. An additional Windows PC was used for measuring throughput in the download experiments. Random files of sizes in the range of 0.5 MB to 4 MB were used for the performance experiments. These file sizes were chosen in accordance to the

results of section 4.3. Each experiment was repeated 50 times. The average throughput is calculated as described in section 4.2. A confidence interval of 95% was used for plotting the results. 0a. DNS Request 1. FTP Transfer A. Upload B. Download DNS Server Linux PC Windows 7 VM 2. Upload Linux Controller 3. Upload Eth0 Cloud Storage 0b. DNS Response 0a. DNS Request Linux PC Windows 7 VM 4. Download Linux Controller 3. Download Eth0 DNS Server 0b. DNS Response 1. FTP Transfer 0c. DNS Request Cloud Storage Windows PC Windows Controller 2. Upload Figure 10. Setup for performance measurements. 0d. DNS Response The first scenario consists of adding random files one at a time to the local cloud storage folder of both Dropbox and Google Drive and measure the upload traffic. Figure 10.A shows how this was implemented. When the cloud storage application is started, it performs a DNS request to resolve the IP addresses of the cloud storage hostnames (step 0a and 0b). Random files created on the Linux PC can be transferred to the cloud storage folder within the Windows 7 VM via a FTP server that was running on it (step 1). The resulting upload traffic routes via the Linux PC to the cloud storage provider, so that the Linux PC can measure the traffic (step 2 and 3). The second scenario also consists of adding files to the cloud storage folder one at a time. However, this time the files are uploaded to the provider via an additional Windows PC which got both cloud storage applications installed with the same credentials. Figure 10.B shows the setup for this download experiment. When the cloud storage application is started, the IP addresses are resolved from the cloud storage hostnames on both the Linux PC and Windows PC (step 0a, 0b, 0c and 0d). The Linux PC creates a random file and transfers it via FTP to the cloud storage folder on Windows PC (step 1). This Windows PC then starts to upload the file to the provider (step 2). After a while, the file is downloaded to the cloud storage folder within the Windows 7 VM (step 3 and 4). This download traffic routes via the Linux host PC, which can measure it. Dropbox LAN sync was disabled for this experiment to prevent it from interfering with the results. This is a feature of Dropbox for synchronizing files within LAN networks [6]. For the third scenario, both the upload and download experiments were repeated for Google Drive. In contrast to the previous scenarios, the Windows 7 VM and Windows PC were forced to use a DNS Server located in Gibbstown, United States. When Google Drive resolves a hostname for synchronizing files on startup as shown by steps 0a, 0b in figure 10.A and additional step 0c and 0d in figure 10.B, the IP address of a data-center located in the US is returned. This forces the Google Drive application to synchronize its data with the data-center located in the United States. This DNS server was chosen, because Gibbstown is relatively close to the Dropbox cloud storage data-centers in Virginia. This experiment was also done with an DNS server in Ireland. Ireland was chosen, because of its location between the Netherlands and the US. 5. RESULTS In this section the results of the performance experiments are discussed and the third sub-question is answered. 5.1 Upload Results Figure 11 shows the average throughput while uploading different size files to Google Drive and Dropbox. Google Drive NL shows the average throughout during uploading while using the University Twente DNS server. Google Drive US shows the average throughput during uploading while using a US DNS server in Gibbstown, New Jersey. The Google Drive IR curve shows the average throughput results during uploading while using a DNS server in Cork, Ireland. The results of Dropbox and Google Drive NL were obtained by using the methodology of the first scenario as described in section 4.4. The results of Google Drive US and Google Drive IR were obtained by using the methodology of scenario 3 as described in section 4.4. Throughput (Mbits/s) 20 18 16 14 12 10 8 6 4 2 Dropbox Google Drive using University Twente DNS server (NL) Google Drive using DNS server Gibbstown (US) Google Drive using DNS server Cork (IR) 0 0.5 1 1.5 2 2.5 3 3.5 4 File Size (MB) Figure 11. Throughput while uploading different size files to Google Drive and Dropbox. The Dropbox curve shown in figure 11 is first slightly increasing. This could be explained by the TCP slow start mechanism. Because Dropbox has a high RTT, it takes a while for the acknowledgements to be received by the sender and therefore for the congestion window to increase. For relatively small files, Dropbox is already finished uploading before finishing the TCP slow start phase and thereby reaching a higher congestion window. This results in a lower throughput for these smaller size files. The Google Drive IR curve is showing a similar course. The reason why the curve of Google Drive NL is not showing this same increase could be explained by the relative low RTT compared to Dropbox. With a lower RTT, acknowledgements are received relatively faster, by which the TCP congestion window will also increase faster resulting already in a higher throughput for smaller size files. However, it is unclear why the Google Drive US curve is showing a constant behavior. It was expected that the Google Drive US curve would also be similar to the curve of Dropbox, because Google Drive US also has a high RTT. Clearly, Google Drive NL is performing much better in uploading files than Dropbox, Google Drive IR and Google Drive US. The average throughput of Google Drive NL in the figure is for all file sizes used at least 1.7 times higher than for Dropbox. For the 0.5 MB files, the average throughput of Google Drive NL is even 2.2 times higher. Remarkable is that Google Drive NL has larger confidence intervals. The average

Throughput (Mbits/s) throughput of Google Drive IR is in figure 11 at least 1.4 lower than that of Google Drive NL. For the 0.5 MB files, this difference is a factor 1.9. Google Drive US is performing worse than Google Drive NL, Google Drive IR and Dropbox. The Google Drive US average throughput is 5.1 times lower than Dropbox for the 0.5 MB files. For 3 MB files, this difference is even a factor 6.6. It was expected that the results of Google Drive US and Dropbox would be closer to each other, because they both use a data-center in the US. It is unclear what is causing the Google Drive US throughput to be so much lower than Dropbox. Compared to Google Drive NL, the average throughput of Google Drive US in the figure is at least 8.9 times less. 5.2 Download Results Figure 12 shows the average throughput while downloading files of different sizes to Google Drive and Dropbox. For Google Drive NL, Google Drive US and Google Drive IR, the same DNS servers as described in section 5.1 were used. The results of Dropbox and Google Drive NL were obtained by using the methodology of the second scenario as described in section 4.4. The results of Google Drive US and Google Drive IR were obtained the methodology of the third scenario as described in section 4.4. 18 16 14 12 10 8 6 4 Dropbox Google Drive using University Twente DNS server (NL) Google Drive using DNS server Gibbstown (US) Google Drive using DNS server Cork (IR) 2 0.5 1 1.5 2 2.5 3 3.5 4 File Size (MB) Figure 12. Throughput while downloading different size files from Google Drive and Dropbox. The figure shows that for the file sizes tested, the average throughput of Google Drive NL is higher than Dropbox and Google Drive US. The average throughput of Google Drive NL in the figure is at least 1.8 times higher than Dropbox and 1.05 times higher than Google Drive US. In contrast to the upload results, Google Drive US is performing better than Dropbox. This difference in throughput increases from a factor 1.1 for 0.5 MB files to 1.9 for 4 MB files. It is remarkable that for most file sizes the Google Drive IR curve is above the Google Drive NL curve, because the data-center of Google Drive IR is farther away from the client. No clear explanation could be found for this result, however, the error margins seem to overlap for file sizes till 2 MB which could mean that the average throughput values become closer when additional test runs are carried out. Because the measurements were not all performed at the same time, it could be that the network was more congested when performing the Google Drive NL measurements as when the Google Drive IR measurements were performed. It could also be the case that provisioning for the data-center in the Netherlands is lower than for the data-center in Ireland. For Google Drive US and Google Drive IR, the average throughput is higher with downloading than with uploading. However, the throughput is lower with downloading for Google Drive NL and Dropbox compared to uploading. This is remarkable, because the PC has a symmetric Internet connection. This lower throughput can be explained by the time it takes for the data-center to actually send a file after it has been requested. It takes some time for the server to retrieve the file as shown in figure 7 in section 4.2. For Dropbox this time is in the order of a few seconds as for Google it is only a few tenths of a second. The results of Google Drive IR, NL and US are relatively close. A study on the pcap files showed that the TCP receive window of the client often drops to almost zero. This could be explained by the virtual machine limiting the TCP receive window of the client. Consequently, the server could not send the files to the client at full speed. 6. CONCLUSIONS AND FUTURE WORK This paper compares the performance effects of distributed versus centralized cloud storage data-center topologies. A centralized cloud storage data-center topology was defined as a topology in which the cloud storage provider has one or a few data-centers located in a small geographical area. A distributed cloud storage data-center topology was defined as a topology in which the cloud storage provider has multiple data-centers spread over a large geographical area in which the user stores and retrieves data from the data-center closest to him. It was verified that Dropbox uses a centralized topology. Furthermore, it was made reasonable that Google Drive uses a distributed topology. Average throughput was found as possible performance criteria for comparing centralized and distributed data-center topologies from a user perspective. The results of the performance measurements suggest that in general using a distributed data-center topology in which the data-center is closer to the client has a positive effect on performance compared to a centralized topology. However, also client and server configuration plays an important role. For example, our measurements show that TCP configuration is sometimes more important than the data-center location in terms of performance. For future work, the performance measurements could be performed by using a different methodology in which other bottlenecks are removed. Due to limited time available, the measurements were not performed at more locations. In future work, the topology performance could also be measured at more locations and on different continents such that for the distributed topology the clients are actual close to the datacenter. Furthermore, the amount of time it takes for a file to spread in a distributed topology could be used as a metric for distributed topology performance. 7. NOWLEDGEMENTS The author would like to thank I. Drago, M.Sc. at the University of Twente, for his continuing support and valuable feedback. Furthermore, the author would also like to thank Dr. ir. G. Karagiannis for his guidance and continuous feedback. 8. REFERENCES [1] AltDrive. "Online Backup Biz Service". http://altdrive.com/biz.html. Accessed on: April 30, 2013. [2] ASUS. "Is there any chance that my data will be lost or damaged?". https://support.asuswebstorage.com/estorage/eservice/1033

/FAQViewKBArticle.aspx?kbarticleid=ccd55d57-702ade11-8f56-005056000173. Accessed on: April 30, 2013. [3] E. Bocchi, "Personal Storage Services", Bachelor assignment, Politecnico di Torino, 2013. [4] E. Bocchi, M. Mellia, I. Drago, A. Pras, and H. Slatman, "Benchmarking Personal Cloud Storage", Under submission for IMC, 2013. [5] DollyDrive. "Dolly Drive is currently the best way to back up a Mac to the cloud". http://www.dollydrive.com/2012/02/dolly-drive-iscurrently-the-best-way-to-back-up-a-mac-to-the-cloudthanks-germany/. Accessed on: April 30, 2013. [6] I. Drago, M. Mellia, M. M. Munafo, A. Sperotto, R. Sadre, and A. Pras, Inside dropbox: understanding personal cloud storage services. In Proceedings of the 2012 ACM conference on Internet measurement conference, Boston, Massachusetts, USA, pp. 481-494, 2012. [7] Dropbox. "Where does Dropbox store everyone's data?". https://www.dropbox.com/help/7/en/. Accessed on: March 8, 2013. [8] Gartner. "Gartner Says That Consumers Will Store More Than a Third of Their Digital Content in the Cloud by 2016". http://www.gartner.com/newsroom/id/2060215/. Accessed on: March 27, 2013. [9] Google. "Data center locations". http://www.google.com/about/datacenters/inside/locations/ index.html. Accessed on: March 6, 2013. [10] Google. "Get Google Drive everywhere". https://www.google.com/intl/en_us/drive/start/download. html. Accessed on: May 18, 2013. [11] Google. "Peering & Content Delivery". https://peering.google.com/about/delivery_ecosystem.html. Accessed on: May 10, 2013. [12] W. Hu, T. Yang, and J. N. Matthews, The good, the bad and the ugly of consumer cloud storage Operating Systems Review (ACM), vol. 44, no. 3, pp. 110-115, 2010. [13] Jungle Disk. "Jungle Disk Personal". https://www.jungledisk.com/personal/. Accessed on: April 30, 2013. [14] R. Krishnan, H. V. Madhyastha, S. Srinivasan, S. Jain, A. Krishnamurthy, T. Anderson, and J. Gao, Moving beyond end-to-end path information to optimize CDN performance. In Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, Chicago, Illinois, USA, pp. 190-201, 2009. [15] Memopal. "How is my data protected?". http://memopal.com/en/supportfaq/how-is-my-dataprotected.htm. Accessed on: April 30, 2013. [16] D. Mohney. "Centralized vs. distributed data centers: Green, redundant, both?". http://www.greendatacenternews.org/articles/473515/centr alized-vs-distributed-data-centers-green-redu/. Accessed on: April 30, 2013. [17] Mozy. "Knowledge Base". http://support.mozy.com/articles/en_us/faq/how-does- Mozy-protect-my-data-1342312483721/. Accessed on: April 30, 2013. [18] Pingdom. "Cloud storage shoot-out: Google Drive vs. Dropbox vs. SkyDrive vs. Box". http://royal.pingdom.com/2012/06/21/cloud-storage-shootout-google-drive-vs-dropbox-vs-skydrive-vs-box-com/. Accessed on: May 19, 2013. [19] Rackspace. "Our Global Data Centers". http://www.rackspace.com/whyrackspace/network/datacent ers/. Accessed on: April 30, 2013. [20] J. Rebello. "Subscriptions to Cloud Storage Services to Reach Half-Billion Level This Year". http://www.isuppli.com/mobile-and-wireless- Communications/News/Pages/Subscriptions-to-Cloud- Storage-Services-to-Reach-Half-Billion-Level-This- Year.aspx. Accessed on: March 27, 2013. [21] H. Slatman, Opening Up the Sky: A Comparison of Performance-Enhancing Features in SkyDrive and Dropbox. In Proceedings of the 18th Twente Student Conference on IT, 2013. [22] S. Testa, and W. Chou, The distributed data center: frontend solutions IT Professional, vol. 6, no. 3, pp. 26-32, 2004. [23] R. Torres, A. Finamore, J. R. Kim, M. Mellia, M. M. Munafo, and S. Rao, Dissecting Video Server Selection Strategies in the YouTube CDN. In Proceedings of the 2011 31st International Conference on Distributed Computing Systems, pp. 248-257, 2011. [24] S. Waldbusser, R. Cole, C. Kalbfleisch, and D. Romascanu, "Introduction to the Remote Monitoring (RMON) Family of MIB Modules", RFC 3577, Internet Engineering Task Force (IETF), 2003. [25] Wuala. "About Wuala". http://www.wuala.com/en/about/. Accessed on: May 20, 2013. [26] N. C. Zakas. "How content delivery networks (CDNs) work". http://www.nczonline.net/blog/2011/11/29/howcontent-delivery-networks-cdns-work/. Accessed on: April 29, 2013.