Department of Computer Science Georgia Institute of Technology FINAL REPORT Back2Cloud: A highly Scalable and Available Remote Backup Service Professor Ling Liu CS 8803 Advanced Internet Application Development Spring 2009 Team #9 Supraja Narasimhan Jerry Philip Amit Warke Lateef Yusuf
TABLE OF CONTENTS Idea 3 Motivation 3 Amazon Simple Storage Service (S3) 4 System Architecture 5 Back2Cloud Features 6 Versioning 7 Chunking 8 Compression 8 Encryption 9 Back2Cloud-Online 9 Performance Analysis 11 Future Work 15 Challenges 15 Conclusion 16 References 16 2
Idea To implement a Remote Backup Service that automatically syncs all of your important data to the Remote Cloud Computing framework provided by the Amazon Simple Storage Service (S3) for the purpose for data restoration in the event of any local data loss. Motivation Many of businesses use tape based backup servers to save all of their important data for later restoration in the event of any data losses. However few realize that even these backup servers are susceptible to various kinds of failures such as server crashes, virus attacks, human errors, natural disasters etc. This would eventually lead to catastrophic loss of all data without leaving a trace. Such complete data loss would lead to closing of companies with losses running into millions of dollars. Also, backup and recovery plans require significant investment in infrastructure such as hardware, software, training of employees to manage the backup servers. Moreover a lot of time is spent in backing up data to the servers. A more cost effective and efficient way to backup your important data would be to incorporate a Remote Backup Service into your business. A Remote Backup Service is one that provides users with a remote storage repository to backup their data, which can be accessible through the internet from anywhere in the world. Remote backups are very easy to implement and require only software installed on the user side. This software takes care of syncing data to the remote storage, thereby reducing the burden of maintaining infrastructure and backups, and can run at any scheduled time. Remote backups are inexpensive and restore data quickly from the remote servers in the event of a system crash by just the click of a button. In terms of security, too, these systems stand out as data is remotely transferred and stored as encrypted data. Only the owner of this data possesses the key to decrypt it for the purposes of read/write. We are providing a reliable and highly available remote backup service by leveraging the cloud computing framework. Our service would store data remotely on the Amazon S3 framework, which is a highly reliable and scalable data store. 3
The reason for choosing a cloud computing framework is that you can back up data to it anytime, from anywhere provided you have web access. The main motive to use Amazon S3 was to save the costs of hosting our own infrastructure and instead to use Amazon s infrastructure to provide the backup service to customers. Amazon S3 s highly available, scalable, inexpensive, reliable and simple distributed data storage framework makes it attractive to implement our backup service on their systems. Also, the cloud can be accessed from anywhere in the world requiring just an internet connection and hence makes it possible for providing the backup service to mobile clients who could lose data anywhere and would just need an internet connection to connect to the cloud and restore their data. Amazon Simple Storage Service (S3) Amazon S3 is Amazon s highly proven Distributed Data Store. It is a highly scalable, reliable and fast storage infrastructure. Amazon S3 allows its users to add, delete, and retrieve files from the storage system. We write data and store them in S3 as objects. The size of the objects can range from 1 byte to 5 GB each. Each object is stored in a bucket and retrieved via a unique developer assigned key. Each object stored in Amazon S3 is associated with a key, which uniquely identifies files saved in the cloud storage system. A bucket and a key uniquely identify an object (file) saved by the user in the storage system. Buckets must be explicitly created before they can be used and each user can create only upto 100 buckets. Bucket names are also global and hence each user must create unique bucket names to save data. Amazon provides a REST API to create, delete and list buckets and provides interfaces to create access control lists allowing read/write permissions to be given to specific AWS customers. 4
System Architecture Figure: Cloud Architecture for the Remote Backup Service This architecture involves the use of the Amazon Simple Storage Service (S3) to back up all the important files of the user in the cloud. Whether the users use auto sync or manual sync, all the files are sent as streams from the workstations to the application server in the cloud, which saves them in the user defined bucket in S3. We used REST (Representational State Transfer) protocol to interact with the Amazon cloud. In REST, each bucket in S3 is represented as a URI; we save files in the cloud by sending the data embedded in an HTTP Request object and receive its metadata via the Response Object. Base 64 encoding was used to prevent encoding and decoding issues in the cloud. When data is saved in S3, the data was replicated across multiple servers for the purpose of faulttolerance. Also we have incorporated a web portal where any user who has experienced a system crash can use someone else s laptop and gain access to his files in the cloud so that he can download the files he needs at that point of time. 5
Back2Cloud Features Figure: Graphical User Interface for the Back2Cloud Application Back2Cloud interacts with Amazon S3 through REST requests. Both requests and responses are streamed. 6
Below is an example that illustrates how Back2Cloud instantiates a service that performs simple queries with the S3 engine. Back2CloudConfig config = new Back2CloudConfig (); config.awsaccesskeyid = awsaccesskeyid; config.awssecretaccesskey = awssecretaccesskey; IBack2Cloud service = new Back2CloudQuery (config); Back2Cloud implements three other services to streamline and support threading, chunking, compression and versioning: Compression Service, Chunk Service and the Snapshot Service. Features of the Back2Cloud application are described as follows: Versioning Back2Cloud implements versioning using a concept of snapshots. When you backup a file or directory, Back2Cloud places those files into snapshot which is a version of those files at that point in time. Each snapshot can be restored to the disk. Each snapshot can also be updated independently, although the default behavior is to update snapshots into new versions or snapshots. Back2Cloud implements a FileWatcher daemon that continually monitors the files in the backup folder and sync updates into new snapshots. The daemon monitors any new write, access, files creation and files renaming and notifies the Back2Cloud process to invoke the backup procedure to the S3 cloud. Back2Cloud uses the Amazon SimpleDB to save metadata information about files. The attributes are synced along with the files to the cloud and retrieved when restoring or accessing files from the cloud. 7
Chunking Back2Cloud splits files into chunks of data of variable size. The chunks are determined by their contents, rather than fixed offsets into a file. This enables Back2Cloud to keep data transfer and storage to a minimum since only chunks that has not been seen before will be uploaded. Chunks have an average size of 128K, although they can range from 2K - 5MB. This method of chunking also means that a full backup of a set of files that have already been placed in a snapshot does not result in re-uploading those files. And if the files have changed since they were last backed up, only those chunks that have changed need to be uploaded. The result is a new full snapshot. The full set of files can be restored from that snapshot. This is de-duplication between snapshots. This method of chunking also enables de-duplication within snapshots. If two files are both backed up at the same time, and the two files share chunks of data, each of those chunks is only uploaded and stored once. Back2Cloud keeps track of the fact that a chunk is referenced by multiple files. Another advantage of this method is that Back2Cloud can process and transfer multiple chunks in parallel, taking full advantage of the network connection. This will allow faster backups than if the chunks were uploaded serially. Finally, because Back2Cloud processes files in chunks, an error during transfer (process is killed, network connection goes away, power outage, etc.) doesn't mean that Back2Cloud has to start all over again from the beginning. It will pick up where it left off and avoid re-uploading chunks it has already uploaded. Compression Another feature of Back2Cloud is that chunks are compressed before they are synced to the cloud. This is done to prevent waste of network bandwidth and to save Data-In transfer costs and Storage costs of the Amazon cloud computing framework. Back2Cloud uses GZIP compression. The compression uses the library in the java.util.zip.gzipoutputstream and the java.util.zip.gzipinputstream; 8
Encryption Back2Cloud implements DES Encryption to support optional file encryption. The files are encrypted before they are synced to the cloud and decrypted when they are restored back to a local disk. This is done to prevent unauthorized data access in a public cloud like Amazon. Back2Cloud-Online Figure: Online web portal providing ubiquitous access to user data from any system connected to the internet We designed and implemented an online web portal, Back2Cloud-Online, as an extension for our already existing Back2Cloud application which would provide Back2Cloud s users one more way to gain access to their files. 9
The motivation for this was For those users who are on the move, and are not expected to use the same system so often. It is unreasonable to expect them to install the back 2 cloud application on every system they use. Provide user the convenience the access to their files independent of system, but only requiring an Internet connection Opening up to a larger Audience. The web portal was hosted on our local Apache Server. PHP was used to make this code, owing to the documentation and support available, and most of all for the ease of Object Oriented Programming. The initial setups involved having the XAMP ready: Apache2, MySQL and PHP running on OSX. Since we used Amazon S3 to back up our files, the code included S3.php, which has functions specified by Amazon to communicate with their system. The main functions that were used for Back2Cloud included: listbuckets ($detailed = false): This function is used to get a list of the buckets for that particular account getbucket ($bucket): This function is used to obtain a bucket if the name of the bucket is known so that the contents can be obtained from it. putbucket ($bucket): This function allows Back 2 Cloud to put a bucket in Amazon with the name that is passed to the function. deletebucket ($bucket): This function permits the user to delete an existing bucket from that account putobject ($input, $bucket) and putobjectfile ($file, $bucket): These are the functions that were used to put the uploaded files from the online form into the designated bucket. It is also possible to have Access Control Lists to permit access across the bucket to another user if needed. deleteobject ($file): This is the function that is used to delete individual files within a bucket. 10
Then we had the page.php, this contains the webpage that the end user sees, this is what includes the S3.php and then makes use of all the functions defined in S3.php. Like the application page.php requires the awsaccesskey and the awssecretkey to get the buckets that belong to that account after Amazon authenticates the account, and it is this secret key that is used to encrypt the information that is sent to Amazon from the user. And since only the user and Amazon knows this secret key, the information is protected, and it also involves a timestamp to avoid a Replay Attack from a malicious user. The page displays all the contents of the bucket dynamically, so obviously it would show the files that are uploaded using the application or the browser itself as well. And download of these files have been made very convenient where all the user has to do is click on the filename, which results in an immediate prompt telling the user that a download is all set to begin, using php s ahref command. Performance Analysis Figure: Table showing of different cloud based backup solutions 11
Among the plethora of remote backup services on the Internet, only a subset exploits the cloud infrastructure to store data. A brief comparison of commercial cloud-based remote backup providers is included (Table 1). Despite its prominence in the cloud marketplace, Microsoft has yet to offer a remote backup utility. Instead their Windows SkyDrive application focuses on raw user-driven storage and sharing, and does not implement basic backup features such autosync and encryption. On the other hand, Backblaze, Jungle Disk and Mozy all provide these two integral features. Moreover, Backblaze and Mozy offer versioning, while Jungle Disk and Mozy allow for scheduled backup. Jungle Disk offers the user more latitude to decide how and where files are stored. As for rates, Jungle Disk and Mozy ask that the user pays per unit of storage used. However Backblaze requires a flat fee each month or each year, which is less amenable to the changing storage needs of end users. We only found one commercial provider, Jungle Disk that explicitly advertises its use of Amazon S3. Amazon's competitor Rackspace, which owns Jungle Disk, offers users the choice to store their data on either Amazon S3 via GUI or on Rackspace clouds (http://www.jungledisk.com/desktop/why.aspx) using GUI or API -based services (http://www.mosso.com/cloudfiles.jsp). Back2Cloud synthesizes the features provided by current tools in the market as a single package, when they were previously scattered across different applications. It delivers a highly available and scalable backup service by virtue of using Amazon S3. Finally it distinguishes itself from its peers by providing failure recovery in case the network connection fails during the backup process. Because the backup schedule is documented in a log before the file transfer, backup can resume at the last file that was incompletely transferred instead of backing up from scratch. 12
Figure: Timing analysis to compare the download performance of Back2Cloud against Mozy Backup Software Back2Cloud Mozy Download Rate ( Time taken to Restore) 1.02 MB/sec 0.4 MB/sec We did a performance analysis of the Back2Cloud application by sending streaming data of various sizes ranging from 10 MB to 100 MB from the local system on which Back2Cloud was running to the Amazon S3 storage service. We performed a timing analysis and found out the time it took for the data to be backed up in the cloud and the time it took to restore data from the cloud back to the local system. On doing a test of the download speeds, we found out that on a Georgia Tech Ethernet lawn connection having downlink bandwidth of 87.92 MB/sec; Mozy provided a download rate of 0.4 MB/sec which was much slower than Back2Cloud s download rate which was 1.02 MB/sec. Mozy also uses the Amazon S3 storage network but does not reveal how it backs up data. 13
However we are guessing that the chunking and compressing the data may have helped us in achieving a better download performance than Mozy. Figure: Timing analysis to compare the upload performance of Back2Cloud against Mozy Backup Software Back2Cloud Mozy Upload Rate ( Time taken to Backup) 0.98 MB/sec 0.1 MB/sec On doing a test of the upload speeds, we found out that on a Georgia Tech Ethernet lawn connection having downlink bandwidth of 6.65 MB/sec; Mozy provided an upload rate of 0.1 MB/sec which was much slower than Back2Cloud s upload rate which was 0.98 MB/sec. We have used intelligent chunking and Gzip compression schemes which might have improved our performance against that of Mozy. 14
Future Work Currently our Back2Cloud is designed for a single user but we intend to extend it so that multiple users can use the Back2Cloud application. We also need to incorporate a data sharing mechanism by which different users of the application can share their buckets or files for data collaboration. Currently our application only backs up files which are saved, modified but closed. We need to work on algorithms and techniques which can sync modified data from open files to the cloud. Challenges: Availability of Service: We still haven t considered the scenario in which the entire Amazon cloud collapses due to electric failure or due to a massive system crash, hence we need to backup data to more than one cloud provider, either backup data on one cloud and copy it in another cloud or backup files in some proportion to both the clouds so that in case one of the clouds crashes we still have data on one of the cloud providers. Data Confidentiality and Auditability: Amazon S3 is still a public cloud which various cloud users access and save and access information and hence our customer s data could be compromised hence we must incorporate better encryption and security techniques to protect our user s data. Data Transfer bottlenecks: The cloud computing framework though being highly available and scalable is effective only when the intern connection between the local system and the cloud is fast and reliable. Current network uplink limitations and immense network traffic can slow down the entire Back2Cloud software and hence we must find ways to optimize our solution so that we backup data during low load conditions and buffer them in memory, i.e. some intelligent buffering mechanisms. 15
Conclusion: We believe cloud computing will be the application of tomorrow, and that can be seen on how fast it is catching on today with so many of the big players understanding the viability of remote storage and computing. Systems like the Macbook Air, Iphone, and Netbooks can afford to be limited in hardware resources and yet have all the capabilities of a fully fledged PC that would be more bulky and less mobile with just having a fast internet connection. And this would be huge boon for the average consumer who doesn t have to pay or worry about hardware crashes and gets corporate reliability on his data, and that beats extending guarantees on your expensive hardware. Reference: Prof Ling Liu Amazon S3: http://aws.amazon.com/documentation/ 16