BigShare - A scalable file sharing service

Transcription

1 BigShare - A scalable file sharing service Alexandros Daglis, Manolis Karpathiotakis, Georgios Psaropoulos October 18, Introduction In recent years, file hosting services have grown to become an established domain, as an important part of IaaS. The increasing popularity of such services has resulted in the creation of both dedicated cloud services (e.g. Dropbox) as well as services incorporated to the environment provided by major IT vendors (e.g. Microsoft Azure, Google Drive). In this document, we present the design of BigShare, a file hosting service to be deployed by EPFL. The purpose of BigShare is to offer file hosting and sharing services to users throughout the world. The basic concern for our system is to offer flexible file sharing to our users. Intuitive access, easy file organization and detailed control over the access permissions of each file and directory must be provided. We also aim for high durability guarantees, scalability, and judicious use of storage and bandwidth. BigShare is designed as a client-server system that provides a file hosting and sharing service. The system consists of the client software that runs on a variety of user devices, and the infrastructure hosting the service itself, that handles file storage and management while interacting with the client software. The environment of this system comprises of the service s users, as well as the physical environment surrounding our infrastructure and the Internet that connects the clients to the service. Our service provides each user with a private storage space and a variety of interfaces to manage it. Beside file transfers to and from their space, a user can also organize it with directories, edit file and directory metadata, share files and directories with other users of the service. To facilitate that, we provide a flexible API that can be used to build a wide range of desktop and mobile client apps. These clients communicate with the server side of our system, which is responsible for reliably providing our service. Our infrastructure comprises of our on-premises cluster that handles user requests and manages user information, and our storage system, which can either be Amazon S3 or our own data warehouse. While our design assumes the use of the former, we also discuss the alternative of maintaining the storage on-premises as well. The rest of this document is organized as follows: Section 2 presents the requirements based on which BigShare was designed in more detail. Section 3 presents the architecture of BigShare. The layers of the architecture and the interactions between them are outlined. Section 4 discusses implementation details behind the modules of BigShare. Section 5 explains the API exposed by each module. Section 6 discusses an alternative approach to our service s data storage, focusing on a potential on-premises deployment of BigShare. Section 7 introduces metrics we could use to evaluate various aspects of our system in a service deployment scenario. Finally, Section 8 concludes. 1

2 2 System Requirements BigShare is designed according to a list of requirements. A prioritized list of our design goals is the following: Usability: We need to provide functionality that users have grown to expect from a file hosting service. File management options must be complete, intuitive, and easy to use. For instance, we need to support uploading, downloading, directory creation, sharing, deleting and renaming. To achieve that, we want to provide an interface all users are already familiar with, i.e., an interface that resembles a conventional file system. Therefore, our aim is to provide an abstraction of a file hierarchy stemming from a single home directory, and consisting of user-created files and directories. Durability: We have to provide strong guarantees that an uploaded file will be retrievable in the future. In other words, data loss should be an extremely rare case. We address this concern with redundancy at the low level, by replicating data on multiple locations. Scalability: BigShare has to be scalable, providing our file hosting and sharing service to millions of users. Our design tries to avoid bottlenecks that would limit the number of users. To this end, different sub-domains are used for discrete parts of the service, and communication between these parts is kept to a minimum. Moreover, BigShare exploits the aforementioned redundancy to support load-balancing. Storage efficiency: We want to minimize the amount of data stored on our data servers. We achieve that through both intra- and inter-user deduplication: we identify identical data and store them once rather than multiple times. While replication might seem to act against the purpose of deduplication, i.e., storage-saving, it is essential for preventing data loss and providing reliability. Thus, deduplication and replication are two conscious design decisions with orthogonal purposes. We also discuss various secondary characteristics that are important for internet services. Security is such a characteristic. We are aware of possible security issues and discuss some of them. However, we do not delve into details, as it is not on the top of our requirements priority list. While we do not design a system with obvious and naive security holes, building a system that provides exceptionally strong security guarantees is not one of our primary concerns either. Finally, availability is another important requirement of online internet services. 3 Architecture BigShare is designed to be modular. The system consists of well-defined modules with discrete functionality and appropriate interfaces that enable straightforward interoperability. This section describes BigShare s modules and abstractions, how we apply a naming scheme on top of them, and how they are combined into a layered architecture. 3.1 Modules and Abstractions Each module of our system has a specific role. Apart from the client, a module on its own, we identify three discrete functionalities that are essential for our service. These are: Authentication Metadata control 2

3 Data Storage This section provides a brief description of the four modules that comprise BigShare Client The client is the software that is essential to access the service. It is the service s interface to the users; it exposes the service s functionality to the users, but also restricts the form of the requests to comply with the API and semantics of the service. Internally, it is responsible for file transmission to our storage component. Files are compressed and then split in chunks. Both the compressed chunks and any auxiliary payload information are encrypted before transmission, so that our metadata and data components are populated with encrypted information. As an additional security mechanism, TLS is used as the transmission protocol for security reasons. We will elaborate on the chunking mechanism in Section Authentication The authentication module is responsible for initializing the client s communication with the service. User interaction with BigShare begins when a user decides to register to our service through a BigShare client. A registration request including a proposed username and password is sent to the authentication module, which processes it and responds with an acknowledgment message containing an appropriate status code that indicates either successful registration or failure, in case of which the client side notifies the user on the reason of their failed attempt. Once a user has created a BigShare account with a unique username, they can use it to log into the service. Log-in utilizes a token-based mechanism: when a user wants to log into the service, their client communicates with the authentication module that either provides a token which validates the client to interact with BigShare for a limited time, or responds with an error message that notifies the user of the error cause (usually wrong login credentials) Data Storage The role of the data module role is the management of the data. All files are split into chunks of an upper-bounded size. Physical data representation and data loss prevention are the data storage module s responsibilities. Reliability is achieved through transparent data replication. The abstraction provided by the module is simple: it receives and stores chunks of data, which have to be retrieved and returned unmodified upon subsequent requests. In our current BigShare design, the storage module is implemented using the Amazon S3 data storage infrastructure. The data storage module has no comprehension of files or directories. Chunks are the module s first-class citizens. Any further blocking is handled internally by S3. Its only responsibilities are: Receiving storage requests that lead to the storage of new data chunks. Receiving querying requests for already stored chunks and returning those requested chunks. 3

4 Each chunk is identified by a unique SHA-2 signature. This uniqueness is verified by the metadata module (Section 3.1.4), which also uses these signatures for deduplication. The answer to each successful chunk storage request is an acknowledgement. A subsequent querying request for the chunk with this identifier is resolved by the data module, and the contents of the appropriate chunk are returned Metadata Control The metadata control module is responsible for keeping track of user file permissions and the mapping between files and their constituent chunks. Essentially, it provides the abstraction of a file system. The module is itself multilayered and is described in more detail in Section After the authentication, clients need to communicate with the metadata control module prior to any actual file transfer. The metadata control module is responsible to list a user s accessible files upon request, check permissions upon upload, handle a download or modification request, modify permissions upon request, handle file sharing requests, and initiate the actual data transfer (upload or download) after all the appropriate checks have been successful. 3.2 Layering BigShare s modules are logically organized in two layers: the frontend and the backend. The frontend of BigShare is the client, a form of which runs on each user s machine (desktop/mobile app). The backend consists of three peer modules: authentication, metadata control, and data storage. Having peer modules in a single layer rather than each module in a separate layer benefits BigShare, as each request does not have to go through several layers, but directly communicates with a certain module, according to the request s state. While the three peer modules materialize the backend layer of our service, each of them has a discrete functionality. Figure 1 illustrates BigShare s layered architecture. The client transforms the user s requests into a sequence of communication with all three modules of the backend layer. While most communication occurs between the client and each of the modules of the backend, there is also some limited communication between the metadata module with the data storage and authentication modules, as we mention in Section 5. Frontend (client) Authentication Metadata Control Backend Data Storage Figure 1: High-level layered architecture of BigShare 4

5 3.2.1 Frontend layer The frontend layer includes the client module. This layer defines the user s interaction with the service. It exposes a user-friendly API and translates the user requests to a form appropriate for the underlying service. Depending on the request and its state, the client knows with which backend module to communicate, and how to formulate the message. The client also receives responses from the backend and transforms them to user-friendly messages. Client sublayers: The client layer is itself multilayered, as illustrated in Figure 2. In the left side of the figure, the direct communication of the GUI sublayer with the Request Generation sublayer is needed for actions that do not require data transfers, such as the list command. The sublayers on the right side are needed for data transfers. To illustrate their functionality, we describe a file upload scenario. A user uses the client s GUI to initiate a file upload. The file is first compressed and then split in chunks in the second and third sublayers. These two steps take place in a pipelined fashion to achieve high performance. For each chunk, a SHA-2 signature is generated (Sublayer 4). The encryption layer encrypts the data that is being sent. In the general case, the encryption key is provided by BigShare s backend. Data encryption is not trying to prevent cases of eavesdropping; this is prevented by using the TLS protocol for the client-backend communication. Instead, data are uploaded and stored encrypted in the data module, to ensure data privacy even in the case of an attack that may result in data leaks. To provide stronger security and privacy for sensitive files, users may use a custom encryption key. We discuss the role of the SHA-2 signatures and the need for an optional custom encryption key in Section 4.3. Finally, the bottom sublayer (request generation) formulates the request and sends it to the appropriate backend module. Client GUI Compression File Chunking & Chunk Handling SHA 2 Gen. Encryption Request Generation Figure 2: Internal layers of client 5

6 3.2.2 Backend layer Apart from the two high-level layers of the system, the Metadata Control and the Data Storage modules are also layered. The description of these modules sublayers follows. Metadata Control sublayers BigShare allows its users to manipulate their own file system instance located on the cloud. To efficiently achieve this, BigShare borrows a subset of the straightforward - yet effective - design of the UNIX file system. As a result, the BigShare file system stack is as follows: Link layer: Enable sharing by providing links of a user s files/directories to specified users. Absolute path name layer: Provide a root for the naming hierarchies. While there is a single root, each user sees their own home directory as the file hierarchy s root. Path name layer: Organize files into naming hierarchies. File name layer: Provide human-oriented names for files. Inode number layer: Provide machine-oriented names for files. File layer: Organize chunks into files. The file layer contains the information about the chunks that constitute each file. Every file entry contains the names of the relevant chunks along with the order in which the file can be reconstructed. The chunk names are hashcodes generated using SHA-2 and thus we consider them adequate for chunk identification. We will see how they are utilized throughout the rest of this document. These file entries are similar to the concept of inodes, and we will use the two terms interchangeably. On top of the file layer, the inode number layer is responsible for implementing a naming mechanism for the file entries. It is implemented in a straightforward manner, returning integer IDs to represent each file entry/inode. The first layer that deals with human-readable names is the file name layer. This layer associates the inode numbers with human-readable names, as provided by the users. Building on top of the file layer, the path name layer provides support for user-specified directories. Every directory is characterized by a directory inode, which maintains the information about all the directory s content as well as the user-provided name for it. This name also includes context information, which specifies its parent directories. Supporting the presence of multiple users requires our system to accommodate growth. We therefore need to be able to handle user additions in an elegant manner. To this end, the absolute path name layer provides a universal context (root) in the directory service to facilitate horizontal growth in the number of users. Specifically, any user registered in BigShare and added to the system is provided a home directory. The way to unify these directory trees is by providing a global context. This global information can be used to facilitate sharing, as paths between different user home directories can be specified. The information about this global context is only accessible to the system itself and not its users. Besides restricting the visibility of the global context, a permission mechanism ensures that each user has access only to files and directories uploaded by them or shared with them. This is achieved by incorporating ownership information in the 6

7 inode of each file or directory. Every uploaded file is assigned a single owner, namely the uploader. To enable sharing, read and write permissions can be granted to additional users or groups of users. However, enhancing a file entry with permissions for an additional user is not enough for them to access the file; the sharing mechanism must also create a link of the file in the user s home directory. This is handled by the metadata module s top layer, namely the link layer. Data Storage sublayers: The data storage module is also organized in layers. The minimum layer requirements are a file layer, to match files (chunks) to disk blocks, and a block layer responsible for identifying and managing physical disk blocks. As we are using Amazon S3 for our current BigShare design, the module s layer stack is much deeper, as it supports all the functionality and flexibility of a file system. 3.3 Naming As we saw in the previous section, naming is utilized in the file system abstraction that our service provides. In addition to that, BigShare also uses different namespaces to distinguish the discrete parts of the service. In the current design of BigShare, each client request is forwarded to a different backend sub-layer, depending on the request s stage. Each of the system s modules is assigned a different sub-domain. Specifically, the authentication module owns the auth.bigshare.epfl.ch. Requests to the metadata control module employ the control[x].bigshare.epfl.ch sub-domain. X is replaced with an integer number based on the actual metadata server the client communicates with. In a similar manner, the data storage module is exposed using the data[y].bigshare.epfl.ch sub-domain. We vary X and Y in the requests in order to enable scalability and load balancing; clients alternate between these domain names to equally distribute the load to the backend servers. Furthermore, as the number of users increases, we can easily scale the system with the increased load by adding more control and data sub-domains. Control servers load balance the number of the requests they serve, and data servers also equally distribute the chunks stored. Users are distributed to the service s metadata servers; the clients direct the requests to a certain server according to the user account. As the request proceeds and data access is required, the metadata module notifies the client about the sub-domain(s) of the data server(s) that will need to be contacted. We further discuss the scaling of the data and metadata modules in the following section. 4 Implementation Considerations In this section we elaborate on some important implementation aspects of our design. We analyze the mechanisms of file sharing and file update. We address performance concerns, presenting a deduplication mechanism that enables bandwidth and storage savings, and also discuss how our design is influenced by our requirements for scalability. 4.1 File Sharing Sharing and permissions are handled by the various layers of the metadata module, as described in Section At the high level, every user account has a default Shared 7

8 folder under the home directory. This is where the link layer creates the references to the files other users have shared with the interested user. In other words, every user can find the files that have been shared with them under their User Home/Shared directory. 4.2 File Update Special care needs to be taken for updates of existing files, especially shared ones. The concern arises from the possibility of one user updating a file, that is being downloaded by another. The update itself is not different from a normal upload, in the sense that it is as if a new file is uploaded, replacing an older file that had the same name before. However, as an update usually means that the new file is based on that file s already uploaded previous version, it is highly likely that some of the file s chunks will not need to be uploaded, thanks to the deduplication mechanism described above. Thus, the primary concern about updates that differentiates them from normal uploads is concurrent access. To address the concurrent access problem, we follow a versioning approach. The metadata layer keeps track of files that are being downloaded. If during the download an update request arrives, the metadata layer creates a new version of that file and initiates the update by responding to the client s update request. New download requests during the update are served the old version of the file. When the update is complete, subsequent download requests will get the latest version. When all active downloads on the older version complete, the metadata module transparently deletes that older version and also notifies the data module to delete the old version s corresponding chunks once no user refers to them any more. Thus, the semantics provided are that a user gets the latest complete version of the file at the time the download request is initiated. A similar approach is followed when multiple concurrent update requests to the same file are received: after both versions are successfully uploaded, the one that was initiated last is eventually kept as the one valid version. Intuitively, a last write wins policy is enforced, where the ordering is based on the time of the request initiation. As a final remark, these mechanisms are not only required for files that are shared by multiple users, but also in the general case, as a private file might be modified by multiple clients on different devices of the same user. 4.3 Data Deduplication BigShare needs to reduce the redundant network bandwidth and storage volume on the side of the data servers to offer competitive performance. To achieve this, redundant information must not be transmitted blindly to the data module. Therefore, we introduce a data deduplication mechanism to reduce the number of duplicate data copies to be transmitted and stored. Note that this mechanism is orthogonal to any replication taking place in the data module to achieve reliability. BigShare tries to minimize both data transmission from the clients and the data volume to be stored in the data module. As previously explained, files are split into fixed-size chunks prior to being transmitted. When a chunk of data is to be uploaded, we initially submit the SHA-2 signature of the chunk in the metadata module. We decrypt the signature and check whether the data chunk has been uploaded in the past; if so, the client is notified that no extra action is required. In effect, the user will experience a sped up upload. It is enough to add the chunk s identifier in the inode entry of the file it belongs to. The rest of the chunks, for which already stored duplicates have not been identified, will be uploaded normally and stored in the data module. To 8

9 achieve additional savings in the bandwidth use, any chunk sent to the data module is compressed prior to its transmission. Our decision to use chunks as first-class citizens of the data module is also related to deduplication efforts. By applying deduplication at this chunk level, storage savings can be applied at the sub-file level. Another benefit it that as files are deterministically chunked at the client and uploaded one by one, uploading can be resumed from an arbitrary chunk, if the file upload was previously interrupted. The choice of the chunk size comes as a tradeoff. A smaller chunk could further improve the rate of deduplication, and thus further decrease the storage capacity needed. However, the smaller the chunk, the larger the cost of keeping track of the chunks that comprise each file. Therefore, the chunk size usually used by existing services is several MB. For instance, GFS [5] uses a chunk size of 64MB, while 4MB is the chunk size used by Dropbox [4] and also the default size for Windows Azure [2]. The collision concern that may arise about deduplication using hash signatures, i.e., two different chunks having the same signature, thus losing one because it is mistakenly regarded to be the same as the other, is not unfounded. However, while SHA-1 was reported to be susceptible to such a security flaw, SHA-2 is much stronger, and so far proved to be collision-free. Thus, such a concern is not substantial. If we utilize cross-user data deduplication, privacy of the data in question can be thought to be compromised [6]. For example, a malevolent user could try to upload a sensitive file, and, based on the upload time they experience, find out whether some other user has already uploaded the file. A different scenario is the following: If a bank uses a specific document template to inform a customer of their PIN number, a malevolent user could upload multiple versions of this document, keeping the name of the target customer fixed and only changing the PIN number. Again, the upload time can indicate whether a match has been found. As we realize that different users have different privacy requirements, we offer the users two options. Users who opt for the vanilla version of BigShare are offered the mechanisms explained so far. Otherwise, users have the option of picking the encryption key that will be used for data encryption themselves. By having the client use this key, BigShare won t be able to cross-compare this specific user s data with that of different users. Still, intra-user data de-duplication will be available to handle the cases where the user uploads the same file again, using the same encryption method. 4.4 Scalability As many other distributed systems, BigShare is subject to the CAP theorem [1]. Therefore, as its data module relies on Amazon S3, it employs eventual consistency for the chunks stored, aiming for high availability and partition tolerance. A similar tradeoff needs to be taken into account regarding the metadata module; one of our options for this module would be sacrificing high availability and employing a similar design to the Google File System [5]. In this case, the metadata module would comprise a single main node. In case of failure, shadow nodes could be used to fallback and only then provide an eventually consistent view of the metadata. Though simple in its design, this solution could obviously cause issues in case of master failure or significant load. Alternatively, we opt for a distributed solution, as depicted in Figure 3: user storage would take place in one of N metadata servers available. Each node is assigned an area of responsibility from the users domain. Small node groups (e.g. groups of 3-4 nodes) would be formed, and rigid consensus mechanisms [7] would be applied to each of them. Consistent hashing mechanisms are 9

10 actually applied in cloud-based solutions to this end [3], yet using a sloppy consensus mechanism. By opting for this case, we are able to handle more efficiently potential crashes and increased load. Figure 3: Distributed deployment of metadata servers Employing the decentralized solution also means that we need to resort to a besteffort inter-user deduplication. Finding out whether a data chunk has already been uploaded by a different user requires multiple metadata nodes to be probed. Probabilistic data structures such as bloom filters could be used to this end to reduce information transmitted. Removing from storage duplicate chunks whose metadata is stored on different metadata servers can happen offline, when the service s load is low. In that case, the benefit of reduced storage requirements is still eventually gained, but the performance benefits for the clients stemming from online inter-user deduplication are reduced. In this document, we favor the distributed solution as a single-master solution could significantly affect availability. 5 API This section describes the service s APIs. All of the backend layer s modules, i.e., the Authentication, Metadata Control, and Data Storage, expose a set of functions. The vast majority of those are directly used by the client. Our RPC semantics stick to the HTTP semantics of at most once. To avoid hanging, all RPCs have a default timeout. When a timeout expires, the call and its effects are assumed to have failed, thus the RPC has to be re-initiated by the client. 5.1 Authentication API User Registration: register <user><password> ACK / NACK Username used / NACK 10

11 The client uses this to request the creation of a new account. The authentication module creates a new account entry in the database, if the proposed username is unique, and acknowledges the account creation. User Sign-In: login <user><password> Token / NACK The client requests to login the service by sending the user s username and password. The authentication module looks up the (user, password) key-value pair in the database, and replies with a token that grants access to the service s modules to the client that is acting on behalf of that user account. Password change: change pwd <user><old password> <new password> ACK / NACK The client requests to change the password for a given account. First, the authentication checks if the (user, old password) key-value pair is valid, and if so, the old password is replaced with the new password. 5.2 Metadata API File Upload - Request: upload request <pathname><filename><token> <unique key>/ Non-existing path / Permission denied The client requests to upload a new file. If the action is allowed, the metadata module returns a unique key that will be used subsequently for the upload. List Contents: list <pathname><token> List<name>/ Non-existing path / Permission denied The client asks the metadata module to list the directories and files that are under a certain path (the requested-path prefix is always the user s home folder). The metadata layer first checks if the requested path corresponds to a valid directory, and if so, verifies whether the requesting user (given their token) has permission to view the content of the requested directory. File Upload - Hashmap: upload hashmap <token><unique key> (unique key acquired from File Upload - Request) missing chunks<hashmap> The client sends a hashmap that contains the hashcode for each chunk of the to-beuploaded file to the metadata module. The metadata module filters those hashcodes that already exist in the service (deduplication) and responds by sending the hashmap with the remaining hash values back to the client. File Upload - Chunks: register uploaded chunks <list chunks><unique key> ACK / NACK 11

12 After the client has successfully stored the chunks in the data module, it notifies the metadata module about the successful upload. The metadata module creates a new file entry for this user (identified by the unique key that was generated earlier), and adds the chunk ids that comprise this file. The file upload procedure is thus completed with this last API call. File Download: download <path/to/file><token> list<chunk hashes>/ Not found / Permission denied Used by clients to request a file download. The metadata module verifies is the requested file exists and if the requesting user has the required permissions. If these conditions are satisfied, the response contains a list of the chunk signatures hashes that comprise the file. File Deletion: delete <path/to/file><token> ACK / Not found/ Permission issue Used by clients to delete a file. The metadata module verifies if the file exists and the required permissions are fulfilled by the user. If these conditions are satisfied, the metadata removes the file s metadata from the user s account, and acknowledges the deletion. If the chunk hashes of which the deleted file consisted of are not also owned by another user (deduplication effect), the metadata module creates a deletion request for those chunks and sends it to the data module. File Share: share file <pathname>list<userid><permissions> <token> ACK / Invalid path / Invalid User id / Permission denied Used by a client to share a file. If the requesting user is the file s owner, the request succeeds, and the metadata module creates a link under each of the requested users Shared directory, providing the requested permissions (read/write). Create directory: mkdir <pathname><directory name><token> ACK / Invalid path / Permission denied Requested by a client. If allowed, the metadata module adds a new directory node in the path name hierarchy of that user. Delete directory: rmdir <path/to/directory><token> ACK / Invalid path / Permission denied Requested by a client. If allowed, the metadata module deletes that node from the user s path name hierarchy and also recursively deletes all files contained in that directory, as described in the File Deletion function. Rename directory: d rename <path/to/directory><new name> <token> ACK / Invalid path / Name collision / Permission denied Used by a client to rename a target directory. The metadata module renames that directory if allowed by that user s permissions and if no other directory with the same name exists under the same namespace. 12

13 Change permissions: chmod <pathname><permissions><token> ACK / Invalid path / Permission denied Used by a client to change the permissions of a target file or directory. If the action is allowed, the metadata module changes the permission of the file, or, if a directory, the permissions of all the files under that directory, recursively. 5.3 Data API Store Chunks: store chunks list<<chunk id><data chunk>> <authorization key> ACK / NACK Used by a client to store the chunks that have been identified by the metadata module before, as chunks of a file that needs to be uploaded that are not already present in the service s storage module. The data module is responsible to reliably and redundantly store the uploaded data, and acknowledge once it is done. Retrieve Chunks: retrieve chunks list<chunk ids><authorization key> list<chunks>/ chunk not found Used by a client to retrieve the chunks that comprise a file that needs to be downloaded. The list of chunk identifiers that comprise the file were previously acquired by the client from the metadata module. Delete Chunks: delete chunks list<chunk ids><authorization key> ACK / NACK Called by the metadata module, when user deletion requests result in chunks not referenced by anyone. Those chunks need to be deleted, thus the metadata module requests the data module to do so. Note: Since access to the S3 data storage requires authentication, we assume that the clients have suitable authorization keys. Conceptually, these can be obtained from S3 by the metadata module, and then be provided to the clients upon request 6 In-house storage The usage of Amazon S3 storage infrastructure provides us with highly available and scalable storage capabilities. Still, another option would be deploying the entire BigShare system on EPFL-owned machinery, in order to be independent from any third-party service. Given our current design, the S3 data module could be replaced by EPFL data module. The latter would implement the chunk layer of our web-based file system. Below this layer, a key-value store solution could be used to efficiently store and retrieve our chunks. Using an in-house solution deprives us of the flexibility offered by Amazon s services. Specifically, we can no longer have an elastic solution, transparently adding or removing machinery based on our current needs and load. In addition, S3 allowed us to be agnostic to potential hardware crashes. For an in-house solution we would need to manually set up a replication-based mechanism to provide failure protection. Keeping 13

14 these replicas in sync would also require us to implement a state machine replication mechanism. The extra effort required when opting for an in-house solution can be amortized by performance benefits gained from the use of dedicated servers. Price and performance per instance is better, as each node is not shared between us and other users. Performance is also more predictable, as resource allocation is more fine-grained and less random and there is no longer a dependency between the workload of the other users of the physical machine. Performance also benefits from the removal of the extra authorization layer that S3 enforces on data access. As we will be able to build our own layers on top of the storage module, we could establish a unified authorization method to be used for our entire service. In addition, this increased flexibility would also allow for a more versatile communication between the data and metadata modules. As an example, when a client requests to download a file, the metadata module could directly notify the data module so that the latter starts sending the requested data directly to the client, bypassing the need to use the client as a forwarder of the data retrieval request. In general, the predefined API and functionality provided by S3 limits our modules interaction flexibility, and in most cases requires the data-metadata communication to occur via the client. Choosing our own infrastructure also allows us to tune the data module s performance to our needs. For instance, we are free to scale-up the machines comprising the storage clusters if needed, or populate our datacenter with Memcached servers to provide fast access to hot data. Finally, the usage of S3 imposes eventual consistency on our data module. If we aim for full ACID compliance, porting to an in-house solution is necessary. 7 Evaluation Metrics A successful BigShare deployment should conform to the requirements we prioritized in our design: usability, durability, scalability and storage efficiency. An evaluation of such an implementation will be based on metrics that express well each of those properties. Below we discuss representative metrics for each property. Usability cannot be objectively evaluated as it is also a matter of GUI design; thus its evaluation relies on user feedback on how easy it is to navigate and customize their private directories according to their needs. This feedback also includes suggestions for new features, which can lead to a richer, more versatile API. Durability is a property that can be concretely measured as the number of lost data during a specified time period, for instance the number of lost chunks in a year. However, this metric is mainly dependent on the data module and as such, in case of using S3, the evaluation results are representative of Amazon s storage system and not of our service. Scalability refers to offering a stable quality of service to our users, being slightly affected by the increasing number of both users and files. Latency and bandwidth utilization are two metrics that define this estimate. Average bandwidth dedicated to each request, as well as average latency are representative metrics that can be used to evaluate how our system responds to increasing load. 14

15 Storage efficiency is another measurable property. The effectiveness of our chunk deduplication mechanism can be evaluated by comparing the nominal total size of the data stored in our service with the actual space needed to store them. This metric can also be used to decide on the optimal chunk size for BigShare. The efficiency of our eventual inter-user deduplication policy can be evaluated by measuring the average time needed to identify similar chunks owned by different metadata servers. Apart from the metrics that we can use to directly evaluate our primary design goals, there further important characteristics we should be able to evaluate for our system. Thus, we also discuss a few more important metrics: High availability is an important service property, even though we did not focus on providing that. A metric for our service s availability would be the frequency of outages. Based on these data and the identification of the outage reason, we can identify unpredicted weaknesses and thus accordingly provision our infrastructure or redesign parts of the system. Hardware utilization metrics are important for any service provider that also owns at least part of the infrastructure. Both average and peak utilization are crucial for optimal infrastructure provisioning. If the load variation is significant, the use of our infrastructure will not be cost-effective. If our metrics indicate so, we should consider moving to a virtualized environment to allow for efficient consolidation. 8 Conclusion File sharing services have become significantly popular in the past years. Web-based collaborative applications attract a large audience, which expects that the data it shares are persistent and readily accessible. We designed BigShare aiming to address these needs. We achieve scalability of the service by using a modular architecture, employing clearly separated modules with minimal interactions between the layers they belong to. We provide details on internal mechanisms used, and discuss tradeoffs involved with achieving scalability and high performance. Finally, we reconsider our design by substituting the S3-powered black box storage layer with an in-house solution, and discuss the influence of this change to the overall service. References [1] Eric A. Brewer. Towards robust distributed systems (abstract). In PODC, page 7, [2] Brad Calder, Ju Wang, Aaron Ogus, Niranjan Nilakantan, Arild Skjolsvold, Sam McKelvie, Yikang Xu, Shashwat Srivastav, Jiesheng Wu, Huseyin Simitci, Jaidev Haridas, Chakravarthy Uddaraju, Hemal Khatri, Andrew Edwards, Vaman Bedekar, Shane Mainali, Rafay Abbasi, Arpit Agarwal, Mian Fahim ul Haq, Muhammad Ikram ul Haq, Deepali Bhardwaj, Sowmya Dayanand, Anitha Adusumilli, Marvin McNett, Sriram Sankaran, Kavitha Manivannan, and Leonidas Rigas. Windows azure storage: a highly available cloud storage service with strong consistency. In SOSP, pages ,

16 [3] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. Dynamo: amazon s highly available key-value store. In SOSP, pages , [4] Idilio Drago, Marco Mellia, Maurizio M. Munafò, Anna Sperotto, Ramin Sadre, and Aiko Pras. Inside dropbox: understanding personal cloud storage services. In Internet Measurement Conference, pages , [5] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In SOSP, pages 29 43, [6] Danny Harnik, Benny Pinkas, and Alexandra Shulman-Peleg. Side Channels in Cloud Services: Deduplication in Cloud Storage. IEEE Security & Privacy, 8(6):40 47, [7] Leslie Lamport. The Part-Time Parliament. ACM Trans. Comput. Syst., 16(2): ,