Project Orwell: Distributed Document Integrity Verification

1 Project Orwell: Distributed Document Integrity Verification Tommy MacWilliam tmacwilliam@cs.harvard.edu Abstract Project Orwell is a client and server application designed to facilitate the preservation of documents. Unlike existing projects, Orwell promotes the duplication of a given document across a network of hosts and targets individuals (as opposed to organizations or universities) seeking to preserve content by introducing an incentive system. Orwell allows users to install their own client in order to upload content in any format to a central server as well as host content uploaded by others. The resulting distribution of equivalent content across multiple hosts ensures documents content will be preserved even if some client(s), particularly that of the original author, are compromised. The integrity of the documents hosted on the user s client is manually or automatically verified against the network of clients also hosting that document, such that the central server does not play a role in the verification process. In the event of a conflict, users are notified and are able to view differences between documents versions as well as resolve conflicts in hosted documents. I. INTRODUCTION As we discussed in class, cloud-hosted e-books allow content providers to push updated documents to users devices unbeknownst to their owners. This issue is representative of the larger problem of document integrity verification: how can content providers ensure their versions of hosted documents match the accepted, standard versions? For example, suppose a website hosts the text of the United States Constitution. If a malicious user gains access to the server, he or she could potentially change (perhaps very slightly) the text of the site s version of the Constitution without notifying the site administrator. In the case of a high-traffic website, this change has the potential to propagate to other sources, as users have reason to believe the hosted copy of the document is correct. Alternatively, consider the case of a software aggregation service like CNET s Download.com: mirrored installers must match those of the original distributors to protect users from malware. However, content providers cannot simply rely on a version hosted by a single authoritative source, as a compromise of this central store is no different than the aforementioned scenario. Instead, distributing the task of document verification across a network of providers will allow hosts to verify content without the need for a centralized data store; rather than comparing a document to a single correct copy, content providers can compare their copy to those of multiple peers. In the event any node in the network is found to have a document mismatch, the network can be notified so the error can be corrected and the document can be rolled back to its previous state. Project Orwell seeks to address the problem of document integrity assurance. After registering with a central server and obtaining an API key, Orwell users download a client package that functions as a content management application. Using the client application, users can upload content for other clients to host as well as host content uploaded by other users. Because the content is duplicated across many hosts, documents will be preserved even if a single server is compromised or ceases to host the content. Furthermore, documents integrity is periodically verified across hosts in order to ensure that all copies of the document

2 are equivalent to the originally-uploaded document. Subsequently, Project Orwell provides a mechanism for users to preserve documents while ensuring readers that the content is at all times faithful to the originally uploaded document. In this paper, we present Project Orwell s design decisions and implementation. In Section II, we present work related to Project Orwell and discuss the differences in Orwell s approach. In Section III we describe how each of Orwell s components was implemented, and in Section IV, we discuss future areas of research. A. Cloud Storage II. RELATED WORK Many users turn to traditional cloud storage systems like Box, Dropbox, and SugarSync to back up their files. Each of these services includes a client application that automatically synchronizes users content with a central server hosted by a third-party. If multiple devices (potentially owned by different users) are configured to collaborate on a directory or document, then the changes of one user will be automatically propagated to the other devices. Project Orwell seeks to solve two problems with this cloud storage model. First, these storage solutions are most commonly used to back up content on one user s device to the cloud. However, implicit in the single backup model is trust in the third party, as users content will be lost or inaccessible if the server is compromised or the third party ceases to exist. Second, the collaboration model allows one user to create changes that can propagate to all consumers of the document. While in many cases this model is ideal (e.g., if many users are contributing to a single document), it is not appropriate for frozen, static content, as a user s copy of a document can be changed by another user without notice. B. LOCKSS LOCKSS [1], or Lots Of Copies Keep Stuff Safe, is a project developed by Stanford University that seeks to preserve libraries digital content. To participate in the LOCKSS network, publishers must host a LOCKSS permission statement online, and then grant permission to a library to archive its content [2]. Meanwhile, libraries install the LOCKSS software on a specialized machine, which downloads content from authorized publishers and continually verifies the integrity of the locally-hosted documents. By duplicating content across multiple LOCKSS boxes, publishers ensure seamless and constant access to documents; if the publisher s copy becomes unavailable, then readers are automatically redirected to a LOCKSShosted copy. LOCKSS has been successful in garnering adoption among both libraries and publishers the Global LOCKSS network contains over 9,000 titles from 510 publishers at the time of writing. LOCKSS thus appeals to both publishers and libraries because users have perpetual access to content owned by publishers or subscribed to by libraries. Project Orwell is built for a different audience: individuals. While LOCKSS is designed for established libraries and publishers, Project Orwell allows any user to become both a host and a publisher. For example, LOCKSS requires publishers to apply to join the LOCKSS network, charges institutions thousands of dollars in annual fees [3], and allows publishers to deny access to particular hosts, all of which exclude individuals from publishing or hosting content. On the other hand, Project Orwell is available free of charge, and there are no restrictions on what content can be published or hosted by a user. Moreover, the LOCKSS box software requires a dedicated machine with a static IP address, while Project Orwell s client can be installed on any web server without disrupting the server s ability to deliver other content (which is a more feasible requirement for an individual). Similarly, while LOCKSS attempts to route all traffic to the original publisher, Project Orwell clients deliver content

3 Figure 1. Users can upload and verify documents through Orwell s client application. directly, which in turn distributes the bandwidth needed to transfer content across multiple hosts; this system is more cost-effective to users whose servers do not have the bandwidth necessary for widespread content delivery. Finally, because users must opt-in to host Orwell documents (in contrast to the web crawlers used by LOCKSS), Orwell, unlike LOCKSS, incentivizes users to host documents by enforcing the constraint that users must host as many documents uploaded by others as they upload themselves. Consequently, the Project Orwell aims to establish a lower barrier of entry to participation and create a more open community. III. IMPLEMENTATION Because Orwell requires the installation of a client application on the user s web server, Orwell was designed using a standard LAMP stack in order to maximize compatibility and minimize dependency constraints. Both the client and server utilize CakePHP [4], a web framework for the PHP [5] programming language. The backend stores information in a MySQL [6] database, and the jquery [7] and Bootstrap [8] JavaScript and CSS libraries are used in frontend components. The source code for both the client and server application can be found on GitHub [9]. Both components adhere to the MVC architecture. As such, all models are found in app/model, all controllers in app/controller, and all views in app/view. JavaScript code is located in app/webroot/js, and CSS is located in app/webroot/css. A. Installation To register for an Orwell client, users first navigate to the Project Orwell website, on which the server component of the project is hosted. To sign up, users need only provide a valid email address and URL at which the client will be publicly hosted. Upon signing up, a random (and salted) shared secret is generated by the server for the client. This shared secret will later be used by the client to authenticate against the central server for all API requests to ensure that requests cannot be forged to clients or to the server. The user can then download the client installer, which when run on the

4 Figure 2. of hosts. Clients upload documents to the Orwell server so other clients can host them, thereby creating a network client s web host, downloads the Orwell client application, installs dependencies, creates the necessary MySQL database schema, and configures the server-generated API key. B. Uploading Documents When users navigate to their Orwell client (which must be accessible at the URL provided by the user upon signing up for the client) for the first time, they are prompted to create an administrator account. Any number of user accounts can be created on the client, such that multiple individuals can have the ability to manage the client s documents, and for security purposes, a valid user must be logged in to perform any document management operations on the client. Users can upload documents of any type from their client. After selecting a document and giving it a title, the document is uploaded from the client to the server, along with the client s shared secret and webroot (to enforce authentication). Upon receiving the document, the server computes a 128-bit MD5 hash and a 160-bit SHA-1 hash of its contents. If another document on the server has the same MD5 and SHA-1 hashes as the uploaded document, the server concludes that the documents are equivalent and does not store a copy of the document. The server is thus able to de-duplicate content in the interest of space efficiency on the server. If, on the other hand, the document does not already exist on the server, the document is assigned a globally unique ID that will identify the document across all clients and the document is stored on the server s filesystem. Finally, in either case, the uploaded document is also stored on the client s filesystem using the document s globally unique identifier. C. Downloading Documents to Host Users can also browse the universe of documents hosted across all Orwell networks from the client. This list is both sortable and searchable, and users can also see the number of clients that are already hosting the document. From this list of documents (which only includes documents that are not hosted on the user s client), users can download documents from the server to host on their own client. When the user presses a document s Download button, an AJAX request is sent to the client indicating the ID of the document that will be downloaded. First, the client checks if the document in question is already hosted (as by searching its database for the document s identifier), and aborts the process if the document is found, again in the interest of de-duplication. The client then makes a request to the central server with both the ID of the document to be downloaded and the client s API key. The server responds with the

5 Figure 3. When a new client hosts a document, the verification network remains a connected graph. set of URLs at which the same document is hosted by other clients, which is then stored on the client s database to be used in the document verification process described in the next section. The server also notifies each of these hosts that a new client has joined the verification network for the document being downloaded, as seen in Figure 3, such that all clients in the network form a connected graph. The server then sends the contents of the file to be downloaded to the downloading client, per Figure 2. The client is thus able to store a local copy of both the the document and the URLs of other clients that are also hosting the document. Both uploaded and downloaded documents are accessible via a public URL on the client. Subsequently, a user can easily view the content hosted on a client via a publicly-accessible URL. D. Document Verification The document verification process ensures the integrity of the documents stored on clients by comparing their contents to those of other Orwell clients hosting the same document. On the client s document management view, seen in Figure 1, users can verify their documents integrity by simply clicking the Verify button. This triggers an AJAX request to the client with the ID of the document to verify. The client then randomly selects another client also hosting the document to verify against. The use of randomness by all clients across all networks ensures that all hosts have an equal probability of being chosen, which maximizes the probability that a document that has been compromised is detected by some client. After selecting a client to verify against, the client compares an MD5 hash and SHA-1 hash of its document to hashes of the document hosted by the selected client. If either of the clients hash values do not match, then the versions of the document are not the same, which triggers a notification to the verifying client. The success of Orwell s document verification model depends on multiple users opting in to host a given document. To encourage users to host others documents, Orwell implements a give-and-take incentive structure. After creating their Orwell, users are able to upload five documents; for each subsequent document, users must host at least one document uploaded by another client (and users will not be able to upload additional content until they do so). This constraint is enforced by

6 Orwell s central server, so users cannot bypass this restriction simply by altering the source code of their own clients. In addition to verifying documents via the client view, clients can periodically verify a number of randomly selected documents. Cron [10], for example, can used to automate this process, which can be run from the commandline and utilizes the same verification code used by the web frontend. It is important to note that the average load placed on clients during the automatic document verification process does not depend on the size of the network. In a network of n clients, a client n 0 that is scheduled to perform a verification selects one of the other clients in the network with equal probability; the probability that any other client is chosen by n 0 is therefore 1/(n 1). Because each of the n clients in the network can select each other client with probability 1/(n 1) and each client makes its selection independently, the expected value for the number of times a given client is selected by another client (after all clients have performed a verification) is simply n 1 1 n 1 = (n 1) 1 = 1 by the law of n 1 total probability. Intuitively, the expected load on each client should be independent of the network s size, as with each additional client, the number of total verifications performed across the network increases, but the probability a particular client is chosen by another client decreases. This result demonstrates that automatic document verification networks are indeed scalable, such that clients hosting a popular document should not expect any increase in load on average. Rather, the load placed on clients depends on the total number of documents hosted, which parallels the inevitable constraints storage space places on the number of documents that can be hosted by a single client. Furthermore, this expected value result demonstrates that the verification network will quickly detect any document compromises. The probability that a given client in the network n 0 is selected at least once after each client has made a selection is equal to the complement of the probability that no other client selects n 0. In a network of n clients, this probability is equal to: P(n 0 ) = 1 ( ) n 2 n 1 n 1 The probability that a given client is selected at least once as the size of the document verification network increases asymptotically is thus greater than 60%, as we have: lim n [ 1 ( ) ] n 2 n 1 = 1 1 n 1 e 0.632 Therefore, random client selection process should be an effective means of ensuring documents integrity across a fully connected network. E. Conflict Resolution If during the verification process two documents are found to be different, clients are given the ability to repair the documents. In the event of a mismatch, users have the ability to view the differences between the locallyhosted document and the document hosted by the client the document was verified against. Each administrator of the client is also sent an email notification containing a link to this diff view. After examining the differences in the documents, the user is asked which copy of the document is correct. If the user indicates that the local copy of the document is correct, then the administrators of the other, verifying client are sent an email notification where they too can view the differences between the documents. If, on the other hand, the user indicates that the locally-hosted copy of the document is incorrect, then the client can repair the document using the other client s copy. IV. FUTURE WORK Initial Orwell development has been completed and the project has been tested on a

7 small scale. However, future research seeks to test the performance of the implementation on a larger scale. In particular, the load placed on clients hosting a varying number of documents over varying verification network sizes should be analyzed. When signing up to host a particular document, the client should be aware of the foreseen load as a result of periodic verification. Similarly, the performance of the client application should be tested on a variety of hardware, in order to make clear to users the minimum system requirements. Furthermore, additional research should investigate the effectiveness of Orwell s incentive structure. While Orwell s give-and-take requirements for uploading documents seems fair, the model has not been tested among real users. Consequently, its effectiveness and reception (e.g., fairness, feasibility of expectations, etc.) among users is currently unknown. If the current model is indeed ineffective, then other structures, perhaps incorporating elements of gamification like achievements or karma points, should be investigated. V. REFERENCES [1] LOCKSS. http://www.lockss.org/ [2] How LOCKSS Works. http://www.lockss.org/about/how-it-works/ [3] How to Join. http://www.lockss.org/join/ [4] CakePHP. http://cakephp.org [5] PHP. http://php.net [6] MySQL. http://mysql.com [7] jquery. http://jquery.com [8] Bootstrap. http://twitter.github.com/bootstrap [9] Project Orwell source code. https://github.com/tmac721/project-orwell [10] Cron. http://en.wikipedia.org/wiki/cron