Shroudbase Technical Overview

Shroudbase Technical Overview Differential Privacy Differential privacy is a rigorous mathematical definition of database privacy developed for the problem of privacy preserving data analysis. Specifically, it ensures that a computation does not reveal information about individual records present in the input by requiring that the computation behaves almost identically on any two databases which differ by at most a single record. Formally, a mechanism M mapping datasets to distributions over an output space R is -differentially private if for every S R and for all datasets A, A 0 for which the number of records which would have been added or removed to change A to A 0 is less than or equal to one, Pr[M(A) 2 S] apple e Pr[M(A 0 ) 2 S] We can interpret the definition as follows: If there are two databases, one with a presence of an individual s data, A, and one without this individual s data, A 0,thenforsmallvaluesof, thereisnooutput an adversary could use to distinguish between A and A 0.Assuch,itisvirtuallyimpossibletoidentifyany information about an individual when differential privacy is achieved. It ensures that personal information about an individual will not be disclosed by participating in a dataset regardless of any external information or datasets, regardless of the computational power of an adversary and regardless of any statistical techniques which exist or may be developed in the future. Differential privacy is typically achieved by adding statistical noise to the output of queries or, more abstractly, to the method of choosing responses to queries. A decade of research in the field has produced an array of algorithms which achieve differential privacy for a wide range of data analysis methods. These algorithms have been refined to introduce minimal noise, and come with strong, provable guarantees of accuracy. However, this interactive model requires that noise be drawn from a fixed distribution on multiple occasions, which introduces a critical drawback: the database comes with a budget and querying is costly once this budget is exhausted, differential privacy is no longer satisfied. Producing Synthetic Data The key to practical, differentially private data analysis is generating synthetic databases. These databases are computed by differentially private algorithms on the original data, and therefore ensure that any computation over the data is differentially private. As a result, these databases do not impose any limitations on data access, and remain private even in the event of a security breach. An example of a simple method for producing synthetic data on low-dimensional datasets which accurately answers statistical queries (queries which count the number of records which satisfy a certain predicate) is the MWEM algorithm. MWEM (Multiplicative Weights Exponential Mechanism) maintains an approximating dataset over a domain of records, initialized to be a uniform distribution over the set of records. At each iteration, the algorithm chooses a query with a high error on the approximate data, poses this query to the true data, and improves the approximate data to more accurately answer the specific query. After a specified number of iterations, the algorithm outputs the average of the approximate databases produced at each iteration as the 1

synthetic data. The accuracy of this algorithm, defined to be the maximum error of any query, is provably logarithmic in the number queries and asymptotically smaller than the number of records. The mathematical details are as follows: The algorithm takes as input a database D, anumberofrecordsn, asetq of queries, a number of iterations T,a privacy parameter (a small number). First, a distribution A 0 is initialized to be the uniform distribution over the universe of records. The exponential mechanism, which satisfies differential privacy is used to choose queries. At a given iteration i of the algorithm, the exponential mechanism chooses a query q i from the distribution: exp( q(a i 1 ) q(d) ) 2T where A i 1 is the approximate database at iteration i 1 and D is the true data. The mechanism for posing the query to the data achieves differential privacy by adding Laplace noise to the output of the query. That is, the measurement of the output of a query is taken to be: 2T m i = q i (D)+Lap At each iteration i, theapproximatedatabaseisupdatedusingthemultiplicativeweightsalgorithm: A i (x) =A i 1 (x) exp q i (x) (m i q i (A i 1 )) 2n Once the algorithm has completed T iterations, A = avg i<t A i is outputed as a synthetic database. The worst case error of the algorithm is given by: r log U 10T log Q max q2q q(a) q(d) apple2n + T The MWEM algorithm, however, is not a universal solution to the release of synthetic data. The algorithm has worst case exponential complexity, so it is not practical for high-dimensional datasets. More over, the accuracy guarantees it provides hold only for linear queries. Although compositions of linear queries can be used to implement a broad range of statistical techniques, MWEM does not provide any accuracy guarantees for certain crucial methods in data analysis, such as regressions. Shroudbase The approach used in this algorithm is the foundation for many of the advanced algorithms deployed by the Shroudbase platform, which produces and manages synthetic data through differentially private mechanisms. Shroudbase s patent-pending software deploys a repertoire of privacy preserving algorithms to enable accurate data analytics on sensitive data, far beyond the capabilities of MWEM. These range from producing summary statistics to machine learning and optimization. Shroudbase: efficiently produces synthetic data on terabytes of high-dimensional datasbases efficiently produces synthetic data to preserve accuracy of generalized linear models, such as regressions maintains these private databases in a centralized, easy-to-use platform answers millions of MySQL queries without requiring the user to specify them in advance Shroudbase is a platform for producing and managing these differentially private synthetic databases. 2

Shroudbase Infrastructure I. Privatization Privatizing data with Shroudbase is a one step process. The client simply enters the information required to access their database along with an endpoint to store the synthetic data. The platform currently privatizes any structured data, including MySQL, PostgreSQL, Microsoft SQL, sqlite3, Excel spreadsheets, and csv files. The privatization procedure can be run through our cloud cluster or locally by installing the Shroudbase Database Management System on the client s machines. If the client uses a local implementation, then the entire procedure can be executed without Shroudbase ever reading or storing any sensitive information. 3

II. Storage Privatized data is stored with the Shroudbase Cloud Database Service. While many online storage systems only protect data in transit, Shroudbase ensures that the only data that enters the cloud is synthetic data with no personally identifiable information. Practically speaking, this means that nobody a hacker, government agency, an employee of Shroudbase can ever access any personal information through Shroudbase, because it simply isn t there. Clients access this service through the Shroudbase administrative control panel or Shroudbase Database Management System, an installable package for controlled data access and administration. Clients also have the option of storing the privatized data locally. 4

III. Querying The Shroudbase Query Client provides an easy and intuitive way to use privatized databases. This client interface takes in SQL formatted commands and outputs responses in a format similar to MySQL s client interface. This can be run by calling sb from the commandline with the appropriate hostname and port for the database the user is connected to. Queries with Shroudbase are identical to MySQL queries, and Shroudbase supports most statistical functions found in MySQL. IV. Updating Shroudbase s patent-pending technology supports inserting additional data into the database while preserving privacy. When additional data is added, the Shroudbase system stores the data in an intermediary state until the Shroudbase server detects that an update needs to occur. When an update occurs, the privatization job is off-loaded to Shroudbase s privatization infrastructure to be recomputed in the cloud. Note: For clients who wish to run specialized analysis not currently supported by Shroudbase synthetic datasets, we provide custom implementations of adaptive differentially private mechanisms. 5