2809 Telegraph Avenue, Suite 206 Berkeley, California 94705 leapyear.io future proof data privacy Copyright 2015 LeapYear Technologies, Inc. All rights reserved. This document does not provide you with any legal rights to any intellectual property in any LeapYear product. You may copy and use this document for your internal, reference purposes.
Introduction Big data analytics present a well-documented opportunity; organizations can make more profitable and more informed business decisions, and individuals can enjoy higher-quality services and easier, healthier lives. These broad benefits are tempered by complex challenges associated with the usage of personal data. Careless or malicious use of financial information, medical records, location history, and other sensitive information can eclipse longstanding civil rights protections in how personal information is used in housing, credit, employment, health, education, and the 1 marketplace, according to a 2014 White House report on big data analytics. However, the amount of personal data collected continues to grow at a breakneck pace, driven by competitive pressures. As a result, privacy protection is a moving target that many corporations and governments fail to hit - with expensive consequences. In 2014, over a billion personal records were breached, an increase of 78% from the previous 2 year. Organizations such as JPMorgan Chase, Home Depot, and ebay suffered high-profile data breaches that compromised the privacy of millions. The business consequences of these breaches have been dramatic. Following a data breach, major retailers have reported experiencing a drop in sales of between 2% and 6%, on top of costs incurred through lawsuits, government fines, IT investments, or rebranding efforts. 3 Consumers have begun dramatically changing their spending habits to ensure that their personal information remains private according to a 2013 report by Radius Global Market Research, more than 75% of internet users surveyed said they would stop doing business with a company if they felt their privacy was violated, and 51% said they d already stopped buying from certain retailers out of concern for the privacy of their 4 data. 1 http://www.whitehouse.gov/sites/default/files/docs/big_data_privacy_report_may_1_2014.pdf 2 http://www.gemalto.com/press/pages/gemalto-releases-findings-of-2014-breach-level-index.aspx 3 https://www.unboundid.com/blog/2014/10/02/infographic-questions-to-ask-to-avoid-a-data-breach 4 http://www.emarketer.com/article/consumers-of-all-ages-more-concerned-about-online-data-privacy / 1010815/1 2
To limit potential damages and encourage better management of data, government regulations have been established in every information-driven industry: HIPAA, HITECH, FINRA, GLBA, PCI, FERPA, FACTA, and the EU Data Privacy Act, to name a few. However, these regulations severely burden business operations by restricting the collection, usage, and sharing of data. Healthcare privacy regulations in particular have had a very damaging effect on research and analysis, especially in public health research and genomics. Researchers report that one out of every three dollars budgeted for clinical research is spent on regulatory compliance, and the IDC Digital Universe Report estimates that less than 10% of useful health data is currently utilized for research and 5 6 analysis. The value of data is compromised across industries by privacy concerns, limiting business applications such as market research, quality assessment, and resale of valuable data. The regulatory approach to protecting privacy also fails to account for new advances in data science and computer science. As techniques for drawing insights from limited data become more powerful and prevalent, the category of data that can be used to compromise privacy is expanding at an alarming rate. For instance, in the summer of 2014, the personal information of 83 million households and small businesses was accessed by hackers who breached the databases of JPMorgan Chase, the largest bank in the United States. The bank stated several times that account information was not compromised only phone numbers, email addresses, and home addresses were stolen. However, privacy researchers have found that information as common as a zip code can be combined with other data to link anonymized sensitive information to individuals. In the mid-1990s, Latanya Sweeney, a student at MIT, was able to link together two publicly available databases one of voter records and one of 7 anonymized health records and easily match de-identified medical records to names. She simply cross-referenced common traits such as gender, zip code, and date of birth that were present in both databases. Sweeney went on to show through her research that 5 http://c-changetogether.org/websites/cchange/images/hipaa/c-change_hipaa_cost_study_web_version.pdf 6 http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf 7 http://www.ncbi.nlm.nih.gov/pmc/articles/pmc3032399/pdf/nihms-264889.pdf 3
roughly 87% of the U.S. population can be uniquely identified with their gender, zip 8 codes, and dates of birth. Through these clever combinations of data, malicious attackers can re-identify anonymized databases of sensitive information, which they can then use to obtain credit cards, wire money from bank accounts, and receive free medical services. Regulations and self-imposed privacy policies take a brute force approach to this challenge, mandating the removal of information to match the progress of analytical innovations. This approach is insufficient to protect privacy in the long-term. Given the sheer volume of personal information that is already public or poorly protected, a patient and persistent attacker can easily access enough information to re-identify virtually any de-identified database. Further complicating the problem is the fundamental tradeoff between precision and privacy. Research in information theory, a field which focuses on quantifying and identifying properties of data, has shown with mathematical certainty that data cannot be perfectly anonymized without compromising some of the data s usefulness for 9 statistical analysis. Any method of privacy protection must walk the fine line of providing maximum value from data while protecting individual privacy. 8 http://dataprivacylab.org/projects/identifiability/paper1.pdf 9 http://www.cse.psu.edu/~asmith/privacy598/papers/dn03.pdf 4
The Current State of Privacy Protection 5
Below are standard methods for privacy-preserving analysis and their disadvantages: Method Description Shortcomings Summary statistics Providing only aggregate statistics on the database, such as mean, median, mode, etc. Reconstruction attacks can identify individuals using 10 only summary statistics. Hashing Replacing PII such as names and SSNs with numbers ( hashes ) generated by a hash function. This function cannot be inverted that is, given a value it is simple to find its hash, but given a hash it is hard to find the original value. Hashing can be reversed by having a computer simply check all possibilities to recover the original key Even without reversing the hash, linkage attacks can use outside information for identification Vulnerable to security breaches of the hash function Query auditing Restricting which queries can be asked by certain users based on permissions. Severely limits analytical utility of data. Impossible if the number of potential queries is very large Data masking Restricting which entries in the database can be seen by certain users based on permissions Prone to human error Limits insights for analytics Defeated by linkage attacks k -anonymity Modifying data so that each combination of identifying attributes is shared by k other members of the dataset Only protects against identity disclosure, does not protect against attribute disclosure Even though the values are the same for k individuals, an adversary can learn them for a target individual 10 http://www.cis.upenn.edu/~aaroth/papers/privacybook.pdf, page 8 6
In addition, query auditing and data masking fail to protect privacy in the event of a security breach. The strict protocols they require can easily be violated if the security system implementing them is not robust. Furthermore, these privacy-preserving approaches to analytics requires vigilant enforcement and organization-wide adoption to be effective. Employee education and company policies for the handling of data are expensive, time-consuming, and far from foolproof. Many companies, interested in utilizing their data with minimal inconvenience, employ only the bare minimum of privacy protection. Clearly, current de-identification techniques, privacy policies, and governmental regulations are ineffective and inefficient. They fail to protect individual privacy and restrict the collection and analysis of critical information. Medical research organizations are forced to sacrifice vital scientific progress to ensure HIPAA compliance, while social networks and e-commerce platforms self-impose stringent privacy policies that build user trust but inhibit lucrative analytics. To take full advantage of information, there is a need for a new paradigm of data privacy. Differential Privacy Differential privacy is a recent, mathematically rigorous definition of privacy which has inspired a field of research at the intersection of statistics and computer science. Specifically, if a database has been computed by a differentially private algorithm, then the presence or absence of any one individual in a database makes no significant difference in the likelihood of each possible response to a database query. In practice, differential privacy promises that nobody will be able to learn any significant additional information about an individual by his or her information being included in a database. Frank McSherry, one of the inventors of differential privacy, described this privacy protection as future-proofed. Identifying individuals in a database computed by a differentially private algorithm is effectively impossible, even with unlimited time and 7
11 outside information. It is considered the gold standard of privacy by the privacy 12 community. There are several reasons why differential privacy has not yet been widely accepted in industry. Primarily, it is because differential privacy is a definition, and the definition itself provides no methods of achieving it in practice. Furthermore, it does not speak to the utility of the data after it is accessed through differentially private mechanisms. Designing algorithms for achieving differential privacy while maintaining the accuracy of statistical analysis is an active, narrow field of research, with only a few experts advancing the science. One of the first mechanisms for achieving differential privacy was the addition of random noise, or distortions, to the output of queries. The magnitude of the noise added to a particular query is a function of the largest change a single entry could have on the output of that query. This method, known as the Laplace Mechanism, allows one to answer aggregate count queries (e.g. How many people in the database have black or brown hair and live in California? ) with a fair amount of accuracy, but it fails to provide useful results for more sophisticated statistical analysis of the database. In order for differential privacy to be preserved by the Laplace mechanism, a privacy budget is placed on the database. Each query costs a portion of this budget, and once the budget is exhausted, access must be terminated. Moreover, if a security breach allows an attacker to bypass the querying mechanism, the raw data is entirely compromised. These practical shortcomings of the Laplace Mechanism and other early methods for achieving differential privacy have resulted in the standard being viewed as a theoretical ideal, but 13 too strict a requirement for real-world application. 11 http://www.scientificamerican.com/article/privacy-by-the-numbers-a-new-approach-to-safeguarding-data/ 12 http://arxiv.org/abs/1402.3329 13 http://www.jetlaw.org/wp-content/uploads/2014/06/bambauer_final.pdf 8
However, the seminal paper which introduced the Laplace Mechanism motivated a decade of research in the field of differential privacy. This has resulted in the development of differentially private analysis analytical techniques including regressions and machine learning algorithms that achieve differential privacy and an extremely high degree of accuracy. In the past few years, algorithms have emerged for producing synthetic datasets, which are inherently differentially private. These synthetic datasets can be optimized to answer a large number of queries of a client s choosing with extreme accuracy while satisfying the highest standard of privacy. Synthetic data imposes no privacy budget and remains private even in the event of a security breach. The rapid advancements in differential privacy are moving the term from being a theoretical ideal to a precise, practical, and quantifiable definition of data privacy. 9
Shroudbase 10
Shroudbase is a platform for storing, sharing, and analyzing sensitive data. It provides 14 compliance, analytical flexibility, and the highest standard of data privacy. Shroudbase provides a patent-pending system for creating, managing, updating, and querying differentially private synthetic datasets. These versions are effectively identical in function to the original data, except that they are permanently de-identified. This holds even if the privatized data is analyzed, sold, published, combined with other data, or stolen. While current methods of de-identification can significantly hinder access to insights by removing information from the original data, Shroudbase protects privacy without removing any information from the database, enabling analysis of previously untouchable data. Its algorithms intelligently recompute databases, creating permanently de-identified copies of the original data. Aside from completely anonymizing sensitive data, Shroudbase achieves the strongest standard of data privacy: differential privacy. We have shown with mathematical proof that the presence of any single individual in a differentially private database does not significantly affect the outcome of any analysis on the database. Consequently, the amount of additional information disclosed about an individual by his or her inclusion in a database produced by Shroudbase is negligible. This holds even in the event of a security breach if a database that had been privatized by Shroudbase were illegally accessed and published online, the data would still be differentially private. Unlike other differentially private mechanisms, Shroudbase is practical for a wide range of uses, including business intelligence, research, and open-source applications. Users of the software can ask unlimited queries to their data and update their data without affecting the privacy protection. Shroudbase produces synthetic data that is optimized for accurate analytics, ranging from summary statistics to machine learning. 14 For more technical details on Shroudbase and differential privacy, please visit shroudbase.com/technology 11
How it Works 12
I. Privatization Privatizing data with Shroudbase is a one step process. The client simply enters the information required to access their database along with an endpoint to store the synthetic data. The platform currently privatizes any structured data, including MySQL, PostgreSQL, Microsoft SQL, sqlite3, Excel spreadsheets, and csv files. The privatization procedure can be run through our cloud cluster or locally by installing the Shroudbase Database Management System on the client's machines. If the client uses a local implementation, then the entire procedure can be executed without Shroudbase ever reading or storing any sensitive information. 13
II. Storage Privatized data is stored with the Shroudbase Cloud Database Service. While many online storage systems only protect data in transit, Shroudbase ensures that the only data that enters the cloud is synthetic data with no personally identifiable information. Practically speaking, this means that nobody a hacker, government agency, an employee of Shroudbase can ever access any personal information through Shroudbase, because it simply isn t there. Clients access this service through the Shroudbase administrative control panel or Shroudbase Database Management System, an installable package for controlled data access and administration. 14
III. Querying The Shroudbase Query Client provides an easy and intuitive way to use privatized databases. This client interface takes in SQL formatted commands and outputs responses in a format similar to MySQL's client interface. This can be run by calling 'sb' from the commandline with the appropriate hostname and port for the database the user is connected to. Queries with Shroudbase are identical to MySQL queries, and Shroudbase supports most statistical functions found in MySQL. IV. Updating Shroudbase's patent-pending technology supports inserting additional data into the database while preserving privacy. When additional data is added, the Shroudbase system stores the data in an intermediary state until the Shroudbase server detects that an update needs to occur. When an update occurs, the privatization job is off-loaded to Shroudbase's privatization infrastructure to be recomputed in the cloud. 15
Results As with any technique that perfectly protects privacy, some accuracy is lost because of the statistical noise introduced to the data itself. However, Shroudbase has been optimized to deliver highly accurate results for aggregate statistical analysis and advanced data mining algorithms. Furthermore, the platform supports analysis of sensitive, high-dimensional data, on the order of terabytes. The table below summarizes the performance of Shroudbase on a variety of databases. The accuracy is defined as the fractional difference between the output of the most erroneous query on the original data as compared to the data produced by Shroudbase. Dataset Entries Number of Attributes Number of Distinct Properties Runtime Query Accuracy National Census State Census Blood Donations Movie Reviews 236,844 4 26 1 min 50s 99.7% 30,162 16 265 3 min 21s 98.8% 748 7 65 1 min 14s 99.7% 943 4,000 40,000 2hrs 21min 99.1% Genomics 58 7,000 70,000 2hrs 48min 93.9% Datasets National Census is a dataset of 2010 abridged census data. State Census is a dataset of state and local census data from 1995. Blood Donors is a dataset of blood donations and information about the blood donors. Movie Reviewers is a high-dimensional dataset of publicly collected user movie ratings and information about the users. Genomics is a high-dimensional genomics dataset containing around 7,000 genomic markers for 58 cancer patients. 16
Compliance Privacy experts agree that databases computed by differentially private algorithms satisfy this requirement. This agreement represents more than a consensus in an industry survey one of the two methods for compliance with the HIPAA Privacy Rule is the Expert Determination Method, which is outlined below. Source: HHS Guidance Regarding Methods for De-identification of PHI in Accordance with the HIPAA Privacy Rule At the request of a client, independent statisticians can verify that our process satisfies requirements for de-identified data under HIPAA. 17
Conclusion Modern organizations face a daunting challenge utilizing the sensitive data they have collected to gain a competitive advantage while simultaneously protecting the privacy of their customers and patients. Standard methods of privacy protection are no longer acceptable they are costly to implement, time-consuming for large quantities of data, vulnerable to escalating threats, and restrict data utility. Shroudbase is a new paradigm of data management that offers streamlined, mathematically provable privacy by design to everyone. The technology provides analytical accuracy and the highest standard of data privacy while providing tools that work seamlessly with a client s existing infrastructure. 18
Frequently Asked Questions 19
OVERVIEW What is differential privacy? 15 Differential privacy is a mathematical definition of privacy. It states that the presence or absence of any one individual in a database makes no significant difference in the likelihood of each possible response to a database query. What is Shroudbase? Shroudbase is LeapYear s patent-pending platform for creating, managing, and analyzing privatized copies of quantitative data. These copies are effectively identical to the original data, except that they will never release any information that can be used to identify any individual. This holds even if the privatized data is sold, shared, published, stolen, or submitted to any kind of statistical analysis. What do you mean by mathematically proven privacy? We have shown through rigorous mathematical proof that the chance of learning anything more about any particular individual by their inclusion in a database produced by Shroudbase, through any method, is negligible. This statement holds no matter what outside information is used to augment the analysis, no matter how advanced statistical techniques become and even in the case of a security breach. The databases we produce are differentially private. How do you achieve differential privacy? Our proprietary algorithms recompute your data, modifying it slightly by introducing statistical noise to its contents. This distortion prevents anyone from learning private information about a specific individual, even if the data that was privatized contains personally identifiable information (PII). 15 For more technical details on Shroudbase and differential privacy, please visit LeapYear.io/shroudbase technology 20
Does Shroudbase compromise the accuracy of analysis? As with any method of data privacy, some accuracy must be lost. However, the amount of statistical noise is precisely calibrated to conceal information about specific individuals while still answering statistical queries with near-perfect accuracy. What happens if a privatized database is hacked? From the standpoint of individual privacy nothing. Our synthetic databases do not contain any personally identifiable information, so privacy is protected even if the entire contents of the database produced by Shroudbase are revealed. Privatized data produced by Shroudbase is considered de-identified information even if it is stolen and published. USAGE How do I use Shroudbase to privatize data? Privatizing data with Shroudbase is a one step process. Simply enter the information required to access your database along with an endpoint to store the synthetic data, and our algorithms will compute a synthetic, permanently de-identified copy of the original data. How do I use Shroudbase to store data? Privatized data is stored with the Shroudbase Cloud Database Service. The only data that enters the cloud is synthetic data with no personally identifiable information. Clients access this service through the Shroudbase Database Management System, an installable package for controlled data access and administration. How do I use Shroudbase to query data? The Shroudbase Query Client provides an easy and intuitive way to use privatized databases. Queries with Shroudbase are identical to MySQL queries, and Shroudbase supports most statistical functions found in MySQL. 21
What kind of data can be privatized? Shroudbase can work with virtually any kind of structured data, including: standard MySQL/Oracle/SQL Server solutions Excel tables cloud and clustered solutions qualitative and text-based data How long does privatization take? The length of the process depends on the size and complexity of the database, but most databases can be privatized in a matter of hours. How do I add data to a privatized database? You can add and remove rows just as you would from a standard database. The Shroudbase management software uses proprietary algorithms to intelligently determine when the dataset requires recomputation to maintain privacy. This procedure will be carried out automatically and asynchronously. PRIVACY What is the difference between differential privacy and standard de-identification? Privacy: Standard de-identification can be reversed to piece together private information, while differential privacy verifies through rigorous mathematical proof that it is effectively impossible to identify an individual regardless of what outside information is used to augment the analysis, no matter how advanced statistical techniques become and even in the case of a security breach. Accuracy: Shroudbase carries out this process without ever removing any information from the database. Instead, we make complete, permanently de-identified copies that are precisely modified to protect privacy. Shroudbase ensures that these modifications have virtually no effect on the analytical utility of the data. Standard de-identification techniques, on the other hand, remove or inefficiently distort information and are incapable of providing any measures of accuracy. 22
How is Shroudbase different from most differential privacy techniques? A significant portion of the differential privacy literature is focused on adaptive privacy preserving mechanisms. Adaptive mechanisms provide noisy, or distorted, responses to queries. These techniques provide theoretical guarantees of accuracy and differential privacy in a variety of settings. However, they require that queries have associated privacy costs, and once a privacy budget is exhausted, differential privacy no longer holds. This causes several problems in practice limiting the number of queries is entirely impractical for effective usage of data, and collusion could allow groups to violate privacy without knowledge of the database curator. Our solution is to produce synthetic databases. Synthetic data is an approximation of the true dataset optimized to accurately answer a set of queries. The algorithms which produce this approximation are differentially private and thereby ensure that any analysis of the data is private. There is no need to put any restrictions on data access, and the database remains differentially private even in the event of a security breach. How can you ensure differential privacy without limiting the queries a client can ask the database? Typically, differential privacy is achieved by adding statistical noise to the output of queries, which is vulnerable to collusion. Our method is to recompute a synthetic database which only contains privatized information. This allows us to preserve differential privacy while providing the client unrestricted access to the data. 23
Are privatized databases HIPAA compliant? HIPAA requirements for de-identifying information can be met through the Expert 16 Determination Method 45 C.F.R. 164.514(b) : A covered entity may determine that health information is not individually identifiable health information only if: (1) A person with appropriate knowledge of and experience with generally accepted statistical and scientific principles and methods for rendering information not individually identifiable: (i) Applying such principles and methods, determines that the risk is very small that the information could be used, alone or in combination with other reasonably available information, by an anticipated recipient to identify an individual who is a subject of the information; and (ii) Documents the methods and results of the analysis that justify such determination Privacy experts have already agreed that differentially private databases satisfy this requirement, even those that are complete copies of databases with PII. At the request of a client, privacy experts can verify that our software satisfies requirements for de-identified data under HIPAA. 16 http://www.gpo.gov/fdsys/pkg/cfr-2002-title45-vol1/pdf/cfr-2002-title45-vol1-sec164-514.pdf 24