Microsoft Corporation Tel 425 882 8080 One Microsoft Way Fax 425 936 7329 Redmond, WA 98052-6399 http://www.microsoft.com/ March 31, 2014 Ms. Nicole Wong Big Data Study Office of Science and Technology Policy Eisenhower Executive Office Building 1650 Pennsylvania Ave. NW. Washington, DC 20502 Re: Government Big Data (FR Doc. 2014-04660) Dear Ms. Wong: Thank you for the opportunity to submit comments in response to the Office of Science and Technology Policy s Request for Information regarding big data. Microsoft applauds the Administration for initiating a comprehensive review of big data and its public policy implications and for seeking broad input through the RFI and the various public events. We are responding primarily to the first and third questions posed by the RFI. Big data holds tremendous promise for society. Like many other developments enabled by technology, however, big data raises important public policy questions. We believe that government policy ought to be directed at promoting the learning that can be gained by analysis of big data while ensuring that privacy is not sacrificed in the process. The benefits enabled by analysis of huge datasets will not be possible if data is locked up in government or private silos. Therefore, we believe that government ought to consider carefully how best to ensure that, in general, data is broadly available to enable big data analysis. Some of this data will relate to people. To address that, government should strengthen but also adapt privacy law to enable the collection and use of large datasets while preserving privacy. This will no doubt entail tradeoffs on which reasonable people may have widely varying views. In determining how best to address privacy concerns, we believe government should draw upon a full range of tools not only law, but also technology, articulated best practices and technical standards. We address these points below. The Promise of Big Data Advances in technology have led to the digitization of massive amounts of information. We re increasingly surrounded by sensors (in our smartphones, tablets, cars and even common appliances), all Microsoft Corporation is an equal opportunity employer.
constantly recording data. Vast amounts of data are generated in other ways across the private and public sectors. As the costs of computing and storage have dropped, it has become increasingly possible to collect, retain, aggregate and analyze all this data. Sophisticated analytical techniques enable researchers to detect trends and correlations among disparate phenomena and thereby make important predictions. The key to unlocking the promise of big data is to enable the collection and broad availability of large volumes of data. Unlike random sampling, which was the foundation of statistical analysis in the 20 th century, big data largely relies upon the collection of as much information as possible pulled from a variety of sources to create new, combined datasets. The collection of data may occur before anyone even realizes what insights may later be drawn from it. Using all of the data available related to a particular phenomenon collected from various datasets over time facilitates detecting patterns and improves the quality of prediction. A few examples drawn from work done at Microsoft Research illustrate the point. More than ten years ago Microsoft researchers demonstrated the power of big data in natural language processing through their work to improve a grammar checker for Microsoft Word. 1 They began by pulling words the data for their work from a variety of sources, such as news articles, scientific abstracts, government transcripts and literature. They trained their grammar checker on increasingly large datasets drawn from these sources, and found that its accuracy greatly improved as the training dataset grew. For example, one of the grammar algorithms was only 75% accurate in predicting grammar problems when trained with a corpus of a half million words, but 95% accurate when trained with a corpus of a billion words. More recently, Microsoft researchers were able to assist doctors aiming to understand HIV mutation by applying analytical methods first developed for an entirely different purpose fighting email spam. HIV is hard to address because it is constantly mutating to avoid attack by the human immune system. Email spammers program their spam to constantly mutate too to avoid email filters. By applying methods initially developed to analyze spam mutations, doctors could better understand how different immune systems respond to the mutations of the HIV virus. While data about spam was helpful in researching HIV, search queries on Bing proved useful for other medical research to discover potentially dangerous drug interactions. Microsoft Research worked with researchers at Stanford University on an analysis that identified side effects when a patient takes Paxil, a widely used antidepressant, together with Pravachol, a leading cholesterol-reducing drug. 2 Using Bing search engine logs, the researchers determined that people who searched on the names of both of those drugs had a much higher likelihood of also searching for diabetes-related side effects (such as 1 Michele Banko and Eric Brill, Scaling to Very Very Large Corpora for Natural Language Disambiguation, Proceedings of the Annual Meeting of the Association for Computational Linguistics, 26-33 (2001). 2 Ryen White, Nicholas Tatonetti, Nigam Shah, et al. Web-scale Pharmacovigilance: Listening to Signals from the Crowd, J Am Med Inform Assoc, doi:10.1136/amiajnl-2012-001482 (2013). 2
headache or fatigue) than a person who searched only for one of the drugs. This suggested a dangerous interaction between the two drugs that could result in diabetic blood sugar levels. Government Should Promote the Broad Availability of Big Data Government can play multiple roles in ensuring that society realizes the benefits of big data while other important values are protected. Government is obviously an important source of data, about taxes, education, labor, defense, energy consumption, weather, health, communications, transportation, entitlement programs and more. It is a steward of that data as well. Access to data such as this will be important if new insights are to be gained and innovation promoted across disciplines as varied as health, security and economics. Government should establish policies with a view toward making data generally available, but limited as appropriate to address other important societal values. Information that essentially belongs to society as a whole, such as data describing the physical or economic world in which we live, ought to be made generally available in an efficient manner. Where data relates to individuals, government should employ a risk-based approach to assess the benefits of data sharing against harms to privacy, taking into account de-identification approaches and other techniques that may help to reduce privacy risks. In its role as policymaker, the government should be cognizant of its duty to promote the Progress of Science and useful Arts (the constitutional basis for copyright) when considering the application of copyright law to large datasets. Large datasets and the aggregation of smaller pieces of data encourage uses that advance the progress of science and research, and computational uses of such data do not impinge upon the dataset owner s reasonable expectations. Such uses are generally allowed today because U.S. law does not provide inherent copyright protection for databases, and such a use would likely be considered permissible fair use in any event. In considering further developments in copyright law, government should strive to maintain approaches such as these that enable third parties to make use of large datasets in ways that are transformative and socially beneficial. As an emissary to other governments, the U.S. government has a role to play in encouraging data to flow across borders while ensuring that privacy is respected. This will require the U.S. government to work closely with governments around the world to ensure that privacy regimes do not pose obstacles to the cross-border flow of data but rather appropriately protect the rights of people. Similarly, the U.S. government should work with other governments to promote copyright laws that foster access to data for all kinds of uses. Government Should Strengthen Privacy Regulation and Adapt It for Big Data Some large datasets will relate to people their activities, interests, health and the like. Given the likely ubiquity of sensors in the years to come and the increasing digitization of so many human activities, there is a real risk that that data about people could be misused. We believe that the promise of big data will not be realized unless approaches are established to address privacy and civil liberties concerns. 3
Privacy regulation in the United States is a complex (yet incomplete) patchwork of federal and state rules that apply to particular industry sectors, particular types of data or particular data uses. While these rules are generally based on the Fair Information Practice Principles, there has been a heavy emphasis on the principles of notice and consent at the time of data collection. Today, however, the notice and consent paradigm has begun to show significant signs of strain under the weight of big data. We believe the notice and consent paradigm, and privacy regulation as a whole, should be strengthened, but also adapted to a big data world, in order to address this. Today s heavy reliance on notice and consent places most of the burden of privacy protection on individuals. This is a problem. People are confronted with lengthy, detailed and often complex privacy statements from nearly every retailer, online service provider and other organization with which they interact. (In providing these long statements, organizations are aiming to comply with current law.) In theory, people would read these statements and then make informed choices on the basis of them. In practice, people often fail to do so, as they would be overwhelmed if they tried. 3 Instead, they quickly discard paper privacy statements and click through online statements to agree with the terms of privacy notices without reading the terms, much less trying to understand them. The law on the books is satisfied, but this is weak privacy protection. There is a second problem: Even as the existing notice and consent paradigm may fail to provide real protection for people, it may serve to preclude beneficial uses of data that would present little privacy risk. Big data analysis often depends upon using datasets, often in combination with others, in ways that were not contemplated when the data was originally collected. If the original privacy notices did not foresee a beneficial use, the data may not be available for big data analysis. To address all this, we believe that government should look at ways to focus use of notice and consent in those areas where decisions really can be informed and meaningful, and where privacy concerns are significant. Some data uses are widely expected or understood, provide high potential societal benefit, or create a low risk of harm. For example, when people purchase an item over the Internet, they understand that the retailer will use the mailing address they provide in order to ship the item to them. Bloating privacy notices with detail about such uses distracts from disclosures that are more important. Other data uses may entail a high risk of privacy harm and little societal benefit. It might be appropriate to generally preclude such uses, rather than allow them as long as notice is nominally provided in some multi-page legal document. For example, merely providing notice should not enable firms to use big data to discriminate against vulnerable communities in ways that would not be allowed in other circumstances. 3 For example, it has been estimated that, on average, every American Internet user would have to spend 244 hours every year to read all the privacy statements he or she encounters. See Aleecia M. McDonald and Lorrie Faith Cranor, The Cost of Reading Privacy Policies, 4 ISJLP 543 (2008). 4
In between these two cases are a wide range of potential data uses where reasonable people may not expect particular data uses, or may find particular uses objectionable. This is where it is important that people be provided with meaningful notice and an opportunity to consent, or not, to uses of data about them. This approach could be complemented by adapting privacy regulation to take greater account of a broader range of Fair Information Practice Principles. Where notice and consent are unwarranted or impractical (as for data collected by small sensors), other protections should be called into play. These may include data security; maintenance of the confidentiality, integrity and availability of the data; data minimization through the use of de-identification techniques and mechanisms for transparency (beyond consumer notices) and other means of creating accountability for all data uses. Privacy regulation could also be strengthened through the adoption of a more consistent and comprehensive approach to privacy law in the United States one that reflects the broader and balanced approach to the Fair Information Practice Principles described above. Microsoft has long supported adoption of an omnibus, baseline federal privacy law. We believe this is the right approach because the increasingly complex patchwork of state and federal laws has resulted in an overlapping, inconsistent and incomplete approach to protecting privacy. This approach is confusing from the perspective of consumers, and unnecessarily burdensome for organizations. The sectoral approach to privacy regulation that we have today may be even more problematic in the context of big data. As noted above, the value of big data is often realized when data is combined and analyzed in new ways. Yet such combinations may be precluded by differing privacy regimes applying to data first collected in varying sectors. A baseline federal privacy law could help ensure that all companies in the big data ecosystem are applying a clear set of responsible data practices, while also enabling the societal value of big data to be more easily realized. Government Should Explore a Variety of Ways to Protect Privacy in a World of Big Data The challenge of unlocking the value of big data while protecting the privacy of those whose data is included requires a multifaceted solution. The government should look to technology, best practices, law and standards to help address this. Technology. Data privacy has traditionally been concerned with the collection, use and disclosure of data that identifies individuals. Privacy frameworks have generally divided data into one of two categories: data that does not identify individuals or data that does, with the assumption that most data is in one category or the other. We will likely need to abandon this binary model. In a world of big data, data that used to be considered non-identifiable, such as the colors of cars, their makes and models, and the times of day when they are on the road, might be used to identify people, especially when combined with other information about where people live and work. While perfect anonymization of big data sets may not be mathematically possible, a variety of techniques hold promise to greatly reduce the practical risk of privacy harm. As in other areas of public policy, government should consider the likelihood and severity of potential risks, and weigh them against other societal benefits. 5
It may be best to treat identifiability of data on a continuum. Some data is highly identifiable to anyone who sees it, and other data is nearly anonymous and requires significant computing power to reidentify, with gradations between the two extremes. Government should promote research on defining and advancing pseudonymization and de-identification techniques, which would help to address privacy concerns while enabling big data analysis. Better definitions of these concepts would help society know what promises can be made at different levels of de-identification. That would have two benefits: it would avoid overpromising what de-identification can achieve while encouraging the use of deidentification for what it does achieve. This research should build on recent work, such as work on k- anonymity and on understanding how easily big data can be re-identified using other sources of information. 4 One helpful technique that should be explored further is called differential privacy. Differential privacy does not rely on removing information from a database or changing it, but rather limits access to the underlying data and provides results to queries of the data that include random but small levels of inaccuracy, or distortion. 5 If the level of distortion is set correctly, the datasets can be usefully exploited without revealing information that could be used to re-identify individuals. Government should explore greater use of cryptographic technologies as well. De-identification often involves cryptographically hashing information that directly identifies individuals, such as account numbers. More aggressive uses of cryptography may become practical over time. For example, current research is exploring how to use encrypted data sets to answer questions about the underlying data even though the data itself cannot be recovered, and some of the technologies are promising. 6 Further research may expand the uses of this research into wider application. Many promising techniques for de-identification, encryption and differential privacy are still only in the research stage and further work to commercialize them should be encouraged. By recognizing the potential for misuse of big data and the risks of various levels of de-identification, policymakers could help society quantify the value that these technologies offer. That, in turn, would likely encourage industry investments in those potentially useful technologies. Best Practices. Technologies such as those described above should be supplemented by best practices, in government and the private sector, regarding the use of those technologies. Such best practices would help to guard against attacks on the system or other attempts to circumvent privacy protection. For example, encryption is only as good as the practices put in place to safeguard the security of 4 For more information about k-anonymity, see, Latanya Sweeney. k-anonymity: a model for protecting privacy, International Journal on Uncertainty, Fuzziness and Knowledge-based Systems, 10 (5), 557-570 (2002), and subsequent research. 5 See, e.g., Cynthia Dwork, Differential Privacy, 33rd International Colloquium on Automata, Languages and Programming, part II (2006). 6 For example, secure multi-party computation as described by Yehuda Lindell and Benny Pinkas, Privacy Preserving Data Mining, Journal of Cryptology, 15(3):177-206 (2002) and Yehuda Lindell and Benny Pinkas, Secure Multiparty Computation for Privacy-Preserving Data Mining, Journal of Privacy and Confidentiality, 1(1):59-98 (2009). Another example is structured encryption being developed by the MetaCrypt project at http://research.microsoft.com/metacrypt. 6
decryption keys. Pseudonyms can be compromised if they are independently mapped to identifiers, such as through look-up tables. Best practices should be standardized and accompanied by robust audit mechanisms. Auditable standards for information security already exist and provide a way for organizations to document their policies that can be understood and verified by third parties. Audits should focus on the organizational controls that organizations put in place. For example, access by users of a dataset to other information that can re-identify individuals in the first dataset should be controlled and logged. Technology has a role to play in standardizing and enforcing these policies, including by associating or tagging policy requirements directly to datasets. Law. Some technological solutions should be backed up by legal requirements. This is important because in many cases people may not know the identities of all the organizations that have access to data about them, much less effective notice of the privacy policies of those organizations. Uniform legal rules could help address this kind of risk. For example, if appropriate legal rules were established that generally prohibited the re-identification of de-identified data, de-identified data could be shared with greater confidence that privacy would be preserved. Any such rules should, of course, be carefully crafted to avoid locking in technologies now that may need to be changed quickly as knowledge of big data and its privacy implications progresses. Standards. One way to provide technological flexibility is to craft law that relies upon industry technical standards. The Federal Information Security Management Act of 2002 (FISMA) is an example of this approach. To help implement FISMA, the National Institute of Standards and Technology (NIST) issued NIST Special Publication (SP) 800-53, which: (1) defines a risk management process; (2) specifies the risks that stakeholders must consider; and (3) provides lists of effective mitigations. The risks in the FISMA context are associated with information security, essentially the risk of loss, corruption or inappropriate disclosure of data. A standard for big data and privacy could adopt a similar approach, where the risks would include the improper re-identification or other illegitimate use of big data datasets (such as for illegal discrimination). A standard for big data and privacy should seek to continuously improve how risks are addressed and to adapt to new risks, as FISMA and NIST SP 800-53 have done. Such a standard should describe a process of assessing risk, taking measures to reduce risk, assessing the effectiveness of the measures, and then returning to reassess risk. Finally, a successful standard should require any organization that follows it to document how it does so and to submit to third-party validation. It may make sense for NIST to be responsible for the development of any such standard. NIST has considerable experience serving in this role in other contexts. If NIST were to undertake such an effort, it should canvass a broad set of stakeholders, from the public and private sectors, to identify the risks that any successful big data and privacy standard should address. 7
Conclusion Microsoft appreciates the opportunity to comment on this RFI. We hope our comments will prove useful as the Administration continues its study of this important topic. Yours sincerely, David A. Heiner Vice President & Deputy General Counsel, Legal and Corporate Affairs Microsoft Corporation 8