Privacy: Legal Aspects of Big Data and Information Security Presentation at the 2 nd National Open Access Workshop 21-22 October, 2013 Izmir, Turkey John N. Gathegi University of South Florida, Tampa, FL Visiting Professor, Hacettepe University, Ankara, TURKEY
Characteristics of big data and data mining: --Big data refers to massive amounts of seemingly unrelated data collected from a variety of sources that are agregated in massive data depository systems.
--usually data sets too big for common database software to manage or process.
--3 defining features (Rubinstein, 2013): 1. availability of massive data continuously collected in multiple ways including: -online -mobile devices -location tracking -data sharing apps
-smart environment interactions and monitoring (e.g. Internet of Things) --big data increasingly will be derived from The Internet of Things. -web 2.0 user generated data, including personal information sharing
2. use of high-speed, high-transfer rate computers with massive storage capability utilizing the cloud computing model 3. use of new computational frameworks... for storing and analyzing this huge volume of data. --summary: more data, faster computers, new analytic techniques
Data mining: extraction of information from massive amounts of data that lead to unexpected new knowledge associations, patterns, and meanings that were previously buried in the data. Have to use massively complex data mining algorithms and statistical methods to analyze the data.
--Think of Google: --email data (gmail); search data; personal information, web navigation data, geographic location data, voice communication data, video communication data, image management and processing data, translation data
--major benefits to industry and society in the area of innovations and service delivery (e.g. medical research, traffic management), but also some downsides, especially in the area of privacy.
--Think of Facebook: nearly a billion users uploading personal information Rubinstein (2013) notes several intertwined trends that are presenting great challenges to privacy: the popularity of social networking sites that permit individuals to voluntarily share personal data the growth of cloud computing the ubiquity of mobile devices and of physical sensors that transmit geo-location information and the growing use of data mining technologies enabling the aggregation and analysis of data from multiple sources --Add to this Open Access and you have a problem!
According to Nicholas Terry (2012): Data aggregation and customer profiling are hardly news. The developments that mark out big data are the scale of the data collection and the increasing sophistication of predictive analytics.
--problems: data mining; profiling (cookies are not the primary concern anymore) -finding hidden correlations, enabling interesting predictions --right to be forgotten (addressed somewhat in Europe but almost ignored in the US) --subverted by the ability to re-identify data subjects using non-personal data. Blurring the line between personal and non-personal data Data aggregation to provide anonymity loses its meaning
Consider this --purchase by Walmart in 2012 of Social Calendar (a Facebook application). Already had ShopyCat, a facebook app of its own that is a giftrecommendation service. Why purchase and not build its own?
Points to --weakest link: over-reliance on informed consent (most people do not read, or understand disclosures, and have no idea bout the subsequent use, or even custody, of their personal information)
Other BD problems --Also allows automated decision-making about individuals, e.g., creditworthiness, insurance eligibility, etc. --process opaque and affords little chance for individual feedback or correction of the underlying data --BD users unable to provide adequate notice of purpose and use of data to individuals, since they cannot tell in advance what they will find --Users cannot effectively consent to the use of their information because they cannot monitor the correlations made possible by the data mining
--dangers of predictive analysis -Target analysis producing a pregnancy prediction score based on women customers purchase patterns. (identification of pregnancy and prediction of due date) e.g., daughter sent baby ads, upsetting father - Pre-crime police departments (as in the movie Minority Report) apprehending criminals based on prediction of their future deeds (thought police?) -redlining certain neighborhoods (for insurance purposes,, social services, etc).
--Tene and Polonetsky (2013) make the very salient points that: In a big data World, what calls for scrutiny is often not the accuracy of the raw data but rather the accuracy of the inferences drawn from the data. Inaccurate, manipulative or discriminatory conclusions may be drawn from perfectly innocuous, accurate data.
--de-identification is often reversible --privacy v. Societal benefit e.g., Tene and Polonetsky (2013) pose the following question: what if the analysis of de-identified online search engine logs enables: identification of a life-threatening epidemic in x% of cases saving y lives assuming a z% chance of re-identification for a certain subset of search engine users should such an analysis be permitted?
No surprise that it is in the health area that privacy has received the most sympathy and attention. But even here, the US, for example, has depended on HIPAA, which is supposed to protect against disclosure of patient data However, as Terry (2012) points out, HIPAA protects against disclosure, not against collection! He notes that a lot of traditional health information circulates in a mainly HIPAA-free zone
--Harvard Researchers who collected data on Facebook users to study changes in their interests and friendships over time. Released data for research to the World because supposed to be anonymous. Other researchers quickly found that they could deanonymize parts of the dataset
On the other hand Stanford researchers who discovered the effect of taking an antidepressant drug together with a cholesterol-reducing drug on the increase of patients blood glucose to diabetic levels (through analyzing data in adverse effect reporting data sets and creating a symptomatic footprint for diabetes-inducing drugs. Then searched this footprint in interactions between pairs of drugs. Four pairs with this effect were found. Among them Paxil and Pravachol. Next they examined Bing search engine logs to see if there was more likelihood of people who searched for both drugs to also report the symptoms, as opposed to those who searched only for the one drug. Found support in the data and potentially saved the lives of 1 million Americans.
Industry not the only BD driver --In 2012 President Obama deployed a Big Data R&D initiative to advance the science and technology of managing, analyzing, visualizing and extracting information from large, diverse, distributed, and heterogeneous data sets. Terry (2012) also notes that in the future BD will come from less structured sources including "[w]eb-browsing data trails, social network communications, sensor data and surveillance data. Much of it is "exhaust data," or data created unintentionally as a byproduct of social networks, web searches, smartphones, and other online behaviors.
This means that with industry, social behavior, and government behind it, BD is only going to grow larger and the privacy problems associated with it are going to grow not in tandem, but exponentially
Ethics Look beyond the law; ethics of BD research availability makes it ethical? research ethics boards have insufficient understanding of the process of anonymizing and mining data, or the errors that can lead to data becoming personally identifiable effects may not be realized until many years into the future data contributors (e.g. social networkers) usually do not have researchers as their audience many have no idea of the processes currently gathering and using their data difference between being in public and being public
--even in the area of litigation, electronic discovery can uncover both criminal acts and non-criminal embarrassing acts
Conclusions BD is here to stay Increasingly happening in the cloud, and with open access Erasing the notion of public/private space distinction
Hierarchy in the BD World 3 classes of people in Big Data World (Manovich, 2011): (1) those that create data (consciously or by leaving digital footprints) (2) those who have the means to collect it (3) those who have the expertise to analyze it (smallest group, and most privileged) -A pyramid?
Tene and Polonetsky (2013) note that presently the benefits of big data do not accrue to individuals whose data is harvested, only to big businesses that use such data: -- those who aggregate and mine this data neither view their informational assets as public goods held on trust nor seem particularly interested in protecting the privacy of their data subjects. The truth lies in the opposite because the big data business model is selling information about their data subjects. To make it less of a pyramid, they advocate the empowerment of individuals in controlling their information by giving them meaningful rights to Access their data in usable, machine-readable format. advantages: unleash innovation for user-side applications and services, give an incentive to users to participate in the data economy ( by aligning their own self-interest with broader societal goals )
To make it less of a pyramid, they advocate the empowerment of individuals in controlling their information by giving them meaningful rights to Access their data in usable, machine-readable format. advantages: unleash innovation for user-side applications and services, give an incentive to users to participate in the data economy ( by aligning their own self-interest with broader societal goals )
What you think about this proposal will have to be a debate we are willing to undertake, today or another day! Thank you! jgathegi@usf.edu