1 Big Data, Little Privacy by Jim Reno, Distinguished Engineer, Chief Architect for Security, CA Technologies So far this year it seems that not a week has passed when I haven t seen a news article in the popular media that has something to do with privacy, along with a host of articles in the technical media that have to do with Big Data. The Big Data articles are mostly about how it s the hottest area of computing: the fastest advancements, best job market, most interesting to venture capitalists. The privacy articles often follow the sky-is-falling motif, focusing on some event like a breach and extrapolating dire consequences. There s usually little evidence of real damage done, but the threat feels real. I find these trends interesting because they intersect. Big Data is all about collecting as much data most of it about people and their habits as possible, and finding new and more powerful ways to analyze it. Privacy is all about not collecting data, and if not keeping all aspects of our lives hidden, at least exercising some control over who knows what. In my mind that puts Big Data and personal privacy on a collision course. For example, the story of the Girls Around Me app was tailor-made for such attention. In case you missed it, the app combined data from Foursquare and Facebook to show users the pictures and location of nearby women. Or at least it did until Foursquare shut off their access. It looks like Girls Around Me only used data that was publicly available that is, where the user s privacy settings were set that way and so may not have broken any laws. 1 The story splashed because of the particularly tacky way the app developer promoted it, as if the target market was stalkers. Had it been Find New Friends Nearby, I wonder if it would have drawn so much interest? In fact, I d find it completely believable if it were revealed that the entire thing was a stunt, fabricated simply to expose the potential bad things that can be done with social networking. Either way, it s an example of publicly available data stores being put to a use that, even if it may not be violating anyone s privacy, is at least creepy and potentially dangerous. Big Data usually enters into the privacy stories in ways like that. More examples: Using marketing data to predict personal customer events, as reported by the New York Times in February. 2 FCC fines for impeding investigation of data collection practices 3 FTC actions against social networking sites to improve consumer consent and adopt better privacy practices. 4 Recent moves to link together user data from multiple services to obtain a consolidated picture of each user. 5 Initiatives to implement Do Not Track capabilities into browsers, and discussion about whether web sites should be required to respect those settings. 6 Criticisms of social networking sites for weak privacy policies, and for changing those policies with little notice and in ways that are difficult for users to understand. 7 About the author Jim Reno is a Distinguished Engineer and Chief Architect for Security at CA Technologies. He has more than 30 years experience at software development in the areas of system software, networking and security. Jim came to CA through the 2010 acquisition of Arcot Systems, where as CTO he worked on authentication and risk technologies, including coinvention of 3-D Secure, an early identity federation protocol used broadly by the major credit card networks for online authentication of cardholders. At CA Technologies, Jim guides the overall architecture of CA s Security products and the security subsystems of all CA products. Jim lives in Northern California with his wife and three children.
2 What kind of data is involved? Potentially: Every web site I visit; every page of every web site, and my typical habits in browsing that site Everything I search for Anything I post to any web site, such as comments Anything I post to online services I use like s, social networking sites Anything I buy, where I bought it, how I paid for it. This includes both online and offline purchases, because even offline the info gets into a computer and likely gets into the cloud somewhere. Historical or publicly available personal information like addresses, phone numbers Credit history In many cases there are laws or regulations to protect the personal information. For example, in the U.S., financial information and other data that can be used to commit identity theft are generally protected. Often sites handle this type of information by masking or encoding it. While that helps, there have been some studies that show that even with encoded data, sometimes secrets can be extracted with powerful enough analytical tools. An interesting study by Acquisti & Gross in showed that it is possible to deduce Social Security numbers fairly well knowing only a person s birth date and where he was born, both easily obtained publicly.* So in what way is Big Data connected with privacy, and why is it a matter for concern? Marketers have been collecting and analyzing user data for years, well before the advent of the Internet; what s different now? One difference is that the Internet is being used in ways that few people expected back when it was first invented. I ve been involved with development of Internet technologies for many years, from back when it was known as the ARPANET. I worked on development of a TCP/IP stack when TCP was new. Back then, just getting systems to work together was the primary concern: few, if any, thought about issues like security. There may have been a few visionaries imagining personal network access for billions of people, but it certainly did not enter the collective consciousness. The Internet was about sharing information, not about hiding it, and so an underlying foundation for privacy and security was never laid down. Once you have established a technology or design and it gains widespread usage, it is very difficult to change. Engineers often lament about the shortcomings of a system and want to start over next time, we ll build it right. The truth is that only failed systems get to be built anew, because successful ones cannot be easily displaced. Consider the move to IPV6. The original IP specification used a 32 bit address, because 4 billion computers sure seemed like enough. IPV6 work started in 1992, with the core specifications released in 1996, but even now is barely implemented. A security example: one of the curses of modern is that it s easy to spoof a sender s address, that is, make an look like it came from any address you like. When network mail protocols were being developed, nobody thought about security. There have been some studies that show that even with encoded data, sometimes secrets can be extracted with powerful enough analytical tools. * Another example of the power of Big Data analytical tools: Cornell researchers showed that by analyzing large sets of motion data taken from observations of a double pendulum, they could find fundamental laws of nature without any prior knowledge of physics or mechanics.
3 That s because security hasn t typically been a driver of new technology. Rather, scale, profit, publicity and curiosity are the big factors. The pattern of technology getting ahead of consideration of its consequences has repeated throughout history both in real life and art (think of Frankenstein). Just because we can do it, should we? In the case of Big Data, a lot of the should we comes down in the area of privacy. We can collect enormous amounts of data about individuals, and use evolving tools to analyze it in new ways. There s unquestionably a lot of potential profit, because the more you know about people and can predict their behavior, the more you can effectively market to them. When asked about their data collection and privacy policies, the usual response from Big Data companies has phrases in it like better serving our customers, targeted ads, and ads more relevant to our users. That s the basic deal: we give you free services (like ); you let us throw ads at you, a business model that goes back at least as far as radio. And the more we know about you the more effective the ads are and the more money advertisers will pay. Hopefully you ll be less annoyed by the ads if they are for products you actually need. Most users are ok with that deal, and don t seem concerned about the privacy side of the equation. In fact, if a few ads are all that happens, there probably isn t any reason for concern. But once the data is collected it is hard to guarantee that it is only going to be used for a specific purpose. Worse, the fundamental nature of invention and technology is that people will find new ways to use things that were not anticipated by their designers; Girls Around Me being a case in point. I doubt that either Facebook or Foursquare anticipated that particular use of their systems. We can t know what people are going to do with the data, but we do know, or at least feel, that a lot of the data being collected is somehow sensitive. If we draw parallels with the physical world, a lot of the individual data items being collected might not seem significant. I walk into Safeway, and buy a loaf of bread. Someone let s call him Fred - happens to be standing in line behind me. I m wearing my badge from work, so Fred can see my name and that I m with CA Technologies. Maybe I m carrying a cup of coffee and he can tell I like non-fat lattes from Starbucks. The brand of bread I like is visible, and he might note that I have a Visa card issued by a particular bank. I don t make any attempt to hide any of these things, because individually they don t seem terribly significant, and I can t imagine going through life wearing a mask and pulling a curtain over everything I do. These events seem normal and harmless to me; I do them in public and have no expectation of keeping them secret. Now suppose that as I exit the store, Fred follows me. He has a notebook sorry, it s the 21st century a cell phone, and records all my activities. He has a record of everything I buy, and of every step I make. He maintains a respectful distance and may not be violating anyone s privacy, but he s always there, and his record builds up a complete picture of me and my habits. How I walk. What I eat. What I buy and where I like to buy it. That for short buildings I tend to prefer the stairs to the elevator. That I get impatient at crosswalks and push the button multiple times. These individual observations don t seem terribly important but put together, and the fact that someone is actively collecting them about me specifically, feels threatening and dangerous. In the online world, the data collection being done today is the equivalent of that person following me around all day. Sites that include Facebook Like buttons, or that include tracking cookies from online advertising companies, send your surfing habits back to their owners whether or not you use their Security hasn t typically been a driver of new technology. Rather, scale, profit, publicity and curiosity are the big factors. But once the data is collected it is hard to guarantee that it is only going to be used for a specific purpose.
4 services. In addition, there s all the data that I willingly provide as I visit those sites or use their services. Online social networking is fundamentally based on communicating information about me and my habits; that data is all available to be analyzed and correlated. The intent of the person following me may be benign, even beneficial, but it still feels threatening. In the online world people don t see the threat because they don t see the person following them. Historically, advertising was a one-way thing, with information being presented to the consumer in the form of ads or other marketing tools. Information was thrown at us and we could choose to ignore it. However, advertising online is now a two-way exchange, where information is being collected about us at the same time as we are being sent information. We can see (and can ignore) the ad but we don t always see what it is sending back. A good part of why it seems threatening is because it s personal. The advances in data storage and analysis have allowed targeted marketing to move from looking at aggregate behavior across populations, to specific behavior of individuals. Fred s on the street corner, noting down everyone s button-pushing behavior. I chat with him about what he s doing, and he tells me that he s trying to design friendlier traffic signals, and so is noting how many people push the button more than once. That feels pretty harmless, because it isn t about me specifically. When I see the store putting dip next to chips because their research shows people tend to buy them together, it doesn t bother me. But if I walked into the store and they offered me a bag of chips with the note we know you finished your last bag yesterday evening, I d be unnerved. As the data collection capacity increases and analytical tools get better, the ability to correlate activities across different system improves. So the parallels I draw above between the physical and online worlds cease to be parallel and come together. Actions that I take online can be correlated against actions I take in the physical world. When I bought that loaf of bread, the details about the transaction were entered into a database and can be correlated with my activities online; perhaps I ll see breadrelated ads the next time I use my computer. As the world becomes more connected and online, the number of data sources and the volume of data about me expand enormously. Is there really any risk? The immediate example that comes to mind is identity theft. There are also periodic stories about breaches, often of financial systems: a typical story will report some number of millions of potentially compromised accounts. I rarely read a story that details any specific harm to any specific people, which dilutes the impact, making the event seem distant from our day to day lives. The perception of identity theft is that it mostly comes from the actions of individuals people who respond to phishing attacks, or who are careless about things like choosing secure passwords. This is a very human response to danger: we want to feel like we are in control and that bad things can be prevented if we take the appropriate actions. But it s very hard for people to protect themselves from potential abuses of Big Data. Most users don t see any threat and aren t willing to take action to protect themselves, especially if doing so means giving up free services like , web search and social networking. Techniques to block tracking are complex and beyond the technical sophistication of the average user. There are efforts underway to change that. The most recent versions of the major Sites that include Facebook Like buttons, or that include tracking cookies from online advertising companies, send your surfing habits back to their owners whether or not you use their services. A good part of why it seems threatening is because it s personal. This is a very human response to danger: we want to feel like we are in control and that bad things can be prevented if we take the appropriate actions.
5 browsers support preferences and headers whereby a user can specify that he does not wish to be tracked. Unfortunately, compliance with these measures is usually voluntary on the part of web sites, and government regulators have been reluctant to put in place mandatory measures. The recent European cookie directive requires companies to make their use of tracking cookies transparent and to obtain explicit consent before dropping any cookies onto a user s computer. 9 Also, there is a battle brewing in the standards organizations about whether the default setting for Do Not Track should be on or off with advertisers on one side and privacy advocates on the other. Both groups know that the vast majority of users may not take the extra step to change the default setting, although there is evidence that the growing publicity about privacy is causing more users to change their settings. 10 Another general issue with such controls is the distributed nature of the Internet, making jurisdiction a key question: different governments implementing different policies make a uniform solution difficult to achieve. The privacy policies implemented by the web sites are an important factor but don t really make things better. Most web site privacy policies read like legal documents not surprisingly; they were probably written by lawyers. I often wonder on reading them whether they are there to protect the end users or to protect the company running the site. The length and complexity of these policies makes most users skip them. Users may have a bit of a fatalistic attitude: I don t understand this, it s too long to read, and they are going to do what they want with the data anyway, so why bother. Most sites reserve the right to unilaterally change their privacy policies, and will do so when it is to their advantage. In some specific domains, such as health and financial information, regulation has had more effect on the collection, and particularly the safeguarding of data, but not broadly. HIPPAA, for example, has had a noticeable effect on privacy of medical information in the United States. I see this even in visits to the doctor: for example, reception areas are now usually structured so that people waiting cannot overhear the people checking in. Previously, check-in involved a line where everyone waiting could hear every detail. Civil liability and breach notification laws have caused better collection and storage procedures for financial data like credit card numbers. So regulation can have an impact, provided it is sufficiently comprehensive and there are consequences for violations. Today, though, most data collection privacy is managed through industry self-regulation. Most social networking sites have privacy settings that can be adjusted by their users. This is a step forward and helps, but often the defaults are set to favor sharing of information rather than user privacy. Many researchers report that few users actively manage their privacy settings, making the defaults important as they are the setting for most users. Often the data collected, or the methods to collect it, are obscure and technical: most users have no real idea of what a tracking cookie is or how it is used. And while there are resources available to educate people, there is no perceivable incentive to do so. Even when the companies collecting the data have the best of intentions, the fundamental nature of some of our web interactions creates the potential for exposure, because the Internet is widely distributed. Social networking has cheapened the word friend : people measure their popularity by number of friends. So friends are no longer people with whom you have a significant relationship, and can trust, but now are often people that previously would barely have qualified as acquaintances. Yet these friends have a greater level of The privacy policies implemented by the web sites are an important factor but don t really make things better. Many researchers report that few users actively manage their privacy settings, making the defaults important as they are the setting for most users.
6 access to personal data than the general public. In fact, on many sites privacy settings only give the choice between public and friends only friends get access to everything. So something posted can be copied by a friend; posted elsewhere; copied by a friend of a friend, and so on, percolating across the Internet into many unknown data stores, where any idea of the original user s privacy settings is distant and impossible to check. Once in those stores, Big Data tools allow it to be correlated, searched and analyzed like any other item. Suppose the data is just wrong, or out of date, or a lie? There has been considerable debate in Europe over the European Commission s proposal to create a right to be forgotten, giving individuals the right to require data stores to remove personal data at their request. The idea is controversial in concept because it s virtually impossible to comply. Some read the EC s right to be forgotten requirement to mean that users can require social media networks to erase all references to an individual - not just from that social network itself but the whole internet. It would require the ability to track the data as it goes from place to place on the web, which is not easy to do. And remember the European concept of personal data can include even a person s name or address. That information can be spread all over the web once it s collected. The right to be forgotten may also conflict with the fundamental right of freedom of speech. If I am convicted of a crime and serve the appropriate penalty, how long should that fact follow me? Some legal systems provide a mechanism for it to not follow me; for example, consider the sealing of court records of juvenile offenders. On the other hand, if a web site is required to remove such data, doesn t that impinge on the freedom of speech of the site owner? Regardless of the underlying issues of fundamental rights, there are practical considerations. Defining what constitutes personal data is reasonable within a specific domain (like financial or medical information), but very difficult in general. It s hard to imagine how a data store operator could implement a mechanism for handling removal requests, since it puts them in the position of having to determine the identity of the requester and whether the removal request is legitimate. When data goes into their systems, they have no way to determine its accuracy or, often, the identity of the submitter. This is the nature of Big Data: collect everything, and use powerful tools later to analyze it. It is much easier to collect it all than it is to select only certain parts, or edit the data during collection. Plus given continuing decreases in storage costs, once the data is in the system it is much easier to keep it forever than to make decisions about what portion of it to delete. And if the data percolates from store to store and site to site, even cooperative attempts to remove or correct it may fail. A person trying to remove something he posted could be faced with the hopeless task of having to track everywhere it went on the Internet. A side effect is that erroneous data is likely to live on forever. Consider threads requesting postcards or s for sick children. Often these are hoaxes or are years out of date. There are many resources that try to provide the truth, yet the same threads pop up every so often, and don t seem to ever just go away. We are living in a world where data is forever. A few years ago the daughter of a friend, after her high school graduation, asked me if I had any advice for her about college. After the initial shock of a teenager actually wanting my opinion wore off, I thought about typical mundane advice about studying hard, etc. Then I told her this: When I was in school, longer ago than I want to think about, I knew that when I was out and about with friends, and I did something incredibly stupid, the worst I had to worry about was those friends teasing me There has been considerable debate in Europe over the European Commission s proposal to create a right to be forgotten. A person trying to remove something he posted could be faced with the hopeless task of having to track everywhere it went on the Internet. A side effect is that erroneous data is likely to live on forever.
7 about it for a long time. You have to worry about someone taking a video of it and it being spread to everyone in the world so adjust your behavior accordingly. If she were asking today, I would add: And you should assume that video will follow you for the rest of your life. There are financial considerations as well. The huge amounts of data being collected represent a tremendous asset. There is corresponding business pressure to monetize that asset, and to continue to find new ways to extract value from it. I think history tells us that if something is technically possible and can be done profitably, someone will do it. Continued improvements in analytical tools will make it easier to extract that value around individuals rather than statistics about aggregate populations, which puts individual privacy directly in tension with business goals. To monetize the data, tools and analysis will require selling it to other companies. That puts the data provider in the position of having to vet the legitimacy of their customer, which gets harder to do as the number of customers increase and the size of each decreases. How difficult will it become for a data provider to determine that a buyer of targeted ad data is indeed legitimate and not, say, a front for criminal activity like identity theft? I am in the security business, where Big Data, particularly around personal information, plays an important role. A central (and hard) problem for many sites is that of identifying and authenticating users. There are a wide variety of technologies available for the actual event of authenticating a user, and our products implement a number of them. Simple passwords are still the most commonly used mechanism, but we are seeing increasing adoption of other things like software multifactor tokens and mobile one-timepassword generators. Regardless of the technology for the authentication event, there are a couple of common problems that must be solved. The first is, when initially establishing the user s account, how do you know it s really him? The second is, how do you recover in the situation where the primary authentication technology fails, e.g. the user has forgotten his password? If a site is creating user accounts for people with whom they have no prior relationship, the initial identity problem often isn t an issue. Usually in this case they don t really care who the user is, it s just a new account. For example, when creating an account on one of the free services, I set up a user name and password and perhaps provide some personal information like my real name. I can use any name I like; the site doesn t check it, and really has no way of knowing whether I have that name or not. I set a password and perhaps the answers to a couple of security questions. All of the security data is collected at the time of account creation. If I forget the password, they can use the security questions to verify I was the one who opened the account, and do a password reset. There is no way for them to know if my answers are true ; only that they match what I originally gave. If I forget the security questions as well, I m likely out of luck. Many sites use multiple backup mechanisms to protect against this situation. Given that all the data I provide can be invented by me, I am in effect creating a fictional identity. If that data makes it into many stores, and is accessed by other people, it probably doesn t matter. But suppose instead of a free account, I am signing up for home banking. Then the online account must be The huge amounts of data being collected represent a tremendous asset. There is corresponding business pressure to monetize that asset, and to continue to find new ways to extract value from it. If a site is creating user accounts for people with whom they have no prior relationship, the initial identity problem often isn t an issue. Usually in this case they don t really care who the user is, it s just a new account.
8 linked to my real bank account, and the bank must establish that I am the legitimate owner of that account. They could do that by requiring me to go to a branch personally, but that doesn t scale well, nor does it match user s Internet expectations. So they create a self-service signup process. They can start identifying me by using data I provided when the original bank account was created, but that data is often sparse and may be out of date. For example, they may have only my name, social security number, and perhaps current address (banks often tell me their contact data on customers is incomplete and filled with errors.) Big Data can help them in this case, because starting with those facts they can access other data stores for information with which to identify me. For example, they can access credit history information and ask me questions about past loans or other transactions. But it hurts them, as well, because they are not the only ones that can access that data, and the more data about me that gets into the Big DataSphere, the easier it is for someone else to access it. Password reset is often plagued with these problems when it is based on security questions. There seems to be only so many different security questions, and they often relate to personal preferences or life events. These same preferences and life events are exactly the kinds of things people post to social networking sites and that get spread around widely.* There are consequences for enterprises as well, beyond just the difficulties my bank has that I just described. The boundary between my job and my life in the online world is blurring. People are using personal mobile devices at work, rather than company-issued ones. Organizations are starting to use cloudprovided services like rather than internally managed versions, putting corporate data in places where it potentially can leak elsewhere. Despite company policy, employees are not always discreet about using Internet resources like social networking sites, which can expose sensitive corporate data. The risks range from corporate embarrassment to legal trouble if the exposed data is something (like personally identifiable information) that the company was required by law to protect. So not only individual privacy is at risk, corporate privacy is, also, and in ways that are often out of the company s control. All in all, I m concerned about Big Data and privacy, but I m not in a panic. There are hopeful signs. Watchdog organizations like the Electronic Frontier Foundation, and some government agencies like the US Federal Trade Commission, are actively working in this area and there is much debate. So the potential problems are at least being discussed. But the fundamental aspects of Internet and data technology probably are such that we need to adjust to a new privacy norm, where people live with an understanding that in the online world they are exposing themselves to the whole world. Maybe it s just the latest step in human social evolution. As people moved from families of huntergatherers, to tribes, to settlements, towns and cities, we had to create ways to live with our daily lives exposed to ever-increasing numbers of strangers. Perhaps the Internet is the final stage, where we expand that to everyone, and so we ll create social structures to make that work, too. There seems to be only so many different security questions, and they often relate to personal preferences or life events. These same preferences and life events are exactly the kinds of things people post to social networking sites and that get spread around widely. All in all, I m concerned about Big Data and privacy, but I m not in a panic. * This summer Wired reporter Mat Honan s Apple, Google and Twitter accounts were taken over, and his ipad, IPhone and MacBook all wiped clean, through an attack involving access to online security data, correlated across multiple sites. You can read his account of it on wired.com. 11