Big Data Security and Privacy Kevin T. Smith, Novetta Solutions AFCEA CyberSecurity Symposium 2014 June 25, 2014 Ksmith <AT> Novetta.com KevinTSmith <AT> Comcast.Net
Big Data With the increase of computing power, electronic devices & accessibility to the Internet, more data than ever is being produced, collected and transmitted. Interesting Facts*: Facebook Collects 250 Terabytes a Day Digital Data Production worldwide doubled in 2009 to 1 zettabyte (1 million petabytes) Worldwide digital production is expected to reach 7.9 zettabytes in 2015 And 35 Zettabytes in 2020 Organizations have recognized the power of data analysis, but are struggling to manage the massive amounts of information they have. *Stats from Thompson Reuters & InfoQ, http://www.infoq.com/news/2013/12/hadoopusage
Securing Big Data Why Should We Care? Regulatory, Access Control & Releasability Concerns Regulatory - Many Organizations required to enforce access control & privacy restrictions on data sets (HIPAA, Privacy Laws) or face steep penalties and fines Access Control - U.S. Government organizations are required to provide access control based on Need-to-Know, & Formal Authorization Credentials Releasability - Big Data brings new challenges related to data management & organizations are struggling to understand what results they can release without unintentionally disclosing information Insider Threat / Threats on Availability How do you control access to your analytics? Many deployments are unsecured Your data is only a distributed delete away Mismanagement of Data Sets & Breaches are Costly AOL Research Data Valdez Incident Listed as one of CNN/Money s Dumbest Moments in Business : $5M Settlement + $100 to each member at the time + $50 to any member concerned Netflix Contest & Anonymized Data Set Class Action Lawsuit, $9M Settlement Playstation (2011) Experts predict costs to Sony between $2.4 and $2.6 Billion *Ponemon Institute, Cost of Data Breach Study: Global Analysis, May 2013
What makes Securing Big Data Different? Unique Challenges to Big Data Analytics Distributed Security: When Data and Processing are distributed to a cluster, there are lots of moving parts to secure related to confidentiality, integrity, and availability. This often leads to complexity related to the development & configuration of security on these systems. Combination of Different Sources: Big Data Analytics Solutions are great at bringing many data sources together & doing analytics on their combination. Given that each data source may have its own access control security policy, how do you enforce security policies on the combination of these data sources? Aggregation & Differential Privacy: When you combine different sources of data, you may discover connections between those data sources that may disclose more information that you intended, potentially violating access control and privacy policies. Unintended Deduction from Large Data Sets: Data sets are typically so large, that it is often difficult to determine what may be deduced from them that may disclose sensitive information.
Deduction & Differential Privacy Example Could a data analyst working for Commissioner Gordon deduce that Batman is Bruce Wayne?
To Complicate the Matter Most Data Analytics Tools were designed without Security In Mind. Example: Apache Hadoop Originally No Security Model No authentication of users or services Anyone can submit arbitrary code to be executed Anyone could add data to or delete data from, or read data from distributed file system You could write a service that impersonated a Hadoop service. Later, after authorization was added, user impersonation = command line switch 2009 Yahoo! Security Retrofit Resulting Security Model is Complex Configuration is Complex No Data at Rest Encryption Kerberos-Centric Limited Authorization Capabilities Easy to Mess Up if You Don t Know What You are Doing Things Are Changing, But They are Changing Slowly! An Alphabet Soup of Secure Distributions, Vendor Add-Ons & Security Focused-Companies Companies releasing Hadoop Distros are taking Security Seriously (See recent press releases - Cloudera: Gazzang, HortonWorks XASecurity) Much activity in open source movements like Project Rhino & projects like Apache Sentry
All Security Needs to be Policy-Driven
Air Gap & Isolation Approaches - Network Isolation in various forms is used in lieu of security in closed networks - Import/Export is problematic - Accidents may still happen - Does not solve issues related to diff. privacy AuthZ issues
Augmenting Analytic Security with Other Tools Ex: Apache Accumulo Find your analytics tools limitations & complement your solution with other tools and libraries. Example here shows building a security layer over Hadoop Cell-Level Access Control via visibility By default, uses its own db for users & credentials Can be extended in code to use other Identity & Access Management Infrastructure
Differential Privacy & Deduction Many approaches are in the Academic Sphere Cynthia Dwork from Microsoft Research is one of the leading researchers Lots of University Work Lots of Math involved. I m involved in more practical solutions (but no Math) Determining Access Control Policies up Front & Applying that Policy Determining Entities that Should not Resolve (Batman + Bruce Wayne) & including this in the security of the system Sometimes this involved an aggregation filter component to prevent the resolution of entities We will still need to follow the academic research in this area.
Final Thoughts General Guidance Every Security Approach Is Different Security is a Journey, Not a Destination Know Your Security Requirements Understand your security requirements & policies related to access to data Know The Security Policies of Your Data: Understand the security policies of your data so that you can enforce them Know Your Tools & Their Limitations Understand, from an in-depth perspective, how to successfully meet your security goals Understand the limitations of your tools & augment your solutions with other approaches Understand the Unique Challenges of Big Data Security Combination of Different Sources & Resulting Policies Aggregation and Differential Privacy (Netflix Contest) Unintended Disclosure (The Batman Problem)