Formal Methods for Preserving Privacy for Big Data Extraction Software



Similar documents
Some Research Challenges for Big Data Analytics of Intelligent Security

Data Refinery with Big Data Aspects

Associate Prof. Dr. Victor Onomza Waziri

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Healthcare, transportation,

How To Develop Software

BIG DATA CHALLENGES AND PERSPECTIVES

Prediction of Heart Disease Using Naïve Bayes Algorithm

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Cray: Enabling Real-Time Discovery in Big Data

NTT DATA Big Data Reference Architecture Ver. 1.0

Lightweight Data Integration using the WebComposition Data Grid Service

How to Enhance Traditional BI Architecture to Leverage Big Data

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Outline. What is Big data and where they come from? How we deal with Big data?

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

XML enabled databases. Non relational databases. Guido Rotondi

Testing Big data is one of the biggest

Healthcare Measurement Analysis Using Data mining Techniques

Realizing business flexibility through integrated SOA policy management.

Homomorphic Encryption Schema for Privacy Preserving Mining of Association Rules

Effective Data Integration - where to begin. Bryte Systems

DATA MINING - 1DL105, 1DL025

90% of your Big Data problem isn t Big Data.

"BIG DATA A PROLIFIC USE OF INFORMATION"

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

SQL Server 2012 Business Intelligence Boot Camp

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Student Handbook Master of Information Systems Management (MISM)

Alcatel-Lucent 8920 Service Quality Manager for IPTV Business Intelligence

Big Data and Healthcare Payers WHITE PAPER

Integrated archiving: streamlining compliance and discovery through content and business process management

Qi Liu Rutgers Business School ISACA New York 2013

A Knowledge Management Framework Using Business Intelligence Solutions

CONCEPTUALIZING BUSINESS INTELLIGENCE ARCHITECTURE MOHAMMAD SHARIAT, Florida A&M University ROSCOE HIGHTOWER, JR., Florida A&M University

Knowledgent White Paper Series. Developing an MDM Strategy WHITE PAPER. Key Components for Success

Beyond the Data Lake

BUSINESS INTELLIGENCE AS SUPPORT TO KNOWLEDGE MANAGEMENT

Reinventing Business Intelligence through Big Data

An Enterprise Framework for Business Intelligence

Data warehouse and Business Intelligence Collateral

How To Handle Big Data With A Data Scientist

HOW THE DATA LAKE WORKS

MEDICAL DATA MINING. Timothy Hays, PhD. Health IT Strategy Executive Dynamics Research Corporation (DRC) December 13, 2012

DEGREE CURRICULUM BIG DATA ANALYTICS SPECIALITY. MASTER in Informatics Engineering

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

QLIKVIEW DEPLOYMENT FOR BIG DATA ANALYTICS AT KING.COM

Session 4 Cloud computing for future ICT Knowledge platforms

A Secure Model for Medical Data Sharing

How To Turn Big Data Into An Insight

SPATIAL DATA CLASSIFICATION AND DATA MINING

Cúram Business Intelligence and Analytics Guide

Enterprise Data Quality

Cleaned Data. Recommendations

Big Data: Study in Structured and Unstructured Data

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

WHITE PAPER. Social media analytics in the insurance industry

BEYOND BI: Big Data Analytic Use Cases

Voice. listen, understand and respond. enherent. wish, choice, or opinion. openly or formally expressed. May Merriam Webster.

Research of Smart Distribution Network Big Data Model

DATA MINING - 1DL360

Big Data Analytics. Copyright 2011 EMC Corporation. All rights reserved.

A financial software company

Big Data Analytics Best Practices

7 Best Practices for Speech Analytics. Autonomy White Paper

Wrangling Actionable Insights from Organizational Data

MDM and Data Warehousing Complement Each Other

IBM Software A Journey to Adaptive MDM

Big Data on Microsoft Platform

JOURNAL OF OBJECT TECHNOLOGY

How to transform data into dollars this is always about Business Intelligence

Top 10 Business Intelligence (BI) Requirements Analysis Questions

CS 493N: Big Data Engineering - Overview

The Future of Business Analytics is Now! 2013 IBM Corporation

Apache Hadoop: The Big Data Refinery

Hadoop Technology for Flow Analysis of the Internet Traffic

A Design Technique: Data Integration Modeling

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

locuz.com Big Data Services

Data Outsourcing based on Secure Association Rule Mining Processes

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Data Mining and Database Systems: Where is the Intersection?

Agile Business Intelligence Data Lake Architecture

DATAOPT SOLUTIONS. What Is Big Data?

How Big Is Big Data Adoption? Survey Results. Survey Results Big Data Company Strategy... 6

COMP9321 Web Application Engineering

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Architected Blended Big Data with Pentaho

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Big Data and Analytics: Challenges and Opportunities

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

BIG DATA: FROM HYPE TO REALITY. Leandro Ruiz Presales Partner for C&LA Teradata

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Industry Models and Information Server

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

CORPORATE OVERVIEW. Big Data. Shared. Simply. Securely.

Introduction to Service Oriented Architectures (SOA)

The University of Jordan

North Highland Data and Analytics. Data Governance Considerations for Big Data Analytics

Transcription:

Formal Methods for Preserving Privacy for Big Data Extraction Software M. Brian Blake and Iman Saleh Abstract University of Miami, Coral Gables, FL Given the inexpensive nature and increasing availability of information storage media, businesses, government agencies, healthcare professionals, and individuals worldwide have exponentially increased their production and persistence of large amounts of data whether such data are captured as text, images, or sound. As such, it is no surprise that just coining the term Big Data has generated a viral level of energy in industry and the research community. In order to facilitate the benefits of Big Data, users must capture and store their information in systematic and predictable data stores using trusted software that extract, transform and load (ETL) the data. This position paper focuses on challenges associated with the protection of sensitive information that, when mishandled or correlated with other pieces of data, can jeopardize user s privacy. In this paper, we motivate the need for enhanced formal approaches that will enable (1) the creation of a customized, contemporary software lifecycle and framework for big data testing, (2) development of new approaches for automated test generation including approaches that leverage the efforts of crowds/social networks, and (3) development of algorithms for continuously ensuring privacy as ETL applications and data evolve. Introduction Businesses and government agencies are generating and continuously collecting huge amounts of data. The current increased focus on large amounts of data will undoubtedly create opportunities and avenues for understanding the processing of such data across many varying domains. In this position paper, we use the term, Big Data, to define many terabyte and larger repositories of interconnected (sometimes with unknown relationships) data that might be realized as logs, product information, transactions, medical or biological collections. Data can be structured and unstructured and can contain complex types such as text documents, graphs, images, audio and video files. Several examples of Big Data may include repositories containing data such as information captured from social media on the Internet, large databases of genomic information (a domain that we have leveraged via our ongoing projects), or consumer s buying history and logs. Analysis of these huge repositories of data introduces fascinating new opportunities for discovering new patterns, relationships and insights that contribute to different branches of science. Researchers, for example, have analyzed data from over 7 million electronic medical records and used them to illustrate in a powerful visualization the (sometimes surprising) relationships between medical conditions on the basis of the frequency of co occurrences [1]. The Figure 1. Health InfoScape : a visualization of visualized network is called the Health InfoScape (Figure 1). Such visualizations raise the relationships between medical conditions on significant HIPAA concerns [2]. the basis of the frequency of co-occurrences. 1

The potential of Big Data comes however with a price; the users privacy is often at risk. Guarantees of conformance to privacy terms and regulations are limited in current Big Data analytics and mining practices. Furthermore, current practices do not ensure that privacy terms are enforced as the data and underlying software evolve. In recent events, the government healthcare website exemplified the challenges faced by Big Data applications. The website requests enrollees to provide personally identifiable information such as date of birth, Social Security number and income information. According to the website's privacy policy, that information is shared among many government agencies, including the IRS, Social Security Administration and Department of Homeland Security. However, the website has been criticized that there have been no guarantees that this information is used solely for the purpose of verifying the enrollee's income, immigration status and healthcare eligibility. Users expect that the information should not be used, shared or stored beyond that purpose. This example and many others [17][18]show that there exists a distinct need to build trust into applications that extract, translate, and load (ETL) this scale of data. Furthermore, these processes must ensure the protection of sensitive information. Developers should be able to verify that their applications comply with privacy agreements and that sensitive information is kept private regardless of changes in the applications and/or privacy regulations. Overview of Big Data Loading and Testing To address these challenges, we have identified a need for new contributions in the areas of formal methods and testing procedures. A 2013 article on Big Data Testing [15] identifies the main testing areas, depicted in Figure 2, on a Hadoop architecture; the defacto standard for Big Data. In our work, we have discovered the need for new paradigms for privacy conformance testing to the four areas of the ETL process as follows: Figure 2. Big Data Architecture and Main Testing Areas [15]. (1) Pre Hadoop Process Validation: This step represents the data loading process. At this step, the privacy specifications define the sensitive pieces of data that can uniquely identify a user or an entity. Privacy terms can also specify which pieces of data can be stored and for how long. At this step, schema restrictions can take place as well. (2) Map Reduce Process Validation: This process manipulates big data assets to efficiently respond to a query. Privacy terms can dictate the minimum number of returned records required to conceal individual values, in addition to constraints on data sharing between different processes. (3) ETL Process Validation: Similar to step (2), warehousing logic need to be verified at this step for compliance with privacy terms. Some data values may be aggregated anonymously or not included in the warehouse if that implies high probability of identifying individuals. (4) Reports Testing: reports are another form of queries, potentially with higher visibility and wider audience. Privacy terms that define the notion of purpose are essential to verify that sensitive data is not reported except for specified uses. 2

Challenges for Formal Validation in Loading Big Data Big Data applications have evolved through commercial use but they are still lacking the theoretical rigor to provide confidence and repeatability. Formal approaches, when applied to Big Data and its applications, will ensure the development of trusted systems in terms of compliance with privacy regulations. Through our work, we introduce techniques that leverage the efforts of social networks. Crowdsourcing techniques can help with subjectivity of privacy rules while providing a solution to dealing with the size of data when verifying privacy rules [3][20]. There are several high level research challenges that must be addressed in order to enable our investigations and the implementation of our formalization framework, namely: Formal methods and specifications must be customized to capture the privacy requirements of Big Data. Any formal methods [4][10][12] [13] to address these problems must provide the necessary rigor to specify Big Data privacy regulations, given its unstructured and highly variable nature. The specification needs to capture the sensitivity level of different pieces of data, both individually and combined. It also needs to incorporate the notion of differential privacy where the risk of exposing personal information is probabilistically minimized. At the same time, the formalization language requires reasonable ease of use for application developers. Techniques and tools must be developed that reason [7][8][9] about the specification of rules in order to automatically generate compliance tests. Human input (i.e. crowdsourcing techniques) should be leveraged to efficiently and effectively guide automatic code and rules repair. With the huge amount of data and the risk involved in automatically correcting sensitive data, we believe that leveraging crowds (perhaps of socially linked groups [3][20]) can increase the trust in the data driven applications. We also believe that users input is valuable in defining and refining privacy rules considering its subjective nature. Current privacy preserving mining algorithms for Big Data must facilitate verification of privacy compliance. Given the high volume, and the mix of structured and unstructured data, we need a set of new algorithms for managing and analyzing Big Data privately. These algorithms may build on current privacy preserving data techniques [5][6][11] [14][19]. Leveraging the appropriate formal specification, we propose a family of algorithms that can (1) comply with the advertised privacy regulations and (2) automatically adapt to changes in the application logic and the privacy rules. New software lifecycle (or customizations to existing contemporary lifecycles) must be created to incorporate the proposed approaches for developing trusted software for Big Data. Privacy considerations must be integrated into popular software development cycle [16] to ensure wide adoption of these approaches. Privacy must be built in as a property to the applications rather than an afterthought. Consequently, there must be a redefinition of the phases of existing software development processed to integrate formalization that define privacy. Subsequently, innovations are required that extend current testing practices and the underlying processes for handling changes in privacy rules and the application logic. Conclusion As Big Data analytics continue to become prevalent in the operations of industry and government organizations, principle methods must ensure privacy at present time and as the data evolve. Aforementioned challenges require research coordination in the general areas of software engineering, data management, and machine learning. 3

References [1] At the Intersection of Big Data and Healthcare: What 7.2 Million Medical Records Can Tell Us» CCC Blog. [Online]. Available: http://www.cccblog.org/2012/08/23/at the intersection of bigdata and healthcare what 7 2 million medical records can tell us/. [Accessed: 08 Jan 2014]. [2] Summary of the HIPAA Security Rule. [Online]. Available: http://www.hhs.gov/ocr/privacy/hipaa/understanding/srsummary.html. [Accessed: 04 Jan 2014]. [3] D. Schall, M.B. Blake, and S. Dustdar Programming Human and Software Based Web Services, IEEE Computer, Vol 43, No. 7, pp 82 86, July 2010 [4] E. Sobel and M. R. Clarkson, Formal methods application: An empirical tale of software development, IEEE Transactions on Software Engineering, pp. 308 320, 2002. [5] I. Saleh and M. Eltoweissy, P3ARM t: Privacy Preserving Protocol for Association Rule Mining with t Collusion Resistance, Journal of Computers, vol. 2, no. 2, Apr. 2007. [6] I. Saleh, A. Mokhtar, A. Shoukry, and M. Eltoweissy, P3ARM: Privacy Preserving Protocol for Association Rule Mining, in 2006 IEEE Information Assurance Workshop, 2006, pp. 76 83. [7] I. Saleh, G. Kulczycki, and M. B. Blake, Demystifying Data Centric Web Services, Internet Computing, IEEE, vol. 13, no. 5, pp. 86 90, 2009. [8] I. Saleh, G. Kulczycki, and M. B. Blake, Formal Specification and Verification of Transactional Service Composition, in Services (SERVICES), 2011 IEEE World Congress on, 2011, pp. 474 481. [9] I. Saleh, G. Kulczycki, M.B. Blake, Y. Wei Formal Methods for Data Centric Web Services: A Case Study, International Journal of Services Computing, Vol. No.1, December 2013 I. Saleh, G. Kulczycki, M. B. Blake, and Y. Wei, Formal Methods for Data Centric Web Services: From Model to Implementation, in 2013 IEEE International Conference on Web Services (ICWS), 2013. [10] J. M. Wing, A Specifier s Introduction to Formal Methods, Computer, vol. 23, no. 9, pp. 8, 10 22, 24, Sep. 1990. [11] J. Paranthaman and T. A. A. Victoire, Hybrid techniques for Privacy preserving in Data Mining, Life Science Journal, vol. 10, no. 7s, 2013. [12] J. R. Abrial, Formal methods in industry: achievements, problems, future, Shanghai, China, 2006, pp. 761 768. [13] Luqi and J. A. Goguen, Formal Methods: Promises and Problems, IEEE Software, vol. 14, no. 1, pp. 73 85, Feb. 1997. [14] M. D. L. Kotak and M. S. Shukla, Protecting Sensitive Rules Based on Classification in Privacy Preserving Data Mining, International Journal of Engineering, vol. 2, no. 11, 2013. [15] M. Gudipati, S. Rao, N. D. Mohan, and N. K. Gajja, Big Data: Testing Approach to Overcome Quality Challenges. [16] M.B. Blake "Decomposing Composition: Service Oriented Software Engineers", Special Issue on Realizing Service Centric Software Systems, IEEE Software, Vol. 24, No. 6, pp 68 77, Nov/Dec 2007 4

[17] Narayanan and V. Shmatikov, How to break anonymity of the netflix prize dataset, arxiv preprint cs/0610105, 2006. [18] R. C. Wong, Big Data Privacy, J Inform Tech Softw Eng, vol. 2, p. e114, 2012. [19] R. Nix, M. Kantarcioglu, and K. J. Han, Approximate privacy preserving data mining on vertically partitioned data, in Data and Applications Security and Privacy XXVI, Springer, 2012, pp. 129 144. [20] W. Tan, M.B. Blake, I. Saleh, S. Dustdar Social Network Sourced Big Data Analytics, IEEE Internet Computing, Vol. 17, No. 5, pp. 62 69, Sept/Oct 2013 [21] X. Fu, Conformance verification of privacy policies, in Web Services and Formal Methods, Springer, 2011, pp. 86 100. 5