DATA MINING - 1DL360

Similar documents
DATA MINING - 1DL105, 1DL025

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS

A generalized Framework of Privacy Preservation in Distributed Data mining for Unstructured Data Environment

Computer Security (EDA263 / DIT 641)

CS346: Advanced Databases

Information Security in Big Data: Privacy and Data Mining (IEEE, 2014) Dilara USTAÖMER

DATABASDESIGN FÖR INGENJÖRER - 1DL124

Data Privacy and Biomedicine Syllabus - Page 1 of 6

Database and Data Mining Security

A Survey of Quantification of Privacy Preserving Data Mining Algorithms

Privacy-preserving Data Mining: current research and trends

NSF Workshop on Big Data Security and Privacy

Information Security in Big Data using Encryption and Decryption

Privacy Preserved Association Rule Mining For Attack Detection and Prevention

Using multiple models: Bagging, Boosting, Ensembles, Forests

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN

Database security. André Zúquete Security 1. Advantages of using databases. Shared access Many users use one common, centralized data set

MULTILATERAL SECURITY. Based on chapter 9 of Security Engineering by Ross Anderson

Survey on Data Privacy in Big Data with K- Anonymity

A Knowledge Model Sharing Based Approach to Privacy-Preserving Data Mining

A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data

Data mining successfully extracts knowledge to

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Computer Security (EDA263 / DIT 641)

OLAP Online Privacy Control

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

Anonymization: Enhancing Privacy and Security of Sensitive Data of Online Social Networks

Mario Guarracino. Data warehousing

Foundation Working Group

A THEORETICAL COMPARISON OF DATA MASKING TECHNIQUES FOR NUMERICAL MICRODATA

PRIVACY IN STATISTICAL DATABASES: AN APPROACH USING CELL SUPPRESSION NEELABH BAIJAL. Department of Computer Science

Centralized and Distributed Anonymization for High-Dimensional Healthcare Data

Data attribute security and privacy in distributed database system

Li Xiong, Emory University

Chapter 23. Database Security. Security Issues. Database Security

Privacy Preserving Data Mining

Privacy Preserving Outsourcing for Frequent Itemset Mining

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Keywords: Security; data warehouse; data mining; statistical database security; privacy

Proposing a Novel Synergized K-Degree L-Diversity T- Closeness Model for Graph Based Data Anonymization

Respected Chairman and the Members of the Board, thank you for the opportunity to testify today on emerging technologies that are impacting privacy.

Privacy-Preserving Outsourcing Support Vector Machines with Random Transformation

Random Projection-based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining

A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data

Big Data - Security and Privacy

Module outline. CS 458 / 658 Computer Security and Privacy. (Relational) Databases. Module outline. Module 6 Database Security and Privacy.

The Christian Doppler Laboratory for Client-Centric Cloud Computing

Privacy-Preserving Big Data Publishing

CS 458 / 658 Computer Security and Privacy. Module outline. Module outline. Module 6 Database Security and Privacy. Winter 2010

Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse April Oslo

Practicing Differential Privacy in Health Care: A Review

Data Warehousing and Data Mining

A Study of Data Perturbation Techniques For Privacy Preserving Data Mining

Privacy-preserving Data-aggregation for Internet-of-things in Smart Grid

Introduction to Data Mining

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Customer Classification And Prediction Based On Data Mining Technique

De-Identification of Clinical Data

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Performing Data Mining in (SRMS) through Vertical Approach with Association Rules

ARX A Comprehensive Tool for Anonymizing Biomedical Data

Foundations of Business Intelligence: Databases and Information Management

S Z E C S K A Y Ü g y v é d i

Overview of Information Security. Murat Kantarcioglu

Obfuscation of sensitive data in network flows 1

Building Data Cubes and Mining Them. Jelena Jovanovic

Privacy-by-design in big data analytics and social mining

Formal Methods for Preserving Privacy for Big Data Extraction Software

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data Mining Introduction

On Density Based Transforms for Uncertain Data Mining

On the Performance Measurements for Privacy Preserving Data Mining

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Top Ten Security and Privacy Challenges for Big Data and Smartgrids. Arnab Roy Fujitsu Laboratories of America

Protecting Patient Privacy. Khaled El Emam, CHEO RI & uottawa

Policy-based Pre-Processing in Hadoop

De-Identification of Health Data under HIPAA: Regulations and Recent Guidance" " "

Attestation and Authentication Protocols Using the TPM

Secure Computation Martin Beck

Privacy Aspects in Big Data Integration: Challenges and Opportunities

Chapter 23. Database Security. Security Issues. Database Security

Enabling the 21st Century HEALTH CARE INFORMATION TECHNOLOGY REVOLUTION

Privacy & data protection in big data: Fact or Fiction?

Privacy Challenges of Telco Big Data

Computer Security: Principles and Practice

FACIAL IMAGE DE-IDENTIFICATION USING IDENTIY SUBSPACE DECOMPOSITION. Hehua Chi 1,2, Yu Hen Hu 2

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

1.2: DATA SHARING POLICY. PART OF THE OBI GOVERNANCE POLICY Available at:

Data, Measurements, Features

Homomorphic Encryption Schema for Privacy Preserving Mining of Association Rules

Arnab Roy Fujitsu Laboratories of America and CSA Big Data WG

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Data Warehouse: Introduction

IEEE JAVA Project 2012

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

Data Formulation Analysis of a Network Marketing Agency

Privacy-Preserving In Big Data with Efficient Metrics

IMPROVED MASK ALGORITHM FOR MINING PRIVACY PRESERVING ASSOCIATION RULES IN BIG DATA

Data Warehousing and Data Mining

The Scientific Data Mining Process

Transcription:

DATA MINING - 1DL360 Fall 2013" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/per1ht13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden 10/12/13 1

Introduction to Data Mining Privacy in Data mining (slides and selected papers)" Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden 10/12/13 2

Privacy and security in data mining" Protecting private data is an important concern for society Several laws now require explicit consent prior to analysis of an individual s data However, its importance is not limited to individuals Corporations might also need to protect their information s privacy, even though sharing it for analysis could benefit the company. Clearly, the trade-off between sharing information for analysis and keeping it secret to preserve corporate trade secrets and customer privacy is a growing challenge 10/12/13 3

Techniques for privacy and security" Most data mining applications operate under the assumption that all the data is available at a single central repository, called a data warehouse. This poses a huge privacy problem because violating only a single repository security exposes all the data. A naive solution to the problem is de-identification remove all identifying information from the data and release it pinpointing exactly what constitutes identification information is difficult worse, even if de-identification is possible and (legally) acceptable, its extremely hard to do effectively without losing the datas utility. studies have used externally available public information to re-identify anonymous data and proved that effectively anonymizing the data required removal of substantial detail. Another solution is to avoid centralized warehouses Requires specialized distributed data mining algorithms, e.g. secure multi-party computation Accurate methods shown for classification and association analysis A third approach is data transformation and perturbation i.e. modifying data so that it no longer represents real individuals. 10/12/13 4

Privacy-preserving techniques in data mining" Most methods use some form of transformation of data to perform privacy preservation Typically, these methods reduce the granularity of representation to reduce the privacy Randomization techniques Introduce noise Group-based anonymization, e.g. K-anonymity Prohibits too detailed queries Distributed privacy preservation Prohibits distribution of individual data while supporting aggregate results Downgrading application effectiveness Results such as association rules, classification might violate privacy and can be restricted by a association rule hiding, classifier downgrading and query auditing 10/12/13 5

Privacy-preserving techniques in data mining" Randomization techniques Additative perturbation techniques - introduce noise, e.g. in the form of statistical distributions Can be attacked by analyzing correlation structure of randomized data Can also be attacked by matching the distribution of randomized data with the distribution of known public information Multiplicative perturbation techniques E.g. applying multidimensional projections to reduce dimensions of data Data swapping Values for different records are swapped while still being able to compute correct aggregate values Randomization approach is well suited for privacy-preservation in data stream mining since noise added is independent of the rest of the data 10/12/13 6

Privacy-preserving techniques in data mining" Group-based anonymization techniques K-anonymity Generalization and/or suppression of attributes to avoid identification of individual data Each release of the data must be such that every combination of values of quasiidentifiers (indirect identifiers) can be indistinguishably matched to at least k respondents. l-diversity In addition to k-anonymity focus on maintaining the diversity of sensitive attributes t-closeness model further enhancement to deal with e.g. skewed data sets Potential problems with sequential releases Several releases of data might reveal more details Linking successive releases must be prevented 10/12/13 7

Privacy-preserving techniques in data mining" Distributed privacy-preservation Horizontal partitioning See example next page Vertical partitioning See example next page Distributed algorithms for aggregate operations See example next page Distributed algorithms for k-anonymity Semi-honest adversaries Malicious adversaries 10/12/13 8

Distributed data mining" The way the data is distributed also plays an important role in defining the problem because data can be partitioned into many parts either vertically or horizontally. Vertical partitioning of data implies that although different sites gather information about the same set of entities, they collect different feature sets. Banks, for example, collect financial transaction information, whereas the IRS collects tax information. Figure 2 illustrates vertical partitioning and the kind of useful knowledge we can extract from it. The figure describes two databases, one containing individual medical records and another containing cell-phone information for the same set of people. Mining the joint global database might reveal such information as cell phones with Li/Ion batteries can lead to brain tumors in diabetics. 10/12/13 9

Distributed data mining" In horizontal partitioning, different sites collect the same set of information but about different entities. Different supermarkets, for example, collect the same type of grocery shopping data. Figure 3 illustrates horizontal partitioning and shows the credit-card databases of two different (local) credit unions. Taken together, we might see that fraudulent customers often have similar transaction histories. However, no credit union has sufficient data by itself to discover the patterns of fraudulent behavior. 10/12/13 10

Secure distributed computation " The secure sum protocol is a simple example of a (information theoretically) secure multi-party computation. Site k generates a random number R uniformly chosen from [0.. n], adds this to its local value x k, and then sends the sum R + x k (mod n) to site k+ 1 (mod l). Drawback of SMC is inefficiency and complexity of model 10/12/13 11

Privacy-preserving techniques in data mining" Privacy-preservation of application results Related to disclosure control in statistical databases Association rule-hiding Distortion Blocking Downgrading classifier effectiveness Modifying data so classification accuracy is reduced while retaining the utility of data for other applications Query auditing and inference control Query auditing denies one or more queries from a sequence of queries Query inference control underlying data (or query result) is perturbed so privacy is preserved See slides for statistical data security 10/12/13 12

Statistical database security " Databases often include sensitive information about single individuals that must be protected from unallowed use. However, statistical information should be extractable from the database. Statistical database security must prohibit access of individual data elements. Three main security mechanisms: conceptual, restriction-based, and perturbation-based. Examples: prohibit queries on attribute level only queries for statistical aggregation (statistical queries) statistical queries are prohibited when the selection from the population is to small. prohibit repeated statistical queries on the same tuples. introduce distortion into data. 10/12/13 13

Security in statistical databases" Statistical database security, (also called inference control), should prevent and avoid possibilities to infer protected information from the set of allowed and fully legitimate statistical queries (statistical aggregation). A security problem occur when providing statistical information without requiring to release sensitive information concerning individuals. The main problem with SDB security is to accomplish a good compromise between integrity for individuals and the need for knowledge and information management and analysis of organizations. 10/12/13 14

Inference protection techniques " One can divide inference protection techniques into three main categories: conceptual, restriction-based, and perturbation-based techniques. Conceptual techniques: Treats the security problem on a conceptual level lattice model conceptual partitioning 10/12/13 15

Inference protection techniques " Restriction-based techniques Prevent queries for certain types of statistical queries query-set size control expanded query-set size control query-set overlap control audit-based control 10/12/13 16

Inference protection techniques " Perturbation-based techniques Modifies information that is stored or presented data swapping random-sample queries fixed perturbation query-based perturbation rounding (systematic, random, controlled) 10/12/13 17

Privacy-preserving techniques in data mining" Limitation of privacy The curse of dimensionality Problems with many privacy-preserving algorithms in high-dimensional space due to sparseness Applications of privacy-preserving data mining Medical databases Sensitive info patients, family members, addresses etc Bioterrorism E.g. Need to compare possible antrax attack with data from outbreak of common respiratory diceases Homeland security Credential validation problem, identity theft, web camera and video surveillance, whatch list problem Genomic privacy Keeping privacy of DNA data while making it available for analysis 10/12/13 18