DATA MINING - 1DL105, 1DL025



Similar documents
DATA MINING - 1DL360

Computer Security (EDA263 / DIT 641)

MULTILATERAL SECURITY. Based on chapter 9 of Security Engineering by Ross Anderson

Chapter 23. Database Security. Security Issues. Database Security

Formal Methods for Preserving Privacy for Big Data Extraction Software

Computer Security (EDA263 / DIT 641)

Database and Data Mining Security

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS

Foundations of Business Intelligence: Databases and Information Management

Survey on Data Privacy in Big Data with K- Anonymity

De-Identification of Health Data under HIPAA: Regulations and Recent Guidance" " "

CS346: Advanced Databases

Homomorphic Encryption Schema for Privacy Preserving Mining of Association Rules

Chapter 23. Database Security. Security Issues. Database Security

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Knowledge Discovery and Data Mining

Data Mining. Nonlinear Classification

1.2: DATA SHARING POLICY. PART OF THE OBI GOVERNANCE POLICY Available at:

Computer Security: Principles and Practice

Building Data Cubes and Mining Them. Jelena Jovanovic

Preparing Data Sets for the Data Mining Analysis using the Most Efficient Horizontal Aggregation Method in SQL

A Study of Data Perturbation Techniques For Privacy Preserving Data Mining

Principles for Responsible Clinical Trial Data Sharing

CS377: Database Systems Data Security and Privacy. Li Xiong Department of Mathematics and Computer Science Emory University

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN

Knowledge Discovery and Data Mining

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Big Data Big Security Problems? Ivan Damgård, Aarhus University

Database security. André Zúquete Security 1. Advantages of using databases. Shared access Many users use one common, centralized data set

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

A Secure Model for Medical Data Sharing

Performing Data Mining in (SRMS) through Vertical Approach with Association Rules

Associate Prof. Dr. Victor Onomza Waziri

Random Projection-based Multiplicative Data Perturbation for Privacy Preserving Distributed Data Mining

Offload Enterprise Data Warehouse (EDW) to Big Data Lake. Ample White Paper

Chapter Managing Knowledge in the Digital Firm

MACs Message authentication and integrity. Table of contents

Data Privacy and Biomedicine Syllabus - Page 1 of 6

A Q&A with the Commissioner: Big Data and Privacy Health Research: Big Data, Health Research Yes! Personal Data No!

International Journal of Advanced Computer Technology (IJACT) ISSN: PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS

Customer Classification And Prediction Based On Data Mining Technique

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Data Outsourcing based on Secure Association Rule Mining Processes

Privacy-preserving Data-aggregation for Internet-of-things in Smart Grid

XML enabled databases. Non relational databases. Guido Rotondi

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

Degrees of De-identification of Clinical Research Data

SPATIAL DATA CLASSIFICATION AND DATA MINING

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

PROTECTION OF PERSONAL INFORMATION

A SECURE DECISION SUPPORT ESTIMATION USING GAUSSIAN BAYES CLASSIFICATION IN HEALTH CARE SERVICES

Week 3 lecture slides

Data Warehouse: Introduction

Web-Based Genomic Information Integration with Gene Ontology

Societal benefits vs. privacy: what distributed secure multi-party computation enable? Research ehelse April Oslo

NSF Workshop on Big Data Security and Privacy

Data Mining Algorithms Part 1. Dejan Sarka

K-NN CLASSIFICATION OVER SECURE ENCRYPTED RELATIONAL DATA IN OUTSOURCED ENVIRONMENT

Chapter 20: Data Analysis

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

A Bayesian Approach for on-line max auditing of Dynamic Statistical Databases

How To Ensure Correctness Of Data In The Cloud

Secure Computation Martin Beck

Attestation and Authentication Protocols Using the TPM

Secure Collaborative Privacy In Cloud Data With Advanced Symmetric Key Block Algorithm

Data Mining Analytics for Business Intelligence and Decision Support

Course Syllabus For Operations Management. Management Information Systems

Cloud based Spatial Cloaking for Mobile User Privacy Preservation

Cryptography: Authentication, Blind Signatures, and Digital Cash

A Brief Survey on Anonymization Techniques for Privacy Preserving Publishing of Social Network Data

BUSINESS INTELLIGENCE AS SUPPORT TO KNOWLEDGE MANAGEMENT

Social Media Mining. Data Mining Essentials

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Transcription:

DATA MINING - 1DL105, 1DL025 Fall 2009 An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/ht09 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology, Uppsala University, Uppsala, Sweden 12/17/09 1

Introduction to Data Mining Privacy in Data mining (slides and selected papers) Kjell Orsborn Department of Information Technology Uppsala University, Uppsala, Sweden 12/17/09 2

Privacy and security in data mining Protecting private data is an important concern for society Several laws now require explicit consent prior to analysis of an individual s data However, its importance is not limited to individuals Corporations might also need to protect their information s privacy, even though sharing it for analysis could benefit the company. Clearly, the trade-off between sharing information for analysis and keeping it secret to preserve corporate trade secrets and customer privacy is a growing challenge 12/17/09 3

Techniques for privacy and security Most data mining applications operate under the assumption that all the data is available at a single central repository, called a data warehouse. This poses a huge privacy problem because violating only a single repository s security exposes all the data. A naive solution to the problem is de-identification remove all identifying information from the data and release it pinpointing exactly what constitutes identification information is difficult Worse, even if de-identification is possible and (legally) acceptable, it s extremely hard to do effectively without losing the data s utility. Studies have used externally available public information to re-identify anonymous data and proved that effectively anonymizing the data required removal of substantial detail. Another solution is to avoid centralized warehouses Requires specialized distributed data mining algorithms, e.g. Secure multiparty computation Accurate methods shown for classification and association analysis A third approach is data perturbation i.e. modifying data so that it no longer represents real individuals. 12/17/09 4

Distributed data mining The way the data is distributed also plays an important role in defining the problem because data can be partitioned into many parts either vertically or horizontally. Vertical partitioning of data implies that although different sites gather information about the same set of entities, they collect different feature sets. Banks, for example, collect financial transaction information, whereas the IRS collects tax information. Figure 2 illustrates vertical partitioning and the kind of useful knowledge we can extract from it. The figure describes two databases, one containing individual medical records and another containing cell-phone information for the same set of people. Mining the joint global database might reveal such information as cell phones with Li/Ion batteries can lead to brain tumors in diabetics. 12/17/09 5

Distributed data mining In horizontal partitioning, different sites collect the same set of information but about different entities. Different supermarkets, for example, collect the same type of grocery shopping data. Figure 3 illustrates horizontal partitioning and shows the credit-card databases of two different (local) credit unions. Taken together, we might see that fraudulent customers often have similar transaction histories. However, no credit union has sufficient data by itself to discover the patterns of fraudulent behavior. 12/17/09 6

Secure distributed computation The secure sum protocol is a simple example of a (information theoretically) secure multiparty computation. Site k generates a random number R uniformly chosen from [0.. n], adds this to its local value x k, and then sends the sum R + x k (mod n) to site k+ 1 (mod l). Drawback of SMC is inefficiency and complexity of model 12/17/09 7

Statistical database security Databases often include sensitive information about single individuals that must be protected from unallowed use. However, statistical information should be extractable from the database. Statistical database security must prohibit access of individual data elements. Three main security mechanisms: conceptual, restriction-based, and perturbation-based. Examples: prohibit queries on attribute level only queries for statistical aggregation (statistical queries) statistical queries are prohibited when the selection from the population is to small. prohibit repeated statistical queries on the same tuples. introduce distortion into data. 12/17/09 8

Security in statistical databases Statistical database security, (also called inference control), should prevent and avoid possibilities to infer protected information from the set of allowed and fully legitimate statistical queries (statistical aggregation). A security problem occur when providing statistical information without requiring to release sensitive information concerning individuals. The main problem with SDB security is to accomplish a good compromise between integrity for individuals and the need for knowledge and information management and analysis of organizations. 12/17/09 9

Inference protection techniques One can divide inference protection techniques into three main categories: conceptual, restriction-based, and perturbation-based techniques. Conceptual techniques: Treats the security problem on a conceptual level lattice model conceptual partitioning 12/17/09 10

Inference protection techniques Restriction-based techniques Prevent queries for certain types of statistical queries query-set size control expanded query-set size control query-set overlap control audit-based control 12/17/09 11

Inference protection techniques Perturbation-based techniques Modifies information that is stored or presented data swapping random-sample queries fixed perturbation query-based perturbation rounding (systematic, random, controlled) 12/17/09 12