Privacy-Preserving Big Data Publishing

Similar documents
Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

DATA MINING - 1DL360

A GENERAL SURVEY OF PRIVACY-PRESERVING DATA MINING MODELS AND ALGORITHMS

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Map-Reduce for Machine Learning on Multicore

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Developing MapReduce Programs

CSE-E5430 Scalable Cloud Computing Lecture 2

Parallel Options for R


Distributed forests for MapReduce-based machine learning

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Big Data With Hadoop

Distributed Computing and Big Data: Hadoop and MapReduce

Hadoop and Map-Reduce. Swati Gore

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

A Performance Analysis of Distributed Indexing using Terrier

Hadoop. History and Introduction. Explained By Vaibhav Agarwal

Information Security in Big Data using Encryption and Decryption

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Big Data Processing with Google s MapReduce. Alexandru Costan

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

HPCHadoop: MapReduce on Cray X-series

Lecture 6 Online and streaming algorithms for clustering

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Hadoop Architecture. Part 1

MapReduce and Hadoop Distributed File System V I J A Y R A O

Challenges for Data Driven Systems

Overview. Introduction. Recommender Systems & Slope One Recommender. Distributed Slope One on Mahout and Hadoop. Experimental Setup and Analyses

Big Systems, Big Data

Similarity Search in a Very Large Scale Using Hadoop and HBase

Estimating PageRank Values of Wikipedia Articles using MapReduce

The Big Data Paradigm Shift. Insight Through Automation

Reduction of Data at Namenode in HDFS using harballing Technique

A Novel Approach for Identification of Hadoop Cloud Temporal Patterns Using Map Reduce

Hadoop SNS. renren.com. Saturday, December 3, 11

Big Data Analytics for Net Flow Analysis in Distributed Environment using Hadoop

16.1 MAPREDUCE. For personal use only, not for distribution. 333

HADOOP PERFORMANCE TUNING

Analysis of MapReduce Algorithms

Advanced Big Data Analytics with R and Hadoop

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

BIG DATA What it is and how to use?

FOUNDATIONS OF A CROSS- DISCIPLINARY PEDAGOGY FOR BIG DATA

Using multiple models: Bagging, Boosting, Ensembles, Forests

Policy-based Pre-Processing in Hadoop

L3: Statistical Modeling with Hadoop

Classification On The Clouds Using MapReduce

From GWS to MapReduce: Google s Cloud Technology in the Early Days

MapReduce and Hadoop Distributed File System

Intro to Map/Reduce a.k.a. Hadoop

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

Hadoop and Map-reduce computing

How To Perform An Ensemble Analysis

Lecture 4 Online and streaming algorithms for clustering

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Daniel J. Adabi. Workshop presentation by Lukas Probst

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

A very short Intro to Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Up Your R Game. James Taylor, Decision Management Solutions Bill Franks, Teradata

Hortonworks & SAS. Analytics everywhere. Page 1. Hortonworks Inc All Rights Reserved

Large Scale Learning

How To Write A Data Processing Pipeline In R

Distributed Image Processing using Hadoop MapReduce framework. Binoy A Fernandez ( ) Sameer Kumar ( )

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Introduction to Hadoop

HIVE + AMAZON EMR + S3 = ELASTIC BIG DATA SQL ANALYTICS PROCESSING IN THE CLOUD A REAL WORLD CASE STUDY

CS54100: Database Systems

Towards Privacy aware Big Data analytics

Real-Time Handling of Network Monitoring Data Using a Data-Intensive Framework

Data Management in the Cloud

Developing a MapReduce Application

A Survey on Big Data Concepts and Tools

Big Data: Study in Structured and Unstructured Data

Fault Tolerance in Hadoop for Work Migration

Bayesian networks - Time-series models - Apache Spark & Scala

Large-Scale TCP Packet Flow Analysis for Common Protocols Using Apache Hadoop

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Transcription:

Privacy-Preserving Big Data Publishing Hessam Zakerzadeh 1, Charu C. Aggarwal 2, Ken Barker 1 SSDBM 15 1 University of Calgary, Canada 2 IBM TJ Watson, USA

Data Publishing OECD * declaration on access to research data Policy in Canadian Institutes of Health Research (CIHR) *- Organization for Economic Co-operation and Development Ref: Introduction to Privacy-Preserving Data Publishing: Concepts and Techniques (2010). Privacy-Preserving Big Data Publishing 1

Data Publishing Benefits: Facilitating the research community to confirm published results. Ensuring the availability of original data for meta-analysis. Making data available for instruction and education. Requirement: Privacy of individuals whose data is included must be preserved. Privacy-Preserving Big Data Publishing 2

Tug of War Preserving the privacy of individuals whose data is included (needs anonymization). Usefulness (utility) of the published data. Ref: http://www.picgifs.com Privacy-Preserving Big Data Publishing 3

Key Question in Data Publishing How to preserve the privacy of individuals while publishing data of high utility? Privacy-Preserving Big Data Publishing 4

Privacy Models Privacy-preserving models: Interactive setting (e.g. differential privacy) Non-interactive setting (e.g. k-anonymity, l-diversity) Randomization Generalization Privacy-Preserving Big Data Publishing 5

K-Anonymity Attributes Identifiers Quasi-identifier Sensitive Privacy-Preserving Big Data Publishing 6

K-Anonymity EQ Privacy-Preserving Big Data Publishing 7

L-Diversity Privacy-Preserving Big Data Publishing 8

Assumptions in state-of-the-art of anonymization Implicit assumptions in current anonymization: Small- to moderate-size data Batch and one-time process Focus on Quality of the published data Privacy-Preserving Big Data Publishing 9

Still valid? New assumptions: Small- to moderate-size data Big data (in Tera or Peta bytes) e.g. health data, or web search logs Batch and one-time process Repeated application Focus on Quality of the published data + Scalability Privacy-Preserving Big Data Publishing 10

Naïve solution? Divide & conquer Inspired by streaming data anonymization techniques Divide the big data into small parts (fragments) Anonymize each part (fragment) individually and in isolation Privacy-Preserving Big Data Publishing 11

Naïve solution? Divide & conquer Inspired by streaming data anonymization techniques Divide the big data into small parts (fragments) Anonymize each part (fragment) individually and in isolation What we lose? Rule of thumb: more data, less generalization/perturbation (large crowd effect) Quality Privacy-Preserving Big Data Publishing 12

Main question to answer in Big Data Privacy Is it somehow possible to take advantage of the entire data set in the anonymization process without losing scalability? Privacy-Preserving Big Data Publishing 13

Map-Reduce-Based Anonymization Idea: distribute the high-computational process of anonymization among different processing nodes such that distribution does not affect the quality (utility) of the anonymized data. Privacy-Preserving Big Data Publishing 14

Map-Reduce Paradigm Data flow of a mapreduce job Privacy-Preserving Big Data Publishing 15

Mondrian-like Map-Reduce Alg. Traditional Mondrian Pick a dimension (e.g. dim with widest range), called cut dimension Pick a point along the cut dimension, called cut point Split data into two equivalence classes along the cut dimension and at the cut point, provided that privacy condition is not violated. Repeat until no further split is possible Privacy-Preserving Big Data Publishing 16

Mondrian-like Map-Reduce Alg. Traditional Mondrian Privacy-Preserving Big Data Publishing 17

Mondrian-like Map-Reduce Alg. Traditional Mondrian Privacy-Preserving Big Data Publishing 18

Mondrian-like Map-Reduce Alg. Traditional Mondrian Privacy-Preserving Big Data Publishing 19

Mondrian-like Map-Reduce Alg. Preliminaries: Each equivalence class is divided into at most q equivalence classes. A global file is shared among all nodes. This file contains equivalence classes formed so far organized in a tree structure (called equivalence classes tree). Initially contains the most general equivalence class. Privacy-Preserving Big Data Publishing 20

Mapper Privacy-Preserving Big Data Publishing 21

Mapper: Example iteration 1: only one equivalence class exists (called eq 1 ) eq 1 = [1:100],[1:100],[1:100] Data records: 12, 33, 5 56, 33, 11 12, 99, 5 Mapper s output: key < eq 1, <(12,1),(33,1),(5,1)>> < eq 1, <(56,1),(33,1),(11,1)>> < eq 1, <(12,1),(99,1),(5,1)>> value Privacy-Preserving Big Data Publishing 22

Combiner Privacy-Preserving Big Data Publishing 23

Combiner example Mapper s output (combiner s input): < eq 1, <(12,1),(33,1),(5,1)>> < eq 1, <(56,1),(33,1),(11,1)>> < eq 1, <(12,1),(99,1),(5,1)>> Combiner s output: 12 2 < eq 1,,, > 56 1 33 2 99 1 5 2 11 1 Privacy-Preserving Big Data Publishing 24

Reducer Privacy-Preserving Big Data Publishing 25

Reducer s output Combiner s output (reducer s input): 12 2 33 2 5 2 < eq 1,,, > 56 1 99 1 11 1 Reducer s output: If eq 1 is splittable: <[12:56],[33:45],[5:11], 1 > <[12:56],[45:99],[5:11], 1 > If eq 1 is un-splittable: <[12:56],[33:99],[5:11], 0 > Added to the global file Added to the global file Privacy-Preserving Big Data Publishing 26

Improvement 1 (Data transfer improvement) Mapper outputs only records belonging to splittable equivalence classes. Requirement: Global file includes a flag for each equivalence class indicating whether it is splittable or not. If a data record belongs to an un-splittable EQ, do not output it. e.g. [1:100],[1:100],[1:100], 1 [12:56],[33:45],[5:11], 1 [12:56],[45:99],[5:11], 1 [12:30],[33:45],[5:11], 1 [30:56],[33:45],[5:11], 0 Privacy-Preserving Big Data Publishing 27

Improvement 2 (Memory improvement) What if the global file doesn t fit into memory of mapper nodes? Break down the global file into multiple small files How? Split the big data into small files. Create a global file per small input file. Each global file contains only a subset of total equivalence classes. Each small global file is referred to as global subset equivalence class file (gsec file). Privacy-Preserving Big Data Publishing 28

Improved Algorithm Mapper s output: < eq 1, <f 1,(12,1),(33,1),(5,1)>> < eq 1, <f 1,(56,1),(33,1),(11,1)>> < eq 1, <f 1,(12,1),(99,1),(5,1)>> Combiner s output: 12 2 < eq 1, <f 1,,,,,, > 56 1 33 2 99 1 5 2 11 1 Reducer s output: If eq 1 is splittable: <[12:56],[33:45],[5:11], 1, f 1 > <[12:56],[45:99],[5:11], 1,f 1 > If eq 1 is un-splittable: <[12:56],[33:99],[5:11], 0,f 1 > Privacy-Preserving Big Data Publishing 29

K,q = 2 Privacy-Preserving Big Data Publishing 30

Further Analysis Time Complexity (per round) Mapper Combiber Reducer Data Transfer (per round) Mapper and Combiner (Local) Combiner and Reducer (Across the network) Privacy-Preserving Big Data Publishing 31

Experiments Answer the following questions: How well the algorithm scales up? How much information is lost in the anonymization process? How much data is transferred between mappers/combiners and combiners/reducers in each iteration? Privacy-Preserving Big Data Publishing 32

Experiments Data sets Poker data set: 1M records, each 11 dimensions Synthetic data set: 10M records, each 15 dimensions, 1.4 GB Synthetic data set: 100M records, each 15 dimensions, 14.5 GB Information loss baseline Each data set is split into 8 fragments. Each fragment is anonymized individualy State-Of-The-Art MapReduce Top-Down Specialization (MRTDS) [Zhang et. al 14] Privacy-Preserving Big Data Publishing 33

Experiments Settings Hadoop cluster on AceNet 32 nodes, each having 16 cores and 64 GB RAM Running RedHat Enterprise Linux 4.8 Privacy-Preserving Big Data Publishing 34

Information Loss vs. K Privacy-Preserving Big Data Publishing 35

Information Loss vs. L Privacy-Preserving Big Data Publishing 36

Running time vs. K (L) Privacy-Preserving Big Data Publishing 37

Data Transfer vs. Iteration # (between mappers and combiners) Privacy-Preserving Big Data Publishing 38

Data Transfer vs. Iteration # (between combiners and reducers) Privacy-Preserving Big Data Publishing 39

Future work Extension to other data types (graph data anonymization, set-value data anonymizaiton, etc) Extension to other privacy models Privacy-Preserving Big Data Publishing 40

Privacy-Preserving Big Data Publishing 41