Hui(Wendy) Wang Stevens Institute of Technology New Jersey, USA. VLDB Cloud Intelligence workshop, 2012



Similar documents
Privacy-Preserving Mining of Association Rules On Cloud by improving Rob Frugal Algorithm

Merkle Hash Tree based Techniques for Data Integrity of Outsourced Data

Homomorphic Encryption Schema for Privacy Preserving Mining of Association Rules

Multi-table Association Rules Hiding

Privacy-Preserving in Outsourced Transaction Databases from Association Rules Mining

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 ISSN

Data Outsourcing based on Secure Association Rule Mining Processes

Cloud Computing: Provide privacy and Security in Databaseas-a-Service

A NOVEL APPROACH FOR MULTI-KEYWORD SEARCH WITH ANONYMOUS ID ASSIGNMENT OVER ENCRYPTED CLOUD DATA

PRIVACY-PRESERVING PUBLIC AUDITING FOR SECURE CLOUD STORAGE

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Improving data integrity on cloud storage services

How To Ensure Correctness Of Data In The Cloud

Secure Data transfer in Cloud Storage Systems using Dynamic Tokens.

Advances in Natural and Applied Sciences

Open source Google-style large scale data analysis with Hadoop

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

IMPROVED BYZANTINE FAULT TOLERANCE TWO PHASE COMMIT PROTOCOL

Hadoop Parallel Data Processing

An Integrity Lock Architecture for Supporting Distributed Authorizations in Database Federations

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Near Sheltered and Loyal storage Space Navigating in Cloud

Secure Collaborative Privacy In Cloud Data With Advanced Symmetric Key Block Algorithm

The Improved Job Scheduling Algorithm of Hadoop Platform

CLOUD BASED PEER TO PEER NETWORK FOR ENTERPRISE DATAWAREHOUSE SHARING

Data Storage Security in Cloud Computing

Prerequisites. Course Outline

Finding Insights & Hadoop Cluster Performance Analysis over Census Dataset Using Big-Data Analytics

Secure Way of Storing Data in Cloud Using Third Party Auditor

Distributed Computing and Big Data: Hadoop and MapReduce

Analysis of MapReduce Algorithms

Arnab Roy Fujitsu Laboratories of America and CSA Big Data WG

The basic data mining algorithms introduced may be enhanced in a number of ways.

A Novel Technique of Privacy Protection. Mining of Association Rules from Outsourced. Transaction Databases

An Efficient Multi-Keyword Ranked Secure Search On Crypto Drive With Privacy Retaining

Top Ten Security and Privacy Challenges for Big Data and Smartgrids. Arnab Roy Fujitsu Laboratories of America

An Introduction to Data Mining

A Comprehensive Data Forwarding Technique under Cloud with Dynamic Notification

Crowdsourcing Entity Resolution: a Short Overview and Open Issues

Trusted Public Auditing Process for Secure Cloud Storage

Big Data Analytics and Healthcare

Azure Machine Learning, SQL Data Mining and R

Secrecy Maintaining Public Inspecting For Secure Cloud Storage

International Journal of Infinite Innovations in Engineering and Technology. ISSN (Online): , ISSN (Print):

International Journal for Research in Applied Science & Engineering Technology (IJRASET) A Review on Big Data Cloud Computing

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Task Scheduling in Hadoop

Secure Framework and Sparsity Structure of Linear Programming in Cloud Computing P.Shabana 1 *, P Praneel Kumar 2, K Jayachandra Reddy 3

Map/Reduce Affinity Propagation Clustering Algorithm

Development of Effective Audit Service to Maintain Integrity of Migrated Data in Cloud

Cloud Platforms, Challenges & Hadoop. Aditee Rele Karpagam Venkataraman Janani Ravi

International Journal of Advanced Research in Computer Science and Software Engineering

MUTI-KEYWORD SEARCH WITH PRESERVING PRIVACY OVER ENCRYPTED DATA IN THE CLOUD

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Problem Solving Hands-on Labware for Teaching Big Data Cybersecurity Analysis

CHAPTER 5 INTELLIGENT TECHNIQUES TO PREVENT SQL INJECTION ATTACKS

PRIVACY PRESERVING ASSOCIATION RULE MINING

Ensuring Data Storage Security in Cloud Computing

Fast Analytics on Big Data with H20

A Direct Numerical Method for Observability Analysis

Classification On The Clouds Using MapReduce

SECURE AND TRUSTY STORAGE SERVICES IN CLOUD COMPUTING

Improving Apriori Algorithm to get better performance with Cloud Computing

Map-Reduce for Machine Learning on Multicore

Big Data Analytics Opportunities and Challenges

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

How To Ensure Correctness Of Data In The Cloud

preliminary experiment conducted on Amazon EC2 instance further demonstrates the fast performance of the design.

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

DYNAMIC QUERY FORMS WITH NoSQL

Dynamic Data in terms of Data Mining Streams

PRIVACY ASSURED IMAGE STACK MANAGEMENT SERVICE IN CLOUD

An Efficient Data Correctness Approach over Cloud Architectures

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Open source large scale distributed data management with Google s MapReduce and Bigtable

Chapter 20: Data Analysis

Index Terms Cloud Storage Services, data integrity, dependable distributed storage, data dynamics, Cloud Computing.

HadoopRDF : A Scalable RDF Data Analysis System

Analytics on Big Data

EFFICIENT AND SECURE DATA PRESERVING IN CLOUD USING ENHANCED SECURITY

Active Learning SVM for Blogs recommendation

International Journal of Advanced Research in Computer Science and Software Engineering

Database Application Developer Tools Using Static Analysis and Dynamic Profiling

Large-Scale Test Mining

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Introduction to Hadoop

Transcription:

Integrity Verification of Cloud-hosted Data Analytics Computations (Position paper) Hui(Wendy) Wang Stevens Institute of Technology New Jersey, USA 9/4/ 1

Data-Analytics-as-a- Service (DAaS) Outsource: Data + Analytics Needs Analytics results Data Owner (Client) Owns large volume of data Computationally weak Cloud Computationally powerful Provide data analytics as a service Examples of DAaS Google Prediction APIs, Amazon EC2 9/4/ 2

Fear of Cloud: Security 9/4/ 3

Integrity Verification of the result in DAaSParadigm The server cannot be fully trusted We focus on result integrity o The client should be able to verify that the analytics result returned by the cloud is correct. Challenges: o Analytics result is unknown before mining. o The client is computationally weak to perform sophisticated analysis. 9/4/ 4

Related Work Integrity Assurance for Database-as-a-Service (DAS) Paradigm o Merklehash trees [2,3], signatures on a chain of paired tuples [4], challenge tokens [5], and counterfeit records [9] Protect sensitive data and data mining results in the data-mining-as-a-service (DMAS) o Association rule mining [1, 6-7] Integrity Verification in DMAS Paradigm o Frequent itemset mining [8] 9/4/ 5

Introduction Related Work Preliminaries Outline Our Verification Approach Future Work and Conclusion 9/4/ 6

Introduction Related Work Preliminaries Outline Our Verification Approach Future Work and Conclusion 9/4/ 7

Summarization Form Training data: an m n dimensional matrix X Target lables: an m 1 matrix Y Problem: o Find the parameter vector θ such that Y = θ T X. The solution: obtain θ = (X T X) 1 X T Y. This computation can be reformulated to compute (1) A = X T X, and (2) B = X T Y. In other words, compute 9/4/ 8

Importance of Summarization Form A large class of machine learning algorithms can be expressed in the summation form. Examples of summarization form based algorithms: o Locally weighted linear regression, o Naïve Bayes, o Neural network, o Principle component analysis, o 9/4/ 9

Summarization Form in MapReduce Mappers: o Each Mapper is assigned a split and/or a split o The Mappers compute the partial values Reducers: o The Reducers sum up the partial values As and Bs. 9/4/ 10

Attack Model Non-collusive workers o Return incorrect result independently, without consulting other malicious workers. Collusive workers o Communicate with each other before cheating. 9/4/ 11

Verification Goal 9/4/ 12

Introduction Related Work Preliminaries Outline Our Verification Approach Future Work and Conclusion 9/4/ 13

Architecture MapReduce core consists of one master JobTracker task and many TaskTracker tasks. Typical configurations run the JobTracker task on the same machine, called the master, and run TaskTracker tasks on other machines, called slaves. We assume the master node is trusted, and is responsible for verification. 9/4/ 14

Overview of Our Verification Approaches Catch non-collusive malicious workers o When honest workers take majority: the replication-based verification approach o When malicious workers take majority: the artificial data injection (ADI) approach Catch collusive malicious workers o The instance-hiding verification approach 9/4/ 15

Catch Non-collusive Malicious Workers When honest workers take majority o We propose the replication-based verification approach The master node assigns the same task to multiple workers. The majority of workers will return correct answer Workers whose results are inconsistent with the majority of the workers that are assigned the same task are caught as malicious. If there is no winning answer by majority voting, the master node assigns more copies of the task to additional workers 9/4/ 16

Catch Non-collusive Malicious Workers When malicious workers take majority o We propose the artificial data injection (ADI) verification approach The master nodes inserts an artificial k l matrix Xainto Xs 9/4/ 17

ADI Approach (Cont.) Verification o The master node pre-computes o After the worker returns checks whether, the master node o If there exists any mismatch, the master node concludes with 100% certainty that the worker returns incorrect answer. o Otherwise, the master node determines the result correctness with a probability, where β is the precision threshold given in (α, β)-correctness requirement. 9/4/ 18

ADT Discussion To satisfy (α, β)-correctness, it must satisfy that Where is the number of columns in the artificial matrix X a (the number of rows of X a is the same as that of X). s It only needs a small to catch workers that change a small fraction of result with high correctness probability. l is independent of the size of the input matrix. Thus our ADI mechanism is especially useful for verification of computation of large matrics. 9/4/ 19

ADT Complexity The complexity of verification preparation: O(kl). The complexity of verification: O(kl 2 ) o K: number of columns of the input matrix X s o L: number of columns of the artificial matrix X a 9/4/ 20

Catch Collusive Malicious Workers ADT approach cannot resist the cheating of collusive malicious workers We propose the instance-hiding verification approach. Before assigning Xto Xsto the workers, the master node applies transformation on Xsby computing where Ts is a k k transformation matrix. o Ts is unique for each input matrix Xs. o Then the master node injects the artificial data Xainto the transformed matrix X s and apply the ADI verification procedure. 9/4/ 21

Post-processing The post-processing procedure eliminates the noise due to the insertion of artificial matrix Non-collusive malicious workers o For the ADI verification approach, the master node returns as the real answer of Collusive malicious workers o For the instance-hiding approach, the master node computes o After that, the master node computes 9/4/ 22

Conclusion Result integrity verification of the result in cloudbased data-analytics-as-a-service paradigm is very important We consider summarization form, in which a large class of machine learning algorithms can be expressed. We propose verification approaches for both noncollusive and collusive malicious Mappers 9/4/ 23

Open Questions What will be the cost that our verification techniques will bring to computations in real-world cloud, e.g., Amazon EC2? Can we define a budget-driven model to allow the client to specify her verification needs in terms of budget (possibly in monetary format) besides α and β? How can we identify the collusive and non-collusive workers, as well as whether collusive workers take the majority, in a cloud in practice? Can we achieve a deterministic verification guarantee by adapting the existing cryptographic techniques to DAaS paradigm? 9/4/ 24

References 1. Giannotti, F., Lakshmanan, L.V., Monreale, A., Pedreschi, D., Wang, H.: Privacy-preserving mining of association rules from outsourced transaction databases. In: SPCC (2010) 2. Li, F., Hadjieleftheriou, M., Kollios, G., Reyzin, L.: Dynamic authenticated index structures for outsourced databases. In: SIGMOD (2006) 3. Mykletun, E., Narasimha, M., Tsudik, G.: Authentication and integrity in outsourced databases. Trans. Storage 2 (May 2006) 4. Pang, H., Jain, A., Ramamritham, K., Tan, K.-L.: Verifying completeness of relational query results in data publishing. In: SIGMOD (2005) 5. Sion, R.: Query execution assurance for outsourced databases. In: VLDB (2005) 6. Tai, C.-H., Yu, P.S., Chen, M.-S.: k-support anonymity based on pseudo taxonomy for outsourcing of frequent itemset mining. In: SIGKDD (2010) 7. Wong, W.K., Cheung, D.W., Hung, E., Kao, B., Mamoulis, N.: Security in outsourcing of association rule mining. In: VLDB (2007) 8. Wong, W.K., Cheung, D.W., Kao, B., Hung, E., Mamoulis, N.: An audit environment for outsourcing of frequent itemset mining. PVLDB 2 (2009) 9. Xie, M., Wang, H., Yin, J., Meng, X.: Integrity auditing of outsourced data. In: VLDB (2007) 9/4/ 25

Q & A Thanks! 9/4/ 26