Shroudbase Technical Overview

Similar documents
future proof data privacy

Differential privacy in health care analytics and medical research An interactive tutorial

Prerequisites. Course Outline

Software Design Proposal Scientific Data Management System

This paper is directed to small business owners desiring to use. analytical algorithms in order to improve sales, reduce attrition rates raise

Hexaware E-book on Predictive Analytics

Diploma Of Computing

RingStor User Manual. Version 2.1 Last Update on September 17th, RingStor, Inc. 197 Route 18 South, Ste 3000 East Brunswick, NJ

A Practical Application of Differential Privacy to Personalized Online Advertising

OpenText Actuate Big Data Analytics 5.2

L3: Statistical Modeling with Hadoop

Machine Data Analytics with Sumo Logic

hmetrix Revolutionizing Healthcare Analytics with Vertica & Tableau

Connect to MySQL or Microsoft SQL Server using R

McAfee Web Reporter Turning volumes of data into actionable intelligence

DATA MINING TOOL FOR INTEGRATED COMPLAINT MANAGEMENT SYSTEM WEKA 3.6.7

Designing a Data Solution with Microsoft SQL Server 2014

THE FIRST UNIFIED DATABASE SECURITY SOLUTION. Product Overview Security. Auditing. Caching. Masking.

HTSQL is a comprehensive navigational query language for relational databases.

YOUR APP. OUR CLOUD.

Advanced analytics at your hands

Practicing Differential Privacy in Health Care: A Review

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Installation Guide. Non Linear Services. August Delivering the Moment

Use Data to Advance Institutional Performance

Turning Data into Actionable Insights: Predictive Analytics with MATLAB WHITE PAPER

Sisense. Product Highlights.

BiDAl: Big Data Analyzer for Cluster Traces

MIGRATING TO AVALANCHE 5.0 WITH MS SQL SERVER

Operationalise Predictive Analytics

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

ANALYTICS CENTER LEARNING PROGRAM

Configuring an Alternative Database for SAS Web Infrastructure Platform Services

Microsoft SharePoint Architectural Models

WEBAPP PATTERN FOR APACHE TOMCAT - USER GUIDE

Please contact Cyber and Technology Training at for registration and pricing information.

Azure Machine Learning, SQL Data Mining and R

In-Database Analytics

Visualization of Semantic Windows with SciDB Integration

Database Management System as a Cloud Service

Very Large Enterprise Network, Deployment, Users

Enterprise level security, the Huddle way.

Knowledge Discovery from patents using KMX Text Analytics

Intended status: Standards Track October 8, 2014 Expires: April 11, 2015

Bringing Big Data Modelling into the Hands of Domain Experts

SQL Server Instance-Level Benchmarks with DVDStore

Investment Portfolio Performance Evaluation. Jay Patel Faculty Advisor: James Gee

About This Document 3. About the Migration Process 4. Requirements and Prerequisites 5. Requirements... 5 Prerequisites... 5

Name: Srinivasan Govindaraj Title: Big Data Predictive Analytics

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Enterprise Network Deployment, 10,000 25,000 Users

MS 20465C: Designing a Data Solution with Microsoft SQL Server

Zmanda Cloud Backup Frequently Asked Questions

Course 6234A: Implementing and Maintaining Microsoft SQL Server 2008 Analysis Services

Master of Science in Healthcare Informatics and Analytics Program Overview

DFW Backup Software. Whitepaper DFW Backup Agent

Synchronizer Installation

Big Data and Privacy. Fritz Henglein Dept. of Computer Science, University of Copenhagen. Finance IT Day Riga,

Fight fire with fire when protecting sensitive data

CS 564: DATABASE MANAGEMENT SYSTEMS

Zoner Online Backup. Whitepaper Zoner Backup Agent

Hurwitz ValuePoint: Predixion

TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES

Scaling DBMail with MySQL

Microsoft Power BI. Nov 21, 2015

Implementing and Maintaining Microsoft SQL Server 2008 Integration Services

Ahsay Backup Software. Whitepaper Ahsay Backup Agent

SQL Server Business Intelligence

Auburn Montgomery. Registration and Security Policy for AUM Servers

IT-Pruefungen.de. Hochwertige Qualität, neueste Prüfungsunterlagen.

10775 Administering Microsoft SQL Server Databases

Advanced In-Database Analytics

Data Mining and Data Warehousing on US Farmer s Data

Session 15 OF, Unpacking the Actuary's Technical Toolkit. Moderator: Albert Jeffrey Moore, ASA, MAAA

Detecting (and even preventing) SQL Injection Using the Percona Toolkit and Noinject!

Augmented Search for Software Testing

Very Large Enterprise Network Deployment, 25,000+ Users

Analyzing HTTP/HTTPS Traffic Logs

2011 Cyber Security and the Advanced Persistent Threat A Holistic View

Managing Incompleteness, Complexity and Scale in Big Data

Grow Revenues and Reduce Risk with Powerful Analytics Software

The Cyber Threat Profiler

Connecting to your Database!... 3

PivotalR: A Package for Machine Learning on Big Data

NOVA COLLEGE-WIDE COURSE CONTENT SUMMARY ITE INTRODUCTION TO COMPUTER APPLICATIONS & CONCEPTS (3 CR.)

(Big) Data Anonymization Claude Castelluccia Inria, Privatics

Using RADIUS Agent for Transparent User Identification

Secure Cross Border File Protection & Sharing for Enterprise Product Brief CRYPTOMILL INC

Securing the Database Stack

Whitepaper FailSafeSolutions Backup Agent

Automating FP&A Analytics Using SAP Visual Intelligence and Predictive Analysis

Introduction to Logistic Regression

Blaze Vault Online Backup. Whitepaper Blaze Vault Online Backup Agent

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

HOW TO: Using Big Data Analytics to understand how your SharePoint Intranet is being used

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Big Data and Big Analytics

OPAS Prerequisites. Prepared By: This document contains the prerequisites and requirements for setting up OPAS.

Transcription:

Shroudbase Technical Overview Differential Privacy Differential privacy is a rigorous mathematical definition of database privacy developed for the problem of privacy preserving data analysis. Specifically, it ensures that a computation does not reveal information about individual records present in the input by requiring that the computation behaves almost identically on any two databases which differ by at most a single record. Formally, a mechanism M mapping datasets to distributions over an output space R is -differentially private if for every S R and for all datasets A, A 0 for which the number of records which would have been added or removed to change A to A 0 is less than or equal to one, Pr[M(A) 2 S] apple e Pr[M(A 0 ) 2 S] We can interpret the definition as follows: If there are two databases, one with a presence of an individual s data, A, and one without this individual s data, A 0,thenforsmallvaluesof, thereisnooutput an adversary could use to distinguish between A and A 0.Assuch,itisvirtuallyimpossibletoidentifyany information about an individual when differential privacy is achieved. It ensures that personal information about an individual will not be disclosed by participating in a dataset regardless of any external information or datasets, regardless of the computational power of an adversary and regardless of any statistical techniques which exist or may be developed in the future. Differential privacy is typically achieved by adding statistical noise to the output of queries or, more abstractly, to the method of choosing responses to queries. A decade of research in the field has produced an array of algorithms which achieve differential privacy for a wide range of data analysis methods. These algorithms have been refined to introduce minimal noise, and come with strong, provable guarantees of accuracy. However, this interactive model requires that noise be drawn from a fixed distribution on multiple occasions, which introduces a critical drawback: the database comes with a budget and querying is costly once this budget is exhausted, differential privacy is no longer satisfied. Producing Synthetic Data The key to practical, differentially private data analysis is generating synthetic databases. These databases are computed by differentially private algorithms on the original data, and therefore ensure that any computation over the data is differentially private. As a result, these databases do not impose any limitations on data access, and remain private even in the event of a security breach. An example of a simple method for producing synthetic data on low-dimensional datasets which accurately answers statistical queries (queries which count the number of records which satisfy a certain predicate) is the MWEM algorithm. MWEM (Multiplicative Weights Exponential Mechanism) maintains an approximating dataset over a domain of records, initialized to be a uniform distribution over the set of records. At each iteration, the algorithm chooses a query with a high error on the approximate data, poses this query to the true data, and improves the approximate data to more accurately answer the specific query. After a specified number of iterations, the algorithm outputs the average of the approximate databases produced at each iteration as the 1

synthetic data. The accuracy of this algorithm, defined to be the maximum error of any query, is provably logarithmic in the number queries and asymptotically smaller than the number of records. The mathematical details are as follows: The algorithm takes as input a database D, anumberofrecordsn, asetq of queries, a number of iterations T,a privacy parameter (a small number). First, a distribution A 0 is initialized to be the uniform distribution over the universe of records. The exponential mechanism, which satisfies differential privacy is used to choose queries. At a given iteration i of the algorithm, the exponential mechanism chooses a query q i from the distribution: exp( q(a i 1 ) q(d) ) 2T where A i 1 is the approximate database at iteration i 1 and D is the true data. The mechanism for posing the query to the data achieves differential privacy by adding Laplace noise to the output of the query. That is, the measurement of the output of a query is taken to be: 2T m i = q i (D)+Lap At each iteration i, theapproximatedatabaseisupdatedusingthemultiplicativeweightsalgorithm: A i (x) =A i 1 (x) exp q i (x) (m i q i (A i 1 )) 2n Once the algorithm has completed T iterations, A = avg i<t A i is outputed as a synthetic database. The worst case error of the algorithm is given by: r log U 10T log Q max q2q q(a) q(d) apple2n + T The MWEM algorithm, however, is not a universal solution to the release of synthetic data. The algorithm has worst case exponential complexity, so it is not practical for high-dimensional datasets. More over, the accuracy guarantees it provides hold only for linear queries. Although compositions of linear queries can be used to implement a broad range of statistical techniques, MWEM does not provide any accuracy guarantees for certain crucial methods in data analysis, such as regressions. Shroudbase The approach used in this algorithm is the foundation for many of the advanced algorithms deployed by the Shroudbase platform, which produces and manages synthetic data through differentially private mechanisms. Shroudbase s patent-pending software deploys a repertoire of privacy preserving algorithms to enable accurate data analytics on sensitive data, far beyond the capabilities of MWEM. These range from producing summary statistics to machine learning and optimization. Shroudbase: efficiently produces synthetic data on terabytes of high-dimensional datasbases efficiently produces synthetic data to preserve accuracy of generalized linear models, such as regressions maintains these private databases in a centralized, easy-to-use platform answers millions of MySQL queries without requiring the user to specify them in advance Shroudbase is a platform for producing and managing these differentially private synthetic databases. 2

Shroudbase Infrastructure I. Privatization Privatizing data with Shroudbase is a one step process. The client simply enters the information required to access their database along with an endpoint to store the synthetic data. The platform currently privatizes any structured data, including MySQL, PostgreSQL, Microsoft SQL, sqlite3, Excel spreadsheets, and csv files. The privatization procedure can be run through our cloud cluster or locally by installing the Shroudbase Database Management System on the client s machines. If the client uses a local implementation, then the entire procedure can be executed without Shroudbase ever reading or storing any sensitive information. 3

II. Storage Privatized data is stored with the Shroudbase Cloud Database Service. While many online storage systems only protect data in transit, Shroudbase ensures that the only data that enters the cloud is synthetic data with no personally identifiable information. Practically speaking, this means that nobody a hacker, government agency, an employee of Shroudbase can ever access any personal information through Shroudbase, because it simply isn t there. Clients access this service through the Shroudbase administrative control panel or Shroudbase Database Management System, an installable package for controlled data access and administration. Clients also have the option of storing the privatized data locally. 4

III. Querying The Shroudbase Query Client provides an easy and intuitive way to use privatized databases. This client interface takes in SQL formatted commands and outputs responses in a format similar to MySQL s client interface. This can be run by calling sb from the commandline with the appropriate hostname and port for the database the user is connected to. Queries with Shroudbase are identical to MySQL queries, and Shroudbase supports most statistical functions found in MySQL. IV. Updating Shroudbase s patent-pending technology supports inserting additional data into the database while preserving privacy. When additional data is added, the Shroudbase system stores the data in an intermediary state until the Shroudbase server detects that an update needs to occur. When an update occurs, the privatization job is off-loaded to Shroudbase s privatization infrastructure to be recomputed in the cloud. Note: For clients who wish to run specialized analysis not currently supported by Shroudbase synthetic datasets, we provide custom implementations of adaptive differentially private mechanisms. 5