Differential privacy in health care analytics and medical research An interactive tutorial
|
|
- Joanna Phelps
- 8 years ago
- Views:
Transcription
1 Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012
2 Overview 1. Releasing medical data: What could go wrong? 2. Differential privacy a formal privacy guarantee 3. A framework for differentially private data analysis 4. Open problems and discussion
3 Insert your favorite application: Health care analytics, medical decision support, preventing epidemics, drug development, medical data (e.g., patient records, diagnostic data, DNA sequences, drug purchases, insurance data) Goal: Protect privacy of individuals while allowing useful analyses
4 What could go wrong? Many examples, several about medical data Here: The Case of Genome-Wide Association Studies (GWAS)
5 GWAS Setup: 1. NIH takes DNA of, say, 100 test candidates with common phenotype (e.g., certain disease) 2. NIH releases minor allele frequencies of test population at, say, 100,000 positions (SNPs) Goal: Find association between SNPs and phenotype
6 Attack on GWAS data [Homer et al.] Can infer membership in test group of an individual with known DNA from published data! SNP MAF Test population SNP MA 1 NO 2 NO 3 YES YES Moritz s DNA SNP MAF Reference population (HapMap data, public)
7 Attack on GWAS data [Homer et al.] Can infer membership in test group of an individual with known DNA from published data! SNP MAF Test population probably SNP MA 1 NO 2 NO 3 YES YES Moritz s DNA SNP MAF Reference population (HapMap data, public)
8 Intersting characteristics Only innocuous looking data was released Data was HIPAA compliant Data curator is trusted (NIH) Attack uses background knowledge (HapMap data set) available in public domain Attack uses unanticipated algorithm Need rigorous privacy guarantee!
9 Differential Privacy [Dwork-McSherry-Nissim-Smith-06] Rigorous privacy guarantee with important properties (e.g., handles attacks with background information, composes nicely, is robust) hundreds of papers in theory, databases, statistics, systems, programming languages Downloadable libraries (PINQ), implemented algorithms Intuition: Presence or absence of single individual in the data set cannot be inferred from output of the algorithm.
10 Differential Privacy [Dwork-McSherry-Nissim-Smith-06] A database is a collection of rows (tuples) from a universe U. Databases D, D are neighboring if they differ in only one row (i.e., one individual). Example: D = {GWAS test population}, D = D {Moritz s DNA} Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set S of outcomes: Pr M D S 1 + ε Pr{M D S} Think of ε = 0.01
11 Differential Privacy [Dwork-McSherry-Nissim-Smith-06] A database is a collection of rows (tuples) from a universe U. Databases D, D are neighboring if they differ in only one row (i.e., one individual). Example: D = {GWAS test population}, D = D {Moritz s DNA} Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set S of outcomes: Pr M D S exp(ε) Pr{M D S} Think of ε = 0.01
12 Definition. Randomized algorithm M is ε-differentially private if for all neighboring D, D and every set of outcomes S: Pr M D S exp ε Pr{M D S} Density ratio bounded by exp ε M(D) M(D ) Outputs
13 What does it mean? Suposse database teaches that smoking causes cancer Smoker S s insurance goes up This is true even if S is not in the database! Learning that smoking causes cancer is the whole point Differential privacy: No harm caused by participation in a data set automatically resilient against GWAS-style attacks
14 Is too strong? Some tasks you can do while guaranteeing DP Statistical queries on the data set data cubes, contingency tables, range queries Learning algorithms SQ learning, SVM, PCA, logistic regression, decision trees, clustering, online learning Monitor data streams heavy hitters, norm estimation In this talk: Focus on statistical queries
15 Statistical Queries with Differential Privacy Trusted Curator Set Q of statistical queries data set D Synopsis S: Q R E.g., list of answers, synthetic data set, data structure Analyst Requirement: Synopsis satisfies differential privacy Maximize usefulness of Synopsis Here: accuracy of answers
16 Statistical queries Example: How many people in D smoke and have cancer? Generally, specified by predicate f: U {0,1} Answer f D = i D f(i) ranges between 0 and n Fact: For every neighboring D, D we have f D f D 1
17 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε
18 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Density ratio bounded by 1 + ε M(D) M(D ) f(d) f(d ) Outputs
19 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Handle k queries by adding Lap to each answer Error scales as O k ε Need k D per query! k ε independently
20 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Handle k queries by adding Lap to each answer Error scales as O k ε per query! k ε independently Need k D Can we answer more queries? YES!
21 Multiplicative Weights Approach* [H-Rothblum 10, Gupta-H-Roth-Ullman 11, H-McSherry-Ligett 12] Nearly optimal accuracy High accuracy on huge query sets can have k poly(n) Simple and scalable implementation gives significant empirical improvements on realworld queries and data sets Output is synthetic data set * Previous work using different ideas: Blum-Ligett-Roth 08, Dwork-Naor-Reingold- Rothblum-Vadhan 09, Dwork-Rothblum-Vadhan 10, Roth-Roughgarden 10
22 Histogram View Represent D as normalized histogram vector x R N where N = U x i = fraction of i s in D 1 must map to similar output distributions! N Statistical query becomes vector f 0,1 N Answer f x = f, x (inner product)
23 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
24 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
25 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
26 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) f 1 f 2 f 3 f k
27 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k
28 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k Pick maximal violation
29 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k Pick maximal violation Lemma. Expected violation is MAX - log k /ε
30 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
31 Multiplicative Weights Update Estimate x t 1 Input x Query f t N 4 5 Before update Suppose f t x t 1 too small!
32 Multiplicative Weights Update Estimate x t 1 Input x Query f t N 4 5 After update x t i = x t 1 i (1 + ηf t i )/Z η suitably chosen Z = normalization
33 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
34 Privacy/Utility analysis Privacy: Algorithm is composition of T ε-differentially private steps. Hence, εt-differentially private over all by Composition theorem [DMNS06,DRV10] Utility: Why can we choose T small? Potential function RE(x t x) 0, log N Lemma: Potential drops with every update by roughly (f t x f t x t 1 ) 2 Conclusion: O α 2 log N steps give accuracy α
35 Implementation Computational bottleneck: Enumerating over all N coordinates N could be exponential (high-dimensional data) Parallelizable, scalable implementation with heuristic tweaks [H-McSherry-Ligett 12] Handles huge data sets with >100 attributes (so N ) Works well for data cubes and range queries (important statistical query classes) on real-world data
36 One-dimensional range queries on blood transfustion data set y-axis = average squared error per query lower bound on previous approach [Li-Miklau11] 1.00E E E E E E E E E E E+01 MW+EM SVD 1.00E E x-axis = Number of queries (randomly chosen queries, epsilon fixed at 0.1) x-axis = epsilon value (2000 randomly chose queries)
37 Two-dimensional range queries on blood transfustion data set y-axis = average squared error per query lower bound on previous approach [Li-Miklau11] 1.00E E E E E E E E E E E E+01 MW+EM SVD 1.00E E x-axis = Number of queries (randomly chosen queries, epsilon fixed at 0.1) x-axis = epsilon value (2000 randomly chose queries)
38 Scalability MW successfully ignores irrelevant attributes! Adult Dataset, binary attributes Elapsed Milliseconds Maximum Error Adult Dataset, binary attributes + 50 random attributes Very Similar shapes Elapsed Milliseconds Maximum Error
39 Scalability Most time spent evaluating queries, MW logic negligible Synthetic Data, varying # of attributes Milliseconds in MW Milliseconds in Total Synthetic Data, varying # of attributes Milliseconds in MW Milliseconds in Total
40 Summary Differential privacy gives a formal privacy guarantee that you can trust Powerful algorithms achieve differential privacy at surprising levels of utility Would be exciting to try out differential privacy in an industry scale application!
41 Guiding future research Finding good models Who is sharing data with whom? Where do we need strong privacy guarantees? Finding good assumptions What assumptions are commonly true for data arising in practice? Finding good problems What algorithms/tools should we focus on?
42 Thank you
43 Differential Privacy and HIPAA HIPAA does not imply sufficient privacy protection Stronger guarantees are needed (increasingly clear to policy makers, see IOM report Beyond the HIPAA Privacy Rule ) Stronger guarantees means less utility in some settings Does differential privacy comply with current HIPAA regulations? Yes and no. Statistician Provision gives some room for differential privacy Safe Harbor Provision preferred in practice for legal reasons
Shroudbase Technical Overview
Shroudbase Technical Overview Differential Privacy Differential privacy is a rigorous mathematical definition of database privacy developed for the problem of privacy preserving data analysis. Specifically,
More informationDifferential Privacy Tutorial Simons Institute Workshop on Privacy and Big Data. Katrina Ligett Caltech
Differential Privacy Tutorial Simons Institute Workshop on Privacy and Big Data Katrina Ligett Caltech 1 individuals have lots of interesting data... 12 37-5 π 2 individuals have lots of interesting data...
More informationPracticing Differential Privacy in Health Care: A Review
TRANSACTIONS ON DATA PRIVACY 5 (2013) 35 67 Practicing Differential Privacy in Health Care: A Review Fida K. Dankar*, and Khaled El Emam* * CHEO Research Institute, 401 Smyth Road, Ottawa, Ontario E mail
More informationA Practical Application of Differential Privacy to Personalized Online Advertising
A Practical Application of Differential Privacy to Personalized Online Advertising Yehuda Lindell Eran Omri Department of Computer Science Bar-Ilan University, Israel. lindell@cs.biu.ac.il,omrier@gmail.com
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationThe Reusable Holdout: Preserving Statistical Validity in Adaptive Data Analysis
The Reusable Holdout: Preserving Statistical Validity in Adaptive Data Analysis Moritz Hardt IBM Research Almaden Joint work with Cynthia Dwork, Vitaly Feldman, Toni Pitassi, Omer Reingold, Aaron Roth
More informationGENETIC DATA ANALYSIS
GENETIC DATA ANALYSIS 1 Genetic Data: Future of Personalized Healthcare To achieve personalization in Healthcare, there is a need for more advancements in the field of Genomics. The human genome is made
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationChallenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014
Challenges of Data Privacy in the Era of Big Data Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014 1 Outline Why should we care? What is privacy? How do achieve privacy? Big
More informationLecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
More informationPrivate False Discovery Rate Control
Private False Discovery Rate Control Cynthia Dwork Weijie Su Li Zhang November 11, 2015 Microsoft Research, Mountain View, CA 94043, USA Department of Statistics, Stanford University, Stanford, CA 94305,
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationHealthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw
Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
More informationGeneralization in Adaptive Data Analysis and Holdout Reuse
Generalization in Adaptive Data Analysis and Holdout Reuse Cynthia Dwork Microsoft Research Vitaly Feldman IBM Almaden Research Center Moritz Hardt Google Research Toniann Pitassi University of Toronto
More information2.1 Complexity Classes
15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationChapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
More informationECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015
ECON 459 Game Theory Lecture Notes Auctions Luca Anderlini Spring 2015 These notes have been used before. If you can still spot any errors or have any suggestions for improvement, please let me know. 1
More informationJava Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
More informationBuilding In-Database Predictive Scoring Model: Check Fraud Detection Case Study
Building In-Database Predictive Scoring Model: Check Fraud Detection Case Study Jay Zhou, Ph.D. Business Data Miners, LLC 978-726-3182 jzhou@businessdataminers.com Web Site: www.businessdataminers.com
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationGamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationPa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on
Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classifica6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationA General Approach to Variance Estimation under Imputation for Missing Survey Data
A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 1 2 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey
More informationIntroduction to Logistic Regression
OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
More informationSeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis
SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis Goal: This tutorial introduces several websites and tools useful for determining linkage disequilibrium
More information1R01HG0007078: Privacy-Preserving Sharing and Analysis of Human Genomic Data. XiaoFeng Wang and Haixu Tang, IUB
1R01HG0007078: Privacy-Preserving Sharing and Analysis of Human Genomic Data XiaoFeng Wang and Haixu Tang, IUB Project Objectives Study of Scalable, Privacy-Preserving Data Analysis, particular those for
More informationGlobally Optimal Crowdsourcing Quality Management
Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationDefending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
More informationWorkshop on Establishing a Central Resource of Data from Genome Sequencing Projects
Report on the Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Background and Goals of the Workshop June 5 6, 2012 The use of genome sequencing in human research is growing
More informationMODELING RANDOMNESS IN NETWORK TRAFFIC
MODELING RANDOMNESS IN NETWORK TRAFFIC - LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationServer Load Prediction
Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that
More informationLecture 10: CPA Encryption, MACs, Hash Functions. 2 Recap of last lecture - PRGs for one time pads
CS 7880 Graduate Cryptography October 15, 2015 Lecture 10: CPA Encryption, MACs, Hash Functions Lecturer: Daniel Wichs Scribe: Matthew Dippel 1 Topic Covered Chosen plaintext attack model of security MACs
More informationFactoring & Primality
Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount
More informationAdvanced In-Database Analytics
Advanced In-Database Analytics Tallinn, Sept. 25th, 2012 Mikko-Pekka Bertling, BDM Greenplum EMEA 1 That sounds complicated? 2 Who can tell me how best to solve this 3 What are the main mathematical functions??
More informationOnline Supplement to Polygenic Influence on Educational Attainment. Genotyping was conducted with the Illumina HumanOmni1-Quad v1 platform using
Online Supplement to Polygenic Influence on Educational Attainment Construction of Polygenic Score for Educational Attainment Genotyping was conducted with the Illumina HumanOmni1-Quad v1 platform using
More informationGlobally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the
Chapter 5 Analysis of Prostate Cancer Association Study Data 5.1 Risk factors for Prostate Cancer Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the disease has
More informationCombining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan
Combining Data from Different Genotyping Platforms Gonçalo Abecasis Center for Statistical Genetics University of Michigan The Challenge Detecting small effects requires very large sample sizes Combined
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationFactors for success in big data science
Factors for success in big data science Damjan Vukcevic Data Science Murdoch Childrens Research Institute 16 October 2014 Big Data Reading Group (Department of Mathematics & Statistics, University of Melbourne)
More informationPerformance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com
More informationSAS Certificate Applied Statistics and SAS Programming
SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and
More informationEmployer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
More informationAuxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationtesto dello schema Secondo livello Terzo livello Quarto livello Quinto livello
Extracting Knowledge from Biomedical Data through Logic Learning Machines and Rulex Marco Muselli Institute of Electronics, Computer and Telecommunication Engineering National Research Council of Italy,
More informationSNPbrowser Software v3.5
Product Bulletin SNP Genotyping SNPbrowser Software v3.5 A Free Software Tool for the Knowledge-Driven Selection of SNP Genotyping Assays Easily visualize SNPs integrated with a physical map, linkage disequilibrium
More informationPredicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationInfluences in low-degree polynomials
Influences in low-degree polynomials Artūrs Bačkurs December 12, 2012 1 Introduction In 3] it is conjectured that every bounded real polynomial has a highly influential variable The conjecture is known
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
More informationBIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
More informationChapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationThe Multiplicative Weights Update method
Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game
More informationlarge-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationD A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
More informationIntroduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
More informationSo, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational
Simply Mining Data Jilles Vreeken So, how do you pronounce Exploratory Data Analysis Jilles Vreeken Jilles Yill less Vreeken Fray can 17 August 2015 Okay, now we can talk. 17 August 2015 The goal So, what
More informationLecture 6 Online and streaming algorithms for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
More informationHexaware E-book on Predictive Analytics
Hexaware E-book on Predictive Analytics Business Intelligence & Analytics Actionable Intelligence Enabled Published on : Feb 7, 2012 Hexaware E-book on Predictive Analytics What is Data mining? Data mining,
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More informationTesting Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
More informationNo Free Lunch in Data Privacy
No Free Lunch in Data Privacy Daniel Kifer Penn State University dan+sigmod11@cse.psu.edu Ashwin Machanavajjhala Yahoo! Research mvnak@yahoo-inc.com ABSTRACT Differential privacy is a powerful tool for
More informationIntroduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
More informationOnline Classification on a Budget
Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk
More informationDescriptive Statistics
Descriptive Statistics Descriptive statistics consist of methods for organizing and summarizing data. It includes the construction of graphs, charts and tables, as well various descriptive measures such
More informationCreating Synthetic Temporal Document Collections for Web Archive Benchmarking
Creating Synthetic Temporal Document Collections for Web Archive Benchmarking Kjetil Nørvåg and Albert Overskeid Nybø Norwegian University of Science and Technology 7491 Trondheim, Norway Abstract. In
More informationCS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis
CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis Team members: Daniel Debbini, Philippe Estin, Maxime Goutagny Supervisor: Mihai Surdeanu (with John Bauer) 1 Introduction
More informationInference of Probability Distributions for Trust and Security applications
Inference of Probability Distributions for Trust and Security applications Vladimiro Sassone Based on joint work with Mogens Nielsen & Catuscia Palamidessi Outline 2 Outline Motivations 2 Outline Motivations
More informationScheduling Home Health Care with Separating Benders Cuts in Decision Diagrams
Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams André Ciré University of Toronto John Hooker Carnegie Mellon University INFORMS 2014 Home Health Care Home health care delivery
More informationSome Essential Statistics The Lure of Statistics
Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationA Sublinear Bipartiteness Tester for Bounded Degree Graphs
A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublinear-time algorithm for testing whether a bounded degree graph is bipartite
More information0.1 Phase Estimation Technique
Phase Estimation In this lecture we will describe Kitaev s phase estimation algorithm, and use it to obtain an alternate derivation of a quantum factoring algorithm We will also use this technique to design
More informationBargaining Solutions in a Social Network
Bargaining Solutions in a Social Network Tanmoy Chakraborty and Michael Kearns Department of Computer and Information Science University of Pennsylvania Abstract. We study the concept of bargaining solutions,
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationWeek 1: Introduction to Online Learning
Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / -21-8418-9 Cesa-Bianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider
More informationDecompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
More informationSIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS. J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID
SIMPLIFIED PERFORMANCE MODEL FOR HYBRID WIND DIESEL SYSTEMS J. F. MANWELL, J. G. McGOWAN and U. ABDULWAHID Renewable Energy Laboratory Department of Mechanical and Industrial Engineering University of
More informationHow I won the Chess Ratings: Elo vs the rest of the world Competition
How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:
More informationCS346: Advanced Databases
CS346: Advanced Databases Alexandra I. Cristea A.I.Cristea@warwick.ac.uk Data Security and Privacy Outline Chapter: Database Security in Elmasri and Navathe (chapter 24, 6 th Edition) Brief overview of
More informationData Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
More informationAccurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are
More information8.1 Min Degree Spanning Tree
CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree
More informationINTRUSION PREVENTION AND EXPERT SYSTEMS
INTRUSION PREVENTION AND EXPERT SYSTEMS By Avi Chesla avic@v-secure.com Introduction Over the past few years, the market has developed new expectations from the security industry, especially from the intrusion
More informationData Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing. C. Olivia Rud, VP, Fleet Bank
Data Mining: An Overview of Methods and Technologies for Increasing Profits in Direct Marketing C. Olivia Rud, VP, Fleet Bank ABSTRACT Data Mining is a new term for the common practice of searching through
More informationHow To Bet On An Nfl Football Game With A Machine Learning Program
Beating the NFL Football Point Spread Kevin Gimpel kgimpel@cs.cmu.edu 1 Introduction Sports betting features a unique market structure that, while rather different from financial markets, still boasts
More informationChoice under Uncertainty
Choice under Uncertainty Part 1: Expected Utility Function, Attitudes towards Risk, Demand for Insurance Slide 1 Choice under Uncertainty We ll analyze the underlying assumptions of expected utility theory
More informationChapter 11 Monte Carlo Simulation
Chapter 11 Monte Carlo Simulation 11.1 Introduction The basic idea of simulation is to build an experimental device, or simulator, that will act like (simulate) the system of interest in certain important
More information