Differential privacy in health care analytics and medical research An interactive tutorial


 Joanna Phelps
 1 years ago
 Views:
Transcription
1 Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012
2 Overview 1. Releasing medical data: What could go wrong? 2. Differential privacy a formal privacy guarantee 3. A framework for differentially private data analysis 4. Open problems and discussion
3 Insert your favorite application: Health care analytics, medical decision support, preventing epidemics, drug development, medical data (e.g., patient records, diagnostic data, DNA sequences, drug purchases, insurance data) Goal: Protect privacy of individuals while allowing useful analyses
4 What could go wrong? Many examples, several about medical data Here: The Case of GenomeWide Association Studies (GWAS)
5 GWAS Setup: 1. NIH takes DNA of, say, 100 test candidates with common phenotype (e.g., certain disease) 2. NIH releases minor allele frequencies of test population at, say, 100,000 positions (SNPs) Goal: Find association between SNPs and phenotype
6 Attack on GWAS data [Homer et al.] Can infer membership in test group of an individual with known DNA from published data! SNP MAF Test population SNP MA 1 NO 2 NO 3 YES YES Moritz s DNA SNP MAF Reference population (HapMap data, public)
7 Attack on GWAS data [Homer et al.] Can infer membership in test group of an individual with known DNA from published data! SNP MAF Test population probably SNP MA 1 NO 2 NO 3 YES YES Moritz s DNA SNP MAF Reference population (HapMap data, public)
8 Intersting characteristics Only innocuous looking data was released Data was HIPAA compliant Data curator is trusted (NIH) Attack uses background knowledge (HapMap data set) available in public domain Attack uses unanticipated algorithm Need rigorous privacy guarantee!
9 Differential Privacy [DworkMcSherryNissimSmith06] Rigorous privacy guarantee with important properties (e.g., handles attacks with background information, composes nicely, is robust) hundreds of papers in theory, databases, statistics, systems, programming languages Downloadable libraries (PINQ), implemented algorithms Intuition: Presence or absence of single individual in the data set cannot be inferred from output of the algorithm.
10 Differential Privacy [DworkMcSherryNissimSmith06] A database is a collection of rows (tuples) from a universe U. Databases D, D are neighboring if they differ in only one row (i.e., one individual). Example: D = {GWAS test population}, D = D {Moritz s DNA} Definition. Randomized algorithm M is εdifferentially private if for all neighboring D, D and every set S of outcomes: Pr M D S 1 + ε Pr{M D S} Think of ε = 0.01
11 Differential Privacy [DworkMcSherryNissimSmith06] A database is a collection of rows (tuples) from a universe U. Databases D, D are neighboring if they differ in only one row (i.e., one individual). Example: D = {GWAS test population}, D = D {Moritz s DNA} Definition. Randomized algorithm M is εdifferentially private if for all neighboring D, D and every set S of outcomes: Pr M D S exp(ε) Pr{M D S} Think of ε = 0.01
12 Definition. Randomized algorithm M is εdifferentially private if for all neighboring D, D and every set of outcomes S: Pr M D S exp ε Pr{M D S} Density ratio bounded by exp ε M(D) M(D ) Outputs
13 What does it mean? Suposse database teaches that smoking causes cancer Smoker S s insurance goes up This is true even if S is not in the database! Learning that smoking causes cancer is the whole point Differential privacy: No harm caused by participation in a data set automatically resilient against GWASstyle attacks
14 Is too strong? Some tasks you can do while guaranteeing DP Statistical queries on the data set data cubes, contingency tables, range queries Learning algorithms SQ learning, SVM, PCA, logistic regression, decision trees, clustering, online learning Monitor data streams heavy hitters, norm estimation In this talk: Focus on statistical queries
15 Statistical Queries with Differential Privacy Trusted Curator Set Q of statistical queries data set D Synopsis S: Q R E.g., list of answers, synthetic data set, data structure Analyst Requirement: Synopsis satisfies differential privacy Maximize usefulness of Synopsis Here: accuracy of answers
16 Statistical queries Example: How many people in D smoke and have cancer? Generally, specified by predicate f: U {0,1} Answer f D = i D f(i) ranges between 0 and n Fact: For every neighboring D, D we have f D f D 1
17 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε
18 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Density ratio bounded by 1 + ε M(D) M(D ) f(d) f(d ) Outputs
19 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Handle k queries by adding Lap to each answer Error scales as O k ε Need k D per query! k ε independently
20 Laplacian Mechanism [DMNS06] Given query f: 1. Compute true answer f(d) 2. Output f D + Lap 1 ε exp( ε z ) Handle k queries by adding Lap to each answer Error scales as O k ε per query! k ε independently Need k D Can we answer more queries? YES!
21 Multiplicative Weights Approach* [HRothblum 10, GuptaHRothUllman 11, HMcSherryLigett 12] Nearly optimal accuracy High accuracy on huge query sets can have k poly(n) Simple and scalable implementation gives significant empirical improvements on realworld queries and data sets Output is synthetic data set * Previous work using different ideas: BlumLigettRoth 08, DworkNaorReingold RothblumVadhan 09, DworkRothblumVadhan 10, RothRoughgarden 10
22 Histogram View Represent D as normalized histogram vector x R N where N = U x i = fraction of i s in D 1 must map to similar output distributions! N Statistical query becomes vector f 0,1 N Answer f x = f, x (inner product)
23 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
24 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
25 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
26 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) f 1 f 2 f 3 f k
27 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k
28 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k Pick maximal violation
29 f t x f t x t 1 Finding maximal violations (exponential mechanism [MT 07]) Add Lap 1 ε f 1 f 2 f 3 f k Pick maximal violation Lemma. Expected violation is MAX  log k /ε
30 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
31 Multiplicative Weights Update Estimate x t 1 Input x Query f t N 4 5 Before update Suppose f t x t 1 too small!
32 Multiplicative Weights Update Estimate x t 1 Input x Query f t N 4 5 After update x t i = x t 1 i (1 + ηf t i )/Z η suitably chosen Z = normalization
33 Algorithm: Input: Histogram x, query set Q Idea: Maintain differentially private histogram x t with x 0 uniform, increase quality of x t iteratively For t = 1 T: 1. Find bad query: Find query f t which maximizes f t x t 1 f t x 2. Improve histogram: Obtain x t from x t 1 using multiplicative weights update defined by f t Output: x T
34 Privacy/Utility analysis Privacy: Algorithm is composition of T εdifferentially private steps. Hence, εtdifferentially private over all by Composition theorem [DMNS06,DRV10] Utility: Why can we choose T small? Potential function RE(x t x) 0, log N Lemma: Potential drops with every update by roughly (f t x f t x t 1 ) 2 Conclusion: O α 2 log N steps give accuracy α
35 Implementation Computational bottleneck: Enumerating over all N coordinates N could be exponential (highdimensional data) Parallelizable, scalable implementation with heuristic tweaks [HMcSherryLigett 12] Handles huge data sets with >100 attributes (so N ) Works well for data cubes and range queries (important statistical query classes) on realworld data
36 Onedimensional range queries on blood transfustion data set yaxis = average squared error per query lower bound on previous approach [LiMiklau11] 1.00E E E E E E E E E E E+01 MW+EM SVD 1.00E E xaxis = Number of queries (randomly chosen queries, epsilon fixed at 0.1) xaxis = epsilon value (2000 randomly chose queries)
37 Twodimensional range queries on blood transfustion data set yaxis = average squared error per query lower bound on previous approach [LiMiklau11] 1.00E E E E E E E E E E E E+01 MW+EM SVD 1.00E E xaxis = Number of queries (randomly chosen queries, epsilon fixed at 0.1) xaxis = epsilon value (2000 randomly chose queries)
38 Scalability MW successfully ignores irrelevant attributes! Adult Dataset, binary attributes Elapsed Milliseconds Maximum Error Adult Dataset, binary attributes + 50 random attributes Very Similar shapes Elapsed Milliseconds Maximum Error
39 Scalability Most time spent evaluating queries, MW logic negligible Synthetic Data, varying # of attributes Milliseconds in MW Milliseconds in Total Synthetic Data, varying # of attributes Milliseconds in MW Milliseconds in Total
40 Summary Differential privacy gives a formal privacy guarantee that you can trust Powerful algorithms achieve differential privacy at surprising levels of utility Would be exciting to try out differential privacy in an industry scale application!
41 Guiding future research Finding good models Who is sharing data with whom? Where do we need strong privacy guarantees? Finding good assumptions What assumptions are commonly true for data arising in practice? Finding good problems What algorithms/tools should we focus on?
42 Thank you
43 Differential Privacy and HIPAA HIPAA does not imply sufficient privacy protection Stronger guarantees are needed (increasingly clear to policy makers, see IOM report Beyond the HIPAA Privacy Rule ) Stronger guarantees means less utility in some settings Does differential privacy comply with current HIPAA regulations? Yes and no. Statistician Provision gives some room for differential privacy Safe Harbor Provision preferred in practice for legal reasons
Shroudbase Technical Overview
Shroudbase Technical Overview Differential Privacy Differential privacy is a rigorous mathematical definition of database privacy developed for the problem of privacy preserving data analysis. Specifically,
More informationDifferential Privacy Tutorial Simons Institute Workshop on Privacy and Big Data. Katrina Ligett Caltech
Differential Privacy Tutorial Simons Institute Workshop on Privacy and Big Data Katrina Ligett Caltech 1 individuals have lots of interesting data... 12 375 π 2 individuals have lots of interesting data...
More informationPracticing Differential Privacy in Health Care: A Review
TRANSACTIONS ON DATA PRIVACY 5 (2013) 35 67 Practicing Differential Privacy in Health Care: A Review Fida K. Dankar*, and Khaled El Emam* * CHEO Research Institute, 401 Smyth Road, Ottawa, Ontario E mail
More informationLogistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
More informationA Practical Application of Differential Privacy to Personalized Online Advertising
A Practical Application of Differential Privacy to Personalized Online Advertising Yehuda Lindell Eran Omri Department of Computer Science BarIlan University, Israel. lindell@cs.biu.ac.il,omrier@gmail.com
More informationThe Reusable Holdout: Preserving Statistical Validity in Adaptive Data Analysis
The Reusable Holdout: Preserving Statistical Validity in Adaptive Data Analysis Moritz Hardt IBM Research Almaden Joint work with Cynthia Dwork, Vitaly Feldman, Toni Pitassi, Omer Reingold, Aaron Roth
More informationGENETIC DATA ANALYSIS
GENETIC DATA ANALYSIS 1 Genetic Data: Future of Personalized Healthcare To achieve personalization in Healthcare, there is a need for more advancements in the field of Genomics. The human genome is made
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationChallenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014
Challenges of Data Privacy in the Era of Big Data Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014 1 Outline Why should we care? What is privacy? How do achieve privacy? Big
More informationSimple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRALab,
More informationLecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 Online kclustering To the extent that clustering takes place in the brain, it happens in an online
More information2.1 Complexity Classes
15859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look
More informationPa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classiﬁca6on
Pa8ern Recogni6on and Machine Learning Chapter 4: Linear Models for Classiﬁca6on Represen'ng the target values for classifica'on If there are only two classes, we typically use a single real valued output
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationprinceton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks
princeton university F 02 cos 597D: a theorist s toolkit Lecture 7: Markov Chains and Random Walks Lecturer: Sanjeev Arora Scribe:Elena Nabieva 1 Basics A Markov chain is a discretetime stochastic process
More informationOffline sorting buffers on Line
Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com
More informationPrivate False Discovery Rate Control
Private False Discovery Rate Control Cynthia Dwork Weijie Su Li Zhang November 11, 2015 Microsoft Research, Mountain View, CA 94043, USA Department of Statistics, Stanford University, Stanford, CA 94305,
More informationHealthcare data analytics. DaWei Wang Institute of Information Science wdw@iis.sinica.edu.tw
Healthcare data analytics DaWei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
More informationGeneralization in Adaptive Data Analysis and Holdout Reuse
Generalization in Adaptive Data Analysis and Holdout Reuse Cynthia Dwork Microsoft Research Vitaly Feldman IBM Almaden Research Center Moritz Hardt Google Research Toniann Pitassi University of Toronto
More informationFactoring & Primality
Factoring & Primality Lecturer: Dimitris Papadopoulos In this lecture we will discuss the problem of integer factorization and primality testing, two problems that have been the focus of a great amount
More informationDefending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
More informationWhen Random Sampling Preserves Privacy
When Random Sampling Preserves Privacy Kamalika Chaudhuri 1 and Nina Mishra 2 1 Computer Science Department, UC Berkeley, Berkeley, CA 94720 2 Computer Science Department, University of Virginia, Charlottesville,
More informationData mining knowledge representation
Data mining knowledge representation 1 What Defines a Data Mining Task? Task relevant data: where and how to retrieve the data to be used for mining Background knowledge: Concept hierarchies Interestingness
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationGlobally Optimal Crowdsourcing Quality Management
Globally Optimal Crowdsourcing Quality Management Akash Das Sarma Stanford University akashds@stanford.edu Aditya G. Parameswaran University of Illinois (UIUC) adityagp@illinois.edu Jennifer Widom Stanford
More informationLecture 10: CPA Encryption, MACs, Hash Functions. 2 Recap of last lecture  PRGs for one time pads
CS 7880 Graduate Cryptography October 15, 2015 Lecture 10: CPA Encryption, MACs, Hash Functions Lecturer: Daniel Wichs Scribe: Matthew Dippel 1 Topic Covered Chosen plaintext attack model of security MACs
More informationECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015
ECON 459 Game Theory Lecture Notes Auctions Luca Anderlini Spring 2015 These notes have been used before. If you can still spot any errors or have any suggestions for improvement, please let me know. 1
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationIntroduction to Logistic Regression
OpenStaxCNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStaxCNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
More informationA General Approach to Variance Estimation under Imputation for Missing Survey Data
A General Approach to Variance Estimation under Imputation for Missing Survey Data J.N.K. Rao Carleton University Ottawa, Canada 1 2 1 Joint work with J.K. Kim at Iowa State University. 2 Workshop on Survey
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationThe Trimmed Iterative Closest Point Algorithm
Image and Pattern Analysis (IPAN) Group Computer and Automation Research Institute, HAS Budapest, Hungary The Trimmed Iterative Closest Point Algorithm Dmitry Chetverikov and Dmitry Stepanov http://visual.ipan.sztaki.hu
More information1R01HG0007078: PrivacyPreserving Sharing and Analysis of Human Genomic Data. XiaoFeng Wang and Haixu Tang, IUB
1R01HG0007078: PrivacyPreserving Sharing and Analysis of Human Genomic Data XiaoFeng Wang and Haixu Tang, IUB Project Objectives Study of Scalable, PrivacyPreserving Data Analysis, particular those for
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationSection 3 Sequences and Limits, Continued.
Section 3 Sequences and Limits, Continued. Lemma 3.6 Let {a n } n N be a convergent sequence for which a n 0 for all n N and it α 0. Then there exists N N such that for all n N. α a n 3 α In particular
More informationPredicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang (junjie87@stanford.edu) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
More informationtesto dello schema Secondo livello Terzo livello Quarto livello Quinto livello
Extracting Knowledge from Biomedical Data through Logic Learning Machines and Rulex Marco Muselli Institute of Electronics, Computer and Telecommunication Engineering National Research Council of Italy,
More informationChapter 6: Episode discovery process
Chapter 6: Episode discovery process Algorithmic Methods of Data Mining, Fall 2005, Chapter 6: Episode discovery process 1 6. Episode discovery process The knowledge discovery process KDD process of analyzing
More informationWorkshop on Establishing a Central Resource of Data from Genome Sequencing Projects
Report on the Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects Background and Goals of the Workshop June 5 6, 2012 The use of genome sequencing in human research is growing
More informationAdvanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationInfluences in lowdegree polynomials
Influences in lowdegree polynomials Artūrs Bačkurs December 12, 2012 1 Introduction In 3] it is conjectured that every bounded real polynomial has a highly influential variable The conjecture is known
More informationBuilding InDatabase Predictive Scoring Model: Check Fraud Detection Case Study
Building InDatabase Predictive Scoring Model: Check Fraud Detection Case Study Jay Zhou, Ph.D. Business Data Miners, LLC 9787263182 jzhou@businessdataminers.com Web Site: www.businessdataminers.com
More informationD A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
More informationChapter 20: Data Analysis
Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.dbbook.com for conditions on reuse Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationLecture 6 Online and streaming algorithms for clustering
CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 Online kclustering To the extent that clustering takes place in the brain, it happens in an online
More informationMarkov chains and Markov Random Fields (MRFs)
Markov chains and Markov Random Fields (MRFs) 1 Why Markov Models We discuss Markov models now. This is the simplest statistical model in which we don t assume that all variables are independent; we assume
More informationL1 vs. L2 Regularization and feature selection.
L1 vs. L2 Regularization and feature selection. Paper by Andrew Ng (2004) Presentation by Afshin Rostami Main Topics Covering Numbers Definition Convergence Bounds L1 regularized logistic regression L1
More informationSeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis
SeattleSNPs Interactive Tutorial: Web Tools for Site Selection, Linkage Disequilibrium and Haplotype Analysis Goal: This tutorial introduces several websites and tools useful for determining linkage disequilibrium
More informationOnline Supplement to Polygenic Influence on Educational Attainment. Genotyping was conducted with the Illumina HumanOmni1Quad v1 platform using
Online Supplement to Polygenic Influence on Educational Attainment Construction of Polygenic Score for Educational Attainment Genotyping was conducted with the Illumina HumanOmni1Quad v1 platform using
More informationCombining Data from Different Genotyping Platforms. Gonçalo Abecasis Center for Statistical Genetics University of Michigan
Combining Data from Different Genotyping Platforms Gonçalo Abecasis Center for Statistical Genetics University of Michigan The Challenge Detecting small effects requires very large sample sizes Combined
More informationAdaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
More informationIntroduction to Convex Optimization for Machine Learning
Introduction to Convex Optimization for Machine Learning John Duchi University of California, Berkeley Practical Machine Learning, Fall 2009 Duchi (UC Berkeley) Convex Optimization for Machine Learning
More information8.1 Min Degree Spanning Tree
CS880: Approximations Algorithms Scribe: Siddharth Barman Lecturer: Shuchi Chawla Topic: Min Degree Spanning Tree Date: 02/15/07 In this lecture we give a local search based algorithm for the Min Degree
More informationFactors for success in big data science
Factors for success in big data science Damjan Vukcevic Data Science Murdoch Childrens Research Institute 16 October 2014 Big Data Reading Group (Department of Mathematics & Statistics, University of Melbourne)
More informationJava Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Nonnormal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
More informationAuxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More information0.1 Phase Estimation Technique
Phase Estimation In this lecture we will describe Kitaev s phase estimation algorithm, and use it to obtain an alternate derivation of a quantum factoring algorithm We will also use this technique to design
More informationScheduling Home Health Care with Separating Benders Cuts in Decision Diagrams
Scheduling Home Health Care with Separating Benders Cuts in Decision Diagrams André Ciré University of Toronto John Hooker Carnegie Mellon University INFORMS 2014 Home Health Care Home health care delivery
More informationBargaining Solutions in a Social Network
Bargaining Solutions in a Social Network Tanmoy Chakraborty and Michael Kearns Department of Computer and Information Science University of Pennsylvania Abstract. We study the concept of bargaining solutions,
More informationA Sublinear Bipartiteness Tester for Bounded Degree Graphs
A Sublinear Bipartiteness Tester for Bounded Degree Graphs Oded Goldreich Dana Ron February 5, 1998 Abstract We present a sublineartime algorithm for testing whether a bounded degree graph is bipartite
More informationIntroduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Preprocessing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
More informationNo Free Lunch in Data Privacy
No Free Lunch in Data Privacy Daniel Kifer Penn State University dan+sigmod11@cse.psu.edu Ashwin Machanavajjhala Yahoo! Research mvnak@yahooinc.com ABSTRACT Differential privacy is a powerful tool for
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation:  Feature vector X,  qualitative response Y, taking values in C
More informationCSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
More informationChapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )
Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 ) and Neural Networks( 類 神 經 網 路 ) 許 湘 伶 Applied Linear Regression Models (Kutner, Nachtsheim, Neter, Li) hsuhl (NUK) LR Chap 10 1 / 35 13 Examples
More informationEmployer Health Insurance Premium Prediction Elliott Lui
Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than
More informationCS346: Advanced Databases
CS346: Advanced Databases Alexandra I. Cristea A.I.Cristea@warwick.ac.uk Data Security and Privacy Outline Chapter: Database Security in Elmasri and Navathe (chapter 24, 6 th Edition) Brief overview of
More informationReview of some concepts in predictive modeling
Review of some concepts in predictive modeling Brigham and Women s Hospital HarvardMIT Division of Health Sciences and Technology HST.951J: Medical Decision Support A disjoint list of topics? Naïve Bayes
More informationSo, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multirelational
Simply Mining Data Jilles Vreeken So, how do you pronounce Exploratory Data Analysis Jilles Vreeken Jilles Yill less Vreeken Fray can 17 August 2015 Okay, now we can talk. 17 August 2015 The goal So, what
More informationServer Load Prediction
Server Load Prediction Suthee Chaidaroon (unsuthee@stanford.edu) Joon Yeong Kim (kim64@stanford.edu) Jonghan Seo (jonghan@stanford.edu) Abstract Estimating server load average is one of the methods that
More informationGamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
More informationThe Multiplicative Weights Update method
Chapter 2 The Multiplicative Weights Update method The Multiplicative Weights method is a simple idea which has been repeatedly discovered in fields as diverse as Machine Learning, Optimization, and Game
More informationAPPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks YoungRae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationFigure 1. IBM SPSS Statistics Base & Associated Optional Modules
IBM SPSS Statistics: A Guide to Functionality IBM SPSS Statistics is a renowned statistical analysis software package that encompasses a broad range of easytouse, sophisticated analytical procedures.
More informationIntroduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
More informationAccurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are
More information4.4 The recursiontree method
4.4 The recursiontree method Let us see how a recursion tree would provide a good guess for the recurrence = 3 4 ) Start by nding an upper bound Floors and ceilings usually do not matter when solving
More informationSAS Certificate Applied Statistics and SAS Programming
SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and
More informationGlobally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the
Chapter 5 Analysis of Prostate Cancer Association Study Data 5.1 Risk factors for Prostate Cancer Globally, about 9.7% of cancers in men are prostate cancers, and the risk of developing the disease has
More informationMODELING RANDOMNESS IN NETWORK TRAFFIC
MODELING RANDOMNESS IN NETWORK TRAFFIC  LAVANYA JOSE, INDEPENDENT WORK FALL 11 ADVISED BY PROF. MOSES CHARIKAR ABSTRACT. Sketches are randomized data structures that allow one to record properties of
More informationAn Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,
More informationTesting Big data is one of the biggest
Infosys Labs Briefings VOL 11 NO 1 2013 Big Data: Testing Approach to Overcome Quality Challenges By Mahesh Gudipati, Shanthi Rao, Naju D. Mohan and Naveen Kumar Gajja Validate data quality by employing
More informationPerformance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationBiomedical Big Data and Precision Medicine
Biomedical Big Data and Precision Medicine Jie Yang Department of Mathematics, Statistics, and Computer Science University of Illinois at Chicago October 8, 2015 1 Explosion of Biomedical Data 2 Types
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationOnline Classification on a Budget
Online Classification on a Budget Koby Crammer Computer Sci. & Eng. Hebrew University Jerusalem 91904, Israel kobics@cs.huji.ac.il Jaz Kandola Royal Holloway, University of London Egham, UK jaz@cs.rhul.ac.uk
More informationDecompose Error Rate into components, some of which can be measured on unlabeled data
BiasVariance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data BiasVariance Decomposition for Regression BiasVariance Decomposition for Classification BiasVariance
More informationWeek 1: Introduction to Online Learning
Week 1: Introduction to Online Learning 1 Introduction This is written based on Prediction, Learning, and Games (ISBN: 2184189 / 2184189 CesaBianchi, Nicolo; Lugosi, Gabor 1.1 A Gentle Start Consider
More informationThe CRM for ordinal and multivariate outcomes. Elizabeth GarrettMayer, PhD Emily Van Meter
The CRM for ordinal and multivariate outcomes Elizabeth GarrettMayer, PhD Emily Van Meter Hollings Cancer Center Medical University of South Carolina Outline Part 1: Ordinal toxicity model Part 2: Efficacy
More informationHow I won the Chess Ratings: Elo vs the rest of the world Competition
How I won the Chess Ratings: Elo vs the rest of the world Competition Yannis Sismanis November 2010 Abstract This article discusses in detail the rating system that won the kaggle competition Chess Ratings:
More informationPrincipled Reasoning and Practical Applications of Alert Fusion in Intrusion Detection Systems
Principled Reasoning and Practical Applications of Alert Fusion in Intrusion Detection Systems Guofei Gu College of Computing Georgia Institute of Technology Atlanta, GA 3332, USA guofei@cc.gatech.edu
More informationData Mining 5. Cluster Analysis
Data Mining 5. Cluster Analysis 5.2 Fall 2009 Instructor: Dr. Masoud Yaghini Outline Data Structures IntervalValued (Numeric) Variables Binary Variables Categorical Variables Ordinal Variables Variables
More information