ANONYMIZATION METHODS AS TOOLS FOR FAIRNESS
|
|
|
- Felicity Hunt
- 9 years ago
- Views:
Transcription
1 ANONYMIZATION METHODS AS TOOLS FOR FAIRNESS Salvatore Ruggieri KDD Lab Department of Computer Science University of Pisa, Italy
2 2 Message of the talk
3 Motivation: risks in data publishing 3 (and in learning from data) Privacy risks re-identification or attribute inference Discrimination risks? discriminatory decisions An employer may notice from public census data that the race or sex of workers act as proxy of the workers productivity. The employer may then use those visible traits for hiring decisions. A machine learning model to profile applicants to a bank loan may learn from past application records some patterns of traditional prejudices Solutions? The model may predict not to loan to a minority group. dataset sanitization for discrimination prevention
4 Reductions of problems 4 Problem X reduces to problem Y if an algorithm that solves Y can be used to solve X sol X (I) = f( sol Y ( g(i) ) ) g(.) f(.) Widely used concept in Computability Computational complexity Programming Assume that g() and f() are «simple» X is «easier or equal than» than Y If Y reduces to X also then X and Y are «equivalent»
5 Reductions of problems 5 Problem X reduces to problem Y if an algorithm that solves Y can be used to solve X sol X (I) = f( sol Y ( g(i) ) ) g(.) f(.) X = sanitize a dataset for discrimination prevention wrt -protection Y = sanitize a dataset for privacy protection wrt t-closeness Message of the talk: X and Y are «equivalent» (in a weak sense) Main reference: S. Ruggieri. Using t-closeness anonymity to control for nondiscrimination. Transactions on Data Privacy 7 (4) : , 2014.
6 6 Discrimination measures
7 Discrimination measures 7 What is the degree of discrimination suffered? Legal principle of proportional representation city=nyc benefit denied benefit granted total women men total p 1 = proportion of benefit denied to women = 6/10 = 60% p 2 = proportion of benefit denied to men = 1/5 = 20% Risk difference (RD) is p 1 - p 2 = 40% Risk ratio (RR) is p 1 / p 2 = 3 Relative chance (RC) is (1-p 1 )/(1-p 2 ) = 0.5 Odds ratio (OR) is RR / RC = 6
8 Discrimination measures 8 What is the degree of discrimination suffered? Legal principle of proportional representation Example: jury selection city=nyc benefit denied p 0 = proportion of women in the overall population = 10/15 = 67% p = proportion of women in the «benefit granted» population = 4/8 = 50% Castaneda rule in the U.S. (1977): p 0 m 2 -b 3 Binomial distribution = m 2 p 0 (1 p 0 ) benefit granted total women 6 4 (b) 10 men total 7 8 (m 2 ) 15
9 Extensions to account for: 9 Lack of comparison term occurs when there are no men (or women) in the context ZIP=100 benefit denied benefit granted total women men total p 1 = proportion of benefit denied to women = 6/10 = 60% p 2 = undefined = p - = proportion of benefit denied to women in the whole dataset All discrimination measures extends smoothly
10 Extensions to account for: 10 Random effects rather than explicit discrimination: Confidence intervals for discrimination measures [Pedreschi et al. 2009] Causality in discrimination conclusions: Do women from NYC have the same characteristics of men they are compared with? Or do they differ as per skills or other admissible reasons? Propensity score weighting [Ridgeway2006] Weighted risk difference (wrd) is p 1 p w = 40% p w = x men denied w(x) / x men w(x) Pr x women = w x Pr( x man) w x = Pr woman x /(1 Pr woman x ) Propensity score weights
11 11 Discrimination in a dataset
12 -protection city=nyc city=nyc birth=1965 benefit benefit total denied granted women men PND attributes PD attribute total City Birth date Sex Benefit NYC 1973 M No NYC 1965 F No NYC 1965 M Yes LA 1973 M No RD = 100% - 50% = 100% 50% A non-empty 4-fold contingency table is -protective if the discrimination measure is lower or equal than a threshold A dataset if a-protective if all of its non-empty 4-fold contingency tables are -protective for any subset (or conjunction) of PND items
13 Local approaches 13 Extract classification rules: gender=women, B benefit=denied with B providing a context of discrimination E.g., B city = NYC with measure > Notice cover(b) is the context of analysis B can be a closed itemset (all distinct covers!)
14 A Java library: dd 14 DDTable tb = new DDTable(); tb.loadfromarff("credit"); tb.setdiscmetadata("personal_status=female_div_or_dep_or_mar, foreign_worker=yes", "class=bad", 20); tb.closeditemset = true; tb.extractitemsets(); tb.initscan(); double largest = -1; ContingencyTable ct = null; while( (ct = tb.nextct())!= null) { double diff = ct.rd(); if( diff > largest ) largest = diff; } tb.endscan(); Download it from
15 15 Level curves of top-k tables
16 16 Privacy in a dataset
17 k-anonimity 17 A partition-based measure of risk in data disclosure QI attributes sensitive attribute q-block size = 3 ZIP Birth date Sex Desease F Yes F No F No M No Q-block = rows with same values for all QIs A q-block is k-anonimous if its size is at least k A dataset is k-anonimous if every q-block is k-anonimous Any individual cannot is indistinguishable from k-1 others
18 t-closeness 18 A partition-based measure of attribute inference risk in data disclosure QI attributes sensitive attribute q-block size = 3 p = 33.3% ZIP Birth date Sex Desease F Yes F No F No M No A q-block is t-close if it maintains the proportion of sensitive values p = proportion of Yes in the q-block p* = proportion of Yes in the whole dataset Condition: p-p* < t A dataset is t-close if every q-block is t-close
19 Differences between t-close and a-protect 19 Distributions t-closeness, single: p-p* < t for every q-block -protection wrt RD, joint: p 1 - p 2 for every 4-fold c.t. Monotonicity property t-closeness fixes all values of QI attributes -protection fixes some values of QI attributes If the rows s.t. city=nyc, birth=x are -protective, for all X, then the rows s.t. city=nyc may be not -protective Sympson s paradox t-closeness and -protection are not equivalent models
20 Sympson s paradox 20 RD = p 1 p 2
21 t-closeness reduces to -protection
22 Reductions of problems 22 X = sanitize a dataset for privacy protection wrt t-closeness Y = sanitize a dataset for discrimination prevention wrt -protection g(.) f(.) g(i) = I+2 new attributes D1,D2 with PND = QI PD = {D1, D2}
23 Reductions of problems 23 X = sanitize a dataset for privacy protection wrt t-closeness Y = sanitize a dataset for discrimination prevention wrt -protection g(.) f(.) g(i) = I+2 new attributes D1,D2 with PND = QI PD = {D1, D2} contingency tables have no comparison term! Use p - = proportion of Yes in the whole dataset = p* Fix QIs (eg., ZIP=100, Birth=1965, Sex = F) RD = p 1 p - < for D1=True RD = p - - p 1 < for D2=True Thus, p 1 p - = p p * < -protection implies t-closeness, for t =
24 -protection reduces to t-closeness
25 Reductions of problems 25 X = sanitize a dataset for privacy protection wrt t-closeness Y = sanitize a dataset for discrimination prevention wrt -protection g(.) f(.) g(i) = I with QI= PND+PD Notice that: p* = p -
26 city=nyc benefit denied benefit granted total women men total p 1 = proportion of benefit denied to women = 6/10 = 60% p 2 = proportion of benefit denied to men = 1/5 = 20% Risk difference (RD) is p1 - p2 = 40% Triangle inequality RD = p1 - p2 p1 p* + p2 p* If the dataset is t-close where QI attributes = PND attributes + PD attribute Sensitive attribute = decision attribute RD = p1 - p2 p1 p* + p2 p* 2t then it is 2t-protective
27 Formal results 27 A dataset does not contain discrimination (more than bd f (t)) if an attacker cannot be confident (more than a threshold t) on the decision assigned to an individual by exploiting the differences in the fraction of positive and negative decisions between the protected and the unprotected groups. The role of an ``attacker" here is played by the anti-discrimination analyst, whose objective is to unveil from data a context where negative decisions are biased against the protected group.
28 Formal results 28 Disjunctive items A=v 1 A=v n Ex., age in [25,30] is a disjunctive item Stronger conclusion than -protection: context with conjunctions of possibly disjunctive items are covered!! Patterns used in decision trees, association rule classifiers!! city=nyc, age in [25,30] benefit denied benefit granted total women men total
29 Reductions of problems 29 X = sanitize a dataset for discrimination prevention wrt -protection Y= sanitize a dataset for privacy protection wrt t-closeness g(.) f(.) The reduction show ONE way to sanitize X using Y. Can all sanitized versions of I be obtained through reduction?
30 Sympson s paradox 30 RD = p 1 p 2 p* = 0.4 p = 0.57 p = 0.5 p = 0.25 p = 0.33 The dataset is 0.17-close
31 Reductions of problems 31 X = sanitize a dataset for discrimination prevention wrt -protection Y= sanitize a dataset for privacy protection wrt t-closeness g(.) f(.) The reduction show ONE way to sanitize X using Y. Can all sanitized versions of I be obtained through such a reduction? Assume the answer is positive: Let I be the Sympon s paradox dataset and = (I is already -protective) There exists a «empty» sanitization Y of I s.t. 2t 0.283, i.e. t Impossible because I is only 0.17-close Message of the talk: X and Y are «equivalent» (in a weak sense)
32 Main results and application 32 t-closeness implies bd(t)-protection where bd() is a function dependent on the discrimination measure (RD,RR,OR, ) the bound bd(t) can be reached in limit cases Application: data anonymization methods can be applied to sanitization wrt non-discrimination
33 33 Multidimensional recoding: dmondrian
34 34 Example
35 35 Bucketization & redistrib.: dsabre
36 36 Effective to reduce discrimination
37 37 dsabre is better
38 38 also regarding information loss
39 Quality: median relative error 39 Count queries are basic elements in classifier construction, e.g., in decision tree or association rule classifiers Class attribrute Relative error of a count query is est prec /prec prec = count over original dataset est = count over sanitized dataset (uniform distribution of values) Median relative error is the median error on 10K randomly generated count queries
40 40 but more sensitive to high dim/card
41 Related work 41 Incognito-like search for k-anonymous & a-protective sanitization [Haijan & DAMI 2014] Impact of k-anonimity sanitization on a-protection [Haijan & DPADM 2012] Techniques for achieving both k-anonimity and a-protection in knowledge disclosure [Haijan et DPADM 2012] Non-discriminatory (fair) classification as a generalization of differential privacy [Dwork et ITCS 2012], [Zemel et ICML 2013]
42 Thanks, questions?
5) The table below describes the smoking habits of a group of asthma sufferers. two way table ( ( cell cell ) (cell cell) (cell cell) )
MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Determine which score corresponds to the higher relative position. 1) Which score has a better relative
Association Between Variables
Contents 11 Association Between Variables 767 11.1 Introduction............................ 767 11.1.1 Measure of Association................. 768 11.1.2 Chapter Summary.................... 769 11.2 Chi
CS346: Advanced Databases
CS346: Advanced Databases Alexandra I. Cristea [email protected] Data Security and Privacy Outline Chapter: Database Security in Elmasri and Navathe (chapter 24, 6 th Edition) Brief overview of
Adverse Impact Ratio for Females (0/ 1) = 0 (5/ 17) = 0.2941 Adverse impact as defined by the 4/5ths rule was not found in the above data.
1 of 9 12/8/2014 12:57 PM (an On-Line Internet based application) Instructions: Please fill out the information into the form below. Once you have entered your data below, you may select the types of analysis
Statistics 2014 Scoring Guidelines
AP Statistics 2014 Scoring Guidelines College Board, Advanced Placement Program, AP, AP Central, and the acorn logo are registered trademarks of the College Board. AP Central is the official online home
MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.
Sample Practice problems - chapter 12-1 and 2 proportions for inference - Z Distributions Name MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question. Provide
We begin by presenting the current situation of women s representation in physics departments. Next, we present the results of simulations that
Report A publication of the AIP Statistical Research Center One Physics Ellipse College Park, MD 20740 301.209.3070 [email protected] July 2013 Number of Women in Physics Departments: A Simulation Analysis
DATA MINING - 1DL360
DATA MINING - 1DL360 Fall 2013" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/per1ht13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
MATH 140 Lab 4: Probability and the Standard Normal Distribution
MATH 140 Lab 4: Probability and the Standard Normal Distribution Problem 1. Flipping a Coin Problem In this problem, we want to simualte the process of flipping a fair coin 1000 times. Note that the outcomes
In the general population of 0 to 4-year-olds, the annual incidence of asthma is 1.4%
Hypothesis Testing for a Proportion Example: We are interested in the probability of developing asthma over a given one-year period for children 0 to 4 years of age whose mothers smoke in the home In the
An Application of the Cox Proportional Hazards Model to the Construction of Objective Vintages for Credit in Financial Institutions, Using PROC PHREG
Paper 3140-2015 An Application of the Cox Proportional Hazards Model to the Construction of Objective Vintages for Credit in Financial Institutions, Using PROC PHREG Iván Darío Atehortua Rojas, Banco Colpatria
Elementary Statistics
Elementary Statistics Chapter 1 Dr. Ghamsary Page 1 Elementary Statistics M. Ghamsary, Ph.D. Chap 01 1 Elementary Statistics Chapter 1 Dr. Ghamsary Page 2 Statistics: Statistics is the science of collecting,
Mind on Statistics. Chapter 4
Mind on Statistics Chapter 4 Sections 4.1 Questions 1 to 4: The table below shows the counts by gender and highest degree attained for 498 respondents in the General Social Survey. Highest Degree Gender
T-61.6010 Non-discriminatory Machine Learning
T-61.6010 Non-discriminatory Machine Learning Seminar 1 Indrė Žliobaitė Aalto University School of Science, Department of Computer Science Helsinki Institute for Information Technology (HIIT) University
Class 19: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.1)
Spring 204 Class 9: Two Way Tables, Conditional Distributions, Chi-Square (Text: Sections 2.5; 9.) Big Picture: More than Two Samples In Chapter 7: We looked at quantitative variables and compared the
Use of the Chi-Square Statistic. Marie Diener-West, PhD Johns Hopkins University
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
4. Continuous Random Variables, the Pareto and Normal Distributions
4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random
CALCULATIONS & STATISTICS
CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents
Random Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
Descriptive Inferential. The First Measured Century. Statistics. Statistics. We will focus on two types of statistical applications
Introduction: Statistics, Data and Statistical Thinking The First Measured Century FREC 408 Dr. Tom Ilvento 213 Townsend Hall [email protected] http://www.udel.edu/frec/ilvento http://www.pbs.org/fmc/index.htm
Chapter 2: Linear Equations and Inequalities Lecture notes Math 1010
Section 2.1: Linear Equations Definition of equation An equation is a statement that equates two algebraic expressions. Solving an equation involving a variable means finding all values of the variable
ARX A Comprehensive Tool for Anonymizing Biomedical Data
ARX A Comprehensive Tool for Anonymizing Biomedical Data Fabian Prasser, Florian Kohlmayer, Klaus A. Kuhn Chair of Biomedical Informatics Institute of Medical Statistics and Epidemiology Rechts der Isar
Chapter 4. Probability and Probability Distributions
Chapter 4. robability and robability Distributions Importance of Knowing robability To know whether a sample is not identical to the population from which it was selected, it is necessary to assess the
Density Curve. A density curve is the graph of a continuous probability distribution. It must satisfy the following properties:
Density Curve A density curve is the graph of a continuous probability distribution. It must satisfy the following properties: 1. The total area under the curve must equal 1. 2. Every point on the curve
Week 3&4: Z tables and the Sampling Distribution of X
Week 3&4: Z tables and the Sampling Distribution of X 2 / 36 The Standard Normal Distribution, or Z Distribution, is the distribution of a random variable, Z N(0, 1 2 ). The distribution of any other normal
RATIOS, PROPORTIONS, PERCENTAGES, AND RATES
RATIOS, PROPORTIOS, PERCETAGES, AD RATES 1. Ratios: ratios are one number expressed in relation to another by dividing the one number by the other. For example, the sex ratio of Delaware in 1990 was: 343,200
Differential privacy in health care analytics and medical research An interactive tutorial
Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could
Two Correlated Proportions (McNemar Test)
Chapter 50 Two Correlated Proportions (Mcemar Test) Introduction This procedure computes confidence intervals and hypothesis tests for the comparison of the marginal frequencies of two factors (each with
Inclusion and Exclusion Criteria
Inclusion and Exclusion Criteria Inclusion criteria = attributes of subjects that are essential for their selection to participate. Inclusion criteria function remove the influence of specific confounding
Statistics for Sports Medicine
Statistics for Sports Medicine Suzanne Hecht, MD University of Minnesota ([email protected]) Fellow s Research Conference July 2012: Philadelphia GOALS Try not to bore you to death!! Try to teach
Statistical tests for SPSS
Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly
Pay Equity Christopher Rootham, Nelligan O Brien Payne LLP
Pay Equity Christopher Rootham, Nelligan O Brien Payne LLP Introduction This paper is designed to provide a brief introduction and overview of the Ontario Pay Equity Act. It is important for both employers
Theories of Discrimination
Theories of Discrimination Taste discrimination: individuals (employers, employees, customers) prefer certain individuals or groups to others. Statistical discrimination: agents make decisions about individuals
Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)
Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.) Logistic regression generalizes methods for 2-way tables Adds capability studying several predictors, but Limited to
SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES
SCHOOL OF HEALTH AND HUMAN SCIENCES Using SPSS Topics addressed today: 1. Differences between groups 2. Graphing Use the s4data.sav file for the first part of this session. DON T FORGET TO RECODE YOUR
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Comparing Two Groups. Standard Error of ȳ 1 ȳ 2. Setting. Two Independent Samples
Comparing Two Groups Chapter 7 describes two ways to compare two populations on the basis of independent samples: a confidence interval for the difference in population means and a hypothesis test. The
Introduction to Hypothesis Testing OPRE 6301
Introduction to Hypothesis Testing OPRE 6301 Motivation... The purpose of hypothesis testing is to determine whether there is enough statistical evidence in favor of a certain belief, or hypothesis, about
Statistics E100 Fall 2013 Practice Midterm I - A Solutions
STATISTICS E100 FALL 2013 PRACTICE MIDTERM I - A SOLUTIONS PAGE 1 OF 5 Statistics E100 Fall 2013 Practice Midterm I - A Solutions 1. (16 points total) Below is the histogram for the number of medals won
Mind on Statistics. Chapter 12
Mind on Statistics Chapter 12 Sections 12.1 Questions 1 to 6: For each statement, determine if the statement is a typical null hypothesis (H 0 ) or alternative hypothesis (H a ). 1. There is no difference
(Big) Data Anonymization Claude Castelluccia Inria, Privatics
(Big) Data Anonymization Claude Castelluccia Inria, Privatics BIG DATA: The Risks Singling-out/ Re-Identification: ADV is able to identify the target s record in the published dataset from some know information
Statistiek I. Proportions aka Sign Tests. John Nerbonne. CLCG, Rijksuniversiteit Groningen. http://www.let.rug.nl/nerbonne/teach/statistiek-i/
Statistiek I Proportions aka Sign Tests John Nerbonne CLCG, Rijksuniversiteit Groningen http://www.let.rug.nl/nerbonne/teach/statistiek-i/ John Nerbonne 1/34 Proportions aka Sign Test The relative frequency
CHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V
CHAPTER 15 NOMINAL MEASURES OF CORRELATION: PHI, THE CONTINGENCY COEFFICIENT, AND CRAMER'S V Chapters 13 and 14 introduced and explained the use of a set of statistical tools that researchers use to measure
Chi-square test Fisher s Exact test
Lesson 1 Chi-square test Fisher s Exact test McNemar s Test Lesson 1 Overview Lesson 11 covered two inference methods for categorical data from groups Confidence Intervals for the difference of two proportions
ECON 443 Labor Market Analysis Final Exam (07/20/2005)
ECON 443 Labor Market Analysis Final Exam (07/20/2005) I. Multiple-Choice Questions (80%) 1. A compensating wage differential is A) an extra wage that will make all workers willing to accept undesirable
Point and Interval Estimates
Point and Interval Estimates Suppose we want to estimate a parameter, such as p or µ, based on a finite sample of data. There are two main methods: 1. Point estimate: Summarize the sample by a single number
Odds ratio, Odds ratio test for independence, chi-squared statistic.
Odds ratio, Odds ratio test for independence, chi-squared statistic. Announcements: Assignment 5 is live on webpage. Due Wed Aug 1 at 4:30pm. (9 days, 1 hour, 58.5 minutes ) Final exam is Aug 9. Review
C. The null hypothesis is not rejected when the alternative hypothesis is true. A. population parameters.
Sample Multiple Choice Questions for the material since Midterm 2. Sample questions from Midterms and 2 are also representative of questions that may appear on the final exam.. A randomly selected sample
The test uses age norms (national) and grade norms (national) to calculate scores and compare students of the same age or grade.
Reading the CogAT Report for Parents The CogAT Test measures the level and pattern of cognitive development of a student compared to age mates and grade mates. These general reasoning abilities, which
Bivariate Statistics Session 2: Measuring Associations Chi-Square Test
Bivariate Statistics Session 2: Measuring Associations Chi-Square Test Features Of The Chi-Square Statistic The chi-square test is non-parametric. That is, it makes no assumptions about the distribution
E3: PROBABILITY AND STATISTICS lecture notes
E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................
Answer: Quantity A is greater. Quantity A: 0.717 0.717717... Quantity B: 0.71 0.717171...
Test : First QR Section Question 1 Test, First QR Section In a decimal number, a bar over one or more consecutive digits... QA: 0.717 QB: 0.71 Arithmetic: Decimals 1. Consider the two quantities: Answer:
How To Collect Data From A Large Group
Section 2: Ten Tools for Applying Sociology CHAPTER 2.6: DATA COLLECTION METHODS QUICK START: In this chapter, you will learn The basics of data collection methods. To know when to use quantitative and/or
IBM SPSS Direct Marketing
IBM Software IBM SPSS Statistics 19 IBM SPSS Direct Marketing Understand your customers and improve marketing campaigns Highlights With IBM SPSS Direct Marketing, you can: Understand your customers in
p ˆ (sample mean and sample
Chapter 6: Confidence Intervals and Hypothesis Testing When analyzing data, we can t just accept the sample mean or sample proportion as the official mean or proportion. When we estimate the statistics
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
3.4 Statistical inference for 2 populations based on two samples
3.4 Statistical inference for 2 populations based on two samples Tests for a difference between two population means The first sample will be denoted as X 1, X 2,..., X m. The second sample will be denoted
Introduction to Hypothesis Testing
I. Terms, Concepts. Introduction to Hypothesis Testing A. In general, we do not know the true value of population parameters - they must be estimated. However, we do have hypotheses about what the true
Basic Probability Concepts
page 1 Chapter 1 Basic Probability Concepts 1.1 Sample and Event Spaces 1.1.1 Sample Space A probabilistic (or statistical) experiment has the following characteristics: (a) the set of all possible outcomes
Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing
Chapter 8 Hypothesis Testing 1 Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing 8-3 Testing a Claim About a Proportion 8-5 Testing a Claim About a Mean: s Not Known 8-6 Testing
HealthStream Regulatory Script
HealthStream Regulatory Script Diversity in the Workplace Version: May 2007 Lesson 1: Introduction Lesson 2: Significance of Workplace Diversity Lesson 3: Diversity Programs Lesson 4: Doing Your Part Lesson
Weight of Evidence Module
Formula Guide The purpose of the Weight of Evidence (WoE) module is to provide flexible tools to recode the values in continuous and categorical predictor variables into discrete categories automatically,
GIS I Business Exr02 (av 9-10) - Expand Market Share (v3b, Jul 2013)
GIS I Business Exr02 (av 9-10) - Expand Market Share (v3b, Jul 2013) Learning Objectives: Reinforce information literacy skills Reinforce database manipulation / querying skills Reinforce joining and mapping
Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
PRIVACY PRESERVING ASSOCIATION RULE MINING
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 10, October 2014,
Non-Parametric Tests (I)
Lecture 5: Non-Parametric Tests (I) KimHuat LIM [email protected] http://www.stats.ox.ac.uk/~lim/teaching.html Slide 1 5.1 Outline (i) Overview of Distribution-Free Tests (ii) Median Test for Two Independent
CHAPTER 6: Continuous Uniform Distribution: 6.1. Definition: The density function of the continuous random variable X on the interval [A, B] is.
Some Continuous Probability Distributions CHAPTER 6: Continuous Uniform Distribution: 6. Definition: The density function of the continuous random variable X on the interval [A, B] is B A A x B f(x; A,
Statistics I for QBIC. Contents and Objectives. Chapters 1 7. Revised: August 2013
Statistics I for QBIC Text Book: Biostatistics, 10 th edition, by Daniel & Cross Contents and Objectives Chapters 1 7 Revised: August 2013 Chapter 1: Nature of Statistics (sections 1.1-1.6) Objectives
UK application rates by country, region, constituency, sex, age and background. (2015 cycle, January deadline)
UK application rates by country, region, constituency, sex, age and background () UCAS Analysis and Research 30 January 2015 Key findings JANUARY DEADLINE APPLICATION RATES PROVIDE THE FIRST RELIABLE INDICATION
Section 12 Part 2. Chi-square test
Section 12 Part 2 Chi-square test McNemar s Test Section 12 Part 2 Overview Section 12, Part 1 covered two inference methods for categorical data from 2 groups Confidence Intervals for the difference of
INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS
INDISTINGUISHABILITY OF ABSOLUTELY CONTINUOUS AND SINGULAR DISTRIBUTIONS STEVEN P. LALLEY AND ANDREW NOBEL Abstract. It is shown that there are no consistent decision rules for the hypothesis testing problem
Row vs. Column Percents. tab PRAYER DEGREE, row col
Bivariate Analysis - Crosstabulation One of most basic research tools shows how x varies with respect to y Interpretation of table depends upon direction of percentaging example Row vs. Column Percents.
CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION
CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION MATIJA STEVANOVIC PhD Student JENS MYRUP PEDERSEN Associate Professor Department of Electronic Systems Aalborg University,
Multinomial and Ordinal Logistic Regression
Multinomial and Ordinal Logistic Regression ME104: Linear Regression Analysis Kenneth Benoit August 22, 2012 Regression with categorical dependent variables When the dependent variable is categorical,
Introduction to Analysis of Variance (ANOVA) Limitations of the t-test
Introduction to Analysis of Variance (ANOVA) The Structural Model, The Summary Table, and the One- Way ANOVA Limitations of the t-test Although the t-test is commonly used, it has limitations Can only
More details on the inputs, functionality, and output can be found below.
Overview: The SMEEACT (Software for More Efficient, Ethical, and Affordable Clinical Trials) web interface (http://research.mdacc.tmc.edu/smeeactweb) implements a single analysis of a two-armed trial comparing
Protecting Respondents' Identities in Microdata Release
1010 IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 13, NO. 6, NOVEMBER/DECEMBER 2001 Protecting Respondents' Identities in Microdata Release Pierangela Samarati, Member, IEEE Computer Society
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
Easily Identify the Right Customers
PASW Direct Marketing 18 Specifications Easily Identify the Right Customers You want your marketing programs to be as profitable as possible, and gaining insight into the information contained in your
P (B) In statistics, the Bayes theorem is often used in the following way: P (Data Unknown)P (Unknown) P (Data)
22S:101 Biostatistics: J. Huang 1 Bayes Theorem For two events A and B, if we know the conditional probability P (B A) and the probability P (A), then the Bayes theorem tells that we can compute the conditional
The Importance of Statistics Education
The Importance of Statistics Education Professor Jessica Utts Department of Statistics University of California, Irvine http://www.ics.uci.edu/~jutts [email protected] Outline of Talk What is Statistics? Four
Probability Distributions
Learning Objectives Probability Distributions Section 1: How Can We Summarize Possible Outcomes and Their Probabilities? 1. Random variable 2. Probability distributions for discrete random variables 3.
Assessment Anchors and Eligible Content
M07.A-N The Number System M07.A-N.1 M07.A-N.1.1 DESCRIPTOR Assessment Anchors and Eligible Content Aligned to the Grade 7 Pennsylvania Core Standards Reporting Category Apply and extend previous understandings
SOLUTIONS TO BIOSTATISTICS PRACTICE PROBLEMS
SOLUTIONS TO BIOSTATISTICS PRACTICE PROBLEMS BIOSTATISTICS DESCRIBING DATA, THE NORMAL DISTRIBUTION SOLUTIONS 1. a. To calculate the mean, we just add up all 7 values, and divide by 7. In Xi i= 1 fancy
Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI
Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: [email protected] Data Mining a step in A KDD Process Data mining:
Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page
Errata for ASM Exam C/4 Study Manual (Sixteenth Edition) Sorted by Page 1 Errata and updates for ASM Exam C/Exam 4 Manual (Sixteenth Edition) sorted by page Practice exam 1:9, 1:22, 1:29, 9:5, and 10:8
Linear Models in STATA and ANOVA
Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples
Easily Identify Your Best Customers
IBM SPSS Statistics Easily Identify Your Best Customers Use IBM SPSS predictive analytics software to gain insight from your customer database Contents: 1 Introduction 2 Exploring customer data Where do
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task
University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task Graham McDonald, Romain Deveaud, Richard McCreadie, Timothy Gollins, Craig Macdonald and Iadh Ounis School
CITY OF DAYTON HUMAN RELATIONS COUNCIL AFFIRMATIVE ACTION ASSURANCE (AAA) FORM
CITY OF DAYTON HUMAN RELATIONS COUNCIL AFFIRMATIVE ACTION ASSURANCE (AAA) FORM The City of Dayton requires an Affirmative Action Assurance form approved by the Human Relations Council for all entities
Paid and Unpaid Labor in Developing Countries: an inequalities in time use approach
Paid and Unpaid Work inequalities 1 Paid and Unpaid Labor in Developing Countries: an inequalities in time use approach Paid and Unpaid Labor in Developing Countries: an inequalities in time use approach
