Differential Privacy Tutorial Simons Institute Workshop on Privacy and Big Data. Katrina Ligett Caltech
|
|
|
- Lawrence Simon
- 10 years ago
- Views:
Transcription
1 Differential Privacy Tutorial Simons Institute Workshop on Privacy and Big Data Katrina Ligett Caltech 1
2 individuals have lots of interesting data π 2
3 individuals have lots of interesting data π...we want to compute on it 2
4 3
5 3
6 3
7 3
8 3
9 3
10 3
11 3
12 3
13 3
14 finding statistical correlations genotype/phenotype associations correlating medical outcomes with risk factors or events publishing aggregate statistics noticing events/outliers intrusion detection disease outbreaks datamining/learning tasks use customer data to update strategies 4
15 5
16 5
17 John Doe tax forms No personally John Doe identifiable irs 1040information was released John Doe error in form 1099 Katrina Ligett data privacy Katrina Ligett aol search debacle Katrina Ligett Ligett DBLP Katrina Ligett computer science news Katrina Ligett Caltech rankings Katrina Ligett weather Pasadena Jane Smith youtube Jane Smith free tv download Jane Smith streaming tv Chris Jones childrens books Chris Jones dr seuss Chris Jones the cat and the hat Chris Jones gifts for children 6
18 John Doe tax forms No personally John Doe identifiable irs 1040information was released John Doe error in form 1099 user data privacy user aol search debacle user Ligett DBLP user computer science news user Caltech rankings user weather Pasadena Jane Smith youtube Jane Smith free tv download Jane Smith streaming tv Chris Jones childrens books Chris Jones dr seuss Chris Jones the cat and the hat Chris Jones gifts for children 7
19 John Doe tax forms No personally John Doe identifiable irs 1040information was released John Doe error in form 1099 user data privacy user aol search debacle user Ligett DBLP user computer science news user Caltech rankings user weather Pasadena Jane Smith youtube Jane Smith free tv download Jane Smith streaming tv Chris Jones childrens books Chris Jones dr seuss Chris Jones the cat and the hat Chris Jones gifts for children 7
20 John Doe tax forms No personally John Doe identifiable irs 1040information was released John Doe error in form 1099 user data privacy user aol search debacle user Ligett DBLP user computer science news user Caltech rankings user weather Pasadena Jane Smith youtube Jane Smith Ad hoc Jane Smith Chris Jones solutions are Chris Jones risky! Chris Jones Chris Jones free tv download streaming tv childrens books dr seuss the cat and the hat gifts for children 7
21 John Doe tax forms No personally John Doe identifiable irs 1040information was released John Doe error in form 1099 user data privacy user aol search debacle user Ligett DBLP user computer science news user Caltech rankings user weather Pasadena Jane Smith youtube Jane Smith Ad hoc Jane Smith Chris Jones solutions are Chris Jones risky! Chris Jones Chris Jones free tv download streaming tv childrens books dr seuss the cat and the hat gifts for children Huge opportunity for formalism. 7
22 This doesn t apply to me! I don t want to publish the whole dataset!
23 individuals hold data......what if it s sensitive? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 9
24 individuals hold data......what if it s sensitive? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 9 public
25 individuals hold data......what if it s sensitive? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 9 public
26 individuals hold data......what if it s sensitive? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 10
27 individuals hold data......what if it s sensitive? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 10 public
28 individuals hold data......what if it s sensitive? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 11
29 individuals hold data......what if it s sensitive? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 18% 11 public
30 individuals hold data......what if it s sensitive? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 12
31 individuals hold data......what if it s sensitive? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 12 public
32 This doesn t apply to me! I don t want to publish the whole dataset!
33 This doesn t apply to me! I don t want to publish the whole dataset! not so fast...
34 This doesn t apply to me! I don t want to publish the whole dataset! not so fast... see, e.g., Korolova 2011 s Facebook microtargeting attack
35 This doesn t apply to me! I don t want to publish the whole dataset! not so fast... see, e.g., Korolova 2011 s Facebook microtargeting attack... must pay attention to all uses of sensitive data
36 what to promise about output? 14
37 what to promise about output? access to the output should not enable one to learn anything about an individual that could not be learned without access 14
38 what to promise about output? access to the output should not enable one to learn anything about an individual that could not be learned without access is this possible? 14
39 what to promise about output? access to the output should not enable one to learn anything about an individual that could not be learned without access is this possible? hint: either privacy or utility separately is easy 14
40 what if wanted to do a study about smoking and? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 15 public
41 what if wanted to do a study about smoking and? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 15 public
42 what if wanted to do a study about smoking and? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N there is a correlation 15 public of xxx
43 what if wanted to do a study about smoking and? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N but what if someone knew Alice is a smoker? 15 public there is a correlation of xxx
44 what to promise about output? access to the output should not enable one to learn anything about an individual that could not be learned without access
45 what to promise about output? access to the output should not enable one to learn anything about an individual that could not be learned without access not possible!
46 what to promise about output? John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N
47 what to promise about output? think of output as randomized John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N
48 what to promise about output? think of output as randomized John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N public 18%
49 what to promise about output? think of output as randomized John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N
50 what to promise about output? think of output as randomized John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N public
51 what to promise about output? think of output as randomized name DOB sex weight smoker John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 16 public
52 what to promise about output? think of output as randomized promise: if you leave the database, no outcome will change probability by very much 16 public
53 more formally... database D a set of rows, one per person sanitizing algorithm M probabilistically maps D to event or object in outcome space
54 differential privacy [DinurNissim03, DworkNissimMcSherrySmith06] ε-differential Privacy for mechanism M: for any two neighboring data sets D1, D2, any C range(m), Pr[M(D1) C] e ε Pr[M(D2) C] 21
55 differential privacy [DinurNissim03, DworkNissimMcSherrySmith06] ε-differential Privacy for mechanism M: for any two neighboring data sets D1, D2, any C range(m), Pr[M(D1) C] e ε Pr[M(D2) C] e ε ~ (1 + ε) 21
56 differential privacy Pr[M(D1) name DOB John Doe Jane Smith Ellen Jones Jennifer Kim Rachel Waters 12/1/51 3/3/46 4/24/59 3/1/70 9/5/43 C] eε Pr[M(D2) sex weight smoker M F F F F Y N Y N N C] N N Y N N
57 differential privacy Pr[M(D1) name DOB John Doe Jane Smith Ellen Jones Jennifer Kim Rachel Waters 12/1/51 3/3/46 4/24/59 3/1/70 9/5/43 C] eε Pr[M(D2) sex weight smoker M F F F F Y N Y N N C] N N Y N N
58 differential privacy Pr[M(D1) name DOB John Doe Jane Smith Ellen Jones Jennifer Kim Rachel Waters 12/1/51 3/3/46 4/24/59 3/1/70 9/5/43 C] eε Pr[M(D2) sex weight smoker M F F F F Y N Y N N C] N N Y N N
59 differential privacy Pr[M(D1) C] e ε Pr[M(D2) C] C. Dwork
60 (ε,δ)-differential privacy Pr[M(D1) C] e ε Pr[M(D2) C]+δ C. Dwork
61 differential privacy Pr[M(D1) C] e ε Pr[M(D2) C] Is a statistical property of mechanism behavior unaffected by auxiliary information independent of adversary's computational power
62 differential privacy Pr[M(D1) C] e ε Pr[M(D2) C] promise: if you leave the database, no outcome will change probability by very much is this achievable?
63 yes! 27
64 if your output is a number... John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N 18% 28 public
65 if your output is a number... name DOB John Doe Jane Smith Ellen Jones Jennifer Kim Rachel Waters 12/1/51 3/3/46 4/24/59 3/1/70 9/5/43 sex weight smoker M F F F F Y N Y N N add noise with particular shape N N Y N N 18% 28 public
66 sensitivity of a function f f = maxd1, D2 f(d1) f(d2) for neighboring data sets D1, D2 29
67 sensitivity of a function f f = maxd1, D2 f(d1) f(d2) for neighboring data sets D1, D2 measures how much one person can affect output 29
68 sensitivity of a function f f = maxd1, D2 f(d1) f(d2) for neighboring data sets D1, D2 measures how much one person can affect output sensitivity is 1 for counting queries that count number of rows satisfying a predicate 29
69 scale noise with sensitivity f = maxd1, D2 f(d1) f(d2) [DMNS06]: on query f, can add scaled symmetric noise Lap(b) with b = f/ε, to achieve ε-differential privacy. 30
70 Laplace distribution Lap(b) p(z) = exp(- z /b)/2b variance = 2b 2 31
71 applying the Laplace mechanism 32
72 applying the Laplace mechanism single counting query: how many people in the database satisfy predicate P? sensitivity = 1 can add noise Lap(1/ε) 32
73 what if want more than one query?...composition an ε1-dp mechanism, followed by an ε2-dp mechanism, is (ε1 + ε2)-dp 33
74 what if want more than one query?...composition an ε1-dp mechanism, followed by an ε2-dp mechanism, is (ε1 + ε2)-dp can also sum both the epsilons and the deltas for (ε, δ)-dp 33
75 what if want more than one query?...composition an ε1-dp mechanism, followed by an ε2-dp mechanism, is (ε1 + ε2)-dp can also sum both the epsilons and the deltas for (ε, δ)-dp more sophisticated argument: k runs of (ε, δ)-dp mechanisms gives (ε, kδ + δ )- DP for ε = (2 k ln(1/δ )) 1/2 ε + k ε (e ε - 1) 33
76 applying the Laplace mechanism vector-valued queries of dimension d can apply composition and add noise Lap(d f/ε) in each component of output, where f is the sensitivity of each component 34
77 applying the Laplace mechanism histogram queries could again use noise Lap(d/ε) 35
78 applying the Laplace mechanism histogram queries could again use noise Lap(d/ε) but actually only need ~Lap(1/ε), since sensitivity generalizes as max L1 distance 35
79 Gaussian mechanism [DKMMN06]: Gaussian noise gives (ε, δ)-dp with σ (2 ln(2/δ)) 1/2 /ε * (max L2 distance) 36
80 Ok, but I wanted to use my database for more than a handful of statistics...
81 Ok, but I wanted to use my database for more than a handful of statistics... Data can be big in two dimensions: more rows makes privacy easier (lower sensitivity); more columns makes it harder (more queries to preserve)
82 handling an exponential number of queries John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N public
83 what handling an exponential number of queries John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Jennifer Kim 3/1/70 F 135 N N Rachel Waters 9/5/43 F 140 N N fraction over age 50? what fraction smoke and have? what fraction of males over 150 lbs?... public
84 a brief history of synthetic data BLR08: ε-dp, error log 1/3 Q n 2/3 (theory) 39
85 a brief history of synthetic data (theory) BLR08: ε-dp, error log 1/3 Q n 2/3 DNRRV09: (ε, δ)-dp, error Q o(1) n 1/2 39
86 a brief history of synthetic data (theory) BLR08: ε-dp, error log 1/3 Q n 2/3 DNRRV09: (ε, δ)-dp, error Q o(1) n 1/2 DRV10: (ε, δ)-dp, error polylog Q n 1/2 39
87 a brief history of synthetic data (theory) BLR08: ε-dp, error log 1/3 Q n 2/3 DNRRV09: (ε, δ)-dp, error Q o(1) n 1/2 DRV10: (ε, δ)-dp, error polylog Q n 1/2 HR10: (ε, δ)-dp, error log Q n 1/2 39
88 a brief history of synthetic data (theory) BLR08: ε-dp, error log 1/3 Q n 2/3 DNRRV09: (ε, δ)-dp, error Q o(1) n 1/2 DRV10: (ε, δ)-dp, error polylog Q n 1/2 HR10: (ε, δ)-dp, error log Q n 1/2 HLM12: simple & matches best bounds 39
89 a brief history of synthetic data (theory) BLR08: ε-dp, error log 1/3 Q n 2/3 DNRRV09: (ε, δ)-dp, error Q o(1) n 1/2 DRV10: (ε, δ)-dp, error polylog Q n 1/2 HR10: (ε, δ)-dp, error log Q n 1/2 HLM12: simple & matches best bounds 39
90 a brief history of synthetic data (theory) BLR08: ε-dp, error log 1/3 Q n 2/3 DNRRV09: (ε, δ)-dp, error Q o(1) n 1/2 DRV10: (ε, δ)-dp, error polylog Q n 1/2 HR10: (ε, δ)-dp, error log Q n 1/2 HLM12: simple & matches best bounds Can (sometimes) do much better than naive noise addition, with much more sophisticated techniques 39
91 exponential mechanism [MT07] select an element C range(m) with probability ~ exp(ε u(d, C)/(2 u)) where u is a scoring function 40
92 exponential mechanism [MT07] select an element C range(m) with probability ~ exp(ε u(d, C)/(2 u)) where u is a scoring function privacy obvious 40
93 exponential mechanism [MT07] select an element C range(m) with probability ~ exp(ε u(d, C)/(2 u)) where u is a scoring function privacy obvious utility... depends 40
94 [BLR08] combines exponential mechanism [MT07] for sampling complex output space sample complexity bounds from learning theory to guarantee existence of good output 41
95 [BLR08] John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Michael Ray 3/2/81 M 200 Y N Fran Michaels 9/9/54 F 155 N N Rachel Kim 1/21/77 F 130 Y Y Michelle Lo 2/2/83 F 135 N N Nira Waters 9/5/43 F 140 N N Jennifer Kim 3/1/70 F 135 N N Lisa Smith 9/5/43 F 140 N N Tony Miller 12/1/51 M 210 Y N Eve Casey 3/3/46 F 140 N N Paul Larson 4/24/59 F 160 Y Y Noelle Mason 3/1/70 F 130 N N Rachel Waters 9/5/43 F 140 Y N Shirley Wu 3/1/70 F 150 N N Rachel Waters 9/5/43 F 140 N Y Lawrence Vay 12/1/51 M 185 Y N Laura Rich 3/3/46 F 140 N N
96 [BLR08] John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N Michael Ray 3/2/81 M 200 Y N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y Fran Michaels 9/9/54 F 155 N N Rachel Kim 1/21/77 F 130 Y Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N Michelle Lo 2/2/83 F 135 N N Nira Waters 9/5/43 F 140 N N Jennifer Kim 3/1/70 F 135 N N Lisa Smith 9/5/43 F 140 N N Tony Miller 12/1/51 M 210 Y N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N Eve Casey 3/3/46 F 140 N N Paul Larson 4/24/59 F 160 Y Y Noelle Mason 3/1/70 F 130 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N Rachel Waters 9/5/43 F 140 Y N Shirley Wu 3/1/70 F 150 N N Rachel Waters 9/5/43 F 140 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N Lawrence Vay 12/1/51 M 185 Y N Laura Rich 3/3/46 F 140 N N
97 [BLR08] Size Õ(VCDIM(Q)/ε 2 ) John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N Michael Ray 3/2/81 M 200 Y N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y Fran Michaels 9/9/54 F 155 N N Rachel Kim 1/21/77 F 130 Y Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N Michelle Lo 2/2/83 F 135 N N Nira Waters 9/5/43 F 140 N N Jennifer Kim 3/1/70 F 135 N N Lisa Smith 9/5/43 F 140 N N Tony Miller 12/1/51 M 210 Y N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N Eve Casey 3/3/46 F 140 N N Paul Larson 4/24/59 F 160 Y Y Noelle Mason 3/1/70 F 130 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N Rachel Waters 9/5/43 F 140 Y N Shirley Wu 3/1/70 F 150 N N Rachel Waters 9/5/43 F 140 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N Lawrence Vay 12/1/51 M 185 Y N Laura Rich 3/3/46 F 140 N N
98 [BLR08] Size Õ(VCDIM(Q)/ε 2 ) John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Michael Ray 3/2/81 M 200 Y N Fran Michaels 9/9/54 F 155 N N Rachel Kim 1/21/77 F 130 Y Y Michelle Lo 2/2/83 F 135 N N Nira Waters 9/5/43 F 140 N N Jennifer Kim 3/1/70 F 135 N N Lisa Smith 9/5/43 F 140 N N Tony Miller 12/1/51 M 210 Y N Eve Casey 3/3/46 F 140 N N Paul Larson 4/24/59 F 160 Y Y Noelle Mason 3/1/70 F 130 N N Rachel Waters 9/5/43 F 140 Y N Shirley Wu 3/1/70 F 150 N N Rachel Waters 9/5/43 F 140 N Y Lawrence Vay 12/1/51 M 185 Y N Laura Rich 3/3/46 F 140 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N
99 [BLR08] Size Õ(VCDIM(Q)/ε 2 ) John Doe 12/1/51 M 185 Y N Jane Smith 3/3/46 F 140 N N Ellen Jones 4/24/59 F 160 Y Y Michael Ray 3/2/81 M 200 Y N Fran Michaels name DOB 9/9/54 sex weight F smoker 155 N N Rachel Kim 1/21/77 F 130 Y Y YYY 1/11/74 F 130 N Y Michelle Lo 2/2/83 F 135 N N ZZZ 10/5/44 F 150 N N Nira Waters 9/5/43 F 140 N N Jennifer Kim 3/1/70 F 135 N N Lisa Smith 9/5/43 F 140 N N Tony Miller 12/1/51 M 210 Y N Eve Casey 3/3/46 F 140 N N Paul Larson 4/24/59 F 160 Y Y Noelle Mason 3/1/70 F 130 N N Rachel Waters 9/5/43 F 140 Y N Shirley Wu 3/1/70 F 150 N N Rachel Waters 9/5/43 F 140 N Y Lawrence Vay 12/1/51 M 185 Y N Laura Rich 3/3/46 F 140 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y YYY 1/11/74 F 130 N Y ZZZ 10/5/44 F 150 N N ZZZ 10/5/44 F 150 N N
100 [HLM12] simple to describe and to implement actually implemented and tested it state of the art in theory, performs well in practice (and quickly, despite bad worstcase news) 43
101 [HLM12] simple to describe and to implement actually implemented and tested it state of the art in theory, performs well in practice (and quickly, despite bad worstcase news)...will hear more about multiplicative weightsbased techniques in Jon s talk 43
102 I have to know all my queries in advance?!
103 interactive mechanisms so far, have discussed creating synthetic data where must know query set in advance tools exist to answer similar number of queries on the fly (correlating randomness across queries) 45
104 It seems like DP would add too much noise for my application.
105 It seems like DP would add too much noise for my application.... stop and think about what this means
106 DP connected to robustness of computation to presence or absence of individuals
107 DP connected to robustness of computation to presence or absence of individuals computation not robust? (should worry!)
108 DP connected to robustness of computation to presence or absence of individuals computation not robust? (should worry!) need more data (individuals) to get desired privacy-utility tradeoff (should think)
109 DP connected to robustness of computation to presence or absence of individuals computation not robust? (should worry!) need more data (individuals) to get desired privacy-utility tradeoff (should think) expect on real data will be robust (we can do something about this!)
110 robustness (an aside) robustness in DP sense not identical to statistical robustness---dp is worst-case rather than wrt to distribution there are connections (will mention shortly) 48
111 expect study robust on actual data idea 1: bootstrapping 49
112 bootstrap aggregation given training dataset, create many new training sets of smaller size by sampling uniformly with replacement fit your model (estimate your statistic) on each combine (e.g., voting, averaging) 50
113 bootstrap intuition a robust statistic should be stable on most reasonbly sized subsets of the data if statistic is somewhat unstable, aggregation smooths result if statistic was very stable, loss in precision should be small 51
114 bootstrap for privacy if function is not low-sensitivity but suspect it s usually stable, not clear how to guarantee DP if aggregation preserves privacy, get DP guarantee even when aggregating non-dp estimates 52
115 bootstrap for privacy = subsample and aggregate [NRS07] 53
116 subsample and aggregate: good news can use any DP aggregation function (as long as choice doesn t depend on data) private aggregation just requires adding noise scaled to sensitivity of the aggregation function privacy is immediate! 54
117 subsample and aggregate: bad news may be difficult to bound worst-case sensitivity of aggregation function default bound is max of its range (quite bad) may be difficult to get good utility guarantees 55
118 subsample and aggregate: applications underlying function might be selecting best model from among set of m options; could aggregate with a noisy max 56
119 subsample and aggregate: applications underlying function might be selecting best model from among set of m options; could aggregate with a noisy max similarly, could output a set of significant features (as for LASSO) 56
120 expect study robust on actual data idea 1: bootstrapping idea 2: check robustness before proceeding 57
121 check robustness would like to be able to test in DP manner whether computation should proceed should : e.g., whether desired function is robust (low-sensitivity) on actual data if not, halt 58
122 local sensitivity of function f on database D maxd f(d) f(d ) 1 for D neighboring data set of D 59
123 local sensitivity of function f on database D maxd f(d) f(d ) 1 for D neighboring data set of D measures how much one person can affect output on this data 59
124 propose-test-release [DL09] propose a bound on local sensitivity 60
125 propose-test-release [DL09] propose a bound on local sensitivity test in DP manner whether actual data satisfies bound 60
126 propose-test-release [DL09] propose a bound on local sensitivity test in DP manner whether actual data satisfies bound if fails, halt 60
127 propose-test-release [DL09] propose a bound on local sensitivity test in DP manner whether actual data satisfies bound if fails, halt if passes, release function with noise tailored to local sensitivity 60
128 notes on propose-test-release test could be what is L1 distance to closest database that fails local sensitivity bound? 61
129 notes on propose-test-release test could be what is L1 distance to closest database that fails local sensitivity bound? test only has sensitivity 1 61
130 notes on propose-test-release test could be what is L1 distance to closest database that fails local sensitivity bound? test only has sensitivity 1 can use conservative local sensitivity threshold 61
131 notes on propose-test-release test could be what is L1 distance to closest database that fails local sensitivity bound? test only has sensitivity 1 can use conservative local sensitivity threshold could still sometimes fail; can t get ε-dp 61
132 DP output need not be noisy! 62
133 DP output need not be noisy! PTR can be used to privately check whether distance to nearest unstable data set is far, and if so release the true f(x) 62
134 robustness, revisited 63
135 robustness, revisited robustness wrt adding/removing a few points from dataset 63
136 robustness, revisited robustness wrt adding/removing a few points from dataset robustness wrt subsampling 63
137 subsample & aggregate + propose-test-release [ST13]: can modify subsample & aggregate so that outputs true f(x) with high probability when f is subsampling stable on x 64
138 subsample & aggregate + propose-test-release [ST13]: can modify subsample & aggregate so that outputs true f(x) with high probability when f is subsampling stable on x shows that DP model selection only increases sample complexity of model selection by O(log (1/δ)/ε) 64
139 DP statistics 65
140 DP statistics connections to robustness; interquartile distance, median, linear regression [DworkLei09] 65
141 DP statistics connections to robustness; interquartile distance, median, linear regression [DworkLei09] M-estimators [Lei11, NekipelovYakovlev11] 65
142 DP statistics connections to robustness; interquartile distance, median, linear regression [DworkLei09] M-estimators [Lei11, NekipelovYakovlev11] for almost any estimator that is asymptotically normal on i.i.d. samples, DP adds asymptotically no additional perturbation [Smith11] 65
143 DP statistics connections to robustness; interquartile distance, median, linear regression [DworkLei09] M-estimators [Lei11, NekipelovYakovlev11] for almost any estimator that is asymptotically normal on i.i.d. samples, DP adds asymptotically no additional perturbation [Smith11] convergence rate of DP estimators tied to gross error sensitivity [ChaudhuriHsu12] 65
144 DP statistics connections to robustness; interquartile distance, median, linear regression [DworkLei09] M-estimators [Lei11, NekipelovYakovlev11] for almost any estimator that is asymptotically normal on i.i.d. samples, DP adds asymptotically no additional perturbation [Smith11] convergence rate of DP estimators tied to gross error sensitivity [ChaudhuriHsu12] minimizing convex loss functions [ChaudhuriMonteleoniSarwate11, Rubinstein et al. 2012, KiferSmithThakurta12] 65
145 DP statistics connections to robustness; interquartile distance, median, linear regression [DworkLei09] M-estimators [Lei11, NekipelovYakovlev11] for almost any estimator that is asymptotically normal on i.i.d. samples, DP adds asymptotically no additional perturbation [Smith11] convergence rate of DP estimators tied to gross error sensitivity [ChaudhuriHsu12] minimizing convex loss functions [ChaudhuriMonteleoniSarwate11, Rubinstein et al. 2012, KiferSmithThakurta12] model selection [SmithThakurta13] 65
146 DP statistics connections to robustness; interquartile distance, median, linear regression [DworkLei09] M-estimators [Lei11, NekipelovYakovlev11] for almost any estimator that is asymptotically normal on i.i.d. samples, DP adds asymptotically no additional perturbation [Smith11] convergence rate of DP estimators tied to gross error sensitivity [ChaudhuriHsu12] minimizing convex loss functions [ChaudhuriMonteleoniSarwate11, Rubinstein et al. 2012, KiferSmithThakurta12] model selection [SmithThakurta13] empirical investigations [VuSlavkovic09, ChaudhuriMonteleoniSarwate11, AbowdSchneiderVilhuber13] 65
147 I need to look at the data before I know what statistics to run.
148 I need to look at the data before I know what statistics to run. not DP
149 I need to look at the data before I know what statistics to run. not DP interactive (or hybrid interactive/ noninteractive ) mechanisms?
150 I need to look at the data before I know what statistics to run. not DP interactive (or hybrid interactive/ noninteractive ) mechanisms? big data force us to formalize looking at the data
151 67 summary
152 summary privacy easy to get wrong; DP provides compelling definition and useful dose of paranoia 67
153 summary privacy easy to get wrong; DP provides compelling definition and useful dose of paranoia powerful tools exist (some with no cost of privacy, and some with no noise!) 67
154 summary privacy easy to get wrong; DP provides compelling definition and useful dose of paranoia powerful tools exist (some with no cost of privacy, and some with no noise!) powerful intuition from notions of robustness 67
155 summary privacy easy to get wrong; DP provides compelling definition and useful dose of paranoia powerful tools exist (some with no cost of privacy, and some with no noise!) powerful intuition from notions of robustness many nearly ready (and quite relevant) to common big data applications 67
156 summary privacy easy to get wrong; DP provides compelling definition and useful dose of paranoia powerful tools exist (some with no cost of privacy, and some with no noise!) powerful intuition from notions of robustness many nearly ready (and quite relevant) to common big data applications no ready-to-use, commercial- grade applications: need demand! 67
157 Differential Privacy Tutorial Katrina Ligett Privacy and Data-Based Research, with Ori Heffetz. Available on SSRN. 68
Differential privacy in health care analytics and medical research An interactive tutorial
Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could
Challenges of Data Privacy in the Era of Big Data. Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014
Challenges of Data Privacy in the Era of Big Data Rebecca C. Steorts, Vishesh Karwa Carnegie Mellon University November 18, 2014 1 Outline Why should we care? What is privacy? How do achieve privacy? Big
CS346: Advanced Databases
CS346: Advanced Databases Alexandra I. Cristea [email protected] Data Security and Privacy Outline Chapter: Database Security in Elmasri and Navathe (chapter 24, 6 th Edition) Brief overview of
Shroudbase Technical Overview
Shroudbase Technical Overview Differential Privacy Differential privacy is a rigorous mathematical definition of database privacy developed for the problem of privacy preserving data analysis. Specifically,
A Practical Application of Differential Privacy to Personalized Online Advertising
A Practical Application of Differential Privacy to Personalized Online Advertising Yehuda Lindell Eran Omri Department of Computer Science Bar-Ilan University, Israel. [email protected],[email protected]
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
An Introduction to Machine Learning
An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia [email protected] Tata Institute, Pune,
(Big) Data Anonymization Claude Castelluccia Inria, Privatics
(Big) Data Anonymization Claude Castelluccia Inria, Privatics BIG DATA: The Risks Singling-out/ Re-Identification: ADV is able to identify the target s record in the published dataset from some know information
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory
CSC2420 Fall 2012: Algorithm Design, Analysis and Theory Allan Borodin November 15, 2012; Lecture 10 1 / 27 Randomized online bipartite matching and the adwords problem. We briefly return to online algorithms
2013 MBA Jump Start Program. Statistics Module Part 3
2013 MBA Jump Start Program Module 1: Statistics Thomas Gilbert Part 3 Statistics Module Part 3 Hypothesis Testing (Inference) Regressions 2 1 Making an Investment Decision A researcher in your firm just
Local outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Exploratory data analysis (Chapter 2) Fall 2011
Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,
BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
The Reusable Holdout: Preserving Statistical Validity in Adaptive Data Analysis
The Reusable Holdout: Preserving Statistical Validity in Adaptive Data Analysis Moritz Hardt IBM Research Almaden Joint work with Cynthia Dwork, Vitaly Feldman, Toni Pitassi, Omer Reingold, Aaron Roth
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
1 Formulating The Low Degree Testing Problem
6.895 PCP and Hardness of Approximation MIT, Fall 2010 Lecture 5: Linearity Testing Lecturer: Dana Moshkovitz Scribe: Gregory Minton and Dana Moshkovitz In the last lecture, we proved a weak PCP Theorem,
OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS
OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments
Privacy-Preserving Aggregation of Time-Series Data
Privacy-Preserving Aggregation of Time-Series Data Elaine Shi PARC/UC Berkeley [email protected] Richard Chow PARC [email protected] T-H. Hubert Chan The University of Hong Kong [email protected] Dawn
COMMON CORE STATE STANDARDS FOR
COMMON CORE STATE STANDARDS FOR Mathematics (CCSSM) High School Statistics and Probability Mathematics High School Statistics and Probability Decisions or predictions are often based on data numbers in
Private False Discovery Rate Control
Private False Discovery Rate Control Cynthia Dwork Weijie Su Li Zhang November 11, 2015 Microsoft Research, Mountain View, CA 94043, USA Department of Statistics, Stanford University, Stanford, CA 94305,
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing
NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing Alex Cloninger Norbert Wiener Center Department of Mathematics University of Maryland, College Park http://www.norbertwiener.umd.edu
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
Big Data: a new era for Statistics
Big Data: a new era for Statistics Richard J. Samworth Abstract Richard Samworth (1996) is a Professor of Statistics in the University s Statistical Laboratory, and has been a Fellow of St John s since
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
E x c e l 2 0 1 0 : Data Analysis Tools Student Manual
E x c e l 2 0 1 0 : Data Analysis Tools Student Manual Excel 2010: Data Analysis Tools Chief Executive Officer, Axzo Press: Series Designer and COO: Vice President, Operations: Director of Publishing Systems
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Advanced Big Data Analytics with R and Hadoop
REVOLUTION ANALYTICS WHITE PAPER Advanced Big Data Analytics with R and Hadoop 'Big Data' Analytics as a Competitive Advantage Big Analytics delivers competitive advantage in two ways compared to the traditional
Probabilistic user behavior models in online stores for recommender systems
Probabilistic user behavior models in online stores for recommender systems Tomoharu Iwata Abstract Recommender systems are widely used in online stores because they are expected to improve both user
Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto [email protected] @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto [email protected] @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
Lecture 4: AC 0 lower bounds and pseudorandomness
Lecture 4: AC 0 lower bounds and pseudorandomness Topics in Complexity Theory and Pseudorandomness (Spring 2013) Rutgers University Swastik Kopparty Scribes: Jason Perry and Brian Garnett In this lecture,
Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
Environmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Big Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
What is the purpose of this document? What is in the document? How do I send Feedback?
This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Statistics
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Fast and Robust Normal Estimation for Point Clouds with Sharp Features
1/37 Fast and Robust Normal Estimation for Point Clouds with Sharp Features Alexandre Boulch & Renaud Marlet University Paris-Est, LIGM (UMR CNRS), Ecole des Ponts ParisTech Symposium on Geometry Processing
No Free Lunch in Data Privacy
No Free Lunch in Data Privacy Daniel Kifer Penn State University [email protected] Ashwin Machanavajjhala Yahoo! Research [email protected] ABSTRACT Differential privacy is a powerful tool for
2.1 Complexity Classes
15-859(M): Randomized Algorithms Lecturer: Shuchi Chawla Topic: Complexity classes, Identity checking Date: September 15, 2004 Scribe: Andrew Gilpin 2.1 Complexity Classes In this lecture we will look
DATA MINING - 1DL360
DATA MINING - 1DL360 Fall 2013" An introductory class in data mining http://www.it.uu.se/edu/course/homepage/infoutv/per1ht13 Kjell Orsborn Uppsala Database Laboratory Department of Information Technology,
Predicting daily incoming solar energy from weather data
Predicting daily incoming solar energy from weather data ROMAIN JUBAN, PATRICK QUACH Stanford University - CS229 Machine Learning December 12, 2013 Being able to accurately predict the solar power hitting
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Simulating Investment Portfolios
Page 5 of 9 brackets will now appear around your formula. Array formulas control multiple cells at once. When gen_resample is used as an array formula, it assures that the random sample taken from the
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Fraud Detection for Online Retail using Random Forests
Fraud Detection for Online Retail using Random Forests Eric Altendorf, Peter Brende, Josh Daniel, Laurent Lessard Abstract As online commerce becomes more common, fraud is an increasingly important concern.
Nonparametric statistics and model selection
Chapter 5 Nonparametric statistics and model selection In Chapter, we learned about the t-test and its variations. These were designed to compare sample means, and relied heavily on assumptions of normality.
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, 2, a note on sampling and filtering sampling: (randomly) choose a representative subset filtering: given some criterion (e.g. membership in a set),
Euler: A System for Numerical Optimization of Programs
Euler: A System for Numerical Optimization of Programs Swarat Chaudhuri 1 and Armando Solar-Lezama 2 1 Rice University 2 MIT Abstract. We give a tutorial introduction to Euler, a system for solving difficult
Gamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
Export Pricing and Credit Constraints: Theory and Evidence from Greek Firms. Online Data Appendix (not intended for publication) Elias Dinopoulos
Export Pricing and Credit Constraints: Theory and Evidence from Greek Firms Online Data Appendix (not intended for publication) Elias Dinopoulos University of Florida Sarantis Kalyvitis Athens University
Healthcare data analytics. Da-Wei Wang Institute of Information Science [email protected]
Healthcare data analytics Da-Wei Wang Institute of Information Science [email protected] Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
Neural Network Add-in
Neural Network Add-in Version 1.5 Software User s Guide Contents Overview... 2 Getting Started... 2 Working with Datasets... 2 Open a Dataset... 3 Save a Dataset... 3 Data Pre-processing... 3 Lagging...
Big Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
Privacy-preserving Data-aggregation for Internet-of-things in Smart Grid
Privacy-preserving Data-aggregation for Internet-of-things in Smart Grid Aakanksha Chowdhery Postdoctoral Researcher, Microsoft Research ac@microsoftcom Collaborators: Victor Bahl, Ratul Mahajan, Frank
Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
A Learning Algorithm For Neural Network Ensembles
A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Cross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort [email protected] Motivation Location matters! Observed value at one location is
KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS
ABSTRACT KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS In many real applications, RDF (Resource Description Framework) has been widely used as a W3C standard to describe data in the Semantic Web. In practice,
ANALYTIC HIERARCHY PROCESS (AHP) TUTORIAL
Kardi Teknomo ANALYTIC HIERARCHY PROCESS (AHP) TUTORIAL Revoledu.com Table of Contents Analytic Hierarchy Process (AHP) Tutorial... 1 Multi Criteria Decision Making... 1 Cross Tabulation... 2 Evaluation
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
Multiple Linear Regression in Data Mining
Multiple Linear Regression in Data Mining Contents 2.1. A Review of Multiple Linear Regression 2.2. Illustration of the Regression Process 2.3. Subset Selection in Linear Regression 1 2 Chap. 2 Multiple
Broadband Networks. Prof. Dr. Abhay Karandikar. Electrical Engineering Department. Indian Institute of Technology, Bombay. Lecture - 29.
Broadband Networks Prof. Dr. Abhay Karandikar Electrical Engineering Department Indian Institute of Technology, Bombay Lecture - 29 Voice over IP So, today we will discuss about voice over IP and internet
Composite performance measures in the public sector Rowena Jacobs, Maria Goddard and Peter C. Smith
Policy Discussion Briefing January 27 Composite performance measures in the public sector Rowena Jacobs, Maria Goddard and Peter C. Smith Introduction It is rare to open a newspaper or read a government
L25: Ensemble learning
L25: Ensemble learning Introduction Methods for constructing ensembles Combination strategies Stacked generalization Mixtures of experts Bagging Boosting CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort [email protected] Session Number: TBR14 Insurance has always been a data business The industry has successfully
International Journal of Advanced Computer Technology (IJACT) ISSN:2319-7900 PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS
PRIVACY PRESERVING DATA MINING IN HEALTH CARE APPLICATIONS First A. Dr. D. Aruna Kumari, Ph.d, ; Second B. Ch.Mounika, Student, Department Of ECM, K L University, [email protected]; Third C.
How to assess the risk of a large portfolio? How to estimate a large covariance matrix?
Chapter 3 Sparse Portfolio Allocation This chapter touches some practical aspects of portfolio allocation and risk assessment from a large pool of financial assets (e.g. stocks) How to assess the risk
Applying Data Science to Sales Pipelines for Fun and Profit
Applying Data Science to Sales Pipelines for Fun and Profit Andy Twigg, CTO, C9 @lambdatwigg Abstract Machine learning is now routinely applied to many areas of industry. At C9, we apply machine learning
Data Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
Monte Carlo testing with Big Data
Monte Carlo testing with Big Data Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with: Axel Gandy (Imperial College London) with contributions from:
How To Improve Efficiency In Ray Tracing
CS 563 Advanced Topics in Computer Graphics Russian Roulette - Sampling Reflectance Functions by Alex White Monte Carlo Ray Tracing Monte Carlo In ray tracing, use randomness to evaluate higher dimensional
