Xiao-Li Meng Department of Statistics, Harvard University Thanks to many students and colleagues
|
|
- Jodie Morton
- 8 years ago
- Views:
Transcription
1 Statistical Paradises and Paradoxes in Big Data Xiao-Li Meng Department of Statistics, Harvard University Thanks to many students and colleagues 1
2 Paradises Much larger general pipeline: Statistics Concentration (Major) Size at Harvard College Much better airplane conversations Golden era for methodological research Emerging theoretical foundations 2
3 Berkeley Group: Integrating Stat/Prob with CS, ML, IS, and Math Rigorous theory of the trade-off between statistical and computational efficiency, under confidentiality, etc., based on classical statistical decision theory. Wide-ranging statistical machine learning theory, methodology, algorithms, using empirical process, signal processing & information theory (e.g., MDL principle). Automated Targeted Learning and Super Learning built upon well-established semiparametric and nonparametric theory. Algebraic statistics, e.g., studying statistical hypothesis testing via algebraic geometry and computational and combinatorial techniques. 3
4 BFF group: Integrating Bayes, Frequentist, and Fiducial perspectives Fusion learning via confidence distributions (CD) Combining results from multiple analyses under possibly different perspectives 4
5 Jianqing Fan s Group (Princeton): Bringing statistical theory and methods to the forefront of Big Data Fan et al. (2014) Challenges of Big Data Analysis National Science Review (China) 1: Salient features of Big Data Heterogeneity (Individuality) Noise accumulation Spurious correlation Incidental endogeneity FanBigDataReview.pdf 5
6 Great Promises and Grand Challenges Multi-Resolution Inference Multi-Phase Inference Multi-Source Inference o Meng (2014) A Trio of Inference Problems That Could Win You a Nobel Prize in Statistics (if you help fund it). COPSS 50 th Anniversary Volume. o Blocker and Meng (2013) The Potential and Perils of Preprocessing: Building New Foundations. Bernoulli, 19, o Xie and Meng (2016) Dissecting Multiple Imputation from a Multiphase Inference Perspective: What Happens When God s, Imputer s and Analyst s Models are Uncongenial? (With discussion). Statistica Sinica, to appear. 6
7 OnTheMap Project of US Census Bureau Developed by LED (Local Employment Dynamic). Users zoom into any region of the US for paired employeeemployer information. Used diverse data sources: surveys and administrative datasets with confidential information. Thanks to Jeremy Wu of C. B. 7
8 Multi-Resolution 8
9 Multi-Phase To protect confidentiality, the displayed data are synthetic: draws from a posterior. Each data source itself has gone through multiple clean up processes, most of which are gray boxes or even 9
10 Multi-Source Built from more than 20 data sources in the LEHD (Longitudinal Employer-Household Dynamics) system. Survey Samples: Monthly survey of 60,000 households covering only 0.05% of households. Administrative Records: Unemployment insurance wage records covering more than 90% of the US workforce; Never intended for inference purposes. Census Data: Quarterly census of earnings and wages covering 98% of US jobs. 10
11 A Trio of NP-Hard Inference Problems Multi-Resolution: How do we infer estimands with resolution far exceeding any possible estimators? Is it possible for such inference to be qualitatively robust even if it cannot be quantitatively robust? Multi-Phase: (Big) Data are almost never collected, preprocessed, and analyzed in a single phase. What theory and methods accommodate this multi-phase setup? Multi-Source: Which one is better: a survey sample covering 1% or an administrative record covering 95% of the population? How should we combine information from these sources? Is it worth combining? 11
12 So which one is better for estimating the population mean: a 1% simple random sample (SRS) or a 95% administrative (observational) dataset (AD)? 1. 1% SRS 2. 95% AD 3. It depends! 4. Is this a trick question? 0% 0% 0% 0% 1% SRS 95% AD It depends! Is this a tric... 12
13 A fundamental principle of statistics: Variance-Bias Tradeoff Total Error = Variance + Bias 2 probabilistic SRS [(1-f s )/n]s Large non-prob data 0 + r 2 [(1-f a )/f a )] S 2 f is the fraction in the population: f=n/n r is the correlation between the (honest) responded/recorded value X and the probability of response/recording, P(X) Big Data Paradox the larger the data, the more pronounced the bias 13
14 For estimating a population mean, if r=0.1, how large does an AD, as a percentage of US population, need to be in order to produce a more accurate sample average than a SRS with n=100 does? 1. <0.5% (1.6M) 2. 5% (16M) 3. 10% (32M) 4. 20% (64M) 5. 50% (160M) 6. 75% (240M) 7. 90% (288M) 8. >95% (303M) 0% 0% 0% 0% 0% 0% 0% 0% <0.5% (1.6M) 5% (16M) 10% (32M) 20% (64M) 50% (160M) 75% (240M) 90% (288M) 14 >95% (303M)
15 Big Data: Big Size or Big Fraction? Size matters, but only after having quality Importance of combining non-probabilistic samples with probabilistic ones, however small the latter are. More does NOT guarantee better: I got more data, my model is more refined, but my estimator is getting worse! Am I just dumb? (Meng and Xie, 2014, Economics Review, ) 15
16 So when/why do we need Big Data? Individualized treatments (e.g., medical; educational; marketing; news) Inference/prediction with very weak signal to noise ratio (e.g., climate change) Understand deeply connected (spatial) networks and (temporal) dynamics 16
17 What does Big Data mean for you? We see you and others more clearly 2015/11/1 17
18 Gift: Treatment for you based only on data from people like you. Curse: No one is perfectly like you. 2015/11/1 18
19 Personalized Treatment: Sounds heavenly, but where on Earth did they find the right guinea pig for me? Liu and Meng (2014) A Fruitful Resolution to Simpson s Paradox via Multi-Resolution Inference, The American Statistician, /11/1 19
20 A Painful Problem 2015/11/1 20
21 Kidney Stone Treatment C. R. Charig, D. R. Webb, S. R. Payne, O. E. Wickham (March 1986) Br Med J (Clin Res Ed) 292 (6524): Treatment A 78% (273/350) Treatment A Treatment B 83% (289/350) Treatment B Small Stone Large Stone 93% (81/87) 73% (192/263) 87% (234/270) 69% (55/80) A: Open Surgery; B: Percutaneous Nephrolithotomy 2015/11/1 21
22 Treatment A 73% successful Large Stones 93% Small Stones 78% Overall Successful Unsuccessful Treatment B 69% successful Large Stones 87% Small Stones 83% Overall Uneven distribution of stone sizes across treatments makes overall success rate misleading. 22
23 Simpson s Paradox Dealing with the disparities between aggregated analysis and disaggregated analyses Determining the right level (primary resolution) for analysis Understanding the bias-variance (relevancerobustness) trade-off 23
24 So what would be the right resolution? Let s take a CarTalk challenge (7/111/2015) 24
25 From Cartalk: You are tested positive for D by a test with 95% accuracy. What s the chance you actually have D, given the prevalence of D is 0.1%? % % % % % % 7. Could be anything 8. I have no idea. 1-5% 0% 0% 0% 0% 0% 0% 0% 0% 5-10% 10-25% 25-50% 50-75% C o u n t d o w n 75-95% Could be anyth I have no idea... 25
26 It could be anything depending on the meaning of accuracy and Need to know how accurate the test is among those with no disease (specificity) AND among those with the disease (sensitivity) The probability could be 1 if sensitivity = 100% For rare disease, overall accuracy ~ specificity Then the answer is less than 2%, if this was a random screening test 26
27 100,000 People for Screening 1,000 with Symptoms 0.1% 99.9% 10% 90% 100 D 99,900 no D 100 D 900 no D 95% 5% 95% 5% 95 pos 5 neg % 95% pos neg 5% 95% 4,995 pos 94,005 neg 45 pos 855 neg 95/(95+4,995) = 1.87% 95/(95+45) = 67.9% Conditioning is the Soul of Statistics --- Joe Blitzstein 27
28 Bayes Theorem When the facts change, I change my opinion. What do you do, sir? ~ John Maynard Keynes 28
29 Useful Statistical Principles/Concepts for Data Science Data Selection and Replication Mechanisms: Randomization, sampling, experiments, observational studies, missing data mechanisms; latent variable/constructs; potential outcome; confidentiality protections Conditioning vs. Marginalizing: Disaggregation vs. aggregation, sub-population analysis, individualized inference, Simpson s paradox, ecological fallacy Bias-Variance Trade-off: Efficiency vs. Robustness, Relevance vs. Robustness; model predictability vs. fitness Inferences principles/perspectives: Likelihood principle; Bayesian thinking; fiducial argument for objectivity; uncertainty quantifications. 2015/11/1 29
30 A Traditional Statistical Theme/Aim: Seeking representative samples to infer about populations A Big-Data Statistical Theme/Aim: Constructing approximating populations to infer about individuals Targeted Individual Approx. Population 2015/11/1 30
31 One more V for Big Data: Veracity 31
32 I find your presentation 1. Inspiring and thought provoking 2. informative and I learned a few things 3. confusing and not very helpful 4. what a waste of my time! Inspiring and... 0% 0% 0% 0% informative an... confusing and... what a waste o... 32
Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom
Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Be able to explain the difference between the p-value and a posterior
More informationFederal Statistics and College Entrepreneurships
Training Undergraduates, Graduate Students, Postdocs, and Federal Agencies: Methodology, Data, and Science for Federal Statistics Noel Cressie, Scott H. Holan, and Christopher K. Wikle Department of Statistics,
More informationFalse Discovery Rates
False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving
More informationCross Validation. Dr. Thomas Jensen Expedia.com
Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract
More informationDiscussion of Presentations on Commercial Big Data and Official Economic Statistics
Discussion of Presentations on Commercial Big Data and Official Economic Statistics John L. Eltinge U.S. Bureau of Labor Statistics Presentation to the Federal Economic Statistics Advisory Committee June
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationGrand Challenges Making Drill Down Analysis of the Economy a Reality. John Haltiwanger
Grand Challenges Making Drill Down Analysis of the Economy a Reality By John Haltiwanger The vision Here is the vision. A social scientist or policy analyst (denoted analyst for short hereafter) is investigating
More informationAPPLIED MISSING DATA ANALYSIS
APPLIED MISSING DATA ANALYSIS Craig K. Enders Series Editor's Note by Todd D. little THE GUILFORD PRESS New York London Contents 1 An Introduction to Missing Data 1 1.1 Introduction 1 1.2 Chapter Overview
More informationPart III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing
CS 188: Artificial Intelligence Lecture 20: Dynamic Bayes Nets, Naïve Bayes Pieter Abbeel UC Berkeley Slides adapted from Dan Klein. Part III: Machine Learning Up until now: how to reason in a model and
More informationMathematical Statisticians at the Bureau of Labor Statistics
Mathematical Statisticians at the Bureau of Labor Statistics Agenda Introduction to BLS What is a Mathematical Statistician? What is the survey process at BLS? Where do Math/Stats work at BLS? So what
More informationPS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
More informationApplying Data Science to Sales Pipelines for Fun and Profit
Applying Data Science to Sales Pipelines for Fun and Profit Andy Twigg, CTO, C9 @lambdatwigg Abstract Machine learning is now routinely applied to many areas of industry. At C9, we apply machine learning
More informationIntroduction to Multilevel Modeling Using HLM 6. By ATS Statistical Consulting Group
Introduction to Multilevel Modeling Using HLM 6 By ATS Statistical Consulting Group Multilevel data structure Students nested within schools Children nested within families Respondents nested within interviewers
More informationCSAC, April 16-17, 2015 Discussion: Big Data and Modernizing Federal Statistics: Update by Bill Bostic and Ron Jarmin
CSAC, April 16-17, 2015 Discussion: Big Data and Modernizing Federal Statistics: Update by Bill Bostic and Ron Jarmin Noel Cressie National Institute for Applied Statistics Research Australia (NIASRA)
More informationLikelihood Approaches for Trial Designs in Early Phase Oncology
Likelihood Approaches for Trial Designs in Early Phase Oncology Clinical Trials Elizabeth Garrett-Mayer, PhD Cody Chiuzan, PhD Hollings Cancer Center Department of Public Health Sciences Medical University
More informationData Science Center Eindhoven. Big Data: Challenges and Opportunities for Mathematicians. Alessandro Di Bucchianico
Data Science Center Eindhoven Big Data: Challenges and Opportunities for Mathematicians Alessandro Di Bucchianico Dutch Mathematical Congress April 15, 2015 Contents 1. Big Data terminology 2. Various
More informationBayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom
1 Learning Goals Bayesian Updating with Discrete Priors Class 11, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1. Be able to apply Bayes theorem to compute probabilities. 2. Be able to identify
More informationNeed for Sampling. Very large populations Destructive testing Continuous production process
Chapter 4 Sampling and Estimation Need for Sampling Very large populations Destructive testing Continuous production process The objective of sampling is to draw a valid inference about a population. 4-
More informationBootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
More informationDealing with Missing Data
Res. Lett. Inf. Math. Sci. (2002) 3, 153-160 Available online at http://www.massey.ac.nz/~wwiims/research/letters/ Dealing with Missing Data Judi Scheffer I.I.M.S. Quad A, Massey University, P.O. Box 102904
More informationHealthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw
Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics
More informationStatistical Fallacies: Lying to Ourselves and Others
Statistical Fallacies: Lying to Ourselves and Others "There are three kinds of lies: lies, damned lies, and statistics. Benjamin Disraeli +/- Benjamin Disraeli Introduction Statistics, assuming they ve
More informationA Statistical Framework for Operational Infrasound Monitoring
A Statistical Framework for Operational Infrasound Monitoring Stephen J. Arrowsmith Rod W. Whitaker LA-UR 11-03040 The views expressed here do not necessarily reflect the views of the United States Government,
More informationKeep It Simple: Easy Ways To Estimate Choice Models For Single Consumers
Keep It Simple: Easy Ways To Estimate Choice Models For Single Consumers Christine Ebling, University of Technology Sydney, christine.ebling@uts.edu.au Bart Frischknecht, University of Technology Sydney,
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationAnother Look at Sensitivity of Bayesian Networks to Imprecise Probabilities
Another Look at Sensitivity of Bayesian Networks to Imprecise Probabilities Oscar Kipersztok Mathematics and Computing Technology Phantom Works, The Boeing Company P.O.Box 3707, MC: 7L-44 Seattle, WA 98124
More informationThe primary goal of this thesis was to understand how the spatial dependence of
5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial
More informationpsychology and economics
psychology and economics lecture 9: biases in statistical reasoning tomasz strzalecki failures of Bayesian updating how people fail to update in a Bayesian way how Bayes law fails to describe how people
More informationCONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
More informationFiscal Stimulus Improves Solvency in a Depressed Economy
Fiscal Stimulus Improves Solvency in a Depressed Economy Dennis Leech Economics Department and Centre for Competitive Advantage in the Global Economy University of Warwick d.leech@warwick.ac.uk Published
More informationMachine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks
CS 188: Artificial Intelligence Naïve Bayes Machine Learning Up until now: how use a model to make optimal decisions Machine learning: how to acquire a model from data / experience Learning parameters
More informationCustomer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
More informationMonica Pratesi, University of Pisa
DEVELOPING ROBUST AND STATISTICALLY BASED METHODS FOR SPATIAL DISAGGREGATION AND FOR INTEGRATION OF VARIOUS KINDS OF GEOGRAPHICAL INFORMATION AND GEO- REFERENCED SURVEY DATA Monica Pratesi, University
More informationSimilarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle
Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases Andreas Züfle Geo Spatial Data Huge flood of geo spatial data Modern technology New user mentality Great research potential
More informationThe Real Business Cycle model
The Real Business Cycle model Spring 2013 1 Historical introduction Modern business cycle theory really got started with Great Depression Keynes: The General Theory of Employment, Interest and Money Keynesian
More informationMessage-passing sequential detection of multiple change points in networks
Message-passing sequential detection of multiple change points in networks Long Nguyen, Arash Amini Ram Rajagopal University of Michigan Stanford University ISIT, Boston, July 2012 Nguyen/Amini/Rajagopal
More informationBig Data: The Computation/Statistics Interface
Big Data: The Computation/Statistics Interface Michael I. Jordan University of California, Berkeley September 2, 2013 What Is the Big Data Phenomenon? Big Science is generating massive datasets to be used
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationWhat Is Probability?
1 What Is Probability? The idea: Uncertainty can often be "quantified" i.e., we can talk about degrees of certainty or uncertainty. This is the idea of probability: a higher probability expresses a higher
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationA Bayesian hierarchical surrogate outcome model for multiple sclerosis
A Bayesian hierarchical surrogate outcome model for multiple sclerosis 3 rd Annual ASA New Jersey Chapter / Bayer Statistics Workshop David Ohlssen (Novartis), Luca Pozzi and Heinz Schmidli (Novartis)
More informationSampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data
Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian
More informationLikelihood: Frequentist vs Bayesian Reasoning
"PRINCIPLES OF PHYLOGENETICS: ECOLOGY AND EVOLUTION" Integrative Biology 200B University of California, Berkeley Spring 2009 N Hallinan Likelihood: Frequentist vs Bayesian Reasoning Stochastic odels and
More informationTotal Survey Error: Adapting the Paradigm for Big Data. Paul Biemer RTI International University of North Carolina
Total Survey Error: Adapting the Paradigm for Big Data Paul Biemer RTI International University of North Carolina Acknowledgements Phil Cooley, RTI Alan Blatecky, RTI 2 Why is a total error framework needed?
More informationInformation Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)
More informationBayesian probability theory
Bayesian probability theory Bruno A. Olshausen arch 1, 2004 Abstract Bayesian probability theory provides a mathematical framework for peforming inference, or reasoning, using probability. The foundations
More informationLinear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
More informationTriangle Census Research Data Center Notes from information sessions
Triangle Census Research Data Center Notes from information sessions About TCRDC Administrator Bert Grider visited Appalachian State University on February 28, 2011 to tell researchers about the resources
More informationOptimal Parameters for Space- Time Cluster Detection of Infectious Disease. Evan Caten Masters Candidate Salem State College May 4, 2009
Optimal Parameters for Space- Time Cluster Detection of Infectious Disease Evan Caten Masters Candidate Salem State College May 4, 2009 Presentation Outline Overview of masters thesis Introduction Objectives
More informationPREDICTIVE ANALYTICS vs HOT SPOTTING
PREDICTIVE ANALYTICS vs HOT SPOTTING A STUDY OF CRIME PREVENTION ACCURACY AND EFFICIENCY 2014 EXECUTIVE SUMMARY For the last 20 years, Hot Spots have become law enforcement s predominant tool for crime
More informationACCESS METHODS FOR UNITED STATES MICRODATA
ACCESS METHODS FOR UNITED STATES MICRODATA Daniel Weinberg, US Census Bureau John Abowd, US Census Bureau and Cornell U Sandra Rowland, US Census Bureau (retired) Philip Steel, US Census Bureau Laura Zayatz,
More informationBusiness Statistics 41000: Probability 1
Business Statistics 41000: Probability 1 Drew D. Creal University of Chicago, Booth School of Business Week 3: January 24 and 25, 2014 1 Class information Drew D. Creal Email: dcreal@chicagobooth.edu Office:
More informationPrinciples and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification
Principles and Best Practices for Sharing Data from Environmental Health Research: Challenges Associated with Data-Sharing: HIPAA De-identification Daniel C. Barth-Jones, M.P.H., Ph.D Assistant Professor
More informationPublication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore
Publication List Chen Zehua Department of Statistics & Applied Probability National University of Singapore Publications Journal Papers 1. Y. He and Z. Chen (2014). A sequential procedure for feature selection
More informationAnomaly detection for Big Data, networks and cyber-security
Anomaly detection for Big Data, networks and cyber-security Patrick Rubin-Delanchy University of Bristol & Heilbronn Institute for Mathematical Research Joint work with Nick Heard (Imperial College London),
More informationDefending Networks with Incomplete Information: A Machine Learning Approach. Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject
Defending Networks with Incomplete Information: A Machine Learning Approach Alexandre Pinto alexcp@mlsecproject.org @alexcpsec @MLSecProject Agenda Security Monitoring: We are doing it wrong Machine Learning
More informationMachine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC
Machine Learning for Medical Image Analysis A. Criminisi & the InnerEye team @ MSRC Medical image analysis the goal Automatic, semantic analysis and quantification of what observed in medical scans Brain
More informationSENSITIVITY ANALYSIS AND INFERENCE. Lecture 12
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this
More informationThe result of the bayesian analysis is the probability distribution of every possible hypothesis H, given one real data set D. This prestatistical approach to our problem was the standard approach of Laplace
More informationAppendix B Checklist for the Empirical Cycle
Appendix B Checklist for the Empirical Cycle This checklist can be used to design your research, write a report about it (internal report, published paper, or thesis), and read a research report written
More informationWhy do statisticians "hate" us?
Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data
More informationThe HB. How Bayesian methods have changed the face of marketing research. Summer 2004
The HB How Bayesian methods have changed the face of marketing research. 20 Summer 2004 Reprinted with permission from Marketing Research, Summer 2004, published by the American Marketing Association.
More informationFrom the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
More informationMULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS
MULTIPLE REGRESSION AND ISSUES IN REGRESSION ANALYSIS MSR = Mean Regression Sum of Squares MSE = Mean Squared Error RSS = Regression Sum of Squares SSE = Sum of Squared Errors/Residuals α = Level of Significance
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationMissing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
More informationFairfield Public Schools
Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity
More informationPREDICTIVE ANALYTICS VS. HOTSPOTTING
PREDICTIVE ANALYTICS VS. HOTSPOTTING A STUDY OF CRIME PREVENTION ACCURACY AND EFFICIENCY EXECUTIVE SUMMARY For the last 20 years, Hot Spots have become law enforcement s predominant tool for crime analysis.
More informationDiscussion of Credit Growth and the Financial Crisis: A New Narrative, by Albanesi et al.
Discussion of Credit Growth and the Financial Crisis: A New Narrative, by Albanesi et al. Atif Mian Princeton University and NBER March 4, 2016 1 / 24 Main result high credit growth attributed to low credit
More informationService courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More informationResearch on the UHF RFID Channel Coding Technology based on Simulink
Vol. 6, No. 7, 015 Research on the UHF RFID Channel Coding Technology based on Simulink Changzhi Wang Shanghai 0160, China Zhicai Shi* Shanghai 0160, China Dai Jian Shanghai 0160, China Li Meng Shanghai
More informationFactors for success in big data science
Factors for success in big data science Damjan Vukcevic Data Science Murdoch Childrens Research Institute 16 October 2014 Big Data Reading Group (Department of Mathematics & Statistics, University of Melbourne)
More informationStatistics 3202 Introduction to Statistical Inference for Data Analytics 4-semester-hour course
Statistics 3202 Introduction to Statistical Inference for Data Analytics 4-semester-hour course Prerequisite: Stat 3201 (Introduction to Probability for Data Analytics) Exclusions: Class distribution:
More informationParametric fractional imputation for missing data analysis
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 Biometrika (????,??,?, pp. 1 14 C???? Biometrika Trust Printed in
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationInterpretation of Somers D under four simple models
Interpretation of Somers D under four simple models Roger B. Newson 03 September, 04 Introduction Somers D is an ordinal measure of association introduced by Somers (96)[9]. It can be defined in terms
More informationBig Data Visualisations. Professor Ian Nabney i.t.nabney@aston.ac.uk NCRG
Big Data Visualisations Professor Ian Nabney i.t.nabney@aston.ac.uk NCRG Overview Why visualise data? How we can visualise data Big Data Institute What is Visualisation? Goal of visualisation is to present
More informationAssessing the Proposed 2014 Statistics Curriculum 9/22/2013 V0A. www.statlit.org/pdf/2014-schield-dsi2-slides.pdf 1
Assessing the Proposed 2014 Statistics Curriculum 9/22/2013 V0A 1 Business Analytics vs. Data Science by Milo Schield Member: International Statistical Institute US Rep: International Statistical Literacy
More informationExample application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
More informationA Pharmacometrician s Perspective for Utilization of Big Data
Is There a Role of Big Data in Drug Development Decisions? ACoP6 Oct. 5, 2015 Crystal City, VA A Pharmacometrician s Perspective for Utilization of Big Data Marc R. Gastonguay, Ph.D. President & CEO Metrum
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationStatistical Rules of Thumb
Statistical Rules of Thumb Second Edition Gerald van Belle University of Washington Department of Biostatistics and Department of Environmental and Occupational Health Sciences Seattle, WA WILEY AJOHN
More informationHow To Calculate A Multiiperiod Probability Of Default
Mean of Ratios or Ratio of Means: statistical uncertainty applied to estimate Multiperiod Probability of Default Matteo Formenti 1 Group Risk Management UniCredit Group Università Carlo Cattaneo September
More informationData Appendix for Firm Age, Investment Opportunities, and Job Creation
Data Appendix for Firm Age, Investment Opportunities, and Job Creation Manuel Adelino Fuqua School of Business Duke University Song Ma Fuqua School of Business Duke University September 21, 2015 David
More informationBayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationMachine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University
Machine Learning Methods for Causal Effects Susan Athey, Stanford University Guido Imbens, Stanford University Introduction Supervised Machine Learning v. Econometrics/Statistics Lit. on Causality Supervised
More informationKnowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationAuxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
More informationPredicting the Performance of a First Year Graduate Student
Predicting the Performance of a First Year Graduate Student Luís Francisco Aguiar Universidade do Minho - NIPE Abstract In this paper, I analyse, statistically, if GRE scores are a good predictor of the
More informationCriminal Justice Evaluation Framework (CJEF): Conducting effective outcome evaluations
Criminal Justice Research Department of Premier and Cabinet Criminal Justice Evaluation Framework (CJEF): Conducting effective outcome evaluations THE CRIMINAL JUSTICE EVALUATION FRAMEWORK (CJEF) The Criminal
More informationNon Parametric Inference
Maura Department of Economics and Finance Università Tor Vergata Outline 1 2 3 Inverse distribution function Theorem: Let U be a uniform random variable on (0, 1). Let X be a continuous random variable
More informationExample: Find the expected value of the random variable X. X 2 4 6 7 P(X) 0.3 0.2 0.1 0.4
MATH 110 Test Three Outline of Test Material EXPECTED VALUE (8.5) Super easy ones (when the PDF is already given to you as a table and all you need to do is multiply down the columns and add across) Example:
More informationICT Perspectives on Big Data: Well Sorted Materials
ICT Perspectives on Big Data: Well Sorted Materials 3 March 2015 Contents Introduction 1 Dendrogram 2 Tree Map 3 Heat Map 4 Raw Group Data 5 For an online, interactive version of the visualisations in
More informationSTATISTICAL DATA ANALYSIS
STATISTICAL DATA ANALYSIS INTRODUCTION Fethullah Karabiber YTU, Fall of 2012 The role of statistical analysis in science This course discusses some statistical methods, which involve applying statistical
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationInteractive Analytical Processing in Big Data Systems,BDGS: AMay Scalable 23, 2014 Big Data1 Generat / 20
Interactive Analytical Processing in Big Data Systems,BDGS: A Scalable Big Data Generator Suite in Big Data Benchmarking,Study about DataSet May 23, 2014 Interactive Analytical Processing in Big Data Systems,BDGS:
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationWhat? So what? NOW WHAT? Presenting metrics to get results
What? So what? NOW WHAT? What? So what? Visualization is like photography. Impact is a function of focus, illumination, and perspective. What? NOW WHAT? Don t Launch! Prevent your own disastrous decisions
More information