The Peril of Vast Search! and how Target Shuffling can Save Science



Similar documents
Data Mining with Qualitative and Quantitative Data

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

Predictive Analytics Certificate Program

Using Predictive Analytics to Detect Contract Fraud, Waste, and Abuse Case Study from U.S. Postal Service OIG

The Predictive Data Mining Revolution in Scorecards:

Tutorials: Abstracts and Speakers Bio

Home Schooling Achievement

Participant Information

U.S. Residential Mortgage Delinquency Rates Seasonally Adjusted Data, 1998Q2 to 2011Q1 Source: Mortgage Bankers Association / Haver Analytics

GRADUATE SCHOOL. Should I go? When to go? Where? Tests Applying SMITH COLLEGE CAREER DEVELOPMENT OFFICE

MASTER & MAXIMIZE BIG DATA PART-TIME MASTERS OF SCIENCE IN ANALYTICS HOUSTON OR VIA LIVE VIDEO

THE PREDICTIVE MODELLING PROCESS

Practice#1(chapter1,2) Name

c. Construct a boxplot for the data. Write a one sentence interpretation of your graph.

How To Make A Credit Risk Model For A Bank Account

MKTG 460, Sec HB1 Fall 15 Marketing Research

Recent Developments in the Law & Technology Relating to Predictive Coding

Big Data: a new era for Statistics

Course Syllabus. Purposes of Course:

How Do People Settle Disputes? How a Civil Trial Works in California

10 Secrets to Developing Client Loyalty. with Ken Hardison, PILMMA and LawMarketing.com

THE OPTIMIZER HANDBOOK:

What is Data Mining? MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling. MS4424 Data Mining & Modelling

The Trials and Tribulations of the CRM: the DFO Experience

The Basics of Building Credit Answer Guides

Can Twitter provide enough information for predicting the stock market?

Wenfeng Wang. EDUCATION UNIVERSITY OF MARYLAND, COLLEGE PARK Ph.D. Accounting (Expected May 2016)

Customer Relationship Management: Perspectives from the Market Place Simon Knox, Stan Maklan, Adrian Payne, Joe Peppard, Lynette Ryals

C u r r i c u l u m V i t a e William T. Sutherland, P.E. Electrical Engineer. Professional Practice. Employment History

World Youth Day USA Krakow Kickoff July 7, 2015 Bishop Frank J Caggiano s Opening Remarks (Video Clip 1)

Lesson Module 3: Defensive Spending Tackling Credit & Debit Cards

CROSS EXAMINATION OF AN EXPERT WITNESS IN A CHILD SEXUAL ABUSE CASE. Mark Montgomery

These two errors are particularly damaging to the perception by students of our program and hurt our recruiting efforts.

Why Ensembles Win Data Mining Competitions

Lockheed Martin Mission Systems & Training(MST)

Reducing the Costs of Employee Churn with Predictive Analytics

Value, size and momentum on Equity indices a likely example of selection bias

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

SAP Predictive Analysis: Strategy, Value Proposition

Applied Analytics in a World of Big Data. Business Intelligence and Analytics (BI&A) Course #: BIA 686. Catalog Description:

In-Database Analytics

10 Biggest Causes of Data Management Overlooked by an Overload

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Stability of School Building Accountability Scores and Gains. CSE Technical Report 561. Robert L. Linn CRESST/University of Colorado at Boulder

INDIVIDUAL MASTERY for: St#: Test: CH 9 Acceleration Test on 29/07/2015 Grade: B Score: % (35.00 of 41.00)

INDIVIDUAL MASTERY for: St#: Test: CH 9 Acceleration Test on 09/06/2015 Grade: A Score: % (38.00 of 41.00)

Howe School of Technology Management. Applied Analytics in a World of Big Data. Business Intelligence and Analytics (BI&A) Proposed Course #: BIA 686

CURRICULUM VITAE EDUCATION AND TRAINING

Successfully Implementing Predictive Analytics in Direct Marketing

What is active learning?

Mathematics 220, Spring 2014 Multivariable Calculus

SYLLABUS PS Thesis in Applied Behavior Analysis I (3 credits) Caldwell College Master of Arts in Applied Behavior Analysis

How To Improve Health Care At Stevens.Org

Auto Days 2011 Predictive Analytics in Auto Finance

Mathematics of Risk. Introduction. Case Study #1 Personal Auto Insurance Pricing. Mathematical Concepts Illustrated. Background

Handling a Crime Committed by Someone You Know

Appendix 4 Chief Executive Communications Induction Package

Advanced In-Database Analytics

Sensitivity Analysis in Multiple Imputation for Missing Data

The Fight for the Last Mile

EDUC 661: EDUCATIONAL RESEARCH METHODS. Stephen G. Sireci, Ph.D. 156 Hills South

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Computer Engineering Graduate Handbook. Administered by the Computer Science and the Charles L. Brown Department of Electrical & Computer Engineering

Paper 1. Calculator not allowed. Mathematics test. First name. Last name. School. Remember KEY STAGE 3 TIER 5 7

Becoming a university academic

Rock, Brock, and the Savings Shock by Sheila Bair (Albert Whitman & Company, Morton Grove, Illinois 2006) ISBN

Data Mining for Business Analytics

The Impact of American University on the District of Columbia Economy. Research Summary

QAC Data Visualization. General Information. Course Description. Course Materials. Course Topics. Syllabus: Fall 2015

USE OF A SINGLE ELEMENT WATTMETER OR WATT TRANSDUCER ON A BALANCED THREE-PHASE THREE-WIRE LOAD WILL NOT WORK. HERE'S WHY.

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Joe Matthews. Joseph M. Matthews, P.A.

University of Missouri Kansas City, Kansas City, Missouri. Education Specialist in Higher Education Administration, 1994.

OPTIONS IN FORECLOSURE

So You Want To Go To Econ Grad School...

EMPLOYEES GUIDE TO APPEALING A WORKERS COMPENSATION CLAIM DENIAL

Megan F. Hess, Ph.D. Williams School of Commerce, Economics, and Politics Washington and Lee University Lexington, VA

Environmental Finance and Sustainable Investment: Risk Mitigation and Emerging Opportunities 2015 Course Description ENV 1707 F

Applying to Graduate Programmes in Economics

Chapter 7 Scatterplots, Association, and Correlation

Faculty member in the Undergraduate and Graduate Schools of Business

This handbook is meant to be a quick-starter guide to Agile Project Management. It is meant for the following people:

Using Credit to Your Advantage.

Transcription:

The Peril of Vast Search! and how Target Shuffling can Save Science John F. Elder, Ph.D. elder@datamininglab.com @johnelder4 300 West Main Street, Suite 301 Charlo5esville, Virginia 22903 434-973- 7673 www.datamininglab.com

Overview Crisis in Epidemiology (study of health causes & effects)! - or generally, in learning from Observational Studies Vast Search Effect problem Placebo is a worthy foe (performance baseline) Target Shuffling - measure the placebo effect! of building a model from a database Simple Examples: Complex Examples: Investment timing - Customer response Oil & Gas production - Medical recommendations 2

Crisis of False Research Findings Amgen could only replicate 6/58 studies Bayer Heathcare replicated only 25% of 67 studies BMJ: 92% of 1,500 referees missed serious errors 157/304 Journals accepted fake Bohannon paper Stan Young: Examined controlled experiments trying to replicate 12 data discoveries :! 0 replicated; 7 neutral; 5 reversed

xkcd: Significance

Placebo is a worthy foe (baseline result) Stronger when it has side effects

Target Shuffling: On the training data: 1 Build a model to predict the target variable, and note its strength (e.g., R-squared, lift, correlation, explanatory power). 2 Randomly shuffle the target vector to break the relationship between each output and its vector of inputs. 3 Search for a new best model or most interesting result - and save its strength. (Don t save the meaningless model.) 4 Repeat steps 2 and 3 often, and create a distribution of the strengths of the Best Apparent Discoveries (BADs). 5 Evaluate where your true results (from step 1) are on (or beyond) this BAD distribution. This is its significance, or probability that a result as strong as it can occur by chance.

Analogy: Students get back someone else s test score

Stock Trading System Example (starting mid-90s) 10

Target Shuffling Code Example: Evaluate the quality of an investment timing strategy READ file fund_1yr date position return MULTIPLY position return trade SUM trade original PRINT original REPEAT 1000 SHUFFLE position pos MULTIPLY pos return trade SUM trade total SCORE total Z END HISTOGRAM Z COUNT Z > original better DIVIDE better 1000 prop_bet PRINT prop_bet 15 of 1,000 were better! = 1.5% chance of chance 11

Gas ProducTon (recent work at ERI)

Gas ProducTon: LiX (cumulatve gains)

Gas ProducTon: 95% chance intervals

Data Cube search

See our writeup on Orange Cars (datamininglab.com)

Summary We love stories and will believe anything. So interpretability is no protection against error. Science requires replication & transparency 65-95% health discovery papers are false,! due to vast search effect (multiple comparison) Resampling (eg, cross-validation) grades fairly Target Shuffling measures the placebo effect of the data * modeling process! Add TS to your arsenal to better find truth!

John F. Elder IV Founder & CEO, Elder Research, Inc. Dr. John Elder heads the USA s largest and most experienced data mining consulting team. Founded in 1995, Elder Research, Inc. has offices in Charlottesville, Virginia, Washington DC, and Baltimore Maryland (www.datamininglab.com). ERI focuses on Federal, commercial, and investment applications of advanced analytics, including text mining, credit scoring, process optimization, cross-selling, drug efficacy, market timing, and fraud detection. John earned Electrical Engineering degrees from Rice University, and a PhD in Systems Engineering from the University of Virginia, where he s an adjunct professor teaching Optimization or Data Mining. Prior to 20 years at ERI, he spent 5 years in aerospace defense consulting, 4 heading research at an investment management firm, and 2 in Rice's Computational & Applied Mathematics department. Dr. Elder has authored innovative data mining tools, is a frequent keynote speaker, and has chaired International Analytics conferences. John was honored to serve for 5 years on a panel appointed by President Bush to guide technology for National Security. His book with Bob Nisbet and Gary Miner, Handbook of Statistical Analysis & Data Mining Applications, won the PROSE award for top book in Mathematics for 2009. His book with Giovanni Seni, Ensemble Methods in Data Mining, was published in 2010, and his book with colleague Andrew Fast and 4 others on Practical Text Mining won the 2012 PROSE award for Computer Science. John is grateful to be a follower of Christ and father of 5. 20