The Peril of Vast Search! and how Target Shuffling can Save Science John F. Elder, Ph.D. elder@datamininglab.com @johnelder4 300 West Main Street, Suite 301 Charlo5esville, Virginia 22903 434-973- 7673 www.datamininglab.com
Overview Crisis in Epidemiology (study of health causes & effects)! - or generally, in learning from Observational Studies Vast Search Effect problem Placebo is a worthy foe (performance baseline) Target Shuffling - measure the placebo effect! of building a model from a database Simple Examples: Complex Examples: Investment timing - Customer response Oil & Gas production - Medical recommendations 2
Crisis of False Research Findings Amgen could only replicate 6/58 studies Bayer Heathcare replicated only 25% of 67 studies BMJ: 92% of 1,500 referees missed serious errors 157/304 Journals accepted fake Bohannon paper Stan Young: Examined controlled experiments trying to replicate 12 data discoveries :! 0 replicated; 7 neutral; 5 reversed
xkcd: Significance
Placebo is a worthy foe (baseline result) Stronger when it has side effects
Target Shuffling: On the training data: 1 Build a model to predict the target variable, and note its strength (e.g., R-squared, lift, correlation, explanatory power). 2 Randomly shuffle the target vector to break the relationship between each output and its vector of inputs. 3 Search for a new best model or most interesting result - and save its strength. (Don t save the meaningless model.) 4 Repeat steps 2 and 3 often, and create a distribution of the strengths of the Best Apparent Discoveries (BADs). 5 Evaluate where your true results (from step 1) are on (or beyond) this BAD distribution. This is its significance, or probability that a result as strong as it can occur by chance.
Analogy: Students get back someone else s test score
Stock Trading System Example (starting mid-90s) 10
Target Shuffling Code Example: Evaluate the quality of an investment timing strategy READ file fund_1yr date position return MULTIPLY position return trade SUM trade original PRINT original REPEAT 1000 SHUFFLE position pos MULTIPLY pos return trade SUM trade total SCORE total Z END HISTOGRAM Z COUNT Z > original better DIVIDE better 1000 prop_bet PRINT prop_bet 15 of 1,000 were better! = 1.5% chance of chance 11
Gas ProducTon (recent work at ERI)
Gas ProducTon: LiX (cumulatve gains)
Gas ProducTon: 95% chance intervals
Data Cube search
See our writeup on Orange Cars (datamininglab.com)
Summary We love stories and will believe anything. So interpretability is no protection against error. Science requires replication & transparency 65-95% health discovery papers are false,! due to vast search effect (multiple comparison) Resampling (eg, cross-validation) grades fairly Target Shuffling measures the placebo effect of the data * modeling process! Add TS to your arsenal to better find truth!
John F. Elder IV Founder & CEO, Elder Research, Inc. Dr. John Elder heads the USA s largest and most experienced data mining consulting team. Founded in 1995, Elder Research, Inc. has offices in Charlottesville, Virginia, Washington DC, and Baltimore Maryland (www.datamininglab.com). ERI focuses on Federal, commercial, and investment applications of advanced analytics, including text mining, credit scoring, process optimization, cross-selling, drug efficacy, market timing, and fraud detection. John earned Electrical Engineering degrees from Rice University, and a PhD in Systems Engineering from the University of Virginia, where he s an adjunct professor teaching Optimization or Data Mining. Prior to 20 years at ERI, he spent 5 years in aerospace defense consulting, 4 heading research at an investment management firm, and 2 in Rice's Computational & Applied Mathematics department. Dr. Elder has authored innovative data mining tools, is a frequent keynote speaker, and has chaired International Analytics conferences. John was honored to serve for 5 years on a panel appointed by President Bush to guide technology for National Security. His book with Bob Nisbet and Gary Miner, Handbook of Statistical Analysis & Data Mining Applications, won the PROSE award for top book in Mathematics for 2009. His book with Giovanni Seni, Ensemble Methods in Data Mining, was published in 2010, and his book with colleague Andrew Fast and 4 others on Practical Text Mining won the 2012 PROSE award for Computer Science. John is grateful to be a follower of Christ and father of 5. 20