A Statistician s View of Big Data

Similar documents
Molecular descriptors and chemometrics: a powerful combined tool for pharmaceutical, toxicological and environmental problems.

Ph.D. in Bioinformatics and Computational Biology Degree Requirements

Statistics for BIG data

Principles of Data Mining by Hand&Mannila&Smyth

Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge

PreciseTM Whitepaper

CLUSTER ANALYSIS WITH R

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Cross Validation. Dr. Thomas Jensen Expedia.com

Introduction To Real Time Quantitative PCR (qpcr)

UKB_WCSGAX: UK Biobank 500K Samples Genotyping Data Generation by the Affymetrix Research Services Laboratory. April, 2015

Identification algorithms for hybrid systems

Multi-GPU Load Balancing for Simulation and Rendering

ANALYTICS CENTER LEARNING PROGRAM

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Introducing Oracle Crystal Ball Predictor: a new approach to forecasting in MS Excel Environment

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Statistical issues in the analysis of microarray data

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

KNIME Enterprise server usage and global deployment at NIBR

How To Run Statistical Tests in Excel

Basic Analysis of Microarray Data

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Data Mining mit der JMSL Numerical Library for Java Applications

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

The Data Mining Process

TEACHING OF STATISTICS IN KENYA. John W. Odhiambo University of Nairobi Nairobi

TIBCO Spotfire Helps Organon Bridge the Data Gap Between Basic Research and Clinical Trials

Factors for success in big data science

One Statistician s Perspectives on Statistics and "Big Data" Analytics

MD - Data Mining

Statistics in Applications III. Distribution Theory and Inference

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Lecture 3: Linear methods for classification

Gene Expression Analysis

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Simple Predictive Analytics Curtis Seare

Application of Predictive Analytics for Better Alignment of Business and IT

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

TOWARD BIG DATA ANALYSIS WORKSHOP

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Business Analytics and Credit Scoring

Statistics Graduate Courses

An Introduction to Genomics and SAS Scientific Discovery Solutions

How many of you have checked out the web site on protein-dna interactions?

Local classification and local likelihoods

Learning outcomes. Knowledge and understanding. Competence and skills

Nathan Brown. The Application of Consensus Modelling and Genetic Algorithms to Interpretable Discriminant Analysis.

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

Statistical Models in Data Mining

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Classification and Regression by randomforest

Course Requirements for the Ph.D., M.S. and Certificate Programs

Classification Problems

Predictive modelling around the world

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Predictive Modeling in Workers Compensation 2008 CAS Ratemaking Seminar

Package empiricalfdr.deseq2

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Predictive Modeling Techniques in Insurance

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Predictive Data modeling for health care: Comparative performance study of different prediction models

University of Glasgow - Programme Structure Summary C1G MSc Bioinformatics, Polyomics and Systems Biology

VALIDATION OF ANALYTICAL PROCEDURES: TEXT AND METHODOLOGY Q2(R1)

Introduction to Data Mining

HT2015: SC4 Statistical Data Mining and Machine Learning

On the effect of data set size on bias and variance in classification learning

Statistical Machine Learning

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Use advanced techniques for summary and visualization of complex data for exploratory analysis and presentation.

Why do statisticians "hate" us?

System Informatics: Combine Engineering Knowledge and Big Data for Model Building and Decision Making

Proposal for New Program: BS in Data Science: Computational Analytics

Data Analysis on the ABI PRISM 7700 Sequence Detection System: Setting Baselines and Thresholds. Overview. Data Analysis Tutorial

Ira J. Haimowitz Henry Schwarz

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Micromyx. Micromyx. A Microbiology Services Company. Lab Services Research - Consulting -

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

CHAPTER 1 INTRODUCTION

APPLYING DATA MINING TECHNIQUES TO FORECAST NUMBER OF AIRLINE PASSENGERS

Essentials of Real Time PCR. About Sequence Detection Chemistries

DATA MINING WITH DIFFERENT TYPES OF X-RAY DATA

Chapter Managing Knowledge in the Digital Firm

8. KNOWLEDGE BASED SYSTEMS IN MANUFACTURING SIMULATION

Data Mining - Evaluation of Classifiers

Transcription:

A Statistician s View of Big Data Max Kuhn, Ph.D (Pfizer Global R&D, Groton, CT) Kjell Johnson, Ph.D (Arbor Analytics, Ann Arbor MI)

What Does Big Data Mean? The advantages and issues related to Big Data can be broken down into two areas: Big N : more samples or cases Big P: more variables or attributes or fields Mostly, I think the catchphrase is associated more with N than P. Does Big Data solve my problems? Maybe 1 1 the basic answer given by every statistician throughout time Max Kuhn (Groton CT) Statistician s View 2 / 23

Can You Be More Specific? It depends on what are you using it for? does it solve some unmet need? does it get in the way? Basically, it comes down to: Bigger Data Better Data at least not necessarily. Max Kuhn (Groton CT) Statistician s View 3 / 23

Big N - Interpolation One situation where it probably doesn t help when samples are added within the mainstream of the data In effect, we are just filling in the predictor space by increasing the granularity. After the first 10K observations, the model will not change very much and it becomes a game of nm. This does pose an interesting interaction within the variance bias trade off. Big N goes a long way to reducing the model variance. Given this, can high variance/low bias models be improved? Max Kuhn (Groton CT) Statistician s View 4 / 23

Variance Bias Trade Off Maybe. Many high(ish) variance/low bias models tend to be very complex and computationally demanding. Adding more data allows these models to more accurately reflect the complexity of the data but would require specialized solutions to be feasible. At this point, the best approach is supplanted by the available approaches (not good). Form should still follow function. Max Kuhn (Groton CT) Statistician s View 5 / 23

Variance Bias Trade Off What about low variance/high bias models? There is some room here for improvement since the abundance of data allows more opportunities for exploratory data analysis to tease apart the functional forms to lower the bias (i.e. improved feature engineering) or to select features. For example, non linear terms for logistic regression models can be parametrically formalized based on the results of spline or loess smoothers. Max Kuhn (Groton CT) Statistician s View 6 / 23

A Simulated Pattern 1 Output 0 1 0 2 4 6 8 10 Input Max Kuhn (Groton CT) Statistician s View 7 / 23

High Bias Model (LOESS) with n = 50 1 Output 0 1 0 2 4 6 8 10 Input Max Kuhn (Groton CT) Statistician s View 8 / 23

High Bias Model (LOESS) with n = 500 1 Output 0 1 0 2 4 6 8 10 Input Max Kuhn (Groton CT) Statistician s View 9 / 23

Lower Bias Model (LOESS) with n = 1K 1 Output 0 1 0 2 4 6 8 10 Input Max Kuhn (Groton CT) Statistician s View 10 / 23

Lower Bias Model (LOESS) with n = 5K 1 Output 0 1 0 2 4 6 8 10 Input Max Kuhn (Groton CT) Statistician s View 11 / 23

Diminishing Returns on N At some point, adding more ( of the same) data does not do any good. Performance stabilizes but computational complexity increases. The modeler becomes hand cuffed to whatever technology is available to handle large amounts of data. Determining what data to use is more important than worrying about the model. Max Kuhn (Groton CT) Statistician s View 12 / 23

Two Examples from Computational Chemistry Pfizer relies on quantitative structural activity relationship (QSAR) models for medicinal chemistry. Assay data for existing compounds are used to create a relationship between the characteristic of interest and molecular descriptors (e.g molecular weight, the number of carbons, etc) When new compounds are designed, we can estimate their characteristics prior to synthesis. We also have existing assay data on large numbers of compounds to build predictive models. Max Kuhn (Groton CT) Statistician s View 13 / 23

Global Versus Local Models When developing a compound, once a hit is obtained, the chemists begin to tweak the structure to make improvements. The leads to a chemical series We could build QSAR models on the large number of existing compounds (a global model) or on the series of interest (a local model). Our experience is that local models beat global models the majority of the time. Here, fewer (of the most relevant) compounds is better. Our first inclination is to use all the data because our (overall) error rate should get better. Like politics, All Problems Are Local. Max Kuhn (Groton CT) Statistician s View 14 / 23

Data Quality One Tier 1 screen is an assay for logp (the partition coefficient) which we use as a measure of greasyness. Tier 1 means that logp is estimated for most compounds (via model and/or assay) There was an existing, high throughput assay on a large number of historical compounds. However, the data quality was poor. Several years ago, a new assay was developed that was lower throughput and higher quality. This became the default assay for logp. The model was re-built on a small (1K) set chemically diverse compounds. In the end, fewer compounds were assayed but the model performance was much better and costs were lowered. Max Kuhn (Groton CT) Statistician s View 15 / 23

Big N and/or P - Reducing Extrapolation However, Big N might start to sample from rare populations. For example: a customer cluster that is less frequent but have high profitability (via Big N ). a specific mutation or polymorphism that helps derive a new drug target (via Big N and Big P) highly non linear activity cliffs in computational chemistry can be elucidated (via Big N ) Now, we have the ability to solve some unmet need. Max Kuhn (Groton CT) Statistician s View 16 / 23

Big P is Most Important I believe that we have reached a plateau in machine learning. Breakthroughs in predictive performance are not likely to be a result of new algorithms. Real progress will be made by improving the data by enriching the information content available or reducing noise. Max Kuhn (Groton CT) Statistician s View 17 / 23

Example RNAseq Traditional RNA expression profiling revolves around hybridization arrays (e.g. Affymetrix) These assays are very noisy, static and bias to the 3 end. RNAseq applies sequencing technology to RNA transcripts and, as a consequence: digital estimates of RNA abundance are generated the transcript library is no longer static and there is no 3 bias the complexity of the data is drastically increased Max Kuhn (Groton CT) Statistician s View 18 / 23

Affy 3 Bias Max Kuhn (Groton CT) Statistician s View 19 / 23

Example RNAseq The good: the sensitivity/resolution is better than previous technologies we can now effectively find splice variants and other genetic components the measurement system error can be dramatically improved The possibly bad: the size of the data may let the tool drive the analysis may scientists barely have enough time to truly capitalize on the existing, lower resolution data Big N may be required to really make break throughs due to low event rates Max Kuhn (Groton CT) Statistician s View 20 / 23

Big N May Enable True Bayesian Machine Learning Pr[Y X ] = Pr[X Y ]Pr[Y ] Pr[X ] The are no Bayesians in foxholes (Breiman, 1977) The biggest issue here is Pr[X ] (and it s conditional). We usually make ridiculous assumptions about the multivariate densities (e.g. LDA, QDA, naive Bayes etc) because we have to. With a good N :P ratio, maybe we can focus more on prediction uncertainty and other issues via Bayesian methods. Max Kuhn (Groton CT) Statistician s View 21 / 23

Big N May Enable Effective Biological Networks A good percentage of biological networks are derived from data with P >>>> N Completely over determined systems lead to severe ad hockery Lack of application of good high dimensional methodologies (e.g. regularization). This is a good example of where Big N can really offer a contribution. High Content Screening via Flow Cytometry using cellular data (instead of wellular data ) dramatically increases N and allows better estimates of networks. The nature of this technology reduces P significantly (compared to an array) but the unintended side effect of this is that the researcher usually thinks about what they need to measure. The problem is that the N here isn t really the N we need (i.e. repeated experimentation) and has the potential to over fit to the data. Max Kuhn (Groton CT) Statistician s View 22 / 23

Summary I think our first inclination is to go big because It s cool I could write a few papers/packages More is better, right? The availability of Big Data should be a trigger to really re-evaluate what we are trying to solve and why this will help. Thanks to Martin, for the invitation to speak and the stimulating discussions Vini Bonato, for the Affy probe visualization Max Kuhn (Groton CT) Statistician s View 23 / 23