Why do statisticians "hate" us?



Similar documents
Data Mining Practical Machine Learning Tools and Techniques

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Data, Measurements, Features

Data Mining: Overview. What is Data Mining?

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Perspectives on Data Mining

Statistics 215b 11/20/03 D.R. Brillinger. A field in search of a definition a vague concept

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Some Essential Statistics The Lure of Statistics

Cross Validation. Dr. Thomas Jensen Expedia.com

The Scientific Data Mining Process

Data Mining Methods: Applications for Institutional Research

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Knowledge Discovery and Data Mining

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Data Mining. Nonlinear Classification

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Dynamic Data in terms of Data Mining Streams

Chapter 6. The stacking ensemble approach

Decision Trees from large Databases: SLIQ

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Organizing Your Approach to a Data Analysis

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

not possible or was possible at a high cost for collecting the data.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Model Combination. 24 Novembre 2009

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Penalized regression: Introduction

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

Data Mining for Fun and Profit

Statistical Models in Data Mining

Statistics for BIG data

Sentiment Analysis of Twitter Feeds for the Prediction of Stock Market Movement

Social Media Mining. Data Mining Essentials

Data Mining Applications in Fund Raising

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Algorithmic Trading Session 1 Introduction. Oliver Steinki, CFA, FRM

Supervised Learning (Big Data Analytics)

Regression Modeling Strategies

What is Data Analysis. Kerala School of MathematicsCourse in Statistics for Scientis. Introduction to Data Analysis. Steps in a Statistical Study

Azure Machine Learning, SQL Data Mining and R

Assumptions. Assumptions of linear models. Boxplot. Data exploration. Apply to response variable. Apply to error terms from linear model

Bootstrapping Big Data

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Gerry Hobbs, Department of Statistics, West Virginia University

Ensemble Methods. Adapted from slides by Todd Holloway h8p://abeau<fulwww.com/2007/11/23/ ensemble- machine- learning- tutorial/

DHL Data Mining Project. Customer Segmentation with Clustering

A Review of Data Mining Techniques

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Elements of statistics (MATH0487-1)

: Introduction to Machine Learning Dr. Rita Osadchy

A Property & Casualty Insurance Predictive Modeling Process in SAS

Data Mining. for Process Improvement DATA MINING. Paul Below, Quantitative Software Management, Inc. (QSM)

Chapter 12 Discovering New Knowledge Data Mining

Knowledge Discovery and Data Mining

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Why Ensembles Win Data Mining Competitions

Understanding Characteristics of Caravan Insurance Policy Buyer

DESCRIPTIVE STATISTICS AND EXPLORATORY DATA ANALYSIS

DATA MINING: AN OVERVIEW

Employer Health Insurance Premium Prediction Elliott Lui

Data Mining: An Overview. David Madigan

Improving the Performance of Data Mining Models with Data Preparation Using SAS Enterprise Miner Ricardo Galante, SAS Institute Brasil, São Paulo, SP

The Application of Exploratory Data Analysis (EDA) in Auditing

Data Mining: An Introduction

Principles of Data Mining

Going Big in Data Dimensionality:

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Paper AA Get the highest bangs for your marketing bucks using Incremental Response Models in SAS Enterprise Miner TM

Introduction. A. Bellaachia Page: 1

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Data Mining Part 5. Prediction

Machine Learning and Data Mining. Fundamentals, robotics, recognition

T Non-discriminatory Machine Learning

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Lecture 10: Regression Trees

Prediction of Stock Performance Using Analytical Techniques

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Multiple Linear Regression in Data Mining

How To Understand The Theory Of Probability

Knowledge Discovery and Data Mining

Data Mining for Manufacturing: Preventive Maintenance, Failure Prediction, Quality Control

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Data Mining Algorithms Part 1. Dejan Sarka

SAS Software to Fit the Generalized Linear Model

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Combining Linear and Non-Linear Modeling Techniques: EMB America. Getting the Best of Two Worlds

Introduction to Predictive Modeling Using GLMs

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Transcription:

Why do statisticians "hate" us? David Hand, Heikki Mannila, Padhraic Smyth "Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner." Results of data mining are models or patterns. observational vs. experimental data data mining uses data that has usually been collected for some other purpose. Ex. retail data (record of purchases in a store) The data mining process may not be connected to the data acquisition process. In statistics data is often collected to answer specific questions Data mining sometimes referred to as secondary data analysis Large data sets: If data sets were small we would use statistical exploratory analysis Large to a statistician - few hundred to a thousand data points Large to a data miner - millions or billions of data points Exceedingly large datasets have problems associated with them beyond what is traditionally considered by statisticians Many statistical methods require some type of exhaustive search. As datasets become larger wrt to records and variables, search becomes computationally more prohibitive (looking at all subsets of variables requires search of 2 p - 1 sets) Increase in number of variables leads to the curse of dimensionality curse of dimensionality - exponential rate of growth in the number of unit cells in the data space as the number of variables increases. Ex. 1 binary variable - can take on a value of 0 or 1, hence two "cells"

to get a good estimate of the variable let's say we need 10 observations per cell. 20 observations in all. 2 1 X 10. two binary variables you need 40 observations. or 2 2 X 10. 10 binary variables you need 10240 observations or 2 10 X 10. 20 variables 10485760 observations. What if we have variables that take on multiple values? Things get MUCH worse. Large data sets of issues with the access. Statistical viewpoint: John Elder, Daryl Pregibon, A Statistical Perspective on Knowledge Discovery in Databases A brief history if statistics relative to KDD 1960's "The robustness era freed statisticians of the shackles of narrow models depending on unrealistic assumptions (e.g. normality)" Early 1970's Exploratory Data Analysis (EDA) - statistical insights and modeling are data driven Arguments against: Don't bias the hypothesis and the model Arguments for: By looking at the data and summaries of that data we can choose richer statistical models Statisticians could "look" at the data BEFORE they created a model! o statistical models could break the data into structure and noise o data = fit + residual o look at the residual and see if there was any more "fit" that can be done o iterative process Visualizations - graphs, pictures are worth a thousand words!! Humans can visualize relationships better than any program (still true!!)

Data descriptive methods were more acceptable as opposed to mathematically based methods o data reexpression - log(age) vs. using raw age value o ignore outliers so to get a better description of the major portion of the data Late 1970's normal theory linear model extended to generalized Linear Models which included probability models (elder, Wedderburn, 1974: McCullagh, elder, 1989) EM algorithm (Expectation Maximization) - ways to solve estimation problems with incomplete data. Complete data may also benefit from being treated as a missing data problem. Bishop Fienberg, and Holland - analysis of nominal or discrete data using loglinear models. Early 1980's Elimination of low order bias of an estimator by using resampling (jack knife) technique was generalized to sampling without replacement from data (bootstrap) It allowed for the determination of the error in the estimate focus on estimating precision of estimators rather than bias removal Late 1980's Use of scatterplot smoothers or sliding windows yielded models with global nonlinear fits applications in classification, regression, discriminaton Early 1990's focus shifted from model estimation to model selection Many candidate models can be considered for data hybrid models, formed from multiple "good" models o yields models that are more accurate o reduced variance

Data Mining, Data Dredging, Snooping, Fishing If you searched long enough you would always find some model to fit a data set arbitrarily well Two contributing factors: o complexity of the model o size of possible models If models are flexible wrt size of available data, then we can fit data arbitrarily well But we need to be able to generalize so that we can model new data - Avoid overfit Therefore we need simpler models o Occam's Razor - given a choice of solutions, the simplest may be the best Extreme case - Model = Data Perfect Fit - Yes! Generalize Well - o! Interesting - o! Given a simple model, consider a billion of them. One may fit!! Ex. Predict a response variable X from a predictor Y Given a very large set of variable, X 1, X 2, X 3,, X p where p is large, then there is a high probability of finding an X that correlates with Y even if no real correlation exists in the domain. Ex. When a team from a particular football league wins the super bowl, then a leading stock market index goes up. Leinweber: perfect prediction of S & P 500 as a function of butter production, cheese production and sheep populations in Bangladesh and the US. o easy solutions. One strategy - split data set so that model is built on part of data and tested on another (cross validation)

Best soln - See if what you produce makes sense!! - To who? To the domain expert/user.