What is Data mining?



Similar documents
CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Statistics & Analysis

Exploratory Data Analysis with MATLAB

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Chapter 12 Discovering New Knowledge Data Mining

An Introduction to Data Mining

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Principles of Data Mining by Hand&Mannila&Smyth

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

D-optimal plans in observational studies

Data, Measurements, Features

HT2015: SC4 Statistical Data Mining and Machine Learning

An Overview of Knowledge Discovery Database and Data mining Techniques

Using Data Mining for Mobile Communication Clustering and Characterization

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

How To Cluster

A Demonstration of Hierarchical Clustering

Data Mining: Overview. What is Data Mining?

Social Media Mining. Data Mining Essentials

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Course Syllabus. Purposes of Course:

Data Mining Part 5. Prediction

Data Mining Techniques Chapter 6: Decision Trees

Health Spring Meeting May 2008 Session # 42: Dental Insurance What's New, What's Important

Data Mining. Dr. Saed Sayad. University of Toronto

Environmental Remote Sensing GEOG 2021

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Lavastorm Analytic Library Predictive and Statistical Analytics Node Pack FAQs

Introduction to Data Mining

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Machine Learning for Data Science (CS4786) Lecture 1

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Data Mining: An Overview. David Madigan

Data Mining and Visualization

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Lecture 9: Introduction to Pattern Analysis

Data Mining Applications in Higher Education

Introduction to Data Mining

The Data Mining Process

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Predictive Modeling and Big Data

Machine Learning using MapReduce

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

2015 Workshops for Professors

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

The Value of Visualization 2

Chapter ML:XI (continued)

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

A Systemic Artificial Intelligence (AI) Approach to Difficult Text Analytics Tasks

A New Approach for Evaluation of Data Mining Techniques

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Fast Analytics on Big Data with H20

Model Deployment. Dr. Saed Sayad. University of Toronto

Diagrams and Graphs of Statistical Data

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data Mining with SAS. Mathias Lanner Copyright 2010 SAS Institute Inc. All rights reserved.

DATA MINING TECHNIQUES AND APPLICATIONS

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Starting Smart with Oracle Advanced Analytics

Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America

Azure Machine Learning, SQL Data Mining and R

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Pattern-Aided Regression Modelling and Prediction Model Analysis

Advanced analytics at your hands

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

THE HYBRID CART-LOGIT MODEL IN CLASSIFICATION AND DATA MINING. Dan Steinberg and N. Scott Cardell

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

LCs for Binary Classification

Potential Value of Data Mining for Customer Relationship Marketing in the Banking Industry

Data Mining. SPSS Clementine Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine

Support Vector Machines with Clustering for Training with Very Large Datasets

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

MS1b Statistical Data Mining

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

not possible or was possible at a high cost for collecting the data.

IMPLEMENTING CLASSIFICATION FOR INDIAN STOCK MARKET USING CART ALGORITHM WITH B+ TREE

How To Make A Credit Risk Model For A Bank Account

Data Exploration Data Visualization

Statistical Machine Learning

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Multivariate Analysis

Transcription:

STAT : DATA MIIG Javier Cabrera Fall Business Question Answer Business Question What is Data mining? Find Data Data Processing Extract Information Data Analysis Internal Databases Data Warehouses Internet Online databases Data Collection Welcome to Data mining Data collected in large databases: Relational databases, Internet, Data Warehouses: Large Datasets, Many variables and cases. Mostly noisy data: Missing Values, Zeros, Outliers. o random samples. Data mining objective: To extract valuable information. To identify nuggets, small clusters of observations in these data that contain potentially valuable information. The definition of valuable is generally reflected by a large response value of a specific category of a qualitative response. Sifting through a large volume of data that is noisy, badly behaved, and that may have many missing values, or that may just be irrelevant is the main challenge of data mining. Moore s law: How large is large? Data mining Software + Processing capacity doubles every couple of years (Exponential) Hard Disk storage capacity doubles every 1 months (Use to be every months) By number of cases: Small: < (o CLT) Moderate: < < (CLT) Moderately large: < < ( tolerable ) Large: +: o computations. By the number of variables: Small: One variable. Moderate: Less than 1 Variables. Matrix inversion. Large: More than 1 Variables. Fast computations. Economic use of memory. Flexible (and user friendly) Graphics Interface. Software that will used in class: Clementine from SPSS SAS, R Other Software: Enterprise Miner, Spottfire, C. - Bottle necks are not speed anymore. Processing capacity is not growing as fast as data acquisition. By database size: Large: Does not fit in memory. 1

y Methods and Techniques Data summarization, EDA, Basic Statistics. Advanced Data Visualization. Data Reduction: variable and case Subsetting, Sampling. Dimension Reduction: Principal Components, Covariance. Cluster analysis (Segmentation): k-means, hierarchical. Classification techniques (Pattern recognition): -LDA, QDA - Trees - eural nets, Support Vector Machines, -earest eighbors. Model based methods: Linear, on-linear, logistic What is new? Improved Methodology and Software. Solve business problems: Data is from regular businesses. Objective: Better business decisions. Case Study: SALES OF ORTHOPEDIC EQUIPMET The objective of this study is to find ways to increase sales of orthopedic material from our company to hospitals in the United States. VARIABLES: BEDS : UMBER OF HOSPITAL BEDS RBEDS : UMBER OF BEDS OUT-V : UMBER OF OUTPATIET VISITS ADM : ADMIISTRATIVE COST(In $1's per year) SIR : REVEUE FROM IPATIET SALES1 : SALES OF. EQUIP. FOR THE LAST 1 MO HIP9 : UMBER OF HIP OPERATIOS FOR 199 KEE9 : UMBER OF KEE OPERATIOS FOR 199 TH : TEACHIG HOSPITAL?, 1 TRAUMA : DO THE HAVE A TRAUMA UIT?, 1 : DO THE HAVE A UIT?, 1 HIP9 : UMBER HIP OPERATIOS FOR 199 KEE9 : UMBER KEE OPERATIOS FOR 199 FEMUR9 : UMBER FEMUR OPERATIOS FOR 199 The Role of visualization Data visualization methods are attractive tools to use for analyzing such datasets for several reasons: Data visualization methods show many features (expected and unexpected) of a dataset at once and, as such, are well equipped to pick up subtle structures of interest and anomalies as well as clear patterns. They allow (in fact, encourage) flexible interaction with the data. They can be more readily understood by non-statisticians (although their properties may not be). Good user-friendly graphics software is becoming more readily available. Data visualization methods Large datasets create visualization challenges. Scatterplots: Large numbers of points may hide the underlying structure. - Apply Data Binning and use an image graph. -Avoid Masking by duplicating plots and highlighting subgroups. - Sometimes is enough to graph a subset selected at random. Many variables at once. There are many ingenious tools for this. Scatterplot matrix - all variables - all descriptor variables with color coding according to one response - all response variables with color coding according to one descriptor plot selected D views to highlight some feature of the data: - principal components analysis (spread) - projection pursuit (clustering)] look at all D views of the data via a dynamic display [rotating D display, grand tour] conditional plots multiple windows with brush and link Data Binning Scatter Plot Binning Plot -1-1 -1-1 -1-1 -1-1 x

R example of Binning Plot Masking effect Using Color Drawing green dots first Drawing purple dots first O f.plot <- function(x,y,nr=,nc=, scale="raw") { zx = c(1:nr,rep(1,nc),1+trunc( nr*(x- min(x))/(max(x)-min(x)) )) zx[zx>nr] = nr zy = c(rep(1,nr),1:nc,1+trunc( nc*(y- min(y))/(max(y)-min(y)) )) zy[zy>nc] = nc z = table(zx,zy); z[,1]=z[,1]-1; z[1,]=z[1,]-1; if (scale=="l") z= log(1+z) image(z=t(z),x=seq(length=nr+1,from=min(x),to=max(x)), y= seq(length=nc+1,from=min(y),to=max(y)), xlab="",ylab="", col=topo.colors(1)) } # Run this code line by line x = rnorm(1) ; y = rnorm(1) plot(x,y) f.plot(x,y,1,1) f.plot(x,y,,) f.plot(x,y,1,1) f.plot(x,y,1,1,'l') f.plot(x,y,,,'l') ux = rnorm()/ uy = ux^ -. f.plot(c(x,ux),c(y,uy)+,,,'l') O 1 1 1 1 1 - log(1 + SALES1) cluster analysis (unsupervised pattern recognition) partitioning methods (e.g., k-means, k-medioids) hierarchical methods (e.g., agglomerative nesting) two-way clustering classification (supervised pattern recognition, discriminant analysis) - trees (e.g., CART, C, Firm, Tree, ARF) - model-based methods (e.g., logistic regression) - artificial neural networks role of robust methods / diagnostics 1 Given : ADM log(1 + SALES1) variable and case selection 1 Feature recognition methods Given : BEDS Conditional Plot example Pairwise Scatter Plot log(1 + SALES1) log(1 + SALES1) O 1 1 1 sqrt(kee9)

Cluster Example 1 Tree methods I. Dependent variable is categorical Classification trees (e.g., CART, C, Firm, Tree, ARF) Decision Trees Decision Rules Example: Personal loan decision Credit Card? II. Dependent variable is numerical Regression Tree Function f(x,) Tree form of f(x,) < X< < 7 Approve Car? Reject Approve Age< Reject X Regression Tree for log(1+sales1) HIP9<. RBEDS<.7711 HIP9<.17 ADM<.7 KEE9<1.1 1.9.9..7.99 HIP9<. FEMUR9<.99 1.1 1.7.1 BEDS<. KEE9<.97 OUTV<1.9.1. SIR<9.9.1.979 Linear Models Linear model: = X β + ε Least Squares Estimator: b= (X T X) -1 X T Linear Discriminants Linear Discriminant: = or 1. - Estimate b by L.S. - Predict 1 if Xb >. otherwise Example: Pima Indians 1 1 1 1 1.1.... Diabetes: one: Diabetes: 9 one: 1

Example: Pima K with k = 1 1 1 1 1 1 1 Example: Pima K with k = 1 1 1 1 1 1 1 1 1 1 1 Example: Pima Indians K =1 1 Machine Learning Pattern recognition Data Mining Techniques: Artificial eural ets Support Vector Machines Objective: Try to emulate the way the brain works (???) Hoax: The functioning of the brain is not yet understood. Any relation with Artificial eural ets is purely anecdotal.