Statistical Models in Data Mining

Similar documents
Principles of Data Mining

Data, Measurements, Features

Learning outcomes. Knowledge and understanding. Competence and skills

Principles of Data Mining by Hand&Mannila&Smyth

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

STA 4273H: Statistical Machine Learning

DATA MINING TECHNIQUES AND APPLICATIONS

Statistics Graduate Courses

Silvermine House Steenberg Office Park, Tokai 7945 Cape Town, South Africa Telephone:

Classification Problems

An Introduction to Data Mining

Introduction to Data Mining

HT2015: SC4 Statistical Data Mining and Machine Learning

Machine Learning Overview

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

How To Understand The Theory Of Probability

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

The Scientific Data Mining Process

Data Mining Algorithms Part 1. Dejan Sarka

Machine Learning with MATLAB David Willingham Application Engineer

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Statistics for BIG data

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

The Data Mining Process

Azure Machine Learning, SQL Data Mining and R

Information Management course

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Lecture 3: Linear methods for classification

Imputing Values to Missing Data

Course Syllabus For Operations Management. Management Information Systems

Supervised Learning (Big Data Analytics)

MS1b Statistical Data Mining

Big Data: Rethinking Text Visualization

Introduction to Data Mining

Diagrams and Graphs of Statistical Data

Database Marketing, Business Intelligence and Knowledge Discovery

Machine Learning and Statistics: What s the Connection?

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Predict Influencers in the Social Network

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Introduction. A. Bellaachia Page: 1

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Machine Learning.

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Social Media Mining. Data Mining Essentials

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Machine Learning for Data Science (CS4786) Lecture 1

Bayesian networks - Time-series models - Apache Spark & Scala

Data Mining and Neural Networks in Stata

Data Mining with SQL Server Data Tools

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining: An Introduction

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Content-Based Recommendation

Machine learning for algo trading

Server Load Prediction

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Sanjeev Kumar. contribute

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Exploratory Data Analysis

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining: An Overview. David Madigan

ANALYTICS IN BIG DATA ERA

Chapter 12 Discovering New Knowledge Data Mining

Prediction of Heart Disease Using Naïve Bayes Algorithm

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Exploratory Data Analysis with MATLAB

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Data Mining: Overview. What is Data Mining?

Faculty of Science School of Mathematics and Statistics

: Introduction to Machine Learning Dr. Rita Osadchy

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Multivariate Normal Distribution

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Organizing Your Approach to a Data Analysis

CHAPTER 2 Estimating Probabilities

A Review of Data Mining Techniques

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining. Dr. Saed Sayad. University of Toronto

Knowledge Discovery from patents using KMX Text Analytics

Visualizing class probability estimators

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Supervised and unsupervised learning - 1

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Transcription:

Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari

Flood of Data New York Times, January 11, 2010 Video and Image Data Unstructured Structured and Unstructured (Text) Data 2 Srihari

Large Data Sets are Ubiquitous 1.Due to digital data acquisition and storage technology Business Supermarket transactions Credit card usage records Telephone call details Government statistics Scientific Images of astronomical bodies Molecular databases Medical records 2. Automatic data production leads to need for automatic data consumption 3. Large databases mean vast amounts of information 4. Difficulty lies in converting data to useful knowledge 3 Srihari

Data Mining Definition Analyze Observational Data to find unsuspected relationships and Summarize data in novel ways that are understandable and useful to data owner Unsuspected Relationships non-trivial, implicit, previously unknown Ex of Trivial: Those who are pregnant are female Summarize as Patterns and Models (usually probabilistic) Linear Equations, Rules, Clusters, Graphs, Tree Structures, Recurrent Patterns in Time Series Extracting useful information from large data sets Usefulness: meaningful: lead to some advantage, usually economic Analysis: Automatic/Semi-automatic Process (Extraction of knowledge) Srihari

Reasons for Uncertainty 1. Data may only be a sample of population to be studied Uncertain about extent to which samples differ from each other 2. Interest is in making a prediction about tomorrow based on data we have today 3. Cannot observe some values and need to make a guess 5 Srihari

Dealing with Uncertainty Several Conceptual bases 1. Probability 2. Fuzzy Sets 3. Rough Sets Probability Theory vs Probability Calculus Probability Calculus is well-developed Generally accepted axioms and derivations Probability Theory has scope for perspectives 6 Lack theoretical backbone and the wide acceptance of probability Mapping real world to what probability is

Frequentist vs Bayesian Frequentist Probability is objective It is the limiting proportion of times event occurs in identical situations Bayesian An idealization since all customers are not identical Subjective probability Explicit characterization of all uncertainty including any parameters estimated from the data Frequently yield same results 7 Srihari

Data Mining vs Statistics Observational Data Objective of data mining exercise plays no role in data collection strategy E.g., Data collected for Transactions in a Bank Experimental Data Collected in Response to Questionnaire Efficient strategies to Answer Specific Questions In this way it differs from much of statistics For this reason, data mining is referred to as secondary data analysis 8 Srihari

Statistics vs Data Mining Size of data set (large in data mining) Eyeballing not an option (terabytes of data) Entire dataset rather than a sample Many variables Curse of dimensionality Make predictions Small sample sizes can lead to spurious discovery: Superbowl winner conference correlates to stock market (up/down)

Multidisciplinary terminology Structured Data Training Set Unstructured Data Information Retrieval Machine Learning Pattern Recognition Records Database Data Mining Statistics Samples Table Visualization Artificial Intelligence Expert Systems Data Points Instances 10 Leading Conference known as Knowledge Discovery and Data Mining Srihari

Data Mining Tasks Not so much a single technique Idea that there is more knowledge hidden in the data than shows itself on the surface Any technique that helps to extract more out of data is useful Five major task types: 1. Exploratory Data Analysis (Visualization: boxplots, charts) 2. Descriptive Modeling (Density estimation, Clustering) Model 3. Predictive Modeling (Classification and Regression) building 4. Discovering Patterns and Rules (Association rules) 5. Retrieval by Content (Retrieve items similar to pattern of interest) 11 Srihari

12 Clustering Old Faithful (Hydrothermal Geyser in Yellowstone) 272 observations Duration (mins, horiz axis) vs Time to next eruption (vertical axis) Simple Gaussian unable to capture structure Linear superposition of two Gaussians is better Gaussian has limitations in modeling real data sets Gaussian Mixture Models give very complex densities p( x) = π N( x µ, Σ k = 1 π k are mixing coefficients that sum to one One dimension Three Gaussians in blue Sum in red K k k k )

Global Model 13 Models and Patterns High level global description of data set Make statement about any point in d-space E.g., prediction, clustering It takes a large sample perspective Summarizing data in convenient, concise way Local Patterns Make statement about restricted regions of d-space E.g.: if x > thresh1 then Prob (y > thresh2) = p Departure from run of data Identify members with unusual properties Outliers in a database

Models for Prediction: Regression and Classification Predict response variable from given values of others Response variable y given predictor variables x 1,.., x d When y is quantitative the task is known as regression When y is categorical, it is known as classification learning or supervised classification 14

Statistical Models for Regression and Classification Generative Naïve Bayes Mixtures of multinomials Mixtures of Gaussians Hidden Markov Models (HMM) Bayesian networks Markov random fields Discriminative Logistic regression SVMs Traditional neural networks Nearest neighbor Conditional Random Fields (CRF) Gaussian Processes 15

National Academy of Sciences: Keck Center 16

Regression Problem: Carbon Dioxide in Atmosphere 400? CO 2 380 Concentration ppm 360 340 320 1960 1980 2000 2010 2020 Year

Regression Single input variable 18 Linear Models Polynomial y(x,w) = w 0 +w 1 x+w 2 x 2 + =Σ w i x i Several variables linear combination of non-linear (basis) functions Bayesian Linear Regression Classification y(x,w) = w 0 + M 1 j=1 w j φ j (x) = w T φ(x) Logistic Regression (with sigmoid or soft-max) y(x,w) = σ[w T φ(x)]

Neural Network Function Overall function M D y k (x,w) = σ (2) w kj h (1) (1) (2) w ji x i + w j 0 + w k 0 j=1 i=1 Where w is the set of all weights and bias parameters nonlinear functions from inputs {x i } to outputs {y k } Note presence of both σ and h functions If σ is identity for regression If σ is sigmoid for two-class classification If σ is softmax for multi-classification 19

Gaussian Processes for Regression Radically different viewpoint not involving weight parameters Functions are drawn from a Gaussian where each data point is a function Gaussian Kernel k(x,x') = exp( x x' 2 /2σ 2 ) Exponential Kernel k(x,x') = exp( θ x x' ) Ornstein-Uhlenbeck process for Brownian motion 20

Gaussian Process with Two Samples Let y be a function (curve) of a one-dimensional variable x We take two samples y 1 and y 2 corresponding to x 1 and x 2 Assume they have a bivariate Gaussian distribution Each point from this distribution y 2 y 1 x 1 x 2 has an associated probability It also defines a function y(x) Assuming that two points are enough to define a curve y 2 More than two points will be needed to define a curve Which leads to a higher dimensional probability distribution 21 y 1

Gaussian Process Regression Generalize multivariate Gaussian to infinite variables (over all values of input x) A Gaussian distribution is fully specified by a mean vector µ and covariance matrix Σ f = (f 1,..f n ) T ~ N (µ,σ) indexes i =1,..n A Gaussian process is fully specified by a mean function m(x) and covariance function k(x,x ) f (x) ~ GP(m(x), k(x,x )) indexes x Kernel function k appears in place of covariance matrix Both express similarity of two in multidimensional space

Gaussian Process Fit to CO 2 data

Dual Role of Probability and Statistics in Data Analysis Generative Model of data allows data to be generated from the model Inference allows making statements about data 24

2. Nature of Data Sets Structured Data set of measurements from an environment or process Simple case n objects with d measurements each: n x d matrix d columns are called variables, features, attributes or fields 25

Unstructured Data 1. Structured Data Well-defined tables, attributes (columns), tuples (rows) 2. Unstructured Data World wide web Documents (HTML/XML) and hyperlinks HTML: tree structure with text and attributes embedded at nodes XML pages use metadata descriptions Text Documents Document viewed as sequence of words and punctuations Mining Tasks» Text categorization» Clustering Similar Documents» Finding documents that match a query Image Databases 26

Retrieval by Content User has pattern of interest and wishes to find that pattern in database, Ex: Text Search Estimate the relative importance of web pages using a feature vector whose elements are derived from the Query-URL pair Image Search Search a large database of images by using content descriptors such as color, texture, relative position 27 Srihari

Representations of Text Documents Boolean Vector Document is a vector where each element is a bit representing presence/absence of word A set of documents can be represented as matrix (d,w) where document d and word w has value 1 or 0 (sparse matrix) Vector Space Representation Each element has a value such as no. of occurrences or frequency A set of documents represented as a document-term matrix 28

Vector Space Example Document-Term Matrix t1 database t2 SQL t3 index t4 regression t5 likelihood t6 linear d ij represents number of times that term appears in that document 29

Image Retrieval Results Crime scene mark and their closest matches

Conclusion Data mining objective is to make discoveries from data We want to be as confident as we can that our conclusions are correct Nothing is certain Fundamental tool is probability Universal language for handling uncertainty Allows us to obtain best estimates even with data inadequacies and small samples 31 Srihari