Supervised and unsupervised learning - 1



Similar documents
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

: Introduction to Machine Learning Dr. Rita Osadchy

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Introduction to Pattern Recognition

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Data, Measurements, Features

Learning is a very general term denoting the way in which agents:

Social Media Mining. Data Mining Essentials

ADVANCED MACHINE LEARNING. Introduction

Classification algorithm in Data mining: An Overview

OUTLIER ANALYSIS. Data Mining 1

Bayes and Naïve Bayes. cs534-machine Learning

Machine Learning using MapReduce

Big Data Analytics CSCI 4030

Introduction to Machine Learning Using Python. Vikram Kamath

MS1b Statistical Data Mining

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

CS Introduction to Data Mining Instructor: Abdullah Mueen

DATA MINING TECHNIQUES AND APPLICATIONS

Predict the Popularity of YouTube Videos Using Early View Data

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

3 An Illustrative Example

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

HT2015: SC4 Statistical Data Mining and Machine Learning

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Character Image Patterns as Big Data

IBM SPSS Direct Marketing 23

Part III: Machine Learning. CS 188: Artificial Intelligence. Machine Learning This Set of Slides. Parameter Estimation. Estimation: Smoothing

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

IBM SPSS Direct Marketing 22

Lecture 3: Linear methods for classification

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Data Mining: An Overview. David Madigan

Fairfield Public Schools

Visualization of Breast Cancer Data by SOM Component Planes

Environmental Remote Sensing GEOG 2021

Unsupervised Data Mining (Clustering)

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

6.2.8 Neural networks for data mining

Foundations of Artificial Intelligence. Introduction to Data Mining

This unit will lay the groundwork for later units where the students will extend this knowledge to quadratic and exponential functions.

The Data Mining Process

Machine learning for algo trading

Data Mining Part 5. Prediction

Introduction to Data Mining

Structural Health Monitoring Tools (SHMTools)

Methods for network visualization and gene enrichment analysis July 17, Jeremy Miller Scientist I jeremym@alleninstitute.org

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Statistical Models in Data Mining

Principles of Data Mining by Hand&Mannila&Smyth

Data Preprocessing. Week 2

Text Analytics Illustrated with a Simple Data Set

AN EXPERT SYSTEM TO ANALYZE HOMOGENEITY IN FUEL ELEMENT PLATES FOR RESEARCH REACTORS

Data Mining: Overview. What is Data Mining?

Chapter 20: Data Analysis

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

An Overview of Knowledge Discovery Database and Data mining Techniques

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Maximizing Precision of Hit Predictions in Baseball

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Introduction to Learning & Decision Trees

Final Project Report

Risk pricing for Australian Motor Insurance

Neural Networks Lesson 5 - Cluster Analysis

Machine Learning Logistic Regression

Machine Learning Final Project Spam Filtering

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Probability and Random Variables. Generation of random variables (r.v.)

Statistics for BIG data

Using multiple models: Bagging, Boosting, Ensembles, Forests

COMMON CORE STATE STANDARDS FOR

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Graphical Representation of Multivariate Data

Lecture 9: Introduction to Pattern Analysis

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Joint models for classification and comparison of mortality in different countries.

Introduction. A. Bellaachia Page: 1

Evaluation of Machine Learning Techniques for Green Energy Prediction

Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016

IBM SPSS Direct Marketing 19

Supporting Online Material for

Machine Learning: Overview

HIGH DIMENSIONAL UNSUPERVISED CLUSTERING BASED FEATURE SELECTION ALGORITHM

Data Mining Algorithms Part 1. Dejan Sarka

Examining Differences (Comparing Groups) using SPSS Inferential statistics (Part I) Dwayne Devonish

Getting the Most from Demographics: Things to Consider for Powerful Market Analysis

1 Maximum likelihood estimation

Transcription:

Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in engineering, finance and industry. Typically, a statistician observes an outcome measurement (quantitative or categorical) and a bunch of input variables and the goal is to predict the outcome for future observations. Based on a training data where we have both the outcome and the input measurements, we try to build a prediction model (known as learner) for new unseen objects. In a supervised learning problem, the presence of such an outcome variable guides the learner. However, in a more challenging scenario, we observe the features only and the outcome is not measured. The task is to find patterns in the data to organize them for future purposes. This will be called unsupervised learning. We consider the following examples to introduce the learning in different fields. Example 1 : The information from 4601 email messages are collected to predict whether the message was a junk email or a spam. The learner is to be built to identify and filter out the spams. The true outcome (email/spam) is available for the training data, as well as the relative frequencies of 57 common words and characters (you / your / free /! / remove etc.). This words are chosen in such a way that the relative frequencies in the two types differs maximally for them. i

Example 2 : Stamey et al collected the data to examine the correlation between Prostate Specific Antigen (PSA) as an indicator of Prostate cancer and some clinical measurements in 97 men about to receive a radical prostatectomy. The learner needs to predict the PSA (or its log) from the input measurements, namely cancer-volume, prostate wt, age, benign prostatic hyperplasia amount etc. This is a regression problem. Example 3 : USPS targets to sort out handwritten envelopes using their ZIP codes. The digitized (and standardized) images of the digits 0-9 are collected from numerous envelopes in 16 16 greyscale maps, each pixels having intensity range 0-255. The learner targets to identify the digits from the image. (Due to the zero error specification, some digits can be classified as not clear and sorted manually later). Example 4 : A gene expression dataset is collected with expression levels from 64 cancer tumors from different patients. There are 6830 genes in total ( a typical HDLSS problem), and the gene expression data is displayed in heat map (green - low, red - high). The goal can be to have a regression problem with the gene expressions as output and the genes and samples as inputs, or an unsupervised learning problem where we want to cluster the samples with 6830 dimensions in a specific way. 3.1.1 Supervised learning Supervised learning is a technique for creating a function for a training data. The training data consists of pairs of input objects (a vector of characteristics) and desired output. The output of the function can be a continuous value (regression) or can predict a class label (classification). The task of the learner is to predict the value of the outcome for any valid input object after having seen a number of training examples. In a global model, the goal is to estimate a function g, given a set of points (x, g(x)). The basic problem of supervised learning deals with predicting the response ii

variables from the independent variables. When the output is quantitative, the problem is known as regression. The categorical variable output will lead us to classification and separation. In essence, the input X is a collection of p associated variables, and for each X, an observed value Y, of the output, is the supervisor. The goal is to build a learner, guided by the training set based on N samples of the pair (Y, X), so that it can predict the value Ŷ (x) from a future observation x. In case of regression, Y is quantitative, whereas classification problems are indicated by discrete values of Y. However, the classification problems can be seen as regression problems, where Y takes value in k dimension vector spaces (k denote the number of classes). Here, Y = (y 1,..., y k ) t such that y i = 1 if the observation falls in ith class and 0 otherwise. In essence, Y takes the vector values e i accordingly as the observation is in ith class. The learner can either predict the class i for a new observation x or it can churn out a vector (p 1,..., p k )(x) where p i (x) denote the probability of coming from the ith class. Given such a predictor, the decision can then be to choose the class i that has the maximum probability p i (x). In this chapter, we will mainly focus on classification techniques, and mention about regression only where required. 3.1.2 Unsupervised learning Unsupervised learning is a method of learning where a model is fit to observations. It is distinguished from supervised learning by the fact that there is no a priori output. A data set of input objects is gathered and the learner treats them as a set of random variables. A joint density model is then built for the data set. Typically one has a set of N observations X 1,..., X N having a joint density p(x), all random p-vectors. The goal is to directly infer the properties of this density without the supervising variable. This provides some added difficulty to the characterization. however, it is sometimes an advantage to know that X represents all the variables under consideration and we don t need to infer how p(x) will change, conditioning on the changing values of iii

other variables. In low dimensions (p 3), there are a variety of effective ways to directly estimate the density p(x) at all X values, including kernel estimation, spline estimation. Those local non-parametric methods fail in high dimensions due to the curse of dimensionality. Curse of dimensionality : Suppose we look for a local hyper-cubical neighborhood about a target-point to capture a fraction r of the observations. For simplicity, we consider a p dimensional unit hypercube and assume that the data points are scattered uniformly in the region. As the required neighborhood need to encase a faction r of the unit volume, the expected edge length will be e p (r) = r 1/p. For p = 10, To capture only 1% of the volume, the required length is e 10 (.01) =.63, about 63% of the entire range of each output. Such neighborhoods are no longer local. Moreover, if we consider r to be small as well, so that we cover only a small range of each variable, we have fewer and fewer observations to average, and the estimate will have high bias. Another problem that will arise is that all sample points are close to the edge of the data. Consider N points uniformly distributed on a unit sphere centered at the origin. The median distance from the origin to the closest data point can be computed as (1 1 1/N ) 1/p. For p = 10, even if N = 500, the median 2 distance is.52, i.e. more than half-way to the boundary. Clearly if one wants to estimate the density near the origin, it will have very few observations to rely upon. Finally, as the sampling density is of the order N 1/p, it will take an enormously large amount of data to have a dense sample (i.e. a sample with high density at an input point). All this sparsity related problems makes the non-parametric estimation of the sample density a difficult and unreliable business. The goal of unsupervised learning will be to characterize the collection of X values where p(x) is large. However, the validity of the techniques cannot be measured directly. The common techniques include identifying lowdimensional manifolds within the Xspace that represent high density (PCA, iv

multidimensional scaling etc), finding multiple regions in the Xspace containing modes (cluster analysis, mixture modeling). We will discuss the cluster analysis in detail and a few others will be mentioned as well. v