Introduction to data mining. Example of remote sensing image analysis



Similar documents
Big data and Earth observation New challenges in remote sensing images interpretation

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Social Media Mining. Data Mining Essentials

Environmental Remote Sensing GEOG 2021

How To Cluster

Chapter 7. Cluster Analysis

Introduction to Data Mining

Data Mining and Machine Learning in Bioinformatics

Foundations of Artificial Intelligence. Introduction to Data Mining

Information Management course

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Neural Networks Lesson 5 - Cluster Analysis

Data Mining Classification: Decision Trees

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

The Scientific Data Mining Process

An Assessment of the Effectiveness of Segmentation Methods on Classification Performance

Classification algorithm in Data mining: An Overview

Data Mining on Social Networks. Dionysios Sotiropoulos Ph.D.

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

SAMPLE MIDTERM QUESTIONS

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Data Mining. Nonlinear Classification

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Data Mining - Evaluation of Classifiers

An Overview of Knowledge Discovery Database and Data mining Techniques

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Using Data Mining for Mobile Communication Clustering and Characterization

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Multiscale Object-Based Classification of Satellite Images Merging Multispectral Information with Panchromatic Textural Features

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Unsupervised learning: Clustering

Knowledge Discovery and Data Mining

COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS

Chapter 12 Discovering New Knowledge Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Data Mining for Knowledge Management. Classification

Knowledge Discovery from patents using KMX Text Analytics

Hadoop Operations Management for Big Data Clusters in Telecommunication Industry

The Data Mining Process

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Azure Machine Learning, SQL Data Mining and R

Data Mining Part 5. Prediction

Classification and Prediction

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Final Project Report

WATER BODY EXTRACTION FROM MULTI SPECTRAL IMAGE BY SPECTRAL PATTERN ANALYSIS

A Review of Data Mining Techniques

Chapter 20: Data Analysis

Data Mining + Business Intelligence. Integration, Design and Implementation

International Journal of Advance Research in Computer Science and Management Studies

Data Mining Algorithms Part 1. Dejan Sarka

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Machine Learning using MapReduce

Analytics on Big Data

Data Mining: Foundation, Techniques and Applications

Review for Introduction to Remote Sensing: Science Concepts and Technology

Clustering & Visualization

Resolutions of Remote Sensing

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

Cluster Analysis: Basic Concepts and Methods

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Content-Based Recommendation

Calculation of Minimum Distances. Minimum Distance to Means. Σi i = 1

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

Clustering UE 141 Spring 2013

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Data Mining Applications in Higher Education

Rule based Classification of BSE Stock Data with Data Mining

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

K-Means Cluster Analysis. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Unsupervised Data Mining (Clustering)

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Extraction of Satellite Image using Particle Swarm Optimization

Lecture 10: Regression Trees

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Comparison of K-means and Backpropagation Data Mining Algorithms

Principles of Data Mining by Hand&Mannila&Smyth

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Distances, Clustering, and Classification. Heatmaps

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Quality Assessment in Spatial Clustering of Data Mining

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Numerical Algorithms Group

Specific Usage of Visual Data Analysis Techniques

Transcription:

Ocean's Big Data Mining, 2014 (Data mining in large sets of complex oceanic data: new challenges and solutions) 8-9 Sep 2014 Brest (France) Monday, September 8, 2014, 4:00 pm - 5:30 pm Introduction to data mining. Example of remote sensing image analysis Prof. Pierre Gançarski, University of Strasbourg Icube lab. After a brief introduction about data mining (why-what-how), the talk presents the two type of tasks: prediction tasks which consist in learning a model from data able to predict unknown values, generally class label and description tasks which consist in finding human-interpretable patterns that describe the data. Classification of data, supervised or not, is one of the most used approach in data mining. Then, the talk introduces the most widely used classification and clustering methods. Applications of these two approaches are mainly illustrated in remote sensing image analysis but a lot of domains are concerned by these methods. About Pierre Gançarski Pierre Gançarski is full Professor in Computer Science Department of the ICube laboratory (University of Strasbourg - France). His research interests concern the complex data mining by hybrid learning multi-classifiers and particularly the unsupervised classification user-centered. It is also interested in studying ways to take into account the domain knowledge in the mining process. His main area (but not the single one) of applications is the classification of remote sensing images. SUMMER SCHOOL #OBIDAM14 / 8-9 Sep 2014 Brest (France) oceandatamining.sciencesconf.org

Data mining and remote sensing images interpretation Pierre Gançarski ICube CNRS - Université de Strasbourg 2014 Pierre Gançarski Data mining - Remote sensing images 1/123

Contents 1 Data mining 2 Remote sensing image 3 Classification 4 Unsupervised classification Pierre Gançarski Data mining - Remote sensing images 2/123

1 Data mining 2 Remote sensing image 3 Classification 4 Unsupervised classification Pierre Gançarski Data mining - Remote sensing images 3/123

Data Mining : why? Data Lots of data is being collected : Web data, e-commerce Purchases at department/grocery stores Bank/Credit Card transactions ; Phone call Skies Telescopes scanning Microarrays to gene expression data Scientific simulations Homics... and... Remote sensing images Pierre Gançarski Data mining - Remote sensing images 4/123

Data Mining : why? Data Lots of data is being collected Comprehension/Analysis/Understanding by an human is infeasible for such raw data Pierre Gançarski Data mining - Remote sensing images 5/123

Data Mining : why? Data Lots of data is being collected Comprehension/Analysis/Understanding by an human is infeasible for such raw data Computers are cheaper and cheaper and more powerful Pierre Gançarski Data mining - Remote sensing images 6/123

Data Mining : why? Data Lots of data is being collected : Comprehension/Analysis/Understanding by an human is infeasible for such raw data Computers are cheaper and cheaper and more powerful! How to extract interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from such data using computer? Pierre Gançarski Data mining - Remote sensing images 7/123

Data Mining : why? Data Lots of data is being collected : Comprehension/Analysis/Understanding by an human is infeasible for such raw data Computers are cheaper and cheaper and more powerful! How to extract interesting (non-trivial, implicit, previously unknown and potentially useful) information or patterns from such data using computer? Solution Data warehousing and on-line analytical processing Data mining : extraction of interesting knowledge (rules, regularities, patterns, constraints) from data... Pierre Gançarski Data mining - Remote sensing images 8/123

Data Mining : what? Data mining Tasks : Knowledge discovery of hidden patterns and insights Two kinds of method : Induction-based methods : learn the model, apply it to new data, get the result! Prediction Example : Based on past results, who will pass the DM exam next week and why? Patterns extraction! Discovery of hidden patterns Example : Based on past results, can we extract groups of students with same behavior? Pierre Gançarski Data mining - Remote sensing images 9/123

Data Mining : how? Objective Objective of data mining tasks Predictive data mining : Use some variables to learn model able to predict unknown or future values of other variables (e.g., class label)! Induction-based methods Regression, classification, rules extraction Descriptive data mining : Find human-interpretable patterns that describe the data! Patterns extraction Clustering, associations extraction Pierre Gançarski Data mining - Remote sensing images 10/123

Data mining : Knowledge Discovery from Data Data Mining : a part of the global KDD process Learning the application domain: relevant prior knowledge and goals of extraction Knowledge Validation Data Cleaning, Sélection, Integration Mining tasks Models Acquisition Pierre Gançarski Data mining - Remote sensing images 11/123

Data mining : Knowledge Discovery from Data Data Mining : a part of the global KDD process Knowledge Validation Data Cleaning, Sélection, Integration Mining tasks Models Acquisition Creating a target data set: data selection Data cleaning and preprocessing (may take 60% of e ort!) Data reduction and transformation: useful features, dimensionality and variable reduction Pierre Gançarski Data mining - Remote sensing images 12/123

Data mining : Knowledge Discovery from Data Data Mining : a part of the global KDD process Knowledge Validation Data Cleaning, Sélection, Integration Mining tasks Models Acquisition Choosing tasks of data mining: classi cation, regression, association, clustering Choosing the mining algorithm(s) : Data mining Pierre Gançarski Data mining - Remote sensing images 13/123

Data mining : Knowledge Discovery from Data Data Mining : a part of the global KDD process Patterns/Models evaluation visualization, transformation, removing patterns, etc. New knowledge integration Knowledge Validation Data Cleaning, Sélection, Integration Mining tasks Models Acquisition Pierre Gançarski Data mining - Remote sensing images 14/123

Data mining : Knowledge Discovery from Data Data Mining : a part of the global KDD process Knowledge Validation Data Cleaning, Sélection, Integration Mining tasks Models Go back if needed... Acquisition Pierre Gançarski Data mining - Remote sensing images 15/123

Data mining : Knowledge Discovery from Data Data Mining : a part of the global KDD process Knowledge Validation Data Cleaning, Sélection, Integration Mining tasks Models Acquisition Pierre Gançarski Data mining - Remote sensing images 16/123

Data mining : tasks Common data mining tasks Regression [Predictive] Classification [Predictive] Clustering [Descriptive] Association Rule Discovery [Descriptive] Pierre Gançarski Data mining - Remote sensing images 17/123

1 Data mining 2 Remote sensing image 3 Classification 4 Unsupervised classification Pierre Gançarski Data mining - Remote sensing images 18/123

Remote sensing image What s it? Image captured by an aerial or satellite system : optic sensors : spectral responses of various surface covers associated with sunshine radar sensors Lidar... Thanks to the LIVE lab (A. Puissant) and the IPGS lab (J.-P. Malet) for providing images. Pierre Gançarski Data mining - Remote sensing images 19/123

Remote sensing image Three dimensions spatial resolution : surface covered by a pixel (from 300m to few tens of centimetres) spectral resolution : number of spectral information (from blue to infrared) corresponding to the number of sensors radiometric resolution : linked to the ability to recognize small brightness variations (from 256 to 64000 level) Pierre Gançarski Data mining - Remote sensing images 20/123

Remote sensing image Spatial resolutions High spatial resolution (HSR) : 20 or 10m Green Red Near infrared (NIR) Pierre Gançarski Data mining - Remote sensing images 21/123

Remote sensing image Spatial resolutions Very high spatial resolution (VHSR) : from 5m to 0.5m Pierre Gançarski Data mining - Remote sensing images 22/123

Data mining Remote sensing image Classification Unsupervised classification Remote sensing image Spatial resolutions Level of analysis is linked to the spatial resolution 1 km Urban analysis Quickbird 2.8m VSR ASTER 15m Geographic objects SPOT 20m HSR Area studies Landsat ETM 30m Pierre Gançarski Data mining - Remote sensing images 23/123

Data mining Remote sensing image Classification Unsupervised classification Remote sensing image Spatial resolutions Level of analysis is linked to the spatial resolution Examples : Urban blocks 10m : Districts 2,4 m : Urban blocks 60cm : Urban objects MRS HRS THRS Homogeneous areas Set of elementary objects spacially organized Set of elementary objects with higher heterogeneity Pierre Gançarski Data mining - Remote sensing images 24/123

Data mining Remote sensing image Classification Unsupervised classification Remote sensing image Spatial resolutions Level of analysis is linked to the spatial resolution Urban analysis Landslide analysis IGN 0.5m VSR Object sub-parts RAPIDEYES 5m MSR LANDSAT 30m Pierre Gançarski HSR Objects Natural areas Data mining - Remote sensing images 25/123

Remote sensing image Three dimensions spatial resolution : surface covered by a pixel (from 300m to few tens of centimetres) spectral resolution : number of spectral information (from blue to infrared) corresponding to the number of sensors radiometric resolution : linked to the ability to recognize small brightness variations (from 256 to 64000 level) Pierre Gançarski Data mining - Remote sensing images 26/123

Remote sensing image Spectral resolutions Hight spectral resolution : One hundred of radiometric bands (or more) band #1 band #22 Aerial view band #29 band #38 Pierre Gançarski Data mining - Remote sensing images 27/123

Remote sensing image Image interpretation Discrimination between (kinds of) objects can depend on the spectral resolution Orthophoto 0.5m CASI 2m Quickbird 2.8m SPOT 5 10m ASTER 15m SPOT 1,2,3 20m IRS 1D 23.5m LandsatTM 5 30m LandsatTM 7 30m Landsat MSS 60m Vegetation 1000m blue green red 0.4 0.5 0.6 0.7 0.9 0.8 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.1 2.0 2.2 2.3 2.4 2.5 near infrared mid infrared panchromatic wavelength (µm) Orthophoto CASI: 32 contiguous bands (at 15nm each) Spectral signature of safe vegetation soil Landsat 5: plus TIR 10.4-12.5 µm Landsat 7: plus TIR ASTER: plus 5 canals TIR 40 20 Pierre Gançarski Data mining - Remote sensing images 28/123

Remote sensing image Three dimensions spatial resolution : surface covered by a pixel (from 300m to few tens of centimetres) spectral resolution : number of spectral information (from blue to infrared) corresponding to the number of sensors radiometric resolution : linked to the ability to recognize small brightness variations (from 256 to 64000 level) Pierre Gançarski Data mining - Remote sensing images 29/123

Image interpretation Semantic gap There are differences between the visual interpretation of the spectral information and the semantic interpretation of the pixels The semantic is not always explicitly contained in the image and depends on domain knowledge and on the context. =) This problem is known as the semantic gap and is defined as the lack of concordance between low-level information (i.e. automatically extracted from the images) and high-level information (i.e. analyzed by geographers) Pierre Gançarski Data mining - Remote sensing images 30/123

Pixels or regions Per-pixel analysis Pixels are analyzed according only their radiometric responses (with possibly some indexes of texture and/or immediate neighbourhood) Pierre Gançarski Data mining - Remote sensing images 31/123

Pixels or regions Per-pixel analysis Pixels are classified analyzed only their radiometric responses (with possibly some indexes of texture and/or immediate neighbourhood) Efficient with low resolutions but... In (V)HSR images, a pixel is only almost any cases, a small sub-part of the thematic object The object to be analyzed are composed of a lot of pixels (from few tens to few hundred or more).! New approach : Object-based Image Analysis (OBIA) Pierre Gançarski Data mining - Remote sensing images 32/123

Pixels or regions Object-based Image Analysis 1 Segmentation of the image : grouping neighbouring pixels according a given homogeneous criterion 2 Characterization of these segments with supplemental radiometric, textural, spatial features,...! object (or region) = segment + set of features 3 Analysis of these regions Pierre Gançarski Data mining - Remote sensing images 33/123

Pixels or regions Region-based classification 1 Segmentation of the image : grouping neighbouring pixels according a given homogeneous criterion 2 Characterization of these segments with supplemental radiometric, textural, spatial features,...! region = segment + set of features 3 Classification of these regions Two main problems Segmentation Feature identification and selection Pierre Gançarski Data mining - Remote sensing images 34/123

Pixels or regions Image segmentation Segmentation is the division of an image into spatially continuous, disjoint and homogeneous areas, i.e. the segments. Segmentation of an image into a given number of regions is a problem with a large number of possible solutions. There are no right, wrong or better solutions but instead meaningful and useful heuristic approximations of partitions of space. Pierre Gançarski Data mining - Remote sensing images 35/123

Pixels or regions Image segmentation These two segmentation is intrinsically right. The choice depends on their usefulness in the next step (e.g., classification) Pierre Gançarski Data mining - Remote sensing images 36/123

Pixels or regions Feature identification and selection What type of features can be used for geographic object classification? There a an infinity of such features. How to select the best features for class discrimination?! How to select the smaller number of features without sacrificing accuracy? Pierre Gançarski Data mining - Remote sensing images 37/123

Image interpretation Solutions To bridge this semantic gap, a lot of approaches can be used : Human visual interpretation : untractable with the new kinds of images (VHSR especially) Per pixel classification (supervised or not) : Automatic or semi-automatic labelisation of the pixels according to thematic classes Segmentation and region classification : Construction of sub-parts of the image and labelisation of these (possibly by a classification process)... Pierre Gançarski Data mining - Remote sensing images 38/123

1 Data mining 2 Remote sensing image 3 Classification 4 Unsupervised classification Pierre Gançarski Data mining - Remote sensing images 39/123

Principles of classification Induction Classification is based on induction principle Deductive reasoning is truth-preserving : All human have a heart All professors are human (for the moment) Therefore, all professors have a heart Induction reasoning adds information All professor observed so far have a heart. Therefore, all professors have a heart. Pierre Gançarski Data mining - Remote sensing images 40/123

Principles of classification Supervised classification Supervised classification is the task of inferring a model from labelled training data : each example is a pair consisting of an input object and a desired class. ) The classes to be learned are known a priori Supervisor Desired class Training data Learned model Predited class Error Model seeking Classical methods : Support vector machine, Decision tree, Artificial neural network... Pierre Gançarski Data mining - Remote sensing images 41/123

Classification schema Offline : Induction (training) Define a Training set :Each record contains a set of attributes, one of the attributes is the class (class label). Find a model for class attribute as a function of the values of other attributes. Learn model Learning algorithm MODEL Pierre Gançarski Data mining - Remote sensing images 42/123

Classification schema Inline : Deduction (Generalization or Prediction) Using the model, give a class label to unseen records MODEL Apply model yes no no yes yes Unseen data Pierre Gançarski Data mining - Remote sensing images 43/123

Classification evaluation How to evaluate the performance of a model? Use of a confusion matrix : Example : N data and two classes + and - Actual class Predicted class + - + A B - C D A : True positive (TP) ; B : False negative (FN) ; C : False positive (FP) ; D : True negative (TN) N = A + B + C + D Pierre Gançarski Data mining - Remote sensing images 44/123

Classification evaluation Accuracy Defined as the ratio of well-classified objects : A + D A + B + C + D = TP + TN N Pierre Gançarski Data mining - Remote sensing images 45/123

Classification evaluation Accuracy Example : We can apply the learned model to the data (without using class labels! ) MODEL yes no Apply model yes Pierre Gançarski Data mining - Remote sensing images 46/123

Classification evaluation Accuracy Example : We can apply the learned model to the data (without using class labels! ) MODEL yes no Apply model yes Actual class Predicted class yes no yes 2 1 no 2 5 Acc= 2 + 5 10 = 0.7 Conclusion : 30% of training error Pierre Gançarski Data mining - Remote sensing images 47/123

Classification evaluation Accuracy Example : We can apply the learned model to the data (without using class labels! ) Conclusion : 30% of training error Suppose we redo training (with new parameters for instance) until this error is zero or near. Question : Is the model right now? Pierre Gançarski Data mining - Remote sensing images 48/123

Classification evaluation Accuracy Example : We can apply the learned model to the data (without using class labels! ) Conclusion : 30% of training error Suppose we redo training (with new parameters for instance) until this error is zero or near. Question : Is the model right now? Not necessary : the most important is that the model correctly classifies unknown data Pierre Gançarski Data mining - Remote sensing images 49/123

Classification evaluation Accuracy Example : We can apply the learned model to the data (without using class labels! ) Conclusion : 30% of training error Suppose we redo training (with new parameters for instance) until this error is zero or near. Question : Is the model right now? Not necessary : the most important is that the model correctly classifies unknown data Question : How to calculate accuracy in this case without any information about actual classes? Pierre Gançarski Data mining - Remote sensing images 50/123

Classification evaluation Accuracy How to calculate accuracy without any information about actual classes? 1 Ask the expert 2 Use the labeled data in a different way by keeping some of them to test the model Pierre Gançarski Data mining - Remote sensing images 51/123

Classification evaluation Cross-validation Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it (often 2/3-1/3). N-fold cross-validation : 1 The given data set is divided into N subset 2 The model is learn using (N-1) subsets 3 The remaining subset is used to validation 4 The operation (step 2) is repeated N times (with a different test subset each time). 5 The best model is kept. Pierre Gançarski Data mining - Remote sensing images 52/123

Classification evaluation Accuracy limitation The accuracy is very sensible to the skew of data Number of examples - =9990;Numberofexamples+ =10 If model predicts everything to be class -, accuracyis 9990/10000 = 99.9 % Precision/Recall #(c c) #(c c)+#(c c) P(c) = actually belong to class c #(c c) #(c c)+#( c c) R(c) = class c which are classified as c ratio of objects classified as c which ratio of objects belonging to actual Pierre Gançarski Data mining - Remote sensing images 53/123

Classification evaluation Precision/Recall MODEL yes no Apply model yes Actual class Predicted class yes no yes 2 1 no 2 5 Acc= 2 + 5 10 = 0.7 P(yes) = R(yes) = P(no) = R(no) = #(yes yes) #(yes yes)+#(yes no) = 2 2+2 = 1/2 #(yes yes) #(yes yes)+#(no yes) = 2 2+1 = 2/3 #(no no) #(no no)+#(no yes) = 5 5+1 = 5/6 #(no no) #(no no)+#(yes no) = 5 5+2 = 5/7 Pierre Gançarski Data mining - Remote sensing images 54/123

Classification Alotofclassificationtechniques The most popular : K-Nearest-Neighbor Decision Tree based Methods Bayesian classifier : not directly usable in RS analysis Rule-based Methods : not directly usable in RS analysis Artificial Neural Networks : to long to explain (sorry) Support Vector Machines! tomorrow Pierre Gançarski Data mining - Remote sensing images 56/123

K-Nearest-Neighbor Principle Use of a similarity measure (or distance) between data The label of unseen data is set according to label of its K nearest neighbors What is the class of red star? Pierre Gançarski Data mining - Remote sensing images 57/123

K-Nearest-Neighbor Principle Use of a similarity measure (or distance) between data The label of unseen data is set according to label of its K nearest neighbors k=1 : blue triangle Pierre Gançarski Data mining - Remote sensing images 58/123

K-Nearest-Neighbor Principle Use of a similarity measure (or distance) between data The label of unseen data is set according to label of its K nearest neighbors k=1 : blue triangle ; k=3 : blue triangle ; Pierre Gançarski Data mining - Remote sensing images 59/123

K-Nearest-Neighbor Principle Use of a similarity measure (or distance) between data The label of unseen data is set according to label of its K nearest neighbors k=1, k=3 : blue triangle ; k=5 : undefined Pierre Gançarski Data mining - Remote sensing images 60/123

K-Nearest-Neighbor Principle Use of a similarity measure (or distance) between data The label of unseen data is set according to label of its K nearest neighbors k=1, k=3 : blue triangle ; k=5 : undefined ; k=7 : yellow cross Pierre Gançarski Data mining - Remote sensing images 61/123

K-Nearest-Neighbor Principle Use of a similarity measure (or distance) between data The label of unseen data is set according to label of its K nearest neighbors k=1, k=3 : blue triangle ; k=5 : undefined ; k=7 : yellow cross k=n? Pierre Gançarski Data mining - Remote sensing images 62/123

K-Nearest-Neighbor Characteristics Lazy learner : It does not build models explicitly Classifier new data is relatively expensive Too high influence (unstability) of the K parameter Need of a similarity measure (or distance) between data Simarity measure Per-pixel classification : Euclidean distance between radiometric values p (XS1 i XS1 j ) 2 + +(XS3 i XS3 j ) 2 Object-oriented approaches : depends on types of features Can be difficult to define Pierre Gançarski Data mining - Remote sensing images 63/123

Decison Trees Principle Decision trees are rule-based classifiers that consist of a hierarchy of decision points (the nodes of the tree). Training Yes Attrib1 No B NO Attrib2 Small, Large Medium Attrib3 NO <80K NO > 80K <100K YES > 100K NO Pierre Gançarski Data mining - Remote sensing images 64/123

Decison Trees Principle Decision trees are rule-based classifiers that consist of a hierarchy of decision points (the nodes of the tree). Prediction Yes Attrib1 No Attr1 = No Attrib2 = Small Attrib3 = 130K Class =?? B NO <80K NO Small, Large Attrib3 > 80K <100K YES Attrib2 Medium NO > 100K NO Pierre Gançarski Data mining - Remote sensing images 65/123

Decison Trees Training How to build a such tree from the training data?! Recursive partitioning At the beginning, all the records are in the root node (considered as the current node C). General procedure : If C only contains records from the same class c t then C become the leaf labeled c t If C contains records that belong to more than one class, use an attribute test to split the data into smaller subsets (associated each to a children node). Recursively apply the procedure to each children node Pierre Gançarski Data mining - Remote sensing images 66/123

Decison Trees Training How to specify the attribute test condition? Depends on the attribute types : nominal,ordinal, continuous Depends on number of ways to split : binary / multiple How to determine the best split? Heterogenity indexes : Gini Index G(C) =1 [p(c i C)] 2 ; Entropy : E(C) = p(c i C)log 2 p(c i C) Misclassification error... Pierre Gançarski Data mining - Remote sensing images 67/123

Classification Hyperplan-based methods Principle : Given a to classes training dataset, find a hyperplan which separates data into the two classes Pierre Gançarski Data mining - Remote sensing images 68/123

Classification Hyperplan-based methods Principle : Given a to classes training dataset, find a hyperplan which separates data into the two classes Where is really the boundary? Overfitting appear when the model would exactly fit to the training data Pierre Gançarski Data mining - Remote sensing images 69/123

Classification Hyperplan-based methods Principle : Given a to classes training dataset, find a hyperplan which separates data into the two classes Where is really the boundary? The red hyperplan probably overfits the Pink circle class while the blue one the Blue square one Pierre Gançarski Data mining - Remote sensing images 70/123

Classification Hyperplan-based methods Methods : Artificial neural network find one of such hyperplan Support Vector Machine find the best one Pierre Gançarski Data mining - Remote sensing images 71/123

Classification Hyperplan-based methods What happen if the dataset is not linearly separable? Pierre Gançarski Data mining - Remote sensing images 72/123

Classification Hyperplan-based methods What happen if the dataset is not linearly separable? The model is too simple ) underfitting Pierre Gançarski Data mining - Remote sensing images 73/123

Classification Hyperplan-based methods What happen if the dataset is not linearly separable? Complexify the model? Pierre Gançarski Data mining - Remote sensing images 74/123

Classification Hyperplan-based methods What happen if the dataset is not linearly separable? Complexify the model : more and more? Pierre Gançarski Data mining - Remote sensing images 75/123

Classification Hyperplan-based methods What happen if the dataset is not linearly separable? Problem : if the blue square is a noise (or an outlet) all the points in light blue area will be classified as blue square! Test error can increase! Overfitting Pierre Gançarski Data mining - Remote sensing images 76/123

Classification Hyperplan-based methods Experiments show that probability of overfitting increases with complexity. For instance, complexity of a decision tree increases with the number of node When stop the learning? Overfitting Pierre Gançarski Data mining - Remote sensing images 77/123

Classification Hyperplan-based methods What happen if the boundary exists but is no linear? Pierre Gançarski Data mining - Remote sensing images 78/123

Classification Hyperplan-based methods What happen if the boundary exists but is no linear? Transform data by projecting them into higher dimensional space in which they are linearly separable Kernel-based methods : Support Vector Machine Pierre Gançarski Data mining - Remote sensing images 79/123

Classification and images Supervised classification : case of VHSR What kind of object as training data? pixels?! Represent only small parts of real objects geographic object?! Need to construct candidate segments before learning. Pierre Gançarski Data mining - Remote sensing images 80/123

Classification and images Supervised classification : case of VHSR How many classes? 10? - Road - Building... 50? - Road - Street - Red car - Blue car - Lighted (half) roof - Shadowed roof... Pierre Gançarski Data mining - Remote sensing images 81/123

Classification and images Supervised classification : case of VHSR How to give enough examples by class with high number of classes? Pierre Gançarski Data mining - Remote sensing images 82/123

Classification and images Supervised classification : case of VHSR Supervised approaches are mainly used to pixels classification in low resolution images Pierre Gançarski Data mining - Remote sensing images 83/123

1 Data mining 2 Remote sensing image 3 Classification 4 Unsupervised classification Pierre Gançarski Data mining - Remote sensing images 84/123

Unsupervised classification Supervised vs. Unsupervised learning Supervised learning : discover patterns in the data that relate data attributes with a target (class) attribute. These patterns are then utilized to predict the values of the target attribute in future data instances. Unsupervised learning : The data have no target attribute Explore the data to find some intrinsic structures or hidden properties. Pierre Gançarski Data mining - Remote sensing images 85/123

Unsupervised classification Clustering Clustering is a technique for finding similarity groups in data, called clusters : Pierre Gançarski Data mining - Remote sensing images 86/123

Unsupervised classification Clustering Clustering is a technique for finding similarity groups in data, called clusters : Homogeneous groups such that two objects of the same class are more similar two objects of different classes Pierre Gançarski Data mining - Remote sensing images 87/123

Unsupervised classification Clustering Clustering is a technique for finding similarity groups in data, called clusters : Homogeneous groups such that two objects of the same class are more similar two objects of different classes Homogeneous groups such as dissimilarity/distances between groups are highest Pierre Gançarski Data mining - Remote sensing images 88/123

Unsupervised classification Unsupervised classification : clustering Clustering is often called an unsupervised learning task as no class values denoting an a priori grouping of the data instances are given, which is the case in supervised learning. Due to historical reasons, clustering is often considered synonymous with unsupervised learning. Association rule mining is also unsupervised Typical applications As a stand-alone tool to get insight into data distribution As a preprocessing step for other algorithms Each cluster would be assigned to a thematic class by the expert. Pierre Gançarski Data mining - Remote sensing images 89/123

Unsupervised classification Unsupervised classification : clustering Partitioning : Construct various partitions and then evaluate them by some criterion Hierarchical : Create a hierarchical decomposition of the set of objects using some criterion Model-based : Hypothesize a model for each cluster and find best fit of models to data Density-based : Guided by connectivity and density functions Pierre Gançarski Data mining - Remote sensing images 90/123

Unsupervised classification A well-known partitioning algorithm : Kmeans Objective : Minimization of intraclass inertia Given K, find a partition of K clusters that optimizes the chosen partitioning criterion Process : 1 Choice of the number K of clusters 2 Random choice of K centroids (seed) in the data space 3 Iteration : 1 Assign each pixel to the cluster that has the closest centroid 2 Recalculate the positions of the K centroids 4 Repeat Step 3 until the centroids no longer move Pierre Gançarski Data mining - Remote sensing images 91/123

Unsupervised classification A well-known clustering algorithm : Kmeans Example on SPOT image Pierre Gançarski Data mining - Remote sensing images 92/123

Unsupervised classification A well-known clustering algorithm : Kmeans The data space corresponding to the Spot image XS2 XS3 XS3 XS1 XS3 XS1 XS2 XS1 XS2 Pierre Gançarski Data mining - Remote sensing images 93/123

Unsupervised classification A well-known clustering algorithm : Kmeans Random choice of K = 5 centroids (seed) in the data space XS2 XS3 XS3 XS1 XS3 XS1 XS2 XS1 XS2 Pierre Gançarski Data mining - Remote sensing images 94/123

Unsupervised classification A well-known clustering algorithm : Kmeans Objective : Minimization of intraclass inertia Process : 1 Choice of the number K of clusters 2 Random choice of K centroids (seed) in the data space 3 Iteration : 1 Assign each pixel to the cluster that has the closest centroid 2 Recalculate the positions of the K centroids 4 Repeat Step 3 until the centroids no longer move Pierre Gançarski Data mining - Remote sensing images 95/123

Unsupervised classification A well-known clustering algorithm : Kmeans Assign each pixel to the cluster that has the closest centroid XS3 XS1 Pierre Gançarski Data mining - Remote sensing images 96/123

Unsupervised classification A well-known clustering algorithm : : Kmeans Each pixel is colorized according to the clusters is belong to. Pierre Gançarski Data mining - Remote sensing images 97/123

Unsupervised classification A well-known clustering algorithm : Kmeans Objective : Minimization of intraclass inertia Process : 1 Choice of the number K of clusters 2 Random choice of K centroids (seed) in the data space 3 Iteration : 1 Assign each pixel to the cluster that has the closest centroid 2 Recalculate the positions of the K centroids 4 Repeat Step 3 until the centroids no longer move Pierre Gançarski Data mining - Remote sensing images 98/123

Unsupervised classification A well-known clustering algorithm : Kmeans Recalculate the positions of the K centroids Pierre Gançarski Data mining - Remote sensing images 99/123

Unsupervised classification A well-known clustering algorithm : Kmeans Objective : Minimization of intraclass inertia Process : 1 Choice of the number K of clusters 2 Random choice of K centroids (seed) in the data space 3 Iteration : 1 Assign each pixel to the cluster that has the closest centroid 2 Recalculate the positions of the K centroids 4 Repeat Step 3 until the centroids no longer move Pierre Gançarski Data mining - Remote sensing images 100/123

Unsupervised classification A well-known clustering algorithm : Kmeans Assign each pixel to the cluster that has the closest centroid Pierre Gançarski Data mining - Remote sensing images 101/123

Unsupervised classification A well-known clustering algorithm : Kmeans On SPOT image, each pixel is colorized according to the clusters is belong to. Pierre Gançarski Data mining - Remote sensing images 102/123

Unsupervised classification A well-known clustering algorithm : Kmeans Recalculate the positions of the K centroids Pierre Gançarski Data mining - Remote sensing images 103/123

Unsupervised classification A well-known clustering algorithm : Kmeans and so on... XS3 XS1 Pierre Gançarski Data mining - Remote sensing images 104/123

Unsupervised classification A well-known clustering algorithm : Kmeans and so on... Pierre Gançarski Data mining - Remote sensing images 105/123

Unsupervised classification A well-known clustering algorithm : Kmeans and so on... Pierre Gançarski Data mining - Remote sensing images 106/123

Unsupervised classification A well-known clustering algorithm : Kmeans and so on... Pierre Gançarski Data mining - Remote sensing images 107/123

Unsupervised classification A well-known clustering algorithm : Kmeans and so on... Pierre Gançarski Data mining - Remote sensing images 108/123

Unsupervised classification A well-known clustering algorithm : Kmeans and so on... Pierre Gançarski Data mining - Remote sensing images 109/123

Unsupervised classification A well-known clustering algorithm : Kmeans and so on... Pierre Gançarski Data mining - Remote sensing images 110/123

Unsupervised classification A well-known clustering algorithm : Kmeans and so on... Pierre Gançarski Data mining - Remote sensing images 111/123

Unsupervised classification A well-known clustering algorithm : Kmeans until the centroids do not move Pierre Gançarski Data mining - Remote sensing images 112/123

Unsupervised classification Kmeans : Weaknesses The algorithm is only applicable iff mean is defined. The user needs to specify K. Kmeans is sensitive to outliers Kmeans is strong sensitive to initialization Kmeans is not suitable to discover clusters with non-convex shapes Despite weaknesses, Kmeans is still one of the most popular algorithms due to its simplicity and efficiency Pierre Gançarski Data mining - Remote sensing images 113/123

Unsupervised classification Hierarchical clustering Agglomerative (bottom up) clustering : merges the most similar (or nearest) pair of clusters or objects stops when all the data objects are merged into a single cluster Divisive (top down) clustering : It starts with all data points in one cluster, the root. Splits the root into a set of child clusters. Each child cluster is recursively divided further Stops when only singleton clusters of individual data points remain, i.e., each cluster with only a single point Pierre Gançarski Data mining - Remote sensing images 114/123

Unsupervised classification Ascendant hierarchical clustering f g h a b c e d Pierre Gançarski Data mining - Remote sensing images 115/123

Unsupervised classification Ascendant hierarchical clustering f g h a b c e d g h Pierre Gançarski Data mining - Remote sensing images 116/123

Unsupervised classification Ascendant hierarchical clustering f g h a b c e d g h b c Pierre Gançarski Data mining - Remote sensing images 117/123

Unsupervised classification Ascendant hierarchical clustering f g h a b c e d g h d b c Pierre Gançarski Data mining - Remote sensing images 118/123

Unsupervised classification Ascendant hierarchical clustering f g h a b c e d g h a d b c Pierre Gançarski Data mining - Remote sensing images 119/123

Unsupervised classification Ascendant hierarchical clustering f g h a b c e d f g h a d b c Pierre Gançarski Data mining - Remote sensing images 120/123

Unsupervised classification Ascendant hierarchical clustering f g h a b c e d f g h e a d b c Pierre Gançarski Data mining - Remote sensing images 121/123

Unsupervised classification Ascendant hierarchical clustering f g h a b c e d f g h e a d b c Pierre Gançarski Data mining - Remote sensing images 122/123

Unsupervised classification Ascendant hierarchical clustering Avantage : No need of average method Weakness : Algorithm cost (two-by-two distance evaluate) Often preceded by partitioning algorithm to reduce the dataset size Pierre Gançarski Data mining - Remote sensing images 123/123