Classification Techniques for Remote Sensing

Transcription

1 Classification Techniques for Remote Sensing Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara saksoy/courses/cs551 UYGU 2014 UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 1 / 108

2 Outline 1 Introduction 2 Bayesian Decision Theory Parametric Models Non-parametric Methods 3 Feature Reduction and Selection 4 Non-Bayesian Classifiers Distance-based Classifiers Decision Boundary-based Classifiers 5 Unsupervised Learning and Clustering 6 Algorithm-Independent Learning Issues Estimating and Comparing Classifiers Combining Classifiers 7 Structural and Syntactic Pattern Recognition UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 2 / 108

3 Human versus Machine Perception Humans have developed highly sophisticated skills for sensing their environment and taking actions according to what they observe. We would like to give similar capabilities to machines. This is of particular interest in remote sensing due to the expansion in the data volume with increasing spatial, spectral, and temporal resolutions, increasing complexity of the data content, urgency of the need for the extraction of useful information. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 3 / 108

4 Classification What is classification? Finding boundaries in a feature space Figure 1: Example classification boundaries for 2D data sets. Classification methods use techniques from mathematics, statistics, computer science, etc., to solve this problem. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 4 / 108

5 An Example Problem: Sorting incoming fish on a conveyor belt according to species. Assume that we have only two kinds of fish: sea bass, salmon. Figure 2: Picture taken from a camera. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 5 / 108

6 An Example: Decision Process What kind of information can distinguish one species from the other? length, width, weight, number and shape of fins, tail shape, etc. What can cause problems during sensing? lighting conditions, position of fish on the conveyor belt, camera noise, etc. What are the steps in the process? capture image isolate fish take measurements make decision UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 6 / 108

7 An Example: Selecting Features Assume a fisherman told us that a sea bass is generally longer than a salmon. We can use length as a feature and decide between sea bass and salmon according to a threshold on length. How can we choose this threshold? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 7 / 108

8 An Example: Selecting Features Figure 3: Histograms of the length feature for two types of fish in training samples. How can we choose the threshold l to make a reliable decision? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 8 / 108

9 An Example: Selecting Features Even though sea bass is longer than salmon on the average, there are many examples of fish where this observation does not hold. Try another feature: average lightness of the fish scales. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 9 / 108

10 An Example: Selecting Features Figure 4: Histograms of the lightness feature for two types of fish in training samples. It looks easier to choose the threshold x but we still cannot make a perfect decision. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 10 / 108

11 An Example: Cost of Error We should also consider costs of different errors we make in our decisions. For example, if the fish packing company knows that: Customers who buy salmon will object vigorously if they see sea bass in their cans. Customers who buy sea bass will not be unhappy if they occasionally see some expensive salmon in their cans. How does this knowledge affect our decision? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 11 / 108

12 An Example: Multiple Features Assume we also observed that sea bass are typically wider than salmon. We can use two features in our decision: lightness: x1 width: x2 Each fish image is now represented as a point (feature vector) ( x = x 1 x 2 ) in a two-dimensional feature space. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 12 / 108

13 An Example: Multiple Features Figure 5: Scatter plot of lightness and width features for training samples. We can draw a decision boundary to divide the feature space into two regions. Does it look better than using only lightness? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 13 / 108

14 An Example: Multiple Features Does adding more features always improve the results? Avoid unreliable features. Be careful about correlations with existing features. Be careful about measurement costs. Be careful about noise in the measurements. Is there some curse for working in very high dimensions? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 14 / 108

15 An Example: Decision Boundaries Can we do better with another decision rule? More complex models result in more complex boundaries. Figure 6: We may distinguish training samples perfectly but how can we predict how well we can generalize to unknown samples? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 15 / 108

16 An Example: Decision Boundaries How can we manage the tradeoff between complexity of decision rules and their performance to unknown samples? Figure 7: Different criteria lead to different decision boundaries. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 16 / 108

17 More on Complexity Figure 8: Regression example: plot of 10 sample points for the input variable x along with the corresponding target variable t. Green curve is the true function that generated the data. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 17 / 108

18 More on Complexity (a) 0 th order polynomial (b) 1 st order polynomial (c) 3 rd order polynomial (d) 9 th order polynomial Figure 9: Polynomial curve fitting: plots of polynomials having various orders, shown as red curves, fitted to the set of 10 sample points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 18 / 108

19 More on Complexity (a) 15 sample points (b) 100 sample points Figure 10: Polynomial curve fitting: plots of 9 th order polynomials fitted to 15 and 100 sample points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 19 / 108

20 Pattern Recognition Systems Physical environment Data acquisition/sensing Training data Pre processing Pre processing Feature extraction Feature extraction/selection Features Features Classification Model Model learning/estimation Post processing Decision Figure 11: Object/process diagram of a pattern recognition system. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 20 / 108

21 Pattern Recognition Systems Data acquisition and sensing: Measurements of physical variables. Important issues: bandwidth, resolution, sensitivity, distortion, SNR, latency, etc. Pre-processing: Removal of noise in data. Isolation of patterns of interest from the background. Feature extraction: Finding a new representation in terms of features. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 21 / 108

22 Pattern Recognition Systems Model learning and estimation: Learning a mapping between features and pattern groups and categories. Classification: Using features and learned models to assign a pattern to a category. Post-processing: Evaluation of confidence in decisions. Exploitation of context to improve performance. Combination of experts. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 22 / 108

23 The Design Cycle Collect data Select features Select model Train classifier Evaluate classifier Figure 12: The design cycle. Data collection: Collecting training and testing data. How can we know when we have adequately large and representative set of samples? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 23 / 108

24 The Design Cycle Feature selection: Domain dependence and prior information. Computational cost and feasibility. Discriminative features. Similar values for similar patterns. Different values for different patterns. Invariant features with respect to translation, rotation and scale. Robust features with respect to occlusion, distortion, deformation, and variations in environment. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 24 / 108

25 The Design Cycle Model selection: Domain dependence and prior information. Definition of design criteria. Parametric vs. non-parametric models. Handling of missing features. Computational complexity. Types of models: templates, decision-theoretic or statistical, syntactic or structural, neural, and hybrid. How can we know how close we are to the true model underlying the patterns? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 25 / 108

26 The Design Cycle Training: How can we learn the rule from data? Supervised learning: a teacher provides a category label or cost for each pattern in the training set. Unsupervised learning: the system forms clusters or natural groupings of the input patterns. Semi-supervised learning: a class of supervised learning techniques that also make use of unlabeled data for training. Active learning: a special case of semi-supervised learning in which a learning algorithm interactively queries the user to obtain the desired outputs at new data points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 26 / 108

27 The Design Cycle Evaluation: How can we estimate the performance with training samples? How can we predict the performance with future data? Problems of overfitting and generalization. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 27 / 108

28 Bayesian Decision Theory Bayesian Decision Theory is a statistical approach that quantifies the tradeoffs between various decisions using probabilities and costs that accompany such decisions. Fish sorting example: define w, the type of fish we observe (class), as a random variable where w = w1 for sea bass, w = w2 for salmon. P (w1 ) is the a priori probability that the next fish is a sea bass. P (w2 ) is the a priori probability that the next fish is a salmon. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 28 / 108

29 Prior Probabilities Prior probabilities reflect our knowledge of how likely each type of fish will appear before we actually see it. How can we choose P (w 1 ) and P (w 2 )? Set P (w1 ) = P (w 2 ) if they are equiprobable (uniform priors). May use different values depending on the fishing area, time of the year, etc. Assume there are no other types of fish (exclusivity and exhaustivity). P (w 1 ) + P (w 2 ) = 1 UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 29 / 108

30 Making a Decision How can we make a decision with only the prior information? Decide w 1 if P (w 1 ) > P (w 2 ) otherwise w 2 What is the probability of error for this decision? P (error) = min{p (w 1 ), P (w 2 )} UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 30 / 108

31 Class-conditional Probabilities Let s try to improve the decision using the lightness measurement x. Let x be a continuous random variable. Define p(x w j ) as the class-conditional probability density (probability of x given that the class is w j for j = 1, 2). p(x w 1 ) and p(x w 2 ) describe the difference in lightness between populations of sea bass and salmon. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 31 / 108

32 Posterior Probabilities Suppose we know P (w j ) and p(x w j ) for j = 1, 2, and measure the lightness of a fish as the value x. Define P (w j x) as the a posteriori probability (probability of the class being w j given the measurement of feature value x). We can use the Bayes formula to convert the prior probability to the posterior probability P (w j x) = p(x w j)p (w j ) p(x) where p(x) = 2 j=1 p(x w j)p (w j ). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 32 / 108

33 Making a Decision p(x w j ) is called the likelihood and p(x) is called the evidence. How can we make a decision after observing the value of x? w 1 if P (w 1 x) > P (w 2 x) Decide otherwise w 2 Rewriting the rule gives w 1 if p(x w 1) Decide otherwise w 2 > P (w 2) p(x w 2 ) P (w 1 ) Note that, at every x, P (w 1 x) + P (w 2 x) = 1. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 33 / 108

34 Making a Decision Figure 13: Optimum thresholds for different priors. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 34 / 108

35 Probability of Error What is the probability of error for this decision? P (w 1 x) if we decide w 2 P (error x) = P (w 2 x) if we decide w 1 What is the average probability of error? P (error) = p(error, x) dx = P (error x) p(x) dx Bayes decision rule minimizes this error because P (error x) = min{p (w 1 x), P (w 2 x)}. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 35 / 108

36 Probability of Error Figure 14: Components of the probability of error for equal priors and the non-optimal decision point x. The optimal point x B minimizes the total shaded area and gives the Bayes error rate. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 36 / 108

37 Receiver Operating Characteristics Consider the two-category case and define w1 : target is present, w2 : target is not present. Table 1: Confusion matrix. Assigned w 1 w 2 True w 1 correct detection mis-detection w 2 false alarm correct rejection Mis-detection is also called false negative or Type II error. False alarm is also called false positive or Type I error. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 37 / 108

38 Receiver Operating Characteristics If we use a parameter (e.g., a threshold) in our decision, the plot of these rates for different values of the parameter is called the receiver operating characteristic (ROC) curve. Figure 15: Example receiver operating characteristic (ROC) curves for different settings of the system. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 38 / 108

39 Bayesian Decision Theory How can we generalize to more than one feature? replace the scalar x by the feature vector x more than two classes? just a difference in notation allowing actions other than just decisions? allow the possibility of rejection different risks in the decision? define how costly each action is UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 39 / 108

40 Minimum-error-rate Classification Let {w 1,..., w c } be the finite set of c classes (categories). Let x be the d-component vector-valued random variable called the feature vector. If all errors are equally costly, the minimum-error decision rule is defined as Decide w i if P (w i x) > P (w j x) j i The resulting error is called the Bayes error and is the best performance that can be achieved. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 40 / 108

41 Bayesian Decision Theory Bayesian Decision Theory shows us how to design an optimal classifier if we know the prior probabilities P (w i ) and the class-conditional densities p(x w i ). Unfortunately, we rarely have complete knowledge of the probabilistic structure. However, we can often find design samples or training data that include particular representatives of the patterns we want to classify. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 41 / 108

42 Bayesian Decision Theory Parametric models: assume that the form of the density functions are known. Density models (e.g., Gaussian) Mixture models (e.g., mixture of Gaussians) Hidden Markov Models Graphical models Non-parametric models: no assumption about the form. Histogram-based estimation Parzen window estimation Nearest neighbor estimation UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 42 / 108

43 The Gaussian Density Gaussian can be considered as a model where the feature vectors for a given class are continuous-valued, randomly corrupted versions of a single typical or prototype vector. Some properties of the Gaussian: Analytically tractable Completely specified by the 1st and 2nd moments Has the maximum entropy of all distributions with a given mean and variance Many processes are asymptotically Gaussian (Central Limit Theorem) Uncorrelatedness implies independence UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 43 / 108

44 Univariate Gaussian For x R: p(x) = N(µ, σ 2 ) [ = 1 exp 1 ( ) ] 2 x µ 2πσ 2 σ where µ = E[x] = σ 2 = E[(x µ) 2 ] = x p(x) dx, (x µ) 2 p(x) dx. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 44 / 108

45 Univariate Gaussian Figure 16: A univariate Gaussian distribution has roughly 95% of its area in the range x µ 2σ. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 45 / 108

46 Multivariate Gaussian For x R d : p(x) = N(µ, Σ) [ 1 = exp 1 ] (2π) d/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) where µ = E[x] = x p(x) dx, Σ = E[(x µ)(x µ) T ] = (x µ)(x µ) T p(x) dx. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 46 / 108

47 Multivariate Gaussian Figure 17: Samples drawn from a two-dimensional Gaussian lie in a cloud centered on the mean µ. The loci of points of constant density are the ellipses for which (x µ) T Σ 1 (x µ) is constant, where the eigenvectors of Σ determine the direction and the corresponding eigenvalues determine the length of the principal axes. The quantity r 2 = (x µ) T Σ 1 (x µ) is called the squared Mahalanobis distance from x to µ. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 47 / 108

48 Gaussian Density Estimation The maximum likelihood estimates of a Gaussian are ˆµ = 1 n x i and ˆΣ = 1 n (x i ˆµ)(x i ˆµ) T. n n i=1 i= Random sample from N(10,2 2 ) Histogram Gaussian fit Random sample from 0.5 N(10,0.4 2 ) N(11,0.5 2 ) Histogram Gaussian fit pdf 0.1 pdf x x Figure 18: Gaussian density estimation examples. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 48 / 108

49 Case 1: Σ i = σ 2 I Figure 19: If the covariance matrices of two distributions are equal and proportional to the identity matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of d 1 dimensions, perpendicular to the line separating the means. The decision boundary shifts as the priors are changed. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 49 / 108

50 Case 2: Σ i = Σ Figure 20: Probability densities with equal but asymmetric Gaussian distributions. The decision hyperplanes are not necessarily perpendicular to the line connecting the means. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 50 / 108

51 Case 3: Σ i = arbitrary Figure 21: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general hyperquadrics. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 51 / 108

52 Case 3: Σ i = arbitrary Figure 22: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general hyperquadrics. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 52 / 108

53 Mixture Densities A mixture model is a linear combination of m densities p(x Θ) = m α j p j (x θ j ) j=1 where Θ = (α 1,..., α m, θ 1,..., θ m ) such that α j 0 and m j=1 α j = 1. α 1,..., α m are called the mixing parameters. p j (x θ j ), j = 1,..., m are called the component densities. The most commonly used mixture model is the mixture of Gaussians. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 53 / 108

54 Expectation-Maximization Estimation of the mixture model parameters can be done using the Expectation-Maximization (EM) algorithm. The EM algorithm is a general iterative method of finding the maximum likelihood estimates of the parameters of a distribution from training data. The EM algorithm: First, computes the expected value of the complete-data log-likelihood using the current parameter estimates (expectation step). Then, finds new values of the parameters that maximize this expectation (maximization step). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 54 / 108

55 (a) (b) (c) (d) (e) (f) 2 Figure 23: Illustration of the EM algorithm iterations for a mixture of two Gaussians. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 55 / 108

56 (a) True densities and sample histograms. (b) Linear Gaussian classifier with P e = (c) Quadratic Gaussian classifier with P e = (d) Mixture of Gaussian classifier with P e = Figure 24: 1-D Bayesian classification examples where the data for each class come from a mixture of three Gaussians. Bayes error is P e = UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 56 / 108

57 (a) Scatter plot. (b) Linear Gaussian classifier with P e = (c) Quadratic Gaussian classifier with P e = (d) Mixture of Gaussian classifier with P e = Figure 25: 2-D Bayesian classification examples where the data for the classes come from a banana shaped distribution and a bivariate Gaussian. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 57 / 108

58 (a) Scatter plot. (b) Quadratic Gaussian classifier with P e = (c) Mixture of Gaussian classifier with P e = Figure 26: 2-D Bayesian classification examples where the data for each class come from a banana shaped distribution. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 58 / 108

59 Histogram Method A very simple method is to partition the space into a number of equally-sized cells (bins) and compute a histogram. Figure 27: Histogram in one dimension. The estimate of the density at a point x becomes p(x) = k nv where n is the total number of samples, k is the number of samples in the cell that includes x, and V is the volume of that cell. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 59 / 108

60 Non-parametric Methods Advantages: No assumptions are needed about the distributions ahead of time (generality). With enough samples, convergence to an arbitrarily complicated target density can be obtained. Disadvantages: The number of samples needed may be very large (number grows exponentially with the dimensionality of the feature space). There may be severe requirements for computation time and storage. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 60 / 108

61 Classification Error To apply these results to multiple classes, separate the training samples to c subsets D 1,..., D c, with the samples in D i belonging to class w i, and then estimate each density p(x w i, D i ) separately. Different sources of error: Bayes error: due to overlapping class-conditional densities (related to features used). Model error: due to incorrect model. Estimation error: due to estimation from a finite sample (can be reduced by increasing the amount of training data). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 61 / 108

62 Feature Reduction and Selection In practical multicategory applications, it is not unusual to encounter problems involving tens or hundreds of features. Intuitively, it may seem that each feature is useful for at least some of the discriminations. There are two issues that we must be careful about: How is the classification accuracy affected by the dimensionality (relative to the amount of training data)? How is the computational complexity of the classifier affected by the dimensionality? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 62 / 108

63 Problems of Dimensionality In general, if the performance obtained with a given set of features is inadequate, it is natural to consider adding new features. Unfortunately, it has frequently been observed in practice that, beyond a certain point, adding new features leads to worse rather than better performance. This is called the curse of dimensionality. Potential reasons include wrong assumptions in model selection or estimation errors due to the finite number of training samples for high-dimensional observations (overfitting). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 63 / 108

64 Problems of Dimensionality All of the commonly used classifiers can suffer from the curse of dimensionality. Dimensionality can be reduced by redesigning the features, selecting an appropriate subset among the existing features, combining existing features. Principal Components Analysis (PCA) seeks a projection that best represents the data in a least-squares sense. Linear Discriminant Analysis (LDA) seeks a projection that best separates the data in a least-squares sense. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 64 / 108

65 Examples (b) Projection onto e 1. (a) Scatter plot. (c) Projection onto e 2. Figure 28: Scatter plot (red dots) and the principal axes for a bivariate sample. The blue line shows the axis e 1 with the greatest variance and the green line shows the axis e 2 with the smallest variance. Features are now uncorrelated. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 65 / 108

66 Examples (b) Projection onto the first PCA axis. (c) Projection onto the (a) Scatter plot. first LDA axis. Figure 29: Scatter plot and the PCA and LDA axes for a bivariate sample with two classes. Histogram of the projection onto the first LDA axis shows better separation than the projection onto the first PCA axis. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 66 / 108

67 Examples (b) Projection onto the first PCA axis. (c) Projection onto the (a) Scatter plot. first LDA axis. Figure 30: Scatter plot and the PCA and LDA axes for a bivariate sample with two classes. Histogram of the projection onto the first LDA axis shows better separation than the projection onto the first PCA axis. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 67 / 108

68 Examples Figure 31: An image and the first six PCA bands (after projection). Histogram equalization was applied to all images for better visualization. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 68 / 108

69 Examples Figure 32: An image and the six LDA bands (after projection). Histogram equalization was applied to all images for better visualization. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 69 / 108

70 Feature Selection An alternative to feature reduction that uses linear or non-linear combinations of features is feature selection that reduces dimensionality by selecting subsets of existing features. The first step in feature selection is to define a criterion function that is often a function of the classification error. Note that, the use of classification error in the criterion function makes feature selection procedures dependent on the specific classifier used. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 70 / 108

71 Feature Selection The most straightforward approach would require examining all ( d m) possible subsets of size m, selecting the subset that performs the best according to the criterion function. The number of subsets grows combinatorially, making the exhaustive search impractical. Iterative procedures such as sequential forward selection or sequential backward selection are often used but they cannot guarantee the selection of the optimal subset. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 71 / 108

72 Examples Sequential forward selection DEM::ELEVATION IKONOS2::BAND1 IKONOS2::BAND3 IKONOS2::BAND2 IKONOS2::BAND4 IKONOS2_GABOR4::FINE90DEG IKONOS2_GABOR4::COARSE90DEG IKONOS2_GABOR4::FINE0DEG IKONOS2_GABOR4::COARSE0DEG AERIAL_GABOR1::FINE0DEG IKONOS2_GABOR1::COARSE0DEG IKONOS2_GABOR1::FINE0DEG IKONOS3::BAND4 IKONOS3::BAND3 IKONOS2_GABOR1::FINE90DEG IKONOS2_GABOR1::COARSE90DEG AERIAL_GABOR2::COARSE0DEG IKONOS3::BAND1 AERIAL_GABOR1::FINE90DEG AERIAL_GABOR2::FINE90DEG AERIAL_GABOR1::COARSE90DEG IKONOS3::BAND2 AERIAL_GABOR2::FINE0DEG AERIAL::BAND1 AERIAL::BAND2 AERIAL_GABOR2::COARSE90DEG AERIAL::BAND3 AERIAL_GABOR1::COARSE0DEG Classification accuracy Figure 33: Results of sequential forward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features added at each iteration (the first iteration is at the bottom). The highest accuracy value is shown with a star. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 72 / 108

73 Examples Sequential backward selection AERIAL_GABOR2::FINE0DEG AERIAL::BAND2 AERIAL::BAND3 IKONOS3::BAND2 AERIAL_GABOR1::COARSE0DEG AERIAL_GABOR2::COARSE90DEG AERIAL_GABOR1::FINE90DEG IKONOS2_GABOR4::FINE0DEG IKONOS2::BAND4 IKONOS2_GABOR1::FINE0DEG IKONOS2_GABOR1::COARSE90DEG IKONOS2_GABOR1::COARSE0DEG IKONOS2_GABOR4::COARSE0DEG IKONOS2::BAND2 IKONOS3::BAND1 IKONOS2_GABOR1::FINE90DEG IKONOS3::BAND4 IKONOS2_GABOR4::COARSE90DEG AERIAL_GABOR2::FINE90DEG AERIAL_GABOR1::FINE0DEG AERIAL_GABOR2::COARSE0DEG IKONOS2_GABOR4::FINE90DEG IKONOS2::BAND3 IKONOS2::BAND1 AERIAL_GABOR1::COARSE90DEG IKONOS3::BAND3 DEM::ELEVATION NONE Classification accuracy Figure 34: Results of sequential backward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features removed at each iteration (the first iteration is at the bottom). The highest accuracy value is shown with a star. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 73 / 108

74 Non-Bayesian Classifiers Distance-based classifiers: Minimum distance (nearest mean) classifier k-nearest neighbor classifier Decision boundary-based classifiers: Linear discriminant functions Support vector machines Neural networks Decision trees UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 74 / 108

75 The k-nearest Neighbor Classifier Given the training data D = {x 1,..., x n } as a set of n labeled examples, the nearest neighbor classifier assigns a test point x the label associated with its closest neighbor in D. The k-nearest neighbor classifier classifies x by assigning it the label most frequently represented among the k nearest samples. Figure 35: Classifier for k = 5. Closeness is defined using a distance function. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 75 / 108

76 Distance Functions A general class of metrics for d-dimensional patterns is the Minkowski metric ( d ) 1/p L p (x, y) = x i y i p i=1 also referred to as the L p norm. The Euclidean distance is the L 2 norm ( d ) 1/2 L 2 (x, y) = x i y i 2. i=1 The Manhattan or city block distance is the L 1 norm d L 1 (x, y) = x i y i. i=1 UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 76 / 108

77 Distance Functions The L norm is the maximum of the distances along individual coordinate axes L (x, y) = d max i=1 x i y i. Figure 36: Each colored shape consists of points at a distance 1.0 from the origin, measured using different values of p in the Minkowski L p metric. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 77 / 108

78 Linear Discriminant Functions Figure 37: Linear decision boundaries produced by using one linear discriminant for each class. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 78 / 108

79 Support Vector Machines Linear discriminant functions are optimal if the underlying distributions are Gaussians having equal covariance for each class. In the general case, the problem of finding linear discriminant functions can be formulated as a problem of optimizing a criterion function. Among all hyperplanes separating the data, there exists a unique one yielding the maximum margin of separation between the classes. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 79 / 108

80 Support Vector Machines y = 1 y = 0 y = 1 y = 1 y = 0 y = 1 margin Figure 38: The margin is defined as the perpendicular distance between the decision boundary and the closest of the data points (left). Maximizing the margin leads to a particular choice of decision boundary (right). The location of this boundary is determined by a subset of the data points, known as the support vectors, which are indicated by the circles. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 80 / 108

81 Support Vector Machines Given a set of training patterns and class labels as (x 1, y 1 ),..., (x n, y n ) R d {±1}, the goal is to find a classifier function f : R d {±1} such that f(x) = y will correctly classify new patterns. Support vector machines are based on the class of hyperplanes (w x) + b = 0, w R d, b R corresponding to decision functions f(x) = sign((w x) + b). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 81 / 108

82 Support Vector Machines Figure 39: A binary classification problem of separating balls from diamonds. The optimal hyperplane is orthogonal to the shortest line connecting the convex hulls of the two classes (dotted), and intersects it half way between the two classes. There is a weight vector w and a threshold b such that the points closest to the hyperplane satisfy (w x i ) + b = 1 corresponding to y i ((w x i ) + b) 1. The margin, measured perpendicularly to the hyperplane, equals 2/ w. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 82 / 108

83 Support Vector Machines To construct the optimal hyperplane, we can define the following optimization problem: minimize 1 2 w 2 subject to y i ((w x i ) + b) 1, i = 1,..., n. The solution can be obtained using quadratic programming techniques where the solution vector is the weighted summation of a subset of the training patterns, called the support vectors. The support vectors lie on the margin and carry all relevant information about the classification problem (the remaining patterns are irrelevant). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 83 / 108

84 Support Vector Machines In many real-world problems there will be no linear boundary separating the classes and the problem of searching for an optimal separating hyperplane is meaningless. However, we can extend the above ideas to handle non-separable data by relaxing the constraints. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 84 / 108

85 Support Vector Machines The new optimization problem becomes: minimize 1 n 2 w 2 + C i=1 subject to (w x i ) + b +1 ξ i for y i = +1, ξ i (w x i ) + b 1 + ξ i for y i = 1, ξ i 0 i = 1,..., n where ξ i, i = 1,..., n, are called the slack variables and C is a regularization parameter. The term C n i=1 ξ i can be thought of as measuring some amount of misclassification where lowering the value of C corresponds to a smaller penalty for misclassification. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 85 / 108

86 Support Vector Machines ξ > 1 y = 1 y = 0 y = 1 ξ < 1 ξ = 0 ξ = 0 Figure 40: Illustration of the slack variables ξ i 0. Data points with circles around them are support vectors. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 86 / 108

87 Support Vector Machines Both the quadratic programming problem and the final decision function f(x) = sign ( n α i y i (x x i ) + b i=1 ) depend only on the dot products between patterns. We can generalize this result to the non-linear case by mapping the original input space into some other space F using a non-linear map Φ : R d F and perform the linear algorithm in the F space which only requires the dot products k(x, y) = Φ(x)Φ(y). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 87 / 108

88 Support Vector Machines Even though F may be high-dimensional, a simple kernel k(x, y) such as the following can be computed efficiently. Table 2: Common kernel functions. Polynomial k(x, y) = (x y) p Sigmoidal k(x, y) = tanh(κ(x y) + θ) Radial basis function k(x, y) = exp( x y 2 /(2σ 2 )) Once a kernel function is chosen, we can substitute Φ(x i ) for each training example x i, and perform the optimal hyperplane algorithm in F. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 88 / 108

89 Support Vector Machines This results in the non-linear decision function of the form ( n ) f(x) = sign α i y i k(x, x i ) + b i=1 where the parameters α i are computed as the solution of the quadratic programming problem. In the original input space, the hyperplane corresponds to a non-linear decision function whose form is determined by the kernel. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 89 / 108

90 Neural Networks Figure 41: A neural network consists of an input layer, an output layer and usually one or more hidden layers that are interconnected by modifiable weights represented by links between layers. They learn the values of these weights as a mapping from the input to the output. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 90 / 108

91 Decision Trees Figure 42: Decision trees classify a pattern through a sequence of questions, in which the next question asked depends on the answer to the current question. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 91 / 108

92 Unsupervised Learning and Clustering Clustering is an unsupervised procedure that uses unlabeled samples. Unsupervised procedures are used for several reasons: Collecting and labeling a large set of sample patterns can be costly. One can train with large amount of unlabeled data, and then use supervision to label the groupings found. Exploratory data analysis can provide insight into the nature or structure of the data. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 92 / 108

93 Clusters A cluster is comprised of a number of similar objects collected or grouped together. Patterns within a cluster are more similar to each other than are patterns in different clusters. Clusters may be described as connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 93 / 108

94 Clustering Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. Figure 43: The number of clusters in the data often depend on the resolution (fine vs. coarse) with which we view the data. How many clusters do you see in this figure? 5, 8, 10, more? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 94 / 108

95 Clustering Most of the clustering algorithms are based on the following two popular techniques: Iterative squared-error partitioning Agglomerative hierarchical clustering One of the main challenges is to select an appropriate measure of similarity to define clusters that is often both data (cluster shape) and context dependent. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 95 / 108

96 Squared-error Partitioning Suppose that the given set of n patterns has somehow been partitioned into k clusters D 1,..., D k. Let n i be the number of samples in D i and let m i be the mean of those samples m i = 1 n i x D i x. Then, the sum-of-squared errors is defined by k J e = x m i 2. i=1 x D i For a given cluster D i, the mean vector m i (centroid) is the best representative of the samples in D i. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 96 / 108

97 Squared-error Partitioning A general algorithm for iterative squared-error partitioning: 1. Select an initial partition with k clusters. Repeat steps 2 through 5 until the cluster membership stabilizes. 2. Generate a new partition by assigning each pattern to its closest cluster center. 3. Compute new cluster centers as the centroids of the clusters. 4. Repeat steps 2 and 3 until an optimum value of the criterion function is found (e.g., when a local minimum is found or a predefined number of iterations are completed). 5. Adjust the number of clusters by merging and splitting existing clusters or by removing small or outlier clusters. This algorithm, without step 5, is also known as the k-means algorithm. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 97 / 108

98 Squared-error Partitioning k-means is computationally efficient and gives good results if the clusters are compact, hyperspherical in shape and well-separated in the feature space. However, choosing k and choosing the initial partition are the main drawbacks of this algorithm. The value of k is often chosen empirically or by prior knowledge about the data. The initial partition is often chosen by generating k random points uniformly distributed within the range of the data, or by randomly selecting k points from the data. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 98 / 108

99 Hierarchical Clustering In some applications, groups of patterns share some characteristics when looked at a particular level. Hierarchical clustering tries to capture these multi-level groupings using hierarchical representations. Figure 44: A dendrogram can represent the results of hierarchical clustering algorithms. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 99 / 108

100 Algorithm-Independent Learning Issues We have seen many learning algorithms and techniques for pattern recognition. Some of these algorithms may be preferred because of their lower computational complexity. Others may be preferred because they take into account some prior knowledge of the form of the data. Given practical constraints such as finite training data, no pattern classification method is inherently superior to any other. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 100 / 108

101 Estimating and Comparing Classifiers Classification error can be estimated using misclassification and false alarm rates. To compare learning algorithms, we should use independent training and test data generated using static division, rotated division (e.g., cross-validation), bootstrap methods. Using the error on points not in the training set (also called the off-training set error) is important for evaluating the generalization ability of an algorithm. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 101 / 108

102 Combining Classifiers Just like different features capturing different properties of a pattern, different classifiers also capture different structures and relationships of these patterns in the feature space. An empirical comparison of different classifiers can help us choose one of them as the best classifier for the problem at hand. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 102 / 108

103 Combining Classifiers However, although most of the classifiers may have similar error rates, sets of patterns misclassified by different classifiers do not necessarily overlap. Not relying on a single decision but rather combining the advantages of different classifiers is intuitively promising to improve the overall accuracy of classification. Such combinations are variously called combined classifiers, ensemble classifiers, mixture-of-expert models, or pooled classifiers. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 103 / 108

104 Combining Classifiers Some reasons for combining multiple classifiers to solve a given classification problem can be stated as follows: Access to different classifiers, each developed in a different context and for an entirely different representation/description of the same problem. Availability of multiple training sets, each collected at a different time or in a different environment, even may use different features. Local performances of different classifiers where each classifier may have its own region in the feature space where it performs the best. Different performances due to different initializations and randomness inherent in the training procedure. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 104 / 108

105 Combining Classifiers In summary, we may have different feature sets, training sets, classification methods, and training sessions, all resulting in a set of classifiers whose outputs may be combined. Combination architectures can be grouped as: Parallel: all classifiers are invoked independently and then their results are combined by a combiner. Serial (cascading): individual classifiers are invoked in a linear sequence where the number of possible classes for a given pattern is gradually reduced. Hierarchical (tree): individual classifiers are combined into a structure, which is similar to that of a decision tree, where the nodes are associated with the classifiers. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 105 / 108

106 Combining Classifiers Examples of classifier combination schemes are: Majority voting where each classifier makes a binary decision (vote) about each class and the final decision is made in favor of the class with the largest number of votes. Bayesian combination: sum, product, maximum, minimum and median of the posterior probabilities from individual classifiers. Bagging where multiple classifiers are built by bootstrapping the original training set. Boosting where a sequence of classifiers is built by training each classifier using data sampled from a distribution derived from the empirical misclassification rate of the previous classifier. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 106 / 108

107 Structural and Syntactic Pattern Recognition Statistical pattern recognition attempts to classify patterns based on a set of extracted features and an underlying statistical model for the generation of these patterns. Ideally, this is achieved with a rather straightforward procedure: determine the feature vector, train the system, classify the patterns. Unfortunately, there are also many problems where patterns contain structural and relational information that are difficult or impossible to quantify in feature vector form. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 107 / 108

108 Structural and Syntactic Pattern Recognition Structural pattern recognition assumes that pattern structure is quantifiable and extractable so that structural similarity of patterns can be assessed. Typically, these approaches formulate hierarchical descriptions of complex patterns built up from simpler primitive elements. This structure quantification and description are mainly done using: Formal grammars Relational descriptions (principally graphs) Then, recognition and classification are done using: Parsing (for formal grammars) Relational graph matching (for relational descriptions) UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 108 / 108