# Classification Techniques for Remote Sensing

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Classification Techniques for Remote Sensing Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara saksoy/courses/cs551 UYGU 2014 UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 1 / 108

2 Outline 1 Introduction 2 Bayesian Decision Theory Parametric Models Non-parametric Methods 3 Feature Reduction and Selection 4 Non-Bayesian Classifiers Distance-based Classifiers Decision Boundary-based Classifiers 5 Unsupervised Learning and Clustering 6 Algorithm-Independent Learning Issues Estimating and Comparing Classifiers Combining Classifiers 7 Structural and Syntactic Pattern Recognition UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 2 / 108

3 Human versus Machine Perception Humans have developed highly sophisticated skills for sensing their environment and taking actions according to what they observe. We would like to give similar capabilities to machines. This is of particular interest in remote sensing due to the expansion in the data volume with increasing spatial, spectral, and temporal resolutions, increasing complexity of the data content, urgency of the need for the extraction of useful information. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 3 / 108

4 Classification What is classification? Finding boundaries in a feature space Figure 1: Example classification boundaries for 2D data sets. Classification methods use techniques from mathematics, statistics, computer science, etc., to solve this problem. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 4 / 108

5 An Example Problem: Sorting incoming fish on a conveyor belt according to species. Assume that we have only two kinds of fish: sea bass, salmon. Figure 2: Picture taken from a camera. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 5 / 108

6 An Example: Decision Process What kind of information can distinguish one species from the other? length, width, weight, number and shape of fins, tail shape, etc. What can cause problems during sensing? lighting conditions, position of fish on the conveyor belt, camera noise, etc. What are the steps in the process? capture image isolate fish take measurements make decision UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 6 / 108

7 An Example: Selecting Features Assume a fisherman told us that a sea bass is generally longer than a salmon. We can use length as a feature and decide between sea bass and salmon according to a threshold on length. How can we choose this threshold? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 7 / 108

8 An Example: Selecting Features Figure 3: Histograms of the length feature for two types of fish in training samples. How can we choose the threshold l to make a reliable decision? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 8 / 108

9 An Example: Selecting Features Even though sea bass is longer than salmon on the average, there are many examples of fish where this observation does not hold. Try another feature: average lightness of the fish scales. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 9 / 108

10 An Example: Selecting Features Figure 4: Histograms of the lightness feature for two types of fish in training samples. It looks easier to choose the threshold x but we still cannot make a perfect decision. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 10 / 108

11 An Example: Cost of Error We should also consider costs of different errors we make in our decisions. For example, if the fish packing company knows that: Customers who buy salmon will object vigorously if they see sea bass in their cans. Customers who buy sea bass will not be unhappy if they occasionally see some expensive salmon in their cans. How does this knowledge affect our decision? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 11 / 108

12 An Example: Multiple Features Assume we also observed that sea bass are typically wider than salmon. We can use two features in our decision: lightness: x1 width: x2 Each fish image is now represented as a point (feature vector) ( x = x 1 x 2 ) in a two-dimensional feature space. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 12 / 108

13 An Example: Multiple Features Figure 5: Scatter plot of lightness and width features for training samples. We can draw a decision boundary to divide the feature space into two regions. Does it look better than using only lightness? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 13 / 108

14 An Example: Multiple Features Does adding more features always improve the results? Avoid unreliable features. Be careful about correlations with existing features. Be careful about measurement costs. Be careful about noise in the measurements. Is there some curse for working in very high dimensions? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 14 / 108

15 An Example: Decision Boundaries Can we do better with another decision rule? More complex models result in more complex boundaries. Figure 6: We may distinguish training samples perfectly but how can we predict how well we can generalize to unknown samples? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 15 / 108

16 An Example: Decision Boundaries How can we manage the tradeoff between complexity of decision rules and their performance to unknown samples? Figure 7: Different criteria lead to different decision boundaries. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 16 / 108

17 More on Complexity Figure 8: Regression example: plot of 10 sample points for the input variable x along with the corresponding target variable t. Green curve is the true function that generated the data. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 17 / 108

18 More on Complexity (a) 0 th order polynomial (b) 1 st order polynomial (c) 3 rd order polynomial (d) 9 th order polynomial Figure 9: Polynomial curve fitting: plots of polynomials having various orders, shown as red curves, fitted to the set of 10 sample points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 18 / 108

19 More on Complexity (a) 15 sample points (b) 100 sample points Figure 10: Polynomial curve fitting: plots of 9 th order polynomials fitted to 15 and 100 sample points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 19 / 108

20 Pattern Recognition Systems Physical environment Data acquisition/sensing Training data Pre processing Pre processing Feature extraction Feature extraction/selection Features Features Classification Model Model learning/estimation Post processing Decision Figure 11: Object/process diagram of a pattern recognition system. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 20 / 108

21 Pattern Recognition Systems Data acquisition and sensing: Measurements of physical variables. Important issues: bandwidth, resolution, sensitivity, distortion, SNR, latency, etc. Pre-processing: Removal of noise in data. Isolation of patterns of interest from the background. Feature extraction: Finding a new representation in terms of features. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 21 / 108

22 Pattern Recognition Systems Model learning and estimation: Learning a mapping between features and pattern groups and categories. Classification: Using features and learned models to assign a pattern to a category. Post-processing: Evaluation of confidence in decisions. Exploitation of context to improve performance. Combination of experts. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 22 / 108

23 The Design Cycle Collect data Select features Select model Train classifier Evaluate classifier Figure 12: The design cycle. Data collection: Collecting training and testing data. How can we know when we have adequately large and representative set of samples? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 23 / 108

24 The Design Cycle Feature selection: Domain dependence and prior information. Computational cost and feasibility. Discriminative features. Similar values for similar patterns. Different values for different patterns. Invariant features with respect to translation, rotation and scale. Robust features with respect to occlusion, distortion, deformation, and variations in environment. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 24 / 108

25 The Design Cycle Model selection: Domain dependence and prior information. Definition of design criteria. Parametric vs. non-parametric models. Handling of missing features. Computational complexity. Types of models: templates, decision-theoretic or statistical, syntactic or structural, neural, and hybrid. How can we know how close we are to the true model underlying the patterns? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 25 / 108

26 The Design Cycle Training: How can we learn the rule from data? Supervised learning: a teacher provides a category label or cost for each pattern in the training set. Unsupervised learning: the system forms clusters or natural groupings of the input patterns. Semi-supervised learning: a class of supervised learning techniques that also make use of unlabeled data for training. Active learning: a special case of semi-supervised learning in which a learning algorithm interactively queries the user to obtain the desired outputs at new data points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 26 / 108

27 The Design Cycle Evaluation: How can we estimate the performance with training samples? How can we predict the performance with future data? Problems of overfitting and generalization. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 27 / 108

28 Bayesian Decision Theory Bayesian Decision Theory is a statistical approach that quantifies the tradeoffs between various decisions using probabilities and costs that accompany such decisions. Fish sorting example: define w, the type of fish we observe (class), as a random variable where w = w1 for sea bass, w = w2 for salmon. P (w1 ) is the a priori probability that the next fish is a sea bass. P (w2 ) is the a priori probability that the next fish is a salmon. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 28 / 108

29 Prior Probabilities Prior probabilities reflect our knowledge of how likely each type of fish will appear before we actually see it. How can we choose P (w 1 ) and P (w 2 )? Set P (w1 ) = P (w 2 ) if they are equiprobable (uniform priors). May use different values depending on the fishing area, time of the year, etc. Assume there are no other types of fish (exclusivity and exhaustivity). P (w 1 ) + P (w 2 ) = 1 UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 29 / 108

30 Making a Decision How can we make a decision with only the prior information? Decide w 1 if P (w 1 ) > P (w 2 ) otherwise w 2 What is the probability of error for this decision? P (error) = min{p (w 1 ), P (w 2 )} UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 30 / 108

31 Class-conditional Probabilities Let s try to improve the decision using the lightness measurement x. Let x be a continuous random variable. Define p(x w j ) as the class-conditional probability density (probability of x given that the class is w j for j = 1, 2). p(x w 1 ) and p(x w 2 ) describe the difference in lightness between populations of sea bass and salmon. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 31 / 108

32 Posterior Probabilities Suppose we know P (w j ) and p(x w j ) for j = 1, 2, and measure the lightness of a fish as the value x. Define P (w j x) as the a posteriori probability (probability of the class being w j given the measurement of feature value x). We can use the Bayes formula to convert the prior probability to the posterior probability P (w j x) = p(x w j)p (w j ) p(x) where p(x) = 2 j=1 p(x w j)p (w j ). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 32 / 108

33 Making a Decision p(x w j ) is called the likelihood and p(x) is called the evidence. How can we make a decision after observing the value of x? w 1 if P (w 1 x) > P (w 2 x) Decide otherwise w 2 Rewriting the rule gives w 1 if p(x w 1) Decide otherwise w 2 > P (w 2) p(x w 2 ) P (w 1 ) Note that, at every x, P (w 1 x) + P (w 2 x) = 1. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 33 / 108

34 Making a Decision Figure 13: Optimum thresholds for different priors. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 34 / 108

35 Probability of Error What is the probability of error for this decision? P (w 1 x) if we decide w 2 P (error x) = P (w 2 x) if we decide w 1 What is the average probability of error? P (error) = p(error, x) dx = P (error x) p(x) dx Bayes decision rule minimizes this error because P (error x) = min{p (w 1 x), P (w 2 x)}. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 35 / 108

36 Probability of Error Figure 14: Components of the probability of error for equal priors and the non-optimal decision point x. The optimal point x B minimizes the total shaded area and gives the Bayes error rate. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 36 / 108

37 Receiver Operating Characteristics Consider the two-category case and define w1 : target is present, w2 : target is not present. Table 1: Confusion matrix. Assigned w 1 w 2 True w 1 correct detection mis-detection w 2 false alarm correct rejection Mis-detection is also called false negative or Type II error. False alarm is also called false positive or Type I error. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 37 / 108

38 Receiver Operating Characteristics If we use a parameter (e.g., a threshold) in our decision, the plot of these rates for different values of the parameter is called the receiver operating characteristic (ROC) curve. Figure 15: Example receiver operating characteristic (ROC) curves for different settings of the system. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 38 / 108

39 Bayesian Decision Theory How can we generalize to more than one feature? replace the scalar x by the feature vector x more than two classes? just a difference in notation allowing actions other than just decisions? allow the possibility of rejection different risks in the decision? define how costly each action is UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 39 / 108

40 Minimum-error-rate Classification Let {w 1,..., w c } be the finite set of c classes (categories). Let x be the d-component vector-valued random variable called the feature vector. If all errors are equally costly, the minimum-error decision rule is defined as Decide w i if P (w i x) > P (w j x) j i The resulting error is called the Bayes error and is the best performance that can be achieved. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 40 / 108

41 Bayesian Decision Theory Bayesian Decision Theory shows us how to design an optimal classifier if we know the prior probabilities P (w i ) and the class-conditional densities p(x w i ). Unfortunately, we rarely have complete knowledge of the probabilistic structure. However, we can often find design samples or training data that include particular representatives of the patterns we want to classify. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 41 / 108

42 Bayesian Decision Theory Parametric models: assume that the form of the density functions are known. Density models (e.g., Gaussian) Mixture models (e.g., mixture of Gaussians) Hidden Markov Models Graphical models Non-parametric models: no assumption about the form. Histogram-based estimation Parzen window estimation Nearest neighbor estimation UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 42 / 108

43 The Gaussian Density Gaussian can be considered as a model where the feature vectors for a given class are continuous-valued, randomly corrupted versions of a single typical or prototype vector. Some properties of the Gaussian: Analytically tractable Completely specified by the 1st and 2nd moments Has the maximum entropy of all distributions with a given mean and variance Many processes are asymptotically Gaussian (Central Limit Theorem) Uncorrelatedness implies independence UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 43 / 108

44 Univariate Gaussian For x R: p(x) = N(µ, σ 2 ) [ = 1 exp 1 ( ) ] 2 x µ 2πσ 2 σ where µ = E[x] = σ 2 = E[(x µ) 2 ] = x p(x) dx, (x µ) 2 p(x) dx. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 44 / 108

45 Univariate Gaussian Figure 16: A univariate Gaussian distribution has roughly 95% of its area in the range x µ 2σ. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 45 / 108

46 Multivariate Gaussian For x R d : p(x) = N(µ, Σ) [ 1 = exp 1 ] (2π) d/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) where µ = E[x] = x p(x) dx, Σ = E[(x µ)(x µ) T ] = (x µ)(x µ) T p(x) dx. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 46 / 108

47 Multivariate Gaussian Figure 17: Samples drawn from a two-dimensional Gaussian lie in a cloud centered on the mean µ. The loci of points of constant density are the ellipses for which (x µ) T Σ 1 (x µ) is constant, where the eigenvectors of Σ determine the direction and the corresponding eigenvalues determine the length of the principal axes. The quantity r 2 = (x µ) T Σ 1 (x µ) is called the squared Mahalanobis distance from x to µ. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 47 / 108

48 Gaussian Density Estimation The maximum likelihood estimates of a Gaussian are ˆµ = 1 n x i and ˆΣ = 1 n (x i ˆµ)(x i ˆµ) T. n n i=1 i= Random sample from N(10,2 2 ) Histogram Gaussian fit Random sample from 0.5 N(10,0.4 2 ) N(11,0.5 2 ) Histogram Gaussian fit pdf 0.1 pdf x x Figure 18: Gaussian density estimation examples. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 48 / 108

49 Case 1: Σ i = σ 2 I Figure 19: If the covariance matrices of two distributions are equal and proportional to the identity matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of d 1 dimensions, perpendicular to the line separating the means. The decision boundary shifts as the priors are changed. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 49 / 108

50 Case 2: Σ i = Σ Figure 20: Probability densities with equal but asymmetric Gaussian distributions. The decision hyperplanes are not necessarily perpendicular to the line connecting the means. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 50 / 108

51 Case 3: Σ i = arbitrary Figure 21: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general hyperquadrics. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 51 / 108

52 Case 3: Σ i = arbitrary Figure 22: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general hyperquadrics. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 52 / 108

53 Mixture Densities A mixture model is a linear combination of m densities p(x Θ) = m α j p j (x θ j ) j=1 where Θ = (α 1,..., α m, θ 1,..., θ m ) such that α j 0 and m j=1 α j = 1. α 1,..., α m are called the mixing parameters. p j (x θ j ), j = 1,..., m are called the component densities. The most commonly used mixture model is the mixture of Gaussians. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 53 / 108

54 Expectation-Maximization Estimation of the mixture model parameters can be done using the Expectation-Maximization (EM) algorithm. The EM algorithm is a general iterative method of finding the maximum likelihood estimates of the parameters of a distribution from training data. The EM algorithm: First, computes the expected value of the complete-data log-likelihood using the current parameter estimates (expectation step). Then, finds new values of the parameters that maximize this expectation (maximization step). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 54 / 108

55 (a) (b) (c) (d) (e) (f) 2 Figure 23: Illustration of the EM algorithm iterations for a mixture of two Gaussians. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 55 / 108

56 (a) True densities and sample histograms. (b) Linear Gaussian classifier with P e = (c) Quadratic Gaussian classifier with P e = (d) Mixture of Gaussian classifier with P e = Figure 24: 1-D Bayesian classification examples where the data for each class come from a mixture of three Gaussians. Bayes error is P e = UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 56 / 108

57 (a) Scatter plot. (b) Linear Gaussian classifier with P e = (c) Quadratic Gaussian classifier with P e = (d) Mixture of Gaussian classifier with P e = Figure 25: 2-D Bayesian classification examples where the data for the classes come from a banana shaped distribution and a bivariate Gaussian. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 57 / 108

58 (a) Scatter plot. (b) Quadratic Gaussian classifier with P e = (c) Mixture of Gaussian classifier with P e = Figure 26: 2-D Bayesian classification examples where the data for each class come from a banana shaped distribution. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 58 / 108

59 Histogram Method A very simple method is to partition the space into a number of equally-sized cells (bins) and compute a histogram. Figure 27: Histogram in one dimension. The estimate of the density at a point x becomes p(x) = k nv where n is the total number of samples, k is the number of samples in the cell that includes x, and V is the volume of that cell. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 59 / 108

60 Non-parametric Methods Advantages: No assumptions are needed about the distributions ahead of time (generality). With enough samples, convergence to an arbitrarily complicated target density can be obtained. Disadvantages: The number of samples needed may be very large (number grows exponentially with the dimensionality of the feature space). There may be severe requirements for computation time and storage. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 60 / 108

61 Classification Error To apply these results to multiple classes, separate the training samples to c subsets D 1,..., D c, with the samples in D i belonging to class w i, and then estimate each density p(x w i, D i ) separately. Different sources of error: Bayes error: due to overlapping class-conditional densities (related to features used). Model error: due to incorrect model. Estimation error: due to estimation from a finite sample (can be reduced by increasing the amount of training data). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 61 / 108

62 Feature Reduction and Selection In practical multicategory applications, it is not unusual to encounter problems involving tens or hundreds of features. Intuitively, it may seem that each feature is useful for at least some of the discriminations. There are two issues that we must be careful about: How is the classification accuracy affected by the dimensionality (relative to the amount of training data)? How is the computational complexity of the classifier affected by the dimensionality? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 62 / 108

63 Problems of Dimensionality In general, if the performance obtained with a given set of features is inadequate, it is natural to consider adding new features. Unfortunately, it has frequently been observed in practice that, beyond a certain point, adding new features leads to worse rather than better performance. This is called the curse of dimensionality. Potential reasons include wrong assumptions in model selection or estimation errors due to the finite number of training samples for high-dimensional observations (overfitting). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 63 / 108

64 Problems of Dimensionality All of the commonly used classifiers can suffer from the curse of dimensionality. Dimensionality can be reduced by redesigning the features, selecting an appropriate subset among the existing features, combining existing features. Principal Components Analysis (PCA) seeks a projection that best represents the data in a least-squares sense. Linear Discriminant Analysis (LDA) seeks a projection that best separates the data in a least-squares sense. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 64 / 108

65 Examples (b) Projection onto e 1. (a) Scatter plot. (c) Projection onto e 2. Figure 28: Scatter plot (red dots) and the principal axes for a bivariate sample. The blue line shows the axis e 1 with the greatest variance and the green line shows the axis e 2 with the smallest variance. Features are now uncorrelated. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 65 / 108

66 Examples (b) Projection onto the first PCA axis. (c) Projection onto the (a) Scatter plot. first LDA axis. Figure 29: Scatter plot and the PCA and LDA axes for a bivariate sample with two classes. Histogram of the projection onto the first LDA axis shows better separation than the projection onto the first PCA axis. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 66 / 108

67 Examples (b) Projection onto the first PCA axis. (c) Projection onto the (a) Scatter plot. first LDA axis. Figure 30: Scatter plot and the PCA and LDA axes for a bivariate sample with two classes. Histogram of the projection onto the first LDA axis shows better separation than the projection onto the first PCA axis. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 67 / 108

68 Examples Figure 31: An image and the first six PCA bands (after projection). Histogram equalization was applied to all images for better visualization. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 68 / 108

69 Examples Figure 32: An image and the six LDA bands (after projection). Histogram equalization was applied to all images for better visualization. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 69 / 108

70 Feature Selection An alternative to feature reduction that uses linear or non-linear combinations of features is feature selection that reduces dimensionality by selecting subsets of existing features. The first step in feature selection is to define a criterion function that is often a function of the classification error. Note that, the use of classification error in the criterion function makes feature selection procedures dependent on the specific classifier used. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 70 / 108

71 Feature Selection The most straightforward approach would require examining all ( d m) possible subsets of size m, selecting the subset that performs the best according to the criterion function. The number of subsets grows combinatorially, making the exhaustive search impractical. Iterative procedures such as sequential forward selection or sequential backward selection are often used but they cannot guarantee the selection of the optimal subset. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 71 / 108

72 Examples Sequential forward selection DEM::ELEVATION IKONOS2::BAND1 IKONOS2::BAND3 IKONOS2::BAND2 IKONOS2::BAND4 IKONOS2_GABOR4::FINE90DEG IKONOS2_GABOR4::COARSE90DEG IKONOS2_GABOR4::FINE0DEG IKONOS2_GABOR4::COARSE0DEG AERIAL_GABOR1::FINE0DEG IKONOS2_GABOR1::COARSE0DEG IKONOS2_GABOR1::FINE0DEG IKONOS3::BAND4 IKONOS3::BAND3 IKONOS2_GABOR1::FINE90DEG IKONOS2_GABOR1::COARSE90DEG AERIAL_GABOR2::COARSE0DEG IKONOS3::BAND1 AERIAL_GABOR1::FINE90DEG AERIAL_GABOR2::FINE90DEG AERIAL_GABOR1::COARSE90DEG IKONOS3::BAND2 AERIAL_GABOR2::FINE0DEG AERIAL::BAND1 AERIAL::BAND2 AERIAL_GABOR2::COARSE90DEG AERIAL::BAND3 AERIAL_GABOR1::COARSE0DEG Classification accuracy Figure 33: Results of sequential forward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features added at each iteration (the first iteration is at the bottom). The highest accuracy value is shown with a star. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 72 / 108

73 Examples Sequential backward selection AERIAL_GABOR2::FINE0DEG AERIAL::BAND2 AERIAL::BAND3 IKONOS3::BAND2 AERIAL_GABOR1::COARSE0DEG AERIAL_GABOR2::COARSE90DEG AERIAL_GABOR1::FINE90DEG IKONOS2_GABOR4::FINE0DEG IKONOS2::BAND4 IKONOS2_GABOR1::FINE0DEG IKONOS2_GABOR1::COARSE90DEG IKONOS2_GABOR1::COARSE0DEG IKONOS2_GABOR4::COARSE0DEG IKONOS2::BAND2 IKONOS3::BAND1 IKONOS2_GABOR1::FINE90DEG IKONOS3::BAND4 IKONOS2_GABOR4::COARSE90DEG AERIAL_GABOR2::FINE90DEG AERIAL_GABOR1::FINE0DEG AERIAL_GABOR2::COARSE0DEG IKONOS2_GABOR4::FINE90DEG IKONOS2::BAND3 IKONOS2::BAND1 AERIAL_GABOR1::COARSE90DEG IKONOS3::BAND3 DEM::ELEVATION NONE Classification accuracy Figure 34: Results of sequential backward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features removed at each iteration (the first iteration is at the bottom). The highest accuracy value is shown with a star. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 73 / 108

74 Non-Bayesian Classifiers Distance-based classifiers: Minimum distance (nearest mean) classifier k-nearest neighbor classifier Decision boundary-based classifiers: Linear discriminant functions Support vector machines Neural networks Decision trees UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 74 / 108

75 The k-nearest Neighbor Classifier Given the training data D = {x 1,..., x n } as a set of n labeled examples, the nearest neighbor classifier assigns a test point x the label associated with its closest neighbor in D. The k-nearest neighbor classifier classifies x by assigning it the label most frequently represented among the k nearest samples. Figure 35: Classifier for k = 5. Closeness is defined using a distance function. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 75 / 108

76 Distance Functions A general class of metrics for d-dimensional patterns is the Minkowski metric ( d ) 1/p L p (x, y) = x i y i p i=1 also referred to as the L p norm. The Euclidean distance is the L 2 norm ( d ) 1/2 L 2 (x, y) = x i y i 2. i=1 The Manhattan or city block distance is the L 1 norm d L 1 (x, y) = x i y i. i=1 UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 76 / 108

77 Distance Functions The L norm is the maximum of the distances along individual coordinate axes L (x, y) = d max i=1 x i y i. Figure 36: Each colored shape consists of points at a distance 1.0 from the origin, measured using different values of p in the Minkowski L p metric. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 77 / 108

78 Linear Discriminant Functions Figure 37: Linear decision boundaries produced by using one linear discriminant for each class. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 78 / 108

79 Support Vector Machines Linear discriminant functions are optimal if the underlying distributions are Gaussians having equal covariance for each class. In the general case, the problem of finding linear discriminant functions can be formulated as a problem of optimizing a criterion function. Among all hyperplanes separating the data, there exists a unique one yielding the maximum margin of separation between the classes. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 79 / 108

80 Support Vector Machines y = 1 y = 0 y = 1 y = 1 y = 0 y = 1 margin Figure 38: The margin is defined as the perpendicular distance between the decision boundary and the closest of the data points (left). Maximizing the margin leads to a particular choice of decision boundary (right). The location of this boundary is determined by a subset of the data points, known as the support vectors, which are indicated by the circles. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 80 / 108

81 Support Vector Machines Given a set of training patterns and class labels as (x 1, y 1 ),..., (x n, y n ) R d {±1}, the goal is to find a classifier function f : R d {±1} such that f(x) = y will correctly classify new patterns. Support vector machines are based on the class of hyperplanes (w x) + b = 0, w R d, b R corresponding to decision functions f(x) = sign((w x) + b). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 81 / 108

82 Support Vector Machines Figure 39: A binary classification problem of separating balls from diamonds. The optimal hyperplane is orthogonal to the shortest line connecting the convex hulls of the two classes (dotted), and intersects it half way between the two classes. There is a weight vector w and a threshold b such that the points closest to the hyperplane satisfy (w x i ) + b = 1 corresponding to y i ((w x i ) + b) 1. The margin, measured perpendicularly to the hyperplane, equals 2/ w. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 82 / 108

83 Support Vector Machines To construct the optimal hyperplane, we can define the following optimization problem: minimize 1 2 w 2 subject to y i ((w x i ) + b) 1, i = 1,..., n. The solution can be obtained using quadratic programming techniques where the solution vector is the weighted summation of a subset of the training patterns, called the support vectors. The support vectors lie on the margin and carry all relevant information about the classification problem (the remaining patterns are irrelevant). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 83 / 108

84 Support Vector Machines In many real-world problems there will be no linear boundary separating the classes and the problem of searching for an optimal separating hyperplane is meaningless. However, we can extend the above ideas to handle non-separable data by relaxing the constraints. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 84 / 108

85 Support Vector Machines The new optimization problem becomes: minimize 1 n 2 w 2 + C i=1 subject to (w x i ) + b +1 ξ i for y i = +1, ξ i (w x i ) + b 1 + ξ i for y i = 1, ξ i 0 i = 1,..., n where ξ i, i = 1,..., n, are called the slack variables and C is a regularization parameter. The term C n i=1 ξ i can be thought of as measuring some amount of misclassification where lowering the value of C corresponds to a smaller penalty for misclassification. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 85 / 108

86 Support Vector Machines ξ > 1 y = 1 y = 0 y = 1 ξ < 1 ξ = 0 ξ = 0 Figure 40: Illustration of the slack variables ξ i 0. Data points with circles around them are support vectors. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 86 / 108

87 Support Vector Machines Both the quadratic programming problem and the final decision function f(x) = sign ( n α i y i (x x i ) + b i=1 ) depend only on the dot products between patterns. We can generalize this result to the non-linear case by mapping the original input space into some other space F using a non-linear map Φ : R d F and perform the linear algorithm in the F space which only requires the dot products k(x, y) = Φ(x)Φ(y). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 87 / 108

88 Support Vector Machines Even though F may be high-dimensional, a simple kernel k(x, y) such as the following can be computed efficiently. Table 2: Common kernel functions. Polynomial k(x, y) = (x y) p Sigmoidal k(x, y) = tanh(κ(x y) + θ) Radial basis function k(x, y) = exp( x y 2 /(2σ 2 )) Once a kernel function is chosen, we can substitute Φ(x i ) for each training example x i, and perform the optimal hyperplane algorithm in F. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 88 / 108

89 Support Vector Machines This results in the non-linear decision function of the form ( n ) f(x) = sign α i y i k(x, x i ) + b i=1 where the parameters α i are computed as the solution of the quadratic programming problem. In the original input space, the hyperplane corresponds to a non-linear decision function whose form is determined by the kernel. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 89 / 108

90 Neural Networks Figure 41: A neural network consists of an input layer, an output layer and usually one or more hidden layers that are interconnected by modifiable weights represented by links between layers. They learn the values of these weights as a mapping from the input to the output. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 90 / 108

91 Decision Trees Figure 42: Decision trees classify a pattern through a sequence of questions, in which the next question asked depends on the answer to the current question. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 91 / 108

92 Unsupervised Learning and Clustering Clustering is an unsupervised procedure that uses unlabeled samples. Unsupervised procedures are used for several reasons: Collecting and labeling a large set of sample patterns can be costly. One can train with large amount of unlabeled data, and then use supervision to label the groupings found. Exploratory data analysis can provide insight into the nature or structure of the data. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 92 / 108

93 Clusters A cluster is comprised of a number of similar objects collected or grouped together. Patterns within a cluster are more similar to each other than are patterns in different clusters. Clusters may be described as connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 93 / 108

94 Clustering Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. Figure 43: The number of clusters in the data often depend on the resolution (fine vs. coarse) with which we view the data. How many clusters do you see in this figure? 5, 8, 10, more? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 94 / 108

95 Clustering Most of the clustering algorithms are based on the following two popular techniques: Iterative squared-error partitioning Agglomerative hierarchical clustering One of the main challenges is to select an appropriate measure of similarity to define clusters that is often both data (cluster shape) and context dependent. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 95 / 108

96 Squared-error Partitioning Suppose that the given set of n patterns has somehow been partitioned into k clusters D 1,..., D k. Let n i be the number of samples in D i and let m i be the mean of those samples m i = 1 n i x D i x. Then, the sum-of-squared errors is defined by k J e = x m i 2. i=1 x D i For a given cluster D i, the mean vector m i (centroid) is the best representative of the samples in D i. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 96 / 108

97 Squared-error Partitioning A general algorithm for iterative squared-error partitioning: 1. Select an initial partition with k clusters. Repeat steps 2 through 5 until the cluster membership stabilizes. 2. Generate a new partition by assigning each pattern to its closest cluster center. 3. Compute new cluster centers as the centroids of the clusters. 4. Repeat steps 2 and 3 until an optimum value of the criterion function is found (e.g., when a local minimum is found or a predefined number of iterations are completed). 5. Adjust the number of clusters by merging and splitting existing clusters or by removing small or outlier clusters. This algorithm, without step 5, is also known as the k-means algorithm. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 97 / 108

98 Squared-error Partitioning k-means is computationally efficient and gives good results if the clusters are compact, hyperspherical in shape and well-separated in the feature space. However, choosing k and choosing the initial partition are the main drawbacks of this algorithm. The value of k is often chosen empirically or by prior knowledge about the data. The initial partition is often chosen by generating k random points uniformly distributed within the range of the data, or by randomly selecting k points from the data. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 98 / 108

99 Hierarchical Clustering In some applications, groups of patterns share some characteristics when looked at a particular level. Hierarchical clustering tries to capture these multi-level groupings using hierarchical representations. Figure 44: A dendrogram can represent the results of hierarchical clustering algorithms. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 99 / 108

100 Algorithm-Independent Learning Issues We have seen many learning algorithms and techniques for pattern recognition. Some of these algorithms may be preferred because of their lower computational complexity. Others may be preferred because they take into account some prior knowledge of the form of the data. Given practical constraints such as finite training data, no pattern classification method is inherently superior to any other. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 100 / 108

101 Estimating and Comparing Classifiers Classification error can be estimated using misclassification and false alarm rates. To compare learning algorithms, we should use independent training and test data generated using static division, rotated division (e.g., cross-validation), bootstrap methods. Using the error on points not in the training set (also called the off-training set error) is important for evaluating the generalization ability of an algorithm. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 101 / 108

102 Combining Classifiers Just like different features capturing different properties of a pattern, different classifiers also capture different structures and relationships of these patterns in the feature space. An empirical comparison of different classifiers can help us choose one of them as the best classifier for the problem at hand. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 102 / 108

103 Combining Classifiers However, although most of the classifiers may have similar error rates, sets of patterns misclassified by different classifiers do not necessarily overlap. Not relying on a single decision but rather combining the advantages of different classifiers is intuitively promising to improve the overall accuracy of classification. Such combinations are variously called combined classifiers, ensemble classifiers, mixture-of-expert models, or pooled classifiers. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 103 / 108

104 Combining Classifiers Some reasons for combining multiple classifiers to solve a given classification problem can be stated as follows: Access to different classifiers, each developed in a different context and for an entirely different representation/description of the same problem. Availability of multiple training sets, each collected at a different time or in a different environment, even may use different features. Local performances of different classifiers where each classifier may have its own region in the feature space where it performs the best. Different performances due to different initializations and randomness inherent in the training procedure. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 104 / 108

105 Combining Classifiers In summary, we may have different feature sets, training sets, classification methods, and training sessions, all resulting in a set of classifiers whose outputs may be combined. Combination architectures can be grouped as: Parallel: all classifiers are invoked independently and then their results are combined by a combiner. Serial (cascading): individual classifiers are invoked in a linear sequence where the number of possible classes for a given pattern is gradually reduced. Hierarchical (tree): individual classifiers are combined into a structure, which is similar to that of a decision tree, where the nodes are associated with the classifiers. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 105 / 108

106 Combining Classifiers Examples of classifier combination schemes are: Majority voting where each classifier makes a binary decision (vote) about each class and the final decision is made in favor of the class with the largest number of votes. Bayesian combination: sum, product, maximum, minimum and median of the posterior probabilities from individual classifiers. Bagging where multiple classifiers are built by bootstrapping the original training set. Boosting where a sequence of classifiers is built by training each classifier using data sampled from a distribution derived from the empirical misclassification rate of the previous classifier. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 106 / 108

107 Structural and Syntactic Pattern Recognition Statistical pattern recognition attempts to classify patterns based on a set of extracted features and an underlying statistical model for the generation of these patterns. Ideally, this is achieved with a rather straightforward procedure: determine the feature vector, train the system, classify the patterns. Unfortunately, there are also many problems where patterns contain structural and relational information that are difficult or impossible to quantify in feature vector form. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 107 / 108

108 Structural and Syntactic Pattern Recognition Structural pattern recognition assumes that pattern structure is quantifiable and extractable so that structural similarity of patterns can be assessed. Typically, these approaches formulate hierarchical descriptions of complex patterns built up from simpler primitive elements. This structure quantification and description are mainly done using: Formal grammars Relational descriptions (principally graphs) Then, recognition and classification are done using: Parsing (for formal grammars) Relational graph matching (for relational descriptions) UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 108 / 108

### Introduction to Pattern Recognition

Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

### Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

### PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

### Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

### Statistical Machine Learning

Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

### Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

### Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

### Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

### Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

### The Scientific Data Mining Process

Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

### Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

### Linear Threshold Units

Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

### EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

### Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University

Data Mining Chapter 6: Models and Patterns Fall 2011 Ming Li Department of Computer Science and Technology Nanjing University Models vs. Patterns Models A model is a high level, global description of a

### Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation

Parametric Models Part I: Maximum Likelihood and Bayesian Density Estimation Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Fall 2015 CS 551, Fall 2015

### Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

### . Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

### DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

### STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

### CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

### Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

### L15: statistical clustering

Similarity measures Criterion functions Cluster validity Flat clustering algorithms k-means ISODATA L15: statistical clustering Hierarchical clustering algorithms Divisive Agglomerative CSCE 666 Pattern

### These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher

### Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

### Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

### Pattern Recognition: An Overview. Prof. Richard Zanibbi

Pattern Recognition: An Overview Prof. Richard Zanibbi Pattern Recognition (One) Definition The identification of implicit objects, types or relationships in raw data by an animal or machine i.e. recognizing

### Clustering and Data Mining in R

Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches

### Multivariate Normal Distribution

Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

### Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

### Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

### Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

### Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

### Classifiers & Classification

Classifiers & Classification Forsyth & Ponce Computer Vision A Modern Approach chapter 22 Pattern Classification Duda, Hart and Stork School of Computer Science & Statistics Trinity College Dublin Dublin

### Chapter 4: Non-Parametric Classification

Chapter 4: Non-Parametric Classification Introduction Density Estimation Parzen Windows Kn-Nearest Neighbor Density Estimation K-Nearest Neighbor (KNN) Decision Rule Gaussian Mixture Model A weighted combination

### ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

### Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

### Going Big in Data Dimensionality:

LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für

### 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial

### Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

### Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

### Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

### Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

### C19 Machine Learning

C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

### Chapter ML:XI (continued)

Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

### Support Vector Machine (SVM)

Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

### Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

### Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based Fuzzy c-means Mixture Model Clustering Density-based

### Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

### Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for

### Distance based clustering

// Distance based clustering Chapter ² ² Clustering Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 99). What is a cluster? Group of objects separated from other clusters Means

### Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598. Keynote, Outlier Detection and Description Workshop, 2013

Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier

### Final Exam, Spring 2007

10-701 Final Exam, Spring 2007 1. Personal info: Name: Andrew account: E-mail address: 2. There should be 16 numbered pages in this exam (including this cover sheet). 3. You can use any material you brought:

### Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

### Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

### Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

### HT2015: SC4 Statistical Data Mining and Machine Learning

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric

### CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

### Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

### Distances, Clustering, and Classification. Heatmaps

Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be

### Review of some concepts in predictive modeling

Review of some concepts in predictive modeling Brigham and Women s Hospital Harvard-MIT Division of Health Sciences and Technology HST.951J: Medical Decision Support A disjoint list of topics? Naïve Bayes

### An Introduction to Data Mining

An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

### Data, Measurements, Features

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

### Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

### Gaussian Classifiers CS498

Gaussian Classifiers CS498 Today s lecture The Gaussian Gaussian classifiers A slightly more sophisticated classifier Nearest Neighbors We can classify with nearest neighbors x m 1 m 2 Decision boundary

### MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

### An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

### Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

### Introduction to Statistical Machine Learning

CHAPTER Introduction to Statistical Machine Learning We start with a gentle introduction to statistical machine learning. Readers familiar with machine learning may wish to skip directly to Section 2,

### Classification: Basic Concepts, Decision Trees, and Model Evaluation. General Approach for Building Classification Model

10 10 Classification: Basic Concepts, Decision Trees, and Model Evaluation Dr. Hui Xiong Rutgers University Introduction to Data Mining 1//009 1 General Approach for Building Classification Model Tid Attrib1

### 1 Maximum likelihood estimation

COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N

### Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

### Monotonicity Hints. Abstract

Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology

### Data Mining Practical Machine Learning Tools and Techniques

Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

### Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

### A Learning Based Method for Super-Resolution of Low Resolution Images

A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method

### CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

### Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

### NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

### Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

### Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

### Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

### KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of

### Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

### Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Engineering the input and output Attribute selection Scheme independent, scheme

### 6. If there is no improvement of the categories after several steps, then choose new seeds using another criterion (e.g. the objects near the edge of

Clustering Clustering is an unsupervised learning method: there is no target value (class label) to be predicted, the goal is finding common patterns or grouping similar examples. Differences between models/algorithms

### Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

### Improved Fuzzy C-means Clustering Algorithm Based on Cluster Density

Journal of Computational Information Systems 8: 2 (2012) 727 737 Available at http://www.jofcis.com Improved Fuzzy C-means Clustering Algorithm Based on Cluster Density Xiaojun LOU, Junying LI, Haitao

### COLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics

ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE: COS-STAT-747 Principles of Statistical Data Mining

### Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

### Learning. Artificial Intelligence. Learning. Types of Learning. Inductive Learning Method. Inductive Learning. Learning.

Learning Learning is essential for unknown environments, i.e., when designer lacks omniscience Artificial Intelligence Learning Chapter 8 Learning is useful as a system construction method, i.e., expose

### Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

### Statistical Models in Data Mining

Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

### Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

### Exploratory Data Analysis with MATLAB

Computer Science and Data Analysis Series Exploratory Data Analysis with MATLAB Second Edition Wendy L Martinez Angel R. Martinez Jeffrey L. Solka ( r ec) CRC Press VV J Taylor & Francis Group Boca Raton

### A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center

### Principles of Data Mining by Hand&Mannila&Smyth

Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences