Classification Techniques for Remote Sensing
|
|
- Sandra Fitzgerald
- 8 years ago
- Views:
Transcription
1 Classification Techniques for Remote Sensing Selim Aksoy Department of Computer Engineering Bilkent University Bilkent, 06800, Ankara saksoy/courses/cs551 UYGU 2014 UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 1 / 108
2 Outline 1 Introduction 2 Bayesian Decision Theory Parametric Models Non-parametric Methods 3 Feature Reduction and Selection 4 Non-Bayesian Classifiers Distance-based Classifiers Decision Boundary-based Classifiers 5 Unsupervised Learning and Clustering 6 Algorithm-Independent Learning Issues Estimating and Comparing Classifiers Combining Classifiers 7 Structural and Syntactic Pattern Recognition UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 2 / 108
3 Human versus Machine Perception Humans have developed highly sophisticated skills for sensing their environment and taking actions according to what they observe. We would like to give similar capabilities to machines. This is of particular interest in remote sensing due to the expansion in the data volume with increasing spatial, spectral, and temporal resolutions, increasing complexity of the data content, urgency of the need for the extraction of useful information. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 3 / 108
4 Classification What is classification? Finding boundaries in a feature space Figure 1: Example classification boundaries for 2D data sets. Classification methods use techniques from mathematics, statistics, computer science, etc., to solve this problem. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 4 / 108
5 An Example Problem: Sorting incoming fish on a conveyor belt according to species. Assume that we have only two kinds of fish: sea bass, salmon. Figure 2: Picture taken from a camera. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 5 / 108
6 An Example: Decision Process What kind of information can distinguish one species from the other? length, width, weight, number and shape of fins, tail shape, etc. What can cause problems during sensing? lighting conditions, position of fish on the conveyor belt, camera noise, etc. What are the steps in the process? capture image isolate fish take measurements make decision UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 6 / 108
7 An Example: Selecting Features Assume a fisherman told us that a sea bass is generally longer than a salmon. We can use length as a feature and decide between sea bass and salmon according to a threshold on length. How can we choose this threshold? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 7 / 108
8 An Example: Selecting Features Figure 3: Histograms of the length feature for two types of fish in training samples. How can we choose the threshold l to make a reliable decision? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 8 / 108
9 An Example: Selecting Features Even though sea bass is longer than salmon on the average, there are many examples of fish where this observation does not hold. Try another feature: average lightness of the fish scales. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 9 / 108
10 An Example: Selecting Features Figure 4: Histograms of the lightness feature for two types of fish in training samples. It looks easier to choose the threshold x but we still cannot make a perfect decision. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 10 / 108
11 An Example: Cost of Error We should also consider costs of different errors we make in our decisions. For example, if the fish packing company knows that: Customers who buy salmon will object vigorously if they see sea bass in their cans. Customers who buy sea bass will not be unhappy if they occasionally see some expensive salmon in their cans. How does this knowledge affect our decision? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 11 / 108
12 An Example: Multiple Features Assume we also observed that sea bass are typically wider than salmon. We can use two features in our decision: lightness: x1 width: x2 Each fish image is now represented as a point (feature vector) ( x = x 1 x 2 ) in a two-dimensional feature space. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 12 / 108
13 An Example: Multiple Features Figure 5: Scatter plot of lightness and width features for training samples. We can draw a decision boundary to divide the feature space into two regions. Does it look better than using only lightness? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 13 / 108
14 An Example: Multiple Features Does adding more features always improve the results? Avoid unreliable features. Be careful about correlations with existing features. Be careful about measurement costs. Be careful about noise in the measurements. Is there some curse for working in very high dimensions? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 14 / 108
15 An Example: Decision Boundaries Can we do better with another decision rule? More complex models result in more complex boundaries. Figure 6: We may distinguish training samples perfectly but how can we predict how well we can generalize to unknown samples? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 15 / 108
16 An Example: Decision Boundaries How can we manage the tradeoff between complexity of decision rules and their performance to unknown samples? Figure 7: Different criteria lead to different decision boundaries. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 16 / 108
17 More on Complexity Figure 8: Regression example: plot of 10 sample points for the input variable x along with the corresponding target variable t. Green curve is the true function that generated the data. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 17 / 108
18 More on Complexity (a) 0 th order polynomial (b) 1 st order polynomial (c) 3 rd order polynomial (d) 9 th order polynomial Figure 9: Polynomial curve fitting: plots of polynomials having various orders, shown as red curves, fitted to the set of 10 sample points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 18 / 108
19 More on Complexity (a) 15 sample points (b) 100 sample points Figure 10: Polynomial curve fitting: plots of 9 th order polynomials fitted to 15 and 100 sample points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 19 / 108
20 Pattern Recognition Systems Physical environment Data acquisition/sensing Training data Pre processing Pre processing Feature extraction Feature extraction/selection Features Features Classification Model Model learning/estimation Post processing Decision Figure 11: Object/process diagram of a pattern recognition system. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 20 / 108
21 Pattern Recognition Systems Data acquisition and sensing: Measurements of physical variables. Important issues: bandwidth, resolution, sensitivity, distortion, SNR, latency, etc. Pre-processing: Removal of noise in data. Isolation of patterns of interest from the background. Feature extraction: Finding a new representation in terms of features. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 21 / 108
22 Pattern Recognition Systems Model learning and estimation: Learning a mapping between features and pattern groups and categories. Classification: Using features and learned models to assign a pattern to a category. Post-processing: Evaluation of confidence in decisions. Exploitation of context to improve performance. Combination of experts. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 22 / 108
23 The Design Cycle Collect data Select features Select model Train classifier Evaluate classifier Figure 12: The design cycle. Data collection: Collecting training and testing data. How can we know when we have adequately large and representative set of samples? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 23 / 108
24 The Design Cycle Feature selection: Domain dependence and prior information. Computational cost and feasibility. Discriminative features. Similar values for similar patterns. Different values for different patterns. Invariant features with respect to translation, rotation and scale. Robust features with respect to occlusion, distortion, deformation, and variations in environment. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 24 / 108
25 The Design Cycle Model selection: Domain dependence and prior information. Definition of design criteria. Parametric vs. non-parametric models. Handling of missing features. Computational complexity. Types of models: templates, decision-theoretic or statistical, syntactic or structural, neural, and hybrid. How can we know how close we are to the true model underlying the patterns? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 25 / 108
26 The Design Cycle Training: How can we learn the rule from data? Supervised learning: a teacher provides a category label or cost for each pattern in the training set. Unsupervised learning: the system forms clusters or natural groupings of the input patterns. Semi-supervised learning: a class of supervised learning techniques that also make use of unlabeled data for training. Active learning: a special case of semi-supervised learning in which a learning algorithm interactively queries the user to obtain the desired outputs at new data points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 26 / 108
27 The Design Cycle Evaluation: How can we estimate the performance with training samples? How can we predict the performance with future data? Problems of overfitting and generalization. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 27 / 108
28 Bayesian Decision Theory Bayesian Decision Theory is a statistical approach that quantifies the tradeoffs between various decisions using probabilities and costs that accompany such decisions. Fish sorting example: define w, the type of fish we observe (class), as a random variable where w = w1 for sea bass, w = w2 for salmon. P (w1 ) is the a priori probability that the next fish is a sea bass. P (w2 ) is the a priori probability that the next fish is a salmon. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 28 / 108
29 Prior Probabilities Prior probabilities reflect our knowledge of how likely each type of fish will appear before we actually see it. How can we choose P (w 1 ) and P (w 2 )? Set P (w1 ) = P (w 2 ) if they are equiprobable (uniform priors). May use different values depending on the fishing area, time of the year, etc. Assume there are no other types of fish (exclusivity and exhaustivity). P (w 1 ) + P (w 2 ) = 1 UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 29 / 108
30 Making a Decision How can we make a decision with only the prior information? Decide w 1 if P (w 1 ) > P (w 2 ) otherwise w 2 What is the probability of error for this decision? P (error) = min{p (w 1 ), P (w 2 )} UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 30 / 108
31 Class-conditional Probabilities Let s try to improve the decision using the lightness measurement x. Let x be a continuous random variable. Define p(x w j ) as the class-conditional probability density (probability of x given that the class is w j for j = 1, 2). p(x w 1 ) and p(x w 2 ) describe the difference in lightness between populations of sea bass and salmon. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 31 / 108
32 Posterior Probabilities Suppose we know P (w j ) and p(x w j ) for j = 1, 2, and measure the lightness of a fish as the value x. Define P (w j x) as the a posteriori probability (probability of the class being w j given the measurement of feature value x). We can use the Bayes formula to convert the prior probability to the posterior probability P (w j x) = p(x w j)p (w j ) p(x) where p(x) = 2 j=1 p(x w j)p (w j ). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 32 / 108
33 Making a Decision p(x w j ) is called the likelihood and p(x) is called the evidence. How can we make a decision after observing the value of x? w 1 if P (w 1 x) > P (w 2 x) Decide otherwise w 2 Rewriting the rule gives w 1 if p(x w 1) Decide otherwise w 2 > P (w 2) p(x w 2 ) P (w 1 ) Note that, at every x, P (w 1 x) + P (w 2 x) = 1. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 33 / 108
34 Making a Decision Figure 13: Optimum thresholds for different priors. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 34 / 108
35 Probability of Error What is the probability of error for this decision? P (w 1 x) if we decide w 2 P (error x) = P (w 2 x) if we decide w 1 What is the average probability of error? P (error) = p(error, x) dx = P (error x) p(x) dx Bayes decision rule minimizes this error because P (error x) = min{p (w 1 x), P (w 2 x)}. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 35 / 108
36 Probability of Error Figure 14: Components of the probability of error for equal priors and the non-optimal decision point x. The optimal point x B minimizes the total shaded area and gives the Bayes error rate. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 36 / 108
37 Receiver Operating Characteristics Consider the two-category case and define w1 : target is present, w2 : target is not present. Table 1: Confusion matrix. Assigned w 1 w 2 True w 1 correct detection mis-detection w 2 false alarm correct rejection Mis-detection is also called false negative or Type II error. False alarm is also called false positive or Type I error. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 37 / 108
38 Receiver Operating Characteristics If we use a parameter (e.g., a threshold) in our decision, the plot of these rates for different values of the parameter is called the receiver operating characteristic (ROC) curve. Figure 15: Example receiver operating characteristic (ROC) curves for different settings of the system. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 38 / 108
39 Bayesian Decision Theory How can we generalize to more than one feature? replace the scalar x by the feature vector x more than two classes? just a difference in notation allowing actions other than just decisions? allow the possibility of rejection different risks in the decision? define how costly each action is UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 39 / 108
40 Minimum-error-rate Classification Let {w 1,..., w c } be the finite set of c classes (categories). Let x be the d-component vector-valued random variable called the feature vector. If all errors are equally costly, the minimum-error decision rule is defined as Decide w i if P (w i x) > P (w j x) j i The resulting error is called the Bayes error and is the best performance that can be achieved. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 40 / 108
41 Bayesian Decision Theory Bayesian Decision Theory shows us how to design an optimal classifier if we know the prior probabilities P (w i ) and the class-conditional densities p(x w i ). Unfortunately, we rarely have complete knowledge of the probabilistic structure. However, we can often find design samples or training data that include particular representatives of the patterns we want to classify. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 41 / 108
42 Bayesian Decision Theory Parametric models: assume that the form of the density functions are known. Density models (e.g., Gaussian) Mixture models (e.g., mixture of Gaussians) Hidden Markov Models Graphical models Non-parametric models: no assumption about the form. Histogram-based estimation Parzen window estimation Nearest neighbor estimation UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 42 / 108
43 The Gaussian Density Gaussian can be considered as a model where the feature vectors for a given class are continuous-valued, randomly corrupted versions of a single typical or prototype vector. Some properties of the Gaussian: Analytically tractable Completely specified by the 1st and 2nd moments Has the maximum entropy of all distributions with a given mean and variance Many processes are asymptotically Gaussian (Central Limit Theorem) Uncorrelatedness implies independence UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 43 / 108
44 Univariate Gaussian For x R: p(x) = N(µ, σ 2 ) [ = 1 exp 1 ( ) ] 2 x µ 2πσ 2 σ where µ = E[x] = σ 2 = E[(x µ) 2 ] = x p(x) dx, (x µ) 2 p(x) dx. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 44 / 108
45 Univariate Gaussian Figure 16: A univariate Gaussian distribution has roughly 95% of its area in the range x µ 2σ. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 45 / 108
46 Multivariate Gaussian For x R d : p(x) = N(µ, Σ) [ 1 = exp 1 ] (2π) d/2 Σ 1/2 2 (x µ)t Σ 1 (x µ) where µ = E[x] = x p(x) dx, Σ = E[(x µ)(x µ) T ] = (x µ)(x µ) T p(x) dx. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 46 / 108
47 Multivariate Gaussian Figure 17: Samples drawn from a two-dimensional Gaussian lie in a cloud centered on the mean µ. The loci of points of constant density are the ellipses for which (x µ) T Σ 1 (x µ) is constant, where the eigenvectors of Σ determine the direction and the corresponding eigenvalues determine the length of the principal axes. The quantity r 2 = (x µ) T Σ 1 (x µ) is called the squared Mahalanobis distance from x to µ. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 47 / 108
48 Gaussian Density Estimation The maximum likelihood estimates of a Gaussian are ˆµ = 1 n x i and ˆΣ = 1 n (x i ˆµ)(x i ˆµ) T. n n i=1 i= Random sample from N(10,2 2 ) Histogram Gaussian fit Random sample from 0.5 N(10,0.4 2 ) N(11,0.5 2 ) Histogram Gaussian fit pdf 0.1 pdf x x Figure 18: Gaussian density estimation examples. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 48 / 108
49 Case 1: Σ i = σ 2 I Figure 19: If the covariance matrices of two distributions are equal and proportional to the identity matrix, then the distributions are spherical in d dimensions, and the boundary is a generalized hyperplane of d 1 dimensions, perpendicular to the line separating the means. The decision boundary shifts as the priors are changed. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 49 / 108
50 Case 2: Σ i = Σ Figure 20: Probability densities with equal but asymmetric Gaussian distributions. The decision hyperplanes are not necessarily perpendicular to the line connecting the means. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 50 / 108
51 Case 3: Σ i = arbitrary Figure 21: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general hyperquadrics. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 51 / 108
52 Case 3: Σ i = arbitrary Figure 22: Arbitrary Gaussian distributions lead to Bayes decision boundaries that are general hyperquadrics. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 52 / 108
53 Mixture Densities A mixture model is a linear combination of m densities p(x Θ) = m α j p j (x θ j ) j=1 where Θ = (α 1,..., α m, θ 1,..., θ m ) such that α j 0 and m j=1 α j = 1. α 1,..., α m are called the mixing parameters. p j (x θ j ), j = 1,..., m are called the component densities. The most commonly used mixture model is the mixture of Gaussians. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 53 / 108
54 Expectation-Maximization Estimation of the mixture model parameters can be done using the Expectation-Maximization (EM) algorithm. The EM algorithm is a general iterative method of finding the maximum likelihood estimates of the parameters of a distribution from training data. The EM algorithm: First, computes the expected value of the complete-data log-likelihood using the current parameter estimates (expectation step). Then, finds new values of the parameters that maximize this expectation (maximization step). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 54 / 108
55 (a) (b) (c) (d) (e) (f) 2 Figure 23: Illustration of the EM algorithm iterations for a mixture of two Gaussians. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 55 / 108
56 (a) True densities and sample histograms. (b) Linear Gaussian classifier with P e = (c) Quadratic Gaussian classifier with P e = (d) Mixture of Gaussian classifier with P e = Figure 24: 1-D Bayesian classification examples where the data for each class come from a mixture of three Gaussians. Bayes error is P e = UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 56 / 108
57 (a) Scatter plot. (b) Linear Gaussian classifier with P e = (c) Quadratic Gaussian classifier with P e = (d) Mixture of Gaussian classifier with P e = Figure 25: 2-D Bayesian classification examples where the data for the classes come from a banana shaped distribution and a bivariate Gaussian. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 57 / 108
58 (a) Scatter plot. (b) Quadratic Gaussian classifier with P e = (c) Mixture of Gaussian classifier with P e = Figure 26: 2-D Bayesian classification examples where the data for each class come from a banana shaped distribution. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 58 / 108
59 Histogram Method A very simple method is to partition the space into a number of equally-sized cells (bins) and compute a histogram. Figure 27: Histogram in one dimension. The estimate of the density at a point x becomes p(x) = k nv where n is the total number of samples, k is the number of samples in the cell that includes x, and V is the volume of that cell. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 59 / 108
60 Non-parametric Methods Advantages: No assumptions are needed about the distributions ahead of time (generality). With enough samples, convergence to an arbitrarily complicated target density can be obtained. Disadvantages: The number of samples needed may be very large (number grows exponentially with the dimensionality of the feature space). There may be severe requirements for computation time and storage. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 60 / 108
61 Classification Error To apply these results to multiple classes, separate the training samples to c subsets D 1,..., D c, with the samples in D i belonging to class w i, and then estimate each density p(x w i, D i ) separately. Different sources of error: Bayes error: due to overlapping class-conditional densities (related to features used). Model error: due to incorrect model. Estimation error: due to estimation from a finite sample (can be reduced by increasing the amount of training data). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 61 / 108
62 Feature Reduction and Selection In practical multicategory applications, it is not unusual to encounter problems involving tens or hundreds of features. Intuitively, it may seem that each feature is useful for at least some of the discriminations. There are two issues that we must be careful about: How is the classification accuracy affected by the dimensionality (relative to the amount of training data)? How is the computational complexity of the classifier affected by the dimensionality? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 62 / 108
63 Problems of Dimensionality In general, if the performance obtained with a given set of features is inadequate, it is natural to consider adding new features. Unfortunately, it has frequently been observed in practice that, beyond a certain point, adding new features leads to worse rather than better performance. This is called the curse of dimensionality. Potential reasons include wrong assumptions in model selection or estimation errors due to the finite number of training samples for high-dimensional observations (overfitting). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 63 / 108
64 Problems of Dimensionality All of the commonly used classifiers can suffer from the curse of dimensionality. Dimensionality can be reduced by redesigning the features, selecting an appropriate subset among the existing features, combining existing features. Principal Components Analysis (PCA) seeks a projection that best represents the data in a least-squares sense. Linear Discriminant Analysis (LDA) seeks a projection that best separates the data in a least-squares sense. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 64 / 108
65 Examples (b) Projection onto e 1. (a) Scatter plot. (c) Projection onto e 2. Figure 28: Scatter plot (red dots) and the principal axes for a bivariate sample. The blue line shows the axis e 1 with the greatest variance and the green line shows the axis e 2 with the smallest variance. Features are now uncorrelated. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 65 / 108
66 Examples (b) Projection onto the first PCA axis. (c) Projection onto the (a) Scatter plot. first LDA axis. Figure 29: Scatter plot and the PCA and LDA axes for a bivariate sample with two classes. Histogram of the projection onto the first LDA axis shows better separation than the projection onto the first PCA axis. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 66 / 108
67 Examples (b) Projection onto the first PCA axis. (c) Projection onto the (a) Scatter plot. first LDA axis. Figure 30: Scatter plot and the PCA and LDA axes for a bivariate sample with two classes. Histogram of the projection onto the first LDA axis shows better separation than the projection onto the first PCA axis. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 67 / 108
68 Examples Figure 31: An image and the first six PCA bands (after projection). Histogram equalization was applied to all images for better visualization. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 68 / 108
69 Examples Figure 32: An image and the six LDA bands (after projection). Histogram equalization was applied to all images for better visualization. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 69 / 108
70 Feature Selection An alternative to feature reduction that uses linear or non-linear combinations of features is feature selection that reduces dimensionality by selecting subsets of existing features. The first step in feature selection is to define a criterion function that is often a function of the classification error. Note that, the use of classification error in the criterion function makes feature selection procedures dependent on the specific classifier used. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 70 / 108
71 Feature Selection The most straightforward approach would require examining all ( d m) possible subsets of size m, selecting the subset that performs the best according to the criterion function. The number of subsets grows combinatorially, making the exhaustive search impractical. Iterative procedures such as sequential forward selection or sequential backward selection are often used but they cannot guarantee the selection of the optimal subset. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 71 / 108
72 Examples Sequential forward selection DEM::ELEVATION IKONOS2::BAND1 IKONOS2::BAND3 IKONOS2::BAND2 IKONOS2::BAND4 IKONOS2_GABOR4::FINE90DEG IKONOS2_GABOR4::COARSE90DEG IKONOS2_GABOR4::FINE0DEG IKONOS2_GABOR4::COARSE0DEG AERIAL_GABOR1::FINE0DEG IKONOS2_GABOR1::COARSE0DEG IKONOS2_GABOR1::FINE0DEG IKONOS3::BAND4 IKONOS3::BAND3 IKONOS2_GABOR1::FINE90DEG IKONOS2_GABOR1::COARSE90DEG AERIAL_GABOR2::COARSE0DEG IKONOS3::BAND1 AERIAL_GABOR1::FINE90DEG AERIAL_GABOR2::FINE90DEG AERIAL_GABOR1::COARSE90DEG IKONOS3::BAND2 AERIAL_GABOR2::FINE0DEG AERIAL::BAND1 AERIAL::BAND2 AERIAL_GABOR2::COARSE90DEG AERIAL::BAND3 AERIAL_GABOR1::COARSE0DEG Classification accuracy Figure 33: Results of sequential forward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features added at each iteration (the first iteration is at the bottom). The highest accuracy value is shown with a star. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 72 / 108
73 Examples Sequential backward selection AERIAL_GABOR2::FINE0DEG AERIAL::BAND2 AERIAL::BAND3 IKONOS3::BAND2 AERIAL_GABOR1::COARSE0DEG AERIAL_GABOR2::COARSE90DEG AERIAL_GABOR1::FINE90DEG IKONOS2_GABOR4::FINE0DEG IKONOS2::BAND4 IKONOS2_GABOR1::FINE0DEG IKONOS2_GABOR1::COARSE90DEG IKONOS2_GABOR1::COARSE0DEG IKONOS2_GABOR4::COARSE0DEG IKONOS2::BAND2 IKONOS3::BAND1 IKONOS2_GABOR1::FINE90DEG IKONOS3::BAND4 IKONOS2_GABOR4::COARSE90DEG AERIAL_GABOR2::FINE90DEG AERIAL_GABOR1::FINE0DEG AERIAL_GABOR2::COARSE0DEG IKONOS2_GABOR4::FINE90DEG IKONOS2::BAND3 IKONOS2::BAND1 AERIAL_GABOR1::COARSE90DEG IKONOS3::BAND3 DEM::ELEVATION NONE Classification accuracy Figure 34: Results of sequential backward feature selection for classification of a satellite image using 28 features. x-axis shows the classification accuracy (%) and y-axis shows the features removed at each iteration (the first iteration is at the bottom). The highest accuracy value is shown with a star. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 73 / 108
74 Non-Bayesian Classifiers Distance-based classifiers: Minimum distance (nearest mean) classifier k-nearest neighbor classifier Decision boundary-based classifiers: Linear discriminant functions Support vector machines Neural networks Decision trees UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 74 / 108
75 The k-nearest Neighbor Classifier Given the training data D = {x 1,..., x n } as a set of n labeled examples, the nearest neighbor classifier assigns a test point x the label associated with its closest neighbor in D. The k-nearest neighbor classifier classifies x by assigning it the label most frequently represented among the k nearest samples. Figure 35: Classifier for k = 5. Closeness is defined using a distance function. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 75 / 108
76 Distance Functions A general class of metrics for d-dimensional patterns is the Minkowski metric ( d ) 1/p L p (x, y) = x i y i p i=1 also referred to as the L p norm. The Euclidean distance is the L 2 norm ( d ) 1/2 L 2 (x, y) = x i y i 2. i=1 The Manhattan or city block distance is the L 1 norm d L 1 (x, y) = x i y i. i=1 UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 76 / 108
77 Distance Functions The L norm is the maximum of the distances along individual coordinate axes L (x, y) = d max i=1 x i y i. Figure 36: Each colored shape consists of points at a distance 1.0 from the origin, measured using different values of p in the Minkowski L p metric. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 77 / 108
78 Linear Discriminant Functions Figure 37: Linear decision boundaries produced by using one linear discriminant for each class. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 78 / 108
79 Support Vector Machines Linear discriminant functions are optimal if the underlying distributions are Gaussians having equal covariance for each class. In the general case, the problem of finding linear discriminant functions can be formulated as a problem of optimizing a criterion function. Among all hyperplanes separating the data, there exists a unique one yielding the maximum margin of separation between the classes. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 79 / 108
80 Support Vector Machines y = 1 y = 0 y = 1 y = 1 y = 0 y = 1 margin Figure 38: The margin is defined as the perpendicular distance between the decision boundary and the closest of the data points (left). Maximizing the margin leads to a particular choice of decision boundary (right). The location of this boundary is determined by a subset of the data points, known as the support vectors, which are indicated by the circles. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 80 / 108
81 Support Vector Machines Given a set of training patterns and class labels as (x 1, y 1 ),..., (x n, y n ) R d {±1}, the goal is to find a classifier function f : R d {±1} such that f(x) = y will correctly classify new patterns. Support vector machines are based on the class of hyperplanes (w x) + b = 0, w R d, b R corresponding to decision functions f(x) = sign((w x) + b). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 81 / 108
82 Support Vector Machines Figure 39: A binary classification problem of separating balls from diamonds. The optimal hyperplane is orthogonal to the shortest line connecting the convex hulls of the two classes (dotted), and intersects it half way between the two classes. There is a weight vector w and a threshold b such that the points closest to the hyperplane satisfy (w x i ) + b = 1 corresponding to y i ((w x i ) + b) 1. The margin, measured perpendicularly to the hyperplane, equals 2/ w. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 82 / 108
83 Support Vector Machines To construct the optimal hyperplane, we can define the following optimization problem: minimize 1 2 w 2 subject to y i ((w x i ) + b) 1, i = 1,..., n. The solution can be obtained using quadratic programming techniques where the solution vector is the weighted summation of a subset of the training patterns, called the support vectors. The support vectors lie on the margin and carry all relevant information about the classification problem (the remaining patterns are irrelevant). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 83 / 108
84 Support Vector Machines In many real-world problems there will be no linear boundary separating the classes and the problem of searching for an optimal separating hyperplane is meaningless. However, we can extend the above ideas to handle non-separable data by relaxing the constraints. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 84 / 108
85 Support Vector Machines The new optimization problem becomes: minimize 1 n 2 w 2 + C i=1 subject to (w x i ) + b +1 ξ i for y i = +1, ξ i (w x i ) + b 1 + ξ i for y i = 1, ξ i 0 i = 1,..., n where ξ i, i = 1,..., n, are called the slack variables and C is a regularization parameter. The term C n i=1 ξ i can be thought of as measuring some amount of misclassification where lowering the value of C corresponds to a smaller penalty for misclassification. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 85 / 108
86 Support Vector Machines ξ > 1 y = 1 y = 0 y = 1 ξ < 1 ξ = 0 ξ = 0 Figure 40: Illustration of the slack variables ξ i 0. Data points with circles around them are support vectors. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 86 / 108
87 Support Vector Machines Both the quadratic programming problem and the final decision function f(x) = sign ( n α i y i (x x i ) + b i=1 ) depend only on the dot products between patterns. We can generalize this result to the non-linear case by mapping the original input space into some other space F using a non-linear map Φ : R d F and perform the linear algorithm in the F space which only requires the dot products k(x, y) = Φ(x)Φ(y). UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 87 / 108
88 Support Vector Machines Even though F may be high-dimensional, a simple kernel k(x, y) such as the following can be computed efficiently. Table 2: Common kernel functions. Polynomial k(x, y) = (x y) p Sigmoidal k(x, y) = tanh(κ(x y) + θ) Radial basis function k(x, y) = exp( x y 2 /(2σ 2 )) Once a kernel function is chosen, we can substitute Φ(x i ) for each training example x i, and perform the optimal hyperplane algorithm in F. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 88 / 108
89 Support Vector Machines This results in the non-linear decision function of the form ( n ) f(x) = sign α i y i k(x, x i ) + b i=1 where the parameters α i are computed as the solution of the quadratic programming problem. In the original input space, the hyperplane corresponds to a non-linear decision function whose form is determined by the kernel. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 89 / 108
90 Neural Networks Figure 41: A neural network consists of an input layer, an output layer and usually one or more hidden layers that are interconnected by modifiable weights represented by links between layers. They learn the values of these weights as a mapping from the input to the output. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 90 / 108
91 Decision Trees Figure 42: Decision trees classify a pattern through a sequence of questions, in which the next question asked depends on the answer to the current question. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 91 / 108
92 Unsupervised Learning and Clustering Clustering is an unsupervised procedure that uses unlabeled samples. Unsupervised procedures are used for several reasons: Collecting and labeling a large set of sample patterns can be costly. One can train with large amount of unlabeled data, and then use supervision to label the groupings found. Exploratory data analysis can provide insight into the nature or structure of the data. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 92 / 108
93 Clusters A cluster is comprised of a number of similar objects collected or grouped together. Patterns within a cluster are more similar to each other than are patterns in different clusters. Clusters may be described as connected regions of a multi-dimensional space containing a relatively high density of points, separated from other such regions by a region containing a relatively low density of points. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 93 / 108
94 Clustering Clustering is a very difficult problem because data can reveal clusters with different shapes and sizes. Figure 43: The number of clusters in the data often depend on the resolution (fine vs. coarse) with which we view the data. How many clusters do you see in this figure? 5, 8, 10, more? UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 94 / 108
95 Clustering Most of the clustering algorithms are based on the following two popular techniques: Iterative squared-error partitioning Agglomerative hierarchical clustering One of the main challenges is to select an appropriate measure of similarity to define clusters that is often both data (cluster shape) and context dependent. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 95 / 108
96 Squared-error Partitioning Suppose that the given set of n patterns has somehow been partitioned into k clusters D 1,..., D k. Let n i be the number of samples in D i and let m i be the mean of those samples m i = 1 n i x D i x. Then, the sum-of-squared errors is defined by k J e = x m i 2. i=1 x D i For a given cluster D i, the mean vector m i (centroid) is the best representative of the samples in D i. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 96 / 108
97 Squared-error Partitioning A general algorithm for iterative squared-error partitioning: 1. Select an initial partition with k clusters. Repeat steps 2 through 5 until the cluster membership stabilizes. 2. Generate a new partition by assigning each pattern to its closest cluster center. 3. Compute new cluster centers as the centroids of the clusters. 4. Repeat steps 2 and 3 until an optimum value of the criterion function is found (e.g., when a local minimum is found or a predefined number of iterations are completed). 5. Adjust the number of clusters by merging and splitting existing clusters or by removing small or outlier clusters. This algorithm, without step 5, is also known as the k-means algorithm. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 97 / 108
98 Squared-error Partitioning k-means is computationally efficient and gives good results if the clusters are compact, hyperspherical in shape and well-separated in the feature space. However, choosing k and choosing the initial partition are the main drawbacks of this algorithm. The value of k is often chosen empirically or by prior knowledge about the data. The initial partition is often chosen by generating k random points uniformly distributed within the range of the data, or by randomly selecting k points from the data. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 98 / 108
99 Hierarchical Clustering In some applications, groups of patterns share some characteristics when looked at a particular level. Hierarchical clustering tries to capture these multi-level groupings using hierarchical representations. Figure 44: A dendrogram can represent the results of hierarchical clustering algorithms. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 99 / 108
100 Algorithm-Independent Learning Issues We have seen many learning algorithms and techniques for pattern recognition. Some of these algorithms may be preferred because of their lower computational complexity. Others may be preferred because they take into account some prior knowledge of the form of the data. Given practical constraints such as finite training data, no pattern classification method is inherently superior to any other. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 100 / 108
101 Estimating and Comparing Classifiers Classification error can be estimated using misclassification and false alarm rates. To compare learning algorithms, we should use independent training and test data generated using static division, rotated division (e.g., cross-validation), bootstrap methods. Using the error on points not in the training set (also called the off-training set error) is important for evaluating the generalization ability of an algorithm. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 101 / 108
102 Combining Classifiers Just like different features capturing different properties of a pattern, different classifiers also capture different structures and relationships of these patterns in the feature space. An empirical comparison of different classifiers can help us choose one of them as the best classifier for the problem at hand. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 102 / 108
103 Combining Classifiers However, although most of the classifiers may have similar error rates, sets of patterns misclassified by different classifiers do not necessarily overlap. Not relying on a single decision but rather combining the advantages of different classifiers is intuitively promising to improve the overall accuracy of classification. Such combinations are variously called combined classifiers, ensemble classifiers, mixture-of-expert models, or pooled classifiers. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 103 / 108
104 Combining Classifiers Some reasons for combining multiple classifiers to solve a given classification problem can be stated as follows: Access to different classifiers, each developed in a different context and for an entirely different representation/description of the same problem. Availability of multiple training sets, each collected at a different time or in a different environment, even may use different features. Local performances of different classifiers where each classifier may have its own region in the feature space where it performs the best. Different performances due to different initializations and randomness inherent in the training procedure. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 104 / 108
105 Combining Classifiers In summary, we may have different feature sets, training sets, classification methods, and training sessions, all resulting in a set of classifiers whose outputs may be combined. Combination architectures can be grouped as: Parallel: all classifiers are invoked independently and then their results are combined by a combiner. Serial (cascading): individual classifiers are invoked in a linear sequence where the number of possible classes for a given pattern is gradually reduced. Hierarchical (tree): individual classifiers are combined into a structure, which is similar to that of a decision tree, where the nodes are associated with the classifiers. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 105 / 108
106 Combining Classifiers Examples of classifier combination schemes are: Majority voting where each classifier makes a binary decision (vote) about each class and the final decision is made in favor of the class with the largest number of votes. Bayesian combination: sum, product, maximum, minimum and median of the posterior probabilities from individual classifiers. Bagging where multiple classifiers are built by bootstrapping the original training set. Boosting where a sequence of classifiers is built by training each classifier using data sampled from a distribution derived from the empirical misclassification rate of the previous classifier. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 106 / 108
107 Structural and Syntactic Pattern Recognition Statistical pattern recognition attempts to classify patterns based on a set of extracted features and an underlying statistical model for the generation of these patterns. Ideally, this is achieved with a rather straightforward procedure: determine the feature vector, train the system, classify the patterns. Unfortunately, there are also many problems where patterns contain structural and relational information that are difficult or impossible to quantify in feature vector form. UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 107 / 108
108 Structural and Syntactic Pattern Recognition Structural pattern recognition assumes that pattern structure is quantifiable and extractable so that structural similarity of patterns can be assessed. Typically, these approaches formulate hierarchical descriptions of complex patterns built up from simpler primitive elements. This structure quantification and description are mainly done using: Formal grammars Relational descriptions (principally graphs) Then, recognition and classification are done using: Parsing (for formal grammars) Relational graph matching (for relational descriptions) UYGU 2014 c 2014, Selim Aksoy (Bilkent University) 108 / 108
Introduction to Pattern Recognition
Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)
More informationNeural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationLecture 9: Introduction to Pattern Analysis
Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationEM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More information. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
More informationThese slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop
Music and Machine Learning (IFT6080 Winter 08) Prof. Douglas Eck, Université de Montréal These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationMultivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationComparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationMachine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
More informationGoing Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained
More information11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression
Frank C Porter and Ilya Narsky: Statistical Analysis Techniques in Particle Physics Chap. c11 2013/9/9 page 221 le-tex 221 11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationClustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
More informationHow To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationSupport Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationAn Introduction to Data Mining
An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationExploratory Data Analysis with MATLAB
Computer Science and Data Analysis Series Exploratory Data Analysis with MATLAB Second Edition Wendy L Martinez Angel R. Martinez Jeffrey L. Solka ( r ec) CRC Press VV J Taylor & Francis Group Boca Raton
More informationCS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott
More informationDistances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
More informationKATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of
More informationIntroduction to nonparametric regression: Least squares vs. Nearest neighbors
Introduction to nonparametric regression: Least squares vs. Nearest neighbors Patrick Breheny October 30 Patrick Breheny STA 621: Nonparametric Statistics 1/16 Introduction For the remainder of the course,
More informationCOLLEGE OF SCIENCE. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE: COS-STAT-747 Principles of Statistical Data Mining
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationNCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationMachine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.
Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,
More informationDimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationA Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationMonotonicity Hints. Abstract
Monotonicity Hints Joseph Sill Computation and Neural Systems program California Institute of Technology email: joe@cs.caltech.edu Yaser S. Abu-Mostafa EE and CS Deptartments California Institute of Technology
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationSupport Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More informationCI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationClustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationChapter 7. Cluster Analysis
Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based
More informationFinal Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationEnsemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationData Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationChristfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
More informationCluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining
Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 What is data exploration? A preliminary
More informationRobust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
More informationExploratory Data Analysis
Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction
More informationLinear Models for Classification
Linear Models for Classification Sumeet Agarwal, EEL709 (Most figures from Bishop, PRML) Approaches to classification Discriminant function: Directly assigns each data point x to a particular class Ci
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
More informationData Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler
Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?
More informationCluster analysis Cosmin Lazar. COMO Lab VUB
Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,
More informationOUTLIER ANALYSIS. Authored by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
OUTLIER ANALYSIS OUTLIER ANALYSIS Authored by CHARU C. AGGARWAL IBM T. J. Watson Research Center, Yorktown Heights, NY, USA Kluwer Academic Publishers Boston/Dordrecht/London Contents Preface Acknowledgments
More informationA Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
More informationTensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
More informationLocal outlier detection in data forensics: data mining approach to flag unusual schools
Local outlier detection in data forensics: data mining approach to flag unusual schools Mayuko Simon Data Recognition Corporation Paper presented at the 2012 Conference on Statistical Detection of Potential
More informationData Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
More informationJava Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
More informationMS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More information