Word Level Script Identification for Scanned Document Images

Size: px
Start display at page:

Download "Word Level Script Identification for Scanned Document Images"

Transcription

1 Word Level Script Identification for Scanned Document Images Huanfeng Ma and David Doermann Language and Media Processing Laboratory Institute for Advanced Computer Studies University of Maryland, College Park, MD 20742, USA ABSTRACT In this paper, we compare the performance of three classifiers used to identify the script of words in scanned document images. In both training and testing, a Gabor filter is applied and 16 channels of features are extracted. Three classifiers (Support Vector Machines (SVM), Gaussian Mixture Model (GMM) and k-nearest-neighbor (k-nn)) are used to identify different scripts at the word level (glyphs separated by white space). These three classifiers are applied to a variety of bilingual dictionaries and their performance is compared. Experimental results show the capability of Gabor filter to capture script features and the effectiveness of these three classifiers for script identification at the word level. Keywords: Script Identification, Support Vector Machines (SVM), Gaussian Mixture Model (GMM), k-nearest- Neighbor (k-nn), Gabor Filter 1.1. Background 1. INTRODUCTION In recent years, the demand for tools to be able to recognize, search and retrieve written and spoken sources of multilingual information has increased tremendously. With the rapid explosion of online repositories, researchers and developers of cross-lingual search and translation systems can get a lot of resources they need easily from the Internet. However, there are still significant resources that can only be accessed in a printed form, especially for sparse, low density languages. Manipulation and conversion of these printed documents is essential for many researchers and organizations. One of the most important tasks to address with printed documents is the automatic recognition of text, which usually consists of three steps: (1) zone segmentation and text region identification using document layout analysis; (2) text line, word and character segmentation; and (3) optical character recognition (OCR). In the last step, OCR systems are often designed to work on documents with the specific script. In order to parse bilingual or multilingual documents such as patents 1 or bilingual dictionaries, or perform multilingual document retrieval, 2 the script must be identified before feeding words to an appropriate OCR system. Our motivation to do script identification stems from attempts to acquire lexicon from bilingual dictionaries. In our previous work, 2, 3 a document image was first segmented into physical zones then into entries based on extracted features. For bilingual dictionaries with one non-roman script, script identification is essential both in entry segmentation and in parsing and tagging the entry itself. Further author information: (Send correspondence to Huanfeng Ma) Huanfeng Ma: hfma@umiacs.umd.edu David Doermann: doermann@umiacs.umd.edu

2 1.2. Previous Work In earlier work on script identification, Hochberg et al. 4 described a technique for identifying 13 scripts including highly connected ones. In their algorithm, a scale-normalized cluster template was created for each script, based on the frequent characters or word shapes of this script. Scripts were then classified by comparing a subset of the document s textual symbols with these templates. Spitz et al. 1, 5 initially divided scripts into Asian (Chinese, Japanese and Korean) or Roman based on the observation that upward concavities are distributed evenly along the vertical axis of Asian characters, but tend to appear at specific locations in Roman characters. Furthermore, discrimination among Asian scripts was made on the basis of character density. Waked et al. 6 used features of the horizontal projection of a text line to classify scripts into three categories including Arabic, Roman and Ideographic. Specific features such as character complexity or curvatures can be used to distinguish different scripts if the classifier designer has sufficient knowledge of the scripts, classification, and thus, is case-dependent. In recent years, texture analysis techniques have been introduced to classify different font-styles and font-faces. Zhu et al. 7 present a font recognition algorithm based on global texture analysis. Gabor filters were used to extract the global texture features, and the extracted features were used to recognize different font styles and faces. They also demonstrated the capability to identify Chinese and English documents with different fonts. 2. SCRIPT IDENTIFICATION AT THE WORD LEVEL It should be noted that all of the script identification approaches mentioned above are at the block or page level, which means that a block or page is assumed to have the same script. Obviously, this is not the case for bilingual dictionaries where text with different scripts may be interlaced. For example, in the English-Chinese bilingual dictionary shown in Figure 1, there is no rule to identify which part should be Chinese and which part should be English unless the content is known. This means it is impossible to combine words which belong to the same script into a whole component, thus the identification must be done at the word level. Figure 1. English-Chinese dictionary. In our work, we perform script identification using the Gabor filter analysis of textures. The use of Gabor filters in extracting texture features of images was motivated by the following two factors: (1) The Gabor representation has been shown to be optimal in the sense of minimizing the joint two-dimensional uncertainty in space and frequency 8 ; and (2) Gabor filters can be considered as orientation and scale tunable edge and line detectors, and the statistics of these micro-features in a given region are often used to characterize the underlying texture information. In our previous work, 9 we proposed a general approach which computed the mean and standard deviation feature vectors of each class in the training phase. For each test sample, the classification is based on the distance between this sample and each class. Suppose the feature vector is in a d-dimensional space, and the computed mean and standard deviation feature vectors for class λ i are µ (i), α (i), where i = 1...M and M is the number of classes. Then for each test sample x R d, the distance between this sample and each class is computed using the following formula: d x k µ (i) k d(x, λ i ) = k=1 α (i) i = 1...M k

3 In this paper, the k-nearest-neighbor (k-nn), Support Vector Machine (SVM) and a Gaussian Mixture Model (GMM) are applied to the classification to improve the performance. The performance of these three classifiers will be compared in Section 4. Figure 2. System architecture. 3. SYSTEM DESIGN A diagram of the system is shown in Figure 2. The main operations in each part of the system will be described in detail in the following subsections Preprocessing The goal of document image preprocessing is to clean the document image and remove variations that may affect the final identification results. The main operations include: (1) Image deskewing; (2) Line removal; and (3) Symbol removal. Deskewing. During the scanning procedure, the document may be skewed. The word segmentation is based on the bounding box of the segmented word, which can be affected by skew. Deskewing is based on a horizontal projection profile. 10 Assuming the skew angle is less than 15 degrees, we first obtain the horizontal projection profile of all text lines. By iteratively rotating the image and computing the correlation of the profile, we can obtain the deskewing angle. The image is then simply rotated by this angle. Line removal. The lines we want to remove from bilingual dictionary pages usually appear as long horizontal lines at the top or bottom of a page, or as long vertical lines in the middle, so we are concerned primarily with these two cases in our work. Thus, the line detection and removal algorithm does not need to be complicated, a Hough Transform was applied to detect the lines to be removed. Symbol removal. In most bilingual dictionaries, there are some special symbols which belong to neither of the scripts we want to identify. Simply assigning these symbols to one class can degrade the classification performance. Before performing classification, these symbols need to be detected and removed from the original image to generate a clean image. In our work, a template matching approach was applied to complete the symbol detection. First, we extracted all the symbols and created one model template for each symbol. Then, for each saved template, we go through the image to detect and recognize the symbol based

4 on a generalized Hausdorff measure. The generalized Hausdorff distance actually measures the degree of mismatch between two point sets, which thus can also be employed to evaluate the resemblance of one point set to another. 11 Once a symbol is detected, the rectangular area that covers the symbol is simply set to background color. Figure 3 shows an example of symbol detection and removal. Figure 3. Symbol detection and removal, where the left image is the original image, the middle image shows the removed symbols, and the right image is the clean image Word Extraction and Processing Script classification is applied at the word level, so before extracting texture features, all words need to be extracted from the document image. In our work, the Docstrum algorithm 12 was applied to perform word segmentation. Word images in different classes, even different word images in the same class, may have different sizes (width and height). To make features be consistent, word image replication and scaling is applied to create a normalized image with predefined size (64 64 pixels in our case). Features used in the following sections are extracted from images with the same size. Figure 4 shows word image replication and scaling examples of two different scripts (Arabic, Roman). Figure 4. Word image replication and scaling. (a,c) original image, (b,d) normalized images Feature Extraction A pair of isotropic Gabor filters are applied to extract texture features of each class Gabor filter design The computational model for 2D isotropic Gabor filters are: h e (x, y) = g(x, y) cos[2πf(xcosθ+ysinθ)] h o (x, y) = g(x, y) sin[2πf(xcosθ+ysinθ)]

5 where h e and h o are the even-symmetric Gabor filters, and g(x,y) is an isotropic Gaussian function with form: g(x, y) = The spatial frequency responses of the Gabor functions are: where j = 1 and 1 + y 2 2πσ 2 exp( x2 2σ 2 ) H e (u, v) = [H 1(u, v) + H 2 (u, v)] 2 H o (u, v) = [H 1(u, v) + H 2 (u, v)] 2j H 1 (u, v) = exp{ 2π 2 σ 2 [(u fcosθ) 2 + (v fsinθ) 2 ]} H 2 (u, v) = exp{ 2π 2 σ 2 [(u + fcosθ) 2 + (v fsinθ) 2 ]} f, θ and σ are the spatial frequency, orientation and space constant of the Gabor envelope. In our case, the image size is normalized to 64 64, so four values of spatial frequency are selected: 0.04, 0.08, 0.16 and The combination of these four frequencies with four selected values of θ (0, 45, 90, 135 ) give a total of 16 Gabor channels. The non-orthogonality of the Gabor wavelets implies there is redundant information in the filtered images. In order to reduce the redundancy, the filters are designed to insure that the half-peak magnitude support of the filter responses in the frequency spectrum touch each other as shown in Figure 5. So the space constant σ is selected based on the formula: σ = 1/(0.6f). Figure 5. Frequency response of Gabor filters. (left: desired response, right: real response.) Feature representation The Gabor wavelet transform of an image I(x,y) is defined as: G mn (x, y) = I(s, t)gmn(x s, y t)dsdt where * indicates the complex conjugate. Based on the computed mean µ mn and the standard deviation σ mn of the magnitude of the transform coefficients, a feature vector (with dimension 32 to represent 16 channels) is constructed as: where µ mn and σ mn are computed as: x = [µ 00, σ 00, µ 01, σ 01,..., µ 33, σ 33 ] µ mn = G mn (x, y) dxdy σ mn = ( G mn (x, y) u mn ) 2 dxdy

6 3.4. Classifier Design Although the system can extract the texture features of different scripts, how they are best used depends on the characteristics of the specific scripts. It is important to assign appropriate weights to different features based on the training samples. As mentioned in Section 2, in previous work, 9 we performed the classification based on the distance of a test sample to each class. The distribution of training samples in the feature space was only taken into account by normalizing the distance with the standard deviations of training samples. To improve the performance of classification, we employ three new classifiers k-nn classifier The k-nearest-neighbor is the extension of the Nearest Neighbor classifier which was first introduced by Cover and Hart 13 in Illustrated in Figure 6, a test sample x is classified by assigning it the label most frequently represented among the k nearest samples. A decision is made by examining the labels of the k nearest neighbors and taking a vote. Figure 6. The k-nearest-neighbor classifier. It starts at the test sample x and grows a spherical region until k training samples are enclosed. The test sample is labeled by a majority vote of these samples. In this k=3 case, the test sample x would be labeled the class of * points SVM classifier SVMs were first introduced in the late seventies, but are recently receiving increased attention. SVMs have been applied in many fields such as handwritten digit recognition, 14 object recognition, 15 speaker identification, 16 face detection in images 17 and text categorization. 18 The SVM classifier constructs a best separating hyperplane (the maximal margin plane) in a high-dimensional feature space which is defined by nonlinear transformations from the original feature variables. Consider the binary classification task in which we have a set of training samples {x i, y i }, i = 1,..., N, y i { 1, 1}, x i R d, where y i are labels corresponding to two classes λ 1 and λ 2 and y i = ±1, the discriminant function is defined as: with the decision rule and all training points are correctly classified if g(x) = w T Φ(x) + b (1) w T Φ(x i ) + b > 0 for x i λ 1 with y i = +1 (2) w T Φ(x i ) + b < 0 for x i λ 2 with y i = 1 (3) y i (w T Φ(x i ) + b) > 0 for all i (4) Figure 7(a) shows two linearly separable sets of data. Many possible hyperplanes can separate these two sets. The goal of SVM is to determine the hyperplane for which the margin - the distance between two parallel hyperplanes (H1 and H2 in Figure 7, which are termed the canonical hyperplanes) on each side of the hyperplane

7 (a) (b) Figure 7. Separating hyperplanes for two sets of data. (a) Linear separating hyperplanes; (b) Nonlinear separating hyperplanes. The separating hyperplane is H : w T Φ(x)+b = 0 and two canonical hyperplanes are H 1 : w T Φ(x)+b = +1 and H 2 : w T Φ(x) + b = 1. The circled data points (lie on two canonical hyperplanes) are support vectors. H that separates the data - is the largest. The data points that lie on the two canonical hyperplanes are called support vectors (circled in Figure 7). The transformation defined by mapping function Φ(x) in Eq. 1 can be linear or nonlinear which can be applied to the separation of linearly-separable and nonlinearly-separable-only data. Figure 7(a) shows an example of separating hyperplanes of linearly separable data, while the two data sets shown in Figure 7(b) can only be separated nonlinearly. For nonlinear SVMs, the kernel function K(x i, x j ), which is defined as K(x i, x j ) = Φ(x i ) Φ(x j ) can be polynomial, Gaussian or sigmoid. Burges 18 gave a detailed description on how to find the separating hyperplanes. We chose the SVM implementation SVM-light 19 and the polynomial kernel function in our work. The SVM was trained using randomly chosen training pages GMM classifier The Gaussian Mixture Model (GMM) classifier is used to model the probability density function of a feature vector, x, by the weighted combination of M multi-variate Gaussian densities: p(x Λ) = M p i g i (x) i=1 where the weight (mixing parameter) p i corresponds to the prior probability that feature x was generated by component i, and satisfies M i=1 p i = 1. Each component λ i is represented by a Gaussian mixture model λ i = N(p i, µ i, Σ i ) whose probability density can be described as: g i (x) = 1 (2π)d Σ i exp( 1 2 (x µ i) T Σ 1 i (x µ i )) where µ i and Σ i are the mean vector and covariance matrix of Gaussian mixture component i respectively, and d is the dimension of the input feature vector. So the Gaussian mixture is completely specified by the mean vectors, covariance matrices and mixture weights of all components and can be represented by Λ = {λ i = N(p i, µ i, Σ i )} i = 1...M The probability that an observed input vector x belongs to the class λ i = N(p i, µ i, Σ i ) is given, in terms of density, by p(λ i x) = p(x λ i)p(λ i ) g i (x) = p i p(x Λ) M j=1 p (5) jg j (x) For script identification, component M is the number of different scripts. So for bilingual documents and 16 channels of Gabor filter features, we have M = 2 and d = 32. Given N training samples {x 1, x 2,..., x N }, using

8 standard techniques, the initial Gaussian mixture model represented by (p i, µ i, Σ i ) is estimated from the training samples as: p i = 1 N N n=1 p(λ i x n ) = N i N N n=1 µ i = p(λ i x n )x n N n=1 p(λ = 1 N i x (i) k (7) i x n ) N i k=1 N n=1 Σ i = p(λ i x n )(x n µ i )(x n µ i ) T N n=1 p(λ = 1 N i i x n ) N i (x (i) k k=1 (6) µ i)(x (i) k µ i) T (8) In Eqs. 6, 7 and 8, 1 i M and N i is the number of samples which belong to class λ i. Considering the fact that the distributions of script components on different pages are different, the estimated models are refined iteratively via the maximum-likelihood detection. At each iteration, the decision for each observation x (test sample) is: λ 1 > p(λ 1 x) < p(λ 2 x) (9) λ 2 Substituting Eq. 5 into the above equation, then computing the likelihood of both sides, we can get the following maximum likelihood decision rule: (x µ 2 ) T Σ 1 2 (x µ 2) (x µ 1 ) T Σ 1 The procedure to obtain the classifier to identify the two scripts is: λ 1 > < 1 (x µ 1) ln ( Σ 1 ) ln ( Σ 2 ) + ln p 2 ln p 1 (10) λ 2 (1) Estimate the parameters (p i, µ i, Σ i ) of Gaussian mixture models using Eqs. 6, 7 and 8; (2) For each feature vector x, perform the classification based on Eq. 10; (3) Reestimate the parameters (p i, µ i, Σ i ) based on the newly classified features vectors; (4) Go back to step (2) to perform the classification again until the iteration stop condition is satisfied; 4. EXPERIMENTS The proposed approaches were applied to 20 randomly chosen pages of four bilingual dictionaries: Arabic-English, Korean-English, Hindi-English and Chinese-English dictionary. Based on these pages, we did the following two experiments Experiment 1: leave-one-out This experiment is used to test how the individual classifier affects the performance for limited data. For each of the four dictionaries, we partition the 20 pages into 19 training pages and 1 test page. The process is repeated a total of 20 times and the accuracy across all partitions is shown in Table Experiment 2: use-one-training In this experiment, one single page of the 20 pages is selected as the training set. The trained system is applied to all of the other pages and the average accuracy is recorded. Compared with the first experiment, these results show how a smaller (and more realistic) training set affects the performance. The results of this experiment are shown in Table 2.

9 Table 1. Leave-One-Out experimental results (k=3 for k-nn; STD:standard deviation). Scripts Arabic Chinese Korean Hindi Classifiers k-nn SVM GMM k-nn SVM GMM k-nn SVM GMM k-nn SVM GMM Accuracy (%) Average STD Median Minimum Maximum Table 2. Use-One-Training experimental results (k=3 for k-nn; STD:standard deviation) Scripts Arabic Chinese Korean Hindi Classifiers k-nn SVM GMM k-nn SVM GMM k-nn SVM GMM k-nn SVM GMM Accuracy % Average STD Median Minimum Maximum Experimental Result Analysis The results in Tables 1 and 2 show the capability of Gabor filters to capture the features of different scripts and the effectiveness of these three classifiers to identify scripts at the word level. The comparison of results in Table 1 and Table 2 show that large number of training samples (19 pages) can produce better performance than a small number of training samples although the performance difference is often minimal. In order to show the robustness of these classifiers, we can also visualize the average accuracy and standard

10 (a) (b) Figure 8. The means and standard deviations of the three classifiers working on the four dictionaries. deviation of each classifier for each dictionary. From Figure 8(a), we can see that for large number of training samples, all these three classifiers are robust with the maximal standard deviation 5.43%, and the k-nn classifier obtains the best average accuracy while the SVM has the minimal deviation. Figure 8(b) shows that relatively small number of training samples can still provide reasonable results although we must be very careful when selecting the training set. In the last column of Table 2, there is a very low accuracy 15.34% which is highlighted in italic. The reason for such a low accuracy is that the page used for training only had several words, which made the GMM classifier fail. Figure 8(b) also shows that for small training sets, the performance of these three classfiers are almost the same. In the above analysis, we always set the k value of the k-nn classifier to 3. However, the choice of the k value for the k-nn classifier may also affect the performance of this classifier. By choosing different k values (k =1,3,5), we obtained the results of the leave-one-out experiment for all of the four dictionaries, which are shown in Figure 9. The results in this figure can explain that: for this script identification case, 3 -NN classifier can often get the best result, while the trend of results for different k values are consistent. Figure 9. Experimental result comparison of k-nn classifier with k=1,3,5. The results are sorted based on the values of the 3 -NN classifier.

11 (a) (b) (c) Figure 10. Word segmentation for different scripts and image quality. (a)over-segmentation of Arabic words; (b)oversegmentation of Chinese words caused by low image quality; (b)perfect word segmentation of Chinese words Factors in the Preprocessing Phase that Affect the Performance We are trying to provide a general script identification approach. The identification is a sequential process which includes three main phases: document image preprocessing, word segmentation and script identification. By manually examining the results, we found that the following factors in the preprocessing phase could affect the identification results: Word segmentation and font face: Since the script identification is at the word level, word segmentation results heavily affect the script identification performance. By browsing the results of the Arabic-English dictionary, we noticed that many of the incorrectly identified Arabic words are over-segmented, which was caused by the nature of Arabic language and the layout of this dictionary, some of the bounding boxes of text lines overlap (Figure 10(a)). Many of the incorrect identifications occurred when an italic Roman script word was identified as Arabic. This may be caused by the fact that they have similar texture features. Word segmentation and image quality: By checking the identification result of Chinese/Roman with low performance, we found that the incorrect identification was also caused by over-segmentation of Chinese words (Figure 10(b)). In addition to the over-segmentation of words, another factor that contributed to incorrect identification is the low image quality. Comparing the second and third images in Figure 10, we can see that the image in Figure 10(b) has lower quality and the word segmentation is not as good (this page has the lowest accuracy). The image in Figure 10(c) has higher quality and the word segmentation is perfect (this page has the highest identification accuracy). Single-character word: In all of the four dictionaries, another challenge is single-character Roman words. Although the identification is performed at word level, by training, we still can obtain the global texture representation of the Roman script. While for the Roman script, after word image replication and scaling, single characters may not have similar texture, leading to incorrect identification. 5. CONCLUSION In this paper, we have compared the performance of three classifiers applied to script identification at the word level. All of the three classifies are based on the Gabor filter features. Experiments were carried on Arabic-, Chinese-, Hindi- and Korean-English bilingual dictionaries, and the results show the effectiveness of the classifiers. Compared to our previous work, 9 all of these three classifiers (k-nn, SVM and GMM) can significantly improve the classification performance. Since our classification is at the word level, one primary factor that may affect the accuracy is word segmentation, which may be caused by scanning noise, text line spacing, word spacing, size and so on. We strongly believe that the results could be significantly improved by addressing the word segmentation problems. 6. ACKNOWLEDGMENT The support of this research under DARPA cooperative agreement N , National Science Foundation grant EIA and DOD contract MDA90402C0406 is gratefully acknowledged.

12 REFERENCES 1. A. L. Spitz, Determination of the script and language content of document images, IEEE Trans. Pattern Analysis and Machine Intelligence 19(3), pp , D. Doermann, H. Ma, B. Karagol-Ayan, and D. W. Oard, Lexicon acquisition from bilingual dictionaries, in SPIE Conference Document Recognition and Retrieval, pp , (San Jose, CA), H. Ma and D. Doermann, Bootstrapping structured page segmentation, in SPIE Conference Document Recognition and Retrieval, pp , (Santa Clara, CA), J. Hochberg, P. Kelly, T. Thomas, and L. Kerns, Automatic script identification from document images using cluster-based templates, IEEE Trans. Pattern Analysis and Machine Intelligence 19(2), pp , P. Sibun and A. L. Spitz, Language determination: Natural language processing from scanned document images, in Proc. 4th Conference on Applied Natural Language Processing, pp , (Stuttgart), B. Waked, S. Bergler, and C. Y. Suen, Skew detection, page segmentation, and script classification of printed document images, in IEEE International Conference on Systems, Man, and Cybernetics (SMC 98), pp , (San Diego, CA), Y. Zhu, T. Tan, and Y. Wang, Font recognition based on global texture analysis, IEEE Trans. Pattern Analysis and Machine Intelligence 23(10), pp , J. G. Daugman, Complete discrete 2d gabor transforms by neural networks for image analysis and compression, IEEE Trans. Acoustics, Speech and Signal Processing 36, pp , H. Ma and D. Doermann, Gabor filter based multi-class classifier for scanned document images, in 7th International Conference on Document Analysis and Recognition, pp , (Edinburgh, Scotland), D. J. Ittner and H. S. Baird, Language-free layout analysis, in IAPR 2nd Int l Conf. on Document Analysis and Recognition, pp , (Tsukuba SCience City, Japan), D. P. Huttenlocher, G. A. Klanderman, and W. J. Rucklidge, Comparing images using the hausdorff distance, IEEE Trans. on Pattern Analysis and Machine Intelligence 15(9), pp , L. O Gorman, The document spectrum for page layout analysis, IEEE Trans. Pattern Analysis and Machine Intelligence 15(11), pp , T. M. Cover and P. E. Hart, Nearest neighbor pattern classification, IEEE Trans. Information Theory IT-13(1), pp , C. Cortes and V. Vapnik, Support-vector networks, Machine Learning 20, pp , V. Blanz, B. Scholkopf, H. Bulthoff, C. Burges, V. Vapnik, and T. Vetter, Comparison of view-based object recognition algorithms using realistic 3d models, in International Conference on Artificial Neural Networks, pp , (Berlin), M. Schmidt, Identifying speaker with support vector networks, in Interface 96 Proceedings, (Sydney), E. Osuna, R. Freund, and F. Girosi, Training support vector machines: an application to face detection, in 1997 Conference on Computer Vision and Pattern Recognition, pp , (San Juan, Puerto Rico), C. J. C. Burges, A tutorial on support vector machines for pattern recognition, Data Mining and Knowledge Discovery 2(2), pp , T. Joachims, Advances in Kernel Methods-Support Vector Learning, ch. Making Large-Scale SVM Learning Practical, pp MIT-Press, 1999.

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

More information

Local features and matching. Image classification & object localization

Local features and matching. Image classification & object localization Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to

More information

Lecture 9: Introduction to Pattern Analysis

Lecture 9: Introduction to Pattern Analysis Lecture 9: Introduction to Pattern Analysis g Features, patterns and classifiers g Components of a PR system g An example g Probability definitions g Bayes Theorem g Gaussian densities Features, patterns

More information

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan Handwritten Signature Verification ECE 533 Project Report by Ashish Dhawan Aditi R. Ganesan Contents 1. Abstract 3. 2. Introduction 4. 3. Approach 6. 4. Pre-processing 8. 5. Feature Extraction 9. 6. Verification

More information

Script and Language Identification for Handwritten Document Images. Judith Hochberg Kevin Bowers * Michael Cannon Patrick Kelly

Script and Language Identification for Handwritten Document Images. Judith Hochberg Kevin Bowers * Michael Cannon Patrick Kelly Script and Language Identification for Handwritten Document Images Judith Hochberg Kevin Bowers * Michael Cannon Patrick Kelly Computer Research and Applications Group (CIC-3) Mail Stop B265 Los Alamos

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

High-Performance Signature Recognition Method using SVM

High-Performance Signature Recognition Method using SVM High-Performance Signature Recognition Method using SVM Saeid Fazli Research Institute of Modern Biological Techniques University of Zanjan Shima Pouyan Electrical Engineering Department University of

More information

Determining optimal window size for texture feature extraction methods

Determining optimal window size for texture feature extraction methods IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability

Classification of Fingerprints. Sarat C. Dass Department of Statistics & Probability Classification of Fingerprints Sarat C. Dass Department of Statistics & Probability Fingerprint Classification Fingerprint classification is a coarse level partitioning of a fingerprint database into smaller

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L5: Novelty Detection and Regression Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune,

More information

Analecta Vol. 8, No. 2 ISSN 2064-7964

Analecta Vol. 8, No. 2 ISSN 2064-7964 EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,

More information

Video OCR for Sport Video Annotation and Retrieval

Video OCR for Sport Video Annotation and Retrieval Video OCR for Sport Video Annotation and Retrieval Datong Chen, Kim Shearer and Hervé Bourlard, Fellow, IEEE Dalle Molle Institute for Perceptual Artificial Intelligence Rue du Simplon 4 CH-190 Martigny

More information

Vision based Vehicle Tracking using a high angle camera

Vision based Vehicle Tracking using a high angle camera Vision based Vehicle Tracking using a high angle camera Raúl Ignacio Ramos García Dule Shu gramos@clemson.edu dshu@clemson.edu Abstract A vehicle tracking and grouping algorithm is presented in this work

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

SOURCE SCANNER IDENTIFICATION FOR SCANNED DOCUMENTS. Nitin Khanna and Edward J. Delp

SOURCE SCANNER IDENTIFICATION FOR SCANNED DOCUMENTS. Nitin Khanna and Edward J. Delp SOURCE SCANNER IDENTIFICATION FOR SCANNED DOCUMENTS Nitin Khanna and Edward J. Delp Video and Image Processing Laboratory School of Electrical and Computer Engineering Purdue University West Lafayette,

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Face detection is a process of localizing and extracting the face region from the

Face detection is a process of localizing and extracting the face region from the Chapter 4 FACE NORMALIZATION 4.1 INTRODUCTION Face detection is a process of localizing and extracting the face region from the background. The detected face varies in rotation, brightness, size, etc.

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Classification Problems

Classification Problems Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

More information

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

More information

Operations Research and Knowledge Modeling in Data Mining

Operations Research and Knowledge Modeling in Data Mining Operations Research and Knowledge Modeling in Data Mining Masato KODA Graduate School of Systems and Information Engineering University of Tsukuba, Tsukuba Science City, Japan 305-8573 koda@sk.tsukuba.ac.jp

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

jorge s. marques image processing

jorge s. marques image processing image processing images images: what are they? what is shown in this image? What is this? what is an image images describe the evolution of physical variables (intensity, color, reflectance, condutivity)

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Subspace Analysis and Optimization for AAM Based Face Alignment

Subspace Analysis and Optimization for AAM Based Face Alignment Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China zhaoming1999@zju.edu.cn Stan Z. Li Microsoft

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Recognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature

Recognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature 3rd International Conference on Multimedia Technology ICMT 2013) Recognition Method for Handwritten Digits Based on Improved Chain Code Histogram Feature Qian You, Xichang Wang, Huaying Zhang, Zhen Sun

More information

Static Environment Recognition Using Omni-camera from a Moving Vehicle

Static Environment Recognition Using Omni-camera from a Moving Vehicle Static Environment Recognition Using Omni-camera from a Moving Vehicle Teruko Yata, Chuck Thorpe Frank Dellaert The Robotics Institute Carnegie Mellon University Pittsburgh, PA 15213 USA College of Computing

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Document Image Retrieval using Signatures as Queries

Document Image Retrieval using Signatures as Queries Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Blog Post Extraction Using Title Finding

Blog Post Extraction Using Title Finding Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Palmprint Recognition. By Sree Rama Murthy kora Praveen Verma Yashwant Kashyap

Palmprint Recognition. By Sree Rama Murthy kora Praveen Verma Yashwant Kashyap Palmprint Recognition By Sree Rama Murthy kora Praveen Verma Yashwant Kashyap Palm print Palm Patterns are utilized in many applications: 1. To correlate palm patterns with medical disorders, e.g. genetic

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

More information

Multi-modal Human-Computer Interaction. Attila Fazekas. Attila.Fazekas@inf.unideb.hu

Multi-modal Human-Computer Interaction. Attila Fazekas. Attila.Fazekas@inf.unideb.hu Multi-modal Human-Computer Interaction Attila Fazekas Attila.Fazekas@inf.unideb.hu Szeged, 04 July 2006 Debrecen Big Church Multi-modal Human-Computer Interaction - 2 University of Debrecen Main Building

More information

Tracking Moving Objects In Video Sequences Yiwei Wang, Robert E. Van Dyck, and John F. Doherty Department of Electrical Engineering The Pennsylvania State University University Park, PA16802 Abstract{Object

More information

Research on Chinese financial invoice recognition technology

Research on Chinese financial invoice recognition technology Pattern Recognition Letters 24 (2003) 489 497 www.elsevier.com/locate/patrec Research on Chinese financial invoice recognition technology Delie Ming a,b, *, Jian Liu b, Jinwen Tian b a State Key Laboratory

More information

Low-resolution Character Recognition by Video-based Super-resolution

Low-resolution Character Recognition by Video-based Super-resolution 2009 10th International Conference on Document Analysis and Recognition Low-resolution Character Recognition by Video-based Super-resolution Ataru Ohkura 1, Daisuke Deguchi 1, Tomokazu Takahashi 2, Ichiro

More information

An Energy-Based Vehicle Tracking System using Principal Component Analysis and Unsupervised ART Network

An Energy-Based Vehicle Tracking System using Principal Component Analysis and Unsupervised ART Network Proceedings of the 8th WSEAS Int. Conf. on ARTIFICIAL INTELLIGENCE, KNOWLEDGE ENGINEERING & DATA BASES (AIKED '9) ISSN: 179-519 435 ISBN: 978-96-474-51-2 An Energy-Based Vehicle Tracking System using Principal

More information

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max

More information

Character Image Patterns as Big Data

Character Image Patterns as Big Data 22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

Template-based Eye and Mouth Detection for 3D Video Conferencing

Template-based Eye and Mouth Detection for 3D Video Conferencing Template-based Eye and Mouth Detection for 3D Video Conferencing Jürgen Rurainsky and Peter Eisert Fraunhofer Institute for Telecommunications - Heinrich-Hertz-Institute, Image Processing Department, Einsteinufer

More information

Galaxy Morphological Classification

Galaxy Morphological Classification Galaxy Morphological Classification Jordan Duprey and James Kolano Abstract To solve the issue of galaxy morphological classification according to a classification scheme modelled off of the Hubble Sequence,

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

Factoring Patterns in the Gaussian Plane

Factoring Patterns in the Gaussian Plane Factoring Patterns in the Gaussian Plane Steve Phelps Introduction This paper describes discoveries made at the Park City Mathematics Institute, 00, as well as some proofs. Before the summer I understood

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

More information

Early defect identification of semiconductor processes using machine learning

Early defect identification of semiconductor processes using machine learning STANFORD UNIVERISTY MACHINE LEARNING CS229 Early defect identification of semiconductor processes using machine learning Friday, December 16, 2011 Authors: Saul ROSA Anton VLADIMIROV Professor: Dr. Andrew

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

Cluster Analysis: Advanced Concepts

Cluster Analysis: Advanced Concepts Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means

More information

Author Gender Identification of English Novels

Author Gender Identification of English Novels Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in

More information

Signature Segmentation from Machine Printed Documents using Conditional Random Field

Signature Segmentation from Machine Printed Documents using Conditional Random Field 2011 International Conference on Document Analysis and Recognition Signature Segmentation from Machine Printed Documents using Conditional Random Field Ranju Mandal Computer Vision and Pattern Recognition

More information

Denial of Service Attack Detection Using Multivariate Correlation Information and Support Vector Machine Classification

Denial of Service Attack Detection Using Multivariate Correlation Information and Support Vector Machine Classification International Journal of Computer Sciences and Engineering Open Access Research Paper Volume-4, Issue-3 E-ISSN: 2347-2693 Denial of Service Attack Detection Using Multivariate Correlation Information and

More information

A fast multi-class SVM learning method for huge databases

A fast multi-class SVM learning method for huge databases www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,

More information

Visualization by Linear Projections as Information Retrieval

Visualization by Linear Projections as Information Retrieval Visualization by Linear Projections as Information Retrieval Jaakko Peltonen Helsinki University of Technology, Department of Information and Computer Science, P. O. Box 5400, FI-0015 TKK, Finland jaakko.peltonen@tkk.fi

More information

Visual Structure Analysis of Flow Charts in Patent Images

Visual Structure Analysis of Flow Charts in Patent Images Visual Structure Analysis of Flow Charts in Patent Images Roland Mörzinger, René Schuster, András Horti, and Georg Thallinger JOANNEUM RESEARCH Forschungsgesellschaft mbh DIGITAL - Institute for Information

More information

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and

More information

Tracking and Recognition in Sports Videos

Tracking and Recognition in Sports Videos Tracking and Recognition in Sports Videos Mustafa Teke a, Masoud Sattari b a Graduate School of Informatics, Middle East Technical University, Ankara, Turkey mustafa.teke@gmail.com b Department of Computer

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

Introduction to Pattern Recognition

Introduction to Pattern Recognition Introduction to Pattern Recognition Selim Aksoy Department of Computer Engineering Bilkent University saksoy@cs.bilkent.edu.tr CS 551, Spring 2009 CS 551, Spring 2009 c 2009, Selim Aksoy (Bilkent University)

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

Online Farsi Handwritten Character Recognition Using Hidden Markov Model

Online Farsi Handwritten Character Recognition Using Hidden Markov Model Online Farsi Handwritten Character Recognition Using Hidden Markov Model Vahid Ghods*, Mohammad Karim Sohrabi Department of Electrical and Computer Engineering, Semnan Branch, Islamic Azad University,

More information

Visualization of Large Font Databases

Visualization of Large Font Databases Visualization of Large Font Databases Martin Solli and Reiner Lenz Linköping University, Sweden ITN, Campus Norrköping, Linköping University, 60174 Norrköping, Sweden Martin.Solli@itn.liu.se, Reiner.Lenz@itn.liu.se

More information

Linear Classification. Volker Tresp Summer 2015

Linear Classification. Volker Tresp Summer 2015 Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Data Mining: Exploring Data Lecture Notes for Chapter 3 Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler Topics Exploratory Data Analysis Summary Statistics Visualization What is data exploration?

More information

Automatic Calibration of an In-vehicle Gaze Tracking System Using Driver s Typical Gaze Behavior

Automatic Calibration of an In-vehicle Gaze Tracking System Using Driver s Typical Gaze Behavior Automatic Calibration of an In-vehicle Gaze Tracking System Using Driver s Typical Gaze Behavior Kenji Yamashiro, Daisuke Deguchi, Tomokazu Takahashi,2, Ichiro Ide, Hiroshi Murase, Kazunori Higuchi 3,

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information