Text versus non-text Distinction in Online Handwritten Documents
|
|
- Claude Webster
- 7 years ago
- Views:
Transcription
1 Text versus non-text Distinction in Online Handwritten Documents Emanuel Indermühle Horst Bunke Institute of Computer Science and Applied Mathematics University of Bern, CH-3012 Bern, Switzerland Faisal Shafait Thomas Breuel Image Understanding and Pattern Recognition Research German Research Center for AI (DFKI), D Kaiserslautern, Germany ABSTRACT The aim of this paper is to explore how well the task of text vs. nontext distinction can be solved in online handwritten documents using only offline information. Two systems are introduced. The first system generates a document segmentation first. For this purpose, four methods originally developed for machine printed documents are compared: x-y cut, morphological closing, Voronoi segmentation, and whitespace analysis. A state-of-the art classifier then distinguishes between text and non-text zones. The second system follows a bottom-up approach that classifies connected components. Experiments are performed on a new dataset of online handwritten documents containing different content types in arbitrary arrangements. The best system assigns 94.3% of the pixels to the correct class. Categories and Subject Descriptors I.7.5 [Document and Text Processing]: Document Capture Document analysis General Terms Algorithms Keywords Document Zone Classification, Document Segmentation, Online Handwritten Documents 1. INTRODUCTION The distinction of different content types like text, drawings, formulas, or tables is one of the key tasks in the analysis of documents. It allows one to assign content of one particular type to the corresponding, specialized recognition systems, which is inevitable to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 10 March 22-26, 2010, Sierre, Switzerland. Copyright 2010 ACM /10/03...$ fully understand each type of information contained in a complex document [12]. In the domain of online handwritten documents this topic attracts increasing attention, as documents produced on a Tablet PC or with Anoto technology typically include diverse contents, such as text, graphics, formulas and tables. Since the most frequent content type is text, it is a good start to distinguish between text and non-text ink in a document first. Two approaches can be applied to distinguish text from non-text in documents. Starting with an entire document, under the topdown strategy the document is segmented into meaningful document zones. Then the zones can be classified into text or nontext [9, 20]. The bottom-up strategy, on the other hand, performs classification on small, naturally given parts of a document e.g. pixels, connected components, or individual strokes in online documents [2, 8, 18]. A clustering algorithm may follow to group small entities into larger, meaningful segments. In the literature, both approaches have been applied to machine printed documents. Top-down methods are prevailing if the document structure can be analyzed rather easily [9] (as in scientific papers or newspapers, for example). Pixel classification is preferred where the structure is difficult to recognize [2] (as in magazines where text and images may be mixed rather irregularly). In the field of online handwritten document analysis, the distinction of text and non-text is accomplished with a bottom-up approach in [8, 13] where single strokes, as the smallest entities, are classified. In [17] the top down strategy is applied. In our work the target is to distinguish text from non-text content in online handwritten documents. However, in the current paper we report only on the use of methods that rely exclusively on offline information. We intend to expand these methods in the near future by additionally using online information. The first system presented in this paper implements the top-down approach with document segmentation methods originally developed for machine printed documents [15]. The segmentation methods are applied to the images of the documents, i.e. to the offline version of given online documents. Then, a support vector machine (SVM) decides if a resulting document zone consist of text or nontext ink. This system is compared with a system implementing the bottom-up strategy, where connected components, as the smallest entities, are classified with the same classifier. The rest of the paper is organized as follows. In Section 2 the dataset on which the experiments have been performed is introduced. Section 3 gives an overview of the two systems and describes the segmentation methods used in the top-down approach.
2 Figure 2: Hierarchical annotation of the online handwritten documents in the dataset. Figure 1: Two sample documents from the dataset. Text ink is black, non-text ink is gray. Section 4 introduces the classifier and the features extracted from the document zones. In Section 5, the experimental setup and the results are shown. Finally, conclusions are drawn in Section DATASET The methods proposed in this paper are trained and tested on a set of online handwritten documents which have recently been collected at the University of Bern, and in the near future they will become available to the public. The dataset consists of 1,000 documents produced by 200 writers. The documents contain text in textblocks, lists, tables, and diagrams, as well as non-text in drawings and diagrams. About 72% of all strokes belongs to text. Examples of these documents can be seen in Figure 1. In the database generation process, the writers compiled each individual document by randomly copying parts from the following sources: 200 diagrams (circuits, URL diagrams, charts, flow charts, chemical formulas, etc.) obtained from Wikimedia Commons 1. In contrast to drawings, diagrams contain text labels. 200 mathematical formulas obtained from Wikipedia drawings obtained from Wikimedia Commons 1 random sentences from the Brown corpus [6] tables containing numbers and random nouns from the Brown corpus [6] list of random nouns from the Brown corpus [6] No further constraint were imposed on the writers. Therefore, some of the documents are quite challenging regarding the proper discrimination of text and non-text. The data format used to store the documents is InkML 3. The digital ink, which is the main part of the document, is stored in terms of groups of successive vectors consisting of X-, and Y coordinates, time, and pressure value. Additional to this information, the annotation of the ink is stored in the same files, which is structured hierarchically as illustrated in Figure 2. 1 Wikimedia Commons: 2 Wikipedia: 3 InkML is a format proposed by a working group of the W3C. Currently a draft is published at InkML/ The aim of this paper is to explore how well the task of text vs. non-text distinction can be solved using only offline information. Therefore the digital ink is transformed to the offline image format. Binary images are the input to the systems. For all images, the size is chosen to be 1000 pixels for the longer side. This results in a resolution of about 84 dots per inch (dpi). 3. SYSTEMS Two systems are proposed and compared to each other in this paper. The first one, referred to as segment classification, follows the top-down approach, while the second one, connected component classification, falls into the category of the bottom-up strategies. 3.1 Segment Classification In the first system, the input images are segmented into document zones. The zones are then classified into zones containing text and zones containing non-text by an SVM, which has been trained on the segments of the segmentation ground truth images. Four different segmentation methods are compared. The selection was inspired by [15] where these methods are applied to machine printed documents. Two of the methods proposed in [15] are ignored, since they are based on constraints that are not met by the documents considered in the current paper. One of the methods was simplified for the same reason. In this work, source code of the OCRopus [4] project was adopted for the implementation of the segmentation methods X-Y Cut The recursive x-y cut (RXYC) algorithm by Nagy et al. [11] is a tree-based top-down algorithm. Starting with the entire document as the root, the image is split into two or more parts, which become the children. This is repeated recursively with every node until the image can not be split any further. The leaf nodes represent the final segmentation. To split an image, the horizontal and vertical projection histograms v y and v x are calculated. The valleys in the histograms are defined as the values smaller than the noise removal thresholds t n x and t n y, which are linearly scaled to the corresponding dimension of the image. The values in these valleys are set to zero. If a continuous valley in v x or v y is larger than the thresholds t x or t y, respectively, the image is split vertically or horizontally at the center of this valley. This procedure is stopped when no image at a leaf position in the tree can be split any further under the given thresholds. The parameters to validate are t n x and t n y, which lead to more segments if they are larger, and t x and t y which lead to more segments if they are smaller Morphological Smearing On the binary images, the morphological operation closing was applied. This operation consists of the morphological operation dilation followed by erosion as they are defined in [7]. Basically,
3 the background of the new image is defined as the union of the regions covered by a structuring element when it is translated to every position in the image and it is not touching any foreground pixel of the original image. The structuring element is a rectangle with width c x and height c y. Every connected component of the resulting image becomes part of the final segmentation. The parameters to validate are the dimensions of the structuring element, which will result in more segments if they are smaller Voronoi Diagram based Algorithm The Voronoi diagram based algorithm was introduced by Kise et al. in [10]. In its first step, the contour of the foreground is sampled with a sampling rate equal to r s. Noise is then removed using a noise removal threshold t n. For each sample point a Voronoi region is generated resulting in a Voronoi-diagram covering the entire document. Then, adjacent Voronoi regions v 1 and v 2 of this diagram are merged as long as one of the following criteria is satisfied: their separating edge crosses foreground d(v 1, v 2)/T d1 < 1 d(v 1, v 2)/T d2 + a r(v 1, v 2)/t a < 1 where d(v 1, v 2) is the minimum distance between the foreground pixels of the Voronoi regions v 1 and v 2. The inter-character gap T d1 is defined by the first maximum in the nearest neighbor histogram of the connected components. The inter-line gap T d2 is derived from the second maximum in the nearest neighbor histogram by adding a margin control factor f m (we refer to [10] for more details). The area ratio a r is defined as: a r(v 1, v 2) = max(a(g1), a(g2)) min(a(g 1), a(g 2)) where g 1 v 1 and g 2 v 2 are the connected components with the smallest distance and a(g i) is the area of g i. The free parameters that have to be validated are the sampling rate r s, the noise removal threshold t n, the margin control factor f m, which controls the inter-line gap, and the area ratio threshold t a Whitespace Analysis This algorithm, which was introduced by Baird [1], analyzes the background of a document image. It starts by finding a set of largest white rectangles covering the entire background of the document. This is done by an algorithm proposed by Breuel [3]. Then the set of rectangles is sorted according to a weight w 1(c) of the rectangle c. Originally w 1(c) is defined as: «height(c) w 1(c) = area(c) W log 2 width(c) «(2) where W (x) is a weighting function which has been experimentally determined in [1]. The following approximation was proposed by Shafait [15]: 8 >< 0.5 if x < 3 W (x) = 1.5 if 3 x < 5 (3) >: 1 if x 5 However, as the whitespace in handwritten documents is not as equally distributed as in printed documents, this weighting function may not be suited here. Therefore, three other weights to sort (1) the rectangles are evaluated: w 2(c) = area(c) (4) w 3(c) = width(c) 2 (5) w 4(c) = height(c) 2 (6) In descending order, using any of the weighting functions, the white rectangles are sequentially plotted onto the output image, which is initialized with black pixels. This process is terminated as soon as the following stopping rule is satisfied: j w i(c j) f w ts (7) m where c j is the last rectangle added to the mask, f w is a factor to weight the influence of the number of segments added, m is the total amount of rectangles, and t s is a stopping threshold. Finally the segmentation is defined by the black regions left in the output image. The free parameters to validate are f w, t s and the weights w 1,..., w Connected Component Classification The connected component classification systems is characterized by performing a connected component analysis in order to segment a given input document. The connected component analysis is performed in a 8-point neighborhood and results in regions with connected black pixels. For classification, an SVM is used again. It is trained on the labelled connected components of the training set. Note that the connected component analysis has no free parameters. 4. CLASSIFIER 4.1 Support Vector Machine One of the most popular classification methods is the support vector machine (SVM)[14, 16, 19]. The key idea is to find a hyperplane that separates the data into two classes with a maximal margin. Such a hyperplane can only be found if the data is linearly separable. If linear separability is not fulfilled, a weaker definition of the margin, the soft margin, can be used. In this case, the optimal hyperplane is the one that minimizes the sum of errors and maximizes the margin at the same time. This optimization problem is usually solved by quadratic programming. In order to improve the classification of non-linearly separable data, an explicit mapping to a higher-dimension feature space can be performed, or instead, a kernel function can be applied. In our experimental evaluation we make use of an SVM with RBF-kernel κ(x, x ) = exp ` γ x x 2, where γ > 0. Hence, besides the weighting parameter C, which controls whether the maximization of the margin or the minimization of the error is more important, the metaparameter γ has to be tuned. The experiments were performed using the libsvm [5] implementation. 4.2 Feature Extraction Different feature sets to classify document zones in printed documents have been investigated by Keysers et al. in [9]. A well performing set of run-length histograms and connected component statistics have been proposed, which can be extracted very fast. The features used in the current paper for the classification of the document zone in the first system and for the connected components in the second system are, in fact, similar to those of [9]. The feature set consists of run-length histograms of black and white pixels along the horizontal, vertical, and the two diagonal directions. Each histogram uses eight bins, counting runs of length 1, 3, 7, 15, 31, 63, 127, and 128. Additionally, histograms
4 Segmenter Parameter Range Best X-Y Cut t n x 10 {0,..., 6} 10,10,10,0 t n x 10 {0,..., 3} 10 t c x 2 {0,..., 15} 2 t c x 2 {0,..., 15} 2 Smearing c x 5 {0,..., 20} 10,0,0,5 c y 5 {0,..., 20} 10,5,5,5 Voronoi r s {0,..., 8} 1 t n 5 {0,..., 7} 5,5,35,10 f m t a {0,..., 10} 1 Whitespace t s {0,..., 40} 1 f w {0,..., 10} 1 w Baird, area, width, height width Recognition Rate Voronoi Morph. Op. X-Y Cut Whitespace Conn. Comp. Table 1: Parameter validation of the four segmentation methods. If four values are present in the Best column then they vary for the four cross validation runs. with the same bins as mentioned before are created from the width and height distributions of the connected components within the document zone. Moreover, a two dimensional 64-bin histogram of the joint distribution of widths and heights, as well as a histogram of the nearest neighbor distances of the connected components are calculated. All in all, 152 numerical features are extracted. The part of the feature set created from connected component statistics is meaningless when extracted from a single connected component, as it would be the case in the second system. Therefore these features are omitted in that system. 5. EXPERIMENTS 5.1 Setup For all experiments, four-fold cross validation is performed in order to reduce the risk of a biased dataset division. Four disjoint, equally sized subsets ss 0,..., ss 3 of the dataset are created. In run i (i = 0,..., 3), the training set is defined as ss (0+i) ss (1+i mod 4), the validation set as ss (2+i mod 4), and the test set as ss (3+i mod 4). For the first system, the SVM is trained on the segmentation ground truth of the training set. In a grid search on the validation set, the best parameter combination of C and γ in the range of {2 2 j j = 12,..., 12} is identified. The SVM with this parameter set is then applied on the test set. The classification rates of the four segmentation methods are maximized over all parameter combinations on the validation set as well. The parameter combinations that lead to the best performance on the validation set (and are applied on the independent test set) are shown in Table 1. In the second system the SVM-classifier is trained on the connected components of the training set. The parameters C and γ are validated in the same range as for the first system. Using the optimal parameter values, the connected components of the test set are classified to get the final result. In both systems the recognition rate is calculated as the fraction of pixels within a document that have been correctly assigned to the text or the non-text class. The mean value is computed over all documents evaluated. 5.2 Results Figure 3 shows the recognition rates that were achieved when classifying the document zones returned by the four segmentation methods. With a recognition rate above 0.92, the Voronoi segmenation and morphological smearing methods are significantly better than the X-Y cut and the whitespace analysis. (This is statistically Figure 3: Recognition rates achieved with the five different methods. backed using a dependent t-test for paired samples with a significance level of α = 0.05.) The reason for the difference might be that the latter two methods can correctly segment documents in Manhattan layout only, which is not given for the dataset used here. The parameter sets of the different segmentation methods that lead to the best recognition rate also lead to an over-segmentation. It seams easier to correctly classify small segments than to generate a segmentation with large document zones containing only one content type. This finding confirms the results of the connected component classification system, where small segments are produced by default. With a recognition rate of it performs significantly better than all other methods. Analyzing the results on the different content types (see Figure 4), the advantage of the connected component classification system can be explained. On tables and diagrams, its recognition rate is higher than that of other methods. It seems that the lines occurring in the tables lead to a wrong classification. The connected component classification can classify each text part in a table individually, while the segmentation methods can not split the tables apart. The problem with diagrams might be the text labels. If they belong to the same segment as the rest of the diagram, they will be misclassified. 6. CONCLUSION The distinction of text and non-text is an important step to transcribe a scanned or online recorded document into an electronic office document (e.g. ODF). In this paper, we compare two systems to solve this problem using only methods originally developed for offline handwritten and machine printed documents. The first system is implementing the top-down approach. It segments the documents into zones which are then classified by an SVM. Four different segmentation methods are compared to each other. In the second system, a bottom-up strategy is implemented where connected components are classified by an SVM. The dataset on which the experiments have been performed is a collection of online handwritten documents containing text, drawings, diagrams, tables, lists, and formulas. In the result section we demonstrate that the distinction of text and non-text content in handwritten document is possible. Both approaches reach good recognition rates. However, with 94.3% of all pixels correctly assigned, the bottom-up strategy outperforms the top-down approach. The use of connected components as document zones reduces the risk to mix text and non-text content, which
5 Recognition Rate Conn. Comp. Whitespace X-Y Cut Morph. Op. Voronoi Text Drawing Diagram Table Figure 4: Recognition rates of the distinction between text and non-text considering only pixels of text blocks, drawing, diagram, or tables. inevitably results in misclassified pixels. In the near future we intend to extend the proposed methods by using online information. In addition, other bottom-up approaches will be applied which use strokes or individual pixels as their smallest entities. The dataset introduced in this paper opens possibilities for further investigations in the field of document analysis. The detection of different content types or the reconstruction of tables and lists might be other interesting subjects for future work. 7. ACKNOWLEDGMENTS This work has been supported by the Swiss National Center of Competence in Research (NCCR) on Interactive Multimodal Information Management (IM2). We thank the developers of OCRopus 4 for making available their software to the public. 8. REFERENCES [1] H. S. Baird. Background structure in document images. In H. Bunke, P. Wang, and H. S. Baird, editors, Document Image Analysis, pages World Scientific, Singapore, [2] H. S. Baird, M. A. Moll, and C. An. Document image content inventories. In Proc. 14th Conf. on Document Recognition and Retrieval, page 296, [3] T. M. Breuel. Two geometric algorithms for layout analysis. In Proceedings of the Workshop on Document Analysis Systems, pages , Princeton, NJ, USA, [4] T. M. Breuel. The OCRopus open source OCR system. In Proceedings IS&T/SPIE 20th Annual Symposium 2008, [5] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, Software available at [6] W. Francis and H. Kucera. Manual of information to accompany a standard sample of present-day edited American English for use with digital computers. Department of Linguistics, Brown University, OCRopus: [7] R. M. Haralick, S. R. Sternberg, and X. Zhuang. Image analysis using mathematical morphology. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(4): , [8] A. K. Jain, A. M. Namboodiri, and J. Subrahmonia. Structure in on-line documents. In In Proc. 6th Int. Conf. on Document Analysis and Recognition, pages , [9] D. Keysers, F. Shafait, and T. M. Breuel. Document image zone classification a simple high-performance approach. In Proc. 2nd Int. Conf. on Computer Vision Theory and Applications, pages 44 51, [10] K. Kise, A. Sato, and M. Iwata. Segmentation of page images using the area voronoi diagram. Computer Vision and Image Understanding, 70(3): , June [11] G. Nagy, S. Seth, and M. Viswanathan. A prototype document image analysis system for technical journals. Computer, 25(7):10 22, [12] O. Okun, D. Doermann, and M. Pietikäinen. Page segmentation and zone classification: the state of the art. Technical report, University of Maryland, Language and Media Processing Laboratory, [13] S. Rossignol, D. Willems, A. Neumann, and L. Vuurpijl. Mode detection and incremental recognition. In In Proc. 9th Int. Workshop on Frontiers in Handwriting Recognition, pages , [14] B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, [15] F. Shafait, D. Keysers, and T. M. Breuel. Performance evaluation and benchmarking of six page segmentation algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(6): , Jun [16] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, [17] M. Shilman, P. Liang, and P. Viola. Learning non-generative grammatical models for document analysis. In Porc. of 10th Int. Conf. on Computer Vision, volume 2, pages , Los Alamitos, CA, USA, IEEE Computer Society. [18] C. Strouthopoulos and N. Papamarkos. Text identification for document image analysis using a neural network. Image and Vision Computing, 16: , [19] V. Vapnik. Statistical Learning Theory. John Wiley, [20] K. Y. Wong, R. G. Casey, and F. M. Wahl. Document analysis system. IBM Journal of Research and Development, 26(6): , 1982.
IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents
IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents Emanuel Indermühle Institute of Computer Science and Applied Mathematics University of Bern, CH-3012 Bern, Switzerland
More informationPage Frame Detection for Marginal Noise Removal from Scanned Documents
Page Frame Detection for Marginal Noise Removal from Scanned Documents Faisal Shafait 1, Joost van Beusekom 2, Daniel Keysers 1, and Thomas M. Breuel 2 1 Image Understanding and Pattern Recognition (IUPR)
More informationThe Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationComparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
More informationOCRopus Addons. Internship Report. Submitted to:
OCRopus Addons Internship Report Submitted to: Image Understanding and Pattern Recognition Lab German Research Center for Artificial Intelligence Kaiserslautern, Germany Submitted by: Ambrish Dantrey,
More informationSupport Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
More informationDetermining optimal window size for texture feature extraction methods
IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec
More informationComparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
More informationHigh-Performance Signature Recognition Method using SVM
High-Performance Signature Recognition Method using SVM Saeid Fazli Research Institute of Modern Biological Techniques University of Zanjan Shima Pouyan Electrical Engineering Department University of
More informationSignature Segmentation from Machine Printed Documents using Conditional Random Field
2011 International Conference on Document Analysis and Recognition Signature Segmentation from Machine Printed Documents using Conditional Random Field Ranju Mandal Computer Vision and Pattern Recognition
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationSupport Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationA Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method
More informationIntroduction to Support Vector Machines. Colin Campbell, Bristol University
Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationThree Effective Top-Down Clustering Algorithms for Location Database Systems
Three Effective Top-Down Clustering Algorithms for Location Database Systems Kwang-Jo Lee and Sung-Bong Yang Department of Computer Science, Yonsei University, Seoul, Republic of Korea {kjlee5435, yang}@cs.yonsei.ac.kr
More informationRecognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationA fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
More informationECE 533 Project Report Ashish Dhawan Aditi R. Ganesan
Handwritten Signature Verification ECE 533 Project Report by Ashish Dhawan Aditi R. Ganesan Contents 1. Abstract 3. 2. Introduction 4. 3. Approach 6. 4. Pre-processing 8. 5. Feature Extraction 9. 6. Verification
More informationCircle Object Recognition Based on Monocular Vision for Home Security Robot
Journal of Applied Science and Engineering, Vol. 16, No. 3, pp. 261 268 (2013) DOI: 10.6180/jase.2013.16.3.05 Circle Object Recognition Based on Monocular Vision for Home Security Robot Shih-An Li, Ching-Chang
More information5. Binary objects labeling
Image Processing - Laboratory 5: Binary objects labeling 1 5. Binary objects labeling 5.1. Introduction In this laboratory an object labeling algorithm which allows you to label distinct objects from a
More informationCOMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationTopological Data Analysis Applications to Computer Vision
Topological Data Analysis Applications to Computer Vision Vitaliy Kurlin, http://kurlin.org Microsoft Research Cambridge and Durham University, UK Topological Data Analysis quantifies topological structures
More informationA Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
More informationSignature verification using Kolmogorov-Smirnov. statistic
Signature verification using Kolmogorov-Smirnov statistic Harish Srinivasan, Sargur N.Srihari and Matthew J Beal University at Buffalo, the State University of New York, Buffalo USA {srihari,hs32}@cedar.buffalo.edu,mbeal@cse.buffalo.edu
More informationTOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
More informationStatistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees
Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.
More informationVideo OCR for Sport Video Annotation and Retrieval
Video OCR for Sport Video Annotation and Retrieval Datong Chen, Kim Shearer and Hervé Bourlard, Fellow, IEEE Dalle Molle Institute for Perceptual Artificial Intelligence Rue du Simplon 4 CH-190 Martigny
More informationDocument Image Retrieval using Signatures as Queries
Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and
More informationVisual Structure Analysis of Flow Charts in Patent Images
Visual Structure Analysis of Flow Charts in Patent Images Roland Mörzinger, René Schuster, András Horti, and Georg Thallinger JOANNEUM RESEARCH Forschungsgesellschaft mbh DIGITAL - Institute for Information
More informationExtend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia
More informationLarge Margin DAGs for Multiclass Classification
S.A. Solla, T.K. Leen and K.-R. Müller (eds.), 57 55, MIT Press (000) Large Margin DAGs for Multiclass Classification John C. Platt Microsoft Research Microsoft Way Redmond, WA 9805 jplatt@microsoft.com
More informationFace detection is a process of localizing and extracting the face region from the
Chapter 4 FACE NORMALIZATION 4.1 INTRODUCTION Face detection is a process of localizing and extracting the face region from the background. The detected face varies in rotation, brightness, size, etc.
More informationData Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
More informationNAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju
NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE Venu Govindaraju BIOMETRICS DOCUMENT ANALYSIS PATTERN RECOGNITION 8/24/2015 ICDAR- 2015 2 Towards a Globally Optimal Approach for Learning Deep Unsupervised
More informationImplementation of OCR Based on Template Matching and Integrating it in Android Application
International Journal of Computer Sciences and EngineeringOpen Access Technical Paper Volume-04, Issue-02 E-ISSN: 2347-2693 Implementation of OCR Based on Template Matching and Integrating it in Android
More informationMultiple Kernel Learning on the Limit Order Book
JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor
More informationClass #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
More informationEnsembles and PMML in KNIME
Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@Uni-Konstanz.De
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationMorphological segmentation of histology cell images
Morphological segmentation of histology cell images A.Nedzved, S.Ablameyko, I.Pitas Institute of Engineering Cybernetics of the National Academy of Sciences Surganova, 6, 00 Minsk, Belarus E-mail abl@newman.bas-net.by
More informationClassification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More information14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)
Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,
More informationData Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent
More informationImage Normalization for Illumination Compensation in Facial Images
Image Normalization for Illumination Compensation in Facial Images by Martin D. Levine, Maulin R. Gandhi, Jisnu Bhattacharyya Department of Electrical & Computer Engineering & Center for Intelligent Machines
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationCharacter Image Patterns as Big Data
22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,
More informationIBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationLecture 2: The SVM classifier
Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function
More informationSuper-resolution method based on edge feature for high resolution imaging
Science Journal of Circuits, Systems and Signal Processing 2014; 3(6-1): 24-29 Published online December 26, 2014 (http://www.sciencepublishinggroup.com/j/cssp) doi: 10.11648/j.cssp.s.2014030601.14 ISSN:
More informationKeywords image processing, signature verification, false acceptance rate, false rejection rate, forgeries, feature vectors, support vector machines.
International Journal of Computer Application and Engineering Technology Volume 3-Issue2, Apr 2014.Pp. 188-192 www.ijcaet.net OFFLINE SIGNATURE VERIFICATION SYSTEM -A REVIEW Pooja Department of Computer
More informationRecognition of Handwritten Digits using Structural Information
Recognition of Handwritten Digits using Structural Information Sven Behnke Martin-Luther University, Halle-Wittenberg' Institute of Computer Science 06099 Halle, Germany { behnke Irojas} @ informatik.uni-halle.de
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationContribution of Multiresolution Description for Archive Document Structure Recognition
Contribution of Multiresolution Description for Archive Document Structure Recognition Aurélie Lemaitre, Jean Camillerapp, Bertrand Coüasnon To cite this version: Aurélie Lemaitre, Jean Camillerapp, Bertrand
More informationAnalecta Vol. 8, No. 2 ISSN 2064-7964
EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,
More informationGamma Distribution Fitting
Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics
More informationLinear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
More informationHigh-dimensional labeled data analysis with Gabriel graphs
High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use
More informationIntroducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
More informationHow To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm
IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode
More informationIntrusion Detection via Machine Learning for SCADA System Protection
Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department
More informationMAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
More informationVector storage and access; algorithms in GIS. This is lecture 6
Vector storage and access; algorithms in GIS This is lecture 6 Vector data storage and access Vectors are built from points, line and areas. (x,y) Surface: (x,y,z) Vector data access Access to vector
More informationJournal of Industrial Engineering Research. Adaptive sequence of Key Pose Detection for Human Action Recognition
IWNEST PUBLISHER Journal of Industrial Engineering Research (ISSN: 2077-4559) Journal home page: http://www.iwnest.com/aace/ Adaptive sequence of Key Pose Detection for Human Action Recognition 1 T. Sindhu
More informationResearch on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationColour Image Segmentation Technique for Screen Printing
60 R.U. Hewage and D.U.J. Sonnadara Department of Physics, University of Colombo, Sri Lanka ABSTRACT Screen-printing is an industry with a large number of applications ranging from printing mobile phone
More informationDEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH
DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH Vikas J Dongre 1 Vijay H Mankar 2 Department of Electronics & Telecommunication, Government Polytechnic, Nagpur, India 1 dongrevj@yahoo.co.in; 2
More informationDomain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
More informationHow To Filter Spam Image From A Picture By Color Or Color
Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among
More informationLocal features and matching. Image classification & object localization
Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to
More informationSupport Vector Machines for Dynamic Biometric Handwriting Classification
Support Vector Machines for Dynamic Biometric Handwriting Classification Tobias Scheidat, Marcus Leich, Mark Alexander, and Claus Vielhauer Abstract Biometric user authentication is a recent topic in the
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationSVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationIBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
More informationSpam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
More informationD-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
More informationALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite
ALGEBRA Pupils should be taught to: Generate and describe sequences As outcomes, Year 7 pupils should, for example: Use, read and write, spelling correctly: sequence, term, nth term, consecutive, rule,
More informationLCs for Binary Classification
Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it
More informationA Simple Feature Extraction Technique of a Pattern By Hopfield Network
A Simple Feature Extraction Technique of a Pattern By Hopfield Network A.Nag!, S. Biswas *, D. Sarkar *, P.P. Sarkar *, B. Gupta **! Academy of Technology, Hoogly - 722 *USIC, University of Kalyani, Kalyani
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationEffective Data Mining Using Neural Networks
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 6, DECEMBER 1996 957 Effective Data Mining Using Neural Networks Hongjun Lu, Member, IEEE Computer Society, Rudy Setiono, and Huan Liu,
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationAutomatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report
Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 69 Class Project Report Junhua Mao and Lunbo Xu University of California, Los Angeles mjhustc@ucla.edu and lunbo
More informationSegmentation and Automatic Descreening of Scanned Documents
Segmentation and Automatic Descreening of Scanned Documents Alejandro Jaimes a, Frederick Mintzer b, A. Ravishankar Rao b and Gerhard Thompson b a Columbia University b IBM T.J. Watson Research Center
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationMaking Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More information