Text versus non-text Distinction in Online Handwritten Documents

Size: px
Start display at page:

Download "Text versus non-text Distinction in Online Handwritten Documents"

Transcription

1 Text versus non-text Distinction in Online Handwritten Documents Emanuel Indermühle Horst Bunke Institute of Computer Science and Applied Mathematics University of Bern, CH-3012 Bern, Switzerland Faisal Shafait Thomas Breuel Image Understanding and Pattern Recognition Research German Research Center for AI (DFKI), D Kaiserslautern, Germany ABSTRACT The aim of this paper is to explore how well the task of text vs. nontext distinction can be solved in online handwritten documents using only offline information. Two systems are introduced. The first system generates a document segmentation first. For this purpose, four methods originally developed for machine printed documents are compared: x-y cut, morphological closing, Voronoi segmentation, and whitespace analysis. A state-of-the art classifier then distinguishes between text and non-text zones. The second system follows a bottom-up approach that classifies connected components. Experiments are performed on a new dataset of online handwritten documents containing different content types in arbitrary arrangements. The best system assigns 94.3% of the pixels to the correct class. Categories and Subject Descriptors I.7.5 [Document and Text Processing]: Document Capture Document analysis General Terms Algorithms Keywords Document Zone Classification, Document Segmentation, Online Handwritten Documents 1. INTRODUCTION The distinction of different content types like text, drawings, formulas, or tables is one of the key tasks in the analysis of documents. It allows one to assign content of one particular type to the corresponding, specialized recognition systems, which is inevitable to Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SAC 10 March 22-26, 2010, Sierre, Switzerland. Copyright 2010 ACM /10/03...$ fully understand each type of information contained in a complex document [12]. In the domain of online handwritten documents this topic attracts increasing attention, as documents produced on a Tablet PC or with Anoto technology typically include diverse contents, such as text, graphics, formulas and tables. Since the most frequent content type is text, it is a good start to distinguish between text and non-text ink in a document first. Two approaches can be applied to distinguish text from non-text in documents. Starting with an entire document, under the topdown strategy the document is segmented into meaningful document zones. Then the zones can be classified into text or nontext [9, 20]. The bottom-up strategy, on the other hand, performs classification on small, naturally given parts of a document e.g. pixels, connected components, or individual strokes in online documents [2, 8, 18]. A clustering algorithm may follow to group small entities into larger, meaningful segments. In the literature, both approaches have been applied to machine printed documents. Top-down methods are prevailing if the document structure can be analyzed rather easily [9] (as in scientific papers or newspapers, for example). Pixel classification is preferred where the structure is difficult to recognize [2] (as in magazines where text and images may be mixed rather irregularly). In the field of online handwritten document analysis, the distinction of text and non-text is accomplished with a bottom-up approach in [8, 13] where single strokes, as the smallest entities, are classified. In [17] the top down strategy is applied. In our work the target is to distinguish text from non-text content in online handwritten documents. However, in the current paper we report only on the use of methods that rely exclusively on offline information. We intend to expand these methods in the near future by additionally using online information. The first system presented in this paper implements the top-down approach with document segmentation methods originally developed for machine printed documents [15]. The segmentation methods are applied to the images of the documents, i.e. to the offline version of given online documents. Then, a support vector machine (SVM) decides if a resulting document zone consist of text or nontext ink. This system is compared with a system implementing the bottom-up strategy, where connected components, as the smallest entities, are classified with the same classifier. The rest of the paper is organized as follows. In Section 2 the dataset on which the experiments have been performed is introduced. Section 3 gives an overview of the two systems and describes the segmentation methods used in the top-down approach.

2 Figure 2: Hierarchical annotation of the online handwritten documents in the dataset. Figure 1: Two sample documents from the dataset. Text ink is black, non-text ink is gray. Section 4 introduces the classifier and the features extracted from the document zones. In Section 5, the experimental setup and the results are shown. Finally, conclusions are drawn in Section DATASET The methods proposed in this paper are trained and tested on a set of online handwritten documents which have recently been collected at the University of Bern, and in the near future they will become available to the public. The dataset consists of 1,000 documents produced by 200 writers. The documents contain text in textblocks, lists, tables, and diagrams, as well as non-text in drawings and diagrams. About 72% of all strokes belongs to text. Examples of these documents can be seen in Figure 1. In the database generation process, the writers compiled each individual document by randomly copying parts from the following sources: 200 diagrams (circuits, URL diagrams, charts, flow charts, chemical formulas, etc.) obtained from Wikimedia Commons 1. In contrast to drawings, diagrams contain text labels. 200 mathematical formulas obtained from Wikipedia drawings obtained from Wikimedia Commons 1 random sentences from the Brown corpus [6] tables containing numbers and random nouns from the Brown corpus [6] list of random nouns from the Brown corpus [6] No further constraint were imposed on the writers. Therefore, some of the documents are quite challenging regarding the proper discrimination of text and non-text. The data format used to store the documents is InkML 3. The digital ink, which is the main part of the document, is stored in terms of groups of successive vectors consisting of X-, and Y coordinates, time, and pressure value. Additional to this information, the annotation of the ink is stored in the same files, which is structured hierarchically as illustrated in Figure 2. 1 Wikimedia Commons: 2 Wikipedia: 3 InkML is a format proposed by a working group of the W3C. Currently a draft is published at InkML/ The aim of this paper is to explore how well the task of text vs. non-text distinction can be solved using only offline information. Therefore the digital ink is transformed to the offline image format. Binary images are the input to the systems. For all images, the size is chosen to be 1000 pixels for the longer side. This results in a resolution of about 84 dots per inch (dpi). 3. SYSTEMS Two systems are proposed and compared to each other in this paper. The first one, referred to as segment classification, follows the top-down approach, while the second one, connected component classification, falls into the category of the bottom-up strategies. 3.1 Segment Classification In the first system, the input images are segmented into document zones. The zones are then classified into zones containing text and zones containing non-text by an SVM, which has been trained on the segments of the segmentation ground truth images. Four different segmentation methods are compared. The selection was inspired by [15] where these methods are applied to machine printed documents. Two of the methods proposed in [15] are ignored, since they are based on constraints that are not met by the documents considered in the current paper. One of the methods was simplified for the same reason. In this work, source code of the OCRopus [4] project was adopted for the implementation of the segmentation methods X-Y Cut The recursive x-y cut (RXYC) algorithm by Nagy et al. [11] is a tree-based top-down algorithm. Starting with the entire document as the root, the image is split into two or more parts, which become the children. This is repeated recursively with every node until the image can not be split any further. The leaf nodes represent the final segmentation. To split an image, the horizontal and vertical projection histograms v y and v x are calculated. The valleys in the histograms are defined as the values smaller than the noise removal thresholds t n x and t n y, which are linearly scaled to the corresponding dimension of the image. The values in these valleys are set to zero. If a continuous valley in v x or v y is larger than the thresholds t x or t y, respectively, the image is split vertically or horizontally at the center of this valley. This procedure is stopped when no image at a leaf position in the tree can be split any further under the given thresholds. The parameters to validate are t n x and t n y, which lead to more segments if they are larger, and t x and t y which lead to more segments if they are smaller Morphological Smearing On the binary images, the morphological operation closing was applied. This operation consists of the morphological operation dilation followed by erosion as they are defined in [7]. Basically,

3 the background of the new image is defined as the union of the regions covered by a structuring element when it is translated to every position in the image and it is not touching any foreground pixel of the original image. The structuring element is a rectangle with width c x and height c y. Every connected component of the resulting image becomes part of the final segmentation. The parameters to validate are the dimensions of the structuring element, which will result in more segments if they are smaller Voronoi Diagram based Algorithm The Voronoi diagram based algorithm was introduced by Kise et al. in [10]. In its first step, the contour of the foreground is sampled with a sampling rate equal to r s. Noise is then removed using a noise removal threshold t n. For each sample point a Voronoi region is generated resulting in a Voronoi-diagram covering the entire document. Then, adjacent Voronoi regions v 1 and v 2 of this diagram are merged as long as one of the following criteria is satisfied: their separating edge crosses foreground d(v 1, v 2)/T d1 < 1 d(v 1, v 2)/T d2 + a r(v 1, v 2)/t a < 1 where d(v 1, v 2) is the minimum distance between the foreground pixels of the Voronoi regions v 1 and v 2. The inter-character gap T d1 is defined by the first maximum in the nearest neighbor histogram of the connected components. The inter-line gap T d2 is derived from the second maximum in the nearest neighbor histogram by adding a margin control factor f m (we refer to [10] for more details). The area ratio a r is defined as: a r(v 1, v 2) = max(a(g1), a(g2)) min(a(g 1), a(g 2)) where g 1 v 1 and g 2 v 2 are the connected components with the smallest distance and a(g i) is the area of g i. The free parameters that have to be validated are the sampling rate r s, the noise removal threshold t n, the margin control factor f m, which controls the inter-line gap, and the area ratio threshold t a Whitespace Analysis This algorithm, which was introduced by Baird [1], analyzes the background of a document image. It starts by finding a set of largest white rectangles covering the entire background of the document. This is done by an algorithm proposed by Breuel [3]. Then the set of rectangles is sorted according to a weight w 1(c) of the rectangle c. Originally w 1(c) is defined as: «height(c) w 1(c) = area(c) W log 2 width(c) «(2) where W (x) is a weighting function which has been experimentally determined in [1]. The following approximation was proposed by Shafait [15]: 8 >< 0.5 if x < 3 W (x) = 1.5 if 3 x < 5 (3) >: 1 if x 5 However, as the whitespace in handwritten documents is not as equally distributed as in printed documents, this weighting function may not be suited here. Therefore, three other weights to sort (1) the rectangles are evaluated: w 2(c) = area(c) (4) w 3(c) = width(c) 2 (5) w 4(c) = height(c) 2 (6) In descending order, using any of the weighting functions, the white rectangles are sequentially plotted onto the output image, which is initialized with black pixels. This process is terminated as soon as the following stopping rule is satisfied: j w i(c j) f w ts (7) m where c j is the last rectangle added to the mask, f w is a factor to weight the influence of the number of segments added, m is the total amount of rectangles, and t s is a stopping threshold. Finally the segmentation is defined by the black regions left in the output image. The free parameters to validate are f w, t s and the weights w 1,..., w Connected Component Classification The connected component classification systems is characterized by performing a connected component analysis in order to segment a given input document. The connected component analysis is performed in a 8-point neighborhood and results in regions with connected black pixels. For classification, an SVM is used again. It is trained on the labelled connected components of the training set. Note that the connected component analysis has no free parameters. 4. CLASSIFIER 4.1 Support Vector Machine One of the most popular classification methods is the support vector machine (SVM)[14, 16, 19]. The key idea is to find a hyperplane that separates the data into two classes with a maximal margin. Such a hyperplane can only be found if the data is linearly separable. If linear separability is not fulfilled, a weaker definition of the margin, the soft margin, can be used. In this case, the optimal hyperplane is the one that minimizes the sum of errors and maximizes the margin at the same time. This optimization problem is usually solved by quadratic programming. In order to improve the classification of non-linearly separable data, an explicit mapping to a higher-dimension feature space can be performed, or instead, a kernel function can be applied. In our experimental evaluation we make use of an SVM with RBF-kernel κ(x, x ) = exp ` γ x x 2, where γ > 0. Hence, besides the weighting parameter C, which controls whether the maximization of the margin or the minimization of the error is more important, the metaparameter γ has to be tuned. The experiments were performed using the libsvm [5] implementation. 4.2 Feature Extraction Different feature sets to classify document zones in printed documents have been investigated by Keysers et al. in [9]. A well performing set of run-length histograms and connected component statistics have been proposed, which can be extracted very fast. The features used in the current paper for the classification of the document zone in the first system and for the connected components in the second system are, in fact, similar to those of [9]. The feature set consists of run-length histograms of black and white pixels along the horizontal, vertical, and the two diagonal directions. Each histogram uses eight bins, counting runs of length 1, 3, 7, 15, 31, 63, 127, and 128. Additionally, histograms

4 Segmenter Parameter Range Best X-Y Cut t n x 10 {0,..., 6} 10,10,10,0 t n x 10 {0,..., 3} 10 t c x 2 {0,..., 15} 2 t c x 2 {0,..., 15} 2 Smearing c x 5 {0,..., 20} 10,0,0,5 c y 5 {0,..., 20} 10,5,5,5 Voronoi r s {0,..., 8} 1 t n 5 {0,..., 7} 5,5,35,10 f m t a {0,..., 10} 1 Whitespace t s {0,..., 40} 1 f w {0,..., 10} 1 w Baird, area, width, height width Recognition Rate Voronoi Morph. Op. X-Y Cut Whitespace Conn. Comp. Table 1: Parameter validation of the four segmentation methods. If four values are present in the Best column then they vary for the four cross validation runs. with the same bins as mentioned before are created from the width and height distributions of the connected components within the document zone. Moreover, a two dimensional 64-bin histogram of the joint distribution of widths and heights, as well as a histogram of the nearest neighbor distances of the connected components are calculated. All in all, 152 numerical features are extracted. The part of the feature set created from connected component statistics is meaningless when extracted from a single connected component, as it would be the case in the second system. Therefore these features are omitted in that system. 5. EXPERIMENTS 5.1 Setup For all experiments, four-fold cross validation is performed in order to reduce the risk of a biased dataset division. Four disjoint, equally sized subsets ss 0,..., ss 3 of the dataset are created. In run i (i = 0,..., 3), the training set is defined as ss (0+i) ss (1+i mod 4), the validation set as ss (2+i mod 4), and the test set as ss (3+i mod 4). For the first system, the SVM is trained on the segmentation ground truth of the training set. In a grid search on the validation set, the best parameter combination of C and γ in the range of {2 2 j j = 12,..., 12} is identified. The SVM with this parameter set is then applied on the test set. The classification rates of the four segmentation methods are maximized over all parameter combinations on the validation set as well. The parameter combinations that lead to the best performance on the validation set (and are applied on the independent test set) are shown in Table 1. In the second system the SVM-classifier is trained on the connected components of the training set. The parameters C and γ are validated in the same range as for the first system. Using the optimal parameter values, the connected components of the test set are classified to get the final result. In both systems the recognition rate is calculated as the fraction of pixels within a document that have been correctly assigned to the text or the non-text class. The mean value is computed over all documents evaluated. 5.2 Results Figure 3 shows the recognition rates that were achieved when classifying the document zones returned by the four segmentation methods. With a recognition rate above 0.92, the Voronoi segmenation and morphological smearing methods are significantly better than the X-Y cut and the whitespace analysis. (This is statistically Figure 3: Recognition rates achieved with the five different methods. backed using a dependent t-test for paired samples with a significance level of α = 0.05.) The reason for the difference might be that the latter two methods can correctly segment documents in Manhattan layout only, which is not given for the dataset used here. The parameter sets of the different segmentation methods that lead to the best recognition rate also lead to an over-segmentation. It seams easier to correctly classify small segments than to generate a segmentation with large document zones containing only one content type. This finding confirms the results of the connected component classification system, where small segments are produced by default. With a recognition rate of it performs significantly better than all other methods. Analyzing the results on the different content types (see Figure 4), the advantage of the connected component classification system can be explained. On tables and diagrams, its recognition rate is higher than that of other methods. It seems that the lines occurring in the tables lead to a wrong classification. The connected component classification can classify each text part in a table individually, while the segmentation methods can not split the tables apart. The problem with diagrams might be the text labels. If they belong to the same segment as the rest of the diagram, they will be misclassified. 6. CONCLUSION The distinction of text and non-text is an important step to transcribe a scanned or online recorded document into an electronic office document (e.g. ODF). In this paper, we compare two systems to solve this problem using only methods originally developed for offline handwritten and machine printed documents. The first system is implementing the top-down approach. It segments the documents into zones which are then classified by an SVM. Four different segmentation methods are compared to each other. In the second system, a bottom-up strategy is implemented where connected components are classified by an SVM. The dataset on which the experiments have been performed is a collection of online handwritten documents containing text, drawings, diagrams, tables, lists, and formulas. In the result section we demonstrate that the distinction of text and non-text content in handwritten document is possible. Both approaches reach good recognition rates. However, with 94.3% of all pixels correctly assigned, the bottom-up strategy outperforms the top-down approach. The use of connected components as document zones reduces the risk to mix text and non-text content, which

5 Recognition Rate Conn. Comp. Whitespace X-Y Cut Morph. Op. Voronoi Text Drawing Diagram Table Figure 4: Recognition rates of the distinction between text and non-text considering only pixels of text blocks, drawing, diagram, or tables. inevitably results in misclassified pixels. In the near future we intend to extend the proposed methods by using online information. In addition, other bottom-up approaches will be applied which use strokes or individual pixels as their smallest entities. The dataset introduced in this paper opens possibilities for further investigations in the field of document analysis. The detection of different content types or the reconstruction of tables and lists might be other interesting subjects for future work. 7. ACKNOWLEDGMENTS This work has been supported by the Swiss National Center of Competence in Research (NCCR) on Interactive Multimodal Information Management (IM2). We thank the developers of OCRopus 4 for making available their software to the public. 8. REFERENCES [1] H. S. Baird. Background structure in document images. In H. Bunke, P. Wang, and H. S. Baird, editors, Document Image Analysis, pages World Scientific, Singapore, [2] H. S. Baird, M. A. Moll, and C. An. Document image content inventories. In Proc. 14th Conf. on Document Recognition and Retrieval, page 296, [3] T. M. Breuel. Two geometric algorithms for layout analysis. In Proceedings of the Workshop on Document Analysis Systems, pages , Princeton, NJ, USA, [4] T. M. Breuel. The OCRopus open source OCR system. In Proceedings IS&T/SPIE 20th Annual Symposium 2008, [5] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, Software available at [6] W. Francis and H. Kucera. Manual of information to accompany a standard sample of present-day edited American English for use with digital computers. Department of Linguistics, Brown University, OCRopus: [7] R. M. Haralick, S. R. Sternberg, and X. Zhuang. Image analysis using mathematical morphology. IEEE Trans. on Pattern Analysis and Machine Intelligence, 9(4): , [8] A. K. Jain, A. M. Namboodiri, and J. Subrahmonia. Structure in on-line documents. In In Proc. 6th Int. Conf. on Document Analysis and Recognition, pages , [9] D. Keysers, F. Shafait, and T. M. Breuel. Document image zone classification a simple high-performance approach. In Proc. 2nd Int. Conf. on Computer Vision Theory and Applications, pages 44 51, [10] K. Kise, A. Sato, and M. Iwata. Segmentation of page images using the area voronoi diagram. Computer Vision and Image Understanding, 70(3): , June [11] G. Nagy, S. Seth, and M. Viswanathan. A prototype document image analysis system for technical journals. Computer, 25(7):10 22, [12] O. Okun, D. Doermann, and M. Pietikäinen. Page segmentation and zone classification: the state of the art. Technical report, University of Maryland, Language and Media Processing Laboratory, [13] S. Rossignol, D. Willems, A. Neumann, and L. Vuurpijl. Mode detection and incremental recognition. In In Proc. 9th Int. Workshop on Frontiers in Handwriting Recognition, pages , [14] B. Schölkopf and A. Smola. Learning with Kernels. MIT Press, [15] F. Shafait, D. Keysers, and T. M. Breuel. Performance evaluation and benchmarking of six page segmentation algorithms. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(6): , Jun [16] J. Shawe-Taylor and N. Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, [17] M. Shilman, P. Liang, and P. Viola. Learning non-generative grammatical models for document analysis. In Porc. of 10th Int. Conf. on Computer Vision, volume 2, pages , Los Alamitos, CA, USA, IEEE Computer Society. [18] C. Strouthopoulos and N. Papamarkos. Text identification for document image analysis using a neural network. Image and Vision Computing, 16: , [19] V. Vapnik. Statistical Learning Theory. John Wiley, [20] K. Y. Wong, R. G. Casey, and F. M. Wahl. Document analysis system. IBM Journal of Research and Development, 26(6): , 1982.

IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents

IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents Emanuel Indermühle Institute of Computer Science and Applied Mathematics University of Bern, CH-3012 Bern, Switzerland

More information

Page Frame Detection for Marginal Noise Removal from Scanned Documents

Page Frame Detection for Marginal Noise Removal from Scanned Documents Page Frame Detection for Marginal Noise Removal from Scanned Documents Faisal Shafait 1, Joost van Beusekom 2, Daniel Keysers 1, and Thomas M. Breuel 2 1 Image Understanding and Pattern Recognition (IUPR)

More information

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

OCRopus Addons. Internship Report. Submitted to:

OCRopus Addons. Internship Report. Submitted to: OCRopus Addons Internship Report Submitted to: Image Understanding and Pattern Recognition Lab German Research Center for Artificial Intelligence Kaiserslautern, Germany Submitted by: Ambrish Dantrey,

More information

Support Vector Machines Explained

Support Vector Machines Explained March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

More information

Determining optimal window size for texture feature extraction methods

Determining optimal window size for texture feature extraction methods IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

High-Performance Signature Recognition Method using SVM

High-Performance Signature Recognition Method using SVM High-Performance Signature Recognition Method using SVM Saeid Fazli Research Institute of Modern Biological Techniques University of Zanjan Shima Pouyan Electrical Engineering Department University of

More information

Signature Segmentation from Machine Printed Documents using Conditional Random Field

Signature Segmentation from Machine Printed Documents using Conditional Random Field 2011 International Conference on Document Analysis and Recognition Signature Segmentation from Machine Printed Documents using Conditional Random Field Ranju Mandal Computer Vision and Pattern Recognition

More information

The Artificial Prediction Market

The Artificial Prediction Market The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

More information

Support Vector Machine (SVM)

Support Vector Machine (SVM) Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

A Learning Based Method for Super-Resolution of Low Resolution Images

A Learning Based Method for Super-Resolution of Low Resolution Images A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 emre.ugur@ceng.metu.edu.tr Abstract The main objective of this project is the study of a learning based method

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Three Effective Top-Down Clustering Algorithms for Location Database Systems

Three Effective Top-Down Clustering Algorithms for Location Database Systems Three Effective Top-Down Clustering Algorithms for Location Database Systems Kwang-Jo Lee and Sung-Bong Yang Department of Computer Science, Yonsei University, Seoul, Republic of Korea {kjlee5435, yang}@cs.yonsei.ac.kr

More information

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28 Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

More information

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

More information

A fast multi-class SVM learning method for huge databases

A fast multi-class SVM learning method for huge databases www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,

More information

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan Handwritten Signature Verification ECE 533 Project Report by Ashish Dhawan Aditi R. Ganesan Contents 1. Abstract 3. 2. Introduction 4. 3. Approach 6. 4. Pre-processing 8. 5. Feature Extraction 9. 6. Verification

More information

Circle Object Recognition Based on Monocular Vision for Home Security Robot

Circle Object Recognition Based on Monocular Vision for Home Security Robot Journal of Applied Science and Engineering, Vol. 16, No. 3, pp. 261 268 (2013) DOI: 10.6180/jase.2013.16.3.05 Circle Object Recognition Based on Monocular Vision for Home Security Robot Shih-An Li, Ching-Chang

More information

5. Binary objects labeling

5. Binary objects labeling Image Processing - Laboratory 5: Binary objects labeling 1 5. Binary objects labeling 5.1. Introduction In this laboratory an object labeling algorithm which allows you to label distinct objects from a

More information

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

Topological Data Analysis Applications to Computer Vision

Topological Data Analysis Applications to Computer Vision Topological Data Analysis Applications to Computer Vision Vitaliy Kurlin, http://kurlin.org Microsoft Research Cambridge and Durham University, UK Topological Data Analysis quantifies topological structures

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

Signature verification using Kolmogorov-Smirnov. statistic

Signature verification using Kolmogorov-Smirnov. statistic Signature verification using Kolmogorov-Smirnov statistic Harish Srinivasan, Sargur N.Srihari and Matthew J Beal University at Buffalo, the State University of New York, Buffalo USA {srihari,hs32}@cedar.buffalo.edu,mbeal@cse.buffalo.edu

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees

Statistical Data Mining. Practical Assignment 3 Discriminant Analysis and Decision Trees Statistical Data Mining Practical Assignment 3 Discriminant Analysis and Decision Trees In this practical we discuss linear and quadratic discriminant analysis and tree-based classification techniques.

More information

Video OCR for Sport Video Annotation and Retrieval

Video OCR for Sport Video Annotation and Retrieval Video OCR for Sport Video Annotation and Retrieval Datong Chen, Kim Shearer and Hervé Bourlard, Fellow, IEEE Dalle Molle Institute for Perceptual Artificial Intelligence Rue du Simplon 4 CH-190 Martigny

More information

Document Image Retrieval using Signatures as Queries

Document Image Retrieval using Signatures as Queries Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and

More information

Visual Structure Analysis of Flow Charts in Patent Images

Visual Structure Analysis of Flow Charts in Patent Images Visual Structure Analysis of Flow Charts in Patent Images Roland Mörzinger, René Schuster, András Horti, and Georg Thallinger JOANNEUM RESEARCH Forschungsgesellschaft mbh DIGITAL - Institute for Information

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

Large Margin DAGs for Multiclass Classification

Large Margin DAGs for Multiclass Classification S.A. Solla, T.K. Leen and K.-R. Müller (eds.), 57 55, MIT Press (000) Large Margin DAGs for Multiclass Classification John C. Platt Microsoft Research Microsoft Way Redmond, WA 9805 jplatt@microsoft.com

More information

Face detection is a process of localizing and extracting the face region from the

Face detection is a process of localizing and extracting the face region from the Chapter 4 FACE NORMALIZATION 4.1 INTRODUCTION Face detection is a process of localizing and extracting the face region from the background. The detected face varies in rotation, brightness, size, etc.

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE Venu Govindaraju BIOMETRICS DOCUMENT ANALYSIS PATTERN RECOGNITION 8/24/2015 ICDAR- 2015 2 Towards a Globally Optimal Approach for Learning Deep Unsupervised

More information

Implementation of OCR Based on Template Matching and Integrating it in Android Application

Implementation of OCR Based on Template Matching and Integrating it in Android Application International Journal of Computer Sciences and EngineeringOpen Access Technical Paper Volume-04, Issue-02 E-ISSN: 2347-2693 Implementation of OCR Based on Template Matching and Integrating it in Android

More information

Multiple Kernel Learning on the Limit Order Book

Multiple Kernel Learning on the Limit Order Book JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

Ensembles and PMML in KNIME

Ensembles and PMML in KNIME Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany First.Last@Uni-Konstanz.De

More information

How To Cluster

How To Cluster Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Morphological segmentation of histology cell images

Morphological segmentation of histology cell images Morphological segmentation of histology cell images A.Nedzved, S.Ablameyko, I.Pitas Institute of Engineering Cybernetics of the National Academy of Sciences Surganova, 6, 00 Minsk, Belarus E-mail abl@newman.bas-net.by

More information

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Image Normalization for Illumination Compensation in Facial Images

Image Normalization for Illumination Compensation in Facial Images Image Normalization for Illumination Compensation in Facial Images by Martin D. Levine, Maulin R. Gandhi, Jisnu Bhattacharyya Department of Electrical & Computer Engineering & Center for Intelligent Machines

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Character Image Patterns as Big Data

Character Image Patterns as Big Data 22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

Lecture 2: The SVM classifier

Lecture 2: The SVM classifier Lecture 2: The SVM classifier C19 Machine Learning Hilary 2015 A. Zisserman Review of linear classifiers Linear separability Perceptron Support Vector Machine (SVM) classifier Wide margin Cost function

More information

Super-resolution method based on edge feature for high resolution imaging

Super-resolution method based on edge feature for high resolution imaging Science Journal of Circuits, Systems and Signal Processing 2014; 3(6-1): 24-29 Published online December 26, 2014 (http://www.sciencepublishinggroup.com/j/cssp) doi: 10.11648/j.cssp.s.2014030601.14 ISSN:

More information

Keywords image processing, signature verification, false acceptance rate, false rejection rate, forgeries, feature vectors, support vector machines.

Keywords image processing, signature verification, false acceptance rate, false rejection rate, forgeries, feature vectors, support vector machines. International Journal of Computer Application and Engineering Technology Volume 3-Issue2, Apr 2014.Pp. 188-192 www.ijcaet.net OFFLINE SIGNATURE VERIFICATION SYSTEM -A REVIEW Pooja Department of Computer

More information

Recognition of Handwritten Digits using Structural Information

Recognition of Handwritten Digits using Structural Information Recognition of Handwritten Digits using Structural Information Sven Behnke Martin-Luther University, Halle-Wittenberg' Institute of Computer Science 06099 Halle, Germany { behnke Irojas} @ informatik.uni-halle.de

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Contribution of Multiresolution Description for Archive Document Structure Recognition

Contribution of Multiresolution Description for Archive Document Structure Recognition Contribution of Multiresolution Description for Archive Document Structure Recognition Aurélie Lemaitre, Jean Camillerapp, Bertrand Coüasnon To cite this version: Aurélie Lemaitre, Jean Camillerapp, Bertrand

More information

Analecta Vol. 8, No. 2 ISSN 2064-7964

Analecta Vol. 8, No. 2 ISSN 2064-7964 EXPERIMENTAL APPLICATIONS OF ARTIFICIAL NEURAL NETWORKS IN ENGINEERING PROCESSING SYSTEM S. Dadvandipour Institute of Information Engineering, University of Miskolc, Egyetemváros, 3515, Miskolc, Hungary,

More information

Gamma Distribution Fitting

Gamma Distribution Fitting Chapter 552 Gamma Distribution Fitting Introduction This module fits the gamma probability distributions to a complete or censored set of individual or grouped data values. It outputs various statistics

More information

Linear Threshold Units

Linear Threshold Units Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear

More information

High-dimensional labeled data analysis with Gabriel graphs

High-dimensional labeled data analysis with Gabriel graphs High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use

More information

Introducing diversity among the models of multi-label classification ensemble

Introducing diversity among the models of multi-label classification ensemble Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and

More information

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm

How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode

More information

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

More information

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a

More information

Vector storage and access; algorithms in GIS. This is lecture 6

Vector storage and access; algorithms in GIS. This is lecture 6 Vector storage and access; algorithms in GIS This is lecture 6 Vector data storage and access Vectors are built from points, line and areas. (x,y) Surface: (x,y,z) Vector data access Access to vector

More information

Journal of Industrial Engineering Research. Adaptive sequence of Key Pose Detection for Human Action Recognition

Journal of Industrial Engineering Research. Adaptive sequence of Key Pose Detection for Human Action Recognition IWNEST PUBLISHER Journal of Industrial Engineering Research (ISSN: 2077-4559) Journal home page: http://www.iwnest.com/aace/ Adaptive sequence of Key Pose Detection for Human Action Recognition 1 T. Sindhu

More information

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2 Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Colour Image Segmentation Technique for Screen Printing

Colour Image Segmentation Technique for Screen Printing 60 R.U. Hewage and D.U.J. Sonnadara Department of Physics, University of Colombo, Sri Lanka ABSTRACT Screen-printing is an industry with a large number of applications ranging from printing mobile phone

More information

DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH

DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH DEVNAGARI DOCUMENT SEGMENTATION USING HISTOGRAM APPROACH Vikas J Dongre 1 Vijay H Mankar 2 Department of Electronics & Telecommunication, Government Polytechnic, Nagpur, India 1 dongrevj@yahoo.co.in; 2

More information

Domain Classification of Technical Terms Using the Web

Domain Classification of Technical Terms Using the Web Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using

More information

How To Filter Spam Image From A Picture By Color Or Color

How To Filter Spam Image From A Picture By Color Or Color Image Content-Based Email Spam Image Filtering Jianyi Wang and Kazuki Katagishi Abstract With the population of Internet around the world, email has become one of the main methods of communication among

More information

Local features and matching. Image classification & object localization

Local features and matching. Image classification & object localization Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to

More information

Support Vector Machines for Dynamic Biometric Handwriting Classification

Support Vector Machines for Dynamic Biometric Handwriting Classification Support Vector Machines for Dynamic Biometric Handwriting Classification Tobias Scheidat, Marcus Leich, Mark Alexander, and Claus Vielhauer Abstract Biometric user authentication is a recent topic in the

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

SVM Ensemble Model for Investment Prediction

SVM Ensemble Model for Investment Prediction 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of

More information

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Spam detection with data mining method:

Spam detection with data mining method: Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,

More information

D-optimal plans in observational studies

D-optimal plans in observational studies D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational

More information

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite

ALGEBRA. sequence, term, nth term, consecutive, rule, relationship, generate, predict, continue increase, decrease finite, infinite ALGEBRA Pupils should be taught to: Generate and describe sequences As outcomes, Year 7 pupils should, for example: Use, read and write, spelling correctly: sequence, term, nth term, consecutive, rule,

More information

LCs for Binary Classification

LCs for Binary Classification Linear Classifiers A linear classifier is a classifier such that classification is performed by a dot product beteen the to vectors representing the document and the category, respectively. Therefore it

More information

A Simple Feature Extraction Technique of a Pattern By Hopfield Network

A Simple Feature Extraction Technique of a Pattern By Hopfield Network A Simple Feature Extraction Technique of a Pattern By Hopfield Network A.Nag!, S. Biswas *, D. Sarkar *, P.P. Sarkar *, B. Gupta **! Academy of Technology, Hoogly - 722 *USIC, University of Kalyani, Kalyani

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Effective Data Mining Using Neural Networks

Effective Data Mining Using Neural Networks IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 8, NO. 6, DECEMBER 1996 957 Effective Data Mining Using Neural Networks Hongjun Lu, Member, IEEE Computer Society, Rudy Setiono, and Huan Liu,

More information

Leveraging Ensemble Models in SAS Enterprise Miner

Leveraging Ensemble Models in SAS Enterprise Miner ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to

More information

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 69 Class Project Report Junhua Mao and Lunbo Xu University of California, Los Angeles mjhustc@ucla.edu and lunbo

More information

Segmentation and Automatic Descreening of Scanned Documents

Segmentation and Automatic Descreening of Scanned Documents Segmentation and Automatic Descreening of Scanned Documents Alejandro Jaimes a, Frederick Mintzer b, A. Ravishankar Rao b and Gerhard Thompson b a Columbia University b IBM T.J. Watson Research Center

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information