IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents
|
|
|
- Kelley Washington
- 10 years ago
- Views:
Transcription
1 IAMonDo-database: an Online Handwritten Document Database with Non-uniform Contents Emanuel Indermühle Institute of Computer Science and Applied Mathematics University of Bern, CH-3012 Bern, Switzerland Marcus Liwicki Knowledge Management Department German Research Center for AI (DFKI),Kaiserslautern, Germany Horst Bunke Institute of Computer Science and Applied Mathematics University of Bern, CH-3012 Bern, Switzerland ABSTRACT In this paper we present a new database of online handwritten documents with different contents such as text, drawings, diagrams, formulas, tables, lists, and markings. It was designed to serve as a standard dataset for the development, training, testing and comparison of methods in the field of handwritten document analysis. The database can serve as a basis for layout analysis, and different segmentation and recognition tasks considering online or just offline information. Its size is 1,000 documents produced by approximately 200 writers including a total of 329,849 online strokes. Few constraints were imposed on the writers when creating the documents. Nonetheless, the database has a stable distribution of the different content types. A software tool was developed to allow easy access to the documents which are stored in InkML. In this paper we also present two experiments which show the challenge this database poses. They may figure as references for further research in this area. Categories and Subject Descriptors H.4 [Information Systems Applications]: Miscellaneous; D.2.8 [Software Engineering]: Metrics complexity measures, performance measures General Terms 1. INTRODUCTION The understanding of entire handwritten documents is a challenging problem which includes a number of difficult tasks. Given a handwritten document, its layout has to be analyzed to isolate different content types in a first step. These different content types can then be directed to specialized systems, including handwriting, symbol, or table recognizers. While a lot of investigation has been directed to the latter three tasks [3, 11, 18], document layout analysis for handwritten documents, just recently gets increasing attention. This is partly due to the rise of novel online hand- Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAS 10, June 9-11, 2010, Boston, MA, USA Copyright 2010 ACM /10/06...$10.00 writing recording devices an their use in the daily tasks. Jain et al. were pioneers in this field [8]. Large companies like HP and Microsoft [2] are joining the effort. The need for a tailored database supporting the development, training, and testing of systems is a central issue in the field of document analysis. Various databases are available and popular in the community. A well known example is CEDAR [7], which is a database of address related text images for the recognition of addresses on letters. The IAMDB [12] is a dataset of handwritten text line images containing sentences of the LOP corpus. These two examples cover offline handwritten text. For online handwriting, widely used datasets are UNIPEN [6], a large repository of online handwritten texts, IRONOFF [17], a database presenting online text which is mapped to the corresponding offline images and IAMonDB [10] which is designed similarly to IAMDB but with online handwritten text. In the research on Japanese and Chinese handwriting recognition online handwritten character databases are ofted used [13]. All of these database are focused on handwritten text only. Databases covering entire documents containing drawings, images, and structured text are the UW-I and UW-II databases [15] as well as the more recent ISRI dataset[14] which contain journal articles and memorandums in English and Japanese. However, they are limited to machine printed documents. In this paper we introduce the IAMonDo-database 1 which is the first database of online handwritten documents with different content types made available to the public. It consists of 1,000 documents containing handwritten text, drawings, diagrams, formulas, tables, lists, and marking elements arranged in an unconstrained way. The database is designed for the development of algorithms for segmentation, content type distinction, and document annotation recognition. Six different classes of content types occur in this database. It is also possible to use the database to distinguish between text and non-text. Moreover, it can be used for recognition and segmentation tasks such as document zone segmentation, formula recognition, symbol recognition, table recognition, text line separation, word segmentation, and handwriting recognition. 1 IAMonDo-database stands form IAM Online Document Database. IAM is the abbreviation of Institut für Informatik und angewandte Mathematik
2 In Section 2 the design of the IAMonDo-database is described. Its documents are stored in the flexible InkML language (see Section 3), and annotations for document zones down to individual words are available (see Section 4). In this paper we also present some results, demonstrating the challenge this database poses (Section 6). In Section 7 conclusions are drawn and future developments are discussed. 2. DESIGN AND ACQUISITION The purpose of the database is in the first place to serve as a dataset for the analysis and development of methods distinguishing between different content types in online handwritten documents. In the second place it should assist in research about recognition of document annotation. The two requirements arising therefrom are the need of different content types and the need that the documents should be annotated (we refer to these annotations as markings to avoid confusion with the ground truth of the documents which is referred to as annotations of the digital ink). 2.1 Database Contents We assume that the online handwritten documents are typically created in the context of meetings or note taking during lessons. Different content types used in such a context are considered when creating this database; see the following list for details: Text block: all text which is neither structured as a list nor as a table does belong to this category. Moreover, a text block must not occur as a label in a diagram. Drawing: graphs, diagrams, maps, symbols or freehand objects. Diagram: same as drawings, but may include text labels. Formula: on one or multiple lines. Table: with and without ruling. List: containing one word per line. It can be expected that in real situations most of the content is text. However, to study text vs. non-text distinction, certain amount of drawings and diagrams should be included. This is considered in the IAMonDo-database (See Figure 1). Another element of these documents are markings which are applied after the writer has created a document. The marking elements, listed below, are a subset of the Microsoft gesture set (see Figure 2 for an illustration). 1. Underlining of text 2. Marking of text on one or multiple lines by angles on the top left as start mark and angles on the bottom right as end mark 3. Marking of text enclosing it in square bracket. 4. Marking of entire text lines by a vertical stroke on the right, or left side of the text block 5. Encircling traces pixels text 82.3% text block text 67.8% list/table formula diag. (labels) non-text 0% 20% 40% 60% 80% 100% diagram drawing non-text Figure 1: Different content types and their representation in the database. Other denotes strokes structuring the document, like ruling in tables, arrows, or separating lines. Figure 2: The different marking elements applied to documents. 6. Text labels annotating these markings 7. Strokes connecting a marking with the annotating text label 2.2 Templates To ensure that each document contains sufficiently large amounts of text, it has been decided to control this by giving templates to the contributors which they had to copy. This also solved the problem of writers not knowing what to write on the given white paper. An additional advantage of using templates is the possibility to control the content distribution. Having control over then contents leads to the decision to use a language corpus as a text source. In the field of linguistics, several large text collections, known as language corpora, are available. Often, they offer detailed information about the text as for example if a word is a noun, a verb, or another word class. It was decided to use the Brown corpus [5] which is a collection of American English texts covering different fields of writing. For this database the following categories were considered: Press: reportage Press: editorial Press: reviews Religion Popular lore other
3 Belles letters, biography, essays Miscellaneous: government and house organs Education and scientific publications Fiction: general Fiction: mystery Fiction: adventure Also the other content types were predefined in the templates. Due to legal concerns, drawings, diagrams and formulas have been obtained from Wikipedia 2 and Wikimedia Commons 3. The templates are generated form the following sets. Consecutive sentences obtained from the Brown text corpus [5] (category A, B, C, D, F, G, H, J, K, L, and N). 200 Drawings obtained from Wikimedia Commons Diagrams (which are drawings with text labels) obtained from Wikimedia Commons Formulas obtained from Wikipedia 2. Nouns from the Brown text corpus [5]. Random numbers. When generating the templates two conflicting requirements arise. On one hand, the template s content should be randomly compiled so as not to allow any prediction about it. On the other hand, some rules must be applied to guarantee the desired distribution. The following list indicates the rules which are applied during generation of one template. Text: A random sentence is selected and the following sentences are appended as long as the text is shorter than 200 characters. One random diagram is included. With a probability of 0.5 either a random formula or a random drawing is added. With a probability of 0.5 either a list or a table is added. With equal probability the table is without ruling, with ruling for the title line, with just horizontal ruling, or with a fully ruled grid. The text in the table is either aligned left, right, or it is centered. The table has 2, 3, or 4 columns and 2 to 6 rows. The table contains random nouns in the first column and row, and random numbers of random length in the rest of the cells The list has 2 to 7 random nouns 2 Wikipedia: 3 Wikimedia Commons: Acquisition For the acquisition the Logitech IO2 Pen, powered by technology of Anoto, has been used as a recording device. It has several advantages compared to a tablet PC or an electronic white board. First, it is accurate. Second, besides the coordinates of the digital ink it delivers time and pressure information. Therefore, the writing speed as well as the force at every sampling points can be calculated. Then, the acquisition process is quite easy because no tablet has to be adjusted and no whiteboard must be installed. It needs just a pen and paper to be handed out to the writer, who can do the writing when he or she has time. Afterwards, the pens can be plugged in a cradle and the digital ink is transferred to the host computer. One of the main problems when acquiring a database is the recruitment of writers. We asked 200 persons to produce 5 documents each, which took about 40 minutes per writer. The writers were mainly students at the institutes of the authors. During a recording session printed instructions and five templates were given to each writer and he or she was asked to copy each template s content on a sheet of paper. There was no further supervision. The instructions contain a small form to collect meta data of the writer. So the following information can be linked to each individual document. A writer id which allows one to say if two document are from the same writer or not The writer s gender If the writer is left, or right-handed The age The native language The nationality The grade of education The profession In the instructions the writers are requested to copy the content of the template to the paper. They are asked to rearrange the content, i.e. to make it different from the template. The writers were encouraged to use multiple columns for the text, to break apart content elements, to structure the documents with arrows and separating lines, and to simplify or reduce drawings and diagrams. The latter has been added, because the writer tended to spend a lot of time correctly copying the drawing while only a sketchy style was requested. They were asked to correct misspelled words by canceling and not by overwriting it. No further constraints on how to create the document were imposed on them. When they finished the document they were asked to mark (annotate) it. Doing this at the end more likely reflects the workflow in a realistic context. On Figure 3 and 4 one can see how a writer copied the content of a template to the document and then added the markings. As it was expected, the writers did not always behave according to the instructions given to them. However, the
4 And he had a feeling -- thanks to the girl -- that things would get worse before they got better. They had the house cleaned up by noon, and Wilson sent the boy out to the meadow to bring in the horses. rancher functions daughter Figure 3: Template as it has been presented to the writer. Figure 4: writer. Document as it has been created by a results are very good. Some unexpected behavior reduced the desired diversity of the documents, others increased it. Although they have been asked to rearrange the content, the writer often placed the text on the top of the document. In a majority of cases the use of multiple columns, breaking of content elements and structuring with arrows or separating lines has been omitted. On the other hand about half of the documents are not used in portrait layout but the paper was turned and used in landscape layout which was not expected and resulted in more diversity. 2.4 Additional Information The database has been split into five disjoint subsets produced by different writers. Each subsets contains approximately 200 documents. For the first four sets (0,1,2,3) two different experimental setups are adopted. In the first one, set 0 and 1 are used for the training, set 2 is used to validate system parameters, and set 3 is the test set. The second setup is a 4-fold cross validation where sets 0 + i and (1 + i mod 4) are used for training, set (2 + i mod 4) for validation, and set (3 + i mod 4) for testing, for i = 0,..., 3. Set 4 is used as an independent test set, which should be used only once for a system. The database consists of 1,000 documents with 329,849 individual online strokes. The entire database includes 68,441 words, 7,576 text lines, 2,532 table cells, 2,069 list items, and 5,646 diagram labels, which are annotated with transcription. The database contains 905 diagrams, 1456 drawings, 485 formulas, 536 lists, 446 tables, and 1474 text blocks. Of all documents 833 contain markings with 8,260 individual marking elements. Few templates have been copied several times by different writers. This happened accidentally by not keeping track of all templates given to users and by an example which was handed out and has been interpreted as template. 3. FORMAT To store the document, the InkML language [4], proposed by the World Wide Web Consortium (W3C), has been chosen. It is an XML based language with the full name Ink Markup Language. InkML offers high flexibility in how information can be stored and presented. It is also partly used by the International Unipen Foundation 4 [1]. InkML is intended to preserve the original form of the data, while offering the possibility to create different views of the ink. This is done by two constructs. The first one is the TraceView element. This XML tag can either contain other TraceView elements or they are referencing to ink data. This allows hierarchical structuring of the data, which is used in the documents of this database. The second construct allows to transform the data by different means. In this work it is used to correct the orientation of the documents by applying an affine transformation to the original coordinate system. Additionally, InkML allows to include annotations as namevalue pairs virtually everywhere. Since this gives a high degree of freedom it is important to define application specific rules. The definition of such rules, however, can not be done with InkML and is therefore specific to this database. We have defined the structure of the annotation in a separated XML file. 4 International Unipen Foundation:
5 document document analysis, feature vectors of traces or even individual sampling points can be exported directly, labeled by a class number representing the intended classification. diagram table list text block marking drawing formula structure text line garbage word symbol TRACE brackets, angles,..ect. 5. SOFTWARE InkAnno is the name of the software tool used to handle the documents of this database. It is written in the Java programming language and therefore it runs on a number of platforms. To export images it depends on the JAI 5 (Java Advanced Imaging) library and for PDF export the itext 6 library is required to be installed. Figure 5: The hierarchical structure of the annotation. 4. ANNOTATION The database has detailed annotations to generate ground truth in various formats. On the top-level different document zones are labeled with their content type. These zones are divided further into sub-elements, such as text lines, table cells, or drawings which are annotated with their type and transcription if applicable. Text lines or table cells are again divided into individual words which are also annotated with their description. The grammar of the hierarchy represented by a graph is shown in Figure 5. The hierarchically structured annotation is on the InkML level realized by annotating TraceView elements which are structured as desired. Formulas have not been annotated in detail. This can be done in the future, however. The annotation has been done using the software presented in Section 5. The process of annotating starts with the operator selecting traces which, grouped together, form for instance a word or a drawing. Such trace groups are annotated by one of the content types. These content types are proposed by the software following the defined rules in the additional XML file. Also the transcription, if applicable, is specified. Having a couple of words or other annotated trace groups, further groups can be built constituting text lines, diagrams or other high-level structures. According to the entities selected for grouping, the software proposes appropriate types to annotate this group. Sometimes, it is hard to recognize the transcription of a given word even for the operator. In these cases it is handy having a template to look at. The whole database has been annotated this way in about 200 hours. Different forms of ground truth can be generated from the annotation of a document. Some exported formats have already proven useful in our experiments. If document analysis is made considering only offline information, the document can be converted to a color coded image as it is proposed in [16]. Several different codes can be used to color different document zones, text lines, or even individual words. This is useful for page, line, or word segmentation tasks as well as for content distinction. For handwriting recognition it is possible to export individual text line or word images, accompanied by the transcription. For online handwritten The core of the software is a library which reflects the structure of an InkML document in a changeable model. Using this library the software offers a graphical user interface which can be used to display the documents, edit the Trace- View tree, modify annotations, and export the document into different formats. These formats include images, colorcoded images, PDFs, lists of feature vectors, and InkML files of course. New exporters can be developed with little effort. As InkML is quite complex its not recommended to reimplement interpreters but to use existing libraries. As of now the library proposed here is the only one implementing enough language elements to interpret this database. However, there are other projects, like the InkML toolkit project 7 heading towards full InkML compatibility. InkAnno is published under the open source GPL license and therefore freely accessible EXPERIMENTS The purpose of this database is primarily to develop and evaluate methods for the distinction of different content types. In this section we present two methods which try to solve the text vs. non-text distinction problem. 6.1 Trace classification The first methods was introduced by Jain et al. [8]. It is a classification of the individual ink traces taking two simple features into account: the length of the trace and the accumulated curvature. Jain et al. tested this method on a dataset which is not available to the public. It was not clear from their paper what classifier was used, but it was probably linear regression. They achieved a recognition rate of 99.1% on a test set with 35,882 text traces and 930 nontext traces. In our experiment rerunning the method on the database presented in this paper, an SVM classifier is used instead. SVMs, similarly to linear regression, are capable of finding a linear separation between two classes. Taking advantage of the kernel trick, however, it is possible to increase the dimensionality of the feature space. The hyperplane found by the SVM which is linear in the higher dimensional space corresponds to a nonlinear separation when projected to the original space. Using this classifier we can expect to get a result at least as good as with a linear regression. 5 Java advanced imaging library: net/ 6 itext library: 7 See: 8 Visit the homepage of our research group for more information
6 1 0.8 Total Text Table Drawing Diagram Total Text Table Drawing Diagram Recognition Rate Recognition Rate Figure 6: Results of text vs. non-text trace classification, displayed for individual content types. Figure 7: Results of text vs. non-text connected component classification, displayed for individual content types. As one can see in Figure 6 the recognition rate of the SVM on this database is 91.3%, which is significantly lower than the 99.1% reported in [8]. The difference can be explained by the fact that the dataset used by Jain et al. is highly unbalanced. Even if every trace would be classified as text, the recognition rate would be 97.5%. With a 4:1 text ratio this is not the case with the database presented here. Since the features are the same and the classifiers can be seen as performing equally well the difference must be in the challenge posed by the two datasets. The IAMonDo-dataset with its balanced content and varying style poses a similar problem as real world documents and therefore serves as a suitable base to measure performance of text versus non text distinction. 6.2 Connected component classification The second method considers only offline information. After applying a connected component analysis (8-side neighborhood) on the document, 82 features are extracted from the individual connected components. Keysers et al. [9] have investigated different feature sets for document zone classification. They proposed a well performing set of run-length histograms and connected component statistics which can be extracted very fast. The features used in the current paper for the classification of the connected components are similar to those. The feature set consists of run-length histograms of black and white pixels along the horizontal, vertical, and the two diagonal directions. Each histogram uses eight bins, counting runs of length 1, 3, 7, 15, 31, 63, 127, and 128. For each histogram the mean and the variance are considered. Adding the width and height of the considered connected component it sums up to a total of 82 features. As in the trace classification an SVM is trained on a training set and achieves 94.4% (see Figure 7) of correctly classified pixels 9. Note that in this experiment only offline information is used. 7. CONCLUSION AND OUTLOOK A database of online handwritten documents containing text blocks, drawings, diagrams, formulas, lists, and tables as well as markings has been described in this paper. The documents have been written in an unconstrained manner and reflect documents generated in a realistic context. With 200 writers many different styles in writing and drawing can be found. However, the local and global distribution of the content types is kept rather stable over all individual documents. The database can serve as a basis for content distinction, marking recognition, layout analysis, as well as formula recognition, handwriting recognition, word and text line segmentation, and many more tasks. Various formats of ground truth can be generated thanks to the detailed annotation of the documents. A software system has been presented which allows easy access to the documents. It can also be used to display, modify and transform the documents. Two text vs. non-text distinction methods have been presented in this paper. The first one is based on a method proposed in [8] which is applied to show the challenge this dataset poses when compared an other dataset. The other experiment applies simple methods using just offline information of the database. The results may serve as a benchmark for future systems to be developed for this database. At our institute we plan to use the database for ongoing research. The database itself is considered complete at the current stage. Some errors of the annotation which may occasionally be detected will be corrected. The software tool, InkAnno, will undergo further development, hopefully in collaboration with other institutes. 9 Please consider that the percent of correctly classified pixel in the connected component classification system can not directly be compared to the percent of correctly classified traces as it is used for the trace classification.
7 Acknowledgments. We would like to thank all the individuals who contributed to the database described in this paper. Particular thanks go to Karin Indermühle for her recruiting skills and Amrei Schröttke for annotating the whole database. Further thanks are due to IM2.VP for financial support. 8. REFERENCES [1] M. Agrawal, K. Bali, and S. Madhvanath. Upx: A new xml representation for annotated datasets of online handwriting data. In Proc. 8th Int. Conf. on Document Analysis and Recognition, pages , [2] C. M. Bishop, M. Svensen, and G. E. Hinton. Distinguishing text from graphics in on-line handwritten ink. In Proc. 9th Int. Workshop on Frontiers in Handwriting Recognition, pages , Washington, DC, USA, IEEE Computer Society. [3] H. Bunke. Recognition of cursive Roman handwriting past, present and future. In Proc. 7th Int. Conf. on Document Analysis and Recognition, Edinburgh, pages IEEE, [4] Y.-M. Chee, M. Froumentin, and S. Watt, editors. Ink markup language (InkML). World Wide Web Consortium, [5] W. Francis and H. Kucera. Manual of information to accompany a standard sample of present-day edited American English for use with digital computers. Department of Linguistics, Brown University, [6] I. Guyon, L. Schomaker, R. Plamondon, M. Liberman, and S. Janet. UNIPEN project of on-line data exchange and recognizer benchmarks. In Proc. 12th Int. Conf. on Pattern Recognition, volume 2, pages vol.2, Oct [7] J. Hull. A database for handwritten text recognition research. IEEE Transactions on Pattern Analysis and Machine Intelligence, 16(5): , [8] A. K. Jain, A. M. Namboodiri, and J. Subrahmonia. Structure in on-line documents. In Proc. 6th Int. Conf. on Document Analysis and Recognition, pages , [9] D. Keysers, F. Shafait, and T. M. Breuel. Document image zone classification a simple high-performance approach. In Proc. 2nd Int. Conf. on Computer Vision Theory and Applications, pages 44 51, [10] M. Liwicki and H. Bunke. IAM-OnDB an on-line English sentence database acquired from handwritten text on a whiteboard. In Proc. 8th Int. Conf. on Document Analysis and Recognition, volume 2, pages , [11] J. Lladós, E. Valveny, G. Sánchez, and E. Martí. Symbol recognition: Current advances and perspectives. In Proc. 4th Int. Workshop on Graphics Recognition Algorithms and Applications, pages , London, UK, Springer-Verlag. [12] U.-V. Marti and H. Bunke. The IAM-database: an English sentence database for offline handwriting recognition. Int. Journal on Document Analysis and Recognition, 5:39 46, [13] M. Nakagawa and M. Onuma. On-line handwritten japanese text recognition free from constrains on line direction and character orientation. In Proc. 7th Int. Conf. on Document Analysis and Recognition, pages , Edinburgh, Scotland, [14] T. A. Nartker, S. V. Rice, and S. E. Lumos. Software tools and test data for research and testing of page-reading ocr systems. In In International Symposium on Electronic Imaging Science and Technology, volume 1, pages SPIE, [15] I. Phillips, J. Ha, R. Haralick, and D. Dori. The implementation methodology for a cd-rom english document database. In Proc. 2nd Int. Conf. on Document Analysis and Recognition, pages , Oct [16] F. Shafait, D. Keysers, and T. M. Breuel. Pixel-accurate representation and evaluation of page segmentation in document images. In Proc. 18th Int. Conf. on Pattern Recognition, volume 1, pages , [17] C. Viard-Gaudin, P. M. Lallican, P. Binter, and S. Knerr. The ireste on/off (ironoff) dual handwriting database. In Proc. 5th Int. Conf. on Document Analysis and Recognition, pages , Los Alamitos, CA, USA, IEEE Computer Society. [18] R. Zanibbi, D. Blostein, and R. Cordy. A survey of table recognition: Models, observations, transformations, and inferences. Int. Journal of Document Analysis and Recognition, 7(1):1 16, 2004.
8 APPENDIX A. SAMPLE DOCUMENTS
Text versus non-text Distinction in Online Handwritten Documents
Text versus non-text Distinction in Online Handwritten Documents Emanuel Indermühle Horst Bunke Institute of Computer Science and Applied Mathematics University of Bern, CH-3012 Bern, Switzerland {eindermu,bunke}@iam.unibe.ch
2. Distributed Handwriting Recognition. Abstract. 1. Introduction
XPEN: An XML Based Format for Distributed Online Handwriting Recognition A.P.Lenaghan, R.R.Malyan, School of Computing and Information Systems, Kingston University, UK {a.lenaghan,r.malyan}@kingston.ac.uk
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
Document Image Retrieval using Signatures as Queries
Document Image Retrieval using Signatures as Queries Sargur N. Srihari, Shravya Shetty, Siyuan Chen, Harish Srinivasan, Chen Huang CEDAR, University at Buffalo(SUNY) Amherst, New York 14228 Gady Agam and
OCRopus Addons. Internship Report. Submitted to:
OCRopus Addons Internship Report Submitted to: Image Understanding and Pattern Recognition Lab German Research Center for Artificial Intelligence Kaiserslautern, Germany Submitted by: Ambrish Dantrey,
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju
NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE Venu Govindaraju BIOMETRICS DOCUMENT ANALYSIS PATTERN RECOGNITION 8/24/2015 ICDAR- 2015 2 Towards a Globally Optimal Approach for Learning Deep Unsupervised
ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan
Handwritten Signature Verification ECE 533 Project Report by Ashish Dhawan Aditi R. Ganesan Contents 1. Abstract 3. 2. Introduction 4. 3. Approach 6. 4. Pre-processing 8. 5. Feature Extraction 9. 6. Verification
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Ensembles and PMML in KNIME
Ensembles and PMML in KNIME Alexander Fillbrunn 1, Iris Adä 1, Thomas R. Gabriel 2 and Michael R. Berthold 1,2 1 Department of Computer and Information Science Universität Konstanz Konstanz, Germany [email protected]
Submission guidelines for authors and editors
Submission guidelines for authors and editors For the benefit of production efficiency and the production of texts of the highest quality and consistency, we urge you to follow the enclosed submission
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Form Design Guidelines Part of the Best-Practices Handbook. Form Design Guidelines
Part of the Best-Practices Handbook Form Design Guidelines I T C O N S U L T I N G Managing projects, supporting implementations and roll-outs onsite combined with expert knowledge. C U S T O M D E V E
CONTENTM WEBSITE MANAGEMENT SYSTEM. Getting Started Guide
CONTENTM WEBSITE MANAGEMENT SYSTEM Getting Started Guide Table of Contents CONTENTM WEBSITE MANAGEMENT SYSTEM... 1 GETTING TO KNOW YOUR SITE...5 PAGE STRUCTURE...5 Templates...5 Menus...5 Content Areas...5
Efficient on-line Signature Verification System
International Journal of Engineering & Technology IJET-IJENS Vol:10 No:04 42 Efficient on-line Signature Verification System Dr. S.A Daramola 1 and Prof. T.S Ibiyemi 2 1 Department of Electrical and Information
Online Farsi Handwritten Character Recognition Using Hidden Markov Model
Online Farsi Handwritten Character Recognition Using Hidden Markov Model Vahid Ghods*, Mohammad Karim Sohrabi Department of Electrical and Computer Engineering, Semnan Branch, Islamic Azad University,
Course Descriptions for Focused Learning Classes
Course Descriptions for Focused Learning Classes Excel Word PowerPoint Access Outlook Adobe Visio Publisher FrontPage Dreamweaver EXCEL Classes Excel Pivot Tables 2 hours Understanding Pivot Tables Examining
Technical Report. The KNIME Text Processing Feature:
Technical Report The KNIME Text Processing Feature: An Introduction Dr. Killian Thiel Dr. Michael Berthold [email protected] [email protected] Copyright 2012 by KNIME.com AG
Biometric Authentication using Online Signatures
Biometric Authentication using Online Signatures Alisher Kholmatov and Berrin Yanikoglu [email protected], [email protected] http://fens.sabanciuniv.edu Sabanci University, Tuzla, Istanbul,
Introduction Course in SPSS - Evening 1
ETH Zürich Seminar für Statistik Introduction Course in SPSS - Evening 1 Seminar für Statistik, ETH Zürich All data used during the course can be downloaded from the following ftp server: ftp://stat.ethz.ch/u/sfs/spsskurs/
Unconstrained Handwritten Character Recognition Using Different Classification Strategies
Unconstrained Handwritten Character Recognition Using Different Classification Strategies Alessandro L. Koerich Department of Computer Science (PPGIA) Pontifical Catholic University of Paraná (PUCPR) Curitiba,
Using Lexical Similarity in Handwritten Word Recognition
Using Lexical Similarity in Handwritten Word Recognition Jaehwa Park and Venu Govindaraju Center of Excellence for Document Analysis and Recognition (CEDAR) Department of Computer Science and Engineering
International Journal of Advanced Information in Arts, Science & Management Vol.2, No.2, December 2014
Efficient Attendance Management System Using Face Detection and Recognition Arun.A.V, Bhatath.S, Chethan.N, Manmohan.C.M, Hamsaveni M Department of Computer Science and Engineering, Vidya Vardhaka College
Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke
1 Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models Alessandro Vinciarelli, Samy Bengio and Horst Bunke Abstract This paper presents a system for the offline
A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques
A Corpus for Comparative Evaluation of OCR Software and Postcorrection Techniques Stoyan Mihov 1, Klaus U. Schulz 2, Christoph Ringlstetter 2, Veselka Dojchinova 1, Vanja Nakova 1, Kristina Kalpakchieva
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Keywords image processing, signature verification, false acceptance rate, false rejection rate, forgeries, feature vectors, support vector machines.
International Journal of Computer Application and Engineering Technology Volume 3-Issue2, Apr 2014.Pp. 188-192 www.ijcaet.net OFFLINE SIGNATURE VERIFICATION SYSTEM -A REVIEW Pooja Department of Computer
Basic Excel Handbook
2 5 2 7 1 1 0 4 3 9 8 1 Basic Excel Handbook Version 3.6 May 6, 2008 Contents Contents... 1 Part I: Background Information...3 About This Handbook... 4 Excel Terminology... 5 Excel Terminology (cont.)...
Designing forms for auto field detection in Adobe Acrobat
Adobe Acrobat 9 Technical White Paper Designing forms for auto field detection in Adobe Acrobat Create electronic forms more easily by using the right elements in your authoring program to take advantage
multimodality image processing workstation Visualizing your SPECT-CT-PET-MRI images
multimodality image processing workstation Visualizing your SPECT-CT-PET-MRI images InterView FUSION InterView FUSION is a new visualization and evaluation software developed by Mediso built on state of
Using Neural Networks to Create an Adaptive Character Recognition System
Using Neural Networks to Create an Adaptive Character Recognition System Alexander J. Faaborg Cornell University, Ithaca NY (May 14, 2002) Abstract A back-propagation neural network with one hidden layer
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Blog Post Extraction Using Title Finding
Blog Post Extraction Using Title Finding Linhai Song 1, 2, Xueqi Cheng 1, Yan Guo 1, Bo Wu 1, 2, Yu Wang 1, 2 1 Institute of Computing Technology, Chinese Academy of Sciences, Beijing 2 Graduate School
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.
SQL databases An introduction AMP: Apache, mysql, PHP This installations installs the Apache webserver, the PHP scripting language, and the mysql database on your computer: Apache: runs in the background
Portal Connector Fields and Widgets Technical Documentation
Portal Connector Fields and Widgets Technical Documentation 1 Form Fields 1.1 Content 1.1.1 CRM Form Configuration The CRM Form Configuration manages all the fields on the form and defines how the fields
38 Essential Website Redesign Terms You Need to Know
38 Essential Website Redesign Terms You Need to Know Every industry has its buzzwords, and web design is no different. If your head is spinning from seemingly endless jargon, or if you re getting ready
Script and Language Identification for Handwritten Document Images. Judith Hochberg Kevin Bowers * Michael Cannon Patrick Kelly
Script and Language Identification for Handwritten Document Images Judith Hochberg Kevin Bowers * Michael Cannon Patrick Kelly Computer Research and Applications Group (CIC-3) Mail Stop B265 Los Alamos
Determining optimal window size for texture feature extraction methods
IX Spanish Symposium on Pattern Recognition and Image Analysis, Castellon, Spain, May 2001, vol.2, 237-242, ISBN: 84-8021-351-5. Determining optimal window size for texture feature extraction methods Domènec
Visualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk
COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
Cascade Server. End User Training Guide. OIT Training and Documentation Services OIT TRAINING AND DOCUMENTATION. oittraining@uta.
OIT Training and Documentation Services Cascade Server End User Training Guide OIT TRAINING AND DOCUMENTATION [email protected] http://www.uta.edu/oit/cs/training/index.php 2013 CONTENTS 1. Introduction
COURSE RECOMMENDER SYSTEM IN E-LEARNING
International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand
Signature verification using Kolmogorov-Smirnov. statistic
Signature verification using Kolmogorov-Smirnov statistic Harish Srinivasan, Sargur N.Srihari and Matthew J Beal University at Buffalo, the State University of New York, Buffalo USA {srihari,hs32}@cedar.buffalo.edu,[email protected]
Participant Guide RP301: Ad Hoc Business Intelligence Reporting
RP301: Ad Hoc Business Intelligence Reporting State of Kansas As of April 28, 2010 Final TABLE OF CONTENTS Course Overview... 4 Course Objectives... 4 Agenda... 4 Lesson 1: Reviewing the Data Warehouse...
How To Fix Out Of Focus And Blur Images With A Dynamic Template Matching Algorithm
IJSTE - International Journal of Science Technology & Engineering Volume 1 Issue 10 April 2015 ISSN (online): 2349-784X Image Estimation Algorithm for Out of Focus and Blur Images to Retrieve the Barcode
ABBYY recognition technologies ideal alternative to manual data entry. Automating processing of exam tests.
ABBYY recognition technologies ideal alternative to manual data entry. Automating processing of exam tests. Marin Vlada 1, Ivan Babiy 2, Octav Ivanescu 3 (1) University of Bucharest, vlada[at]fmi.unibuc.ro
Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall
Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin
Excel 2003 Tutorial I
This tutorial was adapted from a tutorial by see its complete version at http://www.fgcu.edu/support/office2000/excel/index.html Excel 2003 Tutorial I Spreadsheet Basics Screen Layout Title bar Menu bar
5. Binary objects labeling
Image Processing - Laboratory 5: Binary objects labeling 1 5. Binary objects labeling 5.1. Introduction In this laboratory an object labeling algorithm which allows you to label distinct objects from a
. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
Intro to Excel spreadsheets
Intro to Excel spreadsheets What are the objectives of this document? The objectives of document are: 1. Familiarize you with what a spreadsheet is, how it works, and what its capabilities are; 2. Using
III. SEGMENTATION. A. Origin Segmentation
2012 International Conference on Frontiers in Handwriting Recognition Handwritten English Word Recognition based on Convolutional Neural Networks Aiquan Yuan, Gang Bai, Po Yang, Yanni Guo, Xinting Zhao
Go From Spreadsheets to Asset Management in 40 Minutes
E-ISG Asset Intelligence, LLC Go From Spreadsheets to Asset Management in 40 Minutes 3500 Boston Street Suite 316 Baltimore, MD 21224 Phone: 866.845.2416 Website: www.e-isg.com January, 2013 Summary This
The Center for Teaching, Learning, & Technology
The Center for Teaching, Learning, & Technology Instructional Technology Workshops Microsoft Excel 2010 Formulas and Charts Albert Robinson / Delwar Sayeed Faculty and Staff Development Programs Colston
Business Objects Version 5 : Introduction
Business Objects Version 5 : Introduction Page 1 TABLE OF CONTENTS Introduction About Business Objects Changing Your Password Retrieving Pre-Defined Reports Formatting Your Report Using the Slice and Dice
Online Packaging Management Solution
Online Packaging Management Solution WebCenter WebCenter is a powerful web-based Packaging Management platform to manage your business process, approval cycles and digital packaging assets. Specification
Submission Guidelines for BfN Publications
Submission Guidelines for BfN Publications Instructions on creating accessible documents (revision date: 1 January 2012) 1 What this document is about The information in this document helps: Secure error-free
Adobe Illustrator CS5 Part 1: Introduction to Illustrator
CALIFORNIA STATE UNIVERSITY, LOS ANGELES INFORMATION TECHNOLOGY SERVICES Adobe Illustrator CS5 Part 1: Introduction to Illustrator Summer 2011, Version 1.0 Table of Contents Introduction...2 Downloading
Unit 7 Quadratic Relations of the Form y = ax 2 + bx + c
Unit 7 Quadratic Relations of the Form y = ax 2 + bx + c Lesson Outline BIG PICTURE Students will: manipulate algebraic expressions, as needed to understand quadratic relations; identify characteristics
A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS
A MACHINE LEARNING APPROACH TO FILTER UNWANTED MESSAGES FROM ONLINE SOCIAL NETWORKS Charanma.P 1, P. Ganesh Kumar 2, 1 PG Scholar, 2 Assistant Professor,Department of Information Technology, Anna University
Signature Segmentation from Machine Printed Documents using Conditional Random Field
2011 International Conference on Document Analysis and Recognition Signature Segmentation from Machine Printed Documents using Conditional Random Field Ranju Mandal Computer Vision and Pattern Recognition
Database Studio is the new tool to administrate SAP MaxDB database instances as of version 7.5.
1 2 3 4 Database Studio is the new tool to administrate SAP MaxDB database instances as of version 7.5. It replaces the previous tools Database Manager GUI and SQL Studio from SAP MaxDB version 7.7 onwards
The Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
SRCSB General Web Development Policy Guidelines Jun. 2010
This document outlines the conventions that must be followed when composing and publishing HTML documents on the Santa Rosa District Schools World Wide Web server. In most cases, these conventions also
Author Gender Identification of English Novels
Author Gender Identification of English Novels Joseph Baena and Catherine Chen December 13, 2013 1 Introduction Machine learning algorithms have long been used in studies of authorship, particularly in
GeoGebra Statistics and Probability
GeoGebra Statistics and Probability Project Maths Development Team 2013 www.projectmaths.ie Page 1 of 24 Index Activity Topic Page 1 Introduction GeoGebra Statistics 3 2 To calculate the Sum, Mean, Count,
A Workbench for Prototyping XML Data Exchange (extended abstract)
A Workbench for Prototyping XML Data Exchange (extended abstract) Renzo Orsini and Augusto Celentano Università Ca Foscari di Venezia, Dipartimento di Informatica via Torino 155, 30172 Mestre (VE), Italy
Information Server Documentation SIMATIC. Information Server V8.0 Update 1 Information Server Documentation. Introduction 1. Web application basics 2
Introduction 1 Web application basics 2 SIMATIC Information Server V8.0 Update 1 System Manual Office add-ins basics 3 Time specifications 4 Report templates 5 Working with the Web application 6 Working
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Component visualization methods for large legacy software in C/C++
Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University [email protected]
Formatting Text in Microsoft Word
Formatting Text - 1 of 44 Formatting Text in Microsoft Word Page Setup 2 Centering Text 3 Line Spacing 4 Paragraph Spacing 4 Indenting a Paragraph s First Line 5 Hanging Indentation 5 Indenting an Entire
CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition
CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition Neil P. Slagle College of Computing Georgia Institute of Technology Atlanta, GA [email protected] Abstract CNFSAT embodies the P
First of all, although there are no certain rules or predefined software packages you must use, you need to know very well how to use
Guidelines for Writing Reports for Basic Electrical Engineering (EE213) and Basic Electronics (EE214) Laboratory Courses under METU Electrical and Electronics Engineering As prospective engineers, you
IBM SPSS Direct Marketing 23
IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release
Handwritten Signature Verification using Neural Network
Handwritten Signature Verification using Neural Network Ashwini Pansare Assistant Professor in Computer Engineering Department, Mumbai University, India Shalini Bhatia Associate Professor in Computer Engineering
MicroStrategy Products
MicroStrategy Products Bringing MicroStrategy Reporting, Analysis, and Monitoring to Microsoft Excel, PowerPoint, and Word With MicroStrategy Office, business users can create and run MicroStrategy reports
Enhanced Formatting and Document Management. Word 2010. Unit 3 Module 3. Diocese of St. Petersburg Office of Training Training@dosp.
Enhanced Formatting and Document Management Word 2010 Unit 3 Module 3 Diocese of St. Petersburg Office of Training [email protected] This Page Left Intentionally Blank Diocese of St. Petersburg 9/5/2014
BUILDING OLAP TOOLS OVER LARGE DATABASES
BUILDING OLAP TOOLS OVER LARGE DATABASES Rui Oliveira, Jorge Bernardino ISEC Instituto Superior de Engenharia de Coimbra, Polytechnic Institute of Coimbra Quinta da Nora, Rua Pedro Nunes, P-3030-199 Coimbra,
Generalization Capacity of Handwritten Outlier Symbols Rejection with Neural Network
Generalization Capacity of Handwritten Outlier Symbols Rejection with Neural Network Harold Mouchère, Eric Anquetil To cite this version: Harold Mouchère, Eric Anquetil. Generalization Capacity of Handwritten
Voronoi Treemaps in D3
Voronoi Treemaps in D3 Peter Henry University of Washington [email protected] Paul Vines University of Washington [email protected] ABSTRACT Voronoi treemaps are an alternative to traditional rectangular
Character Image Patterns as Big Data
22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,
DataPA OpenAnalytics End User Training
DataPA OpenAnalytics End User Training DataPA End User Training Lesson 1 Course Overview DataPA Chapter 1 Course Overview Introduction This course covers the skills required to use DataPA OpenAnalytics
ADVANCED MACHINE LEARNING. Introduction
1 1 Introduction Lecturer: Prof. Aude Billard ([email protected]) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures
Excel 2007 Basic knowledge
Ribbon menu The Ribbon menu system with tabs for various Excel commands. This Ribbon system replaces the traditional menus used with Excel 2003. Above the Ribbon in the upper-left corner is the Microsoft
IBM SPSS Direct Marketing 22
IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release
Authoring Guide for Perception Version 3
Authoring Guide for Version 3.1, October 2001 Information in this document is subject to change without notice. Companies, names, and data used in examples herein are fictitious unless otherwise noted.
Creating Drawings in Pro/ENGINEER
6 Creating Drawings in Pro/ENGINEER This chapter shows you how to bring the cell phone models and the assembly you ve created into the Pro/ENGINEER Drawing mode to create a drawing. A mechanical drawing
Microsoft Office Word 2010: Level 1
Microsoft Office Word 2010: Level 1 Workshop Objectives: In this workshop, you will learn fundamental Word 2010 skills. You will start by getting acquainted with the Word user interface, creating a new
Microsoft Dynamics GP. Advanced Financial Analysis
Microsoft Dynamics GP Advanced Financial Analysis Copyright Copyright 2010 Microsoft. All rights reserved. Limitation of liability This document is provided as-is. Information and views expressed in this
POINT OF SALES SYSTEM (POSS) USER MANUAL
Page 1 of 24 POINT OF SALES SYSTEM (POSS) USER MANUAL System Name : POSI-RAD System Release Version No. : V4.0 Total pages including this covering : 23 Page 2 of 24 Table of Contents 1 INTRODUCTION...
Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix ABSTRACT INTRODUCTION Data Access
Release 2.1 of SAS Add-In for Microsoft Office Bringing Microsoft PowerPoint into the Mix Jennifer Clegg, SAS Institute Inc., Cary, NC Eric Hill, SAS Institute Inc., Cary, NC ABSTRACT Release 2.1 of SAS
