Automatic Handwritten Character Segmentation for Paleographical Character Shape Analysis

Similar documents
Cursive Handwriting Recognition for Document Archiving

ICFHR 2010 Tutorial: Multimodal Computer Assisted Transcription of Handwriting Images

Online Farsi Handwritten Character Recognition Using Hidden Markov Model

The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project

III. SEGMENTATION. A. Origin Segmentation

Incremental and Offline Handwriting Recognition for the Venice Time Machine

Signature Segmentation from Machine Printed Documents using Conditional Random Field

Visualizing Data: Scalable Interactivity

2. Distributed Handwriting Recognition. Abstract. 1. Introduction

Document Image Retrieval using Signatures as Queries

Format OCR ICR. ID Protect From Vanguard Systems, Inc.

Offline Word Spotting in Handwritten Documents

Welcome page of the German Script Tutorial: script.byu.edu/german.

TRANSKRIBUS. Research Infrastructure for the Transcription and Recognition of Historical Documents

ECE 533 Project Report Ashish Dhawan Aditi R. Ganesan

Signature Segmentation and Recognition from Scanned Documents

Signature verification using Kolmogorov-Smirnov. statistic

Handwritten Text Line Segmentation by Shredding Text into its Lines

NAVIGATING SCIENTIFIC LITERATURE A HOLISTIC PERSPECTIVE. Venu Govindaraju

Efficient on-line Signature Verification System

Machine Learning: Overview

Contribution of Multiresolution Description for Archive Document Structure Recognition

VECTORAL IMAGING THE NEW DIRECTION IN AUTOMATED OPTICAL INSPECTION

OCRopus Addons. Internship Report. Submitted to:

Text versus non-text Distinction in Online Handwritten Documents

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

LOCAL SURFACE PATCH BASED TIME ATTENDANCE SYSTEM USING FACE.

A Lightweight and Effective Music Score Recognition on Mobile Phone

2 Signature-Based Retrieval of Scanned Documents Using Conditional Random Fields

Machine Learning and Statistics: What s the Connection?

Character Image Patterns as Big Data

Handwritten Signature Verification using Neural Network

Bayesian Network Modeling of Hangul Characters for On-line Handwriting Recognition

Using Lexical Similarity in Handwritten Word Recognition

Interactive person re-identification in TV series

Keywords image processing, signature verification, false acceptance rate, false rejection rate, forgeries, feature vectors, support vector machines.

Turkish Radiology Dictation System

Template-based Eye and Mouth Detection for 3D Video Conferencing

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

An e-learning platform for multiform and interactive education of scholars in Greek Paleography

Introduction to Pattern Recognition

Lecture 9: Introduction to Pattern Analysis

Pattern Recognition of Japanese Alphabet Katakana Using Airy Zeta Function

Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)

Research on Chinese financial invoice recognition technology

The key to successful web design is planning. Creating a wireframe can be part of this process.

A new normalization technique for cursive handwritten words

A Method of Caption Detection in News Video

Script and Language Identification for Handwritten Document Images. Judith Hochberg Kevin Bowers * Michael Cannon Patrick Kelly

PageX: An Integrated Document Processing and Management Software for Digital Libraries

Information Visualization of Attributed Relational Data

STATIC SIGNATURE RECOGNITION SYSTEM FOR USER AUTHENTICATION BASED TWO LEVEL COG, HOUGH TRANSFORM AND NEURAL NETWORK

Off-line Handwriting Recognition by Recurrent Error Propagation Networks

THE recognition of Arabic writing has many applications

Page Frame Detection for Marginal Noise Removal from Scanned Documents

Machine Learning with MATLAB David Willingham Application Engineer

DIAGONAL BASED FEATURE EXTRACTION FOR HANDWRITTEN ALPHABETS RECOGNITION SYSTEM USING NEURAL NETWORK

Signature Region of Interest using Auto cropping

A Dynamic Approach to Extract Texts and Captions from Videos

Automatic Extraction of Signatures from Bank Cheques and other Documents

Improving Traceability of Requirements Through Qualitative Data Analysis

PROCESSING & MANAGEMENT OF INBOUND TRANSACTIONAL CONTENT

Low-resolution Character Recognition by Video-based Super-resolution

A colour Code Algorithm for Signature Recognition

Tools & Resources for Visualising Conversational-Speech Interaction

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Form Design Guidelines Part of the Best-Practices Handbook. Form Design Guidelines

A Genetic Algorithm-Evolved 3D Point Cloud Descriptor

TEXT-FILLED STACKED AREA GRAPHS Martin Kraus

Increase your efficiency with maximum productivity and minimal work

University of Waterloo Library

Professional Profile

Applying Deep Learning to Car Data Logging (CDL) and Driver Assessor (DA) October 22-Oct-15

Transcription:

International Workshop on Document Analysis Systems Segmentation for Paleographical Character Shape Analysis Théodore Bluche, Dominique Stutzmann, Christopher Kermorvant April 12, 2016

Paleographical Character Shape Analysis Paleography = study of ancient and historical handwriting Goal of character shape analysis: gather occurences of each character and identify different forms or graphical events Digital Humanities: use automatic approaches (computer vision, HTR) to leverage the large quantity of transcribed data Result: about 700M segmented characters = the biggest database for paleographers Shape Analysis Introduction 1 / 21

The ORIFLAMMS Project Ontology Research, Image Features, Letterform Analysis on Multilingual Medieval Scripts Funded by French National Research Agency (ANR) Gloal: Evolution and variability of handwriting Latin manuscripts from Europe 12th-15th centuries Inscriptions, books, registers, charters... Shape Analysis Introduction 2 / 21

Data (a)graal (b)fontenay Figure: Examples from the Graal (Lyons, City Library, PA 77, fol. 187v) and Fontenay Database (Dijon, Archives départementales de Côte d'or, 15 H 203). Shape Analysis Introduction 3 / 21

Open Visualisation of Results 900 pages have been automatically segmented into 21241 lines, 198219 words and 694100 characters! http://oriflamms.teklia.com Shape Analysis Introduction 4 / 21

Overview Introduction Method Results Conclusion Shape Analysis Introduction 5 / 21

Method Introduction Method Results Conclusion Shape Analysis Method 6 / 21

Related work Text-image alignment / Ground-truth mapping: Rothfeder et al. (2006) : G. Washington database : word alignments from text line with HMMs Fischer et al. (2011) : St. Gall database : alignment of inaccurate transcriptions from text line images with HMMs Kornfield et al. (2004); Stamatopoulos et al. (2010); Leydier et al. (2014) : based on image and transcription features Gatos et al. (2014) semi-supervised Feng & Manmatha (2006) : align OCR results with ground-truth (text-to-text) Al Azawi et al. (2013); Bluche et al. (2014) : using FSTs Shape Analysis Method 7 / 21

Goal: Retrieve character segmentation from unsegmented transcribed images Forced alignment: Using HTR for Alignment Uses previous scholarly work Large corpora automation Creates the training data for future HTR Shape Analysis Method 8 / 21

Method 1. apply a text line segmentation algorithm to the full page 2. assign the line transcripts to the line images.3 use them to train a first HMM based on GMMs.4 assign the line transcription to the line images with the trained GMM-HMM 5. based on this new alignment, train a new GMM-HMM recognizer. Finally, train a final text recognizer based on deep neural networks HMMs. Shape Analysis Method 9 / 21

Details of the HTR System Overview: Preprocessing - conversion to gray levels, deskew, deslant, contrast enhancement, height normalization Feature extraction - handcrafted features using a sliding window of width 3px with no overlap Model - Hidden Markov Models (HMM) associated with a sliding window approach segmentation of the ''text image'' as a by-product. HMMs for characters, and for several writing variants: Conjunction: last stroke of the first letter superposed with the first stroke of the second one Elision: initial stroke of a letter is left out Ligature: two or more letters are joined as a single glyph Allograph: the same letter can have different forms these phenomena are of core interest for palaeographers (allow for identification of scribes, dating, broader understanding of the evolutions of the Latin script in the Middle Age) Shape Analysis Method 10 / 21

Graphical Events Modeling With HMMs (a)conjonction (pa) (b)elision (eu) (c)ligature (st) (d)allograph of s (e)allograph of v Figure: Example of lexical modeling for the word ''merveilles'' Shape Analysis Method 11 / 21

Results Introduction Method Results Conclusion Shape Analysis Results 12 / 21

Segmentation Results A lot of data were automatically extracted: Level Graal Fontenay Segmented lines 10,362 1,363 Segmented words 114,273 22,730 Segmented characters 504,5230 128,946 how to evaluate the results? Shape Analysis Results 13 / 21

Segmentation Evaluation Word segmentation manually corrected word positions assess automatic alignment quality corrected boundaries : ref = (ref l, ref r ) segmented boundaries : hyp = (hyp l, hyp r ) Measures: absolute error = hyp l ref l + hyp r ref r, left relative error = ref l hyp l, right relative error = ref r hyp r. Character segmentation randomly selected 2% of these characters using a uniform distribution a palaeographer validated the segmentation rejection if a structural stroke was missing a structural stroke from a neighbour character was added Shape Analysis Results 14 / 21

Word Segmentation - Absolute Error (a) Histogram of absolute word boundary errors in pixels (b) Cumulative histogram of absolute word boundary errors in pixels dashed line is half a character avg. width, plain line is 1 character avg. width Graal : 63% of boundaries are correct with a 11 px tolerance and 99% are correct with a 23px tolerance. Fontenay : 72% of boundaries are correct with a 22px tolerance and 94% are correct with a 45px tolerance. Shape Analysis Results 15 / 21

Word Segmentation - Right and Left Errors words tend to be cropped Shape Analysis Results 16 / 21

Character Segmentation On average on all the sampled characters, the segmentation error was 10.4% for the Graal 13.3% for Fontenay corpus Shape Analysis Results 17 / 21

Conclusion Introduction Method Results Conclusion Shape Analysis Conclusion 18 / 21

Conclusion A lot of characters segmented automatically Despite errors, that quantity of alignment and segmentation helped paleographers for their analysis Next step (in progress) of automation: automatic clustering of character shapes... also : extend this method to align more corpora, and even transcribe new material In the end: successful collaboration in interdisciplinary research aligned corpora will be released publicly at the end of the project (2016) continued collaboration on a new project Shape Analysis Conclusion 19 / 21

Thanks! tb@a2ia.com Shape Analysis Conclusion 20 / 21

References Al Azawi, M., Liwicki, M., & Breuel, T. M. (2013). WFST-based ground truth alignment for difficult historical documents with text modification and layout variations. In Document Recognition and Retrieval. Bluche, T., Moysset, B., & Kermorvant, C. (2014). Automatic line segmentation and ground-truth alignment of handwritten documents. In International Conference on Frontiers in Handwriting Recognition. Feng, S., & Manmatha, R. (2006). A hierarchical, HMM-based automatic evaluation of OCR accuracy for a digital library of books. Joint Conference on Digital libraries. Fischer, A., Frinken, V., Fornés, A., & Bunke, H. (2011). Transcription alignment of Latin manuscripts using hidden Markov models. In Workshop on Historical Document Imaging and Processing. Gatos, B., Louloudis, G., Causer, T., Grint, K., Romero, V., Sánchez, J.-A., Toselli, A. H., & Vidal, E. (2014). Ground-truth production in the transcriptorium project. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on, (pp. 237--241). IEEE. Kornfield, E., Manmatha, R., & Allan, J. (2004). Text alignment with handwritten documents. In Int. Workshop on Document Image Analysis for Libraries. Leydier, Y., Eglin, V., Bres, S., & Stutzmann, D. (2014). Learning-free text-image alignment for medieval manuscripts. In Frontiers in Handwriting Recognition (ICFHR), 2014 14th International Conference on, (pp. 363--368). IEEE. Rothfeder, J., Manmatha, R., & Rath, T. M. (2006). Aligning Transcripts to Automatically Segmented Handwritten Manuscripts. In Document Analysis Systems. Stamatopoulos, N., Louloudis, G., & Gatos, B. (2010). Efficient transcript mapping to ease the creation of document image segmentation ground truth with text-image alignment. In Frontiers in Handwriting Recognition (ICFHR), 2010 International Conference on, (pp. 226--231). IEEE. Shape Analysis Conclusion 21 / 21