ICFHR 2010 Tutorial: Multimodal Computer Assisted Transcription of Handwriting Images



Similar documents
Turkish Radiology Dictation System

Offline Recognition of Unconstrained Handwritten Texts Using HMMs and Statistical Language Models. Alessandro Vinciarelli, Samy Bengio and Horst Bunke

Online Farsi Handwritten Character Recognition Using Hidden Markov Model

Translating the Penn Treebank with an Interactive-Predictive MT System

Automatic slide assignation for language model adaptation

Word Completion and Prediction in Hebrew

How To Recognize Voice Over Ip On Pc Or Mac Or Ip On A Pc Or Ip (Ip) On A Microsoft Computer Or Ip Computer On A Mac Or Mac (Ip Or Ip) On An Ip Computer Or Mac Computer On An Mp3

Machine Translation. Agenda

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

A Comparative Analysis of Speech Recognition Platforms

AUTOMATIC PHONEME SEGMENTATION WITH RELAXED TEXTUAL CONSTRAINTS

The Role of Size Normalization on the Recognition Rate of Handwritten Numerals

Spot me if you can: Uncovering spoken phrases in encrypted VoIP conversations

Information Leakage in Encrypted Network Traffic

User Interface for a PCS Smart Phone

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Tagging with Hidden Markov Models

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

2. Distributed Handwriting Recognition. Abstract. 1. Introduction

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Bayesian Network Modeling of Hangul Characters for On-line Handwriting Recognition

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Sentiment analysis: towards a tool for analysing real-time students feedback

Membering T M : A Conference Call Service with Speaker-Independent Name Dialing on AIN

Ericsson T18s Voice Dialing Simulator

Copyright. Network and Protocol Simulation. What is simulation? What is simulation? What is simulation? What is simulation?

Language Modeling. Chapter Introduction

Lecture 12: An Overview of Speech Recognition

III. SEGMENTATION. A. Origin Segmentation

German Speech Recognition: A Solution for the Analysis and Processing of Lecture Recordings

Hardware Implementation of Probabilistic State Machine for Word Recognition

Chapter 8 Speech Recognition Tools

Automated Content Analysis of Discussion Transcripts

Translog-II: A Program for Recording User Activity Data for Empirical Translation Process Research

OPTIMIZATION OF NEURAL NETWORK LANGUAGE MODELS FOR KEYWORD SEARCH. Ankur Gandhe, Florian Metze, Alex Waibel, Ian Lane

Biometric Performance Testing Methodology Standards. Michael Thieme, Vice President IBG, A Novetta Solutions Company

Robust Methods for Automatic Transcription and Alignment of Speech Signals

Develop Software that Speaks and Listens

MACHINE TRANSLATION. I.1 Introduction to Statistical Machine Translation

ADDING DOCUMENTS TO A PROJECT. Create a a new internal document for the transcript: DOCUMENTS / NEW / NEW TEXT DOCUMENT.

Chapter 5 Objectives. Chapter 5 Input

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

Statistical Machine Translation: IBM Models 1 and 2

Introduction to Algorithmic Trading Strategies Lecture 2

Latent Dirichlet Markov Allocation for Sentiment Analysis

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Using Neural Networks to Create an Adaptive Character Recognition System

GAZETRACKERrM: SOFTWARE DESIGNED TO FACILITATE EYE MOVEMENT ANALYSIS

Cursive Handwriting Recognition for Document Archiving

DESIGN OF DIGITAL SIGNATURE VERIFICATION ALGORITHM USING RELATIVE SLOPE METHOD

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Language technologies for Education: recent results by the MLLP group

tance alignment and time information to create confusion networks 1 from the output of different ASR systems for the same

A new normalization technique for cursive handwritten words

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Standard Languages for Developing Multimodal Applications

How To Transcribe Ancient Documents With An Automatic Transcription System With Human/Computer Interaction

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

How To Use Neural Networks In Data Mining

Conditional Random Fields: An Introduction

Efficient on-line Signature Verification System

Strategies for Training Large Scale Neural Network Language Models

Author Gender Identification of English Novels

Investigations on Error Minimizing Training Criteria for Discriminative Training in Automatic Speech Recognition

English to Arabic Transliteration for Information Retrieval: A Statistical Approach

Establishing the Uniqueness of the Human Voice for Security Applications

CS 533: Natural Language. Word Prediction

Document Image Retrieval using Signatures as Queries

Examining Self-Similarity Network Traffic intervals

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Micro blogs Oriented Word Segmentation System

Sentiment analysis on tweets in a financial domain

Transcription:

ICFHR 2010 Tutorial: Multimodal Computer Assisted Transcription of Handwriting Images III Multimodality in Computer Assisted Transcription Alejandro H. Toselli & Moisés Pastor & Verónica Romero {ahector,moises,vromero}@iti.upv.es Pattern Recognition and Human Language Technology Group Instituto Tecnológico de Informática Universidad Politécnica de Valencia Spain November 2010 A B L A N K P A G E

Tutorial Contents and Schedule I Introduction Multimodal Interaction in Pattern Recognition Quick Survey of Handwritten Text Recognition (HTR) concepts and techniques Interactive-Predictive Pattern Recognition and Document Image Analysis I-p Off-line HTR in practice HTR Preprocessing Training HMMs using the Hidden Markov Model Toolkit (HTK) Training Language Models and Dictionaries for HTR HTR Experiments II Computer-Assisted Transcription of Text Images (CATTI) Human interaction in HTR A CATTI formal framework Feedback-derived dynamic language modelling and search Performance measures and results achieved in typical applications II-p CATTI in practice Adapting Language Models and Search for CATTI CATTI Experiments Analyzing quantitatively the CATTI performance III Multimodality in CATTI () Touchscreen based multimodal user correction A formal framework Multimodal language modelling and search Performance measures and results achieved in typical applications III-p Demostration of a complete System in a real HTR task Page III.1 Index 1 Multi-Modal Interaction in Text Image Transcription 3 2 Formal framework for 6 3 Multimodal language modelling and search 13 4 Experiments 16 5 Multimodal IPPR for other Document Processing tasks 21 6 Conclusions 24 7 Bibliography 25 8 demostration in a real HTR task 26 Page III.2

Multimodal Interaction for Text Transcription Human feedback for CATTI has been assumed to come in the form of keyboard and mouse actions At the expense of loosing the deterministic accuracy of this traditional input modality, more ergonomic multimodal interfaces are possible: e.g., voice, gaze tracking, etc. A very natural modality for Text Transcription is on-line HTR on a touchscreen: Multimodal CATTI () The increased ergonomy comes at the cost of new errors expected from the decoding of the feedback signals Solving the multimodal interaction problem requires to achieve a modality synergy where both main and feedback data streams help each-other to optimize overall accuracy. Page III.3 A multimodal desktop for interactive Text Transcription antiguos ci udadanos, que en elcastillo en llamaradas On-line HTR on a touchscreen offers a very natural modality for Interactive Text Image Transcription: The original image, the on-going transcription and the user corrective feedback penstrokes are jointly displayed for the user comfort. Page III.4

Multimodal Interaction for Text Transcription (illustration) a) antiguas andanadas que en elcastillo en llamaradas b) antiguas andanadas que en elcastillo en llamaradas c) antiguos ci udadanos, que en elcastillo en llamaradas d) antiguos ci udadanos, que en Castilla se llamaron e) antiguos ci udadanos, que en Castilla se llamaban Page III.5 General statistical framework for Let: x be the input image p be a user-validated transcription prefix t be the user feedback (on-line touchscreen pen strokes) s be the previous system-suggested suffix; The feedback t is aimed at accepting or amending parts of s and/or at adding more text κ be some keystrokes typed by the user to correct (other) parts of s and/or to add more text Now the system has to suggest a new suffix, ŝ, as a continuation of the prefix p, conditioned by the on-line pen strokes t and the typed text κ. The general problem is to find ŝ given x, p and κ and considering all possible decodings, d, of t (i.e., letting d be a hidden variable): ŝ = argmax s Pr(s, d x, p, s, t, κ) d Page III.6

A more realistic framework In the general formulation, the user can type with independence of the result of the on-line handwritten decoding process. This is not realistically useful in practice. More realistic scenario: Wait for the decoding ˆd of the touchscreen data t, prior to typing κ. These keystrokes will typically be used to fix possible on-line handwritten recognition errors in ˆd. Each interaction step can be formulated in three phases: 1. The system relies on the input image and the previous transcription prefix, in order to search for a good transcription suffix, as in CATTI: ŝ = argmax Pr(s x, p) s 2. Given ŝ, the user produces (may be null) on-line pen strokes t and the system decodes t into an optimal character, word (or word sequence), ˆd: ˆd = argmax Pr(d x, p, ŝ, t) d 3. The user enters amendment keystrokes κ, if necessary, and consolidates a new prefix, p, based on the previous p, ˆd, κ and parts of ŝ. Page III.7 operation x STEP-0 p ŝ ŵ opposite this Comment Bill in that thought p, t opposite this Comment Bill in that thought STEP-1 ˆd opposed this Comment Bill in that thought κ p opposed this Comment Bill in that thought ŝ( s ) opposed the Government Bill in that thought p, t opposed the Government Bill in that thought STEP-2 ˆd opposed the Government Bill whack thought κ opposed the Government Bill which thought p opposed the Government Bill which brought ŝ( s ) opposed the Government Bill which brought p, t opposed the Government Bill which brought FINAL κ opposed the Government Bill which brought # p T opposed the Government Bill which brought POST-EDIT: 6 corrections CATTI: 2 keyboard corrections : 3 corrections; 2 touch-screen (red) + 1 keyboard (underlined) Page III.8

Decoding the feedback data for Assuming independence between t and x, p, ŝ given d, the feedback penstroke decoding can be rewritten as: ˆd = argmax d Pr(d x, p, ŝ, t) = argmax Pr(d, x, p, ŝ, t) d = argmax Pr(x, p, ŝ) Pr(d x, p, ŝ) Pr(t d, x, p, ŝ) d = argmax Pr(d x, p, ŝ) Pr(t d) d Pr(t d) modelled by HMM models of the word(s) in d Pr(d x, p, ŝ) provided by a language model constrained by the input image x, by the previous prefix p and by the suffix ŝ In practice, a Grammar Scale Factor is generally used: ˆd = argmax Pr(d x, p, ŝ) α Pr(t d) (1 α) d Several assumptions and constraints can be adopted for Pr(d x, p, ŝ) Page III.9 feedback decoding LM constraints Different level-constrained cases can be considered: Most complex case use all the conditions in Pr(d x, p, ŝ), including the input image x (not considered for now). Simplest, baseline case pure on-line HTR, ignoring all the available conditions: Pr(d x, p, ŝ) P (d) (standard N-gram). Error-conditioned case Consider the wrong words that user tries to correct Pr(d x, p, ŝ) P (d s e ). Compromise: Only consider the information provided by the prefix p and the off-line HTR prediction ŝ: Pr(d x, p, ŝ) P (d p, ŝ). Page III.10

feedback decoding LM constraints Compromise: Only consider the information provided by the prefix p and the off-line HTR prediction ŝ: Pr(d x, p, ŝ) P (d p, ŝ). This simplifies the feedback pen-strokes decoding, while still achieving an effective synergy between the main and the feedback modalities: the user accepts an initial correct part of ŝ, ŝ a this validates a new overall correct prefix p = p ŝ a and signals the first erroneous word(s) of ŝ, ŝ e the user produces some (may be null) pen strokes t aimed to amend ŝ e the multimodal (on-line HTR) decoder is constrained to find a transcription of t which is a suitable continuation of p and depends on the wrong word(s) ŝ e : P (d p, ŝ) P (d p, ŝ e ) Page III.11 operation x STEP-0 p ŝ ŵ opposite this Comment Bill in that thought p, t opposite this Comment Bill in that thought STEP-1 ˆd opposed this Comment Bill in that thought κ p opposed this Comment Bill in that thought ŝ( s ) opposed the Government Bill in that thought p, t opposed the Government Bill in that thought STEP-2 ˆd opposed the Government Bill whack thought κ opposed the Government Bill which thought p opposed the Government Bill which brought ŝ( s ) opposed the Government Bill which brought p, t opposed the Government Bill which brought FINAL κ opposed the Government Bill which brought # p T opposed the Government Bill which brought POST-EDIT: 6 corrections CATTI: 2 keyboard corrections : 3 corrections; 2 touch-screen (red) + 1 keyboard (underlined) Page III.12

Language Modeling and Search To simplify notation, p and e are used instead of p and ŝ e ; where e can be a sequence of one or more words or characters (just 1 in practice). Assuming e is a single word, the decoder should produce a word hypothesis ˆd for the pen strokes t. Language model constrains can be approached using n-grams, depending on each multimodal scenario considered. Baseline P (d): does not take into account any interaction-derived information and since only whole-word touchscreen corretions are assumed, only uni-grams actually make sense. Error-conditioned model P (d e): P (d e) = 0 d = e P (d) 1 P (e) d e Page III.13 Language Modeling and Search Compromise: The prefix-and-error conditioned language model can be approached as: P (d p, e) 0 d = e P (d p k k n+2 ) 1 P (e p k k n+2 ) d e where k is the length of p. The information provided by the interaction process should allow for feedback decoding improved accuracy. Page III.14

Example of building a dynamic 2-gram LM Original Bi-gram model P (w) Conditioned Bi-gram model P (d p, e) de de edad edad edad la media actual la edad edad de media epoca epoca la epoca media epoca media media actual actual Prefix = la Disabled word = media la From the original 2-gram model (used by the off-line HTR system) a prefix-anderror conditioned 2-gram sub-model is derived, which takes as initial state that corresponding to the prefix la. This simplified language model is used by the online HTR feddback sub-system to recognize the handwritten word edad, intended to replace the wrong off-line misrecognized word media, disabled by this model. Page III.15 experimental framework As in the case of CATTI, all the experiments assume interaction happens only at the word level; that is each interaction step involves the correction of a single, whole word from the system-predicted suffix Evaluating the feedback subsystem requires on-line, touch-screen data. This data is the very essence of human interaction But, for experimental purposes, we can not afford the permanent availability of users to test the system The production of on-line feedback data can be properly simulated using a publicly available on-line handwritten corpus: UNIPEN Furthermore, this allows the experiments to be fully reproducible by strictly using public resources Page III.16

Simulating touch-screen feedback with the UNIPEN corpus Every word that the user should write on the touchscreen needs to be obtained from UNIPEN data The touchscreen data for each needed word are generated by concatenating the data from random character samples from three UNIPEN categories: 1a (digits), 1c (lowercase letters) and 1d (symbols) To increase realism, each needed word is produced only using characters belonging to the same writer, from three randomly chosen writers Examples of words generated using characters from three UNIPEN test writers, along with samples of the same words written by two real writers: Words from concatenated UNIPEN chars Real word writing Page III.17 Touch-screen feedback and UNIPEN datasets Training: 42 symbols and digits and 1 000 most frequent English and Spanish words generated from UNIPEN characters of 17 writters. Test: For each off-line HTR task: number of on-line unique words and word instances needed as feedback to correct the off-line HTR word errors made by the plain off-line HTR system. Task Unique words Word instances ODEC-M3 378 753 IAMDB 510 755 CS 648 1 196 Total 1 536 2 704 Overall statistics of the UNIPEN training and test data used in the experiments Number of different: Train Test Lexicon writers 17 3 - digits (1a) 1 301 234 10 letters (1c) 12 298 2 771 26 symbols (1d) 3 578 3 317 32 total characters 17 177 6 322 68 Page III.18

on-line HTR decoding accuracy Writer average feedback decoding error rates for the different corpora and three language models: plain unigram (U, baseline), error-conditioned unigram (U e ) and prefixand-error conditioned bigram (B e ). The relative accuracy improvements for U e and B e with respect to U are shown in the last column. Corpus Lexicon Feedback ER (%) Relative Improv. (%) U U e B e U e B e ODEC-M3 2 790 5.1 5.0 3.1 2.0 39.2 IAMDB 8 017 4.6 4.3 3.5 6.5 23.9 CS-page 2 623 6.4 6.2 5.8 3.1 9.3 U, U e and B e correspond to P (d), Pr(d e) and P (d p, e), respectively. The baseline is U, with no interaction-derived constraints. Page III.19 CATTI / empirical results From left-to-right: post-editing corrections (WER), interactive corrections needed (WSR), contributions of both input modalities: on-line touch-screen (TS) and keyboard (KBD), and overall estimated effort reduction (EFR). All results are percentages Corpus Post-edit CATTI Overall EFR WER WSR WSR U WSR Be CATTI (U / B e ) ODEC 22.9 18.9 19.7 19.5 17.5 14.0 / 14.8 IAMDB 25.3 21.1 22.1 21.8 16.6 12.6 / 13.8 CS-page 28.5 26.9 28.6 28.4 5.6-0.04 / 0.40 WER: word error rate values corresponded to P (d) and P (d p, e) respectively WSR: word error rate and word stroke ratio. In the case, these take into account the total amount of interactions (touchscreen + keystroke inputs) EFR: overall estimated effort reduction (assuming equal effort for TS and KBD interactions) Page III.20

Multimodal Interaction for other Document Processing tasks? Image Restoration and Enhancing Layout Analysis Skew Correction Line Detection Slant Correction Size Normalization... more? Page III.21 Multimodal Interaction for Document Layout Analysis? Page III.22

Multimodal Interaction for Document Layout Analysis? Page III.23 General Conclusions of the Tutorial Current HTR accuracy is not enough for fully automatic high quality transcription of most handwritten text images of interest Human post-editing can be very expensive and hardly acceptable by profesional transcribers (paleographers, e.g.) Computer Assisted, Interactive-Predictive processing offers promise for significant improvements in practical performance and user acceptance Computer Assisted, Interactive-Predictive processing can also be useful for many other Document Analysis tasks. Page III.24

Bibliography E. Vidal, F. Casacuberta, L. Rodríguez, J. Civera and C. Martínez. Computer-assisted translation using speech recognition. IEEE Trans. on Audio, Speech and Language Processing, 14(3):941-951, 2006. E. Vidal, L. Rodriguez, F. Casacuberta and I. García-Varea: Interactive Pattern Recognition. 4th Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI-07), Volume 4892 of LNCS, pp.60-71. Brno, Czech Republic, June 2007. A.H. Toselli, V. Romero and E. Vidal. Computer Assisted Transcription of Text Images and Multimodal Interaction, 5th Joint Workshop on Multimodal Interaction and Related Machine Learning Algorithms (MLMI-08), Utrecht, The Netherlands, September 2008. Page III.25 Index 1 Multi-Modal Interaction in Text Image Transcription 3 2 Formal framework for 6 3 Multimodal language modelling and search 13 4 Experiments 16 5 Multimodal IPPR for other Document Processing tasks 21 6 Conclusions 24 7 Bibliography 25 8 demostration in a real HTR task 26 Page III.26

M M - C A T T I D E M O Page III.26