OCR for historical printings a tutorial

Similar documents

Module 6 Other OCR engines: ABBYY, Tesseract

Guidelines for the submission of invoices

NUANCE The experience speaks for itself

Google Drive and more Image Capture Networked MFPs

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Automatic License Plate Recognition using Python and OpenCV

Coffee Break German Lesson 06

Scanning Settings and Standards. Banner Document Management (BDM)

Search Engines Chapter 2 Architecture Felix Naumann

A. Scan to PDF Instructions

Welcome page of the German Script Tutorial: script.byu.edu/german.

Using PDF Files in CONTENTdm

How to translate VisualPlace

Continuous Workflow. Business Document Scanning. Engineered for. Fast, accurate and reliable digital conversion for batch scanning and archiving

ABBYY FineReader 12 Corporate

How to create a PDF from an MS Word document

OneTouch 4.0 with OmniPage OCR Features. Mini Guide

Scanning for OCR Text Conversion

Optical Character Recognition. Joerg Schulenburg, LinuxTag 2005 GOCR

E-Content Service Group Virtual Meeting. Digital Preservation: How to Get Started

Optimizing Courses for Online Learning

USA STAFFING: FREQUENTLY ASKED QUESTIONS (FAQS)

About the Digitization Programme & History of ITU Portal

3 C i t y C e n t e r D r i v e S u i t e S t. L o u i s, MO w w w. k n o w l e d g e l a k e. c o m P a g e 3

Unit R082 Creating digital graphics. File Formats and the Properties of Digital Images and Graphics

ADOBE ACROBAT X PRO SCAN AND OPTICAL CHARACTER RECOGNITION (OCR)

CA Productivity Accelerator v :

GOOGLE DRIVE Google Apps Documents Step-by-Step Guide

In addition, a decision should be made about the date range of the documents to be scanned. There are a number of options:

AccuRead OCR. Administrator's Guide

Chapter 2 Text Processing with the Command Line Interface

Technical specifications for Digitization of Consular documents

Kurzweil 3000 version General User Guide

How to Keep OCR Errors from Spoiling Your ediscovery Party

WHITE PAPER. 3-Heights Scan to PDF Server Basics and Applications

Snap 9 Professional s Scanning Module

Electronic Docket Filings Michigan Public Service Commission Department of Licensing and Regulatory Affairs

EPSON PERFECTION SCANNING BASICS

Mastering the JangoMail EditLive HTML Editor

PRESS RELEASE. AIIM, Philadelphia, May 15 th 2006 Embargo until, May 15 th 2006, at 5:40 p.m

An Introduction to TextGrid

FOR TEACHERS ONLY The University of the State of New York

UPK and UPK Professional Technical Specifications

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1]

Scan to Network and Scan to Network Premium. Administrator's Guide

OCR and PDF Compression

RJS Software Systems Inc AS/400 Report Delivery System

Bring documents to mobile devices the new epub output for ebooks

Mobile Technical Specifications

Optical Character Recognition (OCR)

Links. Blog. Great Images for Papers and Presentations 5/24/2011. Overview. Find help for entire process Quick link Theses and Dissertations

Readiris 15. User Guide

COMPLETE USER VISUALIZATION INTERFACE FOR KENO

Exemplar for Internal Assessment Resource German Level 1. Resource title: Planning a School Exchange

This is the ability to use a software application designed for planning, designing and building websites.

UPK Professional Technical Specifications VERSION 11.1

Creating Forms with Acrobat 10

Archiving digital documents and s in PDF/A

TSG Leverages ImageCapture Suite SDK to Develop a Document Management Application for a Healthcare Client

Thank you for using AutoDWG Conversion Server Software

Zielgruppe Dieses Training eignet sich für IT-Professionals.

Create a PDF File. Tip. In this lesson, you will learn how to:

Introduction to WIPOScan Software

User Guide - Table of Contents

Our mission is to provide the best back offices solutions available in the market today. You will not find another service provider that is more

ABL Advisor :: Online Advertising Specifications

Word processing software

Office of History. Using Code ZH Document Management System

PEERNET File Conversion Center 6.0

Manual CisWeb application A gwt-based web service that provides access to language processing tools of CIS

Connections to External File Sources

Algolab Raster to vector Conversion Toolkit Help

INSTALLATION MINIMUM REQUIREMENTS. Visit us on the Web

TECHNOLOGY Web Design II Grade: 9-12 Standard 2: Technology and Society Interaction. Organizing Topic Benchmark Indicator Technology and Ethics

Are you ready for more efficient and effective ways to manage discovery?

International Guest Students APPLICATION FORM

Reduce File Size. Compatibility. Contents

TASKSTREAM FAQs. 2. I have downloaded a lesson attachment, but I cannot open it. What is wrong?

Exemplar for Internal Achievement Standard. German Level 1

Transcription:

CIS Dr. Uwe Springmann a tutorial Kolloquium Korpuslinguistik HU Berlin, 23.04.2014

OCR @ CIS: Centrum für Informations- und Sprachverarbeitung OCR group (led by Prof. Dr. Klaus Schulz) has existed for over 10 years specific areas: interactive and automatic postcorrection, lexical resources for improved OCR results partner in EU IMPACT project (Improving Access to Text 2008-2011) open source postcorrection tool PoCoTo since 2013: recognition and postcorrection of Latin texts p. 2

Agenda 1. Document acquisition 2. Preprocessing 3. OCR 4. Evaluation 5. Postcorrection 6. Training p. 3

1. Document Acquisition Deutsche Digitale Bibliothek (DDB): www.ddb.de with links to page images of holding institutions Bayerische Staatsbibliothek (BSB) www.bsb-muenchen.de or www.digitale-sammlungen.de sometimes with OCR results for texts Google books: books.google.com often texts from BSB in lower resolution, but sometimes with OCR even when BSB does not offer it Europeana: www.europeana.eu p. 4

book search@ddb: Book on podagra and herbs (Adam von Bodenstein, 1577) p. 5

jump to source: p. 6

Recommendations If you want to manually correct OCR, go to Google (they offer the best freely available OCR results) If you want your own OCR, go to DDB or BSB and download the pdf (we will be taking this route in this tutorial) Or, scan your own book (at 300dpi minimum)! p. 7

2. Preprocessing (page splitting) deskewing border removal crop binarize dewarp despeckle... p. 8

UNIX preprocessing commands pdftk WieSichMeniglich_Basel_1557-original.pdf cat 64-104 output kräuter.pdf mkdir pdf pdftk kräuter.pdf burst output pdf/%03d.pdf mkdir png cd pdf for f in *.pdf; do convert "$f" "${f/%pdf/png}"; done mv *.png../png Uwe _Springmann, 23.04.2014 p. 9

Tutorial: ScanTailor (10 min) Using your downloaded pdf and ScanTailor (www.scantailor.org), produce a set of clean tif files of pages 65 105 002.png: 3.8 MB (colored image, background counts) 002.tif: 64.7 kb (!), 50 times smaller (background is white = 0 kb) p. 10

Tutorial: gimagereader (5 min) Convert page 2 (file 002.tif) into text! p. 12

4. Evaluation How good (in % correct characters = accuracy) is your text? Compare against correct transcription (= ground truth)! OCR evaluation toolkit (ISRI/UNLV) adapted to UTF-8 by Nick White: https://gitorious.org/ancient-greek-training-for-tesseract/ocr-evaluation-tools/source/ 207f421c198d12f793d3ba0215677a294bed1583 Engine Accuracy (%) Remark Tesseract (deu-frak) 65.03 raw png, not cleaned tif! Tesseract (deu-frak) 76.12 s for ſ not counted as error OCRopus (fraktur) 78.94 s for ſ not counted as error Google books 78.18 Google sponsors Tess. & Ocrop. ABBYY FR 11 (gothic) 83.23 industry leader p. 13

5. Postcorrection Can we clean up messy OCR output? a) Tesseract: interactive correction in gimagereader (side-by-side view + dictionary) b) OCRopus: interactive correction in browser (text line synopsis of image + OCR) ocropus-gtedit html book/????/??????.bin.png -o corr.html; firefox corr.html c) Tesseract: interactive correction in PoCoTo (postcorrection tool of CIS, open source) p. 14

OCRopus line synopsis Fraktur model is not too well adapated to book font (Schwabacher), but it's a start correct OCR output in browser generated ground truth can be used for later training with better training, OCRopus will yield better result correct remaining errors in the same way p. 15

PostCorrectionTool PoCoTo locally installable Java package for postcorrection developed at CIS as part of the EU IMPACT project word synopsis: image + OCR with interactive correction error profiling: calculate statistical error model based on a) historical spelling (not an error) b) proper OCR errors and propose most probable correction candidate batch correction: rank errors according to frequency & error pattern and enable quick correction decision in concordance view try it out: http://www.digitisation.eu/tools/browse/ocr-post-correction-and-enrichment/post-correction-tool/ https://github.com/thorstenv/pocoto p. 16

PoCoTo (developed at CIS) error frequency word synopsis (tesseract hocr output) page context p. 17

PoCoTo (developed at CIS) error pattern concordance with batch correction p. 18

6. Training Can we beat ABBYY (83%) by training an open source engine on the relevant font(s)? OCRopus training on images: ground truth data available! pp. 1-34: training set; pp. 35-40: test set page acc. (%) 35 97.19 36 98.58 37 98.75 cave: p. 35 has 12 (out of 22) errors due to confusions in ground truth between ů: U+016F and u: uu+030a (combined character) 38 97.37 (uuůů difference only visible in some fonts) 39 97.62 40 97.29 p. 19

Training result page 36 Kreüter ner erſcheinüg/ vnſerer teütſcher zaun oder hagwurtzel/ gar micht/ welche der mehrertheil balbierer für rechte Ariſtolochiam rotun dam einſamlend. Dioſc. Diſer wurtzel etwas mit wein myrrhen vnd pfeffer getruncken/ reiniget die weiber von vberfliſzigem vn rath der můter/ treibt auſz die an d geburt vñ weiber menſes. Ein ſalb gemacht vonn diſer wurtzen zeitloſen vñ anagallide zeücht vſz ſpreiſſel/ doo rnvñ geſchiferte bein. Hiemitt beſchlies ich mein rede diſer zeit von den zwelff zei chen kreütteren/ begaoren menck lich welle mirs im beſten aufnem men als dañ ichs gethan/ hab ſye weitleüffiger beſchribẽ wellen/ ſo ſind yetziger kürtze viel vrſach_en/ vorauſz dieweil ich groſſen koſten angewendet in ſůchung der kreü ter auſz eignem willen vñ beüt_tel/ ncn p. 20

Thank you for your attention! Questions? springmann@cis.uni-muenchen.de p. 21