CIS Dr. Uwe Springmann a tutorial Kolloquium Korpuslinguistik HU Berlin, 23.04.2014
OCR @ CIS: Centrum für Informations- und Sprachverarbeitung OCR group (led by Prof. Dr. Klaus Schulz) has existed for over 10 years specific areas: interactive and automatic postcorrection, lexical resources for improved OCR results partner in EU IMPACT project (Improving Access to Text 2008-2011) open source postcorrection tool PoCoTo since 2013: recognition and postcorrection of Latin texts p. 2
Agenda 1. Document acquisition 2. Preprocessing 3. OCR 4. Evaluation 5. Postcorrection 6. Training p. 3
1. Document Acquisition Deutsche Digitale Bibliothek (DDB): www.ddb.de with links to page images of holding institutions Bayerische Staatsbibliothek (BSB) www.bsb-muenchen.de or www.digitale-sammlungen.de sometimes with OCR results for texts Google books: books.google.com often texts from BSB in lower resolution, but sometimes with OCR even when BSB does not offer it Europeana: www.europeana.eu p. 4
book search@ddb: Book on podagra and herbs (Adam von Bodenstein, 1577) p. 5
jump to source: p. 6
Recommendations If you want to manually correct OCR, go to Google (they offer the best freely available OCR results) If you want your own OCR, go to DDB or BSB and download the pdf (we will be taking this route in this tutorial) Or, scan your own book (at 300dpi minimum)! p. 7
2. Preprocessing (page splitting) deskewing border removal crop binarize dewarp despeckle... p. 8
UNIX preprocessing commands pdftk WieSichMeniglich_Basel_1557-original.pdf cat 64-104 output kräuter.pdf mkdir pdf pdftk kräuter.pdf burst output pdf/%03d.pdf mkdir png cd pdf for f in *.pdf; do convert "$f" "${f/%pdf/png}"; done mv *.png../png Uwe _Springmann, 23.04.2014 p. 9
Tutorial: ScanTailor (10 min) Using your downloaded pdf and ScanTailor (www.scantailor.org), produce a set of clean tif files of pages 65 105 002.png: 3.8 MB (colored image, background counts) 002.tif: 64.7 kb (!), 50 times smaller (background is white = 0 kb) p. 10
Tutorial: gimagereader (5 min) Convert page 2 (file 002.tif) into text! p. 12
4. Evaluation How good (in % correct characters = accuracy) is your text? Compare against correct transcription (= ground truth)! OCR evaluation toolkit (ISRI/UNLV) adapted to UTF-8 by Nick White: https://gitorious.org/ancient-greek-training-for-tesseract/ocr-evaluation-tools/source/ 207f421c198d12f793d3ba0215677a294bed1583 Engine Accuracy (%) Remark Tesseract (deu-frak) 65.03 raw png, not cleaned tif! Tesseract (deu-frak) 76.12 s for ſ not counted as error OCRopus (fraktur) 78.94 s for ſ not counted as error Google books 78.18 Google sponsors Tess. & Ocrop. ABBYY FR 11 (gothic) 83.23 industry leader p. 13
5. Postcorrection Can we clean up messy OCR output? a) Tesseract: interactive correction in gimagereader (side-by-side view + dictionary) b) OCRopus: interactive correction in browser (text line synopsis of image + OCR) ocropus-gtedit html book/????/??????.bin.png -o corr.html; firefox corr.html c) Tesseract: interactive correction in PoCoTo (postcorrection tool of CIS, open source) p. 14
OCRopus line synopsis Fraktur model is not too well adapated to book font (Schwabacher), but it's a start correct OCR output in browser generated ground truth can be used for later training with better training, OCRopus will yield better result correct remaining errors in the same way p. 15
PostCorrectionTool PoCoTo locally installable Java package for postcorrection developed at CIS as part of the EU IMPACT project word synopsis: image + OCR with interactive correction error profiling: calculate statistical error model based on a) historical spelling (not an error) b) proper OCR errors and propose most probable correction candidate batch correction: rank errors according to frequency & error pattern and enable quick correction decision in concordance view try it out: http://www.digitisation.eu/tools/browse/ocr-post-correction-and-enrichment/post-correction-tool/ https://github.com/thorstenv/pocoto p. 16
PoCoTo (developed at CIS) error frequency word synopsis (tesseract hocr output) page context p. 17
PoCoTo (developed at CIS) error pattern concordance with batch correction p. 18
6. Training Can we beat ABBYY (83%) by training an open source engine on the relevant font(s)? OCRopus training on images: ground truth data available! pp. 1-34: training set; pp. 35-40: test set page acc. (%) 35 97.19 36 98.58 37 98.75 cave: p. 35 has 12 (out of 22) errors due to confusions in ground truth between ů: U+016F and u: uu+030a (combined character) 38 97.37 (uuůů difference only visible in some fonts) 39 97.62 40 97.29 p. 19
Training result page 36 Kreüter ner erſcheinüg/ vnſerer teütſcher zaun oder hagwurtzel/ gar micht/ welche der mehrertheil balbierer für rechte Ariſtolochiam rotun dam einſamlend. Dioſc. Diſer wurtzel etwas mit wein myrrhen vnd pfeffer getruncken/ reiniget die weiber von vberfliſzigem vn rath der můter/ treibt auſz die an d geburt vñ weiber menſes. Ein ſalb gemacht vonn diſer wurtzen zeitloſen vñ anagallide zeücht vſz ſpreiſſel/ doo rnvñ geſchiferte bein. Hiemitt beſchlies ich mein rede diſer zeit von den zwelff zei chen kreütteren/ begaoren menck lich welle mirs im beſten aufnem men als dañ ichs gethan/ hab ſye weitleüffiger beſchribẽ wellen/ ſo ſind yetziger kürtze viel vrſach_en/ vorauſz dieweil ich groſſen koſten angewendet in ſůchung der kreü ter auſz eignem willen vñ beüt_tel/ ncn p. 20
Thank you for your attention! Questions? springmann@cis.uni-muenchen.de p. 21