Figure 1: The OCR result of a text block generated by a commercial OCR system, TypeReader 3.0 from ExperVision Inc.. In the graphical user interface f

Similar documents
Using PDF Files in CONTENTdm

Adobe Acrobat 9 Pro Accessibility Guide: PDF Accessibility Overview

PDF Primer PDF. White Paper

MSOW. MSO for the Web MSONet Workstation Configuration Guide

Best practices for producing high quality PDF files

Adobe Acrobat 6.0 Professional

ebooks: Exporting EPUB files from Adobe InDesign

Electronic Docket Filings Michigan Public Service Commission Department of Licensing and Regulatory Affairs

ADOBE ACROBAT X PRO SCAN AND OPTICAL CHARACTER RECOGNITION (OCR)

PDF Accessibility Overview

ENVISION TM WEB HOSTING CHECKLIST

Lesson Review Answers

Links. Blog. Great Images for Papers and Presentations 5/24/2011. Overview. Find help for entire process Quick link Theses and Dissertations

Court Services Online - e-filing. Frequently Asked Questions

DATASHEET Scanned images saved as TIFF or image PDF. s with TIFF or image-based PDF attachments

Chapter 10: Multimedia and the Web

Adobe Acrobat 9 Pro Accessibility Guide: Using the Accessibility Checker

ADOBE DREAMWEAVER CS3 TUTORIAL

Creating Web Pages with Netscape/Mozilla Composer and Uploading Files with CuteFTP

Voluntary Product Accessibility Template Blackboard Learn Release 9.1 April 2014 (Published April 30, 2014)

OneTouch 4.0 with OmniPage OCR Features. Mini Guide

In addition, a decision should be made about the date range of the documents to be scanned. There are a number of options:

Office of History. Using Code ZH Document Management System

Create a PDF File. Tip. In this lesson, you will learn how to:

Version of Barcode Toolbox adds support for Adobe Illustrator CS

OnePurdue Printer Support

GUIDEBOOK FOR TECHNOLOGY COMPETENCIES BOSTON COLLEGE LYNCH SCHOOL OF EDUCATION

Designing for Overseas Chinese Readers: Some Guidelines

Georgia State University s Web Accessibility Policy (proposed)

TUTORIAL 4 Building a Navigation Bar with Fireworks

How To Scan A Document

Note: A WebFOCUS Developer Studio license is required for each developer.

Creating Accessible Word Documents

MerchantConnect Premium

The Document Review Process: Automation of your document review and approval. A White Paper. BP Logix, Inc.

2. Basic operations

The Reporting Console

PaperlessPrinter. Version 3.0. User s Manual

You can start Scan2Text from the Start menu (Start, All Programs, Claro Software, Scan2Text). This brings up the main selection window:

Adobe Dreamweaver Exam Objectives

Surfing the Internet. Dodge County 4-H Tech Team January 22, 2004

OpenIMS 4.2. Document Management Server. User manual

MillMedia Guidelines

A. Scan to PDF Instructions

Computer software. Computer software: summary/overview/abstract (Part 1)

e CABINET AND DOCULEX Document Capture and Electronic File Conversion

11 ways to migrate Lotus Notes applications to SharePoint and Office 365

Chapter 19. Developing Course and Laboratory Homepages for the World Wide Web

Archiving digital documents and s in PDF/A

Creating Accessible PDF Documents with Adobe Acrobat 7.0 A Guide for Publishing PDF Documents for Use by People with Disabilities

Preservation Handbook

Document Management Solutions

SYSTEM DEVELOPMENT AND IMPLEMENTATION

Quick Guide for Accessible PDF July 2013 Training:

Secret Communication through Web Pages Using Special Space Codes in HTML Files

Scan Process Distribute. Secure Scanning integrated in your Office Printing Infrastructure

Nuance Power PDF Advanced.

Adobe Dreamweaver CC 14 Tutorial

Image only High Quality Print Image + text

Fig (1) (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript.

Suggestions and Tips for Managing and Uploading Files for the AEP Application

Document Conventions... 2 Technical Requirements Logging On... 3 Logging Off Main Menu Panel... 4 Contents Panel... 4 Document Panel...

Representative Guide for Electronic Records Express Sending Individual Case Responses by Secure Website

Enriched Links: A Framework For Improving Web Navigation Using Pop-Up Views

Fireworks 3 Animation and Rollovers

Image Based Spam: White Paper

Contents. Launching FrontPage Working with the FrontPage Interface... 3 View Options... 4 The Folders List... 5 The Page View Frame...

Optimizing Adobe PDF files for display on mobile devices

ACE: Illustrator CC Exam Guide

File by OCR Manual. Updated December 9, 2008

Saving work in the CMS Edit an existing page Create a new page Create a side bar section... 4

Optical Character Recognition (OCR)

Using Impatica for Power Point

Using the Acrobat X Pro Accessibility Checker

Tic, Tie & Calculate Quick Start Guide. Quick Start Guide

Web Publisher s Kit Getting Started Guide FOR WINDOWS

Newspaper Digitization Brief Background

Automated Medical Citation Records Creation for Web-Based On-Line Journals

A framework for Itinerary Personalization in Cultural Tourism of Smart Cities

Contents 1. Introduction... 2

Going Paperless The Utah Experience. Mike Pecorelli Project Manager Utah DEQ

Microsoft Word 2013 Tutorial

2. Distributed Handwriting Recognition. Abstract. 1. Introduction

Implementing a Web-based Transportation Data Management System

Digital Workflow How to make & use digital signatures

Introduction to OCR in Revu

Turnitin Blackboard 9.0 Integration Instructor User Manual

ABBYY PDF Transformer+ User s Guide

RIT Scholar Works User Guide

15 minutes is not much so I will try to give some crucial guidelines and basic knowledge.

How to Copyright Your Book

38 Essential Website Redesign Terms You Need to Know

Snap Server Manager Section 508 Report

Introduction to OpenOffice Writer 2.0 Jessica Kubik Information Technology Lab School of Information University of Texas at Austin Fall 2005

Internet Explorer An Introduction

JISIS and Web Technologies

Website Editor User Guide

INDIVIDUAL bizhub ENHANCEMENT

Hypercosm. Studio.

QQConnect Overview Guide

Transcription:

REPRESENTING OCRED DOCUMENTS IN HTML Tao Hong and Sargur N. Srihari Center of Excellence for Document Analysis and Recognition State University of New York at Bualo Bualo, New York 14228 email: ftaohong,sriharig@cedar.buffalo.edu ABSTRACT: OCR is an error-prone process. It is time-consuming and expensive to manually proofread OCR results. The errors remaining in OCRed texts can cause serious problems in reading and understanding if they do not refer to the original image representation. As demonstrated in this paper, a hybrid document which combines symbolic representation and image representation may relieve the problem. If we represent a OCRed document properly in HTML, OCR errors will not have much negative eect on the human reading process in a HTML browser and can be corrected by using a HTML authoring tool. Under the approach, an experiment evaluating a Japanese OCR system developed in CEDAR is also reported in this paper. 1 Overview of the Approach OCR is a process to transform a given document from its image representation into its symbolic representation. After this process, we obtain a text document which is electronically searchable, indexable and reusable. However, the transformation is error-prone, especially when the image quality is poor. Therefore, if we just save the pure text document generated by the OCR process, the text may have some potential errors. Although errors in a OCRed text may not cause problem in tasks such as text categorization and retrieval, they may have a negative eect if the text is to be read by ahuman being. In a current OCR system, although its nal output usually is just the symbolic content, the detailed information on segmentation and recognition is stored as an intermediate step. The detailed information usually includes: (1) coordinates of the bounding boxes of text blocks, lines, words and characters; (2) alternatives and condence scores of recognition for each character/word image. Based on this information, many potential OCR errors or suspicious points can be marked. Traditionally, a OCR system provides a specially designed graphical user interface (GUI) for an end-user to proofread OCRed text and manually correct those errors. Figure 1 shows a OCRed text generated by a commercial OCR system. It is very time-consuming and expensive to manually proofread OCR results if a large number of scanned documents have to be transformed into texts. Therefore, for OCRed documents in large digital libraries, usually no proofreading is applied. By reading a OCRed text as shown in Figure 1, people may feel frustrated because OCR errors prevent them from getting exact information from the document. Although it was originally designed for presenting documents on WWW, HTML is rapidly developing into the most popular document representation and a universal, platform-independent client interface for a broad range of applications running on Internet/Intranet and network-centric application servers. Because HTML allows embedded images inside normal text, a OCRed document can be represented in HTML as a combination of text and images. The point is to make potential OCR errors transparent to our reading process: for those words which are recognized with high condence, recognition results are used; otherwise, original word images are placed. Figure 2 shows the HTML le for the OCRed text in Figure 1. Similarly, Figure 3 illustrates a small Japanese document image and its OCR result in HTML. Although there are more than 6,000 Kanji characters in the Japanese character set, the Japanese OCR system used here is designed to recognize about 3,000 JIS level 1 Kanji characters which are often used in Japan. When characters from JIS level 2 are encountered, recognition results for those characters will always be incorrect. It is best to keep those characters as original images so that the text will still be

Figure 1: The OCR result of a text block generated by a commercial OCR system, TypeReader 3.0 from ExperVision Inc.. In the graphical user interface for proofreading, the places where the system failed or had low condence are highlighted. An user can correct those errors after examining the original image. Figure 2: The HTML le generated from the OCRed text in Figure 1 is displayed in Netscape Navigator. It is a hybrid representation of text and images. Word images from the original document image are embedded when the OCR system failed or had low condence in recognizing the word.

(a) (b) Figure 3: An example of a Japanese document image (a) and its OCR result in HTML (b). The Japanese OCR engine can only recognize about 3,000 often used Kanji characters (the rst level of JIS code). Two Kanji characters above have no way to be recognized correctly because they are from the second level of JIS code. They are displayed as images in the HTML le because of recognition failure. readable. Because HTML allows hyperlinks between dierent entities, in a HTML le for a OCRed document, we can create hyperlinks between a word string and its image so that a user can check the original image whenever he/she feels uncertain about the recognition result. We will demonstrate this later in our experiment. Hybrid representation of OCRed documents is not a new concept. In Adobe's Acrobat Capture [1], it has already been used to encode a recognized text into a PDF le which is much compressed and can be rendered the same as the original image in Acrobat Reader. Here, we propose to use HTML for the hybrid representation. The main advantages of doing so are: HTML is a simple but powerful markup language for the network environment, and there are many HTML browsing and authoring tools available on dierent platforms. In the process of developing an experimental Japanese OCR system, we realized that a Web browser such as Netscape Navigator can be a nice environment for OCR system evaluation. By automatically converting recognition results into hyperlinked HTML les, we can do OCR result evaluation, verication and editing easily on dierent platforms across a local network or on the Internet. By taking this approach, much programming work on GUI design and multilingual support can be avoided. 2 An Experiment on Evaluating Japanese Document Recognition Using HTML Cherry Blossom is a prototype system for Japanese document recognition which has been developed at CEDAR in the past few years [2]. This system is designed to meet several challenges of machine-printed Japanese document recognition: wide variations in data quality (facsimile, photocopy, newspaper etc.),

Figure 4: Evaluating OCR results using HTML and Netscape Navigator. There are three frames in the homepage. The top one is the TOC frame which has a table in which each entry is a hyperlink to one of six HTML les to be displayed in the right bottom frame, the MAIN frame. The left bottom one is the OCR frame which is used to display the recognition result for a character image. The HTML le currently being displayed in the MAIN frame is the hybrid representation of the recognition results, in which each character position is either a JIS code or an image. Each character position in the Main frame is also a hyperlink to its OCR result. By clicking on a character position, its OCR result will be displayed in the OCR frame. low scan resolution (e.g., 200ppi), large character set (about 3,000 Kanji characters in the rst level JIS code), and a variety of font and point sizes. Given a Japanese document page, the system can deskew the image, segment the page into blocks, discriminate between text/non-text blocks, determine the alignment of a text block, segment text regions into lines and further into character images, and recognize character images. In order for rapid prototyping of evaluation tools which are exible and platform-dependent, a framework for integrating the JOCR system with Internet/Intranet has been outlined. Under the framework, the output of the JOCR system has to be converted to HTML les, which can be viewed by using a Web browser such as Netscape Navigator. Simple Perl scripts can be written to convert the OCR results into HTML les. More detail description of the experiment can be found in [3]. For a Japanese document page, the recognition system can generate OCR results for each text block. In the current experiment, we focused on the task of converting OCR results for a text block into HTML les. The aspects of system performance to be evaluated are: character segmentation, character recognition and character clustering 1. Given OCR results of a text block, a Perl script was written to generate six HTML les. In each of those HTML les, for every character position (either a JIS code or an image), a hyperlink to its OCR le is provided so that its OCR result can be examined by clicking the mouse on it. The OCR le of a character is also in HTML. In the le, the character image is extracted from the block image and saved in GIF format; the classication result is also saved in a HTML le. 1 The HTML demonstrations for various Japanese and Chinese document pages are located at http://www.cedar.buffalo.edu/jocr/html-demo/index.html.

The Netscape Navigator browser provides a mechanism to have several frames simultaneously inside its window. Each frame can display an individual HTML le. Using the browser, a homepage for JOCR results is designed with three frames: the TOC frame, the OCR frame and the MAIN frame. Figure 4 is a snapshot of a homepage automatically generated for a OCRed Japanese document. Activated by selecting the entry Symbol or Image in the TOC frame, a hybrid HTML le for OCR results can be displayed in the MAIN frame. Here, each character is either a JIS code or an image. For this text block, most characters are in JIS because their OCR condence scores are high. For those characters with low OCR condence, their original images are displayed. Obviously, those positions where images are placed usually have segmentation problems or recognition errors. Although there are many segmentation and recognition errors in the OCR result which makes the result in JIS code dicult to read, the content in the hybrid representation is quite readable and not much information has been lost due to segmentation and recognition errors. For several positions, OCR results are incorrect but with high condence. Therefore, JIS codes are placed there. But it is not dicult for a user to detect those errors in their local context and by examing their original images. If the user wants to correct those errors, he can load the hybrid HTML le into a HTML editor and replace those images and wrong JIS codes with the correct JIS codes. It is easy to modify the Perl script when the appearance of a HTML page needs to be changed. In our experiment, the performance of character segmentation, character recognition and image-based clustering were evaluated. New scripts for converting OCR results to HTML les can be designed rapidly when some other aspects of OCR performance need to be studied. 3 Summary OCR is an error-prone process. The errors remaining in the OCRed text can cause serious problems in reading and understanding. As demonstrated in this paper, a hybrid document which combines symbolic representation and image representation may relieve the problem. Such a hybrid document in HTML is searchable and also error-recoverable. If we represent a OCRed document in such a hybrid format, OCR errors will not have much negative eect on the human reading process. Therefore, the hybrid HTML can be recommended as an ideal format for OCRed documents in an online digital library [4]. References [1] http://www.adobe.com/acrobat/capture.html. Adobe's Acrobat Capture. [2] S. N. Srihari, G. Srikantan, T. Hong, and B. Grom. A general-purpose Japanese optical character recognition system. In Symposium on Document Analysis and Information Retrieval, pages 271{286, April 15-17 1996. [3] T. Hong, S.F. Wu, and S.N. Srihari. Evaluating japanese document recognition in the internet/intranet environment. In IAPR Workshop On Document Analysis Systems, pages 633{650, Oct. 14-16 1996. [4] M. Lesk. Substituting images for books: The economics for libraries. In Symposium on Document Analysis and Information Retrieval, pages 1{16, April 15-17 1996.