STRETCH : A System for Document Storage and Retrieval by Content

Transcription

1 STRETCH : A System for Document Storage and Retrieval by Content E. Appiani, L. Boato, S. Bruzzo, A.M. Colla, M. Davite and D. Sciarra RES Department Elsag spa Via G. Puccini, Genova (Italy) {enrico.appiani,luisa.boato,sandra.bruzzo,annamaria.colla,marco.davite,donatella.sciarra}@elsag.it Abstract In this paper a system for storing and retrieving imaged multimedia documents by content is described. This system is being developed within the Esprit project STRETCH (STorage and RETrieval by Content of imaged documents). The core of STRETCH system is a powerful Archiving and Retrieval Engine, based on a structured document representation and capable of activating appropriate methods to characterise and automatically index heterogeneous documents with variable layout and subsequently retrieve them by answering to complex queries. The produced document base, or Docu-base, relies on an object-oriented internal representation and related characterisation and search methods. A prototype was implemented and successfully tested, in particular, in the creation of an invoice archive. 1. Introduction STRETCH (STorage and RETrieval by Content of imaged documents), ESPRIT Project n , aims at developing a system for storing and retrieving imaged multimedia documents based on their content. STRETCH addresses the Archive Reference System (ARS) market, concerning heterogeneous applications where mass documentary databases are involved. Nowadays, pushed by both new technology developments and the increased need of augmenting the information diffusion and communication efficiency for enterprises, specialised users communities, and the public, there is an ever increasing demand of tools to automatically convert information hold on paper into digital information ( zeropaper option ). The objective of STRETCH is twofold. First, STRETCH aims at combining direct digitalisation, mostly based on location of information fields and OCR, with image indexing open to multimedia, by applying advanced techniques derived from Image Analysis and Pattern Recognition. STRETCH aims at developing a common Archiving and Retrieval shell based on a structured document representation and capable of activating appropriate functions to characterise and subsequently retrieve multimedia documents on users demand. To make such a system effective, the bottleneck of document profiling must be avoided, in particular by overcoming the existing limitations of pre-defined indexing schemes. Second, STRETCH must overcome the main limitations of current ARS systems, offering, in particular, ease of use and programming and ability to dynamically adapt to generic multimedia documents. STRETCH goal to realize a document archive according to the user's view requires the integration among innovative modeling techniques and well established automatic indexing techniques to enable content-based retrieval. The core technology employed is document processing in terms of image enhancement, layout analysis, field location, logo location and recognition, tag identification, Intelligent Character Recognition (ICR). The structure of each document is derived, the document is classified according to user specification, information contents are extracted from the relevant fields. The suitable document representation to support this complex processing has been designed accordingly, along with the corresponding database representation. In such a situation, STRETCH may be regarded as a document meta-engine [1] introducing document logic into existing databases focused on document fields. The document logic, based on STRETCH Document Internal Representation (DIR), is employed for document analysis and classification, information extraction, indexing and retrieval. STRETCH is being tested on three different environments, that are conspicuous examples of the application fields addressed, namely an account payable archive (invoices and related documents: bills of entry, transport documents, and so on), a document archive for the Public Administration (circular letters, statistical reports, and so on), and a medical image archive (miocardial SPECT maps, thorax radiographs). STRETCH started on December 1st, At the time of writing, STRETCH has successfully completed the user requirement capture phase and has consolidated

2 the technical specifications of the data model and of the overall system architecture. The system detailed design and implementation, based on object oriented (OO) approach and incremental prototype production, has been progressively revised and refined during the development stage, following an iterative assessment of the level of users satisfaction. A demonstrative system, endowed with most of the foreseen functionality, has been produced. This paper schematically describes STRETCH architecture, functionality and achievements to date. In the following we will briefly present the system architecture and components (Section 2), the data model (Section 3), and some preliminary experiments in the invoice domain (Section 4). Finally, Section 5 presents some conclusions. 2. Architecture and main components From an architectural point of view, the project relies on a client/server solution, networked through corporate Intranets and (possibly) Internet. The main achievement consists in the development of a powerful Archiving and Retrieval Engine (ARE), based on a general document representation and capable of activating appropriate methods to characterise and retrieve imaged documents. The user interface is constituted by a portable thin client. STRETCH architecture consists of: Client layer: a portable, user-friendly and intuitive Graphical User Interface (GUI); Server layer: a scalable Archiving and Retrieval Engine (ARE); Database layer: a structured document database, or Docu-base. Figure 1 summarizes the main functional components in the Client, Server and Docu-base layers. A standard Corba middleware links all the components in the Client and Server layers, ensuring interoperability among heterogeneous platforms. This choice allows to bring directly in any Corba compliant commercial component, while a Corba wrapping may be implemented for components which are not Corba compliant. The STRETCH Client, also called Docu-client, consists of the STRETCH GUI with related tools. Any other External Application client (for instance, a GUI of external applications extended with STRETCH access) can employ the STRETCH server components through their Corba interfaces. The GUI provides all the ARE services in a friendly way through extensive use of windows, pop-up menus, buttons, icons and thumbnails, as well as context sensitive menus and help windows. In particular, the GUI is used to manage and monitor user sessions and profiles; to allow the user to define new applications and document classes (Maintenance and Definition Tool); to acquire new documents directly by a scanner or other acquisition sources; to visualize archived or retrieved documents; and to inquire the archive. The STRETCH Server, also called Docu-server, is centered on the STRETCH Archiving and Retrieval Engine (ARE), using the Document Internal Representation (DIR). Due to the high-performance requirements and related scalability of most functions to be supported, the ARE can rely on distributed configurations including when necessary parallel machines. The ARE may also interoperate with specialized External Archiving/Retrieval Engines and External Applications, integrated as Corba components. The STRETCH Database, also called Docu-base, includes the main DBMS storing the DIR instances. The latter may contain references to External Databases, if required by application constraints, or to external specialized Archiving/Retrieval Engines. For example, the Docu-base can archive the document description while the external databases store the document images or specific fields. The STRETCH Maintenance and Definition Tool (MDT) consists of both Client and Server components, in charge of application definition, user profile management and system configuration. STRETCH is an open system, due to the standard middleware and the interoperability with external components and applications. Examples of support or external components can be for instance a text retrieval engine or an ERP system. Some basic implementation choices are Java for implementing Client objects and for top-level Server session objects; C++ for Server internal service objects (DIR Objects) and the ARE Manager, in particular for code efficiency; and the adoption of an OODB schema based on the DIR and User Profile classes, permanent objects in the system. DIR methods relevant for retrieval are defined both in the server objects and in the OODB. 3. The data model Any document can be described with respect to two different aspects: the physical and the logical one. The physical structure of a document, also called layout structure, is the collection of the extracted objects, obtained by the repeated partition of the document content into increasingly smaller parts (basic objects), on the basis of the layout appearance. An object of the layout structure is also called physical object. Similarly, the logical structure of a document is the collection of the extracted objects, obtained by the repeated division of the document content into increasingly smaller parts, on the basis of the human perceptible meaning of the content. An object of the logical structure is called logical object.

3 A domain of documents can be defined as a group of documents which can be clustered with respect to their subject or use according to users view: for example journals, tax forms, business letters, invoices, check forms can be regarded as different domains. Since the documents of a domain share the same main subject and are used for similar or related functions, they are characterized by some logical and physical similarities. Some logical objects are common to all the documents of the domain. For example, in an invoice the logical object Total, which contains the amount due to the issuer, is always present. Similarly, a logical object in different types of documents of the same domain, usually keeps some physical features related to its position within the document. Documents belonging to a given domain can be further characterized by different layout or logical structures. Thus documents which feature physical or contextual similarities can be clustered into classes. The internal data structure to represent documents, the so-called Document Internal Representation (DIR), is defined in terms of class objects, with their relevant information and methods to process such information. The DIR objects describe a document according to physical and logical viewpoints. For most applications among those considered by STRETCH, the DIR data structure is based on Modified X-Y trees (Sect. 4.1). The DIR represents the core system data since the Archiving Retrieval Engine uses it during archiving (DIR generation) and retrieval (access and feature matching). The domain representation is based on a similar approach: domain objects are template structures and fields whose methods are the strategies to process new documents and extract information from them. To implement domain knowledge, relevant to specific documents and types of physical structures, we make use of a template structure named correlation graph (Sect. 4.2). The correlation graph makes it possible the implementation of information extraction strategies based on advanced image recognition and reading technologies Modified X-Y tree The Modified X-Y tree (M-X-Y tree) [2] is derived from the X-Y tree [3,4], a well-known data-driven method for page layout analysis. The M-X-Y tree is well suited to the physical representation of documents with complex layout. The basic assumption behind this approach is the fact that structured elements of the page (columns, paragraphs, titles, figures, lines of text, printed symbols) are generally laid out in rectangular blocks, which can almost always be divided into groups in such a way that blocks that are adjacent to one another within a group have one dimension in common [3]. The method consists in using thresholded projection profiles (i.e. the histogram of the number of black pixels along parallel lines through the document) in order to split the document into successively smaller rectangular blocks [4]. Depending on the direction of lines, we can have horizontal or vertical projection profiles. Thresholded projection profiles are obtained by comparing the values of a projection profile with a given threshold. The blocks are split by alternately making horizontal and vertical cuts along either white spaces, found by using the thresholded projection profile, or horizontal or vertical ruler lines. The result of such segmentation can be represented as a tree, where the root is for the whole page, the leaves are for blocks of the page, whereas each level alternatively represents the results of horizontal (X-cut) or vertical (Y-cut) segmentation. In order to maintain consistency in the data representation, the ruler lines, although used as separators, are also stored as leaves in the M-X-Y tree. The tree structure is enriched with descriptions of inter-leaves relationships. Adjacency links among leaves of the tree can be seen as an adjacency graph, where nodes of the graph correspond to leaves of the tree. An adjacency graph [5] describes the structure of a document by giving the position of nearest objects in the horizontal and vertical directions (above, below, left, right relations) Correlation graph The correlation graph is a template structure used to implement domain knowledge, possibly automatically extracted from document samples [6]. This representation is suited for variable layout documents with some spatial structure, or for documents whose semantics can be recognized looking for textual tags in the image. It can be applied to either a full document, or a part of it. The correlation graph describes a document understanding strategy which uses both the predefined template elements (implementing field reading inside a search area in the image, for each field type), and search area computations for fields to be read based on the position of other already found fields. Meaningful fields are of three main types: (i) fields to be read as ASCII strings by the recognition strategy; (ii) textual or geometric tags used by the recognition strategy to understand the document structure; and (iii) image fields that can be recognized by suitable methods (i.e. logos). 4. The invoice application 4.1. Passive invoice management In the scenario of passive invoice management, STRETCH aims at providing, on one hand, data entry automation for VAT recording purposes, interfacing an

4 ERP system, and on the other hand the invoice acquisition, archiving and retrieval capabilities that make the electronic copy immediately available to all the authorized users. In STRETCH environment new invoices can be grouped into batches to be scanned. The acquisition process produces the electronic copy of invoices, in a suitable format that can vary from binary up to colour images depending on users requirements. New invoices can be automatically input to the information extraction procedure, mainly consisting of document classification and automatic ICR reading. It is mandatory for the extracted data, which are to be used for indexing and as input to the VAT recording procedure, to be error free, so a supervision phase before archiving and VAT registration is advisable. The electronic copy of invoices is then archived in the docu-base with the previously extracted archiving indexes. The ICR recognition results also provide automatic data entry to the VAT registration procedure, usually part of the ERP system. The retrieval function is reserved to the authorized users, and makes all archived documents available for immediate consulting with the advantage of eliminating circulation of paper copies. Content-based retrieval allows to find out invoices by means of any partial information. The prototype is centered on INFORMATION EXTRACTION (document classification and reading) and ARCHIVING processes. The SUPERVISION procedure simply consists in presenting the invoice image together with the recognized fields. The user can correct any information, then confirm the data, that will not be modified any more after archiving. The ARCHIVING process directly demonstrates how the recognition results map into STRETCH internal knowledge representation structure, at the moment stored in a relational database. The RETRIEVAL functions are based on the presentation of a form for Query Definition. SQL queries are allowed on the values of known fields, with standard AND-OR expressions Information extraction Three document processing steps are activated in order to extract information for indexing and for the VAT registration procedure (see Figure 2): first the M-X-Y tree generation produces the M-X-Y tree representation of the invoice (see Sect. 4.1); then the classification procedure based on the M-X-Y tree produces the document classification, i.e. the supplier identification; last, the reading strategy [6] for that supplier is applied, which is based on ICR techniques including field finding, neural character reading [7], tag finder and logo recognition [8]. The ICR reader locates and reads the information written on invoices issued by a given supplier. If the supplier identification fails, a general reading strategy can be applied. The ICR result is a set of text strings used as indexes by the archiving procedure and a set of data used as input by the VAT recording procedure. For each information field a basic type is assigned that defines how that field value is interpreted during retrieval. For example a date field older than a certain threshold can be searched, as well as a string field similar to a certain word. A set of tags that have a significant spatial relation with information fields is internally employed by the reading strategy: Date, Invoice Number, Total, VAT,. A set of the most relevant fields from user requirements was selected for the prototype. This set consists of: Supplier (string): the supplier name inherited from the MXY-based classifier, used both as an archiving index and for VAT registration; the supplier logo is located as an accessory information; Date (date): date of issue, used both as an index and for VAT registration; Invoice number (string): used as an index and for VAT registration; Total (integer): the total amount of the invoice, used for VAT registration; IVA (integer): the total amount of Italian VAT tax Preliminary experimental results The invoice documents used as a test set for the demo system were 250 real passive invoices of a company of the Finmeccanica Group. They show different layouts, various styles and many different fonts and font sizes. All the invoices show a company logo, usually in one-to-one correspondence with the supplier, but those issued by one supplier have neither a fixed layout, nor a unique standard writing style. All the documents in the test set are composed of a single page. The acquisition produced binary (black and white) images, with 300 DPI x 300 DPI resolution. No specific filtering or enhancement was applied to the images. The information extraction stage performance was: the M-X-Y tree-based classification achieved 97.8% correct classification in top position; fields were correctly located in 98.4% cases; automatic reading of the field values produced a total of 31 misclassification errors (96.9% correct on fields, 100% on tags); problems were mainly encountered with very noisy images, dot matrix and italic fonts. 5. Conclusions This paper has presented a short architectural and functional description of the STRETCH system, along with the current achievements. The demonstrative system implemented for automated invoice processing has been

5 briefly described and some experimental results presented. The system relies on an open three-tiered architecture, with the capability to interoperate with external applications, engines and databases. Such openness takes into account that advanced technology is nowadays available for document processing. What is expected from STRETCH is to provide the document logic viewpoint above the either flat or explicitly indexed archives of text or images. STRETCH openness, together with the adopted standard middleware and object-oriented approach, will allow to integrate future technology innovations. Acknowledgements We would like to acknowledge the contributions by all STRETCH workteam, in particular by P. Penna (AET, Genova), E. Francesconi and S. Marinai (DSI University of Firenze), M. Diligenti (DI University of Siena). 6. References [1] M. Beigi et al., MetaSEEK: A Content-Based Meta-Search Engine for Images, SPIE Proceedings on Storage and Retrieval for Image and Video Databases, vol. 3312, Jan [2] F. Cesarini, M. Gori, S. Marinai, G. Soda, Structured document segmentation and representation by the modified X-Y tree, Proc. ICDAR 99 (to appear) [3] G. Nagy and S. Seth, Hierarchical representation of optically scanned documents, in Proc. of the International Conference on Pattern Recognition, pp , [4] G. Nagy and M. Viswanathan, Dual representation of segmented technical documents, in Proc. First Int'l Conf. Document Anal. Recog., pp , [5] J. Yuan, Y. Y. Tang, and C. Y. Suen, Four directional adjacency graphs (FDAG) and their application in locating fields in forms, in Proc. Third Int'l Conf. Document Anal. Recog., (Montreal, Canada), pp , [6] L. Boato, E. Cattani, M. Davite, B. Villa, Automatic Programming of Variable Layout Image Documents Reading Applications based on Minimum Description Length Induction, AI*IA Workshop on Automatic Learning and Natural Language, Turin, Italy, Dec [7] A.M. Colla, P. Pedrazzi, Single and Coupled Neural Handprinted Character Classifiers, in M. Marinaro and P.G. Morasso (Ed.s), ICANN 94 Proc. Intl. Conf. on ARTIFICIAL NEURAL NETWORKS, Sorrento, Italy, May , vol. II, pp , Springer-Verlag (1994). [8] M.Corvi, E.Ottaviani, "Multiresolution logo recognition", Proc. Int. Workshop on Visual Form, Capri, Acquisition GUI Maintenance & Definition Tool (C) Docu-client Enhancement Segmentation Layout Analysis Docu-server Docubase Archiving / Retrieval Engine Content-based Image Search Content-based Docum. Analysis Information Retrieval OCR / ICR DBMS Maintenance & Definition Tool (S) Document Internal Repres. Image Analysis Document Internal Repres. Database Instances Figure 1. Main layers with functional modules and data. MXY Generation MXY-based CLASSIFIER Image MXY Tree Supplier Name Document Class Reading Strategy Information from Fields Figure 2. The recognition process for the invoice application.