Recognize This Using Intelligent Document Recognition to Automate Enterprise Content Management By Arthur Gingrande and Don Post Published by
Software Solutions Dispatcher: EMC Captiva Intelligent Document Recognition (IDR) Dispatcher, the IDR solution from EMC Captiva, provides a strong set of innovative technologies that enable corporations to streamline their documents flow. By using Intelligent Document Recognition technology in conjunction with an enterprise document capture platform, businesses can significantly reduce the manual labor required to identify, sort, and enter critical business data from paper documents to reduce the cost of processing this information, and improve the overall productivity of their business process. Dispatcher Value Propositions Streamline the flow of data: Dispatcher uses multiple text and image analysis technologies to process multiple document types in a single capture flow, including structured documents (e.g., forms), semi-structured documents (e.g., invoices, explanations of benefits), and unstructured documents (e.g., correspondence, legal documents). Reduce scanning time preparation and cost: Dispatcher eliminates the need to manual pre-sort documents and the use of bar codes and separator sheets by automatically detecting the appropriate document type. Optimize Transactional Processes: By reducing the preparation required for document scanning, Dispatcher accelerates document routing and maximizes workflow and business process efficiency (e.g., Accounts Payable, Case Management applications) while reducing the manual interaction required to process incoming documents. EMC Captiva 10145 Pacific Heights Boulevard San Diego, CA 92121 CUSTOMER PAIN POINT DISPATCHER BUSINESS IMPACT 858.320.1100 www.emc.com/captiva Inability to prioritize document handling: urgent documents are processed in the same process as regular documents Automatically prioritize your document flow by classifying urgent information and accelerating associated processes. EMC2 and Captiva are registered trademarks Need to classify complex documents folders with time constraint Manage, control documents folders (e.g., loan applications, patient medical record folders, multiple-page Explanations of Benefits, etc.) to reduce time and cost processing. Difficulty automating the extraction of critical business data from less structured documents Accelerate data extraction for complex unstructured documents (e.g., Invoices, Explanations of Benefits, etc) reducing typing errors and cost processing. Slow business processes Process more documents, with fewer resources, faster. and Dispatcher is a trademark of EMC Corporation. All other trademarks used herein are the property of their respective owners. Copyright 2007 EMC Corporation. All rights reserved.
Recognize This Using Intelligent Document Recognition to Automate Enterprise Content Management By now, American business has conceded that the myth of the paperless office will indeed remain a myth. The main reason: paper has real utility. If it did not, we would not be making and distributing as many copies of each document as we do, and the printing industry would be dying. While it is true that somewhere around 70 percent of all new corporate documents are digitally created, the remaining 30 percent that are distributed and processed as paper documents will stay at that rate and remain a problem for the foreseeable future. Moreover, a large quantity of forms that originate digitally are still printed and distributed as paper documents. In particular, this pertains to documents such as invoices, medical claim forms, and bills of lading, where, for auditing and recordkeeping purposes, it is often handier for a purchase order or a bill to be processed and tracked by a barcode on a paper form (for instance, Fedex and HCFA forms) than it is to send the information to its destination digitally. Table of Contents The Unstructured Document....................2 Document Classification Defined.................2 Adding Workflow and BPM to Back-end Processing...3 Document Classification and Records Management...4 Benefits of Intelligent Document Recognition.........5 Where to Start.................................6 Glossary......................................8 2007 AIIM - The ECM Association, 1100 Wayne Avenue, Suite 1100, Silver Spring, MD 20910, 301.587.8202, www.aiim.org. Reproduction in whole or in part without written permission is prohibited. AIIM is a registered trademark. All other vendor and product names are assumed to be trade and ser ice marks of their respective companies. Page 1 of 8
The Unstructured Document As any information technology specialist or line-of-business systems administrator knows, the majority of business documents and forms (e.g., invoices, purchase orders, resumes, and work orders) are received at their destination in no particular order and in formats and layouts that are unexpected and unpredictable from the point of the processing agent. This is to say that they are unstructured. The term unstructured, when associated with a document, refers to a group of functionally alike forms with dissimilar layouts that are received by a processing agent in collectively high volume and that require manual data entry to capture the form data. Because the design and the location of the data fields on any specific form are not known in advance, these form types cannot be recognized using conventional intelligent character recognition (ICR) technology, which relies on predefined templates to enable data location. The term unstructured, when associated with a document, refers to a group of functionally alike forms with dissimilar layouts that are received by a processing agent in collectively high volume and that require manual data entry to capture the form data. This means that if companies want to automatically identify and recognize the vast majority of business documents they receive in their mailrooms each day, they must consider integrating auto-classification and unstructured document recognition software into their existing enterprise-wide enterprise content management (ECM), enterprise report management (ERP), customer relationship management (CRM), and electronic records management (ERM) systems. Over the last 15 years, the computer chip has chased Moore s Law up the power curve, accompanied by similar geometrical increases in PC memory capabilities and hardware support, to the point where automated document recognition software is ready for mainstream adoption. Nowadays, a high-volume, industrial strength, forms automation system incorporated into an ECM solution can generate a payback of less than one year when installed, integrated, and tweaked properly. This holds true for systems that handle unstructured as well as structured documents. In fact, automated document classification technology, often neural network driven, can now accurately sort and categorize documents arbitrarily received and scanned by companies in their mailrooms, lock boxes, and departmental locations to the tune of over 25,000 pieces of mail per hour. Once sorted, classified, and scanned, ICR technology converts the images of the envelope contents into computer-usable data at a price that is eminently cost-justifiable: not infrequently, the payback is less than a year. Document Classification Defined Until a few years ago, unstructured documents had to be presorted into groups of similar form types before they could be recognized by ICR-based, forms automation software. Since then, as the underlying pattern recognition technology has advanced, the preliminary sorting procedure, known as document classification, has also been automated. In this manner, document classification technology replaces labor-intensive sorting procedures by allowing large, arbitrarily ordered collections of forms to be scanned and processed with Page 2 of 8
an absolute minimum amount of document preparation. Document classification technology replaces labor-intensive sorting procedures by allowing large, arbitrarily ordered collections of forms to be scanned and processed with an absolute minimum amount of document preparation. In the past, document-centric applications grounded in ICR-based, forms processing, and image data capture technologies failed to meet experts predictions of extensive user adoption largely because of widespread misperception and misunderstanding about ICR accuracy rates, and because forms processing applications were narrowly focused on structured forms. Over the last five years, however, the development of document classification and unstructured forms recognition software has outstripped the rate at which mainstream users have adopted it. Accordingly, there presently exists a technology lag that business users can exploit in order to gain a competitive advantage. Adding Workflow and BPM to Back-end Processing After documents are scanned and classified and the images are converted into computer-usable data using ICR, both the data and the image should be indexed to enable them to be searched and stored in a secure data repository. The data file includes more than images; it cross-references other documents in the archive database and records all the actions carried out by the persons in charge of the file, which enables historical tracking and auditing. How and in what order back-end procedures are carried out depends upon the specific application, but nevertheless their implementation invariably requires a workflow component to successfully fulfill the business process. For example, take invoice processing. To authorize a payment, an invoice must be verified by comparing it with the purchase order and/or other documents that describe the goods, their purchase price, as well as the time the goods were received and accepted by the company that placed the order. Then, before paying the bill, the accounts payable department must extract the information necessary to validate the invoice data by using business rules and lookup tables, and the department must also accurately index the invoice data and export it to the enterprise data repository for access via search and retrieval. Medical claims and explanation-of-benefits (EOB) forms must also be indexed on the back-end for downstream data adjudication that involves the detection of fraud and the discovery of illegal unbundling of medical procedures before payments can be remitted for reimbursement to the billing agent. Moreover, Medicare and Medicaid claim forms must meet government-mandated processing deadlines and turnaround times, and additionally could be required to interface with a state-sponsored EDI network that facilitates speedy reimbursement. Accordingly, constructing the right workflow for a given application requires a thorough knowledge of its underlying business processes. The degree of understanding must be comprehensive enough to enable the user to define explicitly all the requisite steps of, and exceptions to, the workflow or business process management (BPM) procedures involved. Adding workflow to automated document recognition can be done in two ways: the workflow engine can come packaged with the forms classification and recognition software; or the software can include an API for integrating the application with specific outside workflow modules or BPM applications. Each instance of ap- Page 3 of 8
plication integration requires its own work process analysis in order to assess the complexity and variety of factors that collectively determine the parameters of how the application must interface with the recognition and back-end components of the document recognition software. This enables the extraction of critical busi- Document Classification and Records Management Compliance legislation such as Sarbanes-Oxley (SOX), the USA Patriot Act, the Payment Card Industry-Data Security Standard, and the International Trade in Arms Regulations (for defense contractors) requires that an organization produce information for the federal government on demand or face consequences of fine or imprisonment. Over the past two years, American corporations have paid in excess of $3 billion in fines for failure to comply when subpoenaed by the U.S. Government, and over 200 CEOs have gone to prison since SOX was passed. It may cost an average of $500,000 per corporation to install a records management system that will enable compliance with SOX, but it sure beats doing jail time. The answer to regulatory compliance is to incorporate intelligent document recognition technology into a fully integrated records management system, known as an electronic document records management system (EDRMS). If the completed system is enterprise-wide, the enormous savings from benefits produced by forms recognition and document management technology will end up paying for the installation particularly in high-volume operations. Since the penalties for non-compliance are so great, best practices mandate hiring an attorney to ensure that the completed system satisfies legal discovery requirements. EDRMS implementation requires setting up a document retention schedule that governs the lifecycle of all corporate records. A document retention schedule is also a document destruction schedule that, if properly kept, allows a company to legally rid itself of burdensome records at its earliest convenience. Since ignorance of the law is not a valid excuse, failure to establish a document retention schedule still means that a firm is responsible for all the documents the government requests even if it has already destroyed them. Or if a document retention schedule is set up but the company fails to rigorously enforce the schedule, then the firm can still be held liable for those documents that it succeeds in destroying. Electronically setting up a proper records management system necessitates that a business obtain the consulting services of a certified records manager (CRM) who is familiar with EDRMS technology. The CRM will do a thorough analysis of the document management lifecycle of every document category, and then construct a retention schedule for the documents that can be considered records and must be retained and those that do not have to be kept. Delineating the two is not always an easy task; the easiest way to conceive the difference is to think of a record as any document that has the potential to be subpoenaed into court as evidence in a trial. A successfully implemented electronic document records management system will permit ongoing visibility across disparate systems and processes, will establish objective identification and analysis of key compliance and performance indicators, will enable real-time monitoring and reporting of changing financial conditions, and will provide information forensics for accurate auditing of the financial supply chain. By accurately identifying documents that qualify as records as soon as the business receives them, compliance with SOX and other government regulations is optimized as well as ensured. Page 4 of 8
ness data from, as well as the export of critical business data to, the appropriate administrative interfaces whether they involve BPM, ERP, CRM, or dedicated vertical business processes. If a company invests heavily in EDM, ERP, BPM, or workflow, it has not optimized the data capture process unless and until it adds automatic document classification and data extraction to its operations. Benefits of Intelligent Document Recognition If a company invests heavily in EDM, ERP, BPM, or workflow, it has not optimized the data capture process unless and until it adds automatic document classification and data extraction to its operations. Integrating document classification and forms recognition with electronic content management brings new capabilities to an enterprise that extend beyond data capture into the realm of transactional content management. The result is called intelligent document recognition (IDR). IDR technologies improve productivity by accelerating overall system throughput while drastically reducing document preparation and data entry costs. The major benefits include: Immediate labor savings from automated mail processing. Identifying, sorting, classifying, indexing, and dispatching incoming documents consume as much as 80 percent of the labor involved in processing mail at corporations. By scanning, classifying, and automatically indexing incoming mail documents and then dispatching those documents from a central location, companies can expect to cut their mail processing costs by 40-60 percent. Significant labor savings from automated data entry. Automating the document classification and forms automation process cuts costs by eliminating the need for separator sheets and barcodes and reduces the need for sorters, document prep workers, and data entry operators, which can eliminate manual key operator labor by 60 percent and higher. Better financial control from accelerating business processes. Converting incoming documents into images that can be instantly classified and recognized shortens the business cycle. Incoming invoices can be paid faster and checks can be cashed sooner, thereby establishing a company s control over its net cash position sooner, while allowing a company to realize bottom-line cash flow more quickly. Improved productivity from fast document classification and accurate ICR. Ad hoc classification and recognition of a wide variety of business documents enables quick and accurate content extraction for import into enterprise work processes, including mission-critical applications, an improvement that accelerates the underlying business cycles and increases overall productivity. More efficient allocation and management of business resources. Truncated business cycles plus improvements in business processes and records management enable executives to better track, audit, allocate, and manage corporate business resources while meeting legal compliance objectives. Page 5 of 8
Discounts from faster payment processing. Rapid processing of accurate invoice payment data makes cash available faster to vendors, thus ensuring that buyers receive their contractual discounts for meeting payment deadlines on time. Better records management. Upfront and accurate records classification and immediate extraction of data from qualifying business documents enables efficient compliance with government regulations and rapid access to stored records management data. Improved customer relationship management. Better records management and faster, more accurate forms processing with fewer payment errors means that customers receive higher quality service at a faster response time. Fast payback and cost-effective ROI. Savings from less manual data entry Underlying Technology of IDR The science that underlies IDR software is a product of twenty years of evolution in forms automation technology. ICR templates need no longer be used to find data on forms. Instead, intelligent algorithms take their place. Sometimes the algorithms examine and compute the geometrical and spatial relationships between the text data elements, such as rows or subheadings (rather than graphical objects), and use the results to discover where important form data is most likely to reside. By taking advantage of the powerful computing horsepower and memory that resides in the average desktop PC, a variety of methods, including blob analysis, edge detection, feature extraction, long-line detection, and multi-line character segmentation, is combined or operated in parallel to find form objects, columns, and data fields based solely on analyzing form topology. In fact, the process need not involve character recognition at all; the text can be treated on a higher level of abstraction, as a pattern of blobs. However, when ICR is integrated with topological and morphological analysis, it adds tremendous value and yields a powerful document classification tool. at a low cost of ownership typically translate into a fast payback often less than one year which improves proportionately as greater processing volumes improve economies of scale. Where to Start: Pick a Fail-safe Project To successfully implement an IDR project, select an initial application that is critical to a company s core business activities. This strategy will ensure that the potential savings are there to begin with. Typically, analysis will reveal at least one department with an application that can cost-justify the investment a no-brainer. Below are the salient criteria used to qualify the effectiveness of an intelligent document recognition solution: Paper forms-intensive, lines-of-business with high clerical staffing. Documents contain time-sensitive information directly linked to the bottom line. Department is preferably a source of revenue a profit center. Customer service applications that can yield a competitive advantage. Forms contain mostly machine print and little hand print, which minimizes ICR errors. Page 6 of 8
Obviously, because recognition accuracy is significantly dependent on form design, odds for success are improved greatly by using ICR-friendly forms. On the other hand, if the forms are unstructured, accuracy can be maximized by optimizing document image cleanup and data image enhancement. Once a company puts a forms automation solution in place and develops integration expertise, it should focus on developing best practices by establishing a project team with the mission of seeking out improvements enabled by document and data capture. The team is responsible for evaluating and rolling out additional lineof-business and departmental IDR applications, including the mailroom. In today s fast-moving global economy, integrating intelligent document recognition into an existing enterprise-wide, content management system facilitates rapid access to important, time-sensitive, mission-critical information almost as soon as they hit the company mailroom. In so doing, IDR technology enables improved support of executive decision-making, accelerates funds transfer, and expedites immediate access to market data, all of which works to create a competitive advantage for the host company. About the Authors Arthur Gingrande, ICP is a founding partner of IMERGE Consulting and nationally acclaimed expert and pioneer in image-based intelligent character recognition (ICR), electronic forms, and forms automation. He is the co-founder of Symbus Technology (now Captiva). He is also the current president of the New England Chapter of TAWPI and a board member of the New England AIIM Chapter. Since 1991, over 300 of Mr. Gingrande s articles on ICR, forms automation, imaging, ECM, and EDMS have been published in various trade periodicals. Mr. Gingrande is the author of the AIIM publications Forms Automation from ICR to Electronic Forms to the Internet, and Using Forms Automation to Boost Enterprise Productivity, the definitive books about the role of forms automation in document management and electronic commerce. His latest AIM book is Processing Unstructured Documents: Challenges and Solutions. He also wrote Cost-Justifying an ICR Solution and Measuring and Improving Data Entry Productivity, both published by TAWPI. He can be contacted by telephone at 781-258-8181 or by email at arthur@imergeconsult.com. Donald Post, CDIA+, ICP, ERMp, ECMp, LIT has been a partner of IMERGE Consulting, Inc. since 1999. Previously, he was senior manager of Xerox Professional Services, and a senior consultant with A.T. Kearney Management Consultants. He has been the speaker for two AIIM Webinars in 2007, both about forms processing. Mr. Post is one of the first 28 CDIAs, and is recognized as an AIIM Master of Information Technology Laureate (LIT). He is recognized as an Information Capture Professional (ICP) by TAWPI, and is a trainer for the ICP, CDIA+, and AIIM ERM Certificate courses. He served as chair of the AIIM 2005 Conference, chair of TAWPI 2006, and also serves on the board of TAWPI. He can be contacted by telephone at 815-398-0344 or by email at dpost@imergeconsult.com. About Imerge Consulting IMERGE Consulting is a leading independent analyst firm providing management and technology consulting services which helps clients save money and improve productivity. This focus includes training, records management, electronic content management, business process improvement, document and data capture, collaboration and compliance advisory services. IMERGE helps organizations improve ROI, clearly define requirements and business case, evaluate software and deploy systems faster. IMERGE has over 20 offices located near major cities across North America. Page 7 of 8
GLOSSARY ASCII - American Standard Code for Information Interchange. Algorithm - A list of exact steps to perform a specific calculation or programming problem. A precise description of the solution to a problem. Automated Document Recognition - Process of extracting data from a scanned image of a paper document into ASCII through the use of ICR technology. Automated Forms Processing - The ability for software to accept scanned forms and extract data from the boxes and lines to populate databases. Software usually includes the ability to drop out the form so that recognition accuracy improves. Intelligent Document Recognition automatically identifies document types from the layout and structure of the document. Barcode - A unique symbolic code made up of a series of vertical bars that is used for fast and accurate identification of items using an optical scanner. Batch Processing - A serial form of data processing in which groups or batches of transactions are input to the system. Batch processing stands in contrast to distributed real time processing in which forms are processed immediately upon initial input. Database - (1) Electronic collection of records stored in a central file and accessible by many users for many applications. (2) Collection of data elements within records or files that have relationships with other records or files. Relational databases, in which data is stored in standard rows, tables, and columns, are most common. XML databases are a developing technology. Document - A collection of data and/or graphics in either analog or digital form. In digital form, the collection is called an Electronic Document or Bit Map. Document Image Management - Process of capturing, storing, categorizing, and retrieving documents regardless of original format, using micrographics and/or electronic imaging (scanning, OCR, ICR, etc.). Document Management - Software that controls and organizes documents throughout an enterprise. Incorporates document and content capture, workflow, document repositories, COLD/ERM and output systems, and information retrieval systems. Drop-Out Ink - Background or characters printed using special ink, which is transparent to an optical scanner or image camera. EDI - Electronic Data Interchange. competitive positioning. Workflow and change of culture and work process are essential for productivity increase. Electronic-Forms/Web Forms - Forms designed, managed, and processed completely in an electronic environment. Form - A formatted document containing data in fields that are identified on the document. Forms Automation - See forms processing HCR (Handprint Character Recognition) - character recognition technology that converts images of handprint characters into ASCII code. ICR (Intelligent Character Recognition) - Advanced form of character recognition technology that refers to recognition of both hand printed and machine printed characters. It can also include capabilities such as learning fonts during processing or using context analysis to aid recognition accuracy. ICR-friendly Form - A form specifically designed to enable ICR technology to easily recognize the hand print data on it. Design elements include devices that separate characters, such as boxes and tick marks, and pre-printed symbols such as dollar signs and punctuation marks, all printed in drop-out ink. Intelligent Document Classification - Process of automatically identifying document types from the layout and structure of the document and the use of ICR technology. OMR (Optical Mark Recognition) - Software that detects the presence, or absence, of marks in defined areas such as boxes or circles; used for processing questionnaires, standardized tests, etc. Also known as mark sense. Structured Form - refers to a group of functionally alike forms with identical layouts that can be automatically batchprocessed by a processing agent using ICR technology. Unstructured Form - The term unstructured, refers to a group of functionally alike forms with dissimilar layouts that are received by a processing agent in collectively high volume and that usually require manual data entry in order to fully capture the form data. Web Content Management - A technology that addresses the content creation, review, approval, and publishing processes of Web-based content. XML (extensible Markup Language) - An established standard, based on the Standard Generalized Markup Language, designed to facilitate document construction from standard data items. Also used as a generic data exchange mechanism. Electronic Document Management System (EDMS) - An integrated system employing EIM and workflow to automate business work processes for increased productivity and improved Page 8 of 8
Published by AIIM The Enterprise Content Management Association AIIM is the international authority on Enterprise Content Management (ECM) the tools and technologies used to capture, manage, store, preserve, and deliver content and documents related to organizational processes. ECM enables four key business drivers: Continuity, Collaboration, Compliance, and Costs. For over 60 years, AIIM has been the leading non-profit organization focused on helping users to understand the challenges associated with managing documents, content, records, and business processes. As a neutral and unbiased source of information and education, AIIM serves the needs of its members and the industry through Market Intelligence, Professional Development, and Market Access. Today, AIIM is international in scope, independent, implementation-focused, and, as the representative of the entire ECM industry including users, suppliers, and the channel acts as the industry s intermediary. Learn more about AIIM by visiting www.aiim.org. Sponsored by EMC Corporation 176 South Street Hopkinton, MA 01748 www.software.emc.com EMC Corporation is the world s leading developer and provider of information infrastructure technology and solutions that enable organizations of all sizes to transform the way they compete and create value from their information. Information about EMC s products and services can be found at www.emc.com. EMC Documentum The EMC Documentum family of enterprise content management solutions provides a comprehensive, fully-unified software platform to manage and leverage content in a cost-effective, controlled manner, providing secured access and re-use across the enterprise. It combines a unified platform with a strong compliance infrastructure to support key business needs including collaboration, interactive content management, transactional content management and archiving. To learn more about EMC Documentum, please visit www.software.emc.com/documentum. EMC Captiva EMC Captiva input management solutions transform information from paper, fax and electronic sources into businessready electronic content usable in enterprise applications. Captiva input management solutions meet the needs of organizations large and small and can readily scale from one department to a global enterprise. Solutions include document capture, forms processing, invoice processing, intelligent document recognition and classification, distributed capture, imaging solutions, applications monitoring and premium services. Visit www.software.emc.com/captiva for more information.