Combining Data Integration and Information Extraction Techniques

Combining Data Integration and Information Extraction Techniques Dean Williams dean@dcs.bbk.ac.uk School of Computer Science and Information Systems, Birkbeck College, University of London Abstract We describe a class of applications which are built using databases comprising some structured data and some free text. Conventional database management systems have proved ineffective for these applications and they are rarely suitable for current text and data mining techniques. We argue that combining Information Extraction and Data Integration techniques is a promising direction for research and we outline how our ESTEST system demonstrates this approach. 1. Introduction A class of applications exist which can be characterised by the way in which they combine both data conforming to a schema and some related free text. We describe this application class in Section 2. Our approach is to combine Data Integration (DI) and Information Extraction (IE) techniques to better exploit the text data and, in Section 3, there is a summary of related areas of research and we show how our method relates to these. Section 4 details why we belive text is used in these applications and as a result, why we belive combining DI and IE techniques will be beneficial to these applications. Details of our system Experimental Software to Extract Structure from Text (ES- TEST) are given in Section 5 which shows how we plan to realise our goals. Finally we give our conclusions and plans for future work in Section 6. 2. Partially Structured Data In [1] King and Poulovassilis define a distinct category of data - partially structured data (PSD). Many database applications rely on storing significant amounts of data in the form of free text. Recent developments in database technology have improved the facilities available for storing large amounts of text. However the provision for making use of this text data largely relies on searching the text for keywords. A class of applications exist where the information to be stored consists partly of some structured data conforming to a schema with the remainder left as free text. We consider this data to be partially structured. This idea of PSD is distinct from semistructured data, which is generally taken to mean data that is self-describing. In semistructured data there may not be a schema defined but the data itself contains some structural information e.g. XML tags.

An example of an application based on the use of PSD is operational intelligence gathering, which is used in serious crime investigations. The data collected in this application area takes the form of a report that contains some structured data such as the name of the Police Officer making the report, the time and location of the incident, as well as details of subjects and locations contained in the report. This is combined with the actual report of the sighting or information received which is captured as text. A number of other text based applications exist in crime e.g. for witness statements and scene of crime reports. Other application domains we are familiar with which have partially structured data include Road Traffic Accident reports where the standard format statistics are combined with free text accounts in a formalised subset of English. In Bioinfomatics, structured databases such as the SWISS-PROT [2] database includes comment fields that contain related unstructured information. A common theme of many of these applications, including crime and SWISS-PROT, is a requirement for expert users to annotate the text, trying to use standard terms to assist with queries, reduce duplication and highlight important facts. This is often a time consuming, demanding task with results less effective than would be desired and applications to assist with this work are being developed both as academic research projects e.g. [3] and commercial software e.g. [4]. 3. Related Areas A number of active areas of research deal with text in databases and we use the following definitions to establish how our approach relates to these. Data Integration Providing a single schema over a collection of data sources that facilitates queries across the sources [5] Information Extraction Finding pre-defined entities from text and using the extracted data to fill slots in a template using shallow NLP techniques [6]. Data Mining / Knowledge Discovery in Databases Finding patterns in structured data, discovering new deep knowledge embedded in data. Text Mining Application of data mining to text (often some NLP process creates a structured dataset from the text and then this is used for data mining [7]). Graph Based Data Models Current industry standard databases are essentially record based (e.g. the relational model or some form of object data model) where the schema must be determined in advance of populating the database. Graph-based data models offer finer semantic granularity and greater flexibility [8]. We are not proposing use of a text mining technique which finds patterns in very large collections of text e.g. Nahm and Mooney [9] who combine IE with Text Mining. For many of the PSD applications we have described this is unlikely to be effective as there are not very large static datasets to be mined (although there are some exceptions e.g. SWISS-PROT), rather over time new query requirements arise and extensions to the schema are required. We propose an evolutionary system where the user iterates through the steps as new information sources and new query requirements arise. Firstly an initial integrated 2

schema is built from a variety of sources including structured data schema, domain ontologies and natural language ontologies. Then information extraction rules are semiautomatically generated from this schema to be used as input for the IE processor. The data extracted from the text is added to the integrated schema and is available to answer queries. The schema may then be extended by new data-sources being added or new schema elements identified and the process repeats. Figure 1 shows how the user will use the ESTEST system in this evolutionary manner. Because of the evolutionary approach we suggest a graphical workbench will be required for end user use of ESTEST and we intend to consider the requirements of such a workbench. Integrate Datasources Create Data to assist the IE process IE Direction Data Enhance Schema Information Global Schema Extracted Data Information Extraction (IE) Query Global Schema Integrate Results of IE Control Flow Data Flow Fig. 1. Evolutionary Use of the ESTEST System 4. Combining Data Integration and Information Extraction We belive that the data collected in the form of free-text is important to PSD applications and is not stored as text due to its secondary value, and that there are two main reasons for storing data as text in PSD applications: It is not possible in advance to know all of the queries that will be required in the future. The text captured represents an intuitive attempt by the user to provide all information that could possibly be relevant. The Road Traffic Accident reports are 3

a good example of this. The schema of the structured part of the data covers all currently known requirements in a format known as STATS20 [10] and the text part is used when new reporting requirements arise. Data is captured as text due to the limitation of dynamically building a schema in conventional DBMS where simply adding a column to an existing table can be a major task in production systems. For example in systems storing witness statements in crime reports as entities and relationships are mentioned for the first time it is not possible to dynamically expand the underlying data schema and so the new information is only stored in its text form. Furthermore, the real world entities and relationships described in the text are related to the entities in the structured part of the data. An application combining IE and Data Integration will provide advantages in these applications for a number of reasons. Information Extraction is based on the idea of filling pre-defined templates and Data Integration can provide a global schema to be used as a template. Combining the schema of the structured data together with ontologies and other metadata sources can create the global schema / template. Metadata from the data sources can be used to assist the IE process by semi-automatically creating the required input to the IE modules. Data Integration systems which use a low level graph based common data model (e.g. AutoMed [11]) are able to extend schema as new entities become known without the overhead associated with conventional DBMS as they are not based on record based structures such as tables in relational databases. The templates filled by the IE process will provide a new data source to be added to the global schema supporting new queries which could not previously be answered. 5. The ESTEST System Our ESTEST system makes use of the AutoMed heterogeneous data integration system being developed at Birkbeck and Imperial Colleges [12]. In data integration systems, several data sources, each with an associated local schema, are integrated to form a single virtual database with an associated global schema. If the data sources conform to different data models, then these need to be transformed into a common data model as part of the integration process. The AutoMed system uses a low-level graph-based data model, the HDM, as its common data model - this is suitable for incremental increases in a global schema as new requirements arise. We have developed an AutoMed HDM data store [13] to store instance data and intermediate results for ESTEST. AutoMed implements bi-directional schema transformation pathways to transform and integrate heterogeneous schemas [11] which is a flexible approach amenable to including new domain knowledge dynamically. In summary the ESTEST system works as follows. The data sources are first identified and integrated into a single global schema. In AutoMed each data model which can be integrated is defined in terms of the HDM. Each construct in the external data model has an associated set of HDM nodes and edges. In the ESTEST system some features of data models are required to be preserved across all the integrated data sources. These features include an IS-A concept hierarchy; allowing for attributes; identifying text data to be mined and the ability to attach word forms to concepts. To facilitate the 4

automatic creation of the global schema all the data sources used by ESTEST will be transformed to an ESTEST data model. Each construct in the external model also has a set of transformations to map onto the ESTEST data model. Once all the data sources have been transformed to this standard representation and mappings between schema elements obtained - it will be possible to integrate the schemas. ESTEST then takes the metadata in the global schema and uses this to suggest input into the IE process. The user confirms, corrects and appends to this configuration data and the IE process is run. We make use of the GATE [14] IE architecture to build the ESTEST IE processor. As well as reusing standard IE components such as Named Entity gazetteers, sentence splitters, pattern matching grammars (with configured inputs semi-automatically created by ESTEST), a number of new IE components are being developed: TemplateFromSchema Takes an ESTEST global schema and creates templates to be filled by the IE engine and creates input to the standard IE components. NE-DB Named Entity recognition in IE is typically driven by flat file lists, the NE-DB component will associate a query on the global schema with a annotation type. A list of word forms will be materialised in the HDM store for use when the IE process is running (GATE NE gazetteers generate Finite State Machines for possible transitions of tokens). WordForm For a given concept will get relevant word forms from the WordNet natural language ontology. It will be possible to generate more words by increasing the number of traversals allowed through the WordNet hierarchy ordered by an approximation of semantic distance. The templates filled by the IE process will then be used to add to the extent of the concept in the global schema. Extracted annotations which match objects in the global schema will be extracted and put in the HDM store.the global query facilities of AutoMed are now available to the user who can query the global schema using the IQL query language [15, 16]. For more detailed information on the design of the ESTEST system we refer the reader to [17] and for an example of its operation in the Road Traffic Accident domain to [18]. Recent work within the Tristarp group [19], has resulted in advanced visualisation tools for graph-based databases becoming available [20] that may be of assistance in the proposed user workbench. This research interest is also reflected in recent products developed in industry, the Sentences [21] DBMS from Lazysoft is based on a quadruple store and sets out to challenge the dominance of the relational model. 6. Conclusions and Future Work We have discussed how a class of applications based on partially structured data are not adequately supported by current database and data mining techniques. We have stated why we belive combining Information Extraction and Data Integration techniques is a promising direction for research. We are currently completing an initial implementation of the ESTEST system which we will test in the Road Traffic Accident reporting and Crime Investigation domains. 5

ESTEST extends the facilities offered by data integration systems by moving towards handling text and extends IE systems by attempting to use schema information to semiautomatically configure the IE process. References 1. P.J.H.King and A Poulovassilis. Enhancing database technology to better manage and exploit partially structured data. Technical report, Birkbeck College, University of London, 2000. 2. Bairoch A., Boeckmann B., Ferro S., and Gasteiger E. Swiss-Prot: Juggling between evolution and stability. Brief. Bioinform., 5:39 55, 2000. 3. SOCIS Scene of Crime Information System. http://www.computing.surrey.ac.uk/ai/socis/. 4. QUENZA. http://www.xanalys.com/quenza.html. 5. Alon Y. Halevy. Data integration: A status report. In Gerhard Weikum, Harald Schöning, and Erhard Rahm, editors, BTW, volume 26 of LNI, pages 24 29. GI, 2003. 6. D. Appelt. An introduction to information extraction. Artificial Intelligence Communications, 1999. 7. A.H.Tan. Text mining: The state of the art and the challanges. Proceedings of the PAKDD 1999 Workshop on Knowledge Discovery from Advanced Databases, pages 65 70, 1999. 8. William Kent. Limitations of record-based information models. ACM Transactions on Database Systems, 4(1):107 131, March 1979. 9. R. Mooney U.Y. Nahm. Using information extraction to aid the discovery of prediction rules from text. Proceedings of the KDD-2000 Workshop on text Mining, pages 51 58, 2000. 10. UK Government Department for Transport. Instructions for the completion of road accident report form stats19. http://www.dft.gov.uk/stellent/groups/dft transstats/documents/page/dft transstats 505596.pdf. 11. P.J. McBrien and A. Poulovassilis. Data integration by bi-directional schema transformation rules. In Proc. ICDE 03, 2003. 12. AutoMed Project. http://www.doc.ic.ac.uk/automed/. 13. D. Williams. The AutoMed HDM data store. Technical report, Automed Project, 2003. 14. H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. Experience with a language engineering architecture: Three years of GATE. Proceedings of the 40th Anniversary Meeting of the Association for Computational Linguistics (ACL 02), 2002. 15. A. Poulovassilis. The AutoMed Intermediate Query Language. Technical report, AutoMed Project, 2001. 16. E. Jasper. Global query processing in the AutoMed heterogeneous database environment. In Proc. BNCOD02, LNCS 2405, pages 46 49, 2002. 17. D. Williams and A.Poulovassilis. Combining data integration with natural language technology for the semantic web. In Proc. Workshop on Human Language Technology for the Semantic Web and Web Services, at ISWC 03, page TBC, 2003. 18. Dean Williams and Alexandra Poulovassilis. An example of the estest approach to combining unstructured text and structured data. In DEXA Workshops, pages 191 195. IEEE Computer Society, 2004. 19. Tristarp Project. http://www.dcs.bbk.ac.uk/tristarp. 20. M.N. Smith and P.J.H. King. Database support for exploring criminal networks. Intelligence and Security Informatics: First NSF/NIJ Symposium, 2003. 21. Lazysoft (maker of Sentences). http://www.lazysoft.com/index.html. 6