Instance-Based Learning and Information Extraction for the Generation of Metadata

Transcription

1 Instance-Based Learning and Information Extraction for the Generation of Metadata Andreas D. Lattner (Center for Computing Technologies TZI University of Bremen, Germany Otthein Herzog (Center for Computing Technologies TZI University of Bremen, Germany Abstract: Knowledge Management recently has become popular in enterprises hoping to achieve a competitive advantage. Describing information by metadata allows for performing detailed queries uniformly over information items from different information sources and enables goal-directed search and automatic provision of relevant information. As the manual acquisition of metadata is very costly, support for this task is desired. This work presents two metadata extractors for the creation of metadata. The first applies Instance-Based Learning for the adoption of metadata from similar objects. The second extracts information by applying regular expressions. Both extractors have been integrated into the metadata generation framework of the KnowWork system and have been evaluated in experiments on two realworld data sets from the engineering domain. Key Words: Metadata Generation, Instance-Based Learning, Information Extraction, Knowledge Management Category: H.3. Introduction Knowledge Management recently has become popular in enterprises hoping to achieve a competitive advantage. Having uniform access to all information within a company allows the users to find information faster. Structuring information can be done by setting up an ontology for existing information items. An ontology is the explicit specification of a conceptualization [Gruber 993]. All information classes and their properties can be defined in an ontology. Instances of these classes are representations of business information objects (e.g., a bill of material of a certain product). Their properties attributes and associations to other objects are described by metadata. Metadata allows for performing detailed queries uniformly over information items from different information sources. It enables goal-directed search and automatic provision of relevant information. But how does the metadata get into the system? The manual acquisition of metadata is very costly. As no direct benefit is seen by the users, the motivation for entering metadata usually is quite low. Therefore, support by semi-automated metadata generation is needed to overcome this

2 situation. Semi-automated means that the created metadata has to be understand as a suggestion, which should be verified by the user. As no metadata generation approach will provide perfect metadata, there is a trade-off between checking the created metadata and letting erroneous data into the system. Related Work There are many different fields which are related to metadata generation. If unstructured documents have to be processed, text classification or information extraction can be used for the creation of metadata (e.g. [Yang and Liu 999, Hobbs et al. 996]). In both cases it can be distinguished between automated and manual approaches. [Sebastiani 00] gives a good survey about machine learning in the area of automated text categorization. Examples for learning information extraction rules can be found in [Soderland 999] and [Junker et al. 999]. Inter- and intranet web pages may also be enhanced by metadata, e.g., in the Semantic Web context. Various approaches treat the classification of web pages, e.g., [Pierre 00], metadata generation for web pages [Jenkins et al. 999, Stuckenschmidt and van Harmelen 00], or the creation of knowledge bases from the World Wide Web [Craven et al. 000]. If metadata has to be created for databases, other approaches can be applied. Database contents can be adopted directly via database wrappers or can be mapped to the defined vocabulary of an appropriate ontology (e.g. [Tork Roth and Schwarz 997, Bergamaschi et al. 999]). The technologies to be applied strongly depend on the information sources managed by the system. Person Name Phone * authors Change Report Software variant Area of validity Document Creation date URI is a is a Travel Application Document-ID Checked by Ontology Metadata Extractor Metadata Extractor Metadata Extractor 3 Metadata Extractor n Metadata Generation Extractors Figure : Mapping between attributes and metadata generation extractors

3 Annotated documents Contact person: Mrs. Green Annotation of value to new document Contact person: Mrs. Green Contact person: Mr. Blue Most similar documents Figure : Adoption of metadata from the k-nearest neighbors 3 Metadata Generation with KnowWork The metadata generation framework MetaGen is a module of the KnowWork system [Tönshoff et al. 00] and has been introduced in [Lattner and Apitz 00]. The KnowWork system allows for managing information classes and information items within its domain model, an ontology representation. Information items can be described by attributes and linked to other items via metadata. With the metadata generation framework it is possible to create metadata for arbitrary information items. Its flexible structure allows for integrating metadata extraction modules as needed by implementing an extractor interface. The different extractors can be connected to all defined attributes for creating metadata. Fig. illustrates the mapping between metadata generation extractors and attributes from different information types. Two extractors have been implemented and evaluated so far: the TextSimilarityExtractor and the RegExExtractor. Both are briefly described in the next two subsections. 3. TextSimilarityExtractor The TextSimilarityExtractor uses an instance-based learning approach for the creation of metadata. It adopts metadata from the k-nearest neighbors (k-nn) by applying a similarity measure based on text content. An example is given in Fig.. If values for attribute a of object o should be created these steps are performed: Collect the k-nearest neighbors (default setting is k = 5) to the object o which also have the attribute a, i.e., which are instances of the class (or one of its subclasses) where the attribute a is defined. Collect and count all values for attribute a of neighbors n, n,, n k. Take over the values: o If a is a single-valued attribute, take the value with most appearances as created metadata. o If a is set-valued, take the l most frequently values as created metadata, where l is the average number of values for attribute a for the k neighbors.

4 In the case of the TextSimilarityExtractor the similarity measure is computed from text documents in vector representations. Our implementation uses the mindaccess SDK, a software development kit for integrating the mindaccess system. mindaccess features, among others, search and classification techniques for text documents [insiders 00]. All documents d j are represented by their term vectors: d j = ( w, j, w, j, K, wt, j ). The utilized term-weighting strategy follows the TF-IDF scheme. The similarity between two documents d and d is computed by the cosine of the angle between their vectors (cf. [Baeza-Yates and Ribeiro-Neto 999]): sim( d, d ) d = d d = d t i t i= = i, i, w w i, w t i= w i, 3. RegExExtractor With the RegExExtractor regular expressions for information extraction from texts can be defined. It uses the Jakarta ORO package ( which provides Perl5 compatible regular expressions. Extraction rules Document-ID: Doc. No.: ($value) [\n] Author: Applicant: ($value) [\n] Checked by: checked by: ($value) ( Text content Doc. No.: Travel application Applicant: M. Meyer checked by: K. Müller (..0) Rule matching Doc. No.: Travel application Applicant: M. Meyer checked by: K. Müller (..0) Document-ID: Author: M. Meyer Checked by: K. Müller Annotation of values to the document Figure 3: Information extraction with regular expressions If certain patterns, that indicate where attribute values can be found, appear frequently in texts this information can be used for information extraction. Many documents consist of such patterns, e.g., for pointing out the creation date or author data. In these cases, extraction rules can be defined. In the following example the author name is expected after the Created by: string. All characters after the text Created by: and before the end of line are extracted as the value. After the colon at least one tab or white space is expected. The according extraction rule is Created by:[\t ]+([\w0-9-_]+)$. Fig. 3 illustrates the use of regular expressions for the extraction of information. The figure shows simplified extraction rules for better understanding.

5 4 Evaluation Both extractors have been applied to real world data sets from the engineering domain, which have been provided by two of our application partners of the KnowWork project. Due to non-disclosure agreements we are not granted permission to show any original data used in our experiments. The data set from the first company consists of 0 documents of three different document types. All documents have six attributes: five single-valued and one setvalued attribute (Tab. ). The other data set for evaluation has been provided by a second company. It has 95 documents from a single document type (Change Report). attributes have been assigned to each document. Eight attributes are single-valued and 3 are set-valued (Tab. ). For both data sets three independent experiments have been performed at each case. For each experiment, the data sets were randomly divided into a training set (ca. 60% of the documents) and a testing set (ca. 40% of the documents). The sizes of the testing sets were 4 documents in the first case and 38 in the second one. For each document from the test sets values for all attributes have been created. The quality of the metadata generation was evaluated by the computation of precision and recall of the created metadata. The precision is the ratio of the correctly created to all created values. The ratio of the correctly created values to all actual values for an attribute determines the recall. These values have been calculated for each attribute on its own and for all attributes together. For the first data set only the TextSimilarityExtractor has been used. As these documents were very homogeneous, taking over attribute values from similar documents worked out very well. The precision was on average 9.7% at a recall of 9.7%. These results must not be overestimated because for many attributes only a few possible values existed (e.g., file format). The most challenging attribute here was the set-valued keywords attribute. But even in this case, the TextSimilarityExtractor returned good results with a precision of 79.4% and a recall of 76.% (Tab. ). The data set from the second company was more complex. Some attribute values could not be determined by the k-nn approach (e.g., the creation date). For the first seven attributes (see Table ) the RegExExtractor was used with manually created extraction rules. The TextSimilarityExtractor was applied for the remaining fourteen attributes. The overall average precision and recall on this data set turned out to be 67.7% and 67.9%, respectively. For some attributes the precision and recall values of the TextSimilarityExtractor were quite low. This happens if some attribute values just appear sporadically, or if text similarity can only give little prediction of attribute values. The TextSimilarityExtractor performed poorly at creating values for the attributes product type, change reason, hardware version, and software version. In the worst case the precision and recall are 5.7% and 7.7%. Nevertheless in many cases quite good results were achieved. For twelve of the attributes the precision and recall values were higher than 75% on average (Tab. ).

6 Attribute Set valued Extractor Exp. Exp. Exp. Exp. Exp. 3 Exp. 3 average average Contact person TextSim File type TextSim Project TextSim Customer TextSim Keywords X TextSim Author TextSim Overall Table : Evaluation results of data set from company 5 Conclusion The experiments on the two data sets show quite promising results. It can be seen that even with pretty simple approaches good results can be achieved on real world data. Even though the first data set was quite homogeneous, the experiments showed the practicability of the two integrated metadata generation extractors. For some attributes it is not recommended two apply these extractors, because the quality of the created metadata is not good enough. But if the extractors are applied only to suitable attributes, they can be a great help for the user during the metadata acquisition phase. Depending on the user s requirements, the recall of documents based on the created metadata can be increased by taking over more values. This has the advantage that more values (probably including the right ones) are presented to the user. As taking over more values might also include erroneous ones, such a modification could lead to worse precision values. Acknowledgements The content of this paper is a partial result of the KnowWork project, which is funded by the German Ministry for Education and Research (BMBF) under grant 0 IN 00 D. We wish to express our gratitude to the KnowWork colleagues and students at TZI for their contribution during the development and implementation of some of the ideas and concepts presented in this paper. We also want to acknowledge the efforts of the KnowWork project partners, especially the enterprises which provided the data sets for the evaluation of the extractors, and insiders Wissensbasierte Systeme GmbH for the provision and integration of their technologies into the KnowWork system.

7 Attribute Set valued Extractor Exp. Exp. Exp. Exp. Exp. 3 Exp. 3 average average Report ID RegEx SAP number RegEx Product series X RegEx Author RegEx Checked by RegEx Approved by RegEx Creation date RegEx Power/kW X TextSim Product type X TextSim OEM variant X TextSim Mech. Constr. X TextSim Mounting form Hardware variant Software variant Change reason Hardware version Software version Area of validity X TextSim X TextSim X TextSim X TextSim X TextSim X TextSim X TextSim Categories X TextSim File type TextSim Paper format TextSim Overall Table : Evaluation results of data set from company

8 References [Baeza-Yates and Ribeiro-Neto 999] Baeza-Yates, R.; Ribeiro-Neto, B.: Modern Information Retrieval. ACM Press New York, Addison-Wesley, 999. [Bergamaschi et al. 999] Bergamaschi, S.; Castano, S.; Vincini, M.: Semantic Integration of Semistructured and Structured Data Sources. SIGMOD Record, 8():54-59, 999. [Craven et al. 000] Craven, M.; Dipasquo, D.; Freitag, D.; McCallum, A.; Mitchell, T.; Nugam, K.; Slattery, S.: Learning to Construct Knowledge Bases from the World Wide Web, Artificial Intelligence, 8(-), 000, p [Gruber 993] Gruber, T. R.: A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, 5(), 993, p [Hobbs et al. 996] Hobbs, J.; Appelt, D.; Bear, J.; Israel, D.; Kameyama, M.; Stickel, M.; Tyson, M.: FASTUS: Extracting Information from Natural Language Texts. In: E. Roche and Y. Schabes (Eds.): Finite State Devices for Natural Language Processing, MIT Press, 996. [insiders 00] mindaccess Overview and Concepts, Release.7, Technical Report, insiders Wissensbasierte Systeme GmbH, [Jenkins et al. 999] Jenkins, C; Jackson, M.; Burden, P.; Wallis, J.: Automatic RDF Metadata Generation for Resource Discovery, Computer Networks, 3, 999, p [Junker et al. 999] Junker, M.; Sintek, M.; Rinck, M.: Learning for Text Categorization and Information Extraction with ILP, Proceedings of the Workshop on Learning Language in Logic, 999. [Lattner and Apitz 00] Lattner, A. D., Apitz, R.: A Metadata Generation Framework for Heterogeneous Information Sources, Proceedings of the nd International Conference on Knowledge Management (I-KNOW 0), Graz, Austria, July -, 00, p [Pierre 00] Pierre, J. M.: On the Automated Classification of Web Sites, Linkoping Electronic Articles in Computer and Information Science, Vol. 6, 00. [Sebastiani 00] Sebastiani, F.: Machine Learning in Automated Text Categorization, ACM Computing Surveys, 34(), 00, p [Soderland 999] Soderland, S.: Learning information extraction rules for semi-structured and free text. Machine Learning, 34(-3):33-7, 999. [Stuckenschmidt and van Harmelen 00] Stuckenschmidt, H.; van Harmelen, F.: Ontologybased Metadata Generation from Semi-Structured Information, Proceedings of the st International Conference on Knowledge Capture (K-CAP 00), Morgan Kaufmann, 00, p [Tönshoff et al. 00] Tönshoff, H. K.; Apitz, R.; Lattner, A. D.; Schlieder C.: KnowWork An Approach to Co-ordinate Knowledge within Technical Sales, Design and Process Planning Departments, Proceedings of the 7th International Conference on Concurrent Enterprising, Bremen, Germany, 7 9th June 00, p [Tork Roth and Schwarz 997] Tork Roth, M.; Schwarz, P.: Don't scrap it, wrap it! A Wrapper Architecture for Legacy Sources. In: Proceeding of the 3rd VLDB Conference, Athens, Greece, 997, p [Yang and Liu 999] Yang, Y and Liu, X.: A Re-examination of Text Categorization Methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'99), 999, p