An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System Asanee Kawtrakul ABSTRACT In information-age society, advanced retrieval technique and the automatic extraction of useful information from streams of text become the societal needs. This system should provide more in depth about the contents than does a standard information retrieval system, which relies on keyword based analysis and matching computation. Clearly, natural language processing does playa role for capturing text information and making it accessible to the users. In order to get effective and robust extracting performance, it will be necessary to combine natural language processing, knowledge representation, reasoning and inference for handling large-scale text. This paper gives some of the historical perspective, methodology and state-of-theart of a role of Natural Language Processing in an information retrieval (IR) system. Keywords: conventional information retrieval, intelligent information retrieval, natural language processing, information retrieval, artificial intelligence, nonbibliographic databases. Introduction The tremendous growth of using computer for typesetting and the continued rapid decline in the cost of mass storage devices cause the explosion of nonbibliographic or full text databases in electronic form. According to computer technologies, online information retrieval reaches mainstream computer users. Two approaches can be distinguished when retrieving information: the conventional approach and artificial intelligence (AI) based approach. The major function of the former is document delivery capability to supply what we need in the sea of available material. The output of the system consists of one or more bibliographic reference perhaps with some added information such as an abstract or full text of documents. The latter provides more effective and robust retrieving performance that extract more in-depth information about the contents of its corpus than does the former. In information-age society, advanced retrieval techniques and the automatic extraction of useful information from streams of text, coupled with automated indexing and hypertext, summarization and abstracting become the societal needs and become the functions of Intelligent Information Retrieval (IIR) [4,14]. The marriage between IR system, and AI techniques is, then, natural and potentially important. The combination of natural language processing (NLP), knowledge representation, reasoning and inference will derive IR s power from huge quantities of raw text in an intelligent manner. Through the integration, IIR is seen as one of the several powerful new tools for online information retrieval. This paper gives some of the historical perspective, methodology and state-of-the-art of a role of NLP in an intelligent information retrieval system.

Conventional IR V.S. Intelligent IR Traditionally, information retrieval emphasizes document retrieval which is very much dependent on human classification and the use of humanly prepared searching strategies, while intelligent information retrieval emphasizes automatic extraction of useful information and facilitates interaction between the user and facilitates interaction between the user and system by giving the user natural language access tools. Conventional Information Retrieval In principle, the conventional information retrieval (CIR) systems based on determining the relationships relevant or nonrelevant between the information need of the users and the information in the documents. The outline of conceptual IR is shown in Fig.1 (a). The stored documents need to be organized and controlled. Organization and control activities include classification, cataloging, subject indexing and abstracting [17]. In the early CIR, those activities is very much dependent on information specialists, who need both an understanding of what the document is about, that is, some comprehension of its subject matter and a good knowledge of the user s need. An expanded view of the retrieval operations shown in Fig.1 (b). As depicted in Fig.1(b), queries or information need of the users need to be analyzed and translated into particular vocabularies. Usually, the output of CIR consists of one or more bibliographic references with some added information such as an abstract or full text of documents. In operational environments, the stored documents are represented by sets of index term, sometimes called term vectors. Usually the terms are unweighted, although in some retrieval situation each term may be assigned a weight to reflects its relative importance. Queries may similarly be expressed by using sets of unweighted or weighted terms. In many practical systems, the query terms are joined by Boolean operator. Based on an inverted index, the indexed document set corresponding to a given query formulation is easily determined. Various extensions to the standard inverted-index technology have been proposed [10]. Those extension are distance constraints, term weights, synonym specification and term truncation. More recently, because of the explosion of nonbiliographic databases and free style queries requested directly from naive online searchers, the intelligent information retrieval which is controlled by automatic, machine-performed procedures, is needed. Intelligent Information Retrieval The existence of nonbibliographic databases and online information retrieving by computer users pose many problems for information access system, such as, the language for document representation, command language, database selection, problems relating to user friendliness and ease for use. Since nonbibliographic databases outnumber their bibliographic counterparts, the need for automatic IR (or Intelligent IR) increases.

Queries matching Documents computation retrieval of relevant items conceptual analysis and translation a) Text indexing Queries Forman matching Indexed Documents statements computation documents retrieval of relevant items b) Figure 1. Conventional Information Retrieval. a) Conceptual information retrieval b) Expanded CIR (modified from Fig 8.1 in [10]) Intelligent Information Retrieval (IIR) differs from the CIR systems in that they must be more flexible user-friendly and responsive, automatic indexing and classifying, possibly segmenting, combining or synthesizing a response rather than just retrieving documents, and possibly extracting useful information. [4]. Table 1 summarizes the difference between CIR and IIR. Each approach has advantages and disadvantages for information retrieval. CIR is rigid, inflexible but fast, portable, relatively inexpensive, relatively ease to learn. IIR is highly expressive, flexible but potentially ambiguous, slow, brittle and expensive.

Function Table 1 The comparison of CIR and IIR Approach CIR IIR document representation controlled vocabularies natural language indexing keyword analysis by information specialists automatic indexing method of retrieval artificial query language natural language retrieval strategy bibliographic reference with some added information presentation the portion or part of on-line test hypertext produce a direct answer to the user s question integrate multiple texts that repeat, correct or augment one another Natural Language in IIR An information retrieval system without vocabulary control may be referred to as a natural language (NL) or free-text system. Natural language systems become both more prevalent and more feasible because of the explosion of nonbibliographic databases. In addition to free text, IIR might concern with NL queries which are attractive to naïve users who don t want to learn an artificial query language, which includes boolean operators, proximity and truncation. Eventhough natural language provides highly degree of exhausitivity, highly flexible and user friendly IR system, it contains many problems, of language due to synonymity, unknown word, ill-formed sentences, syntactic ambiguities caused by structure, semantic ambiguities caused by homographs, and contextaual ambiguity [17]. The searching of NL is, then, more difficult that of controlled vocabularies. However, NL searching offers a number of benefits. That is, it permits the conduct of searches of unlimited specificity. Table 2 summarizes the advantages and disadvantages of natural language system.

Table 2 Advantages and Disadvantages of NL system. (modified from table 2.9 in [16]) Advantages Disadvantages highly expressive very difficult to make generic searches permits a variety of access points problem with synonyms highly flexible problem with homographs highly representative of reality problem with false drops represents (any) many points of view ambiguous, fuzzy, soft requires no training to use non-standardized easy to represent new and complex concepts not very compact freedom of expression user must think of own search term, synonym etc. high degree of exhaustion no indexing necessary Role of Natural Language Processing in IIR Document and queries in CIR are generally represented by document description and artificial query language, respectively. Document descriptions are described by single words or sometimes groups of words, extracted from the document texts. Artificial query language are processed by boolean operators, proximity and truncation. The retrieval of relative items depends matching computation. Accordingly, no refined language analysis and understanding are necessary to be included in the information retrieval system. However, documents are necessarily available in the form of natural language or free text which need IIR for deriving IR s power. In IIR environment, a linguisitc component could indicate term relationships between indexing units, which might be used to generate more complete and representative document descriptions than can obtained from single terms alone. More refined content analysis might improve retrieval effectiveness, and enhance traditional retrieval techniques with question-answering and information extraction capabilities that provide specific responses to certain questions. Additionally, a natural language capability might facilitate interaction between user and system by giving the user natural language access tools to replace conventional formalized query languages. To implement IIR, the marriage between IR system and AI techniques, which include Natural Language Processing, knowledge representation, reasoning and inference, is then natural and potentially important. An overview of AI-based IR is given in Figure 2. A useful summary of early work in CIR and IIR is provided by [2, 5, 6, 7, 12]

Conceptual analysis Document analysis (including text indexing) Natural Language Linguistic Domain Specific Natural Language Analysis and Knowledge Knowledge Analysis and Understanding Understanding NL query Query Retrieval Technique Document Document Representation + Representation Inference Engine Information Extraction Figure 2. AI-based Information Retrieval As mentioned in section 3, several problems with the use of NL for IR are language problems. Those problems can be divided in three levels; word level, sentence level, and discourse level. The role of NLP in IR, then, consists of morphological processing, syntax processing, semantic processing, discourse analysis and pragmatic interpretation. Using those five steps, knowledge representation technique [9, 11], and inference technique [15], document representation and representation of a user s information need can be provided. Table 3 summarizes the problem solving techniques. Some are given (for details see [1, 5, 6, 8, 13, 15]. Since the greater of detail full text calls for more sophisticated NLP, some IIR system in [6] included text filtering that later stages of processing are not applied to them.

Table 3 An Overview of Problems and Their Solutions in IIR Components of NLP in IIR Problems Problem Solving Techniques morphological processing spelling errors word boundary ambiguity POS tagging ambiguity unknown words preprocessing spelling correction word filtering probabilistic POS tagger syntax processing semantic processing ill formed sentences complex sentences long sentences structure ambiguities idioms, ellipses homographs synonyms fragment processing partial parsing corpus-based approach knowledge-based approach dictionary and thesaurus method discourse analysis anaphora reference discourse module anaphora resolution coherence reasoning knowledge representation techniques knowledge engineering buttleneck shallow knowledge domain specific knowledge reasoning and inference implicit and uncertain information uncertain inference rule based inference Conclusion The ambitious goal of IIR is the completeness and accuracy of information extracted. In order to process a realistic information retrieval task, IR technology should integrate with AI techniques. While IR system offers a vehicle, AI techniques including NL analysis and understanding, knowledge representation, reasoning and inference will produce useful results. The success of an IIR system will depend on now well principles of NLP have been used. However, the greater of detail in full text apparently calls for more sophisticated natural language processing. According to MUC 4 in 1992 [6], IIR is a key factor in encouraging NLP researchers to move from small-scale systems and artificial data to large scale system, operating on human language.

References 1. Asanee Kawtrakul, Thumkanon Chalatip, Jamjanay Thitima, Muangyunnan Parinee, Poolwan Kritsada, A gradual Refinement Model for a Robust Thai Morphological Analyzer, Coling 96. (on printing process) 2. Cowie, Jim and Lehnert, Wendy, Information Extraction, Communications of the CAN. 39(1); 1996. 3. Croft W. Bruce, Text Retrieval and Inference, Text-Based Intelligent Systems: Current Research and Practice in Informational Extraction and Retrieval, Lawrence Erlbaum Associates, Publishers, 1992 4. Croft, W Bruce, NSF Center for Intelligent Information Retrieval Communications of the ACM. 38(4): 42-23, 1995. 5. DARPA, Proceeding of the 3 rd Message Understanding Conference (MUC-3), Morgan Kaufmann, 1991. 6. DARPA, Proceeding of the 4 th Message Understanding Conference (MUC-4), Morgan Kaufmann, 1992. 7. DARPA, Proceedings of the Tipster Text Program (Phase I), Morgan Kaufmann, 1993. 8. David D. Mc Donald, Robust Partial-Parsing Through Incremental, Multi-Algorithm, Text-Based Intelligent Systems : Current Research and Practice in Informational Extraction and Retrieval, Lawrence Erlbaum Associates, Publishers, 1992. 9. Fox, E.A. Tutorial on Knowledge-Based Information Retrieval, Fifteenth International Conference of Research and Development in Information Retrieval, 1992. 10. Gerard Salton, Automatic Text Processing : The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley Publishing Company, 1989. 11. Judith L. Klavans, Visions of the digital Library : Views on Using Computational Linguistics and Semantic Nets in Information Retrieval, Current Issues in Computational Linguistics : In Honour of Don Walker, Kluwer Acadamic Publichers. 12. Lewis David D; Jones Karen Spark, Natural Language Processing for Information Retrieval, Communications of the ACM, 39 (1) : 92-101, 1996. 13. Nobesawa S., Tsutsumi J., Nitta T., Ono K., Jiang S. and Nakamishi M., Segmenting a Sentence into Morphems using Statistic Information Between Words, Coling 94, pp. 227-232, 1994. 14. Paul S. Jacobs, Introduciton : Text Power and Intellingent Systems, Text-Based Intelligent Systems, LEA Publishers, pp 1-8, 1992. 15. Richard M. Tong and Daniel Shapiro, Experimental investigations of uncertainty in a rule-based system for information retrieval, International Journal of Man-Machine Studies, 22:265-282, 1985. 16. Stephen P. Harter, Online Information Retrieval : Concepts, Principles, and Technique., Library and Information science, 1986. 17. F. Wilfrid Lancaster, Information Retrieval Systems : Characteristics, Testing and Evaluation, John Wiley & Sons, 1979.