An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System



Similar documents
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Natural Language to Relational Query by Using Parsing Compiler

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Text-To-Speech Technologies for Mobile Telephony Services

Search and Information Retrieval

Domain Knowledge Extracting in a Chinese Natural Language Interface to Databases: NChiql

Specialty Answering Service. All rights reserved.

SPATIAL DATA CLASSIFICATION AND DATA MINING

California Lutheran University Information Literacy Curriculum Map Graduate Psychology Department

California Lutheran University Information Literacy Curriculum Map Graduate Psychology Department

Information Need Assessment in Information Retrieval

Natural Language Database Interface for the Community Based Monitoring System *

The Seven Practice Areas of Text Analytics

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR

Overview of the TACITUS Project

RRSS - Rating Reviews Support System purpose built for movies recommendation

Overview of MT techniques. Malek Boualem (FT)

Experiments in Web Page Classification for Semantic Web

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Ontological Representations of Software Patterns

Building a Question Classifier for a TREC-Style Question Answering System

Interactive Multimedia Courses-1

Chapter 8. Final Results on Dutch Senseval-2 Test Data

How To Use Text Mining To Expand Business Intelligence

Classification of Fuzzy Data in Database Management System

BUSINESS RULES AS PART OF INFORMATION SYSTEMS LIFE CYCLE: POSSIBLE SCENARIOS Kestutis Kapocius 1,2,3, Gintautas Garsva 1,2,4

The Prolog Interface to the Unstructured Information Management Architecture

II. PREVIOUS RELATED WORK

How To Use Neural Networks In Data Mining

A Brief Tutorial on Database Queries, Data Mining, and OLAP

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Reusable Knowledge-based Components for Building Software. Applications: A Knowledge Modelling Approach

31 Case Studies: Java Natural Language Tools Available on the Web

2 AIMS: an Agent-based Intelligent Tool for Informational Support

Technical challenges in web advertising

A Framework of Personalized Intelligent Document and Information Management System

Developing a Theory-Based Ontology for Best Practices Knowledge Bases

Classification of Natural Language Interfaces to Databases based on the Architectures

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Application of Data Mining Techniques in Intrusion Detection

CS4025: Pragmatics. Resolving referring Expressions Interpreting intention in dialogue Conversational Implicature

Role of Text Mining in Business Intelligence

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Clustering Connectionist and Statistical Language Processing

Mining Text Data: An Introduction

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

CONCEPTCLASSIFIER FOR SHAREPOINT

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Association of College and Research Libraries Psychology Information Literacy Standards

NATURAL LANGUAGE TO SQL CONVERSION SYSTEM

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Knowledge Discovery from patents using KMX Text Analytics

Automated Extraction of Security Policies from Natural-Language Software Documents

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining and Soft Computing. Francisco Herrera

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Dynamic Data in terms of Data Mining Streams

A Framework for Ontology-Based Knowledge Management System

M LTO Multilingual On-Line Translation

Customizing an English-Korean Machine Translation System for Patent Translation *

DEVELOPMENT OF NATURAL LANGUAGE INTERFACE TO RELATIONAL DATABASES

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Healthcare Measurement Analysis Using Data mining Techniques

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Facilitating Business Process Discovery using Analysis

Machine Learning for natural language processing

Pattern based approach for Natural Language Interface to Database

What do Big Data & HAVEn mean? Robert Lejnert HP Autonomy

HELP DESK SYSTEMS. Using CaseBased Reasoning

Business Intelligence and Decision Support Systems

Natural Language Web Interface for Database (NLWIDB)

Term extraction for user profiling: evaluation by the user

Semantic analysis of text and speech

Optimization of Image Search from Photo Sharing Websites Using Personal Data

Extending Data Processing Capabilities of Relational Database Management Systems.

Flattening Enterprise Knowledge

Linguistic Preference Modeling: Foundation Models and New Trends. Extended Abstract

S.Thiripura Sundari*, Dr.A.Padmapriya**

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Text Mining - Scope and Applications

Fuzzy Knowledge Base System for Fault Tracing of Marine Diesel Engine

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Semantic annotation of requirements for automatic UML class diagram generation

Transcription:

An Overview of a Role of Natural Language Processing in An Intelligent Information Retrieval System Asanee Kawtrakul ABSTRACT In information-age society, advanced retrieval technique and the automatic extraction of useful information from streams of text become the societal needs. This system should provide more in depth about the contents than does a standard information retrieval system, which relies on keyword based analysis and matching computation. Clearly, natural language processing does playa role for capturing text information and making it accessible to the users. In order to get effective and robust extracting performance, it will be necessary to combine natural language processing, knowledge representation, reasoning and inference for handling large-scale text. This paper gives some of the historical perspective, methodology and state-of-theart of a role of Natural Language Processing in an information retrieval (IR) system. Keywords: conventional information retrieval, intelligent information retrieval, natural language processing, information retrieval, artificial intelligence, nonbibliographic databases. Introduction The tremendous growth of using computer for typesetting and the continued rapid decline in the cost of mass storage devices cause the explosion of nonbibliographic or full text databases in electronic form. According to computer technologies, online information retrieval reaches mainstream computer users. Two approaches can be distinguished when retrieving information: the conventional approach and artificial intelligence (AI) based approach. The major function of the former is document delivery capability to supply what we need in the sea of available material. The output of the system consists of one or more bibliographic reference perhaps with some added information such as an abstract or full text of documents. The latter provides more effective and robust retrieving performance that extract more in-depth information about the contents of its corpus than does the former. In information-age society, advanced retrieval techniques and the automatic extraction of useful information from streams of text, coupled with automated indexing and hypertext, summarization and abstracting become the societal needs and become the functions of Intelligent Information Retrieval (IIR) [4,14]. The marriage between IR system, and AI techniques is, then, natural and potentially important. The combination of natural language processing (NLP), knowledge representation, reasoning and inference will derive IR s power from huge quantities of raw text in an intelligent manner. Through the integration, IIR is seen as one of the several powerful new tools for online information retrieval. This paper gives some of the historical perspective, methodology and state-of-the-art of a role of NLP in an intelligent information retrieval system.

Conventional IR V.S. Intelligent IR Traditionally, information retrieval emphasizes document retrieval which is very much dependent on human classification and the use of humanly prepared searching strategies, while intelligent information retrieval emphasizes automatic extraction of useful information and facilitates interaction between the user and facilitates interaction between the user and system by giving the user natural language access tools. Conventional Information Retrieval In principle, the conventional information retrieval (CIR) systems based on determining the relationships relevant or nonrelevant between the information need of the users and the information in the documents. The outline of conceptual IR is shown in Fig.1 (a). The stored documents need to be organized and controlled. Organization and control activities include classification, cataloging, subject indexing and abstracting [17]. In the early CIR, those activities is very much dependent on information specialists, who need both an understanding of what the document is about, that is, some comprehension of its subject matter and a good knowledge of the user s need. An expanded view of the retrieval operations shown in Fig.1 (b). As depicted in Fig.1(b), queries or information need of the users need to be analyzed and translated into particular vocabularies. Usually, the output of CIR consists of one or more bibliographic references with some added information such as an abstract or full text of documents. In operational environments, the stored documents are represented by sets of index term, sometimes called term vectors. Usually the terms are unweighted, although in some retrieval situation each term may be assigned a weight to reflects its relative importance. Queries may similarly be expressed by using sets of unweighted or weighted terms. In many practical systems, the query terms are joined by Boolean operator. Based on an inverted index, the indexed document set corresponding to a given query formulation is easily determined. Various extensions to the standard inverted-index technology have been proposed [10]. Those extension are distance constraints, term weights, synonym specification and term truncation. More recently, because of the explosion of nonbiliographic databases and free style queries requested directly from naive online searchers, the intelligent information retrieval which is controlled by automatic, machine-performed procedures, is needed. Intelligent Information Retrieval The existence of nonbibliographic databases and online information retrieving by computer users pose many problems for information access system, such as, the language for document representation, command language, database selection, problems relating to user friendliness and ease for use. Since nonbibliographic databases outnumber their bibliographic counterparts, the need for automatic IR (or Intelligent IR) increases.

Queries matching Documents computation retrieval of relevant items conceptual analysis and translation a) Text indexing Queries Forman matching Indexed Documents statements computation documents retrieval of relevant items b) Figure 1. Conventional Information Retrieval. a) Conceptual information retrieval b) Expanded CIR (modified from Fig 8.1 in [10]) Intelligent Information Retrieval (IIR) differs from the CIR systems in that they must be more flexible user-friendly and responsive, automatic indexing and classifying, possibly segmenting, combining or synthesizing a response rather than just retrieving documents, and possibly extracting useful information. [4]. Table 1 summarizes the difference between CIR and IIR. Each approach has advantages and disadvantages for information retrieval. CIR is rigid, inflexible but fast, portable, relatively inexpensive, relatively ease to learn. IIR is highly expressive, flexible but potentially ambiguous, slow, brittle and expensive.

Function Table 1 The comparison of CIR and IIR Approach CIR IIR document representation controlled vocabularies natural language indexing keyword analysis by information specialists automatic indexing method of retrieval artificial query language natural language retrieval strategy bibliographic reference with some added information presentation the portion or part of on-line test hypertext produce a direct answer to the user s question integrate multiple texts that repeat, correct or augment one another Natural Language in IIR An information retrieval system without vocabulary control may be referred to as a natural language (NL) or free-text system. Natural language systems become both more prevalent and more feasible because of the explosion of nonbibliographic databases. In addition to free text, IIR might concern with NL queries which are attractive to naïve users who don t want to learn an artificial query language, which includes boolean operators, proximity and truncation. Eventhough natural language provides highly degree of exhausitivity, highly flexible and user friendly IR system, it contains many problems, of language due to synonymity, unknown word, ill-formed sentences, syntactic ambiguities caused by structure, semantic ambiguities caused by homographs, and contextaual ambiguity [17]. The searching of NL is, then, more difficult that of controlled vocabularies. However, NL searching offers a number of benefits. That is, it permits the conduct of searches of unlimited specificity. Table 2 summarizes the advantages and disadvantages of natural language system.

Table 2 Advantages and Disadvantages of NL system. (modified from table 2.9 in [16]) Advantages Disadvantages highly expressive very difficult to make generic searches permits a variety of access points problem with synonyms highly flexible problem with homographs highly representative of reality problem with false drops represents (any) many points of view ambiguous, fuzzy, soft requires no training to use non-standardized easy to represent new and complex concepts not very compact freedom of expression user must think of own search term, synonym etc. high degree of exhaustion no indexing necessary Role of Natural Language Processing in IIR Document and queries in CIR are generally represented by document description and artificial query language, respectively. Document descriptions are described by single words or sometimes groups of words, extracted from the document texts. Artificial query language are processed by boolean operators, proximity and truncation. The retrieval of relative items depends matching computation. Accordingly, no refined language analysis and understanding are necessary to be included in the information retrieval system. However, documents are necessarily available in the form of natural language or free text which need IIR for deriving IR s power. In IIR environment, a linguisitc component could indicate term relationships between indexing units, which might be used to generate more complete and representative document descriptions than can obtained from single terms alone. More refined content analysis might improve retrieval effectiveness, and enhance traditional retrieval techniques with question-answering and information extraction capabilities that provide specific responses to certain questions. Additionally, a natural language capability might facilitate interaction between user and system by giving the user natural language access tools to replace conventional formalized query languages. To implement IIR, the marriage between IR system and AI techniques, which include Natural Language Processing, knowledge representation, reasoning and inference, is then natural and potentially important. An overview of AI-based IR is given in Figure 2. A useful summary of early work in CIR and IIR is provided by [2, 5, 6, 7, 12]

Conceptual analysis Document analysis (including text indexing) Natural Language Linguistic Domain Specific Natural Language Analysis and Knowledge Knowledge Analysis and Understanding Understanding NL query Query Retrieval Technique Document Document Representation + Representation Inference Engine Information Extraction Figure 2. AI-based Information Retrieval As mentioned in section 3, several problems with the use of NL for IR are language problems. Those problems can be divided in three levels; word level, sentence level, and discourse level. The role of NLP in IR, then, consists of morphological processing, syntax processing, semantic processing, discourse analysis and pragmatic interpretation. Using those five steps, knowledge representation technique [9, 11], and inference technique [15], document representation and representation of a user s information need can be provided. Table 3 summarizes the problem solving techniques. Some are given (for details see [1, 5, 6, 8, 13, 15]. Since the greater of detail full text calls for more sophisticated NLP, some IIR system in [6] included text filtering that later stages of processing are not applied to them.

Table 3 An Overview of Problems and Their Solutions in IIR Components of NLP in IIR Problems Problem Solving Techniques morphological processing spelling errors word boundary ambiguity POS tagging ambiguity unknown words preprocessing spelling correction word filtering probabilistic POS tagger syntax processing semantic processing ill formed sentences complex sentences long sentences structure ambiguities idioms, ellipses homographs synonyms fragment processing partial parsing corpus-based approach knowledge-based approach dictionary and thesaurus method discourse analysis anaphora reference discourse module anaphora resolution coherence reasoning knowledge representation techniques knowledge engineering buttleneck shallow knowledge domain specific knowledge reasoning and inference implicit and uncertain information uncertain inference rule based inference Conclusion The ambitious goal of IIR is the completeness and accuracy of information extracted. In order to process a realistic information retrieval task, IR technology should integrate with AI techniques. While IR system offers a vehicle, AI techniques including NL analysis and understanding, knowledge representation, reasoning and inference will produce useful results. The success of an IIR system will depend on now well principles of NLP have been used. However, the greater of detail in full text apparently calls for more sophisticated natural language processing. According to MUC 4 in 1992 [6], IIR is a key factor in encouraging NLP researchers to move from small-scale systems and artificial data to large scale system, operating on human language.

References 1. Asanee Kawtrakul, Thumkanon Chalatip, Jamjanay Thitima, Muangyunnan Parinee, Poolwan Kritsada, A gradual Refinement Model for a Robust Thai Morphological Analyzer, Coling 96. (on printing process) 2. Cowie, Jim and Lehnert, Wendy, Information Extraction, Communications of the CAN. 39(1); 1996. 3. Croft W. Bruce, Text Retrieval and Inference, Text-Based Intelligent Systems: Current Research and Practice in Informational Extraction and Retrieval, Lawrence Erlbaum Associates, Publishers, 1992 4. Croft, W Bruce, NSF Center for Intelligent Information Retrieval Communications of the ACM. 38(4): 42-23, 1995. 5. DARPA, Proceeding of the 3 rd Message Understanding Conference (MUC-3), Morgan Kaufmann, 1991. 6. DARPA, Proceeding of the 4 th Message Understanding Conference (MUC-4), Morgan Kaufmann, 1992. 7. DARPA, Proceedings of the Tipster Text Program (Phase I), Morgan Kaufmann, 1993. 8. David D. Mc Donald, Robust Partial-Parsing Through Incremental, Multi-Algorithm, Text-Based Intelligent Systems : Current Research and Practice in Informational Extraction and Retrieval, Lawrence Erlbaum Associates, Publishers, 1992. 9. Fox, E.A. Tutorial on Knowledge-Based Information Retrieval, Fifteenth International Conference of Research and Development in Information Retrieval, 1992. 10. Gerard Salton, Automatic Text Processing : The Transformation, Analysis, and Retrieval of Information by Computer, Addison-Wesley Publishing Company, 1989. 11. Judith L. Klavans, Visions of the digital Library : Views on Using Computational Linguistics and Semantic Nets in Information Retrieval, Current Issues in Computational Linguistics : In Honour of Don Walker, Kluwer Acadamic Publichers. 12. Lewis David D; Jones Karen Spark, Natural Language Processing for Information Retrieval, Communications of the ACM, 39 (1) : 92-101, 1996. 13. Nobesawa S., Tsutsumi J., Nitta T., Ono K., Jiang S. and Nakamishi M., Segmenting a Sentence into Morphems using Statistic Information Between Words, Coling 94, pp. 227-232, 1994. 14. Paul S. Jacobs, Introduciton : Text Power and Intellingent Systems, Text-Based Intelligent Systems, LEA Publishers, pp 1-8, 1992. 15. Richard M. Tong and Daniel Shapiro, Experimental investigations of uncertainty in a rule-based system for information retrieval, International Journal of Man-Machine Studies, 22:265-282, 1985. 16. Stephen P. Harter, Online Information Retrieval : Concepts, Principles, and Technique., Library and Information science, 1986. 17. F. Wilfrid Lancaster, Information Retrieval Systems : Characteristics, Testing and Evaluation, John Wiley & Sons, 1979.