An Information Retrieval using weighted Index Terms in Natural Language document collections

Transcription

1 Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia University, Minia, Egypt and * Bahgat A. Abdel Latef, Minia University, Minia, Egypt and Abdel Mgeid A. Ali, Minia University, Minia, Egypt and Osman A. Sadek, Minia University, Minia, Egypt and Abstract Indexing a document is the method for describing its content for sake of easier subsequent retrieval in a document storage. This paper describes the implementation of the automatic indexing of various term weighting schemes in an IR (Information Retrieval) system using CISI documents collection which constitutes of abstracts for information retrieval papers and NPL collection which constitutes of abstracts for electronic engineering documents. The system starts with a simple form of text representation in which extracts keywords that represent documents as vectors of weights that represent the importance of keywords in documents of the documents collection and then evaluates, compares the retrieval effectiveness of various search models based on automatic text-word indexing and presents experimental results conduct to study the improvements made on the effectiveness of the text retrieval by successively applying these approaches. 1. Introduction Recently, people have started dealing with an increasing number of electronic documents in information networks. Finding specific documents that users need from among all available documents is an important issue. Information Storage and Retrieval Systems make large volumes of text accessible to people with information needs [2], [6]. The user provides an outline of his requirement perhaps a list of keywords relating to the topic in the form as a question, or even an example document. The system searches its database for documents that are related to the user s query and presents those which are most relevant. Most document retrieval systems use keywords to retrieve documents. These systems first extract keywords from documents and then assign weights to the keywords by using different approaches. Such systems have two major problems. One is how to extract keywords precisely [15], [1], [9] and the other is how to decide the weight of each keyword [14], [3]. Gerard Salton was long an advocate for term weighting approaches and was himself a pioneer in developing techniques for term weighting schemes. He and Christopher Buckley summarize the results of the previous 20 years in their paper [4], which was reprinted in [11]. The remainder of this paper is organized as follows. Section 2 presents Documents Collections in which, we describe the documents that are used in our system. Section 3 presents System Model which describes the IR architecture, various indexing schemes,cosine similarity and recall precision measures. Section 4 presents our evaluation methodology and compares the retrieval effectiveness of various approaches used to index and retrieve documents. Section 5 conclusion provides our study s main findings. Finally, a Future work in which, we will apply genetic algorithm to improve the performance of information retrieval system. 2. Documents Collections At the first conference in 1992, TREC (Text REtrieval Conference) were used a collection of over 50 queries (called topics in the TREC jargon), containing 2 GB of text. In general, newswire articles are taken into account as full-text documents. IR collections are composed of 2 textual materials: a set of documents and a set of queries. For each query, a list of relevant documents is associated. The list can be flat (all documents provided are supposed to be similarly relevant to the given query), or ordered (each relevant document is provided with a relevance level). Because * Corresponding Author

2 Internet and Information Technology in Modern Organizations: Challenges & Answers 636 they are obtained by automatic methods (pooling), large document collections are often flat. Examples of document and query of the Wall Street Journal sub-collection that can be found in the TREC collection are respectively. The TREC collection is a large collection from 2 GB up to 20 GB (7.5 million documents) for the very large corpus track introduced in (TREC-6), which requires time-consuming preparation before experiments can be carried out effectively at a local site. An alternative approach is to use a smaller collection: in this case, minimal evaluations as suggested by Hersh [17], must be done with at least 1000 documents and a minimum of 50 queries should be tested. The set of relevant documents for each example information request (topic) is obtained from a pool of possible relevant documents. This pool is created by taking the top K documents (usually K = 100) in the rankings generated by the various participating retrieval systems. The documents in the pool are then shown to human assessors who ultimately decide on the relevance of each document. TREC uses the following working and user-centered definition of relevance [7]. We study two documents collections CISI (Comités Interministériels pour la Société de l Information) and NPL (Natural Processing Language) in which each collection manipulates a specific topic or specific field in real life documents. For example, CISI collection concerned with information retrieval topic. Table (1) shows some of these documents collections with some of their properties. Table (1). Test collections related to CISI, and NPL collection Collection Subject No. Docs No. Queries ADI Information Science CACM Computer Science CISI Information Retrieval CRAN Aeronautics LISA Library Science MEDLARS Biomedicine NLM Biomedicine NPL Elec. Engineering 11, TIME General Articles Information Retrieval System Model In an Information Retrieval (IR) system manages its text resources by processing their words to extract and assign content descriptive index terms to documents or queries[12]. As we use naturally spoken or written language, words are formulated with many morphological variants, even if they are referred to as a common concept. Therefore, the words often undergo pre-processing. They are stemmed [16], [13]. Stemming has to be performed in order to allow words, which are formulated with morphological variants, to group up with each other, indicating a similar meaning. Most of the stemming algorithms reduce word forms to an approximation of a common morphological root, called stem. The objective is to eliminate the variation that arises from the occurrence of different grammatical forms of the same word, e.g., retrieve, retrieved, retrieves and retrieval should all be recognized as forms of the same word. Hence, it should not be necessary for the user who formulates a query to specify every possible form of a word that he believes may occur in documents for which he is searching. Another common form of preprocessing is the elimination of common words that have little power to discriminate relevant from nonrelevant documents, e.g., the, a, it and same words. Hence, IR engines are usually provided with a stop list of such noise words. This set of terms defines a space such that each distinct term represents one dimension in that space. Since we are representing each document as a set of terms, we can view this space as a document space [4], [6]. Then, we can assign a numeric weight to each term in a given document representing an estimate (usually but not necessarily statistically) of the usefulness of the given term as a descriptor of the given document. This means that the weight of the given term estimates of its usefulness for distinguishing the given document than other documents in the same collection. It should be pointed out that a given term might receive a different weight in each document in which it occurs; a term may be a better descriptor of one document than of another. The following system procedure is usable for each local document: i) Identify the individual text words ii) iii) Remove special function (negative) words contained on a list of excluded words ( and, of, or, but, etc.). Reduce the remaining words to word stem form by applying suffix deletion method.

3 Internet and Information Technology in Modern Organizations: Challenges & Answers 637 iv) Assign a term weight to the remaining word stems based on the word stem frequency in an individual local document, the overall inverse document frequency of stem in the collection and the local document length. v) Makes each document as a vector of term s vi) weights in document spaces. Classify each document in one or more category under a threshold. vii) Apply the previous steps in queries to make queries vector. viii) Get the top of 30 documents relevant for each ix) a given query, according to the cosine similarity measure. Compute the Recall-Precision for each query, and then get Average Recall-Precision for given queries. x) Draw a graph that represents the Average Recall-Precision relationship. Figure 1. Shows an example of Information Retrieval System architecture. Various Indexing Schemes As for deciding of the weight of each term, the simplest way is to make the weight represents the frequency ( tf ), which is the occurrence of that term in the given document applied on the entire collection. If there are large amount of documents, the terms would occur frequently. Thus, to allow variation in document size, the weight is usually " normalized ". In [8] two kinds of normalization. The first normalization of the term frequency, tf is divided by the tf max ( Maximum Term Frequency ). This kind of normalization has been called mn (Maximum Normalization ). The second kind antf ( Normalized Term Frequency ) represents by equation (1), which causes the normalized tf to vary between 0.5 and 1: where W ij is weight and tf i j is the frequency of term i and tf max is the term frequency in document j. The purpose of term frequency normalization (in either form) is that the weight (the importance ) of a term in a given document should depends on its frequency of occurrence relative to other terms in the same document, not its absolute frequency of occurrence. Weighting a term by absolute frequency would obviously tend to favor longer documents over shorter documents. However, there is a potential flaw in mn. The normalization factor for a given document depends only on the frequency of the most frequent term(s) in the document. The same problem arises with antf, but to a less extreme degree since the high frequency term will have a weight of one as with mn but it cannot drag the weights of the other term below 0.5. A commonly-used alternative to normalize the term frequency is to take its natural logarithm plus a constant, e.g., " log ( tf )+1 ". This technique, called ltf ( Logarithmic Term Frequency ), does not explicitly take document length or term frequency into account but it does reduce the importance of raw term frequency in those collections with widely varying document length. It also reduces the effect of a term with an unusually high term frequency within a given document. In general, it reduces the effect of all variation in term frequency. In [10] introduces another normalized method, which is known as Inverse Document Frequency idf measure as follows:

4 Internet and Information Technology in Modern Organizations: Challenges & Answers 638 where N is the number of documents and n i is the total number of documents containing the term i. Several methods are presented to combine tf with the idf measure. The most successful and widely used scheme for automatic generation of weights is tf * idf. Another approach proposed by [4] is given as follows: 4. Experimental Studies In the studying of Vector Space Model system for different weighting schemes applied on CISI and NPL collections for 100 queries, and by computing the average recall precision for each weighting scheme of each collection. We get the following (Table 2) and (Table 3) for CISI and NPL collections respectively, for some of these schemes, and the Recall Precision (Figure 1) and (Figure 2) for these schemes. where W ij is the weight, freq i j is the frequency of the term i in the document j, and maxfreq j is the frequency of any term in the document j. However, we can use frequency or other approaches separately or together to an appropriate index weights. Similarity and Recall-Precision measures The proposed system is based on a vector space model [6] in which both documents and queries are represented as vectors. The components of the vector are weights of keywords extracted from documents or queries; and we use weighting schemes ltf * idf, antf and antf * idf schemes; then, the cosine similarity has been used for measuring relevance between document and query which have the following formula: where Q is the query s vector and D is document s vector of document d, then we rank documents according to their similarity measure with queries for retrieval mechanism and measure effectiveness of the system for retrieving relevant document according to Recall - Precision measures which are: Table 2. Average Recall Precision for 100 queries applied on CISI Collection Average Precision for 100 test queries recall normalization normalization*idf Table 3. Average Recall Precision for 100 queries applied on NPL Collection Average Precision for 100 test queries recall normalization normalization*idf

5 Internet and Information Technology in Modern Organizations: Challenges & Answers Figure 2. Represents an Average Recall-Precision for Augemented normalization normalization*idf Augemented normalization normalization*idf 100 queries applied on CISI Colletion Figure 3. Represents an Average Recall-Precision for 100 queries applied on NPL Colletion 5. Conclusion From the above comparison among the schemes shown in Table 2 and Table 3, we conclude that the antf gives more effectiveness than antf * idf, than ltf * idf in this study for 100 queries applying on CISI and NPL collections. Also, we note that although the idf of a given term is statistics measure that characterizes that term relative to a given collection, not relative to a query. It would be inefficient to recompute the weight of such a term in every document in which it occurs, whenever new documents are added to the collection (or old document are removed), since idf must be recomputed for each descriptor term in the affected documents collection. 6. Future Work In the future work, we apply Genetic Algorithm System (GAs) in Relevance Feedback Problem to improve the performance the effectiveness of the IR systems, which apply in vector space model, and compares that technique with one of the best traditional methods of Relevance Feedback Ide dechi Relevance Feedback method [5]. In that work, we use CISI and NPL Collection, and the Experimental Scheme with which to implement Relevance Feedback using the different methods (the GAs and the Ide dechi method) as the following: For each collection, each query is compared with all the documents, using the cosine similarity measure. This yields a list giving the similarities of each query with all the documents of the collection. This list is ranked in decreasing order of degree of similarity. The normalized document vectors corresponding to the top 15 documents of the list (which will be those to use as feedback), with their relevance scores and the normalized query vector, are provided as input to the query optimization algorithm. 7. References [1] A. Chen, J. He, L. Xu, F. C. Gey and J. Meggs, Chinese text retrieval without using a dictionary. ACM SIGIR'97, Philadelphia, PA, USA, pp.42-49, [2] C. J. Van Rijsbergen, Information Retrieval. Butterworths, London, second edition, [3] D. Lewis, R. Shapire, J.P. Callan and R. Papka, Training algorithms for linear text classifiers. ACM SIGIR'96, Zurich, Switzerland, pp , [4] G. Salton and C. Buckley, Term weighting approaches in automatic text retrieval. Information Processing and Management, pp , [5] G. Salton and C. Buckley, Improving retrieval performance by relevance feedback. Journal of the American Society for Information Science, Vol 41, No. 4, pp , [6] G. Salton and M. J. McGill, Introduction to modern information retrieval. Englewood Cli.s, NJ: Prentice-Hall, 1983a. [7] H. Voorhees and D. Harman, Proceedings of the sixth text retrieval conference, TREC-6, 1997.

6 Internet and Information Technology in Modern Organizations: Challenges & Answers 640 [8] J.H. Lee, Combining multiple evidence from different properties of weighting schemes. Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp , [9] K. L. Kwok, Comparing representations in Chinese information retrieval. ACM SIGIR '97, Philadelphia, PA, USA, pp.34-41, [10] K. S. Jones, A statistical interpretation of term specificity and its application in retrieval. J. Documentation, pp.11-20, [11] K. Sparck Jones and P. Willett (Eds.), Readings in information retrieval. San Francisco: Morgan Kaufman, pp , [12] M. Bacchin, N. Ferro and M. Melucci, A probabilistic model for stemmer generation. Italy, Information Processing and Management Vol. 41, pp , [13] M. F. Porter An algorithm for suffix stripping, Program, Vol. 14, No. 3, [14] M. Gordon, Probabilistic and genetic algorithms in document retrieval. Communications of the ACM, Vol. 31, No. 10, pp , [15] R. A. Baeza-Yates, Text retrieval: theory and practice, In International federation for information processing congress, Vol. 1, Madrid, Spain, pp , [16] W. B. Frakes and R. Baeza-Yates. In W. B. Frakes & B. Y. Ricardo (Eds.), Information retrieval: data structures & algorithms. Englewood Cliffs, NJ: Prentice-Hall, [17] W. Hersh. Information Retrieval: a Healthcare Perspective,Springer, 1996.