AFFAIR: Agent-based Flexible Framework and Architecture for Information Retrieval

Transcription

1 Ottar Viken Valvåg AFFAIR: Agent-based Flexible Framework and Architecture for Information Retrieval Requirements elicitation, design and prototype implementation Project assignment in TDT4710 Information Management, Specialization Subject teacher: Teaching supervisors: Ingeborg Sølvberg Ingeborg Sølvberg Jeanine Lilleng Department of Computer and Information Science, NTNU Autumn 2003

2

3 Summary Summary Modern information retrieval systems often combine a variety of basic techniques to achieve good search results. I have developed AFFAIR, which is an agent-based, flexible framework suitable for experimentation with such combinations. The information retrieval process is modeled by an agent processing a query to produce a result set consisting of documents from a document collection. AFFAIR consists of a collection of Java interfaces defining the required functionality of these components. The AFFAIR framework has been tested by implementing two simple information retrieval strategies as agents. A document collection driver was implemented to allow the agents to search articles from the Bergens Tidende newspaper. The testing showed that the implementation of information retrieval strategies as agents within the AFFAIR framework was feasible. A generic merging agent combining the results of other agents was also developed. This was used to combine the other two agents results, and could be used to easily combine any information retrieval strategies implemented by agents in the AFFAIR framework. Information retrieval strategies may be categorized into three different paradigms. These are referred to as the Set Theoretic, Vector and Probabilistic models of information retrieval. As part of the development of AFFAIR, existing research into the combination of information retrieval techniques has been reviewed. This review showed that the Vector model for information retrieval has a dominant position. It also showed that researchers have tended to investigate combination schemes that are limited to a single model for information retrieval. This is despite research results demonstrating that the combination of different retrieval paradigms provides the greatest potential improvements to retrieval results. Studies of existing information retrieval systems revealed that few information retrieval systems are explicitly designed to support the combination of different retrieval paradigms. In fact, most systems are based on a single model for information retrieval. In addition, many systems emphasize the optimization of efficiency and scalability over the flexibility needed to allow for cross-paradigm experiments. Thus, restrictions in existing frameworks as well as the dominant position of the Vector model may have contributed to limit research into the combination of different information retrieval paradigms. AFFAIR could be able to alleviate this problem. It is concluded that AFFAIR seems to satisfy its objective of providing a suitable framework for experimentation with the combination of different retrieval techniques. Further work on the AFFAIR framework could include the addition of new collection drivers, the development of a better user interface and the inclusion of batch query processing to support benchmarking experiments. Other uses of AFFAIR, such as the use of AFFAIR as collaborative or teaching tool, or the application of AFFAIR to multilingual or non-textual information retrieval, could also be tested.

4

5 Acknowledgements Acknowledgements This report is the result of my project assignment in the Information Management specialization module of my Master of Technology in Computer Science degree at NTNU. I would like to thank my teaching supervisors, professor Ingeborg Sølvberg and dr.ing. fellow Jeanine Lilleng, at the Information Management group of NTNU s Department of Computer and Information Science, for their contributions to and advice about this project. Professor Sølvberg, with her extensive experience, has given well-founded guidance on the organization and presentation of my work. Jeanine, with her enthusiasm and attentiveness, has been a great inspirer and driving force. In short, professor Sølvberg s attention to the big picture and Jeanine s attention to detail have greatly improved the quality of this report. I would also like to thank the Bergens Tidende newspaper as well as professor Torbjørn Nordgård at the Computer Linguistics section of NTNU s Department of Language and Communication Studies for providing me with access to the Bergens Tidende document collection used to test the AFFAIR framework.

6

7 Reader s guide Reader s guide Chapter 1 observes that modern information retrieval systems often combine a variety of basic techniques to achieve good search results. A unified framework for testing and developing new combinations is desired. This report attempts to define and prototype AFFAIR, an architecture suitable for such work. In chapter 2, a brief overview of the common paradigms of textual information retrieval is given. Information retrieval strategies may be categorized into three main classes; Set Theoretic, Algebraic and Probabilistic information retrieval. The theoretical foundations of these paradigms are Set Theory, Vector Algebra and Probability Theory, respectively. Chapter 3 goes on to study how different information retrieval strategies may be combined to produce better results. In general, the combination of multiple sources of evidence to produce a good overall result is known as multiple evidence combination. Research into the application of multiple evidence combination to information retrieval has to a large degree focused on multiple evidence combination within a single retrieval paradigm. Two possible reasons for this are mentioned. One is that the Vector Algebraic model of information retrieval has a dominant position. Another is that researchers tend to do experiments that are possible within their existing frameworks for information retrieval. Therefore, it is hypothesized that a general framework providing support for the easy combination of different information retrieval techniques, regardless of retrieval paradigm, could be beneficial. In chapter 4, some existing systems and architectures for information retrieval are reviewed. It is found that few systems are explicitly designed to support multiple evidence combination. In fact, most systems are based on a single model for information retrieval. In addition, many systems emphasize the optimization of efficiency and scalability over the flexibility needed to allow for creative experiments. It is decided that AFFAIR can contribute by providing an abstract framework supporting the general process of information retrieval. Chapter 5 discusses a suitable way to represent documents and collections of documents. This leads to the development of a general interface to document collections. By writing a collection driver implementing this interface, any kind of textual document collection may be incorporated into the proposed framework. In chapter 6, the entire framework is specified in terms of clean interfaces. The information retrieval process is modeled by a search agent, which processes a query and returns a result set consisting of ranked documents. The agents can be used to implement any kind of information retrieval strategy. Examples of a document collection driver and three agent implementations searching this collection are given in chapter 7. It is shown that different information retrieval paradigms can be easily implemented in AFFAIR. It is also demonstrated how an agent may produce its results by combining the efforts of other agents. Thus, arbitrary combinations of different retrieval techniques are possible. In chapter 8, it is concluded that AFFAIR seems to satisfy its objective of providing a suitable framework for experimentation with the combination of different retrieval techniques. Possible framework extensions and new applications of AFFAIR are also mentioned.

8

9 Contents Contents 1 Introduction Basic Information Retrieval Techniques Set Theoretic IR techniques Fuzzy Information Retrieval Extended Boolean Information Retrieval Algebraic IR techniques Probabilistic IR techniques Structured IR techniques Combining information retrieval techniques Combining result sets Combining queries Common combinations of techniques Discussion Existing architectures for Information Retrieval The SMART system The URSA architecture The INQUERY system The OKAPI system The ProFusion engine The SENTINEL system The SIRE engine The AIRE engine Discussion Documents and Document Collections Document types Design decisions Identification schemes Design decisions Metadata Design decisions Document granularity Design decisions Document formatting Design decisions Collection maintenance Summary The AFFAIR framework Requirements Framework Prototype The Agent interface The Query interface The ResultSet interface The DocumentCollection interface The Document interface The Tools class Testing of AFFAIR The Bergens Tidende document collection... 49

10 Figures 7.2 User interface Agent implementations A simplistic Boolean agent A vector-based agent A multiple evidence combination agent Discussion Conclusion and further work Further work References Appendix A : Java language mechanisms Appendix B : AFFAIR source code Figures Figure 1: The main components of the system Figure 2: Simple agent collaboration Figure 3: A taxonomy of information retrieval models Figure 4: Venn diagram illustrating relevant and non-relevant overlap Figure 5: Formulae for relevant and non-relevant overlap Figure 6: SQL retrieval in SIRE Figure 7: The components of an information retrieval process Figure 8: The Document interface Figure 9: Query execution using sequential access Figure 10: Query execution using direct access Figure 11: The central framework constituents Figure 12: Class diagram of the AFFAIR prototype Figure 13: Example document from BT collection Figure 14: Same document in AFFAIR form Figure 15: The no.ntnu.affair.drivers.bt package Figure 16: Main loop of AFFAIR user interface Figure 17: Output from a run of the user interface Figure 18: XML-formatted search result Figure 19: The performsearch() method of the ExampleAgent Figure 20: The VectorAgent's indexing routine Figure 21: The MergingAgent code Figure 22: Implementation of the CombMNZ algorithm Figure 23: MergingAgent search results Tables Table 1: Overview of examined IR systems Table 2: The Dublin Core Metadata Element Set Table 3: Mapping between BT and DC metadata categories Table 4: DC metadata made explicit by the collection driver Table 5: Untranslatable BT metadata categories Table 6: Summary of test agent properties... 56

11 Introduction 1 Introduction Information Retrieval (IR) is the process of locating documents relevant to some userspecified criterion from a collection of digital documents. Based on the user s specifications, the documents containing the most relevant information must be retrieved. The collection of documents can be a digitized library, the Internet, an image database or any other kind of information collection. In this report only documents comprised of textual information are considered. In the context of textual documents, the general IR problem of finding information relevant to user criteria is usually boiled down to a search for documents based on a search query. The query can be anything from a small set of keywords to a central paragraph or even a template document. Based on the keywords, the information retrieval system must be able to understand what kind of information the user is looking for, and it must be able to evaluate any document and measure to what extent the requested information is present. This is an AIcomplete problem, so approximative techniques are used to produce acceptable results. Some examples of techniques used to estimate a textual document s relevance with regard to a given set of keywords are: Word matching: Does the document contain the keywords? Use of dictionaries: Matches synonyms and related terms as well. N-grams: Used to match similar words, inflections, etc. Semantic analysis: Use of formal logic and natural language understanding to evaluate the information content of documents. Statistical methods: Take advantage of language statistics such as the tendency of two words to occur together. Some basic information retrieval techniques are described in greater detail in chapter 2. In practical research, a combination of techniques is used to achieve as good a result as possible. The results of techniques are weighted and combined to one result. This is done in many different ways, usually dependent on the techniques that are to be combined and the framework that is available. A more uniform and general framework for experimenting with the combinations of different IR techniques is desirable. This could make it easier for researchers to experiment with different combinations, eliminating or reducing the need for special adaptations for each new combination. In this report I look into the possibilities of creating such a framework using an agent-based architecture. Agents are autonomous program units capable of working towards a set of goals.[1] The distinction between a software agent and a software component is a bit blurred, but an agent is in general thought to work at a higher level of abstraction. That is, an agent has high-level goals that must be achieved by low-level means. This requires a certain degree of intelligence, or problem-solving ability in the agent. An important type of agent is the collaborative agent. Collaborative agents solve a larger problem by combining the efforts of many, simpler agents. Such agents seem to be applicable to information retrieval. The idea behind the agent-based architecture for information retrieval described in this report is to create a system consisting of a number of information retrieval agents. The agents have a uniform interface providing methods for searching a document collection, but each agent uses its own technique to perform this search. The framework then allows the user to specify searches, which techniques should be combined, and which 11

12 Introduction document collection should be searched. The architecture is intended to be used for developing and benchmarking new IR methods. The principal components of the architecture are shown in Figure 1. An important part of the development of this architecture is to specify the interfaces and protocols for communication between these components. It is important to design these interfaces as broadly and generally as possible, to maximize the number of different techniques that are compatible with the framework. At the same time, it is desirable to allow for relatively complex patterns of collaboration, so the interfaces must be able to provide finer details as well. In order to find out what the interfaces must provide, it is necessary to study different IR techniques in detail. In addition to interagent communication, the agents need a uniform interface to the document collection. One possibility is to create an interface to a generic document collection, with drivers making different types of concrete document collections accessible through this interface. In this way a collection of text files and the Internet can be accessed in the same way. Interface for query specification Search planning, combination of techniques Agents Agents Mossad MI6 Communication protocol CIA KGB Interface to document collections Document collection Figure 1: The main components of the system Two additional interfaces must be defined to complete the architecture. The first one concerns a structure for representing partial or complete search results, to be used by agents to exchange results and to return a complete result to the user. The result sets should contain as much information as possible, including characteristics related to the IR technique that was used. These characteristics may be used by other agents when combining results and/or performing searches. They can also be useful when the complete results are analyzed in connection with a benchmarking experiment. The final interface to be defined is a user interface to the system. Ideally, such an interface should provide simple tools to perform searches, and specify new search methods. The specification of new search methods involves the specification of agent collaboration patterns. The simplest form of collaboration is the merging of two result sets produced by different agents. The final result set must be ranked in accordance with some weighting of the two partial result sets attributes. More complex collaboration patterns may involve feedback 12

13 Introduction loops where partial search results guide further searches or are used to tune essential parameters. In such cases, internal characteristics stored in the result sets may play a critical role. The design of a user interface allowing a fully flexible way of specifying agent collaboration patterns is probably a large task. In the absence of such an interface, expert involvement is required. An acceptable solution is to create a new agent implementing the desired collaboration pattern. This agent will coordinate the effort of other agents and compile the final result. An illustration of this approach is shown in Figure 2. Interface for query specification CIA Coordinating Agent Agent MI6 Agent Mossad Interface to document collections Document collection Figure 2: Simple agent collaboration The prototype for this architecture is defined and implemented in Java, which is very suitable for the programming of agents due to its object-oriented nature. Interfaces and protocols are also implemented in Java. Ideally, communication between system components should be independent of programming language. This would enable different parts of the system to be implemented in different programming languages. For example, an agent using formal logic reasoning could be implemented in Prolog whereas a computation-intensive agent could be implemented in C. However, due to time considerations the initial prototype is programmed in Java in its entirety. To summarize, there are a number of tasks that must be solved in order to realize the idea of an agent-based architecture for information retrieval. These include: The definition of a generic interface to document collections. The installation of a document collection and the implementation of a driver making the documents accessible through the interface mentioned above. The definition of a format for the storage and exchange of search results. The definition of an agent interface and a protocol for interagent communication. The implementation of two or more agents using different IR techniques. The definition of a user interface or framework for performing searches. The design of a user interface for specifying new ways of combining IR techniques. Testing the architecture by benchmarking an arbitrary method of information retrieval. 13

14 Introduction My specific tasks in relation to this work, listed in order of priority, have been: The definition of an agent interface and a protocol for interagent communication. The definition of a format for the storage and exchange of search results. The definition or discovery of a suitable document collection interface. The installation of a simple, limited document collection, and the implementation of a driver making the documents accessible through the interface mentioned above. The implementation of two or more agents using different IR techniques. My aim was to develop a simple working prototype, where more advanced details in the interfaces and protocols can be added at a later time. The prototype was dubbed AFFAIR, standing for Agent-based Flexible Framework and Architecture for Information Retrieval. 14

15 Basic Information Retrieval Techniques 2 Basic Information Retrieval Techniques In order to develop interfaces that are able to represent and exchange the information used by different information retrieval techniques, a closer look at the commonest techniques is necessary. The following discussion is based on chapter 3 of a book by Baeza-Yates and Ribeiro-Neto[2], as well as chapter 17 of Jurafsky & Martin s book[3] and a survey by Faloutos and Oard.[4] Baeza-Yates and Ribeiro-Neto sort classic IR techniques into three basic categories. These are the Set Theoretic, Algebraic and Probabilistic models for information retrieval. In addition, models that take advantage of structural information about documents are considered. This taxonomy is illustrated in Figure 3, which is adapted from their book.[2] IR application: Search or filtering Classic Models Boolean Vector Probabilistic Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Latent Semantic Index Neural Networks Structured Models Non-overlapping lists Proximal Nodes Probabilistic Inference Network Belief Network Figure 3: A taxonomy of information retrieval models. In general, all current textual IR techniques view documents as a series of terms. A term usually corresponds to a word in the text, but may encompass phrases as well. Current IR techniques usually rely on the simplifying assumption that the semantics of a document are fully described by the individual terms present in them. Most often, their actual order in the document is not even considered. This simplifies the information problem and enables search systems to construct indexes relating terms to documents, speeding up retrieval. An approach rejecting the simplifying term assumption would have to interpret the semantics of a document and represent them in a machine-understandable structure that could be used to answer queries. Knowledge representation languages capable of this do exist, but a lot of manual work is required in order to supply the knowledge base necessary for efficient inference mechanisms. Knowledge representation techniques are still not mature enough to be applicable to general information retrieval problems. Nevertheless, it is important to keep in mind that the term assumption is indeed a simplifying assumption and not a law of nature. In the following, the Set Theoretic, Algebraic, Probabilistic and Structured models for IR are reviewed. Some subcategorizations of the three models are examined, and architecture requirements caused by the different models are identified. 15

16 Basic Information Retrieval Techniques 2.1 Set Theoretic IR techniques Set Theoretic IR techniques attempt to express information queries as boolean algebra expressions. For example, the need for documents concerning software architectures could be expressed as a query software AND architecture. This query would produce a result consisting of all documents containing the terms software and architecture. By viewing the documents containing the term software as one set, and the documents containing the term architecture as another, the result is achieved by taking the intersection of these two sets. Thus Set Theoretic IR may be viewed as a series of set operations such as intersection, union and complement, corresponding to the AND, OR and NOT operators, respectively. Set Theoretic IR techniques have been widely adopted because of their simplicity and intuitiveness. The main problems with these techniques are: The query performs a binary decision, documents are either marked as relevant or irrelevant. Users have problems formulating sophisticated boolean queries Fuzzy Information Retrieval Fuzzy Information Retrieval is an attempt to alleviate the problem of binary decisions. Fuzzy IR uses Fuzzy Set Theory instead of classical Set Theory to produce search results. In Fuzzy Set Theory, an element can partly belong to a set. This is expressed by a membership function taking a value from 0 to 1, where 0 indicates no membership and 1 indicates full membership. The set operations for intersection, union and complement calculate algebraic combinations of the membership functions of elements with regard to the two operand sets. This calculation is used to create membership functions of elements with regard to the resulting set. By using the same retrieval technique as in classic Set Theoretic IR, a final result set is produced where all documents have a degree of participation which is hopefully related to their relevance. For an explanation of how the documents degree of participation in the sets associated with each index term are calculated, see page 36 of Baeza-Yates book.[2] Extended Boolean Information Retrieval The Extended Boolean model for Information Retrieval is a hybrid technique combining some of the features of algebraic and set theoretic techniques. Similarly to other Set Theoretic techniques, Extended Boolean IR is performed by combining sets of documents using the set intersection, union or complement operations. As for Fuzzy IR, the implementation of the set operations have been changed. A p parameter is introduced which can be used to vary the effect of the set operations. When p = 1, the queries are evaluated in a way similar to algebraic IR techniques, which are discussed in section 2.2. When p =, the operations equal the operations used in Fuzzy IR. Thus, the Extended Boolean technique provides great flexibility with regard to the specification and evaluation of queries. Again, we refer to Baeza- Yates book[2], page 40 for more details. 2.2 Algebraic IR techniques Algebraic IR techniques view documents and queries as vectors, where each dimension of the vector represents a term. Common terms thought to carry little information are usually eliminated using stop lists. The values of the different coordinates in a vector represent the weight or importance of the corresponding term. Similarity (in terms of information content) between documents and 16

17 Basic Information Retrieval Techniques queries is then measured by comparing the n-dimensional distance between corresponding vectors, where n is the total number of terms in the document collection, or by comparing the vectors directions. To calculate the vectors of documents the frequencies of terms in the documents are counted. This produces a term-by-document matrix[3], in which every column of the matrix can be viewed as a vector representation of a given document. In addition to frequency counts for every document, the frequency of a term in the entire document collection is often used to normalize the weights in the matrix. The following scheme, known as tf idf weighting[3], is widely used: W i,j = TF i,j IDF i W i,j denotes the weight of term i in the vector representation of document j. TF i,j is the frequency of term i in document j, and IDF i equals log(n/n i ), where N is the total number of documents and n i is the number of documents in which term i occurs. IDF stands for inverse document frequency. The calculation of the vector representation of the query is often more involved, as user queries often consist of few terms and rarely carry enough information to guarantee a good search result. Techniques such as stemming of terms, n-grams, relevance feedback and query expansion using thesauri are used to provide greater recall and/or precision. What all this means with regard to the proposed architecture is that documents should be viewable as vectors, and that search results should include the vectors assigned to the query and the retrieved documents. Different algebraic search techniques may calculate vectors differently, and the vectors could potentially be used when the results of different IR techniques are combined. The calculation of document vectors is typically a preprocessing activity that should be supported by the architecture. 2.3 Probabilistic IR techniques Whereas Set Theoretic IR techniques attempt to calculate a set of relevant documents by direct execution of set operations specified in the query, and algebraic techniques try to calculate a similarity feature between all documents and the query, probabilistic IR techniques take a slightly different approach. These techniques try to estimate the probability that a document is relevant, or belongs to the so-called ideal set (consisting of all relevant documents). Formally, an attempt is made to estimate the probability P(R d j ) for all j. That is, the probability that document j belongs to the set R of relevant documents is estimated. Instead of using this probability as a ranking criterion, the odds of a document being relevant is used, which is expressed as the ratio of the probability of a document being relevant to the probability of a document being irrelevant. This is written P(R d j )/P(~R d j ). Using Bayes rule, the problem can be reformulated to estimating the ratio P(d j R)/P(d j ~R). The probabilities involved in this ratio are the probability of drawing document j from the relevant set and the probability of drawing document j from the irrelevant set, respectively. These probabilities are much easier to estimate than the probabilities in the original ratio. By assuming independence of the terms in a query, P(d j R)/P(d j ~R) can again be broken down to estimating P(k i R) and P(k i ~R) for all terms k i of the query. Given an initial estimate 17

18 Basic Information Retrieval Techniques of these probabilities, it is possible to recursively refine these estimates using a set V of documents thought to be relevant. The set V can be determined by a human user (relevance feedback), or simply by taking the top 10 documents of the previous round of the recursion (blind relevance feedback). Thus, a probabilistic query should include the usual search terms, but also a set V of documents thought to be relevant, and estimates of the probabilities P(k i R) and P(k i ~R) for all the search terms. The initial query can omit these values and let the search agent determine sensible initial estimates. Also, the result set should include the estimated probabilities. In theory, probabilistic IR techniques could perform well because documents are ranked according to their estimated probability of being relevant. The main disadvantage of probabilistic IR techniques is that they don t take the frequency of index terms within documents into account. As in the classic set theoretic model, the terms weights are binary. 2.4 Structured IR techniques Some IR techniques reference the structure of a document, such as font formatting, layout, paragraphs, etc. This enables the user to specify queries referencing structural information, such as a search for all illustrations with captions containing the words information retrieval taxonomy. Naturally, these techniques need access to the structural information of a document. Unfortunately, no formats for document structure are universally adopted. Thus, the document collection drivers should ideally translate the structural information of its documents to a standard format. In the absence of such a mechanism, agents employing structured models for IR will have to access the raw document data directly and derive whatever structural information possible. 18

19 Combining information retrieval techniques 3 Combining information retrieval techniques In this section, different ways of combining information retrieval techniques are reviewed, and some commonly used combinations of information retrieval techniques are identified. This is necessary to determine what kind of agent interaction and collaboration patterns must be supported by the proposed architecture. The idea of combining the efforts of different information retrieval engines is not new. Various experiments have been done in order to investigate the practical utility of such approaches. The combination of different sources of information to produce an improved result is sometimes referred to as multiple evidence combination in the literature. Fusion or data fusion are also common terms, but they are sometimes limited to the merging techniques discussed in section 3.1. The most general term seems to be multiple evidence combination. In the following, some experiments using multiple evidence combination are reviewed. Section 3.1 discusses multiple evidence combinations involving multiple information retrieval strategies, whereas section 3.2 discusses approaches using only one retrieval strategy but varying other parameters such as query or document representation. Section 3.3 lists the techniques that are most commonly combined, and section 3.4 discusses the implications of recent research with regard to the proposed architecture. 3.1 Combining result sets A common type of data fusion is merging, which involves combining two or more ranked result sets produced by different information retrieval engines. Fox and Shaw[17] describe two simple algorithms for combining such result sets, COMB-SUM and COMB-MNZ. These algorithms assume that every document in a result set has been assigned a relevance estimate or similarity value ranging from 0 to 1. COMB-SUM gives every document a new score equaling the sum of the relevance estimates for the document in each result set. The merged result set is then ranked according to this score. COMB-MNZ favors documents that are judged relevant in many result sets, as the COMB-SUM score of a document is multiplied by the number of result sets in which the document occurs. Lee[13] used the same merging algorithms, but considered document rank instead of similarity value, and showed that assigning new scores based on the ranks of the documents gives better results when the result sets to be merged exhibit quite different rank-similarity curves. Lee also investigated the conditions that must be present in order to benefit from the merging of result sets, and found that the relevant and non-relevant overlap of different result sets are correlated to the degree of improvement that can be expected from these and other merging algorithms. The relevant overlap of two result sets is found by taking the number of relevant documents found in both result sets and dividing by the number of relevant documents found in at least one of the result sets. Intuitively, relevant overlap measures the degree to which different techniques make the same correct predictions. Similarly, the non-relevant overlap of two result sets is found by taking the number of nonrelevant documents found in both result sets and dividing by the number of non-relevant documents found in at least one of the result sets. Intuitively, non-relevant overlap measures the degree to which different techniques make the same mistakes or omissions. The concepts of relevant and non-relevant overlap are illustrated in the Venn diagram in Figure 4. 19

20 Combining information retrieval techniques R 1 R both R 2 IR bot IR 1 IR 2 Figure 4: Venn diagram illustrating relevant and non-relevant overlap In the diagram, the large rectangle represents all documents, which are divided into relevant documents (green waves) and irrelevant document (red crosshatches). Two result sets are shown, outlined by dashed blue and dotted purple lines. The relevant overlap of these two result sets is the ratio R both /(R 1 +R 2 +R both ). Similarly, the non-relevant overlap is the ratio IR both /(IR 1 +IR 2 +IR both ). Formulae for calculating relevant and non-relevant overlap, generalized to any number of result sets, are shown in Figure 5. ROverlap = NROverlap = R S ( R S ) ( R S )... ( R S ) 1 ( NR S ) ( NR S )... ( NR S ) 1 1 S 2 2 NR S 1... S S 2 2 n... S Figure 5: Formulae for relevant and non-relevant overlap In the formulae, R denotes the set of all relevant documents, NR is the set of all non-relevant documents, and S 1..n are the different result sets. In his work[13], Lee stated that as long as the component systems being used for fusion had greater relevant overlap than non-relevant overlap, improvement from fusion would be observed. In other words, the algorithms to be combined should make the same predictions to a larger degree than they make the same mistakes. Zhang et al. built on the idea of relevant overlap in developing a novel fusion algorithm.[8] Their algorithm first clustered the documents in each result set. Then, clusters that had high overlap with clusters in other result sets were judged to be more relevant than others, and the documents that belonged to such clusters were assigned a higher score. Finally, the new score was used to combine all the result sets into a single set. Their experiments showed that this merging algorithm compared favorably to other, more conventional merging algorithms. n n n 20

21 Combining information retrieval techniques Mounir and associates[11] tested the feasibility of fusion in the context of Web-based search engines. They developed a heuristic system called FIRE which provided fusion heuristics that outperformed the basic COMB-SUM and COMB-MNZ heuristics. Their system combined the outputs of three different search engines, and found that fusing the search engines result sets using the COMB-MNZ algorithm produced better results than any of the individual search engines. They went on to develop heuristics that not only took the document ranks into account, but also took advantage of the raw scores assigned to the documents by the various search engines. The various scores were linearly weighted and an optimal weighting was determined experimentally. The experiments showed that the various search engines should be weighted according to their dissimilarity and not just by how well they perform individually. They observe that there is little to be gained by combining two very good engines if they both produce the same results. Bartell and his colleagues[15] devised a linear weighting scheme for the combination of different result sets which is applicable to all ranked result sets. Using well understood numerical optimization techniques and a set of training queries they demonstrate how the optimal combination of any set of techniques can be determined. They point out that the evaluation of a technique in isolation gives little indication as to the technique s potential usefulness in combination with other techniques. Methods that perform badly on their own can contribute greatly to the refinement of another technique. They further suggest that their automatic weighting scheme may be used to identify promising combinations of techniques, however the problem of identifying an initial set of techniques to be evaluated remains. McCabe and her colleagues wanted to check if systemic differences such as variations in parsers, use of stemming, stop words and other indexing techniques influence the results of combining techniques. They implemented a probabilistic and a vector space retrieval strategy using the same parser and the same relational retrieval engine. They found that results were consistent with previous experiments and conclude that Lee s overlap ratio seems to be a suitable indicator of the potential for benefiting from result set merging. Two years later, however, the same team investigated the combination of highly efficient retrieval strategies with systemic differences held constant[9]. Their results showed that the implemented strategies produced good individual results, whereas the combination of the individual results provided an improvement of only 4%. They believe that as the efficiency of the component retrieval strategies increases, the benefit of their combination decreases. The most efficient retrieval strategies known produce more similar results than previously thought, they claim. This claim was further investigated by some of the same people. In 2003, they showed empirically that Lee s overlap ratio is an insufficient indicator of the improvements to be gained from the fusion of techniques[5]. For fusion to improve effectiveness, the result sets being fused must contain a significant number of unique relevant documents, which must be highly ranked, they conclude. They explain the previous empirical support for Lee s hypothesis by systemic variations such as parsing rules, stemming, phrase processing and relevance feedback. Apparently, they do not consider these variations to be part of a retrieval strategy. However, they offer no suggestions as to what systemic properties should be used. It would seem to me that existing fusion algorithms definitely have merit as long as no universal set of optimal systemic properties have been identified. 21

22 Combining information retrieval techniques 3.2 Combining queries The previous section focused on the combination of result sets produced by different retrieval strategies. However, combination of result sets also has its applications using the same retrieval mechanisms for all result sets. In these cases, the variations in the result sets are caused by variations in the query or document representations. Fox and Shaw performed their retrieval runs on different partitions of the TREC collections in parallel, and subsequently merged the result sets.[17] This was done because of hardware and storage limitations. A complicating factor when merging results from different collections is that collection statistics are sometimes used to rank documents in a result set. Thus, the collection influences the characteristics of the result set, and result sets from different collections are not directly comparable. A prime example of an often used collection statistic is the inverse document frequency of the td-idf vector based model. Fox and Shaw avoided this problem by using ranking algorithms that didn t employ collection-specific statistics. Belkin and colleagues used human search experts to formulate five independent boolean query formulations designed to solve the same information need.[16] These queries were automatically combined by the INQUERY probabilistic inference network retrieval engine. The engine then used the combined query to retrieve a result set. Experiments showed that the combined boolean queries outperformed every individual boolean query. After experimenting with the weighting of the different queries, they were also able to use the boolean queries to enhance the performance of INQUERY s natural language based queries, which were used as a baseline, even though the individual boolean queries all performed worse than the baseline. Lee developed a fully automatic way of generating multiple query representations for a given information problem.[12] First, an initial query was generated for the given information problem, and an initial retrieval was performed. Then five different relevance feedback methods implemented in the SMART system were applied to the top-retrieved documents of the initial run and used to produce five different query expansions. These five feedback methods generated quite dissimilar query expansions, but the different query expansions resulted in similar levels of retrieval effectiveness. Consquently, they were successfully combined to one final result set which outperformed the individual queries. Lee also experimented with the combination of different term weighting schemes for the same queries.[14] In vector based models, the terms of a query must be weighted and matched with a term vector representation of the various documents in the collection. Needless to say, the weighting scheme used to determine the vector representation of a document influences the result of the search. Lee showed that different weighting schemes have different strengths and weaknesses and thus benefit from multiple evidence combination. Kamps, Monz and de Rijke experimented with multiple evidence combination in the context of multilingual information retrieval.[7] In their experiments, the queries and documents were expressed in different languages. They combined n-gram techniques with morphological tools in the formulation and translation of queries. They found that combining runs could consistently improve even high quality base runs, and that the interpolation factors required for the best gain in performance seemed to be fairly robust across different topics. Chowdhury, Beitzel and Jensen experimented with the combination of different TREC query representations.[6] The TREC conferences for the evaluation of text retrieval techniques use a collection of documents categorized into topics. If the topic descriptions are viewed as 22

23 Combining information retrieval techniques queries, the categorization can be viewed as a gold standard result of this query over the entire document collection. The topic descriptions are passages of text divided into title, description, narrative and other sections. Chowdhury and his colleagues used two different query representations, one short representation consisting of the title and description of the topic, and one long representation consisting of the title and narrative of the topic. The two query representations were used to retrieve documents by their lab s IR engine, AIRE. A merging technique that took the large differences in the two representations query lengths was then suggested, and this technique was found to outperform the traditional CombSUM and CombMNZ merging algorithms in this setting. This experiment illustrates that when differences in result set characteristics are caused by systematic differences in query representations, the merging algorithm should take these underlying causes into account. 3.3 Common combinations of techniques The two previous sections have described various ways of combining queries or result sets. A lot of research has focused on the actual merging algorithms. In many cases, variations around a single technique are employed to produce multiple sources of evidence. In this section we list some examples of the combination of two different retrieval techniques. In their influential TREC experiments[17], Fox and Shaw tested the combinations of vector based retrieval with P-norm extended boolean retrieval. Two different vector representations of queries were used, and three different P-parameters were used in the extended boolean retrieval runs. The five result sets were combined in various ways using various merging algorithms. Belkin et al. s experiments on the combination of boolean queries tested the combination of classical boolean and natural language queries.[16] In demonstrating the abilities of the SIRE information retrieval engine and experimenting with the impact of systemic differences on fusion effectiveness, McCabe et al. combined a probabilistic and a vector space retrieval technique.[10] In later experiments investigating the validity of the relevant overlap hypothesis put forward by Lee, they used the same engine with multiple vector-based and probabilistic techniques.[5] The remainder of examined experiments are generally limited to experimenting over variations of the vector space model for information retrieval. This model apparently has a dominating position as it is included in every multiple evidence combination experiment that I have reviewed, with the exception of Belkin et. al s combination of boolean and natural language queries dating back to 1993.[16] No example of a combination of a probabilistic and a boolean technique has been found. 3.4 Discussion There is wide evidence in the literature that the combination of multiple sources of evidence can greatly improve information retrieval performance. Critics have pointed out that the utility of multiple evidence combination is reduced as the performance of individual techniques increases. This is only natural. The goal of research should at any rate be to develop as good an information retrieval strategy as possible, and multiple evidence combinations may provide vital contributions to this work. Fox and Shaw found that the combination of different search paradigms provided better results than just combining multiple similar queries.[17] In her overview of the first TREC conference for the evaluation of text retrieval techniques[19], Donna Harman commented that the various submitted strategies retrieved very different documents. Thus, there should be great potential for improvement in the combination of many or all common retrieval 23

24 Combining information retrieval techniques techniques. Consequently, it is somewhat surprising that more effort hasn t been put into the research on combinations of totally different search paradigms. Instead, researchers have shown great creativity in devising new ways of employing multiple evidence combination within a single paradigm. One reason for this may be the dominant position of the vector based model for information retrieval. Another reason may be that researchers have performed experiments that have been practically possible within their existing frameworks for information retrieval. As we will see in the next chapter, these frameworks tend to be based on a single model of information retrieval. 24

25 Existing architectures for Information Retrieval 4 Existing architectures for Information Retrieval Any attempt at developing a new architecture should take existing efforts into consideration. In this chapter, some examples of existing frameworks or architectures for information retrieval are examined. This will enable us to study key features that should be present in our proposed architecture in order to aid further research into multiple evidence combination. In addition, the features making the proposed agent-based architecture original and worthwhile are identified. The systems that are considered subsequently are chosen because they represent typical implementations of various search paradigms, or because they are applicable or have been applied to multiple evidence combination. Table gives an overview of the systems that have been reviewed. The systems are reviewed in the order that they were developed. Search system Underlying IR model(s) Special features SMART Vector Boolean queries may be expressed URSA Vector SQL-like query language Distributed architecture INQUERY Probabilistic Inference network OKAPI Probabilistic Manual relevance feedback ProFusion Multiple evidence Limited to web searching SENTINEL Vector + n-grams 3D visualization tool SIRE Relational vector model Built on a RDBMS AIRE Flexible Built to support multiple paradigms using a single parser Table 1: Overview of examined IR systems 4.1 The SMART system The SMART information retrieval system was developed at the Cornell University in Ithaca, New York, and dates back to the early 1960 s. The system has been reimplemented many times. This discussion is based on the C implementation described by Chris Buckley in 1985.[32] The goals of SMART were twofold. First, it should provide a flexible experimental system for research in information retrieval. Second, it should provide a fast, portable and interactive environment for actual users. Buckley notes that these two goals naturally conflict with each other, and states that the current SMART design is an attempt to satisfy each as much as possible. Two types of flexibility are emphasized in SMART: Flexible algorithm parameters and flexible design. The flexible parameters allow experimenters to easily try out new tweaks to the underlying algorithms. The design is a collection of modules and library routines typical for well-structured C programs, and Buckley argues that this allows programmers to easily modify algorithms. SMART is based on the vector model for information retrieval. Within this model, some variations are possible. The system supports three different modes of access to the document collection: Sequential iterate through all documents Inverted file look up the relevant files in an index Clustered iterate through clusters of documents 25