AFFAIR: Agent-based Flexible Framework and Architecture for Information Retrieval

Size: px
Start display at page:

Download "AFFAIR: Agent-based Flexible Framework and Architecture for Information Retrieval"

Transcription

1 Ottar Viken Valvåg AFFAIR: Agent-based Flexible Framework and Architecture for Information Retrieval Requirements elicitation, design and prototype implementation Project assignment in TDT4710 Information Management, Specialization Subject teacher: Teaching supervisors: Ingeborg Sølvberg Ingeborg Sølvberg Jeanine Lilleng Department of Computer and Information Science, NTNU Autumn 2003

2

3 Summary Summary Modern information retrieval systems often combine a variety of basic techniques to achieve good search results. I have developed AFFAIR, which is an agent-based, flexible framework suitable for experimentation with such combinations. The information retrieval process is modeled by an agent processing a query to produce a result set consisting of documents from a document collection. AFFAIR consists of a collection of Java interfaces defining the required functionality of these components. The AFFAIR framework has been tested by implementing two simple information retrieval strategies as agents. A document collection driver was implemented to allow the agents to search articles from the Bergens Tidende newspaper. The testing showed that the implementation of information retrieval strategies as agents within the AFFAIR framework was feasible. A generic merging agent combining the results of other agents was also developed. This was used to combine the other two agents results, and could be used to easily combine any information retrieval strategies implemented by agents in the AFFAIR framework. Information retrieval strategies may be categorized into three different paradigms. These are referred to as the Set Theoretic, Vector and Probabilistic models of information retrieval. As part of the development of AFFAIR, existing research into the combination of information retrieval techniques has been reviewed. This review showed that the Vector model for information retrieval has a dominant position. It also showed that researchers have tended to investigate combination schemes that are limited to a single model for information retrieval. This is despite research results demonstrating that the combination of different retrieval paradigms provides the greatest potential improvements to retrieval results. Studies of existing information retrieval systems revealed that few information retrieval systems are explicitly designed to support the combination of different retrieval paradigms. In fact, most systems are based on a single model for information retrieval. In addition, many systems emphasize the optimization of efficiency and scalability over the flexibility needed to allow for cross-paradigm experiments. Thus, restrictions in existing frameworks as well as the dominant position of the Vector model may have contributed to limit research into the combination of different information retrieval paradigms. AFFAIR could be able to alleviate this problem. It is concluded that AFFAIR seems to satisfy its objective of providing a suitable framework for experimentation with the combination of different retrieval techniques. Further work on the AFFAIR framework could include the addition of new collection drivers, the development of a better user interface and the inclusion of batch query processing to support benchmarking experiments. Other uses of AFFAIR, such as the use of AFFAIR as collaborative or teaching tool, or the application of AFFAIR to multilingual or non-textual information retrieval, could also be tested.

4

5 Acknowledgements Acknowledgements This report is the result of my project assignment in the Information Management specialization module of my Master of Technology in Computer Science degree at NTNU. I would like to thank my teaching supervisors, professor Ingeborg Sølvberg and dr.ing. fellow Jeanine Lilleng, at the Information Management group of NTNU s Department of Computer and Information Science, for their contributions to and advice about this project. Professor Sølvberg, with her extensive experience, has given well-founded guidance on the organization and presentation of my work. Jeanine, with her enthusiasm and attentiveness, has been a great inspirer and driving force. In short, professor Sølvberg s attention to the big picture and Jeanine s attention to detail have greatly improved the quality of this report. I would also like to thank the Bergens Tidende newspaper as well as professor Torbjørn Nordgård at the Computer Linguistics section of NTNU s Department of Language and Communication Studies for providing me with access to the Bergens Tidende document collection used to test the AFFAIR framework.

6

7 Reader s guide Reader s guide Chapter 1 observes that modern information retrieval systems often combine a variety of basic techniques to achieve good search results. A unified framework for testing and developing new combinations is desired. This report attempts to define and prototype AFFAIR, an architecture suitable for such work. In chapter 2, a brief overview of the common paradigms of textual information retrieval is given. Information retrieval strategies may be categorized into three main classes; Set Theoretic, Algebraic and Probabilistic information retrieval. The theoretical foundations of these paradigms are Set Theory, Vector Algebra and Probability Theory, respectively. Chapter 3 goes on to study how different information retrieval strategies may be combined to produce better results. In general, the combination of multiple sources of evidence to produce a good overall result is known as multiple evidence combination. Research into the application of multiple evidence combination to information retrieval has to a large degree focused on multiple evidence combination within a single retrieval paradigm. Two possible reasons for this are mentioned. One is that the Vector Algebraic model of information retrieval has a dominant position. Another is that researchers tend to do experiments that are possible within their existing frameworks for information retrieval. Therefore, it is hypothesized that a general framework providing support for the easy combination of different information retrieval techniques, regardless of retrieval paradigm, could be beneficial. In chapter 4, some existing systems and architectures for information retrieval are reviewed. It is found that few systems are explicitly designed to support multiple evidence combination. In fact, most systems are based on a single model for information retrieval. In addition, many systems emphasize the optimization of efficiency and scalability over the flexibility needed to allow for creative experiments. It is decided that AFFAIR can contribute by providing an abstract framework supporting the general process of information retrieval. Chapter 5 discusses a suitable way to represent documents and collections of documents. This leads to the development of a general interface to document collections. By writing a collection driver implementing this interface, any kind of textual document collection may be incorporated into the proposed framework. In chapter 6, the entire framework is specified in terms of clean interfaces. The information retrieval process is modeled by a search agent, which processes a query and returns a result set consisting of ranked documents. The agents can be used to implement any kind of information retrieval strategy. Examples of a document collection driver and three agent implementations searching this collection are given in chapter 7. It is shown that different information retrieval paradigms can be easily implemented in AFFAIR. It is also demonstrated how an agent may produce its results by combining the efforts of other agents. Thus, arbitrary combinations of different retrieval techniques are possible. In chapter 8, it is concluded that AFFAIR seems to satisfy its objective of providing a suitable framework for experimentation with the combination of different retrieval techniques. Possible framework extensions and new applications of AFFAIR are also mentioned.

8

9 Contents Contents 1 Introduction Basic Information Retrieval Techniques Set Theoretic IR techniques Fuzzy Information Retrieval Extended Boolean Information Retrieval Algebraic IR techniques Probabilistic IR techniques Structured IR techniques Combining information retrieval techniques Combining result sets Combining queries Common combinations of techniques Discussion Existing architectures for Information Retrieval The SMART system The URSA architecture The INQUERY system The OKAPI system The ProFusion engine The SENTINEL system The SIRE engine The AIRE engine Discussion Documents and Document Collections Document types Design decisions Identification schemes Design decisions Metadata Design decisions Document granularity Design decisions Document formatting Design decisions Collection maintenance Summary The AFFAIR framework Requirements Framework Prototype The Agent interface The Query interface The ResultSet interface The DocumentCollection interface The Document interface The Tools class Testing of AFFAIR The Bergens Tidende document collection... 49

10 Figures 7.2 User interface Agent implementations A simplistic Boolean agent A vector-based agent A multiple evidence combination agent Discussion Conclusion and further work Further work References Appendix A : Java language mechanisms Appendix B : AFFAIR source code Figures Figure 1: The main components of the system Figure 2: Simple agent collaboration Figure 3: A taxonomy of information retrieval models Figure 4: Venn diagram illustrating relevant and non-relevant overlap Figure 5: Formulae for relevant and non-relevant overlap Figure 6: SQL retrieval in SIRE Figure 7: The components of an information retrieval process Figure 8: The Document interface Figure 9: Query execution using sequential access Figure 10: Query execution using direct access Figure 11: The central framework constituents Figure 12: Class diagram of the AFFAIR prototype Figure 13: Example document from BT collection Figure 14: Same document in AFFAIR form Figure 15: The no.ntnu.affair.drivers.bt package Figure 16: Main loop of AFFAIR user interface Figure 17: Output from a run of the user interface Figure 18: XML-formatted search result Figure 19: The performsearch() method of the ExampleAgent Figure 20: The VectorAgent's indexing routine Figure 21: The MergingAgent code Figure 22: Implementation of the CombMNZ algorithm Figure 23: MergingAgent search results Tables Table 1: Overview of examined IR systems Table 2: The Dublin Core Metadata Element Set Table 3: Mapping between BT and DC metadata categories Table 4: DC metadata made explicit by the collection driver Table 5: Untranslatable BT metadata categories Table 6: Summary of test agent properties... 56

11 Introduction 1 Introduction Information Retrieval (IR) is the process of locating documents relevant to some userspecified criterion from a collection of digital documents. Based on the user s specifications, the documents containing the most relevant information must be retrieved. The collection of documents can be a digitized library, the Internet, an image database or any other kind of information collection. In this report only documents comprised of textual information are considered. In the context of textual documents, the general IR problem of finding information relevant to user criteria is usually boiled down to a search for documents based on a search query. The query can be anything from a small set of keywords to a central paragraph or even a template document. Based on the keywords, the information retrieval system must be able to understand what kind of information the user is looking for, and it must be able to evaluate any document and measure to what extent the requested information is present. This is an AIcomplete problem, so approximative techniques are used to produce acceptable results. Some examples of techniques used to estimate a textual document s relevance with regard to a given set of keywords are: Word matching: Does the document contain the keywords? Use of dictionaries: Matches synonyms and related terms as well. N-grams: Used to match similar words, inflections, etc. Semantic analysis: Use of formal logic and natural language understanding to evaluate the information content of documents. Statistical methods: Take advantage of language statistics such as the tendency of two words to occur together. Some basic information retrieval techniques are described in greater detail in chapter 2. In practical research, a combination of techniques is used to achieve as good a result as possible. The results of techniques are weighted and combined to one result. This is done in many different ways, usually dependent on the techniques that are to be combined and the framework that is available. A more uniform and general framework for experimenting with the combinations of different IR techniques is desirable. This could make it easier for researchers to experiment with different combinations, eliminating or reducing the need for special adaptations for each new combination. In this report I look into the possibilities of creating such a framework using an agent-based architecture. Agents are autonomous program units capable of working towards a set of goals.[1] The distinction between a software agent and a software component is a bit blurred, but an agent is in general thought to work at a higher level of abstraction. That is, an agent has high-level goals that must be achieved by low-level means. This requires a certain degree of intelligence, or problem-solving ability in the agent. An important type of agent is the collaborative agent. Collaborative agents solve a larger problem by combining the efforts of many, simpler agents. Such agents seem to be applicable to information retrieval. The idea behind the agent-based architecture for information retrieval described in this report is to create a system consisting of a number of information retrieval agents. The agents have a uniform interface providing methods for searching a document collection, but each agent uses its own technique to perform this search. The framework then allows the user to specify searches, which techniques should be combined, and which 11

12 Introduction document collection should be searched. The architecture is intended to be used for developing and benchmarking new IR methods. The principal components of the architecture are shown in Figure 1. An important part of the development of this architecture is to specify the interfaces and protocols for communication between these components. It is important to design these interfaces as broadly and generally as possible, to maximize the number of different techniques that are compatible with the framework. At the same time, it is desirable to allow for relatively complex patterns of collaboration, so the interfaces must be able to provide finer details as well. In order to find out what the interfaces must provide, it is necessary to study different IR techniques in detail. In addition to interagent communication, the agents need a uniform interface to the document collection. One possibility is to create an interface to a generic document collection, with drivers making different types of concrete document collections accessible through this interface. In this way a collection of text files and the Internet can be accessed in the same way. Interface for query specification Search planning, combination of techniques Agents Agents Mossad MI6 Communication protocol CIA KGB Interface to document collections Document collection Figure 1: The main components of the system Two additional interfaces must be defined to complete the architecture. The first one concerns a structure for representing partial or complete search results, to be used by agents to exchange results and to return a complete result to the user. The result sets should contain as much information as possible, including characteristics related to the IR technique that was used. These characteristics may be used by other agents when combining results and/or performing searches. They can also be useful when the complete results are analyzed in connection with a benchmarking experiment. The final interface to be defined is a user interface to the system. Ideally, such an interface should provide simple tools to perform searches, and specify new search methods. The specification of new search methods involves the specification of agent collaboration patterns. The simplest form of collaboration is the merging of two result sets produced by different agents. The final result set must be ranked in accordance with some weighting of the two partial result sets attributes. More complex collaboration patterns may involve feedback 12

13 Introduction loops where partial search results guide further searches or are used to tune essential parameters. In such cases, internal characteristics stored in the result sets may play a critical role. The design of a user interface allowing a fully flexible way of specifying agent collaboration patterns is probably a large task. In the absence of such an interface, expert involvement is required. An acceptable solution is to create a new agent implementing the desired collaboration pattern. This agent will coordinate the effort of other agents and compile the final result. An illustration of this approach is shown in Figure 2. Interface for query specification CIA Coordinating Agent Agent MI6 Agent Mossad Interface to document collections Document collection Figure 2: Simple agent collaboration The prototype for this architecture is defined and implemented in Java, which is very suitable for the programming of agents due to its object-oriented nature. Interfaces and protocols are also implemented in Java. Ideally, communication between system components should be independent of programming language. This would enable different parts of the system to be implemented in different programming languages. For example, an agent using formal logic reasoning could be implemented in Prolog whereas a computation-intensive agent could be implemented in C. However, due to time considerations the initial prototype is programmed in Java in its entirety. To summarize, there are a number of tasks that must be solved in order to realize the idea of an agent-based architecture for information retrieval. These include: The definition of a generic interface to document collections. The installation of a document collection and the implementation of a driver making the documents accessible through the interface mentioned above. The definition of a format for the storage and exchange of search results. The definition of an agent interface and a protocol for interagent communication. The implementation of two or more agents using different IR techniques. The definition of a user interface or framework for performing searches. The design of a user interface for specifying new ways of combining IR techniques. Testing the architecture by benchmarking an arbitrary method of information retrieval. 13

14 Introduction My specific tasks in relation to this work, listed in order of priority, have been: The definition of an agent interface and a protocol for interagent communication. The definition of a format for the storage and exchange of search results. The definition or discovery of a suitable document collection interface. The installation of a simple, limited document collection, and the implementation of a driver making the documents accessible through the interface mentioned above. The implementation of two or more agents using different IR techniques. My aim was to develop a simple working prototype, where more advanced details in the interfaces and protocols can be added at a later time. The prototype was dubbed AFFAIR, standing for Agent-based Flexible Framework and Architecture for Information Retrieval. 14

15 Basic Information Retrieval Techniques 2 Basic Information Retrieval Techniques In order to develop interfaces that are able to represent and exchange the information used by different information retrieval techniques, a closer look at the commonest techniques is necessary. The following discussion is based on chapter 3 of a book by Baeza-Yates and Ribeiro-Neto[2], as well as chapter 17 of Jurafsky & Martin s book[3] and a survey by Faloutos and Oard.[4] Baeza-Yates and Ribeiro-Neto sort classic IR techniques into three basic categories. These are the Set Theoretic, Algebraic and Probabilistic models for information retrieval. In addition, models that take advantage of structural information about documents are considered. This taxonomy is illustrated in Figure 3, which is adapted from their book.[2] IR application: Search or filtering Classic Models Boolean Vector Probabilistic Set Theoretic Fuzzy Extended Boolean Algebraic Generalized Vector Latent Semantic Index Neural Networks Structured Models Non-overlapping lists Proximal Nodes Probabilistic Inference Network Belief Network Figure 3: A taxonomy of information retrieval models. In general, all current textual IR techniques view documents as a series of terms. A term usually corresponds to a word in the text, but may encompass phrases as well. Current IR techniques usually rely on the simplifying assumption that the semantics of a document are fully described by the individual terms present in them. Most often, their actual order in the document is not even considered. This simplifies the information problem and enables search systems to construct indexes relating terms to documents, speeding up retrieval. An approach rejecting the simplifying term assumption would have to interpret the semantics of a document and represent them in a machine-understandable structure that could be used to answer queries. Knowledge representation languages capable of this do exist, but a lot of manual work is required in order to supply the knowledge base necessary for efficient inference mechanisms. Knowledge representation techniques are still not mature enough to be applicable to general information retrieval problems. Nevertheless, it is important to keep in mind that the term assumption is indeed a simplifying assumption and not a law of nature. In the following, the Set Theoretic, Algebraic, Probabilistic and Structured models for IR are reviewed. Some subcategorizations of the three models are examined, and architecture requirements caused by the different models are identified. 15

16 Basic Information Retrieval Techniques 2.1 Set Theoretic IR techniques Set Theoretic IR techniques attempt to express information queries as boolean algebra expressions. For example, the need for documents concerning software architectures could be expressed as a query software AND architecture. This query would produce a result consisting of all documents containing the terms software and architecture. By viewing the documents containing the term software as one set, and the documents containing the term architecture as another, the result is achieved by taking the intersection of these two sets. Thus Set Theoretic IR may be viewed as a series of set operations such as intersection, union and complement, corresponding to the AND, OR and NOT operators, respectively. Set Theoretic IR techniques have been widely adopted because of their simplicity and intuitiveness. The main problems with these techniques are: The query performs a binary decision, documents are either marked as relevant or irrelevant. Users have problems formulating sophisticated boolean queries Fuzzy Information Retrieval Fuzzy Information Retrieval is an attempt to alleviate the problem of binary decisions. Fuzzy IR uses Fuzzy Set Theory instead of classical Set Theory to produce search results. In Fuzzy Set Theory, an element can partly belong to a set. This is expressed by a membership function taking a value from 0 to 1, where 0 indicates no membership and 1 indicates full membership. The set operations for intersection, union and complement calculate algebraic combinations of the membership functions of elements with regard to the two operand sets. This calculation is used to create membership functions of elements with regard to the resulting set. By using the same retrieval technique as in classic Set Theoretic IR, a final result set is produced where all documents have a degree of participation which is hopefully related to their relevance. For an explanation of how the documents degree of participation in the sets associated with each index term are calculated, see page 36 of Baeza-Yates book.[2] Extended Boolean Information Retrieval The Extended Boolean model for Information Retrieval is a hybrid technique combining some of the features of algebraic and set theoretic techniques. Similarly to other Set Theoretic techniques, Extended Boolean IR is performed by combining sets of documents using the set intersection, union or complement operations. As for Fuzzy IR, the implementation of the set operations have been changed. A p parameter is introduced which can be used to vary the effect of the set operations. When p = 1, the queries are evaluated in a way similar to algebraic IR techniques, which are discussed in section 2.2. When p =, the operations equal the operations used in Fuzzy IR. Thus, the Extended Boolean technique provides great flexibility with regard to the specification and evaluation of queries. Again, we refer to Baeza- Yates book[2], page 40 for more details. 2.2 Algebraic IR techniques Algebraic IR techniques view documents and queries as vectors, where each dimension of the vector represents a term. Common terms thought to carry little information are usually eliminated using stop lists. The values of the different coordinates in a vector represent the weight or importance of the corresponding term. Similarity (in terms of information content) between documents and 16

17 Basic Information Retrieval Techniques queries is then measured by comparing the n-dimensional distance between corresponding vectors, where n is the total number of terms in the document collection, or by comparing the vectors directions. To calculate the vectors of documents the frequencies of terms in the documents are counted. This produces a term-by-document matrix[3], in which every column of the matrix can be viewed as a vector representation of a given document. In addition to frequency counts for every document, the frequency of a term in the entire document collection is often used to normalize the weights in the matrix. The following scheme, known as tf idf weighting[3], is widely used: W i,j = TF i,j IDF i W i,j denotes the weight of term i in the vector representation of document j. TF i,j is the frequency of term i in document j, and IDF i equals log(n/n i ), where N is the total number of documents and n i is the number of documents in which term i occurs. IDF stands for inverse document frequency. The calculation of the vector representation of the query is often more involved, as user queries often consist of few terms and rarely carry enough information to guarantee a good search result. Techniques such as stemming of terms, n-grams, relevance feedback and query expansion using thesauri are used to provide greater recall and/or precision. What all this means with regard to the proposed architecture is that documents should be viewable as vectors, and that search results should include the vectors assigned to the query and the retrieved documents. Different algebraic search techniques may calculate vectors differently, and the vectors could potentially be used when the results of different IR techniques are combined. The calculation of document vectors is typically a preprocessing activity that should be supported by the architecture. 2.3 Probabilistic IR techniques Whereas Set Theoretic IR techniques attempt to calculate a set of relevant documents by direct execution of set operations specified in the query, and algebraic techniques try to calculate a similarity feature between all documents and the query, probabilistic IR techniques take a slightly different approach. These techniques try to estimate the probability that a document is relevant, or belongs to the so-called ideal set (consisting of all relevant documents). Formally, an attempt is made to estimate the probability P(R d j ) for all j. That is, the probability that document j belongs to the set R of relevant documents is estimated. Instead of using this probability as a ranking criterion, the odds of a document being relevant is used, which is expressed as the ratio of the probability of a document being relevant to the probability of a document being irrelevant. This is written P(R d j )/P(~R d j ). Using Bayes rule, the problem can be reformulated to estimating the ratio P(d j R)/P(d j ~R). The probabilities involved in this ratio are the probability of drawing document j from the relevant set and the probability of drawing document j from the irrelevant set, respectively. These probabilities are much easier to estimate than the probabilities in the original ratio. By assuming independence of the terms in a query, P(d j R)/P(d j ~R) can again be broken down to estimating P(k i R) and P(k i ~R) for all terms k i of the query. Given an initial estimate 17

18 Basic Information Retrieval Techniques of these probabilities, it is possible to recursively refine these estimates using a set V of documents thought to be relevant. The set V can be determined by a human user (relevance feedback), or simply by taking the top 10 documents of the previous round of the recursion (blind relevance feedback). Thus, a probabilistic query should include the usual search terms, but also a set V of documents thought to be relevant, and estimates of the probabilities P(k i R) and P(k i ~R) for all the search terms. The initial query can omit these values and let the search agent determine sensible initial estimates. Also, the result set should include the estimated probabilities. In theory, probabilistic IR techniques could perform well because documents are ranked according to their estimated probability of being relevant. The main disadvantage of probabilistic IR techniques is that they don t take the frequency of index terms within documents into account. As in the classic set theoretic model, the terms weights are binary. 2.4 Structured IR techniques Some IR techniques reference the structure of a document, such as font formatting, layout, paragraphs, etc. This enables the user to specify queries referencing structural information, such as a search for all illustrations with captions containing the words information retrieval taxonomy. Naturally, these techniques need access to the structural information of a document. Unfortunately, no formats for document structure are universally adopted. Thus, the document collection drivers should ideally translate the structural information of its documents to a standard format. In the absence of such a mechanism, agents employing structured models for IR will have to access the raw document data directly and derive whatever structural information possible. 18

19 Combining information retrieval techniques 3 Combining information retrieval techniques In this section, different ways of combining information retrieval techniques are reviewed, and some commonly used combinations of information retrieval techniques are identified. This is necessary to determine what kind of agent interaction and collaboration patterns must be supported by the proposed architecture. The idea of combining the efforts of different information retrieval engines is not new. Various experiments have been done in order to investigate the practical utility of such approaches. The combination of different sources of information to produce an improved result is sometimes referred to as multiple evidence combination in the literature. Fusion or data fusion are also common terms, but they are sometimes limited to the merging techniques discussed in section 3.1. The most general term seems to be multiple evidence combination. In the following, some experiments using multiple evidence combination are reviewed. Section 3.1 discusses multiple evidence combinations involving multiple information retrieval strategies, whereas section 3.2 discusses approaches using only one retrieval strategy but varying other parameters such as query or document representation. Section 3.3 lists the techniques that are most commonly combined, and section 3.4 discusses the implications of recent research with regard to the proposed architecture. 3.1 Combining result sets A common type of data fusion is merging, which involves combining two or more ranked result sets produced by different information retrieval engines. Fox and Shaw[17] describe two simple algorithms for combining such result sets, COMB-SUM and COMB-MNZ. These algorithms assume that every document in a result set has been assigned a relevance estimate or similarity value ranging from 0 to 1. COMB-SUM gives every document a new score equaling the sum of the relevance estimates for the document in each result set. The merged result set is then ranked according to this score. COMB-MNZ favors documents that are judged relevant in many result sets, as the COMB-SUM score of a document is multiplied by the number of result sets in which the document occurs. Lee[13] used the same merging algorithms, but considered document rank instead of similarity value, and showed that assigning new scores based on the ranks of the documents gives better results when the result sets to be merged exhibit quite different rank-similarity curves. Lee also investigated the conditions that must be present in order to benefit from the merging of result sets, and found that the relevant and non-relevant overlap of different result sets are correlated to the degree of improvement that can be expected from these and other merging algorithms. The relevant overlap of two result sets is found by taking the number of relevant documents found in both result sets and dividing by the number of relevant documents found in at least one of the result sets. Intuitively, relevant overlap measures the degree to which different techniques make the same correct predictions. Similarly, the non-relevant overlap of two result sets is found by taking the number of nonrelevant documents found in both result sets and dividing by the number of non-relevant documents found in at least one of the result sets. Intuitively, non-relevant overlap measures the degree to which different techniques make the same mistakes or omissions. The concepts of relevant and non-relevant overlap are illustrated in the Venn diagram in Figure 4. 19

20 Combining information retrieval techniques R 1 R both R 2 IR bot IR 1 IR 2 Figure 4: Venn diagram illustrating relevant and non-relevant overlap In the diagram, the large rectangle represents all documents, which are divided into relevant documents (green waves) and irrelevant document (red crosshatches). Two result sets are shown, outlined by dashed blue and dotted purple lines. The relevant overlap of these two result sets is the ratio R both /(R 1 +R 2 +R both ). Similarly, the non-relevant overlap is the ratio IR both /(IR 1 +IR 2 +IR both ). Formulae for calculating relevant and non-relevant overlap, generalized to any number of result sets, are shown in Figure 5. ROverlap = NROverlap = R S ( R S ) ( R S )... ( R S ) 1 ( NR S ) ( NR S )... ( NR S ) 1 1 S 2 2 NR S 1... S S 2 2 n... S Figure 5: Formulae for relevant and non-relevant overlap In the formulae, R denotes the set of all relevant documents, NR is the set of all non-relevant documents, and S 1..n are the different result sets. In his work[13], Lee stated that as long as the component systems being used for fusion had greater relevant overlap than non-relevant overlap, improvement from fusion would be observed. In other words, the algorithms to be combined should make the same predictions to a larger degree than they make the same mistakes. Zhang et al. built on the idea of relevant overlap in developing a novel fusion algorithm.[8] Their algorithm first clustered the documents in each result set. Then, clusters that had high overlap with clusters in other result sets were judged to be more relevant than others, and the documents that belonged to such clusters were assigned a higher score. Finally, the new score was used to combine all the result sets into a single set. Their experiments showed that this merging algorithm compared favorably to other, more conventional merging algorithms. n n n 20

21 Combining information retrieval techniques Mounir and associates[11] tested the feasibility of fusion in the context of Web-based search engines. They developed a heuristic system called FIRE which provided fusion heuristics that outperformed the basic COMB-SUM and COMB-MNZ heuristics. Their system combined the outputs of three different search engines, and found that fusing the search engines result sets using the COMB-MNZ algorithm produced better results than any of the individual search engines. They went on to develop heuristics that not only took the document ranks into account, but also took advantage of the raw scores assigned to the documents by the various search engines. The various scores were linearly weighted and an optimal weighting was determined experimentally. The experiments showed that the various search engines should be weighted according to their dissimilarity and not just by how well they perform individually. They observe that there is little to be gained by combining two very good engines if they both produce the same results. Bartell and his colleagues[15] devised a linear weighting scheme for the combination of different result sets which is applicable to all ranked result sets. Using well understood numerical optimization techniques and a set of training queries they demonstrate how the optimal combination of any set of techniques can be determined. They point out that the evaluation of a technique in isolation gives little indication as to the technique s potential usefulness in combination with other techniques. Methods that perform badly on their own can contribute greatly to the refinement of another technique. They further suggest that their automatic weighting scheme may be used to identify promising combinations of techniques, however the problem of identifying an initial set of techniques to be evaluated remains. McCabe and her colleagues wanted to check if systemic differences such as variations in parsers, use of stemming, stop words and other indexing techniques influence the results of combining techniques. They implemented a probabilistic and a vector space retrieval strategy using the same parser and the same relational retrieval engine. They found that results were consistent with previous experiments and conclude that Lee s overlap ratio seems to be a suitable indicator of the potential for benefiting from result set merging. Two years later, however, the same team investigated the combination of highly efficient retrieval strategies with systemic differences held constant[9]. Their results showed that the implemented strategies produced good individual results, whereas the combination of the individual results provided an improvement of only 4%. They believe that as the efficiency of the component retrieval strategies increases, the benefit of their combination decreases. The most efficient retrieval strategies known produce more similar results than previously thought, they claim. This claim was further investigated by some of the same people. In 2003, they showed empirically that Lee s overlap ratio is an insufficient indicator of the improvements to be gained from the fusion of techniques[5]. For fusion to improve effectiveness, the result sets being fused must contain a significant number of unique relevant documents, which must be highly ranked, they conclude. They explain the previous empirical support for Lee s hypothesis by systemic variations such as parsing rules, stemming, phrase processing and relevance feedback. Apparently, they do not consider these variations to be part of a retrieval strategy. However, they offer no suggestions as to what systemic properties should be used. It would seem to me that existing fusion algorithms definitely have merit as long as no universal set of optimal systemic properties have been identified. 21

22 Combining information retrieval techniques 3.2 Combining queries The previous section focused on the combination of result sets produced by different retrieval strategies. However, combination of result sets also has its applications using the same retrieval mechanisms for all result sets. In these cases, the variations in the result sets are caused by variations in the query or document representations. Fox and Shaw performed their retrieval runs on different partitions of the TREC collections in parallel, and subsequently merged the result sets.[17] This was done because of hardware and storage limitations. A complicating factor when merging results from different collections is that collection statistics are sometimes used to rank documents in a result set. Thus, the collection influences the characteristics of the result set, and result sets from different collections are not directly comparable. A prime example of an often used collection statistic is the inverse document frequency of the td-idf vector based model. Fox and Shaw avoided this problem by using ranking algorithms that didn t employ collection-specific statistics. Belkin and colleagues used human search experts to formulate five independent boolean query formulations designed to solve the same information need.[16] These queries were automatically combined by the INQUERY probabilistic inference network retrieval engine. The engine then used the combined query to retrieve a result set. Experiments showed that the combined boolean queries outperformed every individual boolean query. After experimenting with the weighting of the different queries, they were also able to use the boolean queries to enhance the performance of INQUERY s natural language based queries, which were used as a baseline, even though the individual boolean queries all performed worse than the baseline. Lee developed a fully automatic way of generating multiple query representations for a given information problem.[12] First, an initial query was generated for the given information problem, and an initial retrieval was performed. Then five different relevance feedback methods implemented in the SMART system were applied to the top-retrieved documents of the initial run and used to produce five different query expansions. These five feedback methods generated quite dissimilar query expansions, but the different query expansions resulted in similar levels of retrieval effectiveness. Consquently, they were successfully combined to one final result set which outperformed the individual queries. Lee also experimented with the combination of different term weighting schemes for the same queries.[14] In vector based models, the terms of a query must be weighted and matched with a term vector representation of the various documents in the collection. Needless to say, the weighting scheme used to determine the vector representation of a document influences the result of the search. Lee showed that different weighting schemes have different strengths and weaknesses and thus benefit from multiple evidence combination. Kamps, Monz and de Rijke experimented with multiple evidence combination in the context of multilingual information retrieval.[7] In their experiments, the queries and documents were expressed in different languages. They combined n-gram techniques with morphological tools in the formulation and translation of queries. They found that combining runs could consistently improve even high quality base runs, and that the interpolation factors required for the best gain in performance seemed to be fairly robust across different topics. Chowdhury, Beitzel and Jensen experimented with the combination of different TREC query representations.[6] The TREC conferences for the evaluation of text retrieval techniques use a collection of documents categorized into topics. If the topic descriptions are viewed as 22

23 Combining information retrieval techniques queries, the categorization can be viewed as a gold standard result of this query over the entire document collection. The topic descriptions are passages of text divided into title, description, narrative and other sections. Chowdhury and his colleagues used two different query representations, one short representation consisting of the title and description of the topic, and one long representation consisting of the title and narrative of the topic. The two query representations were used to retrieve documents by their lab s IR engine, AIRE. A merging technique that took the large differences in the two representations query lengths was then suggested, and this technique was found to outperform the traditional CombSUM and CombMNZ merging algorithms in this setting. This experiment illustrates that when differences in result set characteristics are caused by systematic differences in query representations, the merging algorithm should take these underlying causes into account. 3.3 Common combinations of techniques The two previous sections have described various ways of combining queries or result sets. A lot of research has focused on the actual merging algorithms. In many cases, variations around a single technique are employed to produce multiple sources of evidence. In this section we list some examples of the combination of two different retrieval techniques. In their influential TREC experiments[17], Fox and Shaw tested the combinations of vector based retrieval with P-norm extended boolean retrieval. Two different vector representations of queries were used, and three different P-parameters were used in the extended boolean retrieval runs. The five result sets were combined in various ways using various merging algorithms. Belkin et al. s experiments on the combination of boolean queries tested the combination of classical boolean and natural language queries.[16] In demonstrating the abilities of the SIRE information retrieval engine and experimenting with the impact of systemic differences on fusion effectiveness, McCabe et al. combined a probabilistic and a vector space retrieval technique.[10] In later experiments investigating the validity of the relevant overlap hypothesis put forward by Lee, they used the same engine with multiple vector-based and probabilistic techniques.[5] The remainder of examined experiments are generally limited to experimenting over variations of the vector space model for information retrieval. This model apparently has a dominating position as it is included in every multiple evidence combination experiment that I have reviewed, with the exception of Belkin et. al s combination of boolean and natural language queries dating back to 1993.[16] No example of a combination of a probabilistic and a boolean technique has been found. 3.4 Discussion There is wide evidence in the literature that the combination of multiple sources of evidence can greatly improve information retrieval performance. Critics have pointed out that the utility of multiple evidence combination is reduced as the performance of individual techniques increases. This is only natural. The goal of research should at any rate be to develop as good an information retrieval strategy as possible, and multiple evidence combinations may provide vital contributions to this work. Fox and Shaw found that the combination of different search paradigms provided better results than just combining multiple similar queries.[17] In her overview of the first TREC conference for the evaluation of text retrieval techniques[19], Donna Harman commented that the various submitted strategies retrieved very different documents. Thus, there should be great potential for improvement in the combination of many or all common retrieval 23

24 Combining information retrieval techniques techniques. Consequently, it is somewhat surprising that more effort hasn t been put into the research on combinations of totally different search paradigms. Instead, researchers have shown great creativity in devising new ways of employing multiple evidence combination within a single paradigm. One reason for this may be the dominant position of the vector based model for information retrieval. Another reason may be that researchers have performed experiments that have been practically possible within their existing frameworks for information retrieval. As we will see in the next chapter, these frameworks tend to be based on a single model of information retrieval. 24

25 Existing architectures for Information Retrieval 4 Existing architectures for Information Retrieval Any attempt at developing a new architecture should take existing efforts into consideration. In this chapter, some examples of existing frameworks or architectures for information retrieval are examined. This will enable us to study key features that should be present in our proposed architecture in order to aid further research into multiple evidence combination. In addition, the features making the proposed agent-based architecture original and worthwhile are identified. The systems that are considered subsequently are chosen because they represent typical implementations of various search paradigms, or because they are applicable or have been applied to multiple evidence combination. Table gives an overview of the systems that have been reviewed. The systems are reviewed in the order that they were developed. Search system Underlying IR model(s) Special features SMART Vector Boolean queries may be expressed URSA Vector SQL-like query language Distributed architecture INQUERY Probabilistic Inference network OKAPI Probabilistic Manual relevance feedback ProFusion Multiple evidence Limited to web searching SENTINEL Vector + n-grams 3D visualization tool SIRE Relational vector model Built on a RDBMS AIRE Flexible Built to support multiple paradigms using a single parser Table 1: Overview of examined IR systems 4.1 The SMART system The SMART information retrieval system was developed at the Cornell University in Ithaca, New York, and dates back to the early 1960 s. The system has been reimplemented many times. This discussion is based on the C implementation described by Chris Buckley in 1985.[32] The goals of SMART were twofold. First, it should provide a flexible experimental system for research in information retrieval. Second, it should provide a fast, portable and interactive environment for actual users. Buckley notes that these two goals naturally conflict with each other, and states that the current SMART design is an attempt to satisfy each as much as possible. Two types of flexibility are emphasized in SMART: Flexible algorithm parameters and flexible design. The flexible parameters allow experimenters to easily try out new tweaks to the underlying algorithms. The design is a collection of modules and library routines typical for well-structured C programs, and Buckley argues that this allows programmers to easily modify algorithms. SMART is based on the vector model for information retrieval. Within this model, some variations are possible. The system supports three different modes of access to the document collection: Sequential iterate through all documents Inverted file look up the relevant files in an index Clustered iterate through clusters of documents 25

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

Incorporating Window-Based Passage-Level Evidence in Document Retrieval Incorporating -Based Passage-Level Evidence in Document Retrieval Wensi Xi, Richard Xu-Rong, Christopher S.G. Khoo Center for Advanced Information Systems School of Applied Science Nanyang Technological

More information

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^

I. The SMART Project - Status Report and Plans. G. Salton. The SMART document retrieval system has been operating on a 709^ 1-1 I. The SMART Project - Status Report and Plans G. Salton 1. Introduction The SMART document retrieval system has been operating on a 709^ computer since the end of 1964. The system takes documents

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

An Information Retrieval using weighted Index Terms in Natural Language document collections

An Information Retrieval using weighted Index Terms in Natural Language document collections Internet and Information Technology in Modern Organizations: Challenges & Answers 635 An Information Retrieval using weighted Index Terms in Natural Language document collections Ahmed A. A. Radwan, Minia

More information

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING ABSTRACT In most CRM (Customer Relationship Management) systems, information on

More information

Mining Text Data: An Introduction

Mining Text Data: An Introduction Bölüm 10. Metin ve WEB Madenciliği http://ceng.gazi.edu.tr/~ozdemir Mining Text Data: An Introduction Data Mining / Knowledge Discovery Structured Data Multimedia Free Text Hypertext HomeLoan ( Frank Rizzo

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

1 o Semestre 2007/2008

1 o Semestre 2007/2008 Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Outline 1 2 3 4 5 Outline 1 2 3 4 5 Exploiting Text How is text exploited? Two main directions Extraction Extraction

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

More information

Review: Information Retrieval Techniques and Applications

Review: Information Retrieval Techniques and Applications International Journal of Computer Networks and Communications Security VOL. 3, NO. 9, SEPTEMBER 2015, 373 377 Available online at: www.ijcncs.org E-ISSN 2308-9830 (Online) / ISSN 2410-0595 (Print) Review:

More information

Appendix B Data Quality Dimensions

Appendix B Data Quality Dimensions Appendix B Data Quality Dimensions Purpose Dimensions of data quality are fundamental to understanding how to improve data. This appendix summarizes, in chronological order of publication, three foundational

More information

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Exam in course TDT4215 Web Intelligence - Solutions and guidelines - English Student no:... Page 1 of 12 Contact during the exam: Geir Solskinnsbakk Phone: 94218 Exam in course TDT4215 Web Intelligence - Solutions and guidelines - Friday May 21, 2010 Time: 0900-1300 Allowed

More information

Data Discovery on the Information Highway

Data Discovery on the Information Highway Data Discovery on the Information Highway Susan Gauch Introduction Information overload on the Web Many possible search engines Need intelligent help to select best information sources customize results

More information

Data Discovery, Analytics, and the Enterprise Data Hub

Data Discovery, Analytics, and the Enterprise Data Hub Data Discovery, Analytics, and the Enterprise Data Hub Version: 101 Table of Contents Summary 3 Used Data and Limitations of Legacy Analytic Architecture 3 The Meaning of Data Discovery & Analytics 4 Machine

More information

Fourth generation techniques (4GT)

Fourth generation techniques (4GT) Fourth generation techniques (4GT) The term fourth generation techniques (4GT) encompasses a broad array of software tools that have one thing in common. Each enables the software engineer to specify some

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Electronic Document Management Using Inverted Files System

Electronic Document Management Using Inverted Files System EPJ Web of Conferences 68, 0 00 04 (2014) DOI: 10.1051/ epjconf/ 20146800004 C Owned by the authors, published by EDP Sciences, 2014 Electronic Document Management Using Inverted Files System Derwin Suhartono,

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

Programming Languages

Programming Languages Programming Languages Programming languages bridge the gap between people and machines; for that matter, they also bridge the gap among people who would like to share algorithms in a way that immediately

More information

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:

More information

Using Use Cases for requirements capture. Pete McBreen. 1998 McBreen.Consulting

Using Use Cases for requirements capture. Pete McBreen. 1998 McBreen.Consulting Using Use Cases for requirements capture Pete McBreen 1998 McBreen.Consulting petemcbreen@acm.org All rights reserved. You have permission to copy and distribute the document as long as you make no changes

More information

not necessarily strictly sequential feedback loops exist, i.e. may need to revisit earlier stages during a later stage

not necessarily strictly sequential feedback loops exist, i.e. may need to revisit earlier stages during a later stage Database Design Process there are six stages in the design of a database: 1. requirement analysis 2. conceptual database design 3. choice of the DBMS 4. data model mapping 5. physical design 6. implementation

More information

An Implementation of Active Data Technology

An Implementation of Active Data Technology White Paper by: Mario Morfin, PhD Terri Chu, MEng Stephen Chen, PhD Robby Burko, PhD Riad Hartani, PhD An Implementation of Active Data Technology October 2015 In this paper, we build the rationale for

More information

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

More information

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System

Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System Stemming Methodologies Over Individual Query Words for an Arabic Information Retrieval System Hani Abu-Salem* and Mahmoud Al-Omari Department of Computer Science, Mu tah University, P.O. Box (7), Mu tah,

More information

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data. MATHEMATICS: THE LEVEL DESCRIPTIONS In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data. Attainment target

More information

Guideline for Implementing the Universal Data Element Framework (UDEF)

Guideline for Implementing the Universal Data Element Framework (UDEF) Guideline for Implementing the Universal Data Element Framework (UDEF) Version 1.0 November 14, 2007 Developed By: Electronic Enterprise Integration Committee Aerospace Industries Association, Inc. Important

More information

Improved Software Testing Using McCabe IQ Coverage Analysis

Improved Software Testing Using McCabe IQ Coverage Analysis White Paper Table of Contents Introduction...1 What is Coverage Analysis?...2 The McCabe IQ Approach to Coverage Analysis...3 The Importance of Coverage Analysis...4 Where Coverage Analysis Fits into your

More information

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary Shape, Space, and Measurement- Primary A student shall apply concepts of shape, space, and measurement to solve problems involving two- and three-dimensional shapes by demonstrating an understanding of:

More information

COMPUTER SCIENCE (5651) Test at a Glance

COMPUTER SCIENCE (5651) Test at a Glance COMPUTER SCIENCE (5651) Test at a Glance Test Name Computer Science Test Code 5651 Time Number of Questions Test Delivery 3 hours 100 selected-response questions Computer delivered Content Categories Approximate

More information

User research for information architecture projects

User research for information architecture projects Donna Maurer Maadmob Interaction Design http://maadmob.com.au/ Unpublished article User research provides a vital input to information architecture projects. It helps us to understand what information

More information

Constructing a TpB Questionnaire: Conceptual and Methodological Considerations

Constructing a TpB Questionnaire: Conceptual and Methodological Considerations Constructing a TpB Questionnaire: Conceptual and Methodological Considerations September, 2002 (Revised January, 2006) Icek Ajzen Brief Description of the Theory of Planned Behavior According to the theory

More information

Managing Variability in Software Architectures 1 Felix Bachmann*

Managing Variability in Software Architectures 1 Felix Bachmann* Managing Variability in Software Architectures Felix Bachmann* Carnegie Bosch Institute Carnegie Mellon University Pittsburgh, Pa 523, USA fb@sei.cmu.edu Len Bass Software Engineering Institute Carnegie

More information

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007

Information Retrieval. Lecture 8 - Relevance feedback and query expansion. Introduction. Overview. About Relevance Feedback. Wintersemester 2007 Information Retrieval Lecture 8 - Relevance feedback and query expansion Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 32 Introduction An information

More information

Information Retrieval Models

Information Retrieval Models Information Retrieval Models Djoerd Hiemstra University of Twente 1 Introduction author version Many applications that handle information on the internet would be completely inadequate without the support

More information

3. Mathematical Induction

3. Mathematical Induction 3. MATHEMATICAL INDUCTION 83 3. Mathematical Induction 3.1. First Principle of Mathematical Induction. Let P (n) be a predicate with domain of discourse (over) the natural numbers N = {0, 1,,...}. If (1)

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Introduction to Fixed Effects Methods

Introduction to Fixed Effects Methods Introduction to Fixed Effects Methods 1 1.1 The Promise of Fixed Effects for Nonexperimental Research... 1 1.2 The Paired-Comparisons t-test as a Fixed Effects Method... 2 1.3 Costs and Benefits of Fixed

More information

IFS-8000 V2.0 INFORMATION FUSION SYSTEM

IFS-8000 V2.0 INFORMATION FUSION SYSTEM IFS-8000 V2.0 INFORMATION FUSION SYSTEM IFS-8000 V2.0 Overview IFS-8000 v2.0 is a flexible, scalable and modular IT system to support the processes of aggregation of information from intercepts to intelligence

More information

A Business Process Services Portal

A Business Process Services Portal A Business Process Services Portal IBM Research Report RZ 3782 Cédric Favre 1, Zohar Feldman 3, Beat Gfeller 1, Thomas Gschwind 1, Jana Koehler 1, Jochen M. Küster 1, Oleksandr Maistrenko 1, Alexandru

More information

Business Intelligence and Decision Support Systems

Business Intelligence and Decision Support Systems Chapter 12 Business Intelligence and Decision Support Systems Information Technology For Management 7 th Edition Turban & Volonino Based on lecture slides by L. Beaubien, Providence College John Wiley

More information

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment

Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment 2009 10th International Conference on Document Analysis and Recognition Low Cost Correction of OCR Errors Using Learning in a Multi-Engine Environment Ahmad Abdulkader Matthew R. Casey Google Inc. ahmad@abdulkader.org

More information

Purposes and Processes of Reading Comprehension

Purposes and Processes of Reading Comprehension 2 PIRLS Reading Purposes and Processes of Reading Comprehension PIRLS examines the processes of comprehension and the purposes for reading, however, they do not function in isolation from each other or

More information

Fuzzy Logic for Software Metric Models Throughout the Development Life-Cycle. Andrew Gray Stephen MacDonell

Fuzzy Logic for Software Metric Models Throughout the Development Life-Cycle. Andrew Gray Stephen MacDonell DUNEDIN NEW ZEALAND Fuzzy Logic for Software Metric Models Throughout the Development Life-Cycle Andrew Gray Stephen MacDonell The Information Science Discussion Paper Series Number 99/20 September 1999

More information

Predict the Popularity of YouTube Videos Using Early View Data

Predict the Popularity of YouTube Videos Using Early View Data 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics

South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready (SCCCR) Probability and Statistics South Carolina College- and Career-Ready Mathematical Process Standards The South Carolina College- and Career-Ready (SCCCR)

More information

SIGIR 2004 Workshop: RIA and "Where can IR go from here?"

SIGIR 2004 Workshop: RIA and Where can IR go from here? SIGIR 2004 Workshop: RIA and "Where can IR go from here?" Donna Harman National Institute of Standards and Technology Gaithersburg, Maryland, 20899 donna.harman@nist.gov Chris Buckley Sabir Research, Inc.

More information

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework Usha Nandini D 1, Anish Gracias J 2 1 ushaduraisamy@yahoo.co.in 2 anishgracias@gmail.com Abstract A vast amount of assorted

More information

Teaching Methodology for 3D Animation

Teaching Methodology for 3D Animation Abstract The field of 3d animation has addressed design processes and work practices in the design disciplines for in recent years. There are good reasons for considering the development of systematic

More information

Fusion of Information Retrieval Engines (FIRE)

Fusion of Information Retrieval Engines (FIRE) Fusion of Information Retrieval Engines (FIRE) S.Alaoui Mounir, N. Goharian, M. Mahoney, A. Salem, O. Frieder Computer Science Department Florida Institute of Technology Melbourne, FL 32901 Abstract We

More information

Healthcare Measurement Analysis Using Data mining Techniques

Healthcare Measurement Analysis Using Data mining Techniques www.ijecs.in International Journal Of Engineering And Computer Science ISSN:2319-7242 Volume 03 Issue 07 July, 2014 Page No. 7058-7064 Healthcare Measurement Analysis Using Data mining Techniques 1 Dr.A.Shaik

More information

Implementing Portfolio Management: Integrating Process, People and Tools

Implementing Portfolio Management: Integrating Process, People and Tools AAPG Annual Meeting March 10-13, 2002 Houston, Texas Implementing Portfolio Management: Integrating Process, People and Howell, John III, Portfolio Decisions, Inc., Houston, TX: Warren, Lillian H., Portfolio

More information

WRITING A CRITICAL ARTICLE REVIEW

WRITING A CRITICAL ARTICLE REVIEW WRITING A CRITICAL ARTICLE REVIEW A critical article review briefly describes the content of an article and, more importantly, provides an in-depth analysis and evaluation of its ideas and purpose. The

More information

Introduction to Systems Analysis and Design

Introduction to Systems Analysis and Design Introduction to Systems Analysis and Design What is a System? A system is a set of interrelated components that function together to achieve a common goal. The components of a system are called subsystems.

More information

3 An Illustrative Example

3 An Illustrative Example Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent

More information

IAI : Expert Systems

IAI : Expert Systems IAI : Expert Systems John A. Bullinaria, 2005 1. What is an Expert System? 2. The Architecture of Expert Systems 3. Knowledge Acquisition 4. Representing the Knowledge 5. The Inference Engine 6. The Rete-Algorithm

More information

Fall 2012 Q530. Programming for Cognitive Science

Fall 2012 Q530. Programming for Cognitive Science Fall 2012 Q530 Programming for Cognitive Science Aimed at little or no programming experience. Improve your confidence and skills at: Writing code. Reading code. Understand the abilities and limitations

More information

Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA) http://hdl.handle.net/11245/2.122992

Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA) http://hdl.handle.net/11245/2.122992 Downloaded from UvA-DARE, the institutional repository of the University of Amsterdam (UvA) http://hdl.handle.net/11245/2.122992 File ID Filename Version uvapub:122992 1: Introduction unknown SOURCE (OR

More information

Component visualization methods for large legacy software in C/C++

Component visualization methods for large legacy software in C/C++ Annales Mathematicae et Informaticae 44 (2015) pp. 23 33 http://ami.ektf.hu Component visualization methods for large legacy software in C/C++ Máté Cserép a, Dániel Krupp b a Eötvös Loránd University mcserep@caesar.elte.hu

More information

Introduction. A. Bellaachia Page: 1

Introduction. A. Bellaachia Page: 1 Introduction 1. Objectives... 3 2. What is Data Mining?... 4 3. Knowledge Discovery Process... 5 4. KD Process Example... 7 5. Typical Data Mining Architecture... 8 6. Database vs. Data Mining... 9 7.

More information

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING

A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING A FUZZY BASED APPROACH TO TEXT MINING AND DOCUMENT CLUSTERING Sumit Goswami 1 and Mayank Singh Shishodia 2 1 Indian Institute of Technology-Kharagpur, Kharagpur, India sumit_13@yahoo.com 2 School of Computer

More information

The Sierra Clustered Database Engine, the technology at the heart of

The Sierra Clustered Database Engine, the technology at the heart of A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

More information

Sample Size and Power in Clinical Trials

Sample Size and Power in Clinical Trials Sample Size and Power in Clinical Trials Version 1.0 May 011 1. Power of a Test. Factors affecting Power 3. Required Sample Size RELATED ISSUES 1. Effect Size. Test Statistics 3. Variation 4. Significance

More information

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization

Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Knowledge Discovery using Text Mining: A Programmable Implementation on Information Extraction and Categorization Atika Mustafa, Ali Akbar, and Ahmer Sultan National University of Computer and Emerging

More information

JERIBI Lobna, RUMPLER Beatrice, PINON Jean Marie

JERIBI Lobna, RUMPLER Beatrice, PINON Jean Marie From: FLAIRS-02 Proceedings. Copyright 2002, AAAI (www.aaai.org). All rights reserved. User Modeling and Instance Reuse for Information Retrieval Study Case : Visually Disabled Users Access to Scientific

More information

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS. PSG College of Technology, Coimbatore-641 004 Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS Project Project Title Area of Abstract No Specialization 1. Software

More information

Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania

Moral Hazard. Itay Goldstein. Wharton School, University of Pennsylvania Moral Hazard Itay Goldstein Wharton School, University of Pennsylvania 1 Principal-Agent Problem Basic problem in corporate finance: separation of ownership and control: o The owners of the firm are typically

More information

Query Recommendation employing Query Logs in Search Optimization

Query Recommendation employing Query Logs in Search Optimization 1917 Query Recommendation employing Query Logs in Search Optimization Neha Singh Department of Computer Science, Shri Siddhi Vinayak Group of Institutions, Bareilly Email: singh26.neha@gmail.com Dr Manish

More information

Information Retrieval Elasticsearch

Information Retrieval Elasticsearch Information Retrieval Elasticsearch IR Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches

More information

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that

More information

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A Introduction to IR Systems: Supporting Boolean Text Search Chapter 27, Part A Database Management Systems, R. Ramakrishnan 1 Information Retrieval A research field traditionally separate from Databases

More information

Towards a Visually Enhanced Medical Search Engine

Towards a Visually Enhanced Medical Search Engine Towards a Visually Enhanced Medical Search Engine Lavish Lalwani 1,2, Guido Zuccon 1, Mohamed Sharaf 2, Anthony Nguyen 1 1 The Australian e-health Research Centre, Brisbane, Queensland, Australia; 2 The

More information

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper Parsing Technology and its role in Legacy Modernization A Metaware White Paper 1 INTRODUCTION In the two last decades there has been an explosion of interest in software tools that can automate key tasks

More information

Basics of Dimensional Modeling

Basics of Dimensional Modeling Basics of Dimensional Modeling Data warehouse and OLAP tools are based on a dimensional data model. A dimensional model is based on dimensions, facts, cubes, and schemas such as star and snowflake. Dimensional

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

types of information systems computer-based information systems

types of information systems computer-based information systems topics: what is information systems? what is information? knowledge representation information retrieval cis20.2 design and implementation of software applications II spring 2008 session # II.1 information

More information

(Refer Slide Time: 01:52)

(Refer Slide Time: 01:52) Software Engineering Prof. N. L. Sarda Computer Science & Engineering Indian Institute of Technology, Bombay Lecture - 2 Introduction to Software Engineering Challenges, Process Models etc (Part 2) This

More information

American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access

More information

INTRUSION PREVENTION AND EXPERT SYSTEMS

INTRUSION PREVENTION AND EXPERT SYSTEMS INTRUSION PREVENTION AND EXPERT SYSTEMS By Avi Chesla avic@v-secure.com Introduction Over the past few years, the market has developed new expectations from the security industry, especially from the intrusion

More information

DATA MODELING AND RELATIONAL DATABASE DESIGN IN ARCHAEOLOGY

DATA MODELING AND RELATIONAL DATABASE DESIGN IN ARCHAEOLOGY DATA MODELING AND RELATIONAL DATABASE DESIGN IN ARCHAEOLOGY by Manuella Kadar Abstract. Data from archaeological excavation is suitable for computerization although they bring challenges typical of working

More information

Fairfield Public Schools

Fairfield Public Schools Mathematics Fairfield Public Schools AP Statistics AP Statistics BOE Approved 04/08/2014 1 AP STATISTICS Critical Areas of Focus AP Statistics is a rigorous course that offers advanced students an opportunity

More information

Five High Order Thinking Skills

Five High Order Thinking Skills Five High Order Introduction The high technology like computers and calculators has profoundly changed the world of mathematics education. It is not only what aspects of mathematics are essential for learning,

More information

SCADE SUITE SOFTWARE VERIFICATION PLAN FOR DO-178B LEVEL A & B

SCADE SUITE SOFTWARE VERIFICATION PLAN FOR DO-178B LEVEL A & B SCADE SUITE SOFTWARE VERIFICATION PLAN FOR DO-78B LEVEL A & B TABLE OF CONTENTS. INTRODUCTION..... PURPOSE..... RELATED DOCUMENTS..... GLOSSARY... 9.. CONVENTIONS..... RELATION WITH OTHER PLANS....6. MODIFICATION

More information

1 Solving LPs: The Simplex Algorithm of George Dantzig

1 Solving LPs: The Simplex Algorithm of George Dantzig Solving LPs: The Simplex Algorithm of George Dantzig. Simplex Pivoting: Dictionary Format We illustrate a general solution procedure, called the simplex algorithm, by implementing it on a very simple example.

More information

Program Visualization for Programming Education Case of Jeliot 3

Program Visualization for Programming Education Case of Jeliot 3 Program Visualization for Programming Education Case of Jeliot 3 Roman Bednarik, Andrés Moreno, Niko Myller Department of Computer Science University of Joensuu firstname.lastname@cs.joensuu.fi Abstract:

More information

The Software Process. The Unified Process (Cont.) The Unified Process (Cont.)

The Software Process. The Unified Process (Cont.) The Unified Process (Cont.) The Software Process Xiaojun Qi 1 The Unified Process Until recently, three of the most successful object-oriented methodologies were Booch smethod Jacobson s Objectory Rumbaugh s OMT (Object Modeling

More information

Information Retrieval Systems in XML Based Database A review

Information Retrieval Systems in XML Based Database A review Information Retrieval Systems in XML Based Database A review Preeti Pandey 1, L.S.Maurya 2 Research Scholar, IT Department, SRMSCET, Bareilly, India 1 Associate Professor, IT Department, SRMSCET, Bareilly,

More information

www.gr8ambitionz.com

www.gr8ambitionz.com Data Base Management Systems (DBMS) Study Material (Objective Type questions with Answers) Shared by Akhil Arora Powered by www. your A to Z competitive exam guide Database Objective type questions Q.1

More information

Overview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series

Overview. Essential Questions. Precalculus, Quarter 4, Unit 4.5 Build Arithmetic and Geometric Sequences and Series Sequences and Series Overview Number of instruction days: 4 6 (1 day = 53 minutes) Content to Be Learned Write arithmetic and geometric sequences both recursively and with an explicit formula, use them

More information

Graphical Web based Tool for Generating Query from Star Schema

Graphical Web based Tool for Generating Query from Star Schema Graphical Web based Tool for Generating Query from Star Schema Mohammed Anbar a, Ku Ruhana Ku-Mahamud b a College of Arts and Sciences Universiti Utara Malaysia, 0600 Sintok, Kedah, Malaysia Tel: 604-2449604

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Requirements Traceability Recovery

Requirements Traceability Recovery MASTER S THESIS Requirements Traceability Recovery - A Study of Available Tools - Author: Lina Brodén Supervisor: Markus Borg Examiner: Prof. Per Runeson April 2011 Abstract This master s thesis is focused

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

Greatest Common Factors and Least Common Multiples with Venn Diagrams

Greatest Common Factors and Least Common Multiples with Venn Diagrams Greatest Common Factors and Least Common Multiples with Venn Diagrams Stephanie Kolitsch and Louis Kolitsch The University of Tennessee at Martin Martin, TN 38238 Abstract: In this article the authors

More information

IF The customer should receive priority service THEN Call within 4 hours PCAI 16.4

IF The customer should receive priority service THEN Call within 4 hours PCAI 16.4 Back to Basics Backward Chaining: Expert System Fundamentals By Dustin Huntington Introduction Backward chaining is an incredibly powerful yet widely misunderstood concept, yet it is key to building many

More information

Conditional Probability, Independence and Bayes Theorem Class 3, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Conditional Probability, Independence and Bayes Theorem Class 3, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom Conditional Probability, Independence and Bayes Theorem Class 3, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom 1 Learning Goals 1. Know the definitions of conditional probability and independence

More information

Principles of Data-Driven Instruction

Principles of Data-Driven Instruction Education in our times must try to find whatever there is in students that might yearn for completion, and to reconstruct the learning that would enable them autonomously to seek that completion. Allan

More information