Searching and surfing the web using a semi-adaptive meta-engine

Transcription

1 Searching and surfing the web using a semi-adaptive meta-engine A. Castellucci, G. Ianni DEIS, Università della Calabria, Rende (CS), Italy tony73@writeme.com, ianni@deis.unical.it D. Vasile Pitagora S.p.A., Rende (CS), Italy vasile@pitagora.it S. Costa CM Sistemi Sud S.r.l Cosenza, Italy sebastiano.costa@gruppocm.it Abstract Global Search 1 is a web agent which integrates and enhances many well known search techniques in order to improve the quality of information gathered from usual web search engines. It features intelligent merging of relevant documents from different search engines, anticipated adaptive exploration and evaluation of links from the current result set, automated derivation of refined queries based on user relevance feedback. 1. Introduction The recent explosive growth of the World Wide Web focused the attention of a wide range of users on the hardness of the information retrieval over the Internet. The usual workaround to this problem is the adoption of huge web indexes which can be queried by keyword-based user questions, like the well-known Altavista, Lycos, Google. Unfortunately, no existing index can track successfully all the existing web pages, in spite of many recent brute force attempts such as the Inktomi indexing project [12]. Moreover, the document selection technique, adopted by each search engine, is often very arbitrary and heterogeneous [4]. The merging of documents found by different search engines enhances the overall web coverage and the quality of documents found: many automatic collection techniques from different search engines are known, such as the ones shown in MetaCrawler, Profusion, Inquirus, SavvySearch [19, 8, 14, 4] and the one from the recent commercial experience of Copernic [3]. Moreover, many studies considered a) the possibility of agent-based autonomous search, in order to pursue various purposes [16, 13], and b) the possibility of involve the user in order to learn knowledge from its own preferences about the pages found [18]. 1 The design of this prototype (also called Good Stuff Agent) was fully sponsored by C.I.E.S., Centro di Ingegneria Economica e Sociale, P.te P.Bucci, Rende (CS), Italy. Global Search Agent (GSA in the following) is a standalone application which should be installed within the Internet-ready machine of the user. From the application, the user can specify its requests as usual, as a set of keywords. GSA queries a relevant set of search engines, collects and ranks results from them; the user can browse documents as soon as they are displayed, while the system searches for other results, browses and ranks links adjacent to the initial ones. Moreover, the user can a) classify the queries made in a structured concept tree: the tree structure is employed by the agent to limit the retrieved documents to a restricted subset of them, those expected to be within the tree; and b) express an opinion about each document found: these preferences are employed by GSA in order to find more keywords which can improve the overall document attitude with the user s meanings. Searches can be scheduled, configured w.r.t. many parameters (set of search engines queried, ranking technique, duration etc.), and delegated to a remote instance of the agent, which can push back the results found when the search is done. 2. Meta-searching Many problems arise when we try to successfully merge results from heterogenous search sources. First, the right way to query each search engine is very different from one another. Second, results are sent back to the user in a semistructured form (usually an HTML page): an ad hoc parser is then needed for each different search source the agent may desire to query. Each parser acts as an independent entity and supplies the main application with a new result (given in an engine-independent form) when a new document is parsed. Differently from [8, 10] and [4], GSA does not attempt to merge results using heuristic techniques intended to deal with the unknown ranking functions of each search engine. In fact, this approach did not prove to be useful in order to provide a suitable sorting of documents found; moreover it

2 would force GSA to gather all the results before a single document could be displayed. Thus, parsers do not provide relevance values; the ranking and merging step are deferred to the following. 3. Adaptive exploration When the main agent is prompted for a new, potentially relevant, result, a new entity, called spider is created. A spider retrieves the document on which it is started on, establishes its ranking, and decides if it is worth to pursue the task of exploring the links following from the current document. In this case a new child spider is started, one for each link found. This approach sacrifices efficiency (each document must be retrieved) but provides effective removal of not well ranked and/or not reachable documents. The two main question arising here are a) how to rank a document, and b) how to automatically select interesting links. In order to attribute a ranking value to a document we chose a ranking function based on the one proposed by [14]. This function embeds three components: a) a presence component, which value is proportional to the presence of almost one occurrence of a given terms within the page text, b) a frequency component, which weights the overall quantity of occurrences for a given term within a page, and c) a distance component, which weights the overall distance between occurrences of the given terms. The ranking function takes the document to be scored and a set of given keywords, and evaluates as follows (We denote as the cardinality of the set ): where!"# $ # %&'(&) *( +, # -./ )( The value is the sum of presence value of each term. The presence is the maximum similarity found for a given term within the text considered. The similarity is introduced in order to consider the stem of each word: differently from [7], we chose a stemming algorithm independent from the language which the text is supposed to be written in. Usually the similarity of a term with another one is 0 when the two terms are identical: a couple of words with the same stem give similarity very near to 0 (e.g. the first one can be considered a significant occurrence of the second one) whereas words with low similarity w.r.t. the set W, are cut off. The value of is the total sum of significant occurrences of the words of ; each significant occurrence is weighted by the corresponding similarity value. 1 is the number of words of with a significant value of presence; 2 3 represents the minimum distance 2 found 3 between two significant occurrences of the words and ; and are two constants controlling the shape of, whereas are suitable chosen weights for each of the three components, and is the maximum distance (in words) to be considered significant for two occurrences of terms in. At the moment, these values can be set from the user to desired values. Differently from [14], our ranking function a) embeds directly some stemming techniques, b) expresses distances in words and not in characters and, c) is naturally bounded within a given range (in fact, ranges from 4 to the asymptotic value ). This eliminates the need of scaling the rank values at the end of the search and the need of knowing a priori the number of documents retrieved, providing a sort of ideal document whose relevance value tends to the right edge (i.e. ) of the allowed score interval. For what the spider behaviour is concerned, the idea is near the approach of Letizia, and Webwatcher [16, 13], but the purposes are pretty different. Each spider takes into account the list of the ranking values scored by the last documents visited, and the concept subtree, which the originating query belongs to. Each tree node carries a concept name and some concept keywords (which are decided by user intervention, at the moment). These values are employed to compute a happiness function based on the last 5 documents visited. In particular: %: where each term ; % is the combined score a considered document received. Given a set of keywords <, representing a query over the web, and the sets of keywords < === <>, representing the ancestor concepts for <, the combined score ; for a given document is ; A is ; % ':EE> BCD 3 such that F < ' 3 A G HI and A and H are two fixed parameters. This function is similar to the average score of the last documents visited: when a document scores a value too low, the spider tries to score it using the concept keywords of the antecedent node

3 of the originating query, and so on, until the root node is reached or a worth score is reached. However, these further rankings steps have a lower weight when the overall happiness is computed. When the happiness of a spider becomes as low as the given threshold value, a spider dies: else, if the maximum depth allowed is not yet reached (i.e. the maximum number of documents a spider can explore independently from its happiness), the spider creates a child spider (which inherits the status of the father spider) for each link within the current document. If a link points to an unknown and/or unwanted information source (such as binary files) it is automatically discarded. The search goes on until there is a spider alive: the higher the happiness of a spider is the higher its execution priority is. WWW MetaSearch Unranked URL GSA Architecture Spider User Ranked URL Scheduler FeedBack Remote Control 4. Learning from user preferences Following [18], the user can express a boolean preference (e.g. hot document, cold document) on each document retrieved, or ignore some of them; then he can ask GSA to take care of his preferences. GSA parses hot and cold documents and extract a set of good terms and a set of bad terms (the latter is not considered in the current release). We chose not to adopt traditional Bayesian clustering methods [15, 18]: in order to be effective, such techniques annoy the user, requiring to classify very huge sets of documents. Thus, we preferred a good heuristic technique, which showed very interesting performances, mainly with smaller sets of documents. In order to extract a suitable set of good terms, GSA compiles a ranking of suitable terms and outputs the terms with best scoring. A set of stopwords [7], containing very common English and Italian terms, is a priori excluded from the ranking (obviously, this does not prevent the user from manually entering a stop word within his own search). Let be the set of good documents and be the set of bad documents; let be the set of words of the originating query: the score of each term is obtained by a relevance function : % 9 % %: 9 % %: BCD ): / 9 ) ': ' 0 where % is the number of occurrences of within the document 2, and ) is the minimum distance (in words) between a significant occurrence of the term ) and the term. Each term increments its score if it appears in a good document and decrements its score if it appears in a Figure 1. The GSA executing environment bad document. Further occurrences beyond the first one of a term in a document does non alter too much the value of. 5. System Architecture We describe GSA architecture with an example. The system starts its activity when a query is entered either from the user, from the scheduler (which manages a list of previously arranged queries), or from a remote instance of GSA, prompting for a search. Assume the entered keywords are Luna Rossa. An additional starting URL can be given to the system, e.g. GSA activates The Spider (SE in the following) and the Metasearch (ME in the following). The two subsystems works in parallel: in this case, the former will start a spider in order to explore and rank the latter will query all the available search engines using the keywords Luna Rossa. ME extracts single results from search engines as soon as they are available, and prompts SE in order to start a spider on each extracted document. SE manages spiders: each spider parses an URL, ranks it, decides if the URL relevance is enough in order to display the corresponding page, and decides if it is worth to deploy further spiders on the neighborhood of the considered URL. The search terminates by user intervention or when ME and SE have no further documents to analyze (i.e. no more spiders can be generated), but the user can analyze results while the system is still performing the search. The Feedback (FE in the following), works offline. User can specify his/her opinion on which are interesting and uninteresting documents, marking accordingly entries of the document list. Once the user opinion is given (even on a small subset of the overall set of documents re-

4 # Documents # Relevant # Relevance Found documents rate GSA % Altavista % Excite % Google % Hotbot % Table 1. The results of GSA against some other well known search engine. Query # Relevance Average Duration rate Score 1 minute 70.0% minutes 100.0% % 878 Table 2. The results of GSA on long duration queries (best score=1000). 7. Further Search issues trieved), FE can be started. The output of FE is a set of relevant words, suggested to the user in order to refine the search. Suppose the user is interested in a competition involving the boat Luna Rossa; then, he/she marks documents found accordingly (e.g. he/she marks all the documents not related to sailing as non-relevant). In this case, the words which GSA outputs are america, cup, Experimental results Table 1 reports some result on a set of 20 short term queries containing keywords pertaining to different domains. For each query we considered the ten most relevant documents reported from each search engine. The table indicates, for each search engine, the number of documents found, the number of documents really considered as relevant, and the percentage of relevant documents w.r.t. documents found. Results from GSA were computed halting evaluation after 30 seconds. The connection speed was about 2Mbit/sec. Costants chosen for the scoring function were Usually, GSA performs fast and very well against single search engines when short duration searches are submitted; the overhead taken by the task of directly retrieving each document is far balanced when five or more search engines are queried simultaneously (tests were performed doing meta-search on Altavista, Excite, FastSearch, Google, Hotbot, Lycos, FastSearch, Yahoo, and Webcrawler [1, 5, 6, 9, 11, 6, 21, 20]). Nonetheless, GSA reveals itself very useful when long duration (e.g. overnight) searches are planned. Table 2 resumes typical relevance rate and average score of the first ten documents retrieved, when the same search is halted after 1 minute, 5 minutes or never (in the latter case GSA halts after a time depending on the initial happiness of spiders). The quality of documents retrieved increases significantly on long term queries, whereas the relevance rate becomes maximum very soon. At the moment we are studying several improvements to the architecture of GSA, like the automatic generation of engine-dependent parsers [17], the improvement of spiders behaviour introducing improved happiness functions and some cooperative information exchange between them. Moreover, we think the system could be improved introducing an automated concept tree derivation like in [22], and providing an automated parameter tuning [2]. Nonetheless, we should complete the learning user preference method with a better stemmed parsing, and introducing some sort of clustering between the terms found. References [1] Altavista web site, [2] B. T. Bartell, G. w. Cottrell, and R. K. Belew. Optimizing parameters in a ranked retrieval system using multi-query relevance feedback. Proc. of the Symposium on Document Analysis and Information Retrieval, Las Vegas, [3] Copernic web site, [4] D. Dreinlinger and A. E. Howe. Savvysearch: A metasearch engine that learns which search engines to query. AI Magazine, 18(2):19 25, [5] Excite web site, [6] Fastsearch web site, [7] W. Frakes and e. R. Baeza-Yates. Information Retrieval: Data structures and algorithms. Prentice-Hall, [8] S. Gauch, G. Wang, and M. Gomez. Profusion: intelligent fusion from multiple, distributed search engines. Journal of Universal Computes Science, 2(9), [9] Google web site, [10] L. Gravano and H. G. Molina. Merging ranks from heterogeneous internet sources. Proc. of the 23rd VLDB Conference, Athens, Greece, [11] Hotbot web site, [12] Inktomi web site, [13] T. Joachim, D. Freitag, and T. Mitchell. Webwatcher: A tour guide for the world wide web. Proc. of the 15th Int. Joint Conf. on Artificial Intelligence, Nagoya, Japan, pages , 1997.

5 [14] T. Joachim, D. Freitag, and T. Mitchell. Inquirus, the NECI meta search engine. Proc. of the Seventh International World Wide Web Conference, Brisbane, Australia, pages , [15] E. J. Keogh and M. J. Pazzani. Learning augmented bayesian classifiers: a comparison of distribution-based and classification-based approaches. Uncertainty 99: The Seventh International Workshop on Artificial Intelligence and Statistics, Ft. Lauderdale FL, USA, [16] H. Lieberman. Letizia: An agent that assists web browsing. Proc. of the 14th Int. Joint Conf. on Artificial Intelligence, IJCAI 95, Montréal, Québec, Canada, pages , [17] S. Nestorov, S. Abiteboul, and R. Motwani. Inferring structure in semistructured data. SIGMOD Record, 26(1):54 66, March [18] M. Pazzani, J. Muramatsu, and D. Billsus. Syskill & Webert: Identifying interesting web sites. Proc. of the 30th Nat. Conf. on Artificial Intelligence, AAAI 96, pages 54 61, [19] E. Selberg and O. Etzioni. The metacrawler architecture for resource aggregation on the web. IEEE Expert, [20] Webcrawler web site, [21] Yahoo web site, [22] S. Yamada and Y. Osawa. Planning to guide concept understanding in the WWW. AAAI Workshop on AI and Information Integration, pages , 1998.