Promoting Agriculture Knowledge via Public Web Search Engines : An Experience by an Iranian Librarian in Response to Agricultural Queries Sedigheh Mohamadesmaeil Saeed Ghaffari Sedigheh Mohamadesmaeil Assistant Professor Department of Library and Information Sciences Science and Research Branch Islamic Azad University Tehran, Iran m.esmaeili2@gmail.com Although the Internet is already becoming a valuable information resource in information retrieving, there are important challenges before agricultural interest groups and users for extensive accessing to this information. Indeed, there is a couple of specific search engines, directories and sites in agricultural subject domain on the web, but it seems the major public search engines could be able to response the scientific field requests as well. This paper aims to determine whether this fact is true in agricultural field domain or not? We are comparing and measuring five major public search engines in response to agricultural requests. This research examines major search engines in response to agricultural terminologies. In order to assess the recall, precision and overlap of search engines, five well-used search engines (Google, Yahoo, AltaVista, AOL, ASK) were chosen. Then five agricultural keywords which selected from CAB (consist of: Intercropping, Carnivorous plants, Soil pollution, Plant viruses, Irrigation farming) were searched in each these search engines. The best search engines in answer to the subject terms are introduced. AOL had 63% precision and 22% recall and retrieved the most relevant agricultural documents. Also, Yahoo had 43% overlap with other search engines, so Yahoo also scored the highest rank. Through this study, findings reveal that major public search engines are suitable alternative for finding agricultural information.the results of this study can also inform agricultural centers, agricultural Information Specialists and agricultural interest groups (users) seek better agricultural resources.this research is an investigation into web search engines recall, precision, and Saeed Ghaffari Department of Library & Information Science Payam Noor University Qom-Iran Ghaffari13@yahoo.com Originally presented at the 7th International Conference on Webometrics, Informetrics and Scientometrics (WIS) and 12th COLLNET Meeting, September 20 23, 2011, Istanbul Bilgi University, Istanbul, Turkey. Published Online First : 15 December 2012 http://www.tarupublications.com/journals/cjsim/cjsim.htm COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First) 1
Promoting Agriculture Knowledge via Public Web Search Engines overlap using agricultural queries, sheds light on the uniqueness of top results retrieved by search engines. In other words, this paper indicates the significant value of search engines in web retrieval even in expert areas. Search engines, recall ratio, precision ratio, overlap, agricultural information, information retrieval Keywords: Agriculture Knowledge, Public Web Search Engines, Google, Yahoo, AltaVista, AOL, ASK 1. Introduction Although the Internet is already a valuable information resource in agricultural information retrieval, there are important challenges to be faced before users will have extensive access to this information (Aguillo, 2000 [1]). Searching is the main activity on the web, and the major search engines are the most frequently used tools for accessing information (Nielsen, 2005 [8]). Many commercial web search engines offer public access to web sites, including Yahoo!, MSN Search, Google and Northern light. Web search engines can differ from one another in three ways crawling reach, frequency of updates, and relevancy analysis. Therefore, the performance capabilities and limitations of web search engines, and the differences between them, is an important and significant research area (Spink et al., 2006 [18]). There are a large variety of search engines on the web. It is essential for agricultural librarians, as information experts, and also for agriculturalists to identify the best search engines in agricultural information retrieval in order to introduce them to agricultural researchers and using them by themselves. If search engines with a high recall ratio are identified, users, here agricultural experts, can confidently rely on them in searching the web. Thus, this paper aims at calculating recall, precision and overlap of well-used popular search engines in reaction to agricultural expressions. 2. Related studies Since the mid-1990s, web searching research has become a crucial area of study. Ding and Marchionini, 1996, [9]) investigated Infoseek, Lycos and Open Text for precision, duplication and degree of overlap using five complex queries. The first twenty hits assessed for precision show that the best results are obtained from Lycos and Open Text. Leighton and Srivastava, [13] searched fifteen queries on AltaVista, Excite, HotBot, Infoseek and Lycos taking the first twenty hits for evaluation of precision. Chu and Rosenthal [6] have investigated AltaVista, Excite and Lycos for their search capabilities and precision. The authors have used ten search queries of varying complexity by evaluating the first ten results for relevance assessment and revealed that AltaVista outperformed,and Excite and Lycos both in search facilities and retrieval performance. Clarke and Willett [7] searched thirty queries of varying nature on AltaVista, Excite and Lycos and obtained best results in terms of precision, recall and coverage from AltaVista. Bar-Ilan [3] investigated six search 2 COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First)
Sedigheh Mohamadesmaeil and Saeed Ghaffari engines using a single query Erdos. All 6,681 retrieved hits examined for precision, overlap and an estimated recall report that no search engine has high recall. Jansen et al., [12], Spink et al., [17], and Spink and Jansen [16] highlight key searching trends from 1997 to 2004, including that most web users do not enter many queries during a search session and view few results pages. Link analysis has also developed as a major web research area (Thelwall [19]}. Cheney and Perry [5] compare the comparative size of Yahoo! and Google s indexes. Mowshowitz and Kawaguchi [15] examined the differences between web search engine results from an expected distribution. Egghe and Rousseau [10] analyze IR system overlap from a mathematical perspective, and Bar-Ilan [2] discusses a statistical comparison of overlap in web search engines. Bar-Yossef and Gurevich [4] discuss methods for comparing web search engine indexes. Isfandyari Moghaddam [11] carried out a comparative study on overlapping of search results in meta search engines and their common underlying search engines. Mohammadesmaeil, Lafzighazi and Gilvari [14], carried out a study entitled: Comparing Search Engines and Meta Search Engines in Pharmaceutic Information Retrieval. The objective of that research was measure the relevance of documents retrieved from search engines and meta search engines in the field of pharmacology. Findings help web users, especially pharmaceutic researches and specialists to know the search tools which cover more pharmaceutic information and use these search tools to access the required information.this research was done in descriptive survey method. 6 major search engines and Meta search engines that are introduced by the website of ww.searchenginewatch.com as well-used search tools of internet was chosen. Pharmaceutic keywords were chosen from medical subject Headings (Mesh) and then selected terms of pharmacology were searched in each of search engines. The first 10 results of search engines were selected for evaluation of recall and precision. Data were analyzed with Excel. More over findings showed that Yahoo retrieved the most pharmaceutic documents and scored the highest rank (34%). Aol had (62%) precision and (21%) recall and retrieved the most relevant pharmaceutic documents. Dogpile retrieved the most pharmaceutic documents and scored the highest rank (22%),followed by Metacrawler (21%) and Info (19%). Excite had (62%) precision and (22%) recall and retrieved the most relevant pharmaceutic documents.finally researchers concluded that, search engines and meta search engines are suitable tools for amateur or professional users and they have suitable search capabilities and facilities. Although using search engines in retrieving relevant documents is useful, but it is suggested that users follow the search in several search engines to access the relevant documents among the vast available sources on web. Briefly, studies show that recall, precision and overlap are a considerable subject matter for web search engine performance survey. Most web search engines studies were performed using general query samples, but in this research, we attend to survey the recall, precision and overlaps of five popular and well-used search engines (Google, Yahoo, AltaVista, AOL, ASK) in relation to the six more precise and specific subject and agricultural keywords which selected from CAB were searched in each these search engines. The best search engines in answer to the subject terms are introduced. COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First) 3
Promoting Agriculture Knowledge via Public Web Search Engines 3. Methodology In May 2010, a set of 6 queries relating to agricultural topics were chosen from agricultural Subject Headings (CAB) were searched in each these search engines. Five major search engines that are introduced by the website of www.searchenginewatch.com as wellused search tools in that time were chosen. In order to determine recall, precision and overlap of search engines in these 5 major search engines were accessed for the selected terms from 25th June to 10th July, 2010. First 10 hits of each search engine result pages in response to each 6 term queries are considered as search population. The research elements are as follows: Major search engines: Google, Yahoo, AltaVista, AOL, ASK. Search queries consist of: Intercropping, Carnivorous plants, Soil pollution, Plant viruses, Irrigation farming. 3.1. Estimation of Precision, Recall and overlap Determination of recall and precision needs to decide which retrieved search engines hits is relevance and which ones are not. Decision of relevance and no relevance hits are made by authors and scored as follow: Exactly relevance: hits that the requested terms are completely amongst the title words of retrieved documents. Relevance: hits in that the compositions of stem of the requested terms are in the title words of retrieved documents. Partly relevance: hits in that part of the requested terms are combined, as prefixes or suffixes, to make a word in the title of retrieved documents. Not relevance: hits in that no one of the requested terms are appeared in the title of retrieved documents. Exactly relevance, relevance and partly relevance are considered as relevance and scored 1. Not relevance hits scored 0. Precision is the fraction of search outputs that is relevant for a particular query. Its calculation, hence, requires knowledge of the relevant and non-relevant hits in the evaluated set of documents (Clarke & Willet, 1997). Thus it is possible to calculate absolute precision of search engines which provide an indication of the relevance of the system. In the context of the present study, precision is defined as: Precision = Sum of the scores of relevance scholarly documents retrieved by a search engine Tota ln umber of results evaluated 4 COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First)
Sedigheh Mohamadesmaeil and Saeed Ghaffari Table 1 The Total number of relevance scholarly documents retrieved by each five search engines (in agricultural information retrieval) Subject Terms Google Yahoo Altavista Aol Ask Total amount Intercropping 230000 542000 541000 56600 48800 1418400 Carnivorous plants 384000 1690000 1660000 62400 53900 3850300 Soil pollution 504000 1050000 1050000 118000 99900 2821900 Plant viruses 624000 793000 796000 64500 55600 2333100 Irrigation Farming 49600 84000 116000 6760 5840 262200 Organic farming 2060000 8960000 9010000 410000 561500 21001500 Total amount 3851600 13119000 13173000 718260 825540 31687400 The recall on the other hand is the ability of a retrieval system to obtain all or most of the relevant documents in the collection. Thus it requires knowledge not just of the relevant and retrieved but also those not retrieved (Clarke & Willet, 1997). The relative recall value is thus defined as: Relative Re call = Total number of relevance scholarly documents retrieved by asearchengine Sum of scholarly documents retrieved by all five search engines To calculate the overlap of the above search engines, each keyword was searched in each search engine. Then, six lists were prepared. Afterwards, these lists were compared with each other. Finally, overlap is thus defined as: Overlap = Total number of same results in each search engine in comparison with others Number of keywords* * number of recall number of other search engines 3.2. Results More over, the mean precision and relative recall of select search engines for retrieving agricultural information are presented in Table 2. Comparing the mean precision, Ask scored the highest rank 63% followed by Yahoo 61% and Alta vista 60%, while AOL received the lowest precision 56% (Figure1). Comparing the corresponding mean relative recall values, Ask has the highest recall 22% followed by Yahoo 21%, Alta vista and Google 20%, while AOL received the lowest recall 19% (Figure 3). COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First) 5
Promoting Agriculture Knowledge via Public Web Search Engines Table 2 Mean Precision and Relative Recall of search engines during 2010 Search engines Alta vista Yahoo Google Ask AOL Precision 60% 61% 58% 63% 56% Recall 20% 21% 20% 22% 19% Figure 1 Percentage of the Total number of relevance scholarly documents retrieved by each five search engines (in agricultural information retrieval) Figure 2 Precision of Five Search Engines in Agricultural Information Retrieval 6 COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First)
Sedigheh Mohamadesmaeil and Saeed Ghaffari Figure 3 Recall of Five search engines in Agricultural Information Retrieval Table 3 The Percentage of Overlap of five Search Engines in Agricultural Information Retrieval Ask AOL Alta vista Yahoo Google %22 %40 %43 %44 %38 The percentage of overlap in five search engines in agricultural information retrieval is also shown in Table 3. 4. Conclusion In this study AOL ranks the top search engine with highest relevant percentage of returns (63% precision and 22% recall), respectively, with overall good performance for its currency sources of information. Also, Yahoo had 44% overlap with other search engines, so Yahoo scored the highest rank. This research also has produced significant findings for all web users, the web researchers, especially agriculturists. This study has determined that, using the best search engines only half of retrieval would be relevant. A major result of our study is that first page results returned by the five major search engines included in this study are different from one COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First) 7
Promoting Agriculture Knowledge via Public Web Search Engines another. Search engines seldom agree on first page returned results for any query. It means that, there is little agreement among search engines on what are the best results for a given query. Moreover, major search engines are suitable tools for findings agricultural information. A huge amount of sources retrieved from the web must be examined and carefully evaluated, thus users can not predict the quality and timeliness of search results. However, searching the web does enable users to discover ghastly current information, agricultural conferences and products, current statistics, news, services and full text articles. References [1] Aguillo, Isidro, A new generation of tools for search, recovery and quality evaluation of World Wide Web medical resources, Management in Medicine, Vol. 14(4), 2000, pp. 240 248. [2] Bar-Ilan, J., Comparing rankings of search results on the web, Information Processing & Management, Vol. 41, 2005, pp. 1511 9. [3]., On the overlap, the precision and estimated recall of search engines: A case study of the query Erdos. Scientometrics, Vol. 42(2), 1998, pp. 207 208. [4] Bar-Yossef, Z. B., Gurevich, M. G., Random sampling from a search engine s index, Proceedings of the 2006 World Wide Web Conference. 22-26 May 2006. Edinburgh, Scotland, 2006. [5] Cheney, M., Perry, M. (2005), A comparison of the size of Yahoo! and Google indices, available at: http://vburton.ncsa.uiuc.edu/indexsize.html [6] Chu, H., and Rosenthal, M. (1996). Search engines for the World Wide Web: a comparative study and evaluation methodology. In: Proceedings of the ASIS 1996 Annual Conference, October, 33, 127-35. Retrieved August 19, 2003 from http://www.asis.org/ annual-96/electronicproceedings/chu.html [7] Clarke, S., and Willett, P. Estimating the recall performance of search engines. ASLIB Proceedings, Vol. 49(7), 1997, pp. 184 189. [8] D. Sullivan, Nielsen. Net ratings: search engine ratings. In Search Engine Watch. [9] Ding, W., and Marchionini, G.. A comparative study of the Web search service performance. In: Proceedings of the ASIS 1996 Annual Conference, October, Vol. 33, 1996, pp. 136 142. [10] Egghe, L., Rousseau, R., Classical retrieval and overlap measures satisfy the requirements for rankings based on a Lorenz curve, Information Processing and Management, Vol. 42(10), 2006, pp. 106 20. [11] Isfandyari Moghaddam,Alireza; Parirokh, Mehri. A comparative study on overlapping of search results in metasearch engines and their common underlying search. Library Review, Vol. 55(5), 2006, pp. 301 306. 8 COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First)
Sedigheh Mohamadesmaeil and Saeed Ghaffari [12] Jansen, B. J., Spink, A., Saracevic, T., Real life, real users, and real needs: a study and analysis of user queries on the web, Information Processing & Management, Vol. 36(2), 2000, pp. 207 27. [13] Leighton, H. (1996). Performance of four WWW index services, Lycos, Infoseek, Webcrawler and WWW Worm. Retrieved June 10, 2005 from http://www.winona.edu/ library/webind.htm. [14] Mohammadesmaeil S, Lafzighazi E, Gilvari A. Comparing Search Engines and Meta Search Engines in Pharmaceutic Information Retrieval. Health Information Management, Vol. 5(2), 2008. [15] Mowshowitz, A., Kawaguchi, A., Measuring search engine bias, Information Processing and Management, Vol. 41, 2005, pp. 1193 205. [16] Spink, A., Jansen, B. J., (Eds),Web Search: Public Searching of the Web, Springer, Berlin, 2004. [17] Spink, A., Jansen, B. J., Wolfram, D., Saracevic, T., IEEE Computer, From e-sex to e-commerce: Web search changes, Vol. 35(3), 2002, pp. 133 5. [18] Spink, Amanda et al., overlap among major web search engines. Internet Research, Vol. 16(9); 2006, pp. 419. [19] Thelwall, M., Link Analysis: An Information Science Perspective, Elsevier Academic Press, 2004. COLLNET JOURNAL OF SCIENTOMETRICS AND INFORMATION MANAGEMENT (Online First) 9