- PDF Free Download

American Journal of Engineering Research (AJER) 2013 American Journal of Engineering Research (AJER) e-issn: 2320-0847 p-issn : 2320-0936 Volume-2, Issue-4, pp-39-43 www.ajer.us Research Paper Open Access A COMPARATIVE STUDY OF BYG SEARCH ENGINES Kailash Kumar 1, Vinod Bhadu 2 Research Scholor, Suresh Gyan Vihar University, Jaipur Technology Lead, Infosys Limited, Chandigarh Abstract: This paper compares the retrieval effectiveness of the Bing, Yahoo and Google (BYG) Search Engines. The precision and relative recall of each search engine was considered for evaluating the effectiveness of the search engines. General Queries were tested. Results of the study showed that the precision of Google was high as compared to other two search engines and Yahoo has better precision than Bing. Keywords: Internet, Search engines, Google, Yahoo, Bing, Precision, Relative recall. I. INTRODUCTION The Web can be used as a quick and direct reference to get any type of information all over the world. However, information found on the Web needs to be filtered and may include voluminous misinformation or non-relevant information. The Internet surfer may not be aware of many search engines to get information on a topic quickly and may use different search strategies. Finding useful information quickly on the Internet poses a challenge to both the ordinary users and the information professionals. Though, the performance of currently available search engines has been improving continuously with powerful search capabilities of various types, the lack of comprehensive coverage, the inability to predict the quality of retrieved results, and the absence of controlled vocabularies make it difficult for users to use search engines effectively. The use of the Internet as an information resource needs to be carefully evaluated as no traditional quality standards or control have been applied to the Web. In this study, an attempt was made to assess the precision and recall three major search engines i.e., Bing, Yahoo and Google (BYG). II. SEARCH ENGINES AND SEARCH QUERIES Three search engines namely Bing, Yahoo and Google (BYG) were considered to examine the precision for some selected search queries during March 10, 2013 to March 17, 2013. In order to retrieve relevant data from each search engine, the advanced search features of the search engines were used. Since, more sites were retrieved from the search engines for each query; it was decided to select only the first 30 sites as user hardly goes beyond three to four pages of the search results. Results from India only were selected for evaluation. A total of 15 queries from various discipline were selected for the study. (See Appendix 1). III. PRECISION OF SEARCH ENGINES After a search, the user is sometimes able to retrieve relevant information and sometimes able to retrieve irrelevant information. The quality of searching the right information accurately would be the precision value of the search engine (Shafi & Rather, 2005). In this paper, the search results which were retrieved by the Google, Yahoo and Bing Marchionini, 1996; Clarke & Willett, 1997): If the web page is closely matched to the subject matter of the search query then it was categorized as If the web page is not closely related to the subject matter but consists of some relevant concepts to the given a score of 1. and given a score of 0. w w w. a j e r. u s Page 39

American Journal of Engineering Research (AJER) 2013 If a web page consists of a whole series of links, rather than the information required, then it was These criteria enabled the calculation of the precision of the search engines for each of the search queries by using the formula: Sum of the scores of sites retrieved by a search engine Precision = Total number of sites selected for evaluation 3.1 Precision of Google Google, being one of the most popular search engines on the Internet, was selected as one of the search engines for comparison. Google focuses on the link structure of the Web to determine relevant results and is representative of the variety of easy-to-use search engines. This study would measure the relevance of the web sites retrieved for each search query. Only English pages were searched for each search query since the web pages in other languages would be difficult to assess for relevancy. Since the number of search results retrieved was large, only the first 30 sites were selected for analysis. Around 70% results are found to be more relevant, 13.77% results are less relevant and only 7.77% results are found to be irrelevant. 7.5% results are useful but do not contain any direct information but useful information is found only by clicking on links provided in the results and only 0.44% results were either shown but deleted or pages not available. 3.2 Precision of Yahoo Yahoo is another popular and well-known Internet search engine. The same set of search queries and the same methodology were used in Yahoo. Yahoo is the second largest search directory on the web by query volume, at 6.42%, after its competitor Google at 85.35%. Around 67% results are found to be more relevant, 16.88% results are less relevant and only 8.0% results are found to be irrelevant. 6.8% results are useful but do not contain any direct information but useful information is found only by clicking on links provided in the results and only 1.0% results were either shown but deleted or pages not available. w w w. a j e r. u s Page 40

American Journal of Engineering Research (AJER) 2013 3.3 Precision of Bing Bing is yet another popular and well-known Internet search engine. The same set of search queries and the same methodology were used in Bing. Bing was unveiled by Microsoft CEO Steve Ballmer on May 28, 2009 at the All Things Digital conference in San Diego for release on June 1. Notable changes include the listing of search suggestions while queries are entered and a list of related searches (called "Explore pane") based on semantic technology from Powerset which Microsoft purchased in 2008. Around 64% results are found to be more relevant, 15.77% results are less relevant and only 9.0% results are found to be irrelevant. 10.0% results are useful but do not contain any direct information but useful information is found only by clicking on links provided in the results and only 0.66% results were either shown but deleted or pages not available. IV. RELATIVE RECALL OF BING, YAHOO AND GOOGLE Recall is the ability of a system to retrieve all or most of the relevant documents in the collection (Shafi & Rather, 2005). The relative recall can be calculated using following the formula: w w w. a j e r. u s Page 41

American Journal of Engineering Research (AJER) 2013 Total number of sites retrieved by a search engine Relative recall = Sum of sites retrieved by all Search Engines (BYG) The relative recall of the Bing, Yahoo and Google (BYG). for general queries was calculated and presented in Table 4. The overall relative recall of the Google was 0.922, for Yahoo was 0.061 and for Bing it was 0.016 Figure 1 shows the relative recall of Bing, Yahoo and Google (BYG) for general queries. In case of Google, the search query 5 had the highest relative recall value of 0.98 and least relative recall for search query 3 of value 0.31. In case of Yahoo, the highest relative recall was for search query 3 (0.34) with the least relative recall for search query 5 (0.005). Similarly, the highest relative recall value of 0.34 for search query 3 and lowest value of 0.005 for search query 5. Fig 1: Relative Recall of Google, Yahoo and Bing Search Engines V. CONCLUSION The World Wide Web with its short history has experienced significant changes. While the earlier search engines were established based on the traditional database and information retrieval methods, many other algorithms and methods have since been added to them to improve their results. The precision value varies w w w. a j e r. u s Page 42

American Journal of Engineering Research (AJER) 2013 among the search engines depending on the database size. The gigantic size of the Web and vast variety of the users' needs and interests as well as the potential of the Web as a commercial market have brought about many changes and a great demand for the development of better search engines. The present study estimated the precision of Google, Yahoo and Bing. The results of the study also showed that the precision of Google was high as compared to Yahoo and Bing and Yahoo has better precision than Bing. It was observed that Google, Yahoo and Bing showed diversity in their search capabilities, user interface and also in the quality of information. However these two search engines retrieved comparatively more relevant sites or links as compared to irrelevant sites. Google utilized the Web graph or link structure of the Web to become one of the most comprehensive and reliable search engines. This study provided evidence that the Google was able to give better search results with more precision and more relative recall as compared to Yahoo which would explain why it is the most widely used search engine for the Internet. Appendix 1 List of Queries REFERENCES [1]. Clarke, S., & Willett, P. (1997). Estimating the recall performance of search engines. ASLIB Proceedings, 49 (7), 184-189. [2]. Chu, H., & Rosenthal, M. (1996). Search engines for the World Wide Web: A comparative study and evaluation methodology. Proceedings of the ASIS 1996 Annual Conference, 33, 127-35. [3]. Ding, W., & Marchionini, G. (1996). A Comparative study of the Web search service performance. Proceedings of the ASIS 1996 Annual Conference, 33, 136-142 [4]. Leighton, H. (1996). Performance of four WWW index services, Lycos, Infoseek, Webcrawler and WWW Worm. Retrieved from http://www.winona.edu/library/webind.htm [5]. Shafi, S. M., & Rather, R. A. (2005). Precision and recall of five search engines for retrieval of scholarly information in the field of biotechnology. Webology, 2 (2), Retrieved from http://www.webology.ir/2005/v2n2/a12.html [6]. Wu, G., & Li, J. (1999). Comparing Web search engine performance in searching consumer health information: Evaluation and recommendations. Bulletin of the Medical Library Association, 87 (4), 456-461. w w w. a j e r. u s Page 43

Kailash et al./ IJAIR Vol. 2 Issue 7 ISSN: 2278-7844 SEARCH ENGINE OPTIMIZATION USING QUERY PARSING Kailash Kumar 1, Vinod Bhadu 2 1 Research Scholar 1 Suresh Gyan Vihar University, Jaipur 1 Email: maheshwari_kailash2002@rediffmail.com 2 Technology Lead, Infosys Limited, Chandigarh Abstract: The Internet has become a vast information source in recent years. Searching the desired information or document via internet is one of the most important issues. Every search engine has its own database that defines its own set of documents which are searched by the search engines. No single search engine is capable of searching all kinds of data via Internet. In the modern practices, an interface is created to call multiple search engines in order to satisfy the users query. Calling all search engines for each users request is not feasible as it increases the cost because the network traffic is increased by sending the query to different search engines, some of which may be useless. The problem can be solved by pars suggests an accurate and fast search mechanism using query parsing techniques. Keywords: Meta search engine, parsing, information retrieval, search engines, network traffic. I. INTRODUCTION Nowadays, internet has become an important source of information. To find the desired data on internet, many search engines have been created. Every search engine has its own database that defines the set of documents that can be searched by the search engine. Generally, the index is created in the database for each documents and it is stored in the search engine which is used to identify the document in the database. Since, the index is already there in the database hence, it becomes very difficult for the search engine to answer There are two types of search engines which exist in the market, namely, General purpose search engines and Special purpose search engines. General purpose search engines provide searching capabilities for all kinds of documents on the internet while Special purpose search engines focus on documents which are confined to specific domain. Google, Alta vista and HotBot are example of General purpose search engines whereas there are millions of Special purpose search engines which currently available on the internet. The number of web pages is increasing at very high rate on the internet. Therefore, it is very monotonous to find all kind of data in a single search engine due to several reasons. First, the processing power and storage capabilities may not scale to the rapidly increasing and virtually infinite amount of data. Second, collecting all kind of data on the internet and maintaining it rationally up to date is not easy task if not possible. It becomes a time consuming process for the crawlers which are used by the search engines to collect the data automatically. An alternative approach is to use the multilevel search engines over the internet. At the lower level, local search engines are used which are grouped at higher level based on the relatedness of their database which in turn, grouped together to form next higher level and so on. At the top level, we have only one search engine called Meta search engine. Whenever, the Meta search engine receives the request from the user in the form of query, it passes the request to appropriate (Meta) search engine in depth first search order. This approach has its own advantages. First, the response time of the query processing is substantially reduced because user queries are evaluated against smaller database in parallel. Second, the index of the local search engine is modified only when the documents in its database are updated, i.e. updating of index is localized. Third, the local information can be collected more conveniently, easily and timely. Last, but not the least, the memory space and processing power of each local search engine can be managed easily. The schematic diagram for the above mentioned scheme is shown below in fig 1. When a single search engine calls many Meta search engines, the there may be serious problem of inefficiency. For example, for a given query, only small fraction of all search engines may contain useful documents. As a result, if every search engine is called for each query, then there may be substantial loss of network bandwidth and network traffic may be created. Moreover, local resources of each search engine will be wasted when useless databases are searched. The most appropriate solution to this 2013 IJAIR. ALL RIGHTS RESERVED 517

Kailash et al./ IJAIR Vol. 2 Issue 7 ISSN: 2278-7844 problem is to first identify those search engines that are most likely to provide useful results to a given user query and then pass it to the appropriate search engine for desired documents. But the question is how to identify potentially useful search engines. The solution to the above problem is to rank all underlying databases in decreasing order of their usefulness for each query using some metadata that describes the contents of each database. Generally, Query the ranking is based on some parameters which ordinary users may not be able to utilize to fit their needs. The current approach can describe the user, to some degree of accuracy, which search engine is likely to be the most useful, the second most useful, etc for a given user query. Although, such ranking scheme will be useful but it can not say anything about the usefulness of any particular search engine to the user. Result Search Engine 1 Meta Search Engine Query Query Result Query Result Result Search Engine 2 Search Engine m Fig 1: A Typical Meta Search Engine The usefulness of any search engine for a given user query is measured in terms of two parameters, first, the number of documents (NoDoc) in the database of the search engine that is more likely to be useful to the query, that is, the similarities between the query and the documents as measured by a certain global similarity function are higher than a specified threshold, and second, the average similarity (AvgSim) of these potentially useful documents. It is important to keep in mind that the global similarity function may vary with the local similarity function used by a local search engine. These two parameters together describe the usefulness of any search engine for a given user query. Mathematically, these terms can be defined as follows: And Where, T: a threshold value D: database of the search engine Similarity document d in the database D. It is very important for the users to determine which search engine to use and how many numbers of documents to be retrieved from the each search engine. For instance, if the user can forecast that a highly ranked search engine with a large database has very few useful documents and searching such a large database is not cost effective, then the user may not use that search engine. The cost of using such a search engine can be reduced by limiting the number of documents to be returned to the number of useful documents in the search engine. In this paper, a new measure which is easy to understand and informative is proposed to characterise the usefulness of a search engine with respect to the users query. Next, a subrange based estimation method is proposed to recognize search engines to use for a given query and to estimate the usefulness of a search engine for the query. II. RELATED WORK In order to discover useful search engines to a query, some attributes about the database of each search engine must be stored in the Meta search engine. Such information is known as the representative of a search engine. Based on the representatives used, different methods can be developed for identifying useful search engines. Several Meta search engines are in working using a range of methods to recognize potentially useful search engines [3], [5], [7], [8], [9] and [10]. However, the database representatives used in most Meta search engines cannot be used to estimate the number of globally most similar documents in each search engine [1], [7], [8] and [10]. In addition, the precautions that are used by these Meta search 2013 IJAIR. ALL RIGHTS RESERVED 518

Kailash et al./ IJAIR Vol. 2 Issue 7 ISSN: 2278-7844 engines to rank the search engines are not easy to understand. As a result, separate methods have to be used to convert these measures to the number of documents to retrieve from each search engine. Another shortcoming of these measures is that they are independent of the similarity threshold. As a result, a search engine will always be ranked the same regardless of how many documents are desired, if the databases of these search engines are fixed. This is in conflict with the following situation. For a given query, a search engine may contain many moderately similar documents but very few or zero highly similar documents. In this case, a good measure should rank the search engine high if a large number of moderately similar documents are desired and rank the search engine low if only highly similar documents are desired. A probabilistic model for distributed information retrieval is proposed in [2]. The method is more suitable in an environment where documents previously retrieved have been identified to be either relevant or irrelevant. A database of m distinct terms is represented in ggloss [5] by m pairs (fi:wi), where fi is the number of documents in the database that contain the i th term and Wi is the sum of weights of the i th term over all documents in the database, The usefulness of a search engine with respect to a given query is defined to be the sum of all documents similarities with the query that are greater than a threshold A method is proposed in [8] to estimate the number of useful documents in a database for the binary and independent case. In this, each document d is represented as a binary vector such that a 0 or 1 at the i th position indicates the absence or presence of the i th term in document d, and the occurrences of terms in different documents are assumed to be independent. A significant amount of information will be lost when documents are represented by binary vectors. As a consequence of which, these methods are rarely used in practice. The estimation method in [6] assumes term weights to be non-binary. III. USEFULNESS ESTIMATION In this section, the basic method for estimating the usefulness of a search engine is described which allows the values of the term weight to be any nonnegative real numbers. The basic assumptions used in this method are: the distributions of the occurrences of the terms in the documents are independent and all documents having a term have the same weight for the term for a given database in the search engine. This basic method can very accurately estimate the usefulness of the search engine. Next, subrangebased statistical method is described which can eliminate the second assumption. IV. BASIC METHOD Consider a database D of a search engine with m distinct terms. Each document d in the database can be represented as a vector where is the weight of the i th term in representing the document, Let us consider the query, where is the weight of the in the query, If the term does not appear in the query, then its corresponding weight will be zero. The similarity between query and document can be defined as the dot product of their respective vectors, i.e.. In this method, the database D is represented as pairs where and is the probability that term appears in a document in D and is the average of the weights of in the set of documents containing. For a given query the database representative is used to estimate the usefulness of database D. Consider the following generating function: Where, is a dummy variable. Let us consider the query and database D. If the terms are independent and the weight of term whenever present in a document is which is given the database representative then the coefficient of in the above function is the probability that a document in D has similarity with. V. SUBRANGE BASED ESTIMATION METHOD In the basic method, it was assumed that all documents having a term have the same weight for the term. While in realistic, it is not so. This sort of problem is overcome in the subrange based statistical method. The different documents having a term may have different weight for the term. Let us consider a term the average and be the standard deviation of the weights of term in the set 2013 IJAIR. ALL RIGHTS RESERVED 519

Kailash et al./ IJAIR Vol. 2 Issue 7 ISSN: 2278-7844 of documents containing the term. Let be the probability that the term appears in a document in the database. Let be the term in a specific query then the following equation is included in the probability generating function: Where, is the weight of the term in the user query. The above equation assumes that the term has uniform weight of for all documents containing the term while in reality the term weights may have nonuniform distribution among the documents having the term. Let the weights of the terms be Where, the number of documents having the term and k = p * n and the total number of documents in the database. Now, suppose that we partition the weight range of into four subranges, each containing 25 percent of the term weights, as follows. The first subrange contains the weights from to, where the second subrange contains the weights from to, where ; the third subrange contains the weights from to, where, and the last subrange contains weights from to. In the first subrange, the median is the weight of the term weights in the subrange and is, where, similarly, the median weights in the second, the third, and the fourth subranges have median weights, and respectively, where, and Then, the distribution of the term weights of may be approximated by the following distribution: The term has a uniform weight of for the first 25 percent of the documents having the term, another uniform weight of for the next 25 percent of the documents, another uniform weight of for the next 25 percent of documents and another uniform weight of for the last 25 percent of documents. With the above weight approximation, for a query containing term t, polynomial (4) in the generating function can be replaced by the following polynomial: Where, is the probability that term occurs in a document and has a weight of, for Since, 25% of those documents having term are assumed to have a weight of for term, for each Essentially, polynomial is obtained from polynomial by decomposing the probability that a document has the term into four probabilities, corresponding to the four subranges. It is important to note that the subrange-based method needs to know the standard deviation of the weights for each term. As a result, a database with terms is now represented as triplets, for, where is the probability that term appears in a document in the database, is the average weight of term in all documents containing the term and is the standard deviation of the weights of in all documents containing. Furthermore, if the maximum normalized weight of each term is used by the highest subrange, then the database representative will contain quadruplets, with being the maximum normalized weight for term. The experimental results indicate that the maximum normalized weight is a critical parameter that can drastically improve the estimation accuracy of search engine usefulness. VI. ISSUES ON APPLICABILITY If the representative of a database used by an estimation method has a large size relative to that of the database, then estimation method will have a poor scalability, as this method is difficult to scale to thousands of text databases. Suppose each term occupies four bytes and each number (probability, average weight, standard deviation, and maximum normalized weight) also occupies 4 bytes. Consider a database with distinct terms. For the subrangebased estimation method, probabilities, average weights, standard derivations, and maximum normalized weights are stored in the database representative, resulting in a total storage overhead of bytes. If the number of search engines is very large, the representatives can be clustered to form a hierarchy of representatives. Each query is first compared against the highest level representatives. Only representatives whose ancestor representatives have been estimated to have a large number of very similar documents will be examined further. As a result, most database representatives will not be compared against the query. To obtain the accurate representative of a database, we need to know the following information: first, the number of documents in the database, second, the 2013 IJAIR. ALL RIGHTS RESERVED 520

Kailash et al./ IJAIR Vol. 2 Issue 7 ISSN: 2278-7844 document frequency of each term in the database (i.e., the number of documents in the database that contain the term) and third, the weight of each term in each document in the database. First two things are needed to compute the probabilities and third one is needed to compute average weights, maximum normalized weights, and standard deviations. First two things can be easily obtained. For example, when a query containing a single term is submitted to a search engine, the number of hits returned is the document frequency of the term. In an environment where internet is used, it may not be feasible to anticipate a search engine to provide the weight of each term in each document in the search engine. Following techniques can be used to obtain the average term weight, their standard deviations and maximum normalized term weights. 1. Sampling technique can be used to estimate the average weight and the standard deviation for each term. When a query is submitted to a search engine, a set S of documents will be returned as a result of the search. For each term t in S and each document d in S, the term frequency of t in d (i.e., the number of times t appears in d) can be computed. As a result, the weight of term t in document d can be computed. If the weights of t in a reasonably large number of documents can be computed then an approximate average weight and an approximate standard deviation for term t can be obtained. Since, the returned documents for each query may contain many different terms, the above estimation can be carried out for many terms at the same time. 2. Find out the maximum normalized weight for each term t directly as follows with respect to the global similarity function used in Meta search engine: submit term t as a single term query to the local search engine which retrieves documents according to a local similarity function. VII. CONCLUSIONS will save both the communication cost and the local processing cost substantially. Estimation method has the following properties: 1. The estimation makes use of the number of documents desired by the user (or the threshold of retrieval), unlike some other estimation methods which rank search engines without using the above information. 2. It guarantees that those search engines containing the most similar documents are correctly identified when the submitted queries are single-term queries. Internet users submit a high percentage of such short queries and they can all be sent to the correct search engines to be processed. [1] J. Callan [2] REFERNCES Distributed Collections with Inference ACM SIGIR Conf. [3] L. Gravano and H. Garcia-Molina, o Vector- Space [4] W. Meng, K. Liu, C. Yu, X. Wang, Y. [5] Howe and D Meta-Search Engine that Learns Which The paper introduces the usefulness measure of the search engine which is intuitive and easily understood by the users. A statistical method is presented to estimate the usefulness of a given search engine with respect to each query. Accurate estimation of the usefulness measure allows a Meta search engine to send queries to only the appropriate local search engines to be processes which in turn [6] vol. 18, no. 2. of the Number of Desired Records with Trans. Database Systems. 2013 IJAIR. ALL RIGHTS RESERVED 521

Kailash et al./ IJAIR Vol. 2 Issue 7 ISSN: 2278-7844 [7] System for Corporate Users: Wide Area TMC199, Thinking Machine Corp. [8] -Like [9] and ISDN Systems, vol. 27, no. 2, pp. 175-182. 231-239. [10] Distributed Text Resource Systems on the f. Database Systems for Advanced Applications -400. 2013 IJAIR. ALL RIGHTS RESERVED 522