Web Search
Web Usage in Client-Server Design A client (e.g., a browser) communicates with a server via http Hypertext transfer protocol: a lightweight and simple protocol asynchronously carrying a variety of payloads (text, images, audio, and video) A client sends an http request to a web server by specifying a URL (universal resource locator) Web pages are encoded in HTML (hypertext markup language) A browser can ignore what it does not understand A browser gets as much as it can, and does not crash due to incompatible features Publishing becomes unprecedentedly easy J. Pei: Information Retrieval and Web Search -- Web Search Basics 2
Making Web Info Discoverable Full-text index search engines Altavista, Excite, and Inforseek Use keyword search interfaces supported by inverted indexes and ranking mechanisms Taxonomies populated with web pages in categories Yahoo! Allowing users to browse through a hierarchical tree of category labels A convenient and intuitive way at the beginning J. Pei: Information Retrieval and Web Search -- Web Search Basics 3
Drawbacks of Taxonomy Methods Accurately classifying web pages into taxonomy tree nodes is very costly and cannot scale up to the web size Low quality web pages are not interesting at all to most users No standard taxonomies the taxonomies in users mind and those in editors mind may be different Almost for sure when the taxonomy trees are big 1,000+ distinct nodes The popularity of taxonomies declined over time Taxonomies are good for building a knowledge base, though! J. Pei: Information Retrieval and Web Search -- Web Search Basics 4
Problems in Purely Full-text Indexes Differences in books and web pages Number: a relatively small number of books on a specific topic versus a huge number of web pages on a query average quality of books is much higher Lengths of books are considerably longer than lengths of web pages full-text based relevance is more reliable on books Most web pages are of low quality and uninteresting Finding all books relevant to a topic and let the user to select feasible since the number of highly related books is not big Finding all web pages relevant to a query and let the user to select infeasible since too many pages are related to a query Find the high quality web pages Ideas: asking an expert finding authoritative web pages, which needs information more than just full-text but also links J. Pei: Information Retrieval and Web Search -- Web Search Basics 5
Static versus Dynamic Web Pages Static web pages: the content does not vary from one request to the next The content still can be updated from time to time, possibly frequently There are a finite number of static web pages Dynamic web pages: pages mechanically generated by an application server in response to query to a database There are an infinite number of dynamic web pages J. Pei: Information Retrieval and Web Search -- Web Search Basics 6
The Web Graph The static web: a directed graph consisting of static HTML pages together with the hyperlinks between them Each web page is a node, each hyperlink is a directed edge Anchor text: the text surrounding the origin of a hyperlink The web graph is not strongly connected The in-degrees of web pages follow power law distribution Freq(i) = 1 / i α, where α 2.1 J. Pei: Information Retrieval and Web Search -- Web Search Basics 7
The Bowtie Shape of the Web Graph SCC: strongly connected component J. Pei: Information Retrieval and Web Search -- Web Search Basics 8
Advertising Branding A company uses graphical banner advertisements on popular websites to convey viewers a positive feeling about the brand of the company Advertisements are shown on algorithmic search results Cost per mil (CPM): the cost to the company of having its banner advertisement displayed 1,000 times Cost per click (CPC): priced by the number of times an advertisement is clicked J. Pei: Information Retrieval and Web Search -- Web Search Basics 9
Advertising Sponsored Search Advertisers pay for users clicks Goto: for each query term q, it accepts bids from companies who wanted their web page shown on the query q, and returns the pages of all advertisers who bid for q, ordered by their bids If the user clicks a result, the corresponding advertiser pays Goto A popular advertising approach in search engines J. Pei: Information Retrieval and Web Search -- Web Search Basics 10
Spamming, SEO and SEM If a page is ranked high by search engines, the page may have a good opportunity to get branding advertising payment Paid inclusion: an owner pays to have her/his web page included in the search engine s index Spamming: deliberating content and link to make the page ranked high by search engines Search engine optimization (SEO) and search engine marketing (SEM): understanding how search engines rank and how to allocate marketing campaign budgets to different keywords and to different sponsored search engines Click spam: clicking on sponsored search results that are not from bona fide search users Exhausting the advertising budget of a competitor J. Pei: Information Retrieval and Web Search -- Web Search Basics 11
Spamming Tricks Cloaking: returns different pages depending on whether the http request comes from a search engine or a human user s browser Doorway pages contain text and metadata carefully chosen to rank highly on selected search keywords Doorway pages J. Pei: Information Retrieval and Web Search -- Web Search Basics 12
Categories of Search Queries Informational queries: seeking general information on a broad topic What is panda? Need multiple web pages to answer Navigational queries: seeking the website or home page of a single entity that the user has in mind Air Canada seeking homepage of Air Canada instead of any agents selling Air Canada airfare Precision 1 is wanted Transactional queries: leading to transactions on the web, e.g., purchasing a product, downloading a file, joining a social website, J. Pei: Information Retrieval and Web Search -- Web Search Basics 13
Index Size Estimation What percentage of the web is indexed by a search engine? An infinite number of dynamic web pages Given two search engines, what are the relative sizes of their indexes? A search engine can return a page that has not been fully or even partially indexed Search engines organize indexes in various tiers and partitions, not all pages indexed are examined on every search Rude estimation under an (unrealistic) assumption: there is a finite size for the web from which each search engine chooses an independent, uniform subset to index J. Pei: Information Retrieval and Web Search -- Web Search Basics 14
Capture-Recapture Method Let x be the probability that a random page in E 1 is indexed by E 2 Symmetrically, let y be the probability that a random page in E 2 is indexed by E 1 Since x E 1 y E 2, E 1 / E 2 y/x E 1 E 2 x E 1 y E 2 J. Pei: Information Retrieval and Web Search -- Web Search Basics 15
Sampling Techniques (1) How to conduct unbiased sampling from outside the search engine? Conceptually, we need to generate a random page from the entire web and test it for presence in each search engine Picking a web page uniformly at random is difficult Random searches: begin with a search log of web searches, send a random search from the log and pick a random page from the results The log may be biased, a random result from a search may not be a uniformly random page indexed by the search engine J. Pei: Information Retrieval and Web Search -- Web Search Basics 16
Sampling Techniques (2) Random IP addresses: generate random IP addresses and send a request to the corresponding server, collecting all pages at that server Many hosts may share one IP, an IP may not accept http requests from the host of the sampling program, biased on many sites of few web pages Random walks: run a random walk starting at an arbitrary page and converge to a steady state distribution, from which we can pick a web page with a fixed probability The web is not strongly connected some pages are not in the sampling space, a random walk may take a long time to converge J. Pei: Information Retrieval and Web Search -- Web Search Basics 17
Random Queries Idea: pick a page (almost) uniformly at random from a search engine s index by posing a random query to it Picking a random word in a dictionary? Not good frequencies of words vary a lot Implementation Crawling a limited portion of the web or a representative subset of the web (e.g., Yahoo!) Use a random conjunctive query on E 1 and pick from the top 100 returned results a page p at random Test p for presence in E 2 by choosing 6-8 low-frequency terms in p and using them in a conjunctive query for E 2 Iterate a large number of times Classroom discussion: why do we use conjunctive queries of many words? J. Pei: Information Retrieval and Web Search -- Web Search Basics 18
Problems in Random Queries The sample is biased toward longer documents Picking from the top 100 results of E 1 induces a bias from the ranking algorithm of E 1 Either E 1 or E 2 may not respond to the queries E 2 may not handle conjunctive queries of many words E 1 or E 2 may reject robotic spam queries Improvements Use phrases Estimate bias and remove it using statistical methods J. Pei: Information Retrieval and Web Search -- Web Search Basics 19
Random Walk Sampling A random walk on a virtual graph derived from documents Two documents (nodes) are linked by a edge if they share two or more words in common Never instantiate the graph Move from a document d to another by picking a pair of keywords in d and run a query on a search engine and pick a random document from the results J. Pei: Information Retrieval and Web Search -- Web Search Basics 20
Summary The client-server usage of the web Two types of search engines full-text versus taxonomies The web graph Advertising and spamming Categories of web search queries Estimation of index sizes of search engines J. Pei: Information Retrieval and Web Search -- Web Search Basics 21
To-Do List According to the latest research results, which search engine may have the largest coverage/index? Search the web for the answer! J. Pei: Information Retrieval and Web Search -- Web Search Basics 22