Automated Price Comparison Shopping Search Engine PriceHunter

Transcription

1 Automated Price Comparison Shopping Search Engine PriceHunter Elwin Chai, Rick Jones {elwin, Faculty Advisor Dr Zachary Ives Abstract In this paper, we explore the possibility of creating a product search engine that is able to dynamically find commercial sites, independent of merchant feeds and other human involvement in the management of internal databases. We evaluate briefly the constraints of current shopping search engines and the benefits of offering a fully automated version. In addition, we consider the application of JTidy, Stemmers and Wrappers, in order to extract the relevant information from a commercial website. Introduction In the past, the Internet could be thought of as just a repository of static information, and search engines merely offered Internet users basic information retrieval. However, as the web evolved to become a bustling marketplace where online transactions are the norm, there is a need for more specific search capabilities. Ultimately, it is hoped that the ideal search engine reduces searching costs in terms of time and money for consumers in a perfectly efficient market. Currently, there already exist numerous shopping search engines (Sullivan, 2003), but they are mostly constrained by an essentially static database of available products. PriceScan [4] lists products from a manually updated database 1, classified under static categories. Kelkoo [3] and Yahoo! Shopping [5] utilize similar database frameworks, where merchants submit their products to be classified manually 2 by the search companies according to a predetermined structure. Amazon [1] is a distributor, which sells a wide range of products, but in reality maintains a finite database of either products they have in their inventory or registered re-sale products. One search engine drew our attention because it seems to use a more dynamic approach in searching. Froogle [2], led us to believe that a fully automated search engine was possible, because we initially thought that it scours the web for relevant products on sale, instead of utilizing a static database. After closer investigations, we discovered that it also relies on merchant feeds, like Yahoo! Shopping, and offers free listing of products. Moreover, there are also other shortcomings of Froogle that could be improved on (Mills, 2003). In order to search over a database that is as dynamic as the growth of the Internet, we need to be able construct such a database directly from web content. Hence, we divided the process into 5 main steps: 1) exploring the web, 2) deciding relevancy of sites, 3) information extraction, 4) database management and 5) information retrieval Any method of database management that involves a case-by-case human decision is considered manual, whether it is the merchant or the search company who makes the decision.

2 Given the extensive amount of research on the features of a search engine, there is already an established base of methods for crawling the web, database management, and information retrieval, which includes ranking a page based on a query (Lee et al, 1997). As such, the steps that require more research are the areas of site relevancy and information extraction. Some research has been done on the automatic classification of websites (Pierre, 2001), but has not concentrated on specifically commercial sites. Nonetheless, an important observation was made then, that when determining the relevancy of a web page, metadata provide critical information on top of just the plain content of the page. An example of metadata is whether a word is displayed as bold or in the title. Google captures some pieces of metadata in a bitmap format for every keyword (Brin and Page, 2001). Information extraction, on the other hand, has been explored through the use of wrappers (Kushmerick et al, 1997). There are even proposed toolkits to help construct a wrapper (Sahuguet and Azavant, 1999), which laid out the fundamentals of wrapper creation that helped us in our own practical implementation. It should be noted that even though there is general consensus as to the need for wrappers, the initially proposed wrappers seem far too site-specific and idealistic to be implemented in practice. Moreover, there is still debate over how effective and practical they can be (Kushmerick, 2000). Nevertheless, we sought to integrate concepts from wrappers to aid us in data extraction. The paper will first present the architecture with which we have chosen to build our search engine. Then we proceed with detailed explanation of our design choices in line with the steps of the workflow, as well as the challenges we faced. Finally we conclude with how future work can augment the effectiveness of this project. Architecture To layout the framework of the program, we track the flow of a single document (or web page) through the system. Frontier Manager Keyword Manager Search Internet Webcrawler Stemmer Database Manager Heuristics Manager Extract Information docinfo DB docsearch Diagram 1 robots

3 A web page is first processed by the Webcrawler, which extracts the links. The links are in turn managed by the Frontier Manager to ensure the crawler has a steady supply of documents to process. The document is then passed on to the Heuristics Manager, which decides if the page is a commercial website selling a product. The Heuristics Manager passes the page on to extract its information. Information extraction occurs in two stages: 1) extracting the price information of the website and 2) extracting the keywords within the document. The keywords are first stemmed and then packaged using Keyword Manager before they are inserted into the database through the Database Manager. Finally Search uses the dynamically constructed database to answer user queries. Webcrawler All operations begin with the Webcrawler. We based our crawler design on the Mercator model (Heydon and Najork, 1999), as illustrated in the figure below. Diagram 2 Thus, the Webcrawler starts with a given set of seed URLs. From these URLs it proceeds to extract the actual pages from the web to be processed. Processing in this case includes extracting links to other pages, and passing it on to decide the desirability of the page. If the page is found to be desirable, it is passed down the pipeline in order to extract its information. The crawler terminates after repeating this process for a predetermined number of pages. If any page has already been processed by the crawler, it is not passed on to the rest of the workflow. However, its links are still extracted to ensure that the crawler always has links to follow and documents to process. In the processing of a page, there are two main obstacles that need to be overcome. The first obstacle is that of politeness. In particular crawlers have to be aware if a page does not want to be searched or indexed. This information is contained in two forms: 1) as a robots.txt file on the server or 2) in the meta tag of the HTML document. Initially, the Webcrawler must check to see if a robots.txt file exists at the base URL. In order to reduce the processing time, the crawler maintains a special robots table in the database. The crawler will first check the robots table to decide if a website should be crawled. If no entry

4 exists, the crawler will attempt to obtain and parse a robots.txt file for the host site. Thus, the crawler will only have to obtain and parse the robots.txt file once. Finally the robots table is recreated on every new crawl. This is done for two reasons: 1) a restriction increase such as a site previously marked with no robots.txt file might have added one, and 2) a change or reduction of restrictions such as a news site changing its robots.txt to match its changing content. The Webcrawler then proceeds to obtain the document and search for the robots meta tag that specifies restrictions for the crawler to obey. User-agent: * Disallow: /~yikesinc/ Disallow: /~gravinaj/ <html> <head> <meta name="robots" content="noindex,nofollow"/> <title> </title> </head> Meta tag robots Diagram 3 The second obstacle is that of memory. Every document the Webcrawler processes generates multiple links to other documents. Since the crawler is the sole generator and processor of links, it is clear that the number of links will continue to grow until crawling is completed. The solution is the creation of a separate thread called the FrontierManager. Frontier Manager Frontier InFile Frontier OutFile Frontier TmpFile storein Frontier readdisk IntoIn writeout ToDisk to Webcrawler FrontierIn Queue FrontierOut Queue from Webcrawler Diagram 4 The Webcrawler obtains links from the FrontierIn queue. The initial seed URLs are placed in the FrontierIn queue. When the crawler extracts links from a page it pushes these links into the FrontierOut queue. The FrontierManager handles the movement of links from the FrontierOut queue into the FrontierIn queue. This is accomplished by checking the sizes of both queues. When the size of FrontierIn falls below a certain threshold, the manager attempts to read links from disk and places them into the queue. If no information is found on disk, the FrontierManager could move links directly from FrontierOut into FrontierIn. On the other hand, if the size of FrontierOut exceeds a certain point, the manager writes the links from the queue onto disk. When crawling is complete, the FrontierManager will store any unprocessed links in

5 the FrontierIn queue back to disk. Multiple FrontierManagers can be created to handle multiple Webcrawlers. This design prevents the Webcrawler from consuming indefinite memory due to link extraction. However, there is still the possibility of generating contention between the FrontierManager and the Webcrawler over the FrontierIn and FrontierOut queues. To avoid creating such contention, the following observations are made. When the Webcrawler extracts links and places them in the FrontierOut queue, it does not need access to FrontierIn. Conversely, when the Webcrawler is obtaining a new document to search and checking its permissions it does not need access to FrontierOut. Finally, the Webcrawler needs access to neither queue when forwarding it down the pipeline. Given these observations, the manager chooses the following strategy: when the Webcrawler signals the FrontierManager to proceed, it is assumed that the crawler does need to access a given queues. Thus the FrontierIn queue can be processed while the Webcrawler is extracting links. However, disk access is expensive and simply given robots.txt permissions the Webcrawler may need extended access to a particular queue or may skip certain steps. Thus to avoid these costs the FrontierManager will only be signaled when the queues meet a certain threshold. The threshold is also set such that should the queue in question not receive immediate attention from the FrontierManager, this does not cause any delays. In addition, such a schedule allows the FrontierManager to more efficiently read or write blocks of data to disk. Finally should the FrontierIn queue become empty or the FrontierOut queue become full, the crawler can block until the FrontierManager can process the given request. Apart from interacting with the FrontierManager, the Webcrawler is also responsible for parsing the web pages into a DOM (Document Object Model) document using JTidy (Marchal, 2003). Since most web pages are not well formulated to according to proper XML structure, it is difficult to traverse the document meaningfully without any heavy processing. JTidy produces a DOM tree can be easily traversed to extract nodes, identifiable by their names. In the context of HTML documents, the nodes can represent the HTML entity tags 3, like <a href= > and <img src= >. Heuristics Not all of the web sites obtained through crawling are relevant to product searches. In fact, we are only concerned with sites that offer products or services for sale. Therefore, in order to avoid processing irrelevant web pages, there needs to be a way of deciding if a page is relevant or not. In addition to merely being relevant, we have grouped sites with products to offer into 3 broad categories: 1) online stores that permit buyers to execute the purchase transaction online, 2) auction sites that offer potential buyers the environment to bid for items they want to buy, and 3) offline stores that provide buyers the information to eventually acquire the items from a physical store, or through an offline exchange. In general, products offered by sites like Amazon.com and Yahoo! Shopping fall under the first category. Products listed on Ebay.com or AuctionFire.com belong to the second category, while classified ads and services like tutoring or childcare are grouped in the third category. 3 For our purposes, the terms tag and node are used interchangeably to refer to both the elements in the HTML document and the nodes of the corresponding DOM tree.

6 The main method that we elected to use is a bitwise scoring system. In essence, a 19-bit score is calculated for every page and can be divided into 4 main sections. The first 4 bits record the occurrence of generic traits of a commercial site, e.g. the most significant bit notes the existence of a HTML input button on the page. The next 7 bits of the score are used to measure the extent to which a page fits the characteristics of an online store, whereas the following 5 and 3 bits analyze the same page with respect to auction sites and offline stores respectively. For any one category, the characteristics that distinguish it are ranked in descending order, so that the most important ones occupy the most significant bits of the score, thereby facilitating sorting using the aggregate score. Taking the identification of online stores as an example, we believe that any site that fits this category should have a shopping cart (or an equivalent feature) at the minimum, followed by a check out option, and so on. Some of the signs we hoped to find on the page were simply text that mentions the availability of the product, shipping costs and return policy. Nevertheless, there were a lot of false positives in using text-based identifiers since they do not uniquely recognize only pages that are truly online stores. A bogus web site that simply comments on return policies in general may still be scored on that characteristic, albeit erroneously. Ultimately, we decided to only use non-text identifiers, i.e. explicit HTML entity tags like <input>, <form> and <button> to provide more accuracy in the site classification process. The bits used to encode text-identifiers are thus left unused. Nonetheless, there remains a trade-off between using String.indexOf or StringTokenizer matching in searching for sub-strings. IndexOf allows a searcher to find bid within the string submitbid, but does not ignore the string obidos (which occurs on all Amazon sites). The reverse is true for using StringTokenizer and matching each token. Additionally, StringTokenizer matching does not allow for matching multiple words like shopping cart in Add to shopping cart. In the end, we decided to primarily rely on IndexOf to find sub-strings, because it is more accurate and more efficient. Furthermore, only the most important and significant characteristics were preserved and they provide adequate heuristics information to classify websites. Nonetheless, because we are operating under an open-world assumption, the results are in no way conclusive or perfect. Input button Price Item ID Payment methods Shopping cart Check out Quantity Availability Account Shipping Return Policy Bid Time left Seller Auction Buy it now Contact Directions Opening hours Score Bit position Generic Online Auction Offline Identifiers in grey indicate text-identifiers that were eventually unused Diagram 5

7 Price Extraction Price is a critical piece of information that needs to be extracted from a given page. However, there is no standard format with which to identify the price that corresponds to any given item on a page. Thus two sets of wrapper functions have been developed to help automatic identification and extraction of prices: one for online stores and the other for auction sites. Our Price Extractor utilizes the DOM tree of the parsed HTML document and exploits the information contained within the structure. There are generally two strategies to extract the price: 1) identify the true price 4 through a set of criteria or 2) attempt to eliminate all prices but the true price. Given that for an online store the only two characteristics of the true price are being in a table and being isolated, these are insufficient pieces of information for identification. Hence, the second strategy is adopted, which can be viewed as passing the list of all prices through a sequence of filters. In contrast, auction sites do not have well-defined competing prices to the true price. Thus, the Extractor simply tries to identify the true price using the first strategy. In order to construct such a list of prices, every text node is first checked to see if it possesses a well-formatted price, such as $ The price and the node that contains the price are then stored in a linked list. In addition each price is marked as being isolated or not isolated. An isolated price is one that appears alone within a given tag. For example, <tag>$17.99</tag> from the diagram below is isolated, while <tag>$12.00 (%40)</tag> is not. Through our empirical observations, the majority of sites present the true price as isolated, making it an important characteristic of a true price. After the document is searched, this list of prices can be processed to extract the true price. If there is only one price on the page, then it is trivially the true price. Otherwise, if there is only one single isolated price, it is also then considered to be the true price of the item. Diagram 6 4 The true price of a site is what a human user is able pick out as the price of the item sold.

8 In the event that there is more than one isolated price, the list of prices is considered for possible true prices. At any point, if only one isolated price remains in the list after eliminating others, it is deemed to be the true price. Other types of prices that may appear on a page are: 1) prices of items recommended by the commercial site, 2) strike out prices, and 3) list prices. Each of these can be systematically removed from the list through a series of filters. As seen at Amazon.com, a list of recommended items or items that consumers are likely to be interested in is often included in a page containing one main item for sale. From our experience, these recommendation lists contain more than four products. Even if there are less than four recommended items, all the prices in the list are typically not isolated. In addition, pages do not group more then two prices around the main price typically, since they only help to indicate the good deal or savings the consumer is getting. Hence, prices that occur in a table with more than 4 prices or in a table only containing non-isolated prices are all eliminated. Diagram 7 After filtering out recommended prices, the remaining prices are scanned for strike out tags, i.e. <strike> or <s>, such as the list price $29.99 in Diagram 6. This tag indicates that the price is crossed out and is not the true price. Due to the fact that the tags must occur around a price in the document, they can be easily located by searching through all of the sibling tags 5 for any given price. Once the tags are found, the prices contained within them are removed from consideration. The system then searches for all list prices, which, instead of being the true price of an item, are the manufacturer s recommended price. This price is often placed within close proximity of the true price. Its position most probably highlights the generous discount that the store is offering. While there are synonyms such as MSRP and retail price, there is a limited number of such terms. Nevertheless, searching for keywords such as list proves more difficult than searching for a strikeout tag. The keyword list need only appear before the price and inside the same table tag, but its actual location may be hard to find. In order to find the keywords, a recursive depth-first traversal of the document is implemented. From a given price in the list, the function recursively traverse up through its parent nodes until the table tag. From each parent node, it searches the children nodes in a depth-first search manner to identify a node containing the sub-strings like list price. Due to the fact that multiple prices may be under the same table, we avoid misidentifying a price by not searching any children beyond the one that contains the 5 There may be more than one sibling tag around a price

9 initial price. If there is, in fact, a parent node that contains a list price child node, the initial price is eliminated. The Price Extractor will also attempt to remove prices in the list that indicate shipping cost or the discount the consumer is receiving in the same manner as list prices. If no appropriate price is found, the Price Extractor returns null instead of trying to return an average of the remaining prices within the list. This is because the result of such a calculation is not based on any arguable logic. As for auction sites, it should be noted that these sites often state two important prices: the currently offered bid and a price you can pay to immediately buy the item. The decision was made to extract the currently offered bid since this can be substantially lower than the immediate buy out price and offers a more accurate representation of the lowest price the buyer could pay. Thus to identify the current bid price the technique of identifying list prices is used only this time only prices matching the criteria are kept. The key term is current bid or starting bid. If this fails the system defaults to try and look for any price that can be associated with just the word bid. The most surprising feature of these extraction functions is how well they perform. Since it is impossible to collect every shopping page on the web, our results are based on the pages we actually crawled. Essentially, price extraction for online sites returns a price more often than not and when it does, it almost always returns the true price. The results for auction sites, on the other hand, were not as encouraging. Some auction sites do not even mention the word bid and are thus eliminated using our Extractor. Database As the back-end of our search engine, we require a database to store all the information gathered from crawling the web. We selected MySQL as our database of choice as it is a free SQL server that we can install locally in our systems. In addition, MySQL is a relational database, which we believe offers flexibility for project. In order to process the main body of the web pages, pre-processing is performed on the relevant text nodes found within the document. This involves removing punctuation marks as well as common characters used in html documents, e.g. for space. These are found and replaced with regular ASCII equivalents, if necessary. Then, stemming is done using an implementation of Porter s algorithm for suffix stripping (Porter, 1980). In doing so, words like run, runs, and running all are stored as run and hashed to the same keyword ID. Hence, searches are made more versatile because searching for running will also return pages containing runs. The database takes in keyword-site pairs 6 and processes them into the proper tables in the database. We used the SHA-1 algorithm to hash the URI and the keyword to facilitate the storing and retrieval of both items. The insertion of site-keyword pairs into the database mostly involves accessing the wordlist and termlist tables (see Appendix I for all tables and the E-R diagram). It should be noted that the insertion of word into these tables is not entirely independent, since new words may signal a need to update both tables. Given this constraint, the number of threads that 6 Each keyword is associated with the URL from which it is extracted

10 can be running simultaneously is quite limited. Thus, while the design can accommodate many threads, it is built practically expecting only a few. The design for the interaction of the Webcrawler and the database is a simplified version of the design for the FrontierManager (see Diagram 4). Essentially keywords are extracted from a document and placed into an Out queue. Then, a thread called KeywordManager moves keywords from the Out queue to the In queue. These are in turn read by the Database for insertion or updating. However, there are certain key differences. The primary difference is that the Webcrawler is a separate thread from the database. Thus how many keywords is required to be offloaded to disk depends entirely on whether the Webcrawler is generating keywords faster than the database can processes them or vice versa. This is the classic consumer-producer problem. In addition, the KeywordManager is not built to accommodate multiple instances. Finally whereas the FrontierManager can be set to run at times when the Webcrawler is not using a given queue, the KeywordManager simply empties and fills the queues as requested. This is due to the fact that the while the KeywordManager is trying to transfer data between two independently running threads. Thus, even having the threads invoke the KeywordManager when they are not actively using the queues can create contention for the KeywordManager to get work done. Nevertheless, KeywordManager follows FrontierManager in gettting to queues before they are full/empty and performs block reading and writing to avoid unnecessary disk I/O. Search The main purpose of PriceHunter is to provide users searching capabilities. Therefore, to this end, we have defined a basic set of searching features. Firstly, in order to determine if a page is relevant enough to be returned as a result of a search, its score is calculated based on the vector space model (Lee et al, 1997). The model measures the vector distance between the HTML document and the search query as a proxy to how similar both documents are to each other. The results from processing the query are ranked by default based on the vector space score. Even if two documents obtain the same score, they are further divided based on their alternate score, which records the number of times each word in the search query occurs in the document itself. Users are able to specify which words they require every page in the search results to have. By default, words entered in the search line do not all have to appear in every web site in the results list, i.e. searching is based on disjunctive keywords. However, in order to narrow down the search results, users are able to refine their search using the quotation marks. All words appearing within quotations are considered to be conjunctive conditionals whereas words that are otherwise separated by white space are disjunctive conditionals. In an example, the string ( one two three four five ) can be interpreted as a request for sites containing both one and two, or simply three, or both four and five. Whether the user wishes to streamline his search using additional query line shortcuts, e.g. canon i320 $ [online], or selecting his preferences using an online form, there is great flexibility in the search system. Essentially, the user is offered the option of specifying the type of sites in the results set and the range of prices, within which the products must fall. Since ultimately, the PriceHunter is a web-based search engine, we designed an appropriately user-friendly web interface that is hosted on a TOMCAT server. It allows the user to submit

11 queries to the SQL server and view the results. In addition, the user is able to sort through the results list and refine the search based on a price range or by specifying the type of sites. Diagram 8 To further avoid returning invalid or irrelevant results, the system automatically calculates the median prices of all initial search results and the standard deviation of the same group to determine what is the range of prices the engine should be concentrating on. While this method does not immediately guarantee the uniformity of search results, it provides an initial filtering of clearly irrelevant pages to the search. Taking into account the fact that MySQL does not have an optimizer and that processing search queries each time may take a long time accessing the database to calculate the results, completed searches are cached in the DB as well. Therefore, whenever the same search is made again, determined by the SHA-1 hash of the search string, the results in the cache are simply returned, if they are present. In order to allow for changes in web pages, each search cache is also timestamped. When the timestamp of a search query is more than 2 weeks old, the system forces a fresh processing of the search. In theory, the expiration period of the timestamp should be as long as the lag time between complete web crawls. Challenges Throughout the design and implementation of our search engine, we faced several challenges and learnt many important lessons. The most important lesson is that the web is essentially unstructured and contains largely ill-formed HTML pages. In addition, the content is continuously changing and does not follow any convention. While Tidy was supposed to convert ill-formed HTML into XHTML, i.e. well-formed XML versions of HTML, there are certain irregularities that are not appropriately converted by Tidy. For example, child nodes that exist within the title tag are relocated by Tidy to be part of the body tag.

12 <head> <title>2004 Mitsubishi Galant <a href= >Cars</a> </title> </head> <body> </body> Before Tidy <head> <title>2004 Mitsubishi Galant </title> </head> <body> <a href= >Cars</a> </body> After Tidy Diagram 9 This problem is also partially a cause of the difficulty in identifying relevant search results. Due to the excessive amount of information presented on any one page, extracting all the words that exist on the page will return too many false positives in the results. On the other hand, extracting only the base set of keywords from the meta tags and title tags may provide too few keywords for the user to find enough relevant websites. In the case above, the word cars has been dropped from the title, and losing this piece of critical information will cause the page not to be returned when the user searches for cars. In addition, since the user is likely to sort the results by prices, it is highly undesirable for the result set to contain irrelevant pages with low prices. Future Work While we attempted to provide for scalability, there remains room for improvement in making the entire search engine completely expandable. In particular, the number of Webcrawlers can be increased to distribute the crawling workload. Similarly, more parallelism could potentially alleviate the currently heavy workload of the back-end, which is due to the fact that disk access is relatively expensive. As mentioned before, Tidy has shortcomings in processing the ill-formed web. In order to extract information based on what the user sees, Tidy should be improved to properly reflect the HTML document as presented visually by most browser and preserve the intended structure. In terms of searching, the location of keywords on the page could be recorded so that positional querying can be performed. While keywords from the meta tag do not inherently have positional data associated with them, those within the title tag do. As such, it is non-trivial to assign the positions to a word. Furthermore, metadata of keywords, e.g. formatting data, can be used to augment our current model to return relevant results. Building on our method of extracting prices, other pieces of information can be obtained from the web page, e.g. shipping, item number, seller etc. In particular, the category of a certain product may also be dynamically assigned, thereby creating an automatic category list of all products in the database. Ultimately, to become a fully viable product search engine, this method would have to be able to extract more than just prices from the web pages. Conclusion There is sufficient evidence that there is some level of convention and structure among commercial sites for wrapper functions to extract information reasonably. We were surprised to find that the functions that we have developed have achieved a substantial amount of success.

13 Nonetheless, we have also restricted ourselves to a small set of web sites. The diversity of the wider web and the constant change that it undergoes only means that further investigation is necessary to determine the viability of a fully automated product search engine. The content of these websites may simply become more diversified and irregular to warrant the applicability of any set of heuristics in the long-term. Nonetheless, as the web develops, there may be greater standardization among online stores.

14 References [1] Amazon, [2] Froogle, [3] Kelkoo, [4] Pricescan, [5] Yahoo! Shopping, [6] Sergey Brin and Lawrence Page, The Anatomy of a Large-Scale Hypertextual Web Search Engine, Computer Networks and ISDN Systems, 1998 [7] Allan Heydon and Marc Najork. Mercator: A Scalable, Extensible Web Crawler. In World Wide Web Journal, December 1999, pp , [8] Nicholas Kushmerick, Daniel S. Weld and Robert Doorenbos, Wrapper induction for information extraction, Intl. Joint Conference on Artificial Intelligence, 1997, pp [9] Nicholas Kushmerick, Wrapper verification, World Wide Web Journal 3(2), 2000, pp [10] Dik L. Lee, Huei Chuang and Kent Seamons, Document Ranking and the Vector-Space Model, IEEE Software, March/April 1997, pp [11] Benoit Marchal, Tip: Convert from HTML to XML with HTML Tidy, 18 Sep 2003, [12] Jason Mills. Early Froogle BETA Shortcomings. Top Site Listings, 15 January [13] John M. Pierre. On the Automated Classification of Web Sites. Electronic Transactions on Artificial Intelligence, 2001, p. 6, [14] Martin F. Porter, An algorithm for suffix stripping, Program, Vol. 14, no. 3, 1980, pp [15] Arnaud Sahuguet and Fabien Azavant, Building Light-Weight Wrappers for Legacy Web Data-Sources Using W4F, Proceedings of the 25th International Conference on VLDB, 1999, pp [16] Danny Sullivan. Shopping Search Engines. Search Engine Watch, 5 December

15 Appendix I CREATE TABLE document ( docid bigint, uri varchar(255), score char(24), type char(10), title varchar(255), descr varchar(255), PRIMARY KEY (docid)); CREATE TABLE documentinfo( price float, docid bigint, PRIMARY KEY (docid), FOREIGN KEY (docid) REFERENCES document); CREATE TABLE termlist( docid bigint, wordid bigint, word varchar(255), hit bigint, weight float, PRIMARY KEY (docid, wordid), FOREIGN KEY (docid) REFERENCES document, FOREIGN KEY (wordid) REFERENCES wordlist); CREATE TABLE wordlist( wordid bigint, word varchar(255), nhits bigint, idf float, PRIMARY KEY (wordid)); CREATE TABLE robots( host bigint, disallow varchar(128), PRIMARY KEY(host, disallow)); CREATE TABLE documentsearch( searchid bigint, docid bigint, ranking float, alt_ranking int, timestamp timestamp, PRIMARY KEY (searchid, docid), FOREIGN KEY (docid) REFERENCES document);

16 ER Diagram for PriceHunter DocumentInfo price bigint(20) float host disallow Robots bigint(20) varchar(128) contains docid uri score type title descr Document bigint(20) varchar(255) char(24) char(10) varchar(255) varchar(255) Termlist bigint(20) bigint(20) word varchar(255) hit bigint(20) weight float is searched DocumentSearch searchid bigint(20) bigint(20) ranking float alt_ranking int timestamp timestamp wordid word nhits idf Wordlist bigint(20) varchar(255) bigint(20) float Entity Relationship Bold&Underline primary key foreign key