Buglook: A Search Engine for Bug Reports

Transcription

1 Buglook: A Search Engine for Bug Reports Georgi Chulkov May 18, 2007 Project Report Networks and Distributed Systems Seminar Supervisor: Dr. Juergen Schoenwaelder Jacobs University Bremen

2 1 INTRODUCTION 1 Abstract Buglook is a new type of search engine that is aware of the structure of the websites it can search. Because it is constrained to a single application domain - in this case, bug repository systems - it is able to better understand the pages it finds and downloads. It is designed to run with minimal overhead, reusing the search capabilities of the bug trackers it searches, and relies on search modules to parse different categories of websites. It has extensibility as its highest priority, encapsulating all site-category-specific code away from the general code. This paper presents the design and implementation of Buglook in detail, and suggests further improvements largely related to performance. 1 Introduction Modern search engines make it very easy to find a web page containing given text in mere seconds. They continuously crawl and index the World Wide Web (WWW), thus allowing a user to only worry about what to search for, not where. Despite the massive amount of web page content available worldwide, search engines generally take less than a second to return a list of results to their user. [6] The list is additionally sorted by relevance in some way. [7] To achieve such astonishing performance, search engines restrict themselves to pure textual search. They view a web page as little more than a list of words, and make no attempt to infer any structure or other semantic information from it. The reason for this is twofold. Obviously, any processing beyond text indexing would be too expensive given the massive amounts of data involved, but in addition search engines must deal with any web page on any website of any application domain. Generally, two pages from the set of web pages on the WWW have nothing in common beyond the use of HTML as their markup language. If one restricts searching to a given set of websites within a well-known application domain, ignoring structural similarity between web pages is no longer an optimal (or the only) solution. A search engine that can derive structure from the similarity within such web pages could allow more expressive search queries, and could better define the relevance of web-pages in order to sort the results. Classic pure text search, on the other hand, is constrained to word similarity and proximity as its relevance criterion. [7] This paper explores the problem of searching within support request ticket websites also known as bug report systems, bug repositories, or bug trackers. 1 It presents Buglook, a search engine designed to search for bug reports within such repositories. 1 Throughout this paper, the terms bug repository and bug tracker are used interchangeably.

3 1 INTRODUCTION 2 The choice of application domain is motivated by the importance of searching within bug repositories. Bug reports are short notices of problems within software, submitted by its users to the developers of the program(s). In a classical support scenario, a bug is confirmed by the developer, and a fix is created and issued to users, as well as integrated in the next version of the software. False alarms are usually handled by the developer explaining why a reported bug is not in fact a problem with the software, but a mistake on the user s end. Consequently, bug reports as found in bug repositories often contain the solution to the specific issues contained in them. When they do not, they typically at least confirm whether a given problem truly exists, whether the developer knows about it, and whether a fix is being worked on. Locating such information is often difficult with a general search engine, largely because of the numerous irrelevant results that are likely to be returned for a given set of keywords. As an example, suppose a user is looking for bugs describing some malfunction, such that they are confirmed to be errors and are being worked on. If, beside the keywords describing the problem, a user specifies keywords such as confirmed or fixed, the user will likely get many bug reports that are confirmed and fixed, but do not pertain to their problem. If the user does not specify these additional keywords, they are likely to get many results that are not bug reports at all. A search engine that is designed with bug reports in mind would be able to derive structure common to all reports, such as the state of the bug ( confirmed, fixed, etc.), date of reporting, date of last activity, and many more properties that a general search engine has no concept of. This information can in turn be used within search queries to provide highly relevant results to a user. Given that bug repository systems implement their own search capabilities that are aware of the inherent properties of their content, why is a domain-specific search engine needed? The answer is quite simple: a search engine can search many bug trackers simultaneously, saving the user a lot of manual effort. The user would only need to know what they are looking for, but not where to find it. 1.1 Searching Bug Trackers vs. the General Web Web pages within a bug tracker differ in significant ways from general web pages on the WWW. A bug tracker invariably is backed by a database or some other data storage system, and uses a well-defined template to transform given data to a HTML page. A general web page can be similarly generated from external data, manually coded by a human being, or both. This distinction has several important consequences. Using a single template guarantees the consistency of the HTML structure of all pages within a single bug tracker. The generated web pages differ

4 1 INTRODUCTION 3 only in their textual content, or in the number of well-defined elements they have (e.g. the number of replies to a bug report), but their overall structure is identical. A given datum, such as the title of the bug report or the data of the last reply, can be extracted in an identical way from every bug report. Structural similarity in turn makes it possible to parse the pages and extract their data with considerable accuracy, without resorting to complex natural-language processing tools. On the other hand, there is no guarantee that a bug tracker will always generate the exact same page twice from the same external data. Subtle differences (such as bug tracker version, the current date, or the time it took to generate a page) imply that it is difficult to know if a page has been modified without downloading the page and extracting modification time if it is available. In addition, it is not possible to simply index all pages, because their number is infinite, even though the source data they were generated from is finite. 1.2 Buglook: Automated Search of Bug Repositories Buglook is a search engine that operates on bug tracker systems. It uses the tracker s internal search engine, and parses the HTML pages that the search returns. It is able to derive structural information from bug reports. Buglook requires no indexing of a given bug tracker before it is able to search it. It can only operate on certain categories of bug trackers (called sitetypes) that it knows about; examples of sitetypes are all Bugzilla-based bug trackers or all phpbb forums etc. Each sitetype is supported by a Buglook plug-in called a search module which encapsulates the peculiarities of a given sitetype, while Buglook itself handles all generic operations such as downloading and parsing HTML. To the user, Buglook looks similar to classic search engines, except that it supports as much structure as its search modules are able to derive, and is constrained to the sites that it has search modules for. To the bug tracker, Buglook looks like a user who performs a search and views the results of that search. The remainder of this paper is organized as follows. Section 2 discusses the architecture and building blocks of Buglook, as well as its search algorithm and current state of implementation. Section 3 describes the search modules interface and functionality in detail, using a search module for Bugzilla bug trackers as an example. Finally, section 4 explores various performance improvements to the search algorithm that could make Buglook much faster without compromising search quality.

5 2 DESIGN AND IMPLEMENTATION 4 Figure 1: Buglook Architecture 2 Design and Implementation 2.1 Buglook Architecture Buglook consists of three essential parts that complement each other. It has an engine that is responsible for downloading, HTML parsing, and all other general functionality that is required for searching any bug repository. Another part is the set of all search modules. Each one implements the sitetype-specific functionality required to parse a given category of bug trackers. Finally, a front-end handles user interaction such as query input and formatting of search results. The overall architecture of Buglook is illustrated on Figure Building Blocks Buglook is written in Java. Java is a cross-platform language especially suitable for complex web applications, and also has numerous useful libraries written in it. For downloading web pages, Buglook uses the Apache Commons project s HttpClient library, which supports several protocols (HTTP, FTP, and more), SSL security, cookies, authentication, and both GET and POST HTTP requests, among other features. [2] HTML parsing is achieved via JTidy [8], a Java port of tidy. The latter is a popular C++ library for parsing real-world HTML which is often not well-formed, nonconforming to HTML standards, or both. While tidy s main application is to produce an otherwise equivalent but completely correct HTML/XHTML document, its Java port JTidy also has a Document Object Model (DOM) interface. DOM is an API that allows direct access to the HTML code as a tree structure. [1] From the DOM trees downloaded by HttpClient and returned by JTidy,

6 2 DESIGN AND IMPLEMENTATION 5 search modules are required to perform certain tasks involving data extraction. To facilitate this, Buglook provides the ability to specify regular expressions for HTML trees, and search an entire DOM tree for matching subtrees. Search modules are thus able to express search criteria of the form find all elements with a given title that have certain attributes and their children (recursively) match the following criteria. Such regular expressions enable easily specifying complex heuristics for parsing different web pages within a sitetype Search Algorithm When searching a bug repository for some search terms, Buglook simulates a user going to the root web page of the repository and filling in a search form there. It then receives one or more web pages of results. Each result is then parsed to obtain information about that entry, and is finally presented to the user as a search result. The search algorithm, before any performance optimizations are applied, is shown in algorithm 2.2. The four underlined operations are the responsibilities of the search modules. Note that the algorithm takes three input elements - the search query, a search URL, and a range of results. The URL can be supplied by the user of Buglook or can be obtained by other means; one way of obtaining a list of root URLs is outlined below Obtaining Root URLs Normally, a user would like to search multiple bug trackers, or may have no idea where to search. In such cases, supplying root URLs to Buglook is tedious or impossible respectively. One way to obtain a list of parseable root URLs is to use a classic web crawler that crawls all links from a starting page, passing each obtained URL through Buglook s detection (see section 3.1.1). The URLs that pass detection would then be added to a list of URLs, against which searches would be performed. Finding good root URLs is then a matter of running the crawler from a good starting point - such as the result of a classical search engine for terms related to the available search modules (e.g. a search for Bugzilla bug tracker ). This functionality has not been implemented yet, and currently Buglook requires a user-supplied root URL for each search. 2.3 Current State of Implementation At present, Buglook has a working implementation of the simple search algorithm, as well as a functioning web interface. As a proof of concept, a 2 Perfect consistency is only guaranteed within the same site. Within different sites of the same sitetype, page structure is typically similar but not the same.

7 2 DESIGN AND IMPLEMENTATION 6 Algorithm 1 The Buglook search algorithm (simple version: no optimizations) Require: root URL u root, search query q, result range m n 1: R HT ML download u root 2: R DOM parse HTML of R HT ML 3: for all search modules m M do 4: detect whether m is suitable for R DOM 5: if m is suitable for R DOM then 6: p m 7: end if 8: end for 9: if no suitable p was found then 10: fail 11: end if 12: L empty list of URLs 13: u search compile a search request with p and q from R DOM 14: while u search points to a next page of results do 15: S HT ML download u search 16: S DOM parse HTML of S HT ML 17: extract list of bug URLs with p from S DOM and add to L 18: u search next page of results 19: end while 20: X empty list of search results 21: for all u bug L within range m n do 22: B HT ML download u bug 23: B DOM parse HTML of B HT ML 24: x parse search result with p from B DOM 25: add x to X 26: end for 27: return X

8 3 SEARCH MODULES 7 Table 1: Search requests for the query foo for several Bugzilla bug trackers Bug Tracker Search Request Handled by Buglook? bugs.gentoo.org../buglist.cgi?quicksearch=foo yes bugzilla.gnome.org../buglist.cgi?query=foo yes bugs.kde.org../quick search.cgi?id=foo yes bugzilla.kernel.org unknown (uses JavaScript) no working Bugzilla search module has also been implemented. It is used as an example throughout the discussion of search modules below. While it functions correctly, Buglook is not quite as fast as a traditional index-based search engine. Some possible optimizations that could close the performance gap are outlined in section 4. 3 Search Modules Perhaps the most important part of Buglook are its search modules, which are the mechanisms that extract structural information from a DOM tree. By writing a new search module, one can transparently extend Buglook to operate on new types of bug trackers. In doing so, one only has to worry about the code specific to that website, without having to handle downloading, parsing, or searching within a DOM tree manually. The challenge to writing a search module is that is needs to define heuristics for extracting information from a web page that work on a number of different repositories. For example, the included Bugzilla module is able to handle (most) bug trackers derived from the popular Bugzilla software [5]. While all Bugzilla trackers share some common characteristics, any two of them will most likely look different, and correspondingly have different HTML trees. The differences come not only from using different versions of the bug tracker software, but also because site administrators often customize software to better suite their needs. As a specific example, consider the bug trackers shown in table 3. They are all based on Bugzilla, but have a different way of submitting search queries. 3.1 Interface and Functionality All search modules have to support 4 operations: detection, compilation, extraction, and parsing Detection Detection stands for the ability to determine whether a particular DOM tree represents a page that can be handled by a search module. More precisely, it answers the question whether the module can figure out how to fill in a

9 3 SEARCH MODULES 8 search form on the root web page, and extract bug URLs from the resulting result page(s). public boolean canhandle (URL base, Document rootdoc); Here, as well as in all other operations, the first parameter is the URL of the page being processed - the only information about a web page that is not contained in the DOM tree. It is required in order to correctly resolve relative links to abolute URLs that can be downloaded. The second parameter is the actual DOM tree of the web page in question. The root page of a bug tracker is the first one that a user sees when they first visit a site. Detection s primary priority is to be fast. It does not need to be perfectly accurate: if it claims to be able to handle a site, but subsequently fails, Buglook will detect this and try another suitable module if one is available. Thus, false positives do not pose a problem Compilation Compilation (of a search request) has the prototype: public FetchRequest getsearchfromroot (URL base, Document rootdoc, String query) throws SearchModuleException; Other than the base URL and the root page s DOM tree, the module also receives the search query the user has submitted. It must return a FetchRequest structure, which is little more than a URL with several additional parameters such as GET or POST request type, and request parameters and their names. A module must therefore find a search form within the DOM tree of a root page. If there are several forms on the page, the module must guess the correct one - the one that will allow searching rather than logging in, for example. When the correct form is identified, the module must find out what fields to fill in with the query, or equivalently, what request parameter the query must be specified in, and where it must be submitted Extraction After it has obtained a search request via compilation, Buglook executes the request, resulting in the first page of search results. Arguably the simplest task a search module must perform, extraction, is the act of finding which links on the search result page are URLs to bug reports. In addition, a module has to determine whether there are more pages of bug reports: some bug trackers only return a handful of results, and have a link titled Next or similarly, that leads to further entries. Extraction s prototype reflects this additional requirement:

10 3 SEARCH MODULES 9 public ListURLAndFetchRequest getbugsfromsearchres (URL base, Document searchresdoc) throws SearchModuleException; The returned structure, a pair of a list of URLs and a FetchRequest, requires some explanation. The list contains the URLs of the bug reports. The additional FetchRequest must be the request for the next page of results if one exists, or null otherwise. In this way, Buglook can access all results for a search no matter if the bug tracker presents them at once or on several pages. The simple search algorithm in fact just downloads and parses all result pages one after another. Once all bug URLs are added to a single list, it no longer matters whether the URLs came from a single page or from multiple ones. Of course, such a naive approach is fairly slow in practise, and generally unparallelizable. (Section 4 discusses this and other performance considerations.) What is important is that the responsibility for downloading next pages at a given time does not lie with the search module; it merely must indicate if and where more results can be found Parsing The final and most complex operation is parsing: SearchResult filldetails (URL base, Document bugdoc) throws SearchModuleException; Given the DOM tree of a single bug report, the module must fill in a SearchResult structure. The latter has fields for a URL and a title, among others. A module should try to fill in as many as possible and leave out the ones that cannot be parsed reliably. A problem arises when bug tracker software is modified to a particular site. Elements important to the functionality of a site but not to its appearance, such as the HTML code for forms and the format of linked URLs, are seldom modified and work relatively consistently within a sitetype. On the other hand, the pages that contain the bug reports themselves look very different across different trackers, and therefore have significantly different HTML structure. Whereas detection, compilation, and extraction largely deal with functional HTML elements (forms and links), parsing has to extract human-readable information which is encapsulated in very different presentational HTML structures. For instance, the title of a bug report might be in a <h1> tag in one bug tracker, but in the first row of a table in another. Reliable heuristics that locate the required data with high probability of success are therefore very difficult to define. As the semantics provided by bug report pages vary somewhat between different sitetypes, the correct approach to reducing the parsing difficulty

11 3 SEARCH MODULES 10 could be different for different search modules. In the extreme case, a module could have several parsers specific for some sites that do not conform to a good default parser. Due to time constraints, it was not possible to explore this issue further, and as a result, the current version of Buglook only has a bug s URL and title in the SearchResult structure. Although it was not attempted, a solution similar to the one outlined in [4] is believed to be suitable. 3.2 Example: The Bugzilla Module Buglook has a search module for Bugzilla-based repositories. Below it will serve as a specific example to illustrate how a search module works Detection Detection is implemented in a trivial (and arguably not optimal) way: simply attempt to compile a search request, and if that works, assume that the page is a Bugzilla site. This method will give false positives (any site with a search form that uses a GET request will do!), but will never claim that a parseable page cannot be parsed Compilation The Bugzilla module uses regular expressions over DOM trees to look for all forms that use a GET request. From these, the first one that has a text field with a non-null name is selected as the search form. Its details are then extracted from the HTML tags of the form and field. Algorithm is a more precise description of the process Extraction From the several Bugzilla sites that were examined, it is notable that all use the same format for links to bug reports: show_bug.cgi?id=number. Such an observation makes it very easy to extract all links to bug reports in Bugzilla: simply collect the values of any href attributes of a elements (links) that have the right format. Further, Bugzilla always returns a single page of results, so the FetchRequest for the next page of results is always null Parsing Due to time constraints, parsing was constrained to obtaining the title. The title is easily obtained by retrieving the text within the <title> and <\title> HTML tags, and removing everything up to and including the first dash. Consider the HTML code:

12 4 PERFORMANCE OPTIMIZATIONS 11 Algorithm 2 Bugzilla compilation Require: root URL u, DOM tree D, search query q 1: F find all search forms with attributes method=get and action=anything.cgi 2: if F is empty then 3: fail 4: end if 5: for all forms f F do 6: a the value of attribute action 7: T find all input fields with attributes input=text and name=something 8: if T is non-empty then 9: Let t T 10: n the value of attribute name of t 11: break for loop 12: end if 13: end for 14: if no n was found then 15: fail 16: end if 17: return u/a?n=q <title> Foo Bug Bar does not work</title> It reduces to the title Bar does not work. If no dash is found, the entire string between the two tags is taken as a title. Note that in addition to the title, the SearchResult structure also requires the URL of the bug report, but it is already known before parsing. 4 Performance Optimizations The performance of Buglook s simple search algorithm is about two orders of magnitude worse than that of a modern index-based textual search engine, often taking several tens of seconds to deliver a single page of 10 search results. Simple measurements conclusively show that the bottleneck is caused by the network overhead of downloading pages, while the time spent on local processing is insignificant in comparison. Therefore, in the following discussion performance is measured in terms of how many pages are downloaded, in what order, and when. 4.1 Search List Reuse When not told otherwise, Buglook defaults to only showing the first 10 results of a query. Even when using the simple algorithm it will only download

13 4 PERFORMANCE OPTIMIZATIONS 12 Algorithm 3 The Buglook search algorithm (optimized with on-demand result page loading, multithreaded downloading, and site caching) Require: root URL u root, search query q, result range m n 1: if u root has been downloaded before then 2: p the search module that was used to handle u root 3: u search the search request with the query replaced by q 4: else 5: R HT ML download u root 6: R DOM parse HTML of R HT ML 7: for all search modules m M do 8: detect whether m is suitable for R DOM 9: if m is suitable for R DOM then 10: p m 11: end if 12: end for 13: if no suitable p was found then 14: fail 15: end if 16: L empty list of URLs 17: u search compile a search request with p and q from R DOM 18: end if 19: L empty list of URLs 20: S HT ML download u search 21: S DOM parse HTML of S HT ML 22: I the set of fetch requests for next pages that contain bug reports m n 23: for all i I do 24: I HT ML download i 25: I DOM parse HTML of I HT ML 26: add any bugs (m n) to L by extracting them from I DOM with p 27: end for 28: X empty list of search results 29: for all u bug L do 30: start a new thread z 31: B HT ML download u bug 32: B DOM parse HTML of B HT ML 33: x parse search result with p from B DOM 34: add x to X 35: stop z 36: end for 37: wait for all threads to stop 38: return X

14 4 PERFORMANCE OPTIMIZATIONS 13 these 10 results, even though it has the list of all bug reports. However, when another set of results is requested from that known list, the simple algorithm has to be executed again. Once again it performs the detection, parsing, and extraction operations (which download exactly two pages in total), and then parses the new set of results. Clearly, a better solution is to reuse the already known list of URLs and to download only the new set of results in order to parse them. Unfortunately, doing so is not as simple as it sounds, because it would require the engine to keep state for every search. A request for further results may come at an arbitrary later time, if ever, and in addition the engine cannot distinguish a request for more results from a new search with the same query. One possible solution is the use of browser cookies to keep state, falling back to a timeout-constrained server-side state management (or no state management at all) when cookies are not available. A natural question is whether reusing the list of URLs is really necessary. Intuitively, fetching a root page and a single search result page requires two page downloads - a number that does not grow as the number of total results increases. Unfortunately, the time that it takes for the bug tracker to return its search result page grows linearly with the number of matching bugs, and therefore the overhead of reconstructing the URL list can become very large for general queries with many results. The need for URL list reuse is well-justified by this observation. 4.2 On-demand Result Page Loading For simplicity, the previous subsection assumed a single page of search results. Constructing the list of bug URLs becomes even more expensive when there are several pages of results returned by the bug tracker, chained via Next links. The simple search algorithm will download all of them in order, an approach which clearly does not scale. One solution is to derive the number of bugs per bug tracker result page from the first such page obtained after compilation. Then, given an interval of requested bugs (e.g. 71 to 80), Buglook would only download the chain of Next pages up to the upper bound of the bugs it needs to display. Going further, downloading only the bug tracker pages that contain the requested bugs (and not everything that comes before them in the chain) requires the ability to fetch a page by number, rather than following Next links. That in turn requires extending the search module interface with a function (called next ) that returns a FetchRequest for the right page, derived from links on the first result page. Because not all bug trackers support jumping to an arbitrary page of results, the new function could be specified as an optional operation that may or may not be supported by search modules.

15 4 PERFORMANCE OPTIMIZATIONS Multithreaded Downloading While URL list construction is expensive, downloading the required range of search results is usually even more expensive. One reason why it takes so long is that the simple algorithm always downloads pages in sequence. A significantly better solution is to download bug pages concurrently in several threads, up to a reasonable limit. While such a setup reduces the impact of network latency at the expense of network bandwidth, it is important to note that Buglook will only fetch the HTML of a page without any images or other high-volume objects, and therefore bandwidth is unlikely to become a bottleneck. 4.4 Bug Caching Multithreaded downloading in turn poses a different problem: what if the bug tracker becomes overloaded? That, and the fact that downloading will always take time whether multithreaded or not, makes it necessary to avoid downloading altogether whenever possible. Given that all bug trackers assign a unique identifier to a given bug, it is trivial to distinguish one bug from another, even if all data fields for a bug are changed completely (except the identifier which cannot be changed). Intuitively, this should allow to cache bugs that have been downloaded before, and compare search URLs against the cached identifiers to see whether a bug is known or not. A bug with a matching known id could be served from cache. Unfortunately, there is no general mechanism to detect whether a bug has changed from a known previous state without actually downloading its current page. Further complicating matters is the fact that all pages are dynamically generated by the bug tracker server when they are requested. A tracker has no obligation to produce the same web page twice for the same data, as there can be subtle differences in content such as retrieval time, current time on the server, and others. Moreover, popular important bugs can be modified many times within a day, while older ones sometimes tend to stay untouched for months; therefore, no reasonable cache timeout can be defined. In other words, correct caching is the single most difficult problem that Buglook needs to solve in order to become as fast as classic search engines. 4.5 Site Caching An easier if less-capable optimization is caching search information about specific websites. Once a given specific URL has been detected and a search request has been compiled successfully, it is unlikely that the mechanism for specifying search requests for that particular site will ever change. If Buglook were extended to remember how to search each URL it encounters, it could easily skip downloading a root page or doing the local processing

16 5 CONCLUSION 15 required to identify a suitable search module. If a search on a known URL ever failed (because of a large system change on the tracker server), the URL would be deleted from the cache, and the entire algorithm would be run again. 4.6 User Interface Optimization A final optimization can be applied to the user interface component of Buglook. Currently it serves pure HTML pages, which must be completely defined before they are shipped to the user. In other words, although the search engine knows most of the important information that will end up on a result page (the list of URLs) relatively early in the search process, it must wait until it has all other information retrieved before it can deliver anything. An alternative approach would return a web page consisting only of URLs, and would update that page as more information about the entries becomes available. Rather than page reloads, a technology such as AJAX [3] could be used to implements this in a non-obstructive manner. 5 Conclusion Buglook is an initial version of a new type of search engine - one that knows what it is looking for. By reusing the capabilities of the website it searches, it can be used as a convenient tool to quickly search a large number of bug repositories of a known type. It has no infrastructure overhead, as it does not require any prior indexing or web-crawling, and it encapsulates all general functionality away from all sitetype-specific code, in order to be readily extensible. As some or all of the previously described performance optimizations are implemented in the future, it is believed that Buglook can improve its performance by a factor as large as ten, and consequently become as fast as traditional search engines that do not attempt to derive the structure of the content they search. Finally, the core idea behind Buglook could certainly be extended to application domains beyond bug repository systems to any domain that deals with structured or semi-structured data. References [1] World Wide Web Consortium. Document Object Model (DOM). [2] Apache Software Foundation. HttpClient Home. [3] Jesse James Garret. Ajax: A new approach to web applications.

17 REFERENCES 16 [4] N. Kushmerick. Wrapper induction: Efficiency and expressiveness. Artificial Ingelligence, pages 15 68, [5] The Mozilla Organization. About :: Bugzilla. [6] Lawrence Page and Sergey Brin. The anatomy of a large-scale hypertextual web search engine. In In Proceedings of the Seventh International Conference on World Wide Web 7, pages Elsevier Science Publishers B.V., [7] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank citation ranking: Bringing order to the web. SIDL-WP , [8] sourceforge. JTidy.