SYNTACTICAL INTEGRATION OF PRODUCT INFORMATION FROM SEMI-STRUCTURED SOURCES

Transcription

1 Department of Computer Science, Institute for Systems Architecture, Chair of Computer Networks Diplomarbeit SYNTACTICAL INTEGRATION OF PRODUCT INFORMATION FROM SEMI-STRUCTURED SOURCES Ludwig Hähne Mat.-Nr.: Supervised by: Dipl.-Medieninf. Maximilian Walther Prof. Dr. rer. nat. habil. Dr. h. c. Alexander Schill Submitted on July 16, 2009

2 II

3 ABSTRACT This thesis presents a novel product information retrieval and extraction system. The goal is to provide a solution which automatically locates the manufacturer s page of a given product and extracts relevant product attributes. The document retrieval subsystem exploits multiple web search services and uses various heuristics to improve the ranking. The unsupervised extraction of product attributes is based on syntactic features of the product pages. XPath queries are used to cluster and select genuine product attributes from web documents. Three different extraction rule induction algorithms are presented. One variant uses multiple training documents, another incorporates already extracted data, and a supervised solution falls back on user-supplied examples. A web crawler was developed which automatically retrieves pages sharing common underlying page-templates. The implementation extends an experimental federated search engine developed at the TU Dresden. The extracted product attributes are meant to spice up already available data with first-hand information gathered from the respective manufacturer sites. The system was evaluated according to a gold standard. Considering the low expenses in terms of user guidance effort and execution time, the system exhibits good precision and recall metrics. III

4 IV

5 CONFIRMATION I confirm that I independently prepared the thesis and that I used only the references and auxiliary means indicated in the thesis. Dresden, July 16, 2009 V

6 VI

7 CONTENTS 1 Introduction 1 2 State of the Art Document Retrieval Document Model Retrieval Effectiveness Web Crawler Summary Information Extraction Data Model Wrapper Induction Supervised Information Extraction Semi-Supervised Information Extraction Unsupervised Information Extraction Case Studies Summary Information Integration Legal Considerations Fedseeko Producer Information Integration Summary VII

8 3 Requirements Information Description Product Pages Functional Description Behavioral Description Validation Criteria Summary Design Data Design Retrieving Product Pages Information Extraction from Product Pages Architectural Design Fedseeko Integration Summary Implementation Product Page Retrieval Locating the Producer Site Locating the Product Page Crawling Related Product Pages Locator Architecture Information Extraction Prototype Data Regions Phrase Matching Phrase Clustering XPath Query Generalization Wrapper Induction Conclusion Information Extraction Implementation Wrapper Induction Attribute Extraction Selecting a Wrapper VIII Contents

9 5.3.4 Architecture of the Web IE Subsystem Fedseeko Integration Summary Evaluation Feature Comparison Effectiveness and Performance Evaluation Test Products Product Page Retrieval Effectiveness Related Page Crawling Effectiveness Information Extraction Effectiveness Summary Conclusion Future Work A Glossary 75 Contents IX

10 X Contents

11 LIST OF FIGURES 2.1 Interplay of document retrieval, information extraction and integration in web data extraction Template-driven web page creation from database records Different wrapper induction strategies [CKGS06] General tree mapping example [ZL05] Iterative partial tree alignment example [ZL05] Wrapper induction example for RoadRunner [CMM01] Input pages in ExAlg [AGM03] Generalized nodes and data regions in DEPTA [ZL05] Fedseeko architecture [WSS09] Overview of information flow Product page example with the extraction targets being highlighted Information flow during extraction Selecting a product page from a set of candidates using multiple techniques Navigating to a related product page (Nikon D90 to Nikon D3X) Examples of specification data embedded in different containers Clustering text nodes from multiple documents Source code of the two pages from figure Architecture overview of the complete system Ranking a set of candidate documents using multiple techniques XI

12 5.2 Architecture of the DR subsystem Supervised retrieval and extraction Architecture of the IE subsystem Fedseeko product administration view Word cloud visualizing the most common terms in key phrases Effectiveness of locating the right producer sites and product pages Product page retrieval runtime performance distribution Number of successful operations of each isolated component Correctness and completeness of extraction results Example of a nested template page Example of specification page for multiple products Information extraction runtime performance XII List of Figures

13 1 INTRODUCTION The World Wide Web is a place where millions if not billions of products are marketed, searched, sold, bought and reviewed. Potential customers have a multitude of different sources at their disposal to facilitate a purchase decision. There are various product review sites, web shops provide product descriptions, blogs gain popularity as information resources and there is the information published by the product s manufacturer. An important factor is the reliability of the individual information sources. When it comes to buying an expensive product, a customer probably prefers to resort to the most reliable source of information. However, it is getting increasingly difficult to find first-hand product information via a simple web search. To reach potential customers, manufacturers have to compete with many other information providers in order to receive attention and a good search engine rank. Nowadays, web search engines are the single point of contact interfacing to the exuberant information in the World Wide Web. However, todays web search engines predominantly only inform about the whereabouts of data and can still not answer complex queries. It is very difficult to do better as long as the web content is not semantically interwoven. Not only Tim Berners-Lee believes the Semantic Web to be the future of the Internet [BLHL01]. Instead of phrasing keyword queries and wading through search results to find relevant information, the vision is letting the Semantic Web answer actual questions. In the context of product information retrieval one might want to ask questions like: How much power does the latest Siemens refrigerator consume compared to its predecessor and the new flagship product of Pengiun Electrics? As old as this vision is, it still has a long way to go. Web developers are required to semantically describe their data in languages that may seem too complex and lavish to pick up easily. Especially the lack of obvious short term benefits may impede the adoption of Semantic Web technologies. It is not helpful either that a semantic query system needs a somewhat complete knowledge base in the target domain to be valuable for a potential user. But what if semantic data could be condensed out of existing web pages? 1

14 One idea is to bridge the gap between the "syntactic" Web and the Semantic Web by automatically transferring information from traditional web pages into a semantic context with the help of information extraction techniques. Acknowledged, information extraction systems will not immediately provide the anticipated power of the Semantic Web without further efforts. But these systems might help to facilitate the migration process in some well-defined domains one of which might be product information extraction. With an automatic product information extraction and integration system at hand, it would be possible to find similar products based on all kinds of feature-related criteria. It would also relieve the customer of retrieving the producer information for the interesting products manually. Furthermore, such a system would be manufacturer- and vendor-independent. This work presents a novel approach towards automatic Web information extraction and is striving for becoming an enabling technology for product information integration. A prototype implementation was developed and integrated into a federated search engine, demonstrating the practical viability for product information integration and its immanent challenges: locating product pages, automatically collecting training data for pattern mining and identifying and extracting valuable product data. The product page location component resorts to multiple web search services and incorporates various heuristics to optimize the retrieval precision. The extraction exploits structural characteristics of template-generated web pages. Extraction rules are stored as XPath queries in the system. A low complexity clustering algorithm is utilized to derive these extraction rules. Three algorithms are proposed, corresponding to different degrees of automation. Chapter two provides a theoretical background and the state of the art in Web information extraction and related fields of research are discussed. In chapter three the requirements of the novel product information extraction system are analyzed. The subsequent chapters deal with the design and implementation of the software system. Chapter six dissects the advantages and drawbacks of the presented solution and evaluates the system according to a gold standard. Finally, a summary and an outlook are given. 2 Chapter 1 Introduction

15 2 STATE OF THE ART Integrating information from the World Wide Web into a local database relies on three major components, as depicted in figure 2.1. In this chapter, important concepts of document retrieval and information extraction are outlined and an overview of the state of the art in each field is given. This work strongly focuses on information extraction and thus presents a selection of existing information extraction systems. Information integration is covered briefly for the sake of completeness. The chapter closes with the presentation of Fedseeko, the system into which the new information extraction system shall be integrated. Figure 2.1: Interplay of document retrieval, information extraction and integration in web data extraction 3

16 2.1 DOCUMENT RETRIEVAL "Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it." Samuel Johnson Information retrieval (IR) is often only loosely defined. Moreover, in the context of most retrieval systems information retrieval actually refers to document retrieval. In effect, information retrieval shall be synonymous to document retrieval (DR) in this thesis. But for being less ambiguous, the latter term is preferred. Lancaster gives the following definition of IR that also draws a dividing line separating related fields of research like fact retrieval or question answering [Lan68]: Definition 1 (Information Retrieval) An information retrieval system does not inform (i.e. change the knowledge of) the user on the subject of his inquiry. It merely informs on the existence (or non-existence) and whereabouts of documents relating to his request. Document retrieval aims to find relevant information from a large corpus of documents. Given a user query, traditional DR systems identify and rank documents in a corporate or library network or on a single host (e.g. desktop search). In the context of the Internet, DR is an important foundation of web search technologies with web pages building the document corpus. Due to the vast amount of web content with trillions of web pages, web search systems have different requirements than traditional DR systems. User queries normally are lists of words. Based on a query, the DR system finds relevant documents by matching the query tokens with the documents contents. In the simplest case, each word occurring in the query must also occur in the document. Phrase queries are also a very common instrument in IR. In addition, the query may contain Boolean operators or means to express that two tokens must occur near each other. However, complex query constructs are rarely used in practice as those make the DR task more difficult for the users. In the following, DR document models, effectiveness metrics and web crawlers are discussed Document Model The document model specifies how the documents and queries are represented and governs how the relevance of a document in respect to a query is computed. A document can be modeled in many different ways. It is common to most models that documents and queries are treated as a "bag of words or terms" in which term sequence and position are ignored [Liu06]. An important characteristic of document models is whether and how term-interdependencies are modeled. In the simplest case, each word is treated independently. According to Kuropka, the various approaches can be divided into set-theoretic models (e.g. Boolean model), algebraic models (e.g. vector- 4 Chapter 2 State of the Art

17 space model) and probabilistic models [Kur04]. The different models will be briefly presented in the following. In the Boolean model each term is only checked for its presence or absence in a document. A query in a Boolean retrieval system can be given as a logical equation combining terms with logic operators, e.g. "James Joyce" AND Trieste. A document is relevant in respect to the query if the contained set of terms make the query logically true. Boolean models have the disadvantage that no ranking can be derived from the simple definition of the problem. Neither the term frequency is examined nor does the model permit inexact matches. In the vector-space model a document is represented by an n-dimensional vector, in which each dimension represents a distinct term of the vocabulary from the whole document corpus. The weight of the term is computed from its occurrence characteristic in the document. The query is also modeled as such a vector. Now the relevance of the document in respect to the query can be computed as the cosine of the angle between the two vectors, defined as the cosine similarity (see equation (2.1)). cos θ = d q d q (2.1) An example for a probabilistic approach are language models which were first proposed for document retrieval by Ponte and Croft [PC98]. In a statistical language model a probability distribution of the n-grams is computed for each document in the corpus. The idea is to derive the ranking of a document d i in respect to a query q from the a posteriori probability P(d i q). This is essentially the likelihood of the query being generated by the respective language model. The ranking derived from the degree of relevance is governed by the internal document model. It may reflect poorly the actual relevance of documents as perceived by the user. Thus, effectiveness metrics are required to evaluate the performance of a DR system Retrieval Effectiveness Numerous metrics have been proposed to measure the performance of DR systems. The most commonly used metrics are precision and recall. Assuming a document is either relevant or irrelevant in respect to a query, precision is the fraction of relevant documents in the set of retrieved documents. In contrast, recall is the ratio of the number of relevant documents retrieved to the total number of relevant documents (including those that were not retrieved). Both metrics are related and are most often examined in context of each other. For example, it is trivial to achieve 100% recall by just returning all documents for every query. However, the precision metric would immediately reveal the deficiency of such an approach. Another commonly used metric is the F-score (or F-measure) which is defined as the weighted harmonic mean of precision and recall. Web search engines typically present search results in buckets of around ten documents. Users, however, do not consider search results beyond the first few result pages. 2.1 Document Retrieval 5

18 In effect, a relevant but very low 1 ranked document is essentially useless from the user s perspective. Therefore, the ranking is also considered in the performance evaluation by only examining the first i search results. Let D be the whole document corpus set. A query is submitted to a given DR system. D retrieved D is an ordered set of all retrieved documents while Dretrieved i are the i top ranked documents returned by the system. D relevant D is the set of all relevant documents. The effectiveness metrics can be computed according the equations (2.2). precision(i) = D relevant D i retrieved D i retrieved recall(i) = D relevant Dretrieved i D relevant 2 precision(i) recall(i) F-score(i) = precision(i) + recall(i) (2.2) In order to identify all relevant documents, the DR system first needs to be aware of the existence and whereabouts of the individual web pages. The gathering of web page is performed by a web crawler which is presented in the following section Web Crawler Web IR systems have to gather web pages to build a document index. This non-trivial task is performed by a web crawler, also known as spider or robot. Web crawlers recursively follow links in web pages to build a document index. As the Internet is constantly evolving, web sites need to be visited regularly to account for new or changed content. Definition 2 (Spider [FOL09] ) A program that automatically explores the World-Wide Web by retrieving a document and recursively retrieving some or all the documents that are referenced in it. The best known web crawlers are universal crawlers which operate on behalf of web search engines collecting the data for the document index. According to Brin and Page, the crawler is the most fragile component of a search engine because it has to interact with millions of remote servers all beyond the control of the system [BP98]. Thus, a crawler has to be very robust and handle a multitude of corner cases even if that might affect only a single page. Crawlers may impose a huge stress on the resources of the respective hosts if the request rate is not limited, leading to denial of service attacks in the worst case. Furthermore, crawlers should identify themselves and comply with the robots exclusion standard 2. 1 Depending on the user s web browsing behavior and motivation he or she might wade through one hundred search results but more likely not more than ten search results will be considered by the user. 2 The robot exclusion standard or protocol is a de facto standard described at org/wc/robots.html. 6 Chapter 2 State of the Art

19 In addition to universal crawlers, there are focused and topical crawlers exploring the Web based on user preferences. These try to find only pages relating to a category of interest or being similar to a set of seed pages. Focused crawlers differ from universal crawlers in their strategy of picking URLs that shall be visited Summary Important concepts of IR have been briefly presented. Information retrieval in terms of document retrieval is just a first step in obtaining and processing knowledge in an information system. The next step of information processing is the extraction of more fine-grained information from the retrieved documents. 2.2 INFORMATION EXTRACTION "Get your facts first, and then you can distort them as much as you please." Mark Twain Roughly speaking, information extraction (IE) aims to condense knowledge about a specific domain of interest. Attributes of the domain s entities or facts are distilled from one or more input documents. The goal of IE is enabling the information system to reason based on the extracted data. For example, an IE system that collects facts on the world s countries may extract attributes such as population, capital or natality [UTF08]. Definition 3 (Information Extraction [SAI01] ) Rather than indicating which documents need to be read by a user, [Information Extraction] extracts pieces of information that are salient to the user s needs. IE produces structured data from unstructured and semi-structured documents. Semi-structured data typically refers to tables and lists, which are characteristic for web pages. Whether a document is perceived as structured or unstructured depends on the research domain. Databases are typically regarded as structured data while free text is commonly classified as unstructured. The classification, however, cannot be solely based on the data format. It is quite possible to dump a whole unstructured document into a single database record, or strictly format a text file as a sequence of key-value tuples. Similarly, a HTML body may contain an unstructured stream of free text or a fine-grained table. Nevertheless, in the IE community, HTML is commonly classified as semi-structured data, while XML documents with available meta-data are considered being structured [CKGS06]. The dividing line between semi-structured and structured data is drawn between documents containing some kind of syntactic structuring elements (e.g. HTML tags) and semantic tags of the data. While IE for unstructured documents like free text has been thoroughly investigated during the last decades, as indicated by the success of the Message Understanding Conferences [Gri97], IE for semi-structured documents has received a growing interest from researchers during the last years. For the respective tasks, different techniques 2.2 Information Extraction 7

20 are required. Traditional IE needs to extract knowledge from human language texts and typically uses lexicons and grammars to achieve this goal. Web IE takes advantage of the fact that web pages are often automatically generated from (structured) database records. Because web pages are created by static templates, machine learning and pattern recognition techniques can be applied to analyze the syntactic structure of the documents. Web scraping or screen scraping are used synonymously for Web IE. The Jargon File 3 gives the following definition of screen scraping stressing the unintended usage of the medium. Definition 4 (Screen scraping) The act of capturing data from a system or program by snooping the contents of some display that is not actually intended for data transport or inspection by programs. [...] it often refers to parsing the HTML in generated web pages with programs designed to mine out particular patterns of content. Chang et al. give an overview of contemporary Web information extraction systems and categorize those based on task difficulty, extraction technique and degree of automation [CKGS06] Data Model In the following, a generic IE data model is described informally. The chosen model is derived from the data model known from relational databases and is also referenced by other IE researchers [AGM03, Liu06]. According to this model, the data is structured as nested relations made up of basic types arranged in tuples and sets. A basic type B is an atomic entity, typically a string in the context of web pages. The tuple type T 1, T 2,..., T n is an ordered collection of other types T i. Tuples map to data records in a database context. Set types {T} are constructed by multiple elements of the same type T, like a list of equally typed tuples. Let S be the schema of a book description. The data record (tuple) describing a book might comprise the title, a set of authors, the publisher of the book and the number of pages. Then the schema can be described as S = B title, { B name } authors, B publisher, B pages. An instance of S is the value x = "Ulysses", { "James Joyce" }, "Penguin", A template-based semi-structured page is created from one or more data records stored in a database and a template as illustrated in figure 2.2 on the next page. A template maps instances of a certain schema to a web page. More formally, an encoded web page P is created from a data record x and a template T via a template mapping function λ. Thus, the page creation process can be modeled as P = λ(t, x). The IE task is to extract x from P with T being unknown. If λ 1 T is the extraction function associated with template T, x = λ 1 T (P) is performed by the extractor. The schema of the extracted data rarely matches the model of the original data schema. Either the IE system is not able to extract all data fields or only a subset of the data fields are 3 The Jargon File is "a comprehensive compendium of hacker slang illuminating many aspects of hackish tradition, folklore, and humor." [Jar03] 8 Chapter 2 State of the Art

21 Template Database Web Page Figure 2.2: Template-driven web page creation from database records required. Therefore, x is generally an incomplete approximation of the original data record x. For example, the schema for the extracted data in the running example might be S = B title, B authors, B publisher with the nested data records for the authors collapsed into a single data field and the page count being omitted. In practice, many IE systems use simpler data models for the extraction targets than the one described. Particularly nesting of set and tuple types is not supported by the majority of the available IE systems. An example template for the running example is given in listing 2.1 using a pseudo template language. 1 <html><body> 2 <h1>books</h1> 3 <ul> 4 < l i ><b> T i t l e : </b> < i ><% p r i n t book. t i t l e %></ i ></ l i > 5 <% for each author in book. authors %> 6 < l i ><b>author : </b> <% p r i n t author. name %></ l i > 7 <% end %> 8 < l i >Publisher : <% p r i n t book. publisher %></ l i > 9 < l i ><% p r i n t book. pages %> pages</ l i > 10 </ul> 11 </body></html> Listing 2.1: Template example So far, the web page has been assumed to be a static document. However, techniques such as Ajax allow to perform operations asynchronously, for example the deferred loading of additional content using XMLHttpRequest [Gar05]. This poses new challenges for DR and IE systems if relevant information becomes only available after performing a certain action, like clicking a link or button on the page. A potential solution to remedy this problem is to drive a full-fledged web browser with a JavaScript interpreter and using a plug-in like Watir 4 to store static snapshots of the dynamic page. 4 Watir is an open-source library for automating web browsers: Information Extraction 9

22 As has already been identified in this section, the goal of the Web IE system is to extract data embedded in web pages created from a template. This task is performed by a wrapper program which may be hand-crafted or automatically generated. Wrapper generation techniques are discussed in the following section Wrapper Induction According to a very general definition, a wrapper provides an interface to an entity and allows it to be treated as if being something else. In the Web IE context, a wrapper allows to regard a web page as a database record. Consequently, the wrapper is responsible for extracting one or more data records from web pages. Early IE systems were programmed manually. 5 A set of web documents are examined and common patterns have to be identified by a human operator. Recurrent patterns enable the programmer to write a wrapper for extracting the target data, either manually or aided by pattern specification languages. The hand-crafted wrapper should then be able to extract data from documents sharing the same template. 1 <html><body> 2 <h1>books</h1> 3 <ul> 4 < l i ><b> T i t l e : </b> < i >Ulysses</ i ></ l i > 5 < l i ><b>author : </b> James Joyce</ l i > 6 < l i >Publisher : Penguin</ l i > 7 < l i >1040 pages</ l i > 8 </ul> 9 </body></html> Listing 2.2: Sample web page Listing 2.2 shows a simple web page generated from the aforementioned template. Assuming the extraction task is to extract the book s title, the programmer might write a program that skips to the <i> tag and extracts the text that follows until the closing </i> tag. Alternatively, regular expressions or XPath queries could be used. The different variants to represent extraction rules are discussed on page 15. Manually programmed wrappers are prone to failures when templates change, require knowledge of the employed technologies and are very labor-intensive. In contrast to manually specifying extraction rules, wrapper induction systems derive these from a set of training documents with various degrees of automation. Regardless how the wrapper was generated, Web IE systems have to deal with the problems of wrapper verification and wrapper repair. A wrapper relies on the extraction targets to be encoded in a certain way. However, web pages are subject to change and information providers may choose to replace their templates at any time. This causes hardship for wrapper maintenance. The detection of whether the wrapper is 5 Special purpose IE tasks often are still conducted manually, e.g. extracting the links from a web search result page. 10 Chapter 2 State of the Art

23 Figure 2.3: Different wrapper induction strategies [CKGS06] suited to extract data from a presented page is called the wrapper verification problem. 6 Adapting the wrapper to a changed template is called the wrapper repair problem. A way to approach both problems is to learn and verify characteristic patterns of the target data. In case of failure, the patterns can be used in attempting to adapt the wrapper to the new template. However, both tasks are very difficult to solve and are still an active research area [Liu06]. The goal of wrapper induction is to derive the encoding template from a collection of encoded instances of the same type. Repeated patterns in HTML documents can be detected with string or tree matching and alignment techniques. These will be discussed in the next sections. String Matching String matching helps revealing to what extent two character strings resemble each other. The Levenshtein distance is a commonly used algorithm to compute the similarity of two strings [Lev65]. It is defined as the minimum number of operations to transform one string into the other. These operations are inserting, deleting or replacing a single character in the string. The edit distance can be computed using dynamic programming. Let s 1 and s 2 be the input strings and n and m the respective character counts. The table D of dimension (n + 1) (m + 1) is initialized with D i,0 = i and D 0,j = j. The remaining cells are computed using equation (2.3). D i 1,j (i, j)i [1..n], j [1..m] : D i,j = min D i,j D i 1,j + 1 D i 1,j 1 same character replace insert delete (2.3) 6 In fact, wrapper verification is also needed if the IE system may be confronted with ineligible pages, i.e. pages that are created from different templates. 2.2 Information Extraction 11

24 The final edit distance is retrieved from the bottom right corner cell D n,m. An alignment path can be traced back through the matrix illustrating the operations. The timecomplexity of the algorithm is O(nm). Table 2.1 shows an example matrix of the comparison of the character strings sheep and shepard yielding a Levenshtein distance of 4. For similarity computations, the edit distance can be normalized by dividing it through the length of the longer string max(n, m). Table 2.1: Edit distance matrix of the strings "shepard" and "sheep" s 1 s h e p a r d s s h e e p Tree Matching String matching across non-trivial Web documents is a complex and expensive operation considering the average document length in terms of characters. There are no pre-determined boundaries and the content and length of the data may differ across multiple documents or records. The semi-structured nature of Web documents led to the application of tree matching to conduct IE tasks. Tree matching compares the structure of two trees and computes a cost of pairing the vertices. In the context of Web IE, the DOM-tree or parts thereof are commonly compared by using the element tags as the vertices labels. Tree matching computes a minimum-cost mapping for two ordered labeled trees. According to the general definition, each node appears no more than once and the order and hierarchical relations among nodes are preserved. Figure 2.4 on the facing page illustrates such a mapping. Tai presented the first polynomial algorithm for computing the edit distance based on dynamic programming [Tai79]. The algorithm has a complexity of O(n 1 n 2 h 1 h 2 ) in time and space, with n 1 and n 2 being the number of nodes and h 1 and h 2 the heights of the respective trees. Cost functions are assigned to the editing operations transforming one tree into another, i.e. relabeling, deleting and inserting nodes. Relabeling is of special interest as it lends itself to identifying recurrent patterns in similar structured documents. More elaborate cost functions for the relabel operation may exploit syntactic (e.g. string edit distance) or semantic (e.g. feature vector) similarities. Zigoris et al. propose using support vector machines to learn the parameters of the cost function for semantic matching. The preliminary results, however, indicated no performance-gain in comparison to simpler cost functions [ZEZ06]. 12 Chapter 2 State of the Art

25 Figure 2.4: General tree mapping example [ZL05] A more restrictive variant of tree matching was defined by Selkow in 1977 [Sel77]. According to Selkow s definition, insertion and deletion is limited to the leaf nodes and node replacement is not supported. In effect, the aim of tree matching is to find the maximum matching where every node-pair has the same parent nodes. This definition has been found to better fit to web documents because structural (i.e. level-crossing) changes are not generally applicable to DOM-trees [CAM01]. Simple tree matching (STM) is an algorithm solving this problem in quadratic time [Yan91]. It is again based on dynamic programming and shown in listing STM( A, B ) 2 i f A root = B root then 3 return 0 4 e lse 5 m A children 6 n B children 7 M i,0 0 i [0..m] 8 M 0,j 0 j [0..n] 9 for i = 1 to m do 10 for j = 1 to n do 11 M i,j max(m i,j 1, M i 1,j, M i 1,j 1 + STM(A i, B j )) 12 return M m,n + 1 Listing 2.3: Simple tree matching algorithm Multiple Alignment In order to identify patterns in case more than two strings or trees are involved, multiple sequence alignment (MSA) techniques can be applied. Multiple alignment has its foundation in molecular biology where it is used to identify similarities of sequences (e.g. proteins). Given a set of similar sequences, MSA tries to find an optimal alignment by inserting gaps into the sequences. Carrillo and Lipman presented an algorithm based on multidimensional dynamic programming that yields optimal results but has an exponential time complexity [CL88]. Hence, various heuristic methods have been proposed amongst which the center star method has found its way into IE systems. 2.2 Information Extraction 13

26 In this method, a center sequence c is selected from a set of sequences X minimizing the pair-wise distance to the other sequences. c = arg min x c X x i X d(x i, x c ) (2.4) Afterwards, the alignments with the remaining sequences are computed and gaps are inserted into the center string where necessary. The time complexity of the center star method is O(n 2 k 2 ) for n sequences of length k. While being of polynomial complexity, the character sequence lengths of HTML pages still incurs excessive runtime behavior in IE systems. Partial Tree Alignment Partial tree alignment was specifically crafted to solve the multiple alignment problem in an IE context [ZL05]. It aligns multiple trees by progressively growing a seed tree. The latter is initialized to be the tree with the maximum number of nodes. This way it likely aligns well with the other trees. The remaining trees are matched by linking matching nodes and trying to insert nodes into the seed tree for which no match was found. Nodes are only inserted if a position can be uniquely determined. That is, if the neighboring siblings in the source tree are matched with consecutive siblings in the seed tree. Figure 2.5 illustrates growing such a seed tree T s from three input trees. Figure 2.5: Iterative partial tree alignment example [ZL05] 14 Chapter 2 State of the Art

27 Extraction Rules Once the extraction targets are identified, rules to mine the relevant information need to be formalized and stored for future use. There are various possibilities ranging from first-order logic rules over regular expressions to XPath and CSS selectors. Logic rules are primarily used in free-text IE where common tokens and characteristic delimiters facilitating the other approaches are rarely available. Regular expressions have been widely adopted for data mining from semi-structured documents. In the example in listing 2.2 on page 10 the title of the book can be mined with the regular expression <i>(\w+)</i>. In practice, however, regular expressions are not very well suited to match data in HTML documents. To correctly match all possible incantations of a specific HTML tag with a regular expression is a daunting task, especially due to the statefulness of the HTML syntax. For example, the given expression will not work if the <i> tag contains any attributes and will unintentionally match with occurrences of the tag in comments or strings. Therefore, the interest has recently shifted to query languages like XPath or CSS selectors which are much more suitable to extract information from an HTML or XML document. Especially the usage of the XPath language in Web information extraction has gained importance with a growing number of libraries supporting this query mechanism. In a nutshell, XPath queries provide means to address node-sets or individual nodes in the DOM tree of an XML (or HTML) document. For instance, //li/i/text() addresses the title phrase of the book in the running example while querying for //ul/li[1] returns the node containing the whole book-title attribute. XPath queries are far more powerful than the examples given above. This complexity, however, has caused hardship for providing full support of the XPath standard in implementations and an uncertainty concerning the complexity of XPath queries in general. Gottlob et al. have shown that large fragments of XPath are of LOGCFL 7 complexity and thus can be massively parallelized [GKP03]. A more elaborate treatise on XPath can be found in Essential XML Quick Reference [SG01]. O Keefe and Trotman present a number of query languages aside to XPath and argue that most available solutions are overly complicated [OT03]. On the one hand, the lack of comprehensive support of the XPath 1.0 standard in many query libraries backs this assumption. On the other hand, in Web IE the expressive power to select the relevant parts of the available information with the utmost precision is a more favorable goal than a simpler yet inferior solution. CSS selectors, for example, share similar concepts with XPath queries but are not quite as powerful. After foundational approaches and techniques have been covered, supervised, semisupervised and unsupervised IE system concepts are presented along with a few exemplary case studies. 7 Logarithmically Reducible to Context-Free Languages 2.2 Information Extraction 15

28 2.2.3 Supervised Information Extraction Manually observing recurrent patterns in web pages is a rather cumbersome and errorprone process which can be alleviated by automatically learning extraction rules from labeled training documents. This approach is referred to as supervised IE. As depicted in 2.3 on page 11, the user has to label relevant data with the help of a graphical user interface (GUI). In the example of the book page, the user may mark "Ulysses" as the title of the book and does that for a set of other pages. The IE system then tries to derive rules from these examples and, depending on the IE system, may suggest additional informative pages to be labeled by the user. For example, Rapier is a supervised extraction system that uses a relational learning algorithm [CM97]. It initializes the system with specific rules to extract the labeled data and successively replaces those with more general rules. Syntactic and semantic information is incorporated using a part-of-speech (POS) tagger. Extraction rules consist of pre-filler, filler and post-filler patterns for each data field. These describe the context and syntax of the extraction target. The respective patterns for extracting the publisher name in the running example could be "</li>", "<li>", "Publisher:" as pre-filler tokens and "</li>", "<li>" as post-filler tokens. Depending on the training data, the filler pattern might specify that the publisher name consists of at most two words which were labeled as nouns by the POS tagger. Other examples of supervised IE systems are SRV [Fre98], WIEN [KWD97], Soft- Mealy [HD98], STALKER [MMK99] and DEByE [LRNdS02] Semi-Supervised Information Extraction Labeling training data in advance is a labor-intensive process limiting the scope of the IE system. Instead of requiring labeled data, semi-supervised IE systems extract potentially interesting data and let the user decide what shall be extracted. In other words, the user provides feedback to the IE system which is incorporated into the wrapper generation process. In the running example, a semi-supervised system might recover title, author and the publisher as extractable data fields from a set of unlabeled book pages. The user then selects which fields shall be extracted and how to integrate the information, e.g. by labeling the titles as such in the extraction target tuple. An example for a semi-supervised system is IEPAD [CL01]. Apart from extraction target selection, semi-supervised IE systems are very similar to unsupervised IE systems Unsupervised Information Extraction Automatic or unsupervised IE systems extract data from unlabeled training documents. The core concept behind all unsupervised IE systems is to identify repetitive patterns in the input data and extracting data items embodied in the recurrent pattern. 16 Chapter 2 State of the Art

29 Unsupervised IE systems can be subdivided into page-level extraction systems and record-level extraction systems. The former extract data from a page-wide template, while the latter assume multiple data records of the same type are available rendered by a common template into one page. In case multiple records exist in a single web page, it might be possible to derive extraction rules from a single web page, assuming the individual data records can be told apart. The record-level extraction task can be described as trying to extract various items from a list page (e.g. a product list from a web shop). In contrast, page-level extraction tasks require multiple pages (e.g. product detail pages) to discover patterns and learn extraction rules. Evidently, record-level extraction systems can only operate on documents containing multiple data records and require means to identify the data regions describing the individual data records. The latter problem can be tackled with string or tree alignment techniques. Examples for such systems are DEPTA [ZL05] and NET [LZ05]. Page-level extraction systems can treat the whole input page as a data region from which the data record shall be extracted. However, multiple pages 8 for wrapper induction need to be fetched in advance. Thus, the problem of collecting training data is shifted into the DR domain and is rarely addressed by IE researchers. Examples for page-level extraction systems are RoadRunner [CMM01] and ExAlg [AGM03] Case Studies In the following, a selection of well-known IE systems are presented which try to solve similar problems. One semi-supervised and three unsupervised IE systems are presented illustrating various techniques and the associated constraints to solve different IE tasks. RoadRunner RoadRunner is one of the early unsupervised Web IE systems, presented in 2001 by Crescenzi, Mecca and Merialdo [CMM01]. It compares multiple pages and generates union-free 9 regular expressions based on the identified similarities and differences. RoadRunner initializes the wrapper with a random page of the input set and matches the remaining pages using an algorithm called ACME matching. The wrapper is generalized for every encountered mismatch. Text string mismatches are interpreted as data fields, tag mismatches are treated as indicators of optional items and iterators. In the RoadRunner data model, individual data items must be separated by HTML tags but tags must not occur as part of the data field. Figure 2.6 on the following page shows an example of a wrapper generated from two input pages. 8 At least two training pages are required for page-level wrapper induction. Depending on the IE system and the template, however, ten or even more training pages may be necessary to successfully derive extraction rules. 9 A union-free regular expression does not contain disjunctions (e.g. (A B)). 2.2 Information Extraction 17

30 Figure 2.6: Wrapper induction example for RoadRunner [CMM01] The runtime complexity is exponential in the input string length. Therefore, heuristics were introduced to limit the exploration space. ExAlg Arasu and Garcia-Molina propose an IE system automatically deducing the template from a set of template-generated pages [AGM03]. ExAlg has a hierarchically structured data model and supports optional elements and disjunctions. A web page is modeled as a list of tokens in which a token might either be a HTML tag or a word from a text node. ExAlg builds equivalence classes of the tokens found in the input documents. Based on these sets of tokens, the underlying template is deduced. Figure 2.7 on the next page shows four example pages where each template-token is labeled with an index. Tokens with the same occurrence vector across all input documents build an equivalence class. The idea is that tokens emitted from the same template constructor will likely occur with the same frequency. Furthermore, ExAlg can detect tokens with multiple roles, e.g. the token Name in Book Name and Reviewer Name has a different semantic in either occurrence. It differentiates between roles based on the occurrence-path 10 and the spans of valid equivalence classes. For instance, an equiv- 10 The occurrence-path, as defined by Arasu and Garcia-Molina, has a close resemblance to an XPath query. 18 Chapter 2 State of the Art

31 alence class in the given example is {<li>, Reviewer, Name, Rating, Text, </li>} with the occurrence vector 1, 2, 1, 0. ExAlg defines large and frequent equivalence classes (LFEQs) as classes containing many tokens which occur in a large fraction of the input documents. The LFEQs are hierarchically structured and the order of the tokens is preserved. The nesting is governed by the span formed by all tokens in the respective equivalence class. LFEQs are passed to the analysis stage in which the template is deduced. Figure 2.7: Input pages in ExAlg [AGM03] Starting from the root LFEQ the tokens occurring exactly once in all input documents, ExAlg searches for non-empty positions between consecutive tokens and generates type constructors for these locations. Nested LFEQs are recursively visited and the types are constructed according to the data model. The generated template can then be used to extract data from input pages. For the given example, the original schema B Book, { B Reviewer, B Score, B Text } can be recovered by analyzing the four input pages. ExAlg has a sophisticated data model compared to other automatic IE systems. Moreover, ExAlg operates on the token-level not on the tag-level as many other unsupervised extraction systems do and thus has the chance of extracting attributes embedded in text nodes without any markup. The effectiveness of the extraction tends to improve with the number of input pages. However, experiments indicate that ExAlg works well for collections of under ten input documents given that the occurrence of attributes to be extracted exceed the chosen threshold. 2.2 Information Extraction 19

32 IEPAD IEPAD, a semi-supervised IE system, was presented by Chang and Liu in 2001 [CL01]. It is capable of extracting homogeneous data records from a set of unlabeled pages. IEPAD generates wrappers by discovering repetitive patterns using multiple string alignment. The input document is converted to a binary representation of the data. HTML tags and text elements are mapped to a set of fixed length binary tokens. A PAT tree, which is a binary suffix tree, is created from the binary representation. The PAT tree, in turn, is used to find repetitive patterns by recording occurrence count and reference points for each recurring pattern. To tolerate inexact matches, the center star algorithm is applied to obtain generalized extraction patterns. The candidate patterns and the occurrence metrics are presented to the user. Upon selection of a pattern, a regular expression is created from the binary representation. Thus, the wrapper can also operate on web pages without transforming those into the binary representation. DEPTA DEPTA stands for Data Extraction based on Partial Tree Alignment and is an unsupervised IE system [ZL05]. DEPTA extracts data records from list pages with an algorithm called MDR, taking advantage of the tree structure of the HTML page. MDR was first presented by Liu et al. in 2003 [LGZ03]. The design of MDR is based on two observations about data records. The first observation states that similar objects are likely located in a contigous region and formatted with almost identical or at least similar HTML tags. The second observation is that similar data records are built by sub-trees of a common parent node. The algorithm first builds the DOM-tree for the web page and stores the bounding box for each element. 11 Adjacent nodes that share the same parent are then compared by computing the string edit distance of the tag strings. If the estimated similarity exceeds a predefined threshold, the group of nodes is identified as a data region. To account for data records that are spread over multiple sibling nodes, the concept of generalized nodes was introduced. Generalized nodes encompass one or more sibling nodes. Figure 2.8 on the facing page shows an abstracted tag tree where nodes 5, 6 and 8, 9, 10 build two data regions as the respective nodes in each region are similar. The combined nodepairs (14, 15), (16, 17) and (18, 19) are also similar to each other and each pair builds a generalized node. Data records are derived from generalized nodes. However, there are cases when such a node does not represent a single data record. DEPTA handles some special cases to deal with these discontinuities in data records. Finally, data fields are extracted from the alleged data records. After all tag-trees belonging to the data record are assembled in a new tree, partial tree alignment is performed to induce the structure of the data. The idea is to match the fields from all data records to build a generalized representation of the data record. 11 The visual information for each tag is supplied by a web browser. 20 Chapter 2 State of the Art

33 Figure 2.8: Generalized nodes and data regions in DEPTA [ZL05] MDR can handle non-contiguous data records and is capable of extracting data records that span multiple sibling nodes. The assumption is made that HTML tags are generated by the template and text nodes belong to the data to be extracted. Visual cues are consulted to distinguish individual data records. However, the extraction is limited to flat data records. Support for nested data records (e.g. two data records sharing data items from a common parent data record) was added in a successor system called NET [LZ05]. In the latter system, a post-order traversal of the tag tree is performed to identify data records at different levels. NET uses simple tree matching to compute the tree similarity and aligns the trees whose similarity is above a chosen threshold Summary This section introduced Web IE concepts and techniques and presented a few interesting automatic IE systems from the literature. An information system consisting of a document retrieval and information extraction component is able to identify relevant Web pages and extract salient data from the respective pages. However, to embed the obtained information into an existing knowledge base information integration techniques are required. 2.3 INFORMATION INTEGRATION "It is a very sad thing that nowadays there is so little useless information." Oscar Wilde After retrieving and extracting information from heterogeneous sources, the obtained data needs to be related to existing data. The inherent challenges of information integration (II) originate in the structural and semantic heterogeneity of the various information sources. Data can be laid out and stored in different ways depending on the chosen data model leading to structural heterogeneity. Semantic heterogeneity is concerned with the content and meaning of the data. 2.3 Information Integration 21

34 Wache et al. state the problem of information integration and semantic interoperability as follows [WVV + 01]: "In order to achieve semantic interoperability in a heterogeneous information system, the meaning of the information that is interchanged has to be understood across the systems. Semantic conflicts occur whenever two contexts do not use the same interpretation of the information." According to Pollock and Hodgson semantic conflicts can be classified as naming conflicts, scaling and unit conflicts, confounding conflicts or domain conflicts. Naming conflicts occur in the presence of synonyms and homonyms, i.e. multiple names exist for the same entity. Different units and currencies lead to scaling conflicts. Metrics may either be explicitly encoded in the data or implicitly assumed. Confounding conflicts arise when a same-named entity is defined differently by the various information providers. Finally, domain conflicts occur when data is modeled with distinct domainspecific intentions resulting in overlapping or disjoint concepts [PH04]. Information integration can be approached with ontology-mapping techniques. Ontologies are well suited to model hidden and implicit knowledge for different domains. Wache et al. give a concise overview of ontology-based information integration techniques [WVV + 01]. 2.4 LEGAL CONSIDERATIONS Retrieving, extracting and integrating information published by a third party may have legal implications. The terms of service of the respective sites apply which may prohibit web scraping of their content. Although a few precedents exist, this is a grey area of law and was differently ruled depending on the jurisdiction and the case. Adhering to the terms of use of a web site only being visited by the IR/IE-system is not realizable unless the terms could be retrieved and understood by the crawler. Legal advise should be sought before employing web scraping in a public or commercial software systems. 2.5 FEDSEEKO Fedseeko is a federated search engine with the goal to facilitate obtaining product information from the Internet [WSS09]. It uses adapters to access diverse product information providers such as online shopping malls, producer sites and third party information portals like forums or blogs. The information sources are accessed via web services if such a possibility exists. For instance, the Amazon Product Advertising API 12 provides extensive vendor information through a web service. In case no such interface exists, the information may be extracted using web scraping techniques. Figure 2.9 depicts the Chapter 2 State of the Art

35 architecture of Fedseeko and its internal and external interfaces. The reference implementation is based on Ruby on Rails. Figure 2.9: Fedseeko architecture [WSS09] Producer Information Integration In the following section, some important aspects of the original producer information extraction implementation will be outlined. As a first step, the manufacturer URL for a given product is retrieved by a web search query. The first hit of a web search restricted to the com domain is considered to be the producer site and will be the basis of downstream product page searches. The product page is located via a phrase search on the suspected producer site. Fedseeko uses XPath queries to address the individual nodes associated with a product attribute. The mining of XPath queries requires guidance. An example key/value pair needs to be supplied, which is used to locate the proper product URL. Starting from the suspected product page, the linked pages are walked and page contents are matched via a similarity check with the key/value phrases. The search stops once a page with the requested resemblance is found. Once a matching product page is found, a Scrubyt 13 extractor computes the XPath queries for the key, value and base query respectively. The identified XPath queries are associated with the producer, insinuating a single producer-wide template. Fedseeko uses mapping ontologies to relate producer information to available information of similar products by other manufacturers. The shortcomings of the existing producer information solution are first and foremost the required amount of user supervision. Supplying samples for each attribute and producer is a labor-intensive process, especially considering the large variety of 13 scrubyt! is a Ruby library designed to facilitate web scraping tasks Fedseeko 23

36 producers and the number of attributes associated with some products 14. Furthermore, the limitation of one template per producer is an oversimplifying presumption. Large producers with a manifold product range may use slightly different templates for different product categories. A new approach towards producer information retrieval and extraction aiming to overcome the deficiencies of the existing implementation will be presented in this thesis. 2.6 SUMMARY An overview was given covering the research areas information retrieval, information extraction and information integration. The brief treatise of IR focused on effectiveness metrics while an in-depth introduction to Web IE was provided. Important IE techniques have been presented and exemplary IE systems have been examined. Some of the methods and techniques will be reused and referenced in the subsequent chapters. II was swiftly covered for the sake of completeness but is otherwise outside the scope of this thesis. 15 Finally, the federated search engine Fedseeko has been introduced and its producer information integration component was evaluated. During the course of this thesis, a replacement of this component will be developed. 14 For instance, in the domain of digital cameras more than one hundred attributes may be listed per product. 15 A related work is conducted contemporaneously which revamps the ontology mapping in Fedseeko. 24 Chapter 2 State of the Art

37 3 REQUIREMENTS The goal of the revised information extraction component is to minimize the effort as well as the cost of obtaining and providing first-hand product information. Upon a query for a certain product, the system shall extract all available product attributes from the manufacturer s web site without requiring guidance or supervision. In contrast to the existing IE system, web sites based on not yet encountered templates shall be analyzed automatically and extraction rules be inferred and stored for future requests. A change of a known template requiring different extraction rules should be detected and acted upon. In this chapter, the information flow of the retrieval and extraction system is analyzed. A functional and behavioral description is given. Finally, the validation criteria for the software system will be briefly covered. 3.1 INFORMATION DESCRIPTION In a nutshell, the information extraction system shall locate product pages in the Internet and extract product attributes without any mandatory user interaction. As depicted in figure 3.1 on the following page, the only input to the software system is a product descriptor. This product descriptor or identifier may be manually entered or may originate from vendor databases or other sources listing products. The input is a tuple comprising a manufacturer name and a product identifier. The latter can be decomposed into a list of tokens, where the tokens describe a specific product. Based on this information, the manufacturer s product page is to be retrieved. An example input is Apple Inc. and MacBook Pro. The output is an ordered set of attribute tuples extracted from the product page associated with a product. Each attribute tuple consists of a key and a value character string, e.g. "Weight", "42 kg". 25

38 Figure 3.1: Overview of information flow The extracted attributes may be saved in a database, be passed to a downstream processor or can be presented directly to the user. There is a product detail view in Fedseeko presenting the producer information alongside other related data like product reviews. Furthermore, the extracted data is passed to an information integration system performing ontology mapping. The latter task is carried out by a separate system which will not be discussed herein. The source of the attributes to be extracted are product detail pages residing at the respective manufacturer sites. Empiric observations regarding these pages will be presented in the next section Product Pages The IE engine shall be able to extract product attributes from a vast amount of heterogeneous manufacturer pages. The following empirical observations describe characteristics of typical product pages. 1. A product page with sufficient information often describes only a single product but may contain data for different product variants. 2. A manufacturer may use more than one template for different product categories or families. 3. There might be very few pages available with a common template. 4. Multiple description pages with different templates might exist for the same product, e.g. a summary and a specification page. These characteristics do not apply to all product domains. Throughout this work the focus is laid upon those kinds of products for which a human operator could easily tell product features apart by looking at the product page. Figure 3.2 on the next page shows a product page of a Nikon digital camera for which attributes like "Total Pixels", "12.9 million" shall be extracted. 26 Chapter 3 Requirements

39 Figure 3.2: Product page example with the extraction targets being highlighted 3.2 FUNCTIONAL DESCRIPTION The complete product information retrieval system can be decomposed into two major components. One component is responsible for the identification of the manufacturer site as well as the proper product page. The other component s task is to extract product attributes from the aforementioned product page. The document retrieval component locates and fetches the product page from the manufacturer s web site. If multiple pages exist for a single product, the page with the most syntactically structured content should be picked. For example, a specifications page is better suited for Web IE than a free text summary page. The information extraction component extracts attribute tuples from a product page of a specific template. Its job is to filter irrelevant data and identify the useful bits of information in a given document. Either new rules are derived for identifying the extraction targets or already stored ones are used to extract data out of a page created from a previously encounered template. Extraction from a page based upon a known template is an on-line operation 1. Therefore, it should deliver results within the time- 1 It shall be performed while the user of the system waits for a respond to his request. 3.2 Functional Description 27

40 frame given for the overall Fedseeko query to complete. In other words, if a query for a Fedseeko product detail page should respond within fifteen seconds, the extraction s execution time should not exceed this bound in the average case. As it might not be possible to select the proper wrapper object to extract data from a given document, a wrapper shall be able to detect ineligible input pages. In effect, the wrapper verification problem must be solved inside the wrapper object. The wrapper induction component creates extraction rules for one or more pages sharing a specific template. Wrapper induction only needs to be executed if a new template is discovered or a known template has changed. Thus, the operation may be performed off-line on a best effort basis. 3.3 BEHAVIORAL DESCRIPTION Most of the system s operations are invisible to the user. Upon requesting detailed information for a given product, the system will retrieve the product page and extract all product attributes from that page. No user input is required. However, the system may not be able to retrieve the proper product page, may fail to extract any information or select bogus data. For these cases, the user may intervene after the retrieval and extraction steps have been executed. The user shall be given means to correct the estimated product page URL. Furthermore, extracted data may be discarded whereabout the extraction can be restarted. Should the automatic extraction fail to deliver meaningful data, the user may provide hints to facilitate the extraction process. 3.4 VALIDATION CRITERIA The software system is evaluated according to a gold standard 2. A control group of one hundred products from twenty different domains is used to validate the proper operation of the system as well as to measure the effectiveness of the retrieval and extraction components. In order to spot the cause of extraction failures, the subsystems are examined individually. The automatic extraction of attributes shall work reliably in the majority of the test cases. With additional information, it ought to be possible to successfully extract the proper data from four out of five documents. For each test product, the proper product URL is gathered manually and a reference attribute is recorded. This manually gathered data is matched with the automatically computed data during evaluation. The document retrieval subsystem either succeeds to locate a product page suitable for information extraction, or fails to do so. Therefore, the precision metric follows 2 Wikipedia defines a gold standard test as a "diagnostic test or benchmark that is regarded as definitive" [Wik09]. Test results are interpreted in a way that no false-positive or false-negative results are included. 28 Chapter 3 Requirements

41 the probabilistic interpretation and states the probability that the returned document is relevant. 3.5 SUMMARY This chapter stated the goal of the software system and requirements were analyzed from various perspectives. Based on the given problem analysis, a software system will be developed. Its design, implementation and evaluation will be presented throughout the subsequent chapters. 3.5 Summary 29

42 30 Chapter 3 Requirements

43 4 DESIGN The system design is outlined in this chapter. A description of each component required to solve the problem is provided as a processing narrative and in context of the architectural design. 4.1 DATA DESIGN The input and output data is depicted in figure 4.1. The key components have been identified as the product page locator responsible for DR, and the components revolving around the wrapper logic, responsible for Web IE. Both components and their design constraints will be exhibited in this section. Manufacturer Web Site Product Page Locator Product Page Wrapper Induction Product ID Wrapper Database Wrapper Attributes Figure 4.1: Information flow during extraction 31

44 4.1.1 Retrieving Product Pages The DR component must supply the downstream IE processor with a genuine product page. In contrast to the more common DR systems in which a large set of documents is returned, selecting the proper product page is a binary choice. Either the right product page is identified or the IE component won t be able to extract relevant data. In effect, the goal of the document retrieval subsystem is to optimize the precision for the top-ranked candidate (i.e. according to the terminology introduced in section 2.1.2, precision(1) shall be maximized). In a full-fledged product page retrieval system, all manufacturer sites would have to be indexed in advance in order to allow the retrieval of subordinate product pages. However, this work puts the focus onto the information extraction task and only limited resources are available. Hence, it was chosen not to build a dedicated document index for product page retrieval from the World Wide Web. In contrast, the results of existing web search services are used and combined to pick the product page. The results of multiple web search engines such as Google Search, MSN Search and Yahoo! Search shall be aggregated to obtain a maximum coverage of the World Wide Web and benefit from well-established ranking algorithms used in the respective services. Product page retrieval is laid out as a two step process. In a first step, the producer page is located and, in a second step, the product page is searched at the producer site. In this manner, first hand product information is not intermixed with third party information like web shop offers or product reviews. In case of failing to locate the proper producer site in the first step, the DR component should fall back to another candidate. This is done if the product was not featured on the site. Product Page Ranking During product page retrieval on the producer site, the DR subsystem tries to pick the proper page from the top-ranked set of candidates of multiple web search engines. Not just using the single top-ranked candidate improves the chance that a relevant document is among the set of retrieved documents. The ranking of the individual search engines is combined using Borda ranking, known from social choice theory. In Borda ranking, named after Jean-Charles de Borda who proposed it as an election method in 1770, every voter announces an ordered list of preferred candidates. If there are n candidates, the top-ranked candidate of each voter receives n points and each lower ranked candidate receives a decremented score. Borda ranking and other search result combination methods are discussed in web Data Mining by Bing Liu [Liu06]. Table 4.1 shows the search results of an artificial query. As indicated in the example, a combined ranking may not suffice to select the proper document from a set of candidates. Therefore, additional metrics are incorporated to refine the original ranking. Figure 4.2 on the next page gives an overview of the approaches used to process the candidate list. Some techniques try to identify a page that contains specification information and other methods scan for references to the searched product. The scores of 32 Chapter 4 Design

45 Table 4.1: Top four search results of two web search engines Document Relevant? Rank A Rank B Borda Rank /news/november/the_new_shiny_product no = 4 /products/detail.html?category=6&id=17 yes = 4 /products/index.html?category=6 no = 6 /forum/show.html?post=42 no = 3 /reviews/products/17 no = 3 the individual techniques need to be appropriately balanced in order to identify specification data associated with the right product. Figure 4.2: Selecting a product page from a set of candidates using multiple techniques The list of candidates is extended with linked documents referring to potential specification pages. A content-type check ensures only processable documents are considered. Simple keyword heuristics examining the components of the URL are implemented to tell apart genuine product information from related but ineligible pages, like product news, user reviews or forum postings. The page title is matched with the product identifier and the body of the page is scanned for known key phrases 1. Finally, all scores are combined using empirically estimated weights. The candidate with the highest combined score is returned. Finding Related Product Pages In the conclusion of the ExAlg paper, Arasu and Garcia-Molina identified the automatic crawling of template-generated pages as a promising approach to enhance future Web IE systems [AGM03]. A slight variation of this idea is implemented with the goal of automatically gathering training data for wrapper induction. If no domain knowledge is available to identify relevant data on a given page, the wrapper generator requires at least another page sharing the same template to detect recurrent patterns. A promising approach to find similar pages is to crawl a fraction of the producer site starting from the product page and select a page with a similar URL, content and structure. There are two driving ideas behind this concept. First, an empiric observation has been made that it often takes no more than two clicks to navigate from a product page to another 1 A list of key phrases is supplied by the IE subsystem, which assembles a collection of characteristic attribute keys from already extracted data. 4.1 Data Design 33

46 one of a similar product (see figure 4.3 for an example). Second, similar URLs are more likely to reference template-sharing pages, e.g. /product.html?name=spam and /product.html?name=eggs. This is due to routing mechanisms in template-based web application frameworks. Thus, a web crawler with a limited recursion depth likely picks up a related product page. Tree matching is performed to ensure the found page has a similar content and syntactic structure. Figure 4.3: Navigating to a related product page (Nikon D90 to Nikon D3X) The crawling will be performed on-demand. On the one hand, this is an expensive operation putting a huge stress both on the Fedseeko server and on the respective producer sites. On the other hand, the recursion limit puts an upper bound on the execution time. Furthermore, the crawling will only be executed as a last resort; in case the extraction approach which draws upon domain knowledge fails Information Extraction from Product Pages The IE engine shall be able to extract product attributes from a vast amount of manufacturer pages. In the following, empiric observations regarding product pages from the requirements analysis are revisited and the consequences on the design of the IE system are discussed. Observation 1: A product page with sufficient information often describes only a single product. Product detail pages are rarely multi-record pages and typically describe only a single product. In effect, multi-record IE systems like DEPTA are not suited for this particular task. Either multiple pages generated from the same template are required for deriving extraction rules, or domain knowledge must be incorporated to identify relevant data in the page. In the first case, one or more related product pages sharing the same template must be found automatically for driving a multi-page IE algorithm. Observation 2: A manufacturer may use more than one template for different product categories or families. 34 Chapter 4 Design

47 In contrast to the original implementation, multiple extraction rules associated with distinct templates shall be associated with each producer. It needs to be decided which wrapper to select for extracting data from a given page or whether to generate a new wrapper object. Thus, the wrapper shall automatically detect if the document was created from the template it was trained for. Observation 3: There might be very few pages available with a common template. In effect, the wrapper induction algorithm shall be able to derive rules even if only two training pages are supplied. Most existing IE systems require considerably more training data to induce extraction rules, e.g. ExAlg draws upon the target attributes occurrence characteristic which can hardly be derived from only two training pages. Even if many related product pages are available, those have to be found and gathered automatically. This is an expensive operation. Observation 4: There is no domain knowledge available up-front. The system starts with a clean slate meaning that no domain knowledge is initially available to facilitate the extraction of product attributes. Therefore, the IE system shall be able to extract data just using syntactic features of the input page. This is required for boot-strapping the system. None of the existing IE systems presented in chapter 2 is particularly suited for the task. Therefore, a new IE system shall be developed which is able to extract data under the assumptions stated above. Exploratory Prototype An IE prototype was developed to answer the question whether it is feasible to extract data under the following conditions: No domain knowledge is available. There are only two unlabeled input pages. The extraction target is a set of key/value tuples, i.e. the product attributes. The evolution of the prototype implementation and experimental results are presented in section 5.2. With the help of the prototype, the possibility of extracting data under the conditions stated above could be demonstrated. Information Extraction Data Model A more formal description of the data model according to the notations introduced in section is given below. As already mentioned, the target data 4.1 Data Design 35

48 is assumed to be encoded in a product detail page describing a single product. Product attributes are a flat set of key/value pairs nested data records are not supported. Thus, the formal description of the data model is S = { B key, B value } attributes. The extracted data record for figure 3.2 on page 27 looks like x = { "Image Sensor Format", "DX", "Image Sensor Type", "CMOS",...}. It is worth mentioning that the key component of the attribute tuples may belong to the template or be part of the data in the original data schema. Either way, key strings are always stored with the extracted data. Product Attributes Selection The core assumption of the extraction logic is that all product attributes in a document can be selected by a common XPath query. For example, the rows of a table, the items of a list, or consecutive paragraphs all these respective entities can be addressed by a single XPath query. Figure 4.4 shows three representative examples of product specification data stored in different container elements in the DOM-tree. An XPath query can be computed for each shown example, being able to extract all available attributes from the respective page. (a) Table (b) List (c) Nested blocks Figure 4.4: Examples of specification data embedded in different containers In addition, it is assumed that it is sufficient for attribute extraction to compute the XPath query which addresses the key phrases. It is deemed possible to infer the ancestor element in the DOM-tree comprising the whole attribute. Thus, the XPath query selecting the key phrases is split in an absolute path to address the attribute elements (the attribute path), and a relative path to select the key part of the attribute (the key path). The value of an attribute can be derived by collapsing all remaining text nodes underneath the attribute node. A wrapper stores the attribute and the key path. The extraction is reduced to running an XPath query with the help of a capable library. The algorithm is outlined in listing 4.1 on the next page. The evaluation of a document is performed on the tag-level and is closely oriented on the DOM-tree representation of the web page. The set of phrases in a document is equivalent to the set of text nodes in the DOM-tree. In effect, the individual attributes must be syntactically distinct. For example, if the product attributes are all contained in a single paragraph with no further mark-up, it is not possible to extract those ele- 36 Chapter 4 Design