An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

Transcription

1 An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos NCSR Demokritos, Institute of Informatics & Telecommunications, Aghia Paraskevi Attikis, Athens, Greece {vangelis, iit.demokritos.gr/skel Abstract. The paper presents a platform that facilitates the use of tools for collecting domain specific web pages as well as for extracting information from them. It also supports the configuration of such tools to new domains and languages. The platform provides a user friendly interface through which the user can specify the domain specific resources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various configurations. The platform design is based on the methodology proposed for web information retrieval and extraction in the context of the R&D project CROSSMARC. 1 Introduction The growing volume of web content in various languages and formats, along with the lack of structured information and the information diversity have made information and knowledge management a real challenge towards the effort to support the information society. Enabling large scale information extraction (IE) from the Web is a crucial issue for the future of the Internet. The traditional approach to Web IE is to create wrappers, i.e. sets of extraction rules, either manually or automatically. At run-time, wrappers extract information from unseen collections of Web pages, of known layout, and fill the slots of a predefined template. The manual creation of wrappers presents many shortcomings due to the overhead in writing and maintaining them. On the other hand, the automatic creation of wrappers (wrapper induction) presents also problems since a re-training of the wrappers is necessary when changes occur in the formatting of the targeted Web site or when pages from a similar Web site are to be analyzed. Training an effective siteindependent wrapper is an attractive solution in terms of scalability, since any V. Karkaletsis and C.D. Spyropoulos: An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them, StudFuzz 185, (2005) c Springer-Verlag Berlin Heidelberg 2005

2 146 V. Karkaletsis and C.D. Spyropoulos domain-specific page could be processed, without relying heavily on the hypertext structure. The collection of the application specific web pages which will be processed by the wrappers is also a crucial issue. A collection mechanism is necessary for the location of the application specific web sites and the identification of interesting pages within them. The design and development of web pages collection and extraction systems needs to consider requirements such as enabling adaptation to new domains and languages, facilitating maintenance for an existing domain, providing strategies for effective site navigation, ensuring personalized access, and handling of structured, semi-structured or unstructured data. The implementation of a web pages collection and extraction mechanism that addresses effectively these important issues was the motivation for the R&D project CROSSMARC 1, which was partially funded by the EC. CROSSMARC work resulted to a system for web information retrieval and extraction which can be trained to new applications and languages and a customization infrastructure that supports configuration of the system to new domains and languages. Based on the methodology proposed in CROSSMARC, we started the development of a new platform to facilitate the use of collection and extraction tools as well as their customization. The platform provides a user friendly interface through which the user can specify the domain specific resources (ontology, lexica, corpora for the training and testing of the tools), train the collection and extraction tools using these resources, and test the tools with various configurations. The current version of the platform incorporates mainly CROSSMARC tools for the case studies in which it is being tested. However, it also enables the incorporation of new tools due to its open architecture design. The paper outlines first CROSSMARC work, in relation to other works in the area. It presents then the first version of the platform as well as some first results from its use in case studies. 2 Related Work Collection of domain specific web pages involves the use of web focused crawling and spidering technologies. The motivation for web focused crawling comes from the poor performance of general-purpose search engines, which depend on the results of generic Web crawlers. The aim is to adapt the behavior of the search engine to the user requirements. The term focused crawling was introduced in [1] where the system presented starts with a set of representative pages and a topic hierarchy and tries to find more instances of interesting topics in the hierarchy by following the links in the seed pages. Another interesting approach to focused crawling is adopted by the InfoSpiders system [4], 1

3 An Open Platform for Collecting Domain Specific Web Pages 147 a multi-agent focused crawler which uses as starting points a set of keywords and a set of root pages. The crawler implemented according to CROSSMARC approach involves three different crawlers which exploit topic hierarchies, keywords from domain ontologies and lexica, and a set of representative pages [8]. While in focused crawling, the aim is to adapt the behavior of the search engine to the requirements of a user, in site-specific spidering the spider navigates in a Web site, following best-scored-first links. Each Web page visited is evaluated, in order to decide whether it is really relevant to the topic, and its hyperlinks are scored in order to decide whether they are likely to lead to useful pages. Therefore, site-specific spidering involves two decision functions: one which classifies Web pages as being interesting (e.g. laptop offers) or not and one that scores hyperlinks, according to their potential usefulness. Thus, the input to the 1st decision function is a Web page visited by the spider and its output is a binary decision. This is a typical text classification task. Various machine learning methods have been used for constructing such text classifiers. In [6] an up-to-date survey of such approaches is provided. In CROSSMARC we examined a large number of classification approaches in order to find the most appropriate one for each domain and language. Concerning the second decision function in site-specific spidering, this is a regression function, i.e., the input to the function is the hyperlink, together with its anchor and possibly surrounding text, and the output is a score, corresponding to the probability of reaching a product page quickly through this link. Like classification, there is a variety of machine learning methods that are available for learning regression functions. However, in contrast to text classification, the task of hyperlink scoring has not been studied extensively in the literature. Most of the work on scoring and ordering of links refers to Web-wide crawlers, rather than site-specific spiders, and is based on the popularity of the pages pointed by the links that are being examined. This approach is inappropriate for the spider implemented in CROSSMARC. The only really relevant work that has been identified in the literature is [5], who use a type of simplified reinforcement learning in order to score the hyperlinks met by a Web spider. A reinforcement learning link scoring methodology was also examined in CROSSMARC and was compared against a rule-based methodology. Concerning information extraction from web pages, a number of systems have been developed to extract structured data from web pages. A recent nice survey of existing web extraction tools is found in [3], where a classification of these tools is proposed based on the technologies used for wrapper creation or induction. According to [3], tools can be classified in the following categories: Languages for wrappers development: these are languages designed for assisting the manual creation of wrappers.

4 148 V. Karkaletsis and C.D. Spyropoulos Fig. 1. Classification of web extraction tools (taken from [3]) and CROSSMARC position HTML-aware tools: these tools convert a web page into a tree representation that reflects the HTML tag hierarchy. Extraction rules are then applied to the tree representation. Wrapper Induction tools: these tools generate delimiter-based rules relying on page formatting features and not on linguistic ones. They present similarities with the HTML-aware tools. NLP-based tools: these tools employ natural language processing (NLP) techniques, such as part of speech tagging, phrase chunking, to learn extraction rules; Ontology-based tools: these tools employ a domain-specific ontology to locate ontology instances in the web page which are then used to fill the template slots. CROSSMARC employs most of the categories of web extraction tools presented in [3] (see Fig. 1). It uses: Wrapper Induction (WI) techniques in order to exploit the formatting features of the web pages. NLP techniques to exploit linguistic features of the web pages enabling the process of domain specific web pages in different sites and in different languages (multilingual, site-independent). Ontology engineering to enable the creation and maintenance of ontologies, language-specific lexica as well as other application-specific resources. Details on the CROSSMARC extraction tools are presented in [2]. More relevant publications can be found at the project s web site.

5 An Open Platform for Collecting Domain Specific Web Pages 149 Fig. 2. System s agent based architecture 3 The Platform CROSSMARC work resulted to a core system for web information retrieval and extraction which can be trained to new applications and languages and a customization infrastructure that supports configuration of the system to new domains and languages. The core system implements a distributed, multiagent, open and multi-lingual architecture which is depicted in Fig. 2. It involves components for the identification of interesting web sites (focused crawling) and the location of domain-specific web pages within these sites (spidering), the extraction of information about product/offer descriptions from the collected web pages, and the storage and presentation of the extracted information to the end-user according to his/her preferences. The infrastructure for configuring to new domains and languages involves: an ontology management system for the creation and maintenance of the ontology, the lexicons and other ontology-related resources; a methodology and a tool for the formation of corpus necessary for the training and testing of the modules in the spidering component; a methodology and a tool for the collection and annotation of corpus necessary for the training and testing of the information extraction components. Based on this work, we started the development of a platform that will enable the integration, training and testing of collection and extraction tools (such as the ones developed in CROSSMARC) under a common interface. The experiences from building three different applications using CROSSMARC tools assisted significantly the platform deisgn. These applications concerned the extraction of information from: laptops offers in e-retailers web sites (in four languages),

6 150 V. Karkaletsis and C.D. Spyropoulos Fig. 3. Ontology tab: invoking the ontology management system job offers in IT companies web sites (in four languages), holidays packages in the sites of travel agencies (in two languages). According to CROSSMARC methodology, the building of an application involves two main stages. The 1st one concerns the creation of the applicationspecific resources using the customization infrastructure, whereas the 2nd stage concerns the training of the integrated system using the applicationspecific resources and the system configuration. The 1st stage is realized, in our platform, by the Ontology and Corpora tabs. Through the Ontology tab (see Fig. 3), the user can invoke an ontology management system in order to create or update the domain specific ontology, the lexicons under the domain ontology, the important entities and fact types for the domain, and the user stereotypes definitions according to the ontology. In the current version, the ontology management system of CROSSMARC is used. The Ontology tab enables also the user to specify the location of the ontology related resources, he/she wants to use in the next steps of the application building (see Fig. 4). Through the Corpora Tab the user can perform several tasks. The user can invoke the Corpus Formation Tool (CFT), which helps users build a corpus of positive and negative pages, with respect to a given domain (see Fig. 5). This corpus is then used for the training and testing of the Page Filtering component of the spidering tool. In addition, the user can specify the folder(s) where the corpora for the training and testing of the Information extraction components are stored, and also invoke the annotation tool. The current version of the platform employs a different annotation tool from the one that was included in the CROSS- MARC distribution. The new tool is of the ones provided by the Ellogon

7 An Open Platform for Collecting Domain Specific Web Pages 151 Fig. 4. Ontology tab: specifying the ontology related resources Fig. 5. Corpora tab: invoking the Corpus Formation Tool language engineering platform 2 of our laboratory. However, the platform supports also the use of the CROSSMARC Web annotation tool [7]. The 2nd processing stage is realized by the Training and Extraction tabs. Through the Training tab (see Fig. 6), the user can invoke the machine learning based training tools for the Page Filtering, Link scoring, and Information Extraction components. Especially, in the case of Information Extraction, training involves two separate modules, the Named entity recognition & classification NERC module and the Fact extraction FE module. The current version of the platform employs the Ellogon-based NERC and FE modules developed by our laboratory. The platform can support also 2

8 152 V. Karkaletsis and C.D. Spyropoulos Fig. 6. Training tab: invoking the training tool for page filtering Fig. 7. Extraction tab: configuring the spidering component (advanced options) the use of the other NERC and FE training tools developed in the context of CROSSMARC, since they all share common I/O specifications. Through the Extraction tab (see Fig. 7), the user can configure and test the Crawling, Spidering and Information Extraction components. In the case of Crawling, the user can set the starting points for the crawler editing the corresponding configuration file. In a similar way, a different crawler can be incorporated and configured according to the specific domains. A new crawler is currently under development and will be tested through the platform in a future case study. In the case of Spidering, the user can select the model for page filtering and link scoring (a machine learning or a heuristics based), edit the heuristics based model, set a threshold for link scoring, and perform several more advanced options. The user can test the components

9 An Open Platform for Collecting Domain Specific Web Pages 153 with various configurations, view the results and decide on the preferred configuration. Concerning Information Extraction, the user can test separately the NERC and FE components, and configure the demarcation components. In the current version, the platform supports only the NERC component. It must be noted that the outcome of the platform use is not necessarily a complete web content collection and extraction system. As it is shown in the case studies section, the platform user can build a crawler for a new domain, a collection system (crawler and spider), a named entity recognition system, or an information extraction system. It depends on the specific task needs and the domain. 4 Case Studies The current version of the platform was used for the building of several applications. Some of these applications are presented below grouped according to the different tasks. The first group of applications involves the development of crawlers for an information filtering task. More specifically, the task was to develop crawlers for specific topics (English and Greek languages were covered) that will return lists of web sites for these topics. These lists would be used to train an information filtering system. Examples of topics include web sites that provide a service to communicate (chat) with other users in real time, web sites that provide services (send/receive messages), sites with job offers, etc. In these cases, the extraction tab of the platform was used to configure the starting points of the crawler, test it and find the best configuration for each topic. Another group of applications concern the development of systems collecting web pages for specific domains and languages. An example domain is personal web pages of academic staff in University departments (Greek pages were covered). Such applications involve the training of both the crawling and the spidering components using the platform functionalities. More specifically, the ontology tab for creating the domain-specific ontology and lexica, the corpora tab for create the corpus for the training of page filtering, the training tab for the training of the page filtering and link scoring components, and the extraction tab for configuring and testing the crawling and spidering components. A third group of applications concerns the development of named entity recognition systems for specific domain and languages, which require the collection and annotation of the necessary corpus, the training and testing of the system. In a similar way, information extraction systems can be developed. The final group of applications integrate the collection and extraction mechanisms, as it was the case for the CROSSMARC domains. The platform,

10 154 V. Karkaletsis and C.D. Spyropoulos in its current status, does not support the development of such integrated applications. 5 Concluding Remarks The CROSSMARC project implemented a distributed, multi-agent, open and multilingual architecture for web retrieval and extraction, which integrates several components based on state of the art AI technologies and commercial tools. Based on this work we are developing a platform that enables the integration, training and testing of collection and extraction tools, such as the ones developed in CROSSMARC. A first version of this platform is currently being tested in several case studies for the development of focused crawlers, spiders, and information extraction systems. The current version employs mainly CROSSMARC tools. However, due to its open design, other tools have also been employed and more will be integrated and tested in the near future. References 1. Chakrabarti S., van den Berg M.H., Dom B.E.: Focused Crawling: a new approach to topic-specific Web resource discovery. Proceedings of the 8th International World Wide Web Conference, Toronto, Canada (1999) 2. Karkaletsis V., Spyropoulos C.D., Grover C., Pazienza M.T., Coch J., Souflis D.: A Platform for Cross-lingual, Domain and User Adaptive Web Information Extraction. Proceedings of the European Conference in Artificial Intelligence (ECAI), Valencia, Spain (2004) Laender A., Ribeiro-Neto B., da Silva A., Teixeira J.: A Brief Survey of Web Data Extraction Tools, ACM SIGMOD Records, vol. 31(2) (2002) 4. Menczer F., Belew R.K.: Adaptive Retrieval Agents: Internalizing Local Context and Scaling up to the Web. Machine Learning, 39(2/3) (2000) Rennie J., McCallum A.: Efficient Web Spidering with Reinforcement Learning. Proceedings of the 16th International Conference on Machine Learning (ICML- 99) (1999) 6. Sebastiani F.: Machine learning in automated text categorization. ACM Computing Surveys, 34(1) (2002) 7. Sigletos G., Farmakiotou D., Stamatakis K., Paliouras G., Karkaletsis V.: Annotating Web pages for the needs of Web Information Extraction applications. Proceedings of the 12th International WWW Conference (Poster Session), Budapest, Hungary (2003) 8. Stamatakis K., Karkaletsis V., Paliouras G., Horlock J., Grover C., Curran J.R., Dingare S.: Domain-Specific Web Site Identification: The CROSSMARC Focused Web Crawler. Proceedings of the 2nd International Workshop on Web Document Analysis (WDA 2003), Edinburgh, UK (2003)