Gestão e Tratamento da Informação



Similar documents
Web Data Extraction: 1 o Semestre 2007/2008

Database System Concepts

Web Document Clustering

Web Content Mining and NLP. Bing Liu Department of Computer Science University of Illinois at Chicago

Automated Web Data Mining Using Semantic Analysis

From Web Content Mining to Natural Language Processing

Ternary Based Web Crawler For Optimized Search Results

Experiments in Web Page Classification for Semantic Web

Mining Templates from Search Result Records of Search Engines

Wrapper Induction for End-User Semantic Content Development

Quiz! Database Indexes. Index. Quiz! Disc and main memory. Quiz! How costly is this operation (naive solution)?

Data Mining with R. Decision Trees and Random Forests. Hugh Murrell

Web Data Extraction, Applications and Techniques: A Survey

Data Integration through XML/XSLT. Presenter: Xin Gu

A Workbench for Prototyping XML Data Exchange (extended abstract)

Purchasing the Web: an Agent based E-retail System with Multilingual Knowledge

Algorithms and Data Structures

Chapter 1: Introduction

Full and Complete Binary Trees

Buglook: A Search Engine for Bug Reports

Mining Text Data: An Introduction

Modeling System Calls for Intrusion Detection with Dynamic Window Sizes

Building a Question Classifier for a TREC-Style Question Answering System

Web Data Scraper Tools: Survey

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Chapter 13: Query Processing. Basic Steps in Query Processing

Big Data and Scripting. Part 4: Memory Hierarchies

Using Trace Clustering for Configurable Process Discovery Explained by Event Log Data

Binary Search Trees. Data in each node. Larger than the data in its left child Smaller than the data in its right child

Process Mining by Measuring Process Block Similarity

Spam Detection Using Customized SimHash Function

XML Data Integration

Search Result Optimization using Annotators

GRAPH THEORY LECTURE 4: TREES

Lexical Analysis and Scanning. Honors Compilers Feb 5 th 2001 Robert Dewar

Naming. Name Service. Why Name Services? Mappings. and related concepts

Learning and Discovering Structure in Web Pages

Finite Automata. Reading: Chapter 2

Naming in Distributed Systems

Analysis of Algorithms I: Binary Search Trees

2 nd Semester 2008/2009

A LANGUAGE INDEPENDENT WEB DATA EXTRACTION USING VISION BASED PAGE SEGMENTATION ALGORITHM

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

Self Organizing Maps for Visualization of Categories

Scheduling Shop Scheduling. Tim Nieberg

Lesson 8: Introduction to Databases E-R Data Modeling

Professor Anita Wasilewska. Classification Lecture Notes

Decision Trees and Networks

1. Domain Name System

Design and Implementation of Firewall Policy Advisor Tools

Test Case Design by Means of the CTE XL

Triangulation by Ear Clipping

Populating the Semantic Web

Application of XML Tools for Enterprise-Wide RBAC Implementation Tasks

Physical Data Organization

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Classification and Prediction

Fast nondeterministic recognition of context-free languages using two queues

Data Mining. Nonlinear Classification

Unsupervised Data Mining (Clustering)

Overview. What is Information Retrieval? Classic IR: Some basics Link analysis & Crawlers Semantic Web Structured Information Extraction/Wrapping

An Ontology Framework based on Web Usage Mining

Social Media Mining. Data Mining Essentials

PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. Authors: B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo.

DATA MINING TECHNIQUES AND APPLICATIONS

Blog Post Extraction Using Title Finding

The Halting Problem is Undecidable

Sorting Hierarchical Data in External Memory for Archiving

Introduction. Compiler Design CSE 504. Overview. Programming problems are easier to solve in high-level languages

Property Based Broadcast Encryption in the Face of Broadcasts

The Minimum Consistent Subset Cover Problem and its Applications in Data Mining

Database Design Patterns. Winter Lecture 24

KEYWORD SEARCH IN RELATIONAL DATABASES

XSLT - A Beginner's Glossary

Structured vs. unstructured data. Motivation for self describing data. Enter semistructured data. Databases are highly structured

A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model

How To Improve Performance In A Database

A New Marketing Channel Management Strategy Based on Frequent Subtree Mining

Chapter 8 The Enhanced Entity- Relationship (EER) Model

Classification On The Clouds Using MapReduce

I. INTRODUCTION NOESIS ONTOLOGIES SEMANTICS AND ANNOTATION

Merkle Hash Trees for Distributed Audit Logs

not possible or was possible at a high cost for collecting the data.

Artificial Intelligence

Distance Degree Sequences for Network Analysis

Outline. Definition. Name spaces Name resolution Example: The Domain Name System Example: X.500, LDAP. Names, Identifiers and Addresses

CS510 Software Engineering

XML: extensible Markup Language. Anabel Fraga

Secret Communication through Web Pages Using Special Space Codes in HTML Files

Wee Keong Ng. Web Data Management. A Warehouse Approach. With 106 Illustrations. Springer

Information Extraction

Clustering & Visualization

Introduction to Learning & Decision Trees

Mathematical Induction. Lecture 10-11

Multiple electronic signatures on multiple documents

Cassandra. References:

Information Discovery on Electronic Medical Records

Naming in Distributed Systems

Lecture 10: Regression Trees

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

Transcription:

Web Data Extraction: Departamento de Engenharia Informática Instituto Superior Técnico 1 o Semestre 2007/2008 Slides baseados nos slides oficiais do livro Web Data Mining c Bing Liu, Springer, December, 2006.

Outline 1 Introduction 2 3

Outline 1 Introduction 2 3

Introduction A large amount of information on the Web is contained in regularly structured data objects often data records retrieved from databases Such Web data records are important: lists of products and services; Applications: e.g., comparative shopping, meta-search, meta-query, etc. To process this data implies extracting it from semi-structured web pages and converting it to structured information This is done by using wrappers

Wrappers Wrappers are small applications or scripts capable of extracting information from semi-structured sources, such as HTML Wrappers can be built manually E.g., using regular expressions, XSLT,... Or automatically Using machine learning techniques

Web Data Pages There are two types of data rich pages: List pages Each such page contains one or more lists of data records Each list in a specific region in the page Two types of data records: flat and nested Detail pages Each such page focuses on a single object But can have a lot of related and unrelated information

A List Page

A Detail Page

Extraction Results Chomsky On Anarchism Noam Chomsky,... $11.53 4 Manufacturing Consent Edward S. Herman,... $11.78 4 11 de septiembre... Noam Chomsky $8.95 3.5 Imperial Grand Strategy... Noam Chomsky $17.99 5

Outline 1 Introduction 2 3

The Data Model Most Web data can be modeled as nested relations typed objects allowing nested sets and tuples An instance of a type T is simply an element of the domain of T Definition Let B = {B 1, B 2,...,B k } be a set of basic types Each B i is atomic and of domain dom(b i ) If T 1, T 2,...,T n are types, then [T 1, T 2,...,T n ] is a tuple type of domain dom([t 1, T 2,...,T n ]) = {[v 1, v 2,...,v n ] v i dom(t i )} If T is a tuple type, then {T } is a set type with domain dom({t }) = 2 dom(t)

An Example Nested Tuple Classic flat relations are of un-nested or flat set types Nested relations are of arbitrary set types An example type tuple book (title: string; author: set (name: string); prices: set (use: string; price: number;)) An example tuple [ Manufacturing Consent, { Edward S. Herman, Noam Chomsky }, {[ used, 4.78], [ new, 8.95]}]

Type Trees Definition A basic type B i is a leaf A tuple type [T 1, T 2,...,T n ] is a tree rooted at a tuple node with n sub-trees, one for each T i A set type {T } is a tree rooted at a set node with one sub-tree We introduce a labeling of a type tree, which is defined recursively: If a set node is labeled Φ, then its child is labeled Φ.0 If a tuple node is labeled Φ, then its n children are labeled Φ.1,...,Φ.n

An Example Type Tree tuple (book) string (title) set (author) set (prices) string (name) tuple (price) string (use) number (price) Note: attribute names are not part of the type tree.

Instance Trees Definition An instance (constant) of a basic type is a leaf A tuple instance [v 1, v 2,...,v n ] forms a tree rooted at a tuple node with n children or sub-trees representing attribute values v 1, v 2,...,v n A set instance {e 1, e 2,...,e n } forms a set node with n children or sub-trees representing the set elements e 1, e 2,...,e n Note: A tuple instance is usually called a data record

An Example Instance Tree tuple Manufacturing Consent set set Edward S. Herman Noam Chomsky tuple tuple used 4.78 new 8.95

HTML Encoding of Data There are no designated tags for each type HTML was not designed as a data encoding language Any HTML tag can be used for any type For a tuple type, values (also called data items) of different attributes are usually encoded differently to distinguish them and to highlight important items A tuple may be partitioned into several groups or sub-tuples Each group covers a disjoint subset of attributes and may be encoded differently

HTML Encoding and Type Trees Definition For leaf node of a basic type Φ, an instance c is encoded with enc(φ : c) = OPEN-TAGS c CLOSE-TAGS where OPEN-TAGS is a sequence of open HTML tags and CLOSE-TAGS is a sequence of HTML close tags

HTML Encoding and Type Trees (cont.) A tuple node of type [Φ.1, Φ.2,...,Φ.n] is partitioned into h groups: Φ.1,...,Φ.e, Φ.(e + 1),...,Φ.g,..., Φ.(k + 1),...,Φ.n An instance [v 1, v 2,...,v n ] is encoded with: enc(φ : [v.1, v.2,...,v.n]) = OPEN-TAGS 1 enc(v 1 ) enc(v e ) CLOSE-TAGS 1 OPEN-TAGS 2 enc(v e+1 ) enc(v g ) CLOSE-TAGS 2 OPEN-TAGS h enc(v k+1 ) enc(v n ) CLOSE-TAGS h

HTML Encoding and Type Trees (cont.) For a set node labeled Φ, a non-empty set instance {e 1, e 2,...,e n } is encoded with: enc(φ : {e 1, e 2,...,e n }) = OPEN-TAGS enc(e 1 ), enc(e 2 ),...,enc(e n ) CLOSE-TAGS

An Example <p> <h3>manufacturing Consent</h3> <ul> <li>edward S. Herman</li> <li>noam Chomsky</li> </ul> <ul> <li> <strong>used</strong> <it>4.78</it> </li> <li> <strong>new</strong> <it>8.95</it> </li> </ul> </p>

Limitations Clearly, this mark-up encoding does not cover all cases in Web pages In fact, each group of a tuple type can be further divided In an actual Web page the encoding may not be done by HTML tags alone Words and punctuation marks can be used

Outline 1 Introduction 2 3

Wrappers are generated by using machine learning to generate extraction rules The user marks the target items in a few training pages The system learns extraction rules from these pages The rules are applied to extract items from other pages There are many wrapper induction systems WIEN (Kushmerick et al, IJCAI-97) Softmealy (Hsu and Dung, 1998) BWI (Freitag and Kushmerick, AAAI-00) WL2 (Cohen et al. WWW-02) We will focus on Stalker (Muslea et al. Agents-99)

Stalker: A Hierarchical System Hierarchical wrapper learning Extraction is isolated at different levels of hierarchy This is suitable for nested data records Each item is extracted independently of others The extraction is done using a tree structure called the EC tree (embedded catalog tree) The EC tree is based on the type tree of the data To extract each target item (a node), the wrapper needs a rule that extracts the item from its parent

Extraction Rules Each extraction is done using two rules a start rule and an end rule The start rule identifies the beginning of the node and the end rule identifies the end of the node. This strategy is applicable to both leaf nodes (which represent data items) and list nodes For a list node, list iteration rules are needed to break the list into individual data records (tuple instances)

Landmarks The extraction rules are based on the idea of landmarks A landmark is a sequence of consecutive tokens Landmarks are used to locate the beginning and the end of a target item Rules use landmarks to extract the items

An Example <p>seller Name:<b>Good Books Inc.</b><br><br> <li>321 Red Street, <i>mullen</i>, Phone 0-<i>485</i>-3463-1766</li> <li>62 Blue Street, <i>finner</i>, Phone (621) 2176-2771</li> <li>543 Yellow Street, <i>pitsher</i>, Phone 0-<i>788</i>-3452-4638</li> <li>321 Orange Street, <i>kingston</i>, Phone: (333)-6755-7891</li></p> To extract the seller name, rule R1 can identify the beginning: R1: SkipTo(<b>) This rule means that the system should start from the beginning of the page and skip all the tokens until it sees the first <b> tag. <b> is a landmark Similarly, to identify the end of the seller name, we use: R2: SkipTo(</b>)

An Example (cont.) A rule may not be unique. For example, we can also use the following rules to identify the beginning of the name: or R3: SkiptTo(Name Punctuation HtmlTag ) R4: SkiptTo(Name) SkipTo(<b>) R3 means that we skip everything till the word Name followed by a punctuation symbol and then a HTML tag. In this case, Name Punctuation HtmlTag together is a landmark Punctuation and HtmlTag are wildcards

Learning Extraction Rules Stalker uses sequential covering to learn extraction rules for each target item In each iteration, it learns a perfect rule that covers as many positive examples as possible without covering any negative example Once a positive example is covered by a rule, it is removed The algorithm ends when all the positive examples are covered. The result is an ordered list of all learned rules

The Stalker Algorithm Algorithm LearnRules(examples) 1 rule 2 while examples do 1 disjunct LearnDisjunt(examples) 2 remove all items in examples covered by disjunct 3 add disjunct to rule 3 return rule

The Stalker Algorithm (cont.) Algorithm LearnDisjunct(examples) 1 seed the shortest example 2 candidates GetInitialCandidates(seed) 3 while candidates do 1 D BestDisjunct(candidates) 2 if D is a perfect disjunct then 1 return D 3 remove D from candidates 4 candidates candidates Refine(D, seed) 4 return D

The Stalker Algorithm (cont.) GetInitialCandidates(seed) Return tokens t that immediately precede the example or wildcards that match t BestDisjunct(candidates) Return candidates that have more correct matches fewer false positives fewer wildcards longer landmarks Refine(D, seed) Specialize D by adding more terminals

An Example: Extracting Area Codes E1:<li>321 Red Street, <i>mullen</i>, Ph 0-<i> 485 </i>-3463-1766</li> E2:<li>62 Blue Street, <i>finner</i>, Ph ( 621 ) 2176-2771</li> E3:<li>543 Yellow Street, <i>pitsher</i>, Ph 0-<i> 788 </i>-3452-4638</li> E4:<li>321 Orange Street, <i>kingston</i>, Ph: ( 333 )-6755-7891</li></p> The following candidate disjuncts are generated: D1: SkipTo(() D2: SkipTo( Punctuation ) D1 is selected by BestDisjunct D1 is a perfect disjunct The first iteration of LearnRule() ends E2 and E4 are removed

An Example (cont.) E1:<li>321 Red Street, <i>mullen</i>, Ph 0-<i> 485 </i>-3463-1766</li> E3:<li>543 Yellow Street, <i>pitsher</i>, Ph 0-<i> 788 </i>-3452-4638</li> LearnDisjunct() will select E1 as the seed Two candidates are then generated: D3: SkipTo(<i>) D4: SkipTo( HtmlTag ) Both candidates match early in the uncovered examples, E1 and E3. Thus, they cannot uniquely locate the positive items Refinement is needed

Refinement To specialize a disjunct by adding more terminals to it A terminal means a token or one of its matching wildcards We hope the refined version will be able to uniquely identify the positive items in some examples without matching any negative item in any example Two types of refinement: Landmark refinement Topology refinement

Landmark Refinement Landmark refinement: increase the size of a landmark by concatenating a terminal E1:<li>321 Red Street, <i>mullen</i>, Ph 0-<i> 485 </i>-3463-1766</li> E3:<li>543 Yellow Street, <i>pitsher</i>, Ph 0-<i> 788 </i>-3452-4638</li> D5: SkipTo(-<i>) D6: SkipTo( Punctuation <i>)

Topology Refinement Topology refinement: Increase the number of landmarks by adding 1-terminal landmarks E1:<li>321 Red Street, <i>mullen</i>, Ph 0-<i> 485 </i>-3463-1766</li> E3:<li>543 Yellow Street, <i>pitsher</i>, Ph 0-<i> 788 </i>-3452-4638</li> D7 SkipTo(321)SkipTo(<i>) D8 SkipTo(Red)SkipTo(<i>) D9 SkipTo(Street)SkipTo(<i>) D10 SkipTo(,)SkipTo(<i>) D11 SkipTo(<i>)SkipTo(<i>)... D19 SkipTo( Numeric )SkipTo(<i>) D20 SkipTo( Punctuation )SkipTo(<i>)

Final Results E1:<li>321 Red Street, <i>mullen</i>, Ph 0-<i> 485 </i>-3463-1766</li> E2:<li>62 Blue Street, <i>finner</i>, Ph ( 621 ) 2176-2771</li> E3:<li>543 Yellow Street, <i>pitsher</i>, Ph 0-<i> 788 </i>-3452-4638</li> E4:<li>321 Orange Street, <i>kingston</i>, Ph: ( 333 )-6755-7891</li></p> Using BestDisjunct, D5 is selected as the final solution as it has longest last landmark D5: SkipTo(-<i>) Since all the examples are covered, LearnRule() returns the disjunctive (start) rule: either D1 or D5 R7: either SkipTo(() or SkipTo(-<i>)

Wrapper Maintenance Wrapper verification: if the site changes, does the wrapper know the change? Wrapper repair: if the change is correctly detected, how to automatically repair the wrapper? One way to deal with both problems is to learn the characteristic patterns of the target items These patterns are then used to monitor the extraction to check whether the extracted items are correct These tasks are extremely difficult often needs contextual and semantic information to detect changes and to find the new locations of the target items Wrapper maintenance is still an active research area

Questions?