Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy

Similar documents

Reverse Engineering of Relational Databases to Ontologies: An Approach Based on an Analysis of HTML Forms

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Logical Framing of Query Interface to refine Divulging Deep Web Data

Semantification of Query Interfaces to Improve Access to Deep Web Content

MOC 20461C: Querying Microsoft SQL Server. Course Overview

Automatic Annotation Wrapper Generation and Mining Web Database Search Result

Flattening Enterprise Knowledge

Search and Information Retrieval

Best Practices for Hadoop Data Analysis with Tableau

SPATIAL DATA CLASSIFICATION AND DATA MINING

Design and Implementation of Domain based Semantic Hidden Web Crawler

Search Result Optimization using Annotators

Deep Web Entity Monitoring

KEYWORD SEARCH IN RELATIONAL DATABASES

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Text Analytics. A business guide

T O B C A T C A S E G E O V I S A T DETECTIE E N B L U R R I N G V A N P E R S O N E N IN P A N O R A MISCHE BEELDEN

Report on the Dagstuhl Seminar Data Quality on the Web

Bayesian Spam Filtering

Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu

Data Catalogs for Hadoop Achieving Shared Knowledge and Re-usable Data Prep. Neil Raden Hired Brains Research, LLC

Developing Microsoft SharePoint Server 2013 Advanced Solutions MOC 20489

CHAPTER-24 Mining Spatial Databases

User Guide to the Content Analysis Tool

Building Semantic Content Management Framework

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

Index. AdWords, 182 AJAX Cart, 129 Attribution, 174

Experiments in Web Page Classification for Semantic Web

Web Database Integration

ALIAS: A Tool for Disambiguating Authors in Microsoft Academic Search

Building a Question Classifier for a TREC-Style Question Answering System

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

HELP DESK SYSTEMS. Using CaseBased Reasoning

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

Cosmos. Big Data and Big Challenges. Ed Harris - Microsoft Online Services Division HTPS 2011

Data Quality in Information Integration and Business Intelligence

TIBCO Spotfire Business Author Essentials Quick Reference Guide. Table of contents:

Semantic Search in Portals using Ontologies

Completing the Big Data Ecosystem:

Neovision2 Performance Evaluation Protocol

Composite Data Virtualization Composite Data Virtualization And NOSQL Data Stores

Improving EHR Semantic Interoperability Future Vision and Challenges

1 Class Diagrams and Entity Relationship Diagrams (ERD)

Model Selection. Introduction. Model Selection

Predictive Coding Defensibility and the Transparent Predictive Coding Workflow

Cosmos. Big Data and Big Challenges. Pat Helland July 2011

Object Recognition. Selim Aksoy. Bilkent University

How to Choose Between Hadoop, NoSQL and RDBMS

Research Statement Immanuel Trummer

Qlik s Associative Model

A terminology model approach for defining and managing statistical metadata

TAKEAWAYS CHALLENGES. The Evolution of Capture for Unstructured Forms and Documents STRATEGIC WHITE PAPER

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Elena Baralis, Silvia Chiusano Politecnico di Torino. Pag. 1. Query optimization. DBMS Architecture. Query optimizer. Query optimizer.

Predictive Coding Defensibility and the Transparent Predictive Coding Workflow

Interactive Dynamic Information Extraction

Tips and Tricks SAGE ACCPAC INTELLIGENCE

SharePoint Training DVD Videos

Comprehensive IP Traffic Monitoring with FTAS System

Evaluation & Validation: Credibility: Evaluating what has been learned

A View Integration Approach to Dynamic Composition of Web Services

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Unlocking The Value of the Deep Web. Harvesting Big Data that Google Doesn t Reach

Blog Post Extraction Using Title Finding

InvGen: An Efficient Invariant Generator

Domain Classification of Technical Terms Using the Web

A NOVEL APPROACH FOR AUTOMATIC DETECTION AND UNIFICATION OF WEB SEARCH QUERY INTERFACES USING DOMAIN ONTOLOGY

OECD.Stat Web Browser User Guide

Course -Oracle 10g SQL (Exam Code IZ0-047) Session number Module Topics 1 Retrieving Data Using the SQL SELECT Statement

Lesson 8: Introduction to Databases E-R Data Modeling

Azure Machine Learning, SQL Data Mining and R

Our Data & Methodology. Understanding the Digital World by Turning Data into Insights

Oracle Data Miner (Extension of SQL Developer 4.0)

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

ASWGF! Towards an Intelligent Solution for the Deep Semantic Web Problem

SEO Techniques for various Applications - A Comparative Analyses and Evaluation

Benchmarking Cassandra on Violin

Understanding SQL Server Execution Plans. Klaus Aschenbrenner Independent SQL Server Consultant SQLpassion.at

The Scientific Data Mining Process

Transcription:

The Deep Web: Surfacing Hidden Value Michael K. Bergman Web-Scale Extraction of Structured Data Michael J. Cafarella, Jayant Madhavan & Alon Halevy Presented by Mat Kelly CS895 Web-based Information Retrieval Old Dominion University Septmber 27, 2011

Papers Contributions Bergman attempts various methods of estimating size of Deep Web Cafarella proposes concrete methods of extracting and more reliably estimating size of Deep Web and offers a surprising caveat in the estimation

What is The Deep Web? Pages that do not exist in search engines Created dynamically as result of search Much larger than surface web (400-550x) 7500 TB (deep) vs. 19TB (surface) [in 2001] Information resides in databases 95% of the information is publicly accessible

Estimating the Size Analysis procedure of > 100 known deep web sites 1. Webmasters queried for record count and storage size, 13% responded 2. Some sites explicitly stated their database size without the need for webmaster assistance 3. Site sizes compiled from lists provided at conferences 4. Utilizing a site s own search capability with a term known not to exist, e.g. NOT ddfhrwxxct 5. If still unknown, do not analyze

Further Attempts at Size Estimation: Overlap Analysis Compare (pair-wise) random listings from two independent sources Repeat pair-wise with all sources previously collected that are known to have deep web From the commonality of the listings, we can then abstract the total size Provides a lower bound size of the deep web, since our source list is incomplete src 2 listings src 1 listings Total Size Total size covered by Src1 listings = (shared listings) (src 1 listings)

Further Attempts at Size Estimation: Multiplier on Average Site s Size From listing of 17,000 site candidates, 700 were randomized selected. 100 of these could be fully characterized Randomized queries issues to these 100 with results on HTML pages, mean page size calculated and used for est.? queried 17k deep websites 700 randomly chosen 100 sites used that could be fully characterized Results page produced and analyzed

Other Methods Used For Estimation Pageviews ( What s Related on Alexa) and Link References Growth Analysis obtained from Whois From 100 surface and 100 deep web sites, acquired date site was established Combined and plotted to add time as factor for estimation

Overall Findings From Various Analyses Mean deep website has web-expressed database (HTML included) of 74.4MB Actual record counts can be derived from onein-seven deep websites On average, deep websites receive half as much monthly traffic as surface websites Median deep website receives more than two times traffic as random surface website

The Followup Paper: Web-Scale Extraction of Structured Data Three systems for used for extracting deep web data TextRunner WebTables became Deep Web Surfacing (Relevant to Bergman) By using these methods, the data can be aggregated for use in other services, e.g. Synonym finding Schema auto-complete Type prediction

TextRunner Parses text from crawls into n-ary tuples into natural language e.g. Albert Einstein was born in 1879 into the tuple <Einstein,1879> with the was_born_in relation This has been done before but TextRunner: Works in batch mode: Consumes an entire crawl, produces large amount of data Pre-compute good extractions before queries arrive and aggressively index Discovers relations on-the-fly, others preprogrammed Others methods are query-driven and perform all of the work on-demand Argument 1 Argument 2 Predicate Einstein 1879 born Search Search Results Albert Einstein was born in 1879. Demo: http://bit.ly/textrunner

TextRunner s Accuracy Corpus Size (pages) Tuples Extracted Accuracy Early Trial 9 Million 1 Million Followup Trial Results not yet available 500 Million 900 Million http://turing.cs.washington.edu/papers/banko-thesis.pdf

Downsides of TextRunner Text-centric extractors rely on binary relations of language (two nouns and a linking relation) Unable to extract data that conveys relations in a table form (but WebTables [next] can) Because of the on-the-fly analyses of relations, the output model is not relational e.g. We cannot know that Einstein is a human attribute and 1879 a birth year

WebTables Designed to extract data from content within HTML s table tag Ignores calendar, single cells and tables used as basis for site design General crawl of 14.1B tables contains 154M true relational database (1.1%). Evolved into

How Does WebTables work? Throw out tables with single cell, calendars and those used for layout. Accomplished with hand-written detectors Label these as relational or nonrelational using statistically trained classifiers base classification on number of rows, columns, empty cells, number of columns with numeric-only data, etc 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Trial 1 Trial 2 Trial 3 Group 1 9.5 5.2 6.9 Group 2 10 12 9.8 Group 3 9.6 7.3 8.7 Group 4 10 13 12 Relational Data

WebTables Accuracy Procedure retains 81% of truly relational databases in input corpus though only 41% of output is relational (superfluous data) 271M relations including 125M of raw input s 154M true relations (and 146M false ones)

Downsides of WebTables Does not recover multi-table databases Traditional database restraints (e.g. key constraints) cannot be expressed with table tag Metadata is difficult to distinguish from table contents Second trained classifier can be run to determine if metadata exists Human-marked filtering of true relations indicates 71% have metadata Secondary classifier performs well with: Precision of 89% Recall of 85%

Two Approaches Obtaining Access to Deep-Web Databases 1. Create vertical search on specific domains (e.g. cars, books), a semantic mapping and a mediator for the domain. Not scalable Difficult to identify domain-query mapping 2. Surfacing: pre-compute relevant form submissions then index the resulting HTML Leverages current search infrastructure

Surfacing Deep-Web Databases 1. Select values for each input in the form Trivial for select menus, challenging for text boxes 2. Perform enumeration of the inputs Simple enumeration is wasteful and un-scalable Text input falls in one of two categories: 1. Generic inputs that accept most keywords 2. Typed text input that only accept values in a particular domain

Enumerating Generic Inputs Examine page for good candidate keywords to bootstrap an iterative probing process When valid results are produced from keywords, obtain more keywords from results page

Selecting Input Combination Crawling forms with multiple inputs is expensive and not scalable Introduced notion: input template Given a set of binding inputs: Template = set of all form submissions using only Cartesian product of binding inputs Results in only informative templates in the form, only a few hundred form submissions per form No. of form submissions proportional to size of database in underlying form, NOT No. of inputs and possible combinations

Extraction Caveats Semantics are lost when only using results pages Annotations, future challenge is to find right kind of annotation that can be used by the IRstyle index most effectively

In Summary The Deep Web is large much larger than the surface web Bergman gave various means of estimating the deep web and some method of accomplishing this Cafarella et al. provided a much more structured approach in surfacing the content, not just to estimate magnitude but also to integrate the contents Cafarella suggests a better way to estimate the deep web size independent of the number of fields and possible combinations.

References Bergman, M. K. (2001). The Deep Web: Surfacing Hidden Value. Journal of Electronic Publishing 7, 1-17. Available at: http://www.press.umich.edu/jep/07-01/bergman.html. Cafarella, M. J., Madhavan, J., and Halevy, A. (2009). Web-scale extraction of structured data. ACM SIGMOD Record 37, 55. Available at: http://portal.acm.org/citation.cfm?doid=151910 3.1519112.