Web Data Management - Some Issues

Similar documents

Abstract 1. INTRODUCTION

Search and Information Retrieval

Principles of Distributed Database Systems

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Make search become the internal function of Internet

CMDB Federation (CMDBf) Frequently Asked Questions (FAQ) White Paper

How To Use Windows Live Family Safety On Windows 7 (32 Bit) And Windows Live Safety (64 Bit) On A Pc Or Mac Or Ipad (32)

HELP DESK SYSTEMS. Using CaseBased Reasoning

A Comparative Approach to Search Engine Ranking Strategies

Distributed Computing and Big Data: Hadoop and MapReduce

MOBICIP NAME. Mobicip. Company. Version Client. Type of product. Computer. Devices supported. Linux Ubuntu Windows 7 (Tested on Windows)

Sentiment Analysis. D. Skrepetos 1. University of Waterloo. NLP Presenation, 06/17/2015

Indexing Full Packet Capture Data With Flow

WEBVIEW An SQL Extension for Joining Corporate Data to Data Derived from the World Wide Web

The Data Grid: Towards an Architecture for Distributed Management and Analysis of Large Scientific Datasets

XML Processing and Web Services. Chapter 17

Data Grids. Lidan Wang April 5, 2007

Data Integration and Exchange. L. Libkin 1 Data Integration and Exchange

Sage CRM Connector Tool White Paper

Generic Log Analyzer Using Hadoop Mapreduce Framework

D5.3.2b Automatic Rigorous Testing Components

Report on the Dagstuhl Seminar Data Quality on the Web

AN ENHANCED DATA MODEL AND QUERY ALGEBRA FOR PARTIALLY STRUCTURED XML DATABASE

A SURVEY ON WEB MINING TOOLS

User's voice OPTENET WEBFILTER PC. Comprehensibility: Test user did not provide any comments. Look and Feel: Time to install and configure: 45 minutes

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

A Survey on Product Aspect Ranking

Ranked Keyword Search Using RSE over Outsourced Cloud Data

Bisecting K-Means for Clustering Web Log data

Our SEO services use only ethical search engine optimization techniques. We use only practices that turn out into lasting results in search engines.

A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION

Data Mining System, Functionalities and Applications: A Radical Review

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Web Application Hosting Cloud Architecture

CYBERSITTER NAME. LLC/Solid Oak Software. Company. Version Client. Type of product. Computer. Devices supported

Natural Language to Relational Query by Using Parsing Compiler

User's voice. Safe Eyes. User 1: I was surprised how the tool was easy to install and configure. Comprehensibility:

Objectives. Distributed Databases and Client/Server Architecture. Distributed Database. Data Fragmentation

An Enhanced Framework For Performing Pre- Processing On Web Server Logs

Workload Characterization and Analysis of Storage and Bandwidth Needs of LEAD Workspace

Fig (1) (a) Server-side scripting with PHP. (b) Client-side scripting with JavaScript.

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

A Survey Study on Monitoring Service for Grid

Technologies for a CERIF XML based CRIS

How to Choose the Right Data Storage Format for Your Measurement System

Complex Event Processing (CEP) Why and How. Richard Hallgren BUGS

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

A collaborative platform for knowledge management

High-Volume Data Warehousing in Centerprise. Product Datasheet

Optimization of Internet Search based on Noun Phrases and Clustering Techniques

User's voice NET NANNY. Quite simple to install and configure. Improve messages and possible reactions when a page is blocked. Comprehensibility:

Search Result Optimization using Annotators

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Introduction to Data Mining

How To Build A Connector On A Website (For A Nonprogrammer)

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Bayesian Spam Filtering

Towards a Context-Aware IDE-Based Meta Search Engine for Recommendation about Programming Errors and Exceptions

An Intelligent Approach for Integrity of Heterogeneous and Distributed Databases Systems based on Mobile Agents

AN INTEGRATION APPROACH FOR THE STATISTICAL INFORMATION SYSTEM OF ISTAT USING SDMX STANDARDS

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Web Service Based Data Management for Grid Applications

DomainPlex API Documentation

REVIEW PAPER ON PERFORMANCE OF RESTFUL WEB SERVICES

Dr. Anuradha et al. / International Journal on Computer Science and Engineering (IJCSE)

Key Components of WAN Optimization Controller Functionality

META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING

Preprocessing Web Logs for Web Intrusion Detection

Bazaarvoice SEO implementation guide

Big Systems, Big Data

A SYSTEM FOR AUTOMATIC QUERY EXPANSION IN A BROWSER-BASED ENVIRONMENT

Data Driven Success. Comparing Log Analytics Tools: Flowerfire s Sawmill vs. Google Analytics (GA)

A Survey on Preprocessing of Web Log File in Web Usage Mining to Improve the Quality of Data

Monitoring Web Browsing Habits of User Using Web Log Analysis and Role-Based Web Accessing Control. Phudinan Singkhamfu, Parinya Suwanasrikham

CLOUD COMPUTING SECURITY ARCHITECTURE - IMPLEMENTING DES ALGORITHM IN CLOUD FOR DATA SECURITY

Transcription:

Web Data Management - Some Issues

Properties of Web Data Lack of a schema Data is at best semi-structured Missing data, additional attributes, similar data but not identical Volatility Changes frequently May conform to one schema now, but not later Scale Does it make sense to talk about a schema for Web? How do you capture everything? Querying difficulty What is the user language? What are the primitives? Aren t search engines or metasearch engines sufficient? Web Data Management 2

Outline Distribution Models Modeling Issues Web Data Integration Web Querying Web Caching Web Data Management 3

Data Delivery on the Internet Properties of information supply It is very large in volume It is highly heterogeneous May not have a properly defined schema Data available from too many devices and in streaming fashion Data stream systems Properties of information consumption It is data intensive Use of large data sets is common It requires access to diverse data sources Existing databases and/or repositories must somehow be glued together Application integration Web Data Management 4

Data Delivery Alternatives Delivery Mode Hybrid Push-only Pull-only Periodic Unicast Conditional Irregular Frequency Communication Method One-to-many Web Data Management 5

Web Data Modeling Can t depend on a strict schema to structure the data Data are self-descriptive {name: {first: Tamer, last: Ozsu }, institution: University of Waterloo, salary: 300000} Usually represented as an edge-labeled graph XML can also be modeled this way name institution salary first last UW 300000 Tamer Ozsu Web Data Management 6

Web Data Integration What is being integrated? Integration or interoperation? More flexible architectures Barriers to joining and leaving federations should be minimum Participants should be able to maintain their own environments as much as possible Application as well as data integration Role of XML Data model Exchange format Flexible operation Ability to deal with data inconsistencies as well as schema inconsistencies Systems should be able to deal with failures and incomplete federations Web Data Management 7

Approaches to Web Querying Search engines and metasearchers Keyword-based Category-based Information integration Semistructured data querying Special Web query languages Learning-based systems Question-Answering Web Data Management 8

Information Integration Basic principle: Integrate part of the Web data into a database as either virtual or materialized views and query over these views Example systems: Information Manifold [Levy et al., 1996] Araneus [Atzeni et al., 1997] WSQ/DSQ [Goldman & Widom, 2000] Web Data Management 9

Evaluation Advantages Well-understood Well-known database techniques can be brought to bear Disadvantages Not querying the entire Web; more querying some data on the Web Does not scale well; integration methodology should be low overhead Web Data Management 10

Semistructured Data Querying Basic principle: Consider Web as a collection of semistructured data and use those techniques Uses an edge-labeled graph model of data Example systems & languages: Lore/Lorel [Abiteboul et al., 1997] UnQL [Buneman et al., 1996] StruQL [Fernandez et al., 1997] Web Data Management 11

Lorel Example Select Find addresses zip codes of all restaurants cheap restaurants with zip code 92310 Select Guide.restaurant(.address)?.zipcode Guide.restaurant.address Where Guide.restaurant.% Guide.restaurant.address.zipcode grep cheap = 92310 Web Data Management 12

Evaluation Advantages Simple and flexible Fits the natural link structure of Web pages Disadvantages Data model too simple (no record construct or ordered lists) Graph can become very complicated Aggregation and typing combined DataGuides No differentiation between connection between documents and subpart relationships Web Data Management 13

Web Query Languages Basic principle: Take into account the documents content and internal structure as well as external links The graph structures are more complex Examples WebSQL [Mendelzon et al., 1996] W3QS [Kanopnicki & Shmueli, 1995] WebLog [Lakshmanan et al., 1993] WebOQL [Arocena & Mendelzon, 1999] StruQL [Fernandez et al., 1997] First Generation Second Generation Web Data Management 14

WebOQL Example Find, in the cspapers database, all the papers authored by Smith and extract their title and URL of the full version of the papers. select [y.title, y.url] from x in cspapers, y in x where y.authors ~ ``Smith Web Data Management 15

Evaluation Advantages More powerful data model - Hypertree Ordered edge-labeled tree Internal and external arcs Language can exploit different arc types (structure of the Web pages can be accessed) Languages can construct new complex structures. Disadvantages You still need to know the graph structure Complexity issue Web Data Management 16

Learning-Based Approaches Basic principle: Learn what the user s intent is from the query and find the data Some based on NLP, others metasearch systems; agent technology and mining-based Examples: InfoSpider [Menczer & Below, 1998] WebWatcher [Joachims & Freitag, 1997] Fab [Balabanovic, 1997] Syskill & Webert [Pazzani et al., 1996] WebSifter II [Kerschberg et al., 2001] Web Data Management 17

Question-Answer Approach Basic principle: Web pages that could contain the answer to the user query are retrieved and the answer extracted from them. NLP and information extraction techniques Used within IR in a closed corpus; extensions to Web Examples QASM [Radev et al., 2001] Ask Jeeves Mulder [Kwok et al, 2001] WebQA [Lam & Özsu, 2002] Web Data Management 18

WebQA Objectives Query the entire Web Return actual answers, not URLs Scale with additional data sources Accept fuzziness (precision/recall) Do not depend on existence of a schema Web Data Management 19

Interaction with Web Server Web Data Management 20

Query Parser Categorizer NL question WebQAL Generator Query category Valid WebQAL WebQAL WebQAL Checker Query can be NL or WebQAL (internal language) <Category> [-output <output option>] -keywords <keyword list> Elimination of stopwords Categorization Name Place Time Quantity Abbreviation Weather Other Very light weight NLP Rule-based Only categorization 99% accurate on TREC-9 questions Web Data Management 21

Summary Retriever WebQAL Keyword Generator Keyword list Record Retriever Wrapper Source Ranker Record Consolidator/ Ranker Ranked source list Record Retriever Wrapper Keyword generator: a set of keywords for sources Source ranker: identify more promising sources for this query Record consolidator/ranker: submits keywords and consolidates results Source Name Snippet Local Rank Ranking algorithm Google 1 Ranked Records list Record retriever:retrieves records from wrappers Wrapper: uniform API Submit Next HTTP Data Source HTTP Search Engine Web Data Management 22

Answer Extraction List of ranked records Candidate Retriever Rearranger Candidate identification Rule-based based on the category Candidates are scored based on frequency of occurrence in records Rearranger takes care of anomalies Score(Bell) > Score(Alexander Graham Bell) Output Converter Top ten answers (user readable) Web Data Management 23

Evaluation Using TREC-9 List of 693 questions and a list of documents Answers should be 50-byte or 250-byte passage, not exact answers Ranked score between 0 (worst) and 1 (best) Score = 1/n where n is the rank of the correct answer Two measures Accuracy Efficiency Web Data Management 24

Experiment 1 Run TREC-9 queries directly against the search engines 0.5 0.4 0.3 0.2 0.1 0 Name Place Time Quantity Abbreviation Other Overall AllTheWeb AltaVista Ask Jeeves Excite Google Overture Teoma Yahoo Web Data Management 25

Experiment 2 Run TREC-9 queries through WebQA; use CIA Fact Book 2001 as secondary 1 0.8 0.6 0.4 0.2 0 Name Place Time Quantity Abbreviation Other Overall AllTheWeb AltaVista AskJeeves Excite Google Overture Teoma Yahoo Web Data Management 26

Experiment 3 Run TREC-9 queries using best combination 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Name (Y/E) Place (F/Y/O) Time (F/Y/E) Quantity (F/Y/E) Abbreviation (F/Y/A) Other (F/Y/O) Overall Web Data Management 27

WebQA Performance Local Execution Time Remote Execution Time 10000 8000 (sec) 6000 4000 2000 0 Name Place Time Quantity Abbreviation Other Overall Web Data Management 28

Web Caching Storing objects at places through which a users request passes. browser cache, proxy cache, server cache Merits of Web caching: reduces network traffic reduces client latency reduces server load Caching architectures With respect to location Hierarchical, distributed Cache consistency issues Weak cache consistency algorithms Strong cache consistency algorithms Web Data Management 29

Classification Client- Validation Server Invalidation C/S Interaction Strong Pollingevery-time Invalidation Lease Weak TTL,PCV PSI N/A Web Data Management 30

Why Strong Consistency? Motivating Examples Online Shopping Store Travel Tickets Reservation Stock Quotes Online Auction Web Data Management 31