XML Data Integration using Fragment Join



Similar documents
Combinatorial Testing for Tree-Structured Test Models with Constraints

32. The Tangency Problem of Apollonius.

ON THE CHINESE CHECKER SPHERE. Mine TURAN, Nihal DONDURMACI ÇİN DAMA KÜRESİ ÜZERİNE

Summary: Vectors. This theorem is used to find any points (or position vectors) on a given line (direction vector). Two ways RT can be applied:

N V V L. R a L I. Transformer Equation Notes

Intro to Circle Geometry By Raymond Cheong

Basic Principles of Homing Guidance

A Note on Risky Bond Valuation

GENERAL OPERATING PRINCIPLES

r (1+cos(θ)) sin(θ) C θ 2 r cos θ 2

Orbits and Kepler s Laws

Quick Guide to Lisp Implementation

The art of Paperarchitecture (PA). MANUAL

2. Properties of Functions

LAPLACE S EQUATION IN SPHERICAL COORDINATES. With Applications to Electrodynamics

Revised products from the Medicare Learning Network (MLN) ICD-10-CM/PCS Myths and Facts, Fact Sheet, ICN , downloadable.

Screentrade Car Insurance Policy Summary

Multicriteria Decision Model for Information Systems Priorities Based on Business Process Management

16. Mean Square Estimation

Module 5. Three-phase AC Circuits. Version 2 EE IIT, Kharagpur

Analytical Proof of Newton's Force Laws

WHAT HAPPENS WHEN YOU MIX COMPLEX NUMBERS WITH PRIME NUMBERS?

Maximum area of polygon

How many times have you seen something like this?

(1) continuity equation: 0. momentum equation: u v g (2) u x. 1 a

Uncertain Version Control in Open Collaborative Editing of Tree-Structured Documents

Active Directory Service

Inter-domain Routing

DiaGen: A Generator for Diagram Editors Based on a Hypergraph Model


1 Fractions from an advanced point of view

How To Write A Theory Of The Concept Of The Mind In A Quey

1. Definition, Basic concepts, Types 2. Addition and Subtraction of Matrices 3. Scalar Multiplication 4. Assignment and answer key 5.

MODAL VARIATIONS WITHIN GRANITIC OUTCROPS D. O. EruBnsoN, Department of Geology, Uni'ttersity of C alif ornia, Dattis, C ali'f orni,a'

The Casino Experience. Let us entertain you

Towards Zero-Overhead Static and Adaptive Indexing in Hadoop

Top K Nearest Keyword Search on Large Graphs

OxCORT v4 Quick Guide Revision Class Reports

MATH PLACEMENT REVIEW GUIDE

Architecture and Data Flows Reference Guide

OUTLINE SYSTEM-ON-CHIP DESIGN. GETTING STARTED WITH VHDL August 31, 2015 GAJSKI S Y-CHART (1983) TOP-DOWN DESIGN (1)

Chapter. Contents: A Constructing decimal numbers

KEY SKILLS INFORMATION TECHNOLOGY Level 3. Question Paper. 29 January 9 February 2001

Angles 2.1. Exercise Find the size of the lettered angles. Give reasons for your answers. a) b) c) Example

SR-Phlx Page 39 of 43 NASDAQ OMX PHLX LLC 1 PRICING SCHEDULE THE EXCHANGE CALCULATES FEES ON A TRADE DATE BASIS.

est using the formula I = Prt, where I is the interest earned, P is the principal, r is the interest rate, and t is the time in years.

Word Wisdom Correlations to the Common Core State Standards, Grade 6

CS 316: Gates and Logic

Semipartial (Part) and Partial Correlation

c b N/m 2 (0.120 m m 3 ), = J. W total = W a b + W b c 2.00

CLOSE RANGE PHOTOGRAMMETRY WITH CCD CAMERAS AND MATCHING METHODS - APPLIED TO THE FRACTURE SURFACE OF AN IRON BOLT

Application Note Configuring Integrated Windows Authentication as a McAfee Firewall Enterprise Authenticator. Firewall Enterprise

PLWAP Sequential Mining: Open Source Code


Enterprise Digital Signage Create a New Sign

tools for Web data extraction

JCM TRAINING OVERVIEW Multi-Download Module 2

Vectors Summary. Projection vector AC = ( Shortest distance from B to line A C D [OR = where m1. and m

AN IMPLEMENTATION OF BINARY AND FLOATING POINT CHROMOSOME REPRESENTATION IN GENETIC ALGORITHM

(Ch. 22.5) 2. What is the magnitude (in pc) of a point charge whose electric field 50 cm away has a magnitude of 2V/m?

National Firefighter Ability Tests And the National Firefighter Questionnaire

ON THE (Q, R) POLICY IN PRODUCTION-INVENTORY SYSTEMS

Words Symbols Diagram. abcde. a + b + c + d + e

Chapter 3 Savings, Present Value and Ricardian Equivalence

Mechanics 1: Motion in a Central Force Field

Distributed Computing and Big Data: Hadoop and MapReduce

A statistical development of fixed odds betting rules in soccer

Chapter 19: Electric Charges, Forces, and Fields ( ) ( 6 )( 6

European Convention on Social and Medical Assistance

Over-encryption: Management of Access Control Evolution on Outsourced Data

Supporting Efficient Top-k Queries in Type-Ahead Search

Review Graph based Online Store Review Spammer Detection

U-BLHB-2 SIZE: C SHEET 1 OF 1

Scheduling Hadoop Jobs to Meet Deadlines

Definitions. Optimization of online direct marketing efforts. Test 1: Two campaigns. Raw Results. Xavier Drèze André Bonfrer. Lucid.

Fluent Merging: A General Technique to Improve Reachability Heuristics and Factored Planning

Volumes by Cylindrical Shells: the Shell Method

An Efficient Group Key Agreement Protocol for Ad hoc Networks

Learning Schemas for Unordered XML

UNIT CIRCLE TRIGONOMETRY

European Convention on Products Liability in regard to Personal Injury and Death

Skills Needed for Success in Calculus 1

Random Variables and Distribution Functions

Definitions and terminology

SECTION 7-2 Law of Cosines

Mechanics 1: Work, Power and Kinetic Energy

Vindforsk report Project /V-238

Newton s Law of Universal Gravitation and the Scale Principle

Transcription:

XML Dt Integtion using Fgment Join Jin Gong, Dvi W. Cheung, Nikos Mmoulis, n Ben Ko Deptment of Compute Siene, The Univesity of Hong Kong Pokfulm, Hong Kong, Chin {jgong,heung,nikos,ko}@s.hku.hk Astt. We stuy the polem of nsweing XML queies ove multiple t soues une shem-inepenent senio whee XML shems n shem mppings e unville. We evelop the fgment join opeto genel opeto tht meges two XML fgments se on thei ovelpping omponents. We fomlly efine the opeto n popose n effiient lgoithm fo implementing it. We efine shem-inepenent quey poessing ove multiple t soues n popose novel fmewok to solve this polem. We povie theoetil nlysis n expeimentl esults tht show tht ou ppohes e oth effetive n effiient. 1 Intoution Dt integtion llows glol queies to e nswee y t tht is istiute mong multiple heteogeneous t soues [1]. Though unifie quey intefe, glol istiute queies e poesse s if they wee one on single integte t soue. To hieve t integtion, shem mpping is often use, whih onsists of set of mpping ules tht efine the semnti eltionship etween the glol shem n the lol shems (t the t soues). In these systems, suh s Clio [2], poessing glol quey typilly involves two steps: quey ewiting, n t meging. While muh wok hs een one on quey ewiting, vey little hs een one on t meging. In most existing ppohes, t meging is mostly n ho omputtion speil t meging outine is ustom-oe fo eh mpping ule. This ppoh les to inflexile system esign. In this ppe we popose shem inepenent fmewok tht llows t meging e poesse without efeing to ny speifi shem mpping ules. Let us illustte ou ie y n exmple. Figue 1() shows two XML ouments tken fom UA Cinem wesite n IMDB wesite, espetively. Both UA n IMDB ontin the n the ieto of eh. In ition, UA ontins venue n pie, while IMDB ontins the s eviews. Consie use who wnts to fin out the, ieto, pie, n eview fo eh. This is expesse y the twig ptten quey shown in Figue 1(). Note tht neithe UA no IMDB n nswe the quey lone euse UA lks eviews n IMDB lks piing infomtion. The (glol) quey thus hs to e oken into two quey fgments, one fo eh site. The etune esults fom the two sites shoul then e mege se on thei ommon omponents. Figue 1() shows n exmple of the quey esult. Ou gol is to nswe suh twig ptten queies in shem-inepenent fshion whee mpping ules e not neee.

1 ieto The Fnis Gofthe Coppol 2 The Fnis Gofthe Coppol venue UA Times Sque UA pie 5 IMDB... ieto eviews eview... vey goo () Smple XML ouments... q f q 1 q 2 ieto pie eview eview The ieto Fnis pie 5 vey goo eview pie Gofthe Coppol ieto ieto () A quey twig ptten f 1 f 2 () Smple mth of the quey ieto pie ieto eview The Fnis 5 The vey goo Gofthe Coppol Gofthe Fnis Coppol (e) Join XML fgments to otin the quey esult Fig. 1. Quey on smple XML ouments n the esults. () Pojete queies f' ieto pie eview The Fnis 5 vey goo Gofthe Coppol In ou ppoh, we join t fgments se on thei ovelpping ontent in oe to nswe queies. Fo exmple, we fist pojet the glol quey on the two XML ouments n otin two lol queies (Figue 1()). Then, we etieve XML fgments m(t,, p) fom UA n m(t,, ) fom IMDB. Aftew, we join these fgments se on thei ovelpping pts, whih e (t) n ieto () (Figue 1(e)). 2 Peliminies An XML oument D is oote, noe-lele tee D = N, E,, wheein N is noe set, E N N is n ege set, n N is the oot noe. Eh noe in n XML oument hs lel n my ontin some text. The vouly of n XML oument, enote y v(), is the set of istint noe lels of. Definition 1. (XML FRAGMENT) An XML fgment f is n ege-lele XML oument, whee eh ege is lele y eithe / (pent-hil ege) o // (nestoesennt ege). An XML fgment f is fgment of n XML oument, enote s f, if thee exists n injetive mpping λ : f.n.n, suh tht: (i) n f.n, n = λ(n), n (ii) e(n 1, n 2 ) f.e lele s / (esp., // ), λ(n 1 ) is the pent (esp., nesto) of λ(n 2 ). Definition 2. (TWIG PATTERN AND MATCH) A twig ptten is n XML fgment, whee the text ontent of the noes is isege. A fgment f is mth to twig ptten q, enote s f q, if thee exists mpping γ : q.n f.n, suh tht the noe lels n eges of q e peseve in f. A fgment f 1 is ontine in nothe fgment f 2, enote s f 1 f 2, if ll the noes n eges of f 1 e ontine in f 2. Definition 3. (PROJECTION) Given fgment f n vouly v() of oument, the pojetion of f on v(), enote s ρ v() (f), is otine y emoving fom f ll the noes whose lels e not in v() n the oesponing onneting eges. 3 The fgment join opeto Definition 4. (FRAGMENT JOIN) Given set of of fgments f 1,..., f n (n 2), fgment f is join of f 1,..., f n, enote s (f 1,..., f n ) f, if f 1 f,..., f n f,

f 1 f 2 1 2...?... f 1 f 2 f 2 f 1 () The fgment f1 of 1, f2 of 2 () Joint noes n oesponing join esults Fig. 2. XML fgment join on iffeent joint su-tees. suh tht: 1) f i = f i, 1 i n, 2) n f.n, n f 1.N... f n.n, n 3) e f.e, e f 1.E... f n.e. In ition, the join set of f 1,..., f n is set of fgments F = {f (f 1,..., f n ) f}, enote s (f 1,..., f n ) F. Definition 5. (JOINT SUB-TREE) Given two fgments f 1 n f 2, sutee js is joint su-tee of f 1 n f 2 if (1) js f 1, js f 2, (2) the oot of js = the oot of f 2. Figue 2() shows the five esults of the fgment join etween f 1 n f 2 shown in Figue 2(). Eh of these esults is se on joint su-tee, whose noes e pointe y oule-owe she lines in the two fgments. We popose Algoithm 1 fo evluting the fgment join of two fgments f 1 n f 2. Fo exmple, onsie the fist join esult shown in Figue 2(). The joint-sutee fo this join esult onsists of lone noe. The ouny noes e the hilen of the oot noe in f 2, whih e lele n (uneline). The sutees of these ouny noes e tthe to the mthing noe in f 1 foming the join esult. 4 Shem-inepenent, quey-se t integtion Ou eseh polem is fomlly stte s following: given XML ouments 1 n 2, n twig ptten quey q, ompute F = {f f q; (f 1, f 2 ) f; f 1 1 ; f 2 2 }. Ou ppoh to solve this polem onsists of the following phses. Pojetion. The q is ewitten into lol queies q 1 = ρ v(1)(q) n q 2 = ρ v(2)(q) using the pojet opeto (Setion 2). We then pply the fgment join opeto on q 1 n q 2 to fin joint su-tee js fo whih the join esult is q. Mthing. Two sets of fgments F 1 n F 2 e etune, whih ontins ll mthes to the lol quey q 1 in 1 n ll mthes to the lol quey q 2 in 2, espetively 1. Join. Fo eh pi of fgments (f 1, f 2 ) F 1 F 2, we ompute the fgment join of f 1 n f 2 using the joint-sutee otine in the pojetion phse. The join esults e etune s the quey s nswe. 1 We thnk the uthos of [3] fo poviing us with the implementtion of TwigList, use s moule fo evluting twig queies in ou wok.

Algoithm 1 The join evlution lgoithm Input: XML fgments f 1 n f 2 Output: set of XML fgments F, with the join su-tees use fo eh f F 1: JS enumetejointsutees(f 1, f 2) 2: fo ll js JS o 3: f join(f 1, f 2, js) 4: output (f, js) 5: en fo 6: epet 1-6 with f 1 n f 2 exhnge, if neessy funtion join(f 1, f 2, js) 1: f opy(f 1) 2: fo ll x js.n o 3: let x 1, x 2 e the oesponing noes of x in f 1 n f 2, espetively 4: fo ll x 2 s hil o 5: if / js.n then 6: sf onstutf gment(f 2, ) 7: Chil(f 1, x 1, sf) 8: en if 9: en fo 1: en fo 11: etun f Figue 3 illusttes ou ppoh (the foun joint su-tee ontins the uneline noes). We note tht pojeting glol quey onto lol soues so tht one single lol quey is pplie to eh soue my not e suffiient to etieve the omplete set of quey esults. Fo exmple, onsie gin quey q in Figue 3. We oseve tht joining q 11, su-twig ptten of q 1 ontining noes n n the ege etween them, with q 2 lso gives us q (using the joint su-tee ). Theefoe, in oe to ensue tht ll vli quey esults e foun, we shoul onsie ll pis of su-twig pttens of q 1 n q 2 tht n fom q. Definition 6. (RECOVERABILITY) Given twig ptten q, pi of twig pttens (q i, q j ) is eovele fo q, enote s (q i, q j ) q, if (q i, q j ) q using some joint su-tee js; else, (q i, q j ) is non-eovele fo q, enote s (q i, q j ) q. We two moe shem-level phses to the Pojetion-Mthing-Join fmewok, in oe to ensue ompleteness of the quey esults. Deomposition. Afte the pojetion phse in whih lol queies q 1 n q 2 e eive, the eomposition phse etuns: Q 1 = {q i q i q 1 }, n Q 2 = {q j q j q 2 }. Reoveility heking. Afte the eomposition phse, this phse etuns: {(q i, q j ) (q i, q j ) Q 1 Q 2 (q i, q j ) q}. 5 Expeimentl evlution n onlusion We use DBLP n CiteSee tsets in ou expeiments. The w CiteSee t e in plin text BiTeX fomt. We onvete them into n XML file hving simil shem

1... 1 1 1 2 2 3 2... 1 2 1 2 () Two XML ouments q Pojetion q 1 q 2 Mthing f 11 f 12 f 13 1 1 f 21 1 1 2 1 f 22 2 2 3 2 Join 1 1 1 2 3 () The twig ptten quey n Pojetion-Mthing-Join quey nsweing poess 2 1 2 1 Fig. 3. Quey nsweing fom multiple t soues: pojetion, mthing, n join. 2 15 1 5 75 5 2 Q1 Q2 Q3 Q4 fgment join 6 fgment join 4 fgment join fgment join 15 45 3 1 3 2 15 1 5 Fig. 4. Ovell pefomne of PDRMJ fo ll queies n tsets. to tht of DBLP t. The size of Citesee tset is 15MB. We nomly smple the oiginl DBLP (13MB) tset to extt the pulition eos n ttiutes, n otin five DBLP tsets, whose sizes e: 1MB, 1MB, 2MB, 4MB, n 8MB, espetively. Thus, we hve five pis of tsets use fo the queies, eh onsisting of the Citesee tset plus one of the smple DBLP tsets. We mnully ete fou test twig ptten queies, nme Q1-Q4, eh of whih queies on set of tiutes of ppes, suh s. All these queies n only e nswee using oth DBLP n Citesee tsets (ut not one of the two tsets lone) y fgment join in ou fmewok. The ovell pefomne of ou omplete, optimize ppoh (PDRMJ) is teste in Figue 4 fo ll queies Q1-Q7 on ll tsets. The ovell esponse time is oken own to two pts: (i) the time spent y ll su-twig ptten queies issue ginst the iffeent soues, n (ii) the time spent y the fgment joins. We oseve tht the pefomne fo ll queies sles oughly linely to the size of the DBLP tset (ell tht the size of the CiteSee tset is fixe). In ition, nely hlf of the ost is ue to the twig ptten queies ginst the soues. In onlusion, we evelope fgment join opeto fo quey-se t integtion fom multiple soues. We stuie the polem of shem-inepenent t integtion se on this opeto. We onute expeiments to show the effetiveness of ou ppohes. Refeenes 1. Lenzeini, M.: Dt integtion: theoetil pespetive. In: PODS. (22) 2. Yu, C., Pop, L.: Constint-se XML quey ewiting fo t integtion. In: SIGMOD. (24) 3. Qin, L., Yu, J.X., Ding, B.: TwigList: mke twig ptten mthing fst. In: DASFFA. (27)