XML Dt Integtion using Fgment Join Jin Gong, Dvi W. Cheung, Nikos Mmoulis, n Ben Ko Deptment of Compute Siene, The Univesity of Hong Kong Pokfulm, Hong Kong, Chin {jgong,heung,nikos,ko}@s.hku.hk Astt. We stuy the polem of nsweing XML queies ove multiple t soues une shem-inepenent senio whee XML shems n shem mppings e unville. We evelop the fgment join opeto genel opeto tht meges two XML fgments se on thei ovelpping omponents. We fomlly efine the opeto n popose n effiient lgoithm fo implementing it. We efine shem-inepenent quey poessing ove multiple t soues n popose novel fmewok to solve this polem. We povie theoetil nlysis n expeimentl esults tht show tht ou ppohes e oth effetive n effiient. 1 Intoution Dt integtion llows glol queies to e nswee y t tht is istiute mong multiple heteogeneous t soues [1]. Though unifie quey intefe, glol istiute queies e poesse s if they wee one on single integte t soue. To hieve t integtion, shem mpping is often use, whih onsists of set of mpping ules tht efine the semnti eltionship etween the glol shem n the lol shems (t the t soues). In these systems, suh s Clio [2], poessing glol quey typilly involves two steps: quey ewiting, n t meging. While muh wok hs een one on quey ewiting, vey little hs een one on t meging. In most existing ppohes, t meging is mostly n ho omputtion speil t meging outine is ustom-oe fo eh mpping ule. This ppoh les to inflexile system esign. In this ppe we popose shem inepenent fmewok tht llows t meging e poesse without efeing to ny speifi shem mpping ules. Let us illustte ou ie y n exmple. Figue 1() shows two XML ouments tken fom UA Cinem wesite n IMDB wesite, espetively. Both UA n IMDB ontin the n the ieto of eh. In ition, UA ontins venue n pie, while IMDB ontins the s eviews. Consie use who wnts to fin out the, ieto, pie, n eview fo eh. This is expesse y the twig ptten quey shown in Figue 1(). Note tht neithe UA no IMDB n nswe the quey lone euse UA lks eviews n IMDB lks piing infomtion. The (glol) quey thus hs to e oken into two quey fgments, one fo eh site. The etune esults fom the two sites shoul then e mege se on thei ommon omponents. Figue 1() shows n exmple of the quey esult. Ou gol is to nswe suh twig ptten queies in shem-inepenent fshion whee mpping ules e not neee.
1 ieto The Fnis Gofthe Coppol 2 The Fnis Gofthe Coppol venue UA Times Sque UA pie 5 IMDB... ieto eviews eview... vey goo () Smple XML ouments... q f q 1 q 2 ieto pie eview eview The ieto Fnis pie 5 vey goo eview pie Gofthe Coppol ieto ieto () A quey twig ptten f 1 f 2 () Smple mth of the quey ieto pie ieto eview The Fnis 5 The vey goo Gofthe Coppol Gofthe Fnis Coppol (e) Join XML fgments to otin the quey esult Fig. 1. Quey on smple XML ouments n the esults. () Pojete queies f' ieto pie eview The Fnis 5 vey goo Gofthe Coppol In ou ppoh, we join t fgments se on thei ovelpping ontent in oe to nswe queies. Fo exmple, we fist pojet the glol quey on the two XML ouments n otin two lol queies (Figue 1()). Then, we etieve XML fgments m(t,, p) fom UA n m(t,, ) fom IMDB. Aftew, we join these fgments se on thei ovelpping pts, whih e (t) n ieto () (Figue 1(e)). 2 Peliminies An XML oument D is oote, noe-lele tee D = N, E,, wheein N is noe set, E N N is n ege set, n N is the oot noe. Eh noe in n XML oument hs lel n my ontin some text. The vouly of n XML oument, enote y v(), is the set of istint noe lels of. Definition 1. (XML FRAGMENT) An XML fgment f is n ege-lele XML oument, whee eh ege is lele y eithe / (pent-hil ege) o // (nestoesennt ege). An XML fgment f is fgment of n XML oument, enote s f, if thee exists n injetive mpping λ : f.n.n, suh tht: (i) n f.n, n = λ(n), n (ii) e(n 1, n 2 ) f.e lele s / (esp., // ), λ(n 1 ) is the pent (esp., nesto) of λ(n 2 ). Definition 2. (TWIG PATTERN AND MATCH) A twig ptten is n XML fgment, whee the text ontent of the noes is isege. A fgment f is mth to twig ptten q, enote s f q, if thee exists mpping γ : q.n f.n, suh tht the noe lels n eges of q e peseve in f. A fgment f 1 is ontine in nothe fgment f 2, enote s f 1 f 2, if ll the noes n eges of f 1 e ontine in f 2. Definition 3. (PROJECTION) Given fgment f n vouly v() of oument, the pojetion of f on v(), enote s ρ v() (f), is otine y emoving fom f ll the noes whose lels e not in v() n the oesponing onneting eges. 3 The fgment join opeto Definition 4. (FRAGMENT JOIN) Given set of of fgments f 1,..., f n (n 2), fgment f is join of f 1,..., f n, enote s (f 1,..., f n ) f, if f 1 f,..., f n f,
f 1 f 2 1 2...?... f 1 f 2 f 2 f 1 () The fgment f1 of 1, f2 of 2 () Joint noes n oesponing join esults Fig. 2. XML fgment join on iffeent joint su-tees. suh tht: 1) f i = f i, 1 i n, 2) n f.n, n f 1.N... f n.n, n 3) e f.e, e f 1.E... f n.e. In ition, the join set of f 1,..., f n is set of fgments F = {f (f 1,..., f n ) f}, enote s (f 1,..., f n ) F. Definition 5. (JOINT SUB-TREE) Given two fgments f 1 n f 2, sutee js is joint su-tee of f 1 n f 2 if (1) js f 1, js f 2, (2) the oot of js = the oot of f 2. Figue 2() shows the five esults of the fgment join etween f 1 n f 2 shown in Figue 2(). Eh of these esults is se on joint su-tee, whose noes e pointe y oule-owe she lines in the two fgments. We popose Algoithm 1 fo evluting the fgment join of two fgments f 1 n f 2. Fo exmple, onsie the fist join esult shown in Figue 2(). The joint-sutee fo this join esult onsists of lone noe. The ouny noes e the hilen of the oot noe in f 2, whih e lele n (uneline). The sutees of these ouny noes e tthe to the mthing noe in f 1 foming the join esult. 4 Shem-inepenent, quey-se t integtion Ou eseh polem is fomlly stte s following: given XML ouments 1 n 2, n twig ptten quey q, ompute F = {f f q; (f 1, f 2 ) f; f 1 1 ; f 2 2 }. Ou ppoh to solve this polem onsists of the following phses. Pojetion. The q is ewitten into lol queies q 1 = ρ v(1)(q) n q 2 = ρ v(2)(q) using the pojet opeto (Setion 2). We then pply the fgment join opeto on q 1 n q 2 to fin joint su-tee js fo whih the join esult is q. Mthing. Two sets of fgments F 1 n F 2 e etune, whih ontins ll mthes to the lol quey q 1 in 1 n ll mthes to the lol quey q 2 in 2, espetively 1. Join. Fo eh pi of fgments (f 1, f 2 ) F 1 F 2, we ompute the fgment join of f 1 n f 2 using the joint-sutee otine in the pojetion phse. The join esults e etune s the quey s nswe. 1 We thnk the uthos of [3] fo poviing us with the implementtion of TwigList, use s moule fo evluting twig queies in ou wok.
Algoithm 1 The join evlution lgoithm Input: XML fgments f 1 n f 2 Output: set of XML fgments F, with the join su-tees use fo eh f F 1: JS enumetejointsutees(f 1, f 2) 2: fo ll js JS o 3: f join(f 1, f 2, js) 4: output (f, js) 5: en fo 6: epet 1-6 with f 1 n f 2 exhnge, if neessy funtion join(f 1, f 2, js) 1: f opy(f 1) 2: fo ll x js.n o 3: let x 1, x 2 e the oesponing noes of x in f 1 n f 2, espetively 4: fo ll x 2 s hil o 5: if / js.n then 6: sf onstutf gment(f 2, ) 7: Chil(f 1, x 1, sf) 8: en if 9: en fo 1: en fo 11: etun f Figue 3 illusttes ou ppoh (the foun joint su-tee ontins the uneline noes). We note tht pojeting glol quey onto lol soues so tht one single lol quey is pplie to eh soue my not e suffiient to etieve the omplete set of quey esults. Fo exmple, onsie gin quey q in Figue 3. We oseve tht joining q 11, su-twig ptten of q 1 ontining noes n n the ege etween them, with q 2 lso gives us q (using the joint su-tee ). Theefoe, in oe to ensue tht ll vli quey esults e foun, we shoul onsie ll pis of su-twig pttens of q 1 n q 2 tht n fom q. Definition 6. (RECOVERABILITY) Given twig ptten q, pi of twig pttens (q i, q j ) is eovele fo q, enote s (q i, q j ) q, if (q i, q j ) q using some joint su-tee js; else, (q i, q j ) is non-eovele fo q, enote s (q i, q j ) q. We two moe shem-level phses to the Pojetion-Mthing-Join fmewok, in oe to ensue ompleteness of the quey esults. Deomposition. Afte the pojetion phse in whih lol queies q 1 n q 2 e eive, the eomposition phse etuns: Q 1 = {q i q i q 1 }, n Q 2 = {q j q j q 2 }. Reoveility heking. Afte the eomposition phse, this phse etuns: {(q i, q j ) (q i, q j ) Q 1 Q 2 (q i, q j ) q}. 5 Expeimentl evlution n onlusion We use DBLP n CiteSee tsets in ou expeiments. The w CiteSee t e in plin text BiTeX fomt. We onvete them into n XML file hving simil shem
1... 1 1 1 2 2 3 2... 1 2 1 2 () Two XML ouments q Pojetion q 1 q 2 Mthing f 11 f 12 f 13 1 1 f 21 1 1 2 1 f 22 2 2 3 2 Join 1 1 1 2 3 () The twig ptten quey n Pojetion-Mthing-Join quey nsweing poess 2 1 2 1 Fig. 3. Quey nsweing fom multiple t soues: pojetion, mthing, n join. 2 15 1 5 75 5 2 Q1 Q2 Q3 Q4 fgment join 6 fgment join 4 fgment join fgment join 15 45 3 1 3 2 15 1 5 Fig. 4. Ovell pefomne of PDRMJ fo ll queies n tsets. to tht of DBLP t. The size of Citesee tset is 15MB. We nomly smple the oiginl DBLP (13MB) tset to extt the pulition eos n ttiutes, n otin five DBLP tsets, whose sizes e: 1MB, 1MB, 2MB, 4MB, n 8MB, espetively. Thus, we hve five pis of tsets use fo the queies, eh onsisting of the Citesee tset plus one of the smple DBLP tsets. We mnully ete fou test twig ptten queies, nme Q1-Q4, eh of whih queies on set of tiutes of ppes, suh s. All these queies n only e nswee using oth DBLP n Citesee tsets (ut not one of the two tsets lone) y fgment join in ou fmewok. The ovell pefomne of ou omplete, optimize ppoh (PDRMJ) is teste in Figue 4 fo ll queies Q1-Q7 on ll tsets. The ovell esponse time is oken own to two pts: (i) the time spent y ll su-twig ptten queies issue ginst the iffeent soues, n (ii) the time spent y the fgment joins. We oseve tht the pefomne fo ll queies sles oughly linely to the size of the DBLP tset (ell tht the size of the CiteSee tset is fixe). In ition, nely hlf of the ost is ue to the twig ptten queies ginst the soues. In onlusion, we evelope fgment join opeto fo quey-se t integtion fom multiple soues. We stuie the polem of shem-inepenent t integtion se on this opeto. We onute expeiments to show the effetiveness of ou ppohes. Refeenes 1. Lenzeini, M.: Dt integtion: theoetil pespetive. In: PODS. (22) 2. Yu, C., Pop, L.: Constint-se XML quey ewiting fo t integtion. In: SIGMOD. (24) 3. Qin, L., Yu, J.X., Ding, B.: TwigList: mke twig ptten mthing fst. In: DASFFA. (27)