Schema Clustering and Retrieval for Multi-domain Pay-As-You-Go Data Integration Systems

Size: px
Start display at page:

Download "Schema Clustering and Retrieval for Multi-domain Pay-As-You-Go Data Integration Systems"

Transcription

1 Schem Clusterng nd Retrevl for Mult-domn Py-As-You-Go Dt Integrton Systems Htem A. Mhmoud Unversty of Wterloo Wterloo, ON, Cnd Ashrf Aoulng Unversty of Wterloo Wterloo, ON, Cnd ABSTRACT A dt ntegrton system offers sngle nterfce to multple structured dt sources. Mny pplcton contexts (e.g., serchng structured dt on the we) nvolve the ntegrton of lrge numers of structured dt sources. At we scle, t s mprctcl to use mnul or sem-utomtc dt ntegrton methods, so py-s-you-go pproch s more pproprte. A py-s-you-go pproch entls usng fully utomtc pproxmte dt ntegrton technque to provde n ntl dt ntegrton system (.e., n ntl medted schem, nd ntl mppngs from source schems to the medted schem), nd then refnng the system s t gets used. Prevous reserch hs nvestgted utomtc pproxmte dt ntegrton technques, ut ll exstng technques requre the schems eng ntegrted to elong to the sme conceptul domn. At we scle, t s mprctcl to clssfy schems nto domns mnully or sem-utomtclly, whch lmts the pplclty of these technques. In ths pper, we present n pproch for clusterng schems nto domns wthout ny humn nterventon nd sed only on the nmes of ttrutes n the schems. Our clusterng pproch dels wth uncertnty n ssgnng schems to domns usng prolstc model. We lso propose query clssfer tht determnes, for gven keyword query, the most relevnt domns to ths query. We expermentlly demonstrte the effectveness of our schem clusterng nd query clssfcton technques. Ctegores nd Suect Descrptors H.4.0 [Informton Systems Applctons]: Generl Generl Terms Algorthms, Desgn, Expermentton Keywords dt ntegrton, clusterng, clssfcton Permsson to mke dgtl or hrd copes of ll or prt of ths work for personl or clssroom use s grnted wthout fee provded tht copes re not mde or dstruted for proft or commercl dvntge nd tht copes er ths notce nd the full ctton on the frst pge. To copy otherwse, to repulsh, to post on servers or to redstrute to lsts, requres pror specfc permsson nd/or fee. SIGMOD 0, June 6, 200, Indnpols, Indn, USA. Copyrght 200 ACM /0/06...$ INTRODUCTION As the numer of structured dt sources on the we contnues to ncrese, so does the dffculty of orgnzng them nd mkng them ccessle. A promnent exmple of structured dt sources on the we s the lrge numer of we stes tht provde ccess to dtses through we forms. Such dtses hdden ehnd we forms re usully clled the deep we or the hdden we, nd re eleved to surpss the surfce we n quntty nd qulty [4]. Recent studes y Google estmte n order of 0 mllon hgh qulty HTML forms [3]. Mny other types of structured dt sources spnnng wde spectrum of domns re lso vlle on the we, such s HTML tles nd downlodle spredsheets. The need to provde ccess to lrge numer of heterogeneous structured dt sources lso rses on smller scle n personl nformton mngement nd scentfc dt mngement pplctons [9]. One of the pproches used to ccess such lrge numers of heterogeneous structured dt sources s to tret ther dt s mere documents nd pply keyword serch on them usng nformton retrevl (IR) technques. For the deep we, vrous technques hve een proposed to surfce t, mkng t serchle v trdtonl IR technques [4]. Ths pproch, however, does not tke much dvntge of the structure of dt sources. Another pproch tht tkes dvntge of such structure s to use dt ntegrton. Dt ntegrton systems provde the user wth unfed nterfce to ccess set of dt sources tht provde nformton out the sme rel-world domn ut hve dfferent schems. Typclly, dt ntegrton system s estlshed y frst defnng medted schem tht represents the domn tht s eng consdered nd cts s the user s nterfce to the system. Mppngs re then defned from the schems of the vrous dt sources to tht medted schem. Cretng nd mntnng dt ntegrton systems hs lwys een n expensve process tht consumes much effort. Consequently, much reserch hs een done to fcltte tht process y developng technques tht recommend medted schems nd schem mppngs to the user [6]. Dfferent dt ntegrton technques requre dfferent levels of user nvolvement. At we scle, the mssve numer of dt sources mkes even sem-utomtc dt ntegrton technques mprctcl. For exmple, ttempts to use sem-utomtc ntegrton technques y Google [4] ndcte tht humn nnottor workng on dt ntegrton wth the help of sem-utomtc tools cn ntegrte only 00 schems on verge per dy. The other lterntve, whch s fully utomtc dt ntegrton, produces mprecse med-

2 ted schems nd schem mppngs. Therefore, t ws suggested [3] tht py-s-you-go dt ntegrton pproch s the only wy to del wth we-scle dt ntegrton. A pys-you-go dt ntegrton system ccepts pproxmte nd ncomplete ntegrton s strtng pont, nd llows further enhncements to e ntroduced lter, whenever deemed necessry. The system strts provdng servces, e.g. keyword serch, wthout hvng to wt untl full nd precse ntegrton tkes plce. To del wth mprecson n fully utomtc ntegrton, pror reserch [6, 7] proposes usng prolstc model where severl possle medted schems re generted, ech ssgned prolty vlue. Then, from ech dt source to ech of the generted medted schems, severl possle mppngs re generted, nd ech mppng s lso ssgned prolty vlue. However, ll exstng fully utomtc ntegrton technques ssume tht the dt sources to e ntegrted elong to the sme domn, so preprocessng phse s stll needed to cluster dt sources nto domns efore dt ntegrton tkes plce [3]. Wthout such step, dt ntegrton s more lkely to produce semntclly ncoherent medted schems nd nccurte mppngs to these schems. Surprsngly, there hs een very lttle work on utomtc clusterng of dt sources nto domns. In ths pper, we present n pproch for clusterng structured dt sources nto domns sed on ther schems. When workng on structured we dt sources, we re fced wth mny chllenges. Frst, the only nformton gurnteed to e vlle out dt source s ttrute nmes. Even smple nformton lke ttrute dt types s not lwys esy to determne. Therefore, our clusterng pproch reles entrely on ttrute nmes to cluster schems nto domns. Second, we do not know n dvnce ll the domns we should hve or how mny they re, snce the we s essentlly out everythng. Consequently, we use clusterng lgorthm tht does not mke ssumptons out the numer or the types of domns n dvnce. Thrd, snce we re proposng fully utomtc technque, we need to hndle uncertnty n decdng whch domn schem should e ssgned to. We use prolstc model to del wth ths uncertnty, where ech dt source my elong to multple domns wth dfferent proltes. Typclly, fter schems re clustered nto domns, exstng technques of schem medton nd mppng wll e run on ech domn seprtely. Our work ntegrtes well wth prevous work on schem medton nd mppng wth uncertnty [6, 7]. At query tme, we need to gve the user the cplty to serch for relevnt domns. For exmple, serch engne needs to detect when keyword query contns ttrute nmes tht re relevnt to one or more of the domns constructed n the clusterng phse. More concretely, keyword query lke deprture Toronto destnton New York contns two ttrute nmes tht re relevnt to the trvel domn, nmely deprture nd destnton. The serch engne cn then retreve the medted schems of relevnt domns nd present them to the user n the form of structured query nterfces s prt of the serch results pge, rnked y ther relevnce to the query. The user cn then pose structured queres over ny of those query nterfces nd retreve structured dt. We present technque sed on nve Byes clssfcton to determne the domns relevnt to query nd rnk these domns ccordng to ther Fgure : System rchtecture llustrted v n exmple of typcl use cse. relevnce to the query. The prolstc nture of the domns dds more chllenges to clssfcton. Fgure llustrtes the rchtecture of our system wth n exmple of typcl use cse. Our contrutons n ths pper cn e summrzed s follows:. A fully utomtc technque for clusterng schems nto domns sed on ttrute nmes only. 2. A prolstc pproch for hndlng uncertnty n clusterng. Our pproch ntegrtes semlessly wth exstng pproches for schem medton nd mppng wth uncertnty. 3. A technque sed on nve Byes clssfcton to determne the domns relevnt to keyword query nd rnk those domns ccordng to ther relevnce to the query. Our clssfer tkes nto ccount the fct tht the domns re prolstc. 4. An expermentl evluton on schems from wde spectrum of domns. The rest of the pper s orgnzed s follows. Secton 2 dscusses relted work. Secton 3 provdes more forml prolem defnton, whle Secton 4 provdes n overvew of our soluton. Secton 5 explns our schem clusterng pproch. Secton 6 explns our pproch for retrevng domns tht re relevnt to keyword query. Our expermentl evluton s presented n Secton 7, nd we conclude n Secton 8.

3 2. RELATED WORK We use mchne lernng technques from the domn of document clusterng [2] to del wth schem clusterng. When determnng whch domns re relevnt to query, we use nve Byes clssfcton, whch s mchne lernng technque tht hs een used extensvely n document clssfcton nd other pplctons [5]. The de of usng prolstc models to del wth uncertnty n dt ntegrton ws consdered n [6] nd [7]. However, t ws used n the context of prolstc schem medton nd mppng, whle the prolem of clusterng schems nto domns ws left open. Our usng of prolstc models n clusterng ntegrtes semlessly wth [6] nd [7]. Clusterng schems nto domns hs een consdered n [2]. However, whle [2] presents customzed lgorthm tht s desgned to ft eght predefned domns (lke crs nd moves), our clusterng pproch s desgned to hndle ny numer of rtrry nd overlppng domns. Furthermore, the pproch n [2] ssumes tht for ech domn there re nchor ttrutes tht do not occur except n tht domn, whle we do not rely on the exstence of nchor ttrutes. Addtonlly, our pproch dels wth uncertnty n clusterng v prolstc models, nd uses smple (yet effectve) dstnce mesures n clusterng. Schem clusterng s lso mentoned n [3] s prt of the proposed py-s-you-go rchtecture, ut wthout detls on how to del wth the numerous chllenges tht rse when clusterng we dt sources. Fnlly, clusterng s lso used s tool n nother phse of dt ntegrton, nmely schem medton, where t s the ttrutes tht re clustered not the schems [, 6, 8]. 3. PROBLEM DEFINITION Exstng technques of utomtc dt ntegrton ssume tht the dt sources to e ntegrted elong to the sme domn. For these technques to work on lrge numer of dt sources from multple domns, there hs to e n ntl step n whch the dt sources re clustered nto domns. The oectve of ths pper s to utomte the clusterng step. Therefore, we consder the two prolems of () clusterng schems nto domns, nd (2) retrevng nd rnkng relevnt domns t query tme. For the purpose of our reserch, we defne the noton of domn s follows: Defnton 3. (Domn). A domn s set of sngle-tle schems wth suffcently lrge ntr-domn smlrty nd suffcently lrge nter-domn dssmlrty, ccordng to some mesure of smlrty. We lso defne schem s set of ttrute nmes, nd n ttrute nme s set of terms (e.g., the ttrute nme Frst Nme conssts of the terms Frst nd Nme ). Our system tkes s n nput set of sngle-tle schems, where ech schem s extrcted from structured dt source (e.g., we form, n HTML tle, downlodle spredsheet). We focus on sngle-tle schems snce most dt sources on the we elong to tht ctegory. Our pproch works on schems wthout ssumng ny ccess to the ctul dtses ehnd these schems, so s to e generl enough to hndle deep we dt sources wthout the need to surfce them. Moreover, the only nformton we need to know out schem s the ttrute nmes, whch s often the only nformton tht s vlle. So, for exmple, ttrute dt types re not requred. We lso do not ssume ny pror nformton out the exct numer or nture of potentl domns. Consequently, domns need to e dscovered from the vlle schems. Our prolem generlly nvolves uncertnty n determnng whether two schems elong to the sme domn or not. The output of the clusterng phse s set of domns, where ech domn s set of schems s n Defnton 3.. In typcl py-syou-go system, ech output domn wll e fed s n nput to schem medton nd mppng lgorthm. Schem medton nd mppng s lredy well-studed prolem nd s not focus of ths pper, ut we hve to ensure tht our solutons ntegrte well wth prevous work. At query tme, we need to provde the user wth the cplty to retreve domns relevnt to keyword query, tkng n consderton tht our domns re constructed wth uncertnty. We defne keyword query s set of terms. The query processng phse tkes s n nput keyword query nd set of domns, nd outputs for ech domn ts degree of relevnce to tht query. 4. SOLUTION OVERVIEW We use herrchcl gglomertve clusterng to group schems nto domns. Ths lgorthm opertes y tertvely mergng smlr schems together nto clusters nd mergng smlr clusters together nto lrger clusters, untl mxmum level of nter-cluster dssmlrty s reched [8]. Snce textul smlrty mong ttrute nmes s the ss upon whch medted schems nd schem mppngs re usully generted, t s resonle to rely on the sme ss when mesurng schem-to-schem smlrty durng schem clusterng. Therefore, we ssume tht the prolty tht two schems elong to the sme domn cn e determned sed on the textul smlrty etween the ttrute nmes of the two schems. Prevous emprcl studes [] hve shown tht ttrute nmes wthn the sme domn tend to e smlr cross dfferent schems. Moreover, relyng only on ttrute nmes mkes t possle to pply our pproch on dt sources whose dt nd dt types re not plnly exposed (e.g. the deep we). Our experments on the schems of hundreds of we dt sources from dverse domns show tht ssgnng proltes sed on textul smlrtes works well. We hndle uncertnty n schem clusterng sed on prolstc model. Besdes eng mthemtclly pproprte, usng prolstc model s consstent wth prevous reserch tht dels wth uncertnty n py-s-you-go dt ntegrton systems [6, 7]. The steps of constructng the prolstc model nd drwng nferences from t cn e summrzed s follows:. Ech schem s represented y feture vector, whch we construct sed on the terms extrcted from the ttrute nmes of the schem. 2. Herrchcl gglomertve clusterng s ppled to the feture vectors of the schems to group them nto domns. 3. Schems tht hve equl or close smlrtes to multple domns re ssgned to ech of these domns wth dfferent proltes. The proltes re sed on the smlrtes etween schems nd domns. 4. When user poses keyword query over the system, nve Byes clssfcton s used to determne, for ech

4 domn, the prolty tht the query elongs to tht domn. Relevnt domns re then rnked sed on prolty vlues. Between Steps 3 nd 4, exstng technques from prevous reserch cn e used to generte medted schem for ech domn nd then generte prolstc mppngs from the schems n the domn to the domn s medted schem. The generted schem mppngs re lso prolstc so s to hndle the uncertnty n determnng whch ttrutes n source schem correspond to whch ttrutes n the medted schem [6]. A prolstc mppng from source schem to medted schem s sclly set of possle mppngs, ech ssgned prolty. The prolty ssgned to ny ndvdul tuple retreved from domn t query tme s the product of two proltes: () the prolty tht the schem from whch tht tuple s retreved elongs to tht domn (otned from our work), nd (2) the prolty tht the schem mppng sed on whch the tuple ws mpped to the medted schem of the domn s the correct mppng (otned from the prolstc schem mppng technque, e.g., [6]). 5. SCHEMA CLUSTERING Gven set of schems S = {S,S 2,...,S S } s nput, our trget s to output set of clusters C = {C,C 2,...,C C }, where C r S, for ll C r C. Optmlly, for ll, =, 2,..., S, ndforllr =, 2,..., C, thetwoschems S nd S should elong to C r f nd only f S nd S represent the sme rel-world domn. However, we hve no mens y whch we cn determne utomtclly nd wth solute certnty whether ny two gven schems represent the sme domn or not. We hve to rely on pproxmte methods nd ccept est-effort results, whch s n essentl spect of the py-s-you-go pproch. We ssume tht the prolty tht two schems elong to the sme domn cn e determned sed on the textul smlrty etween the ttrute nmes of the two schems. 5. Cretng Feture Vectors Before proceedng wth clusterng, we need to chrcterze ech schem wth feture vector. Feture vectors re needed oth durng the clusterng process nd durng query clssfcton. We use vector spce model smlr to tht used n document clusterng [2]; tht s, f there re d dstnct terms n ll gven schems, we chrcterze ech schem wth vector comprsed of d nry fetures, one feture for ech dstnct term to ndcte whether ths term exsts n the schem or not. We use nry fetures nsted of, for exmple, countng the frequency of terms n schems, ecuse schem ttrutes usully contn few terms, so nry fetures re suffcent. Algorthm descres how feture vectors re creted. Frst, for ech schem S S, we extrct ll the terms from S y splttng ts ttrute nmes over set of pre-defned delmters, lke whte spces, slshes nd underscores. For exmple, gven the followng schem {Clss ID,Dy/Tme, Professor Nme, Suect}, the set of extrcted terms wll e {Clss, ID, Dy, Tme, Professor, Nme, Suect}. We lso splt ttrute nmes tht consst of severl cptlstrted terms conctented to ech others (e.g. MxNumerOfStudents s splt nto Mx, Numer, Of nd Students ). Splttng ttrute nmes s motvted y the o- Algorthm Crete Feture Vectors : procedure CreteFetureVectors 2: nput: Set of schems S = {S,S 2,...,S S } 3: for ll S S do 4: Defne the set of terms T 5: Extrct ll terms from S s ttrute nmes to T 6: Convert ll terms n T nto cnoncl form 7: Remove very smll terms nd stop words from T 8: end for 9: Sort ll terms n S =T nto vector L 0: for ll S S do : Defne nry vector F,wheredm F = dm L 2: for ll terms L n L do 3: f mx t T t sm(l,t) τ t sm then 4: F 5: else 6: F 0 7: end f 8: end for 9: end for 20: return F = {F,F 2,...,F S } 2: end procedure servton tht ndvdul terms wthn the ttrutes nmes of schems n sngle domn cn cluster together etter thn the whole ttrute nmes; snce they tend to e less senstve to rephrsng (e.g., Professor Nme versus Nme of the Professor ). We convert ll terms to cnoncl form for etter comprsons (e.g. ll chrcters to lower cse), then we remove stop words nd extremely short terms (e.g. terms wth less thn three letters). The result s the set T = {T,T 2,...,T T },wheret s the set of terms extrcted from the schem S. Next, ll terms n T = T re sorted nto vector of terms L =< L,L 2,...,L dm L >,wheredm L = T = T. We then crete, for ech S S, nry feture vector F, such tht dm F = dm L. LetF denote the th feture n F. The vector F chrcterzes S y ndctng, for ech term L n L whether S contns term tht s suffcently smlr to L or not; f yes then F =,otherwsef =0. For ech S S, F s computed s follows. Let t sm e functon tht tkes two terms t nd t s nput nd returns rel vlue n the rnge [0, ] tht ndctes how smlr the two terms re. For ech term L n L, wecompute mx t sm(l,t); tht s, the mxmum mong ll the t T smlrtes etween L nd ech of the terms n S. We then compre tht mxmum to threshold τ t sm tht we set sed on our knowledge of the smlrty functon t sm. If mx t sm(l,t) τ t sm then F =,otherwsef =0. t T There re lredy severl well-studed functons for mesurng term smlrty [5]. In our work, we use functon tht s sed on the longest common sustrng. Let the functon LCS(t,t ) denote the longest common sustrng etween the two terms t nd t, nd the functon len(t) denote the numer of chrcters n the term t; then t sm(t 2.len(LCS(t,t)),t )= len(t )+len(t ) Tht s, the length of the longest common sustrng dvded

5 y the verge of the lengths of the two terms. We pck hgh vlue for τ t sm, for exmple 0.8, to ensure suffcent smlrty. The longest common sustrng cn e computed effcently n lner tme usng suffx trees [0]. Another possle lterntve for the term smlrty functon t sm s to use functon tht recognzes two terms to e smlr f nd only f they hve the sme stem. 5.2 Clusterng Algorthm We use herrchcl gglomertve clusterng s descred n Algorthm 2. We choose herrchcl clusterng ecuse we do not know the pproprte numer of clusters n dvnce, nd herrchcl clusterng does not requre pror knowledge of ths numer. Algorthm 2 Cluster Schem : procedure ClusterSchem 2: nput: Set of schems S = {S,S 2,...,S S } 3: k 4: U (k) {{S }, {S 2},...,{S S }} 5: Let (U (k) )e: 6: rg mx (U (k) 7: whle c sm(u (k) 8: U (k+) ) U (k) U (k) ; U (k) U (k) 9: U (k+) (U (k) \{U (k) ) τ c sm do 0: k k + : U (k) {{S }, {S 2},...,{S S }} 2: Let (U (k) )e: 3: rg mx (U (k) 4: end whle 5: return C = U (k) 6: end procedure ) U (k) U (k) ; c sm(u (k) ) }) {U (k+) } c sm(u (k) ) Frst, we mesure the smlrty etween every two schems y mesurng the smlrty etween ther feture vectors. Let the functon s sm(s,s ) e the smlrty mesure etween the two schems S nd S, where, S. We use the Jccrd coeffcent s smlrty mesure snce t s known to e sutle for hgh dmensonl nry feture vectors [7]. Thus, s sm(s,s )=Jccrd(F,F )= {r : F r =nd Fr =} {r : Fr =or Fr =} All schem-to-schem smlrtes should e computed nd memozed (.e., cched) n dvnce so s to vod recomputng them multple tmes durng clusterng. Next, we proceed to clusterng. Intlly, every schem s consdered sngleton cluster n ts own rght. Then gglomertve herrchcl clusterng opertes tertvely y mergng the most smlr pr of clusters mong the set of vlle clusters nto one new cluster, sed on some mesure of cluster smlrty. At the egnnng of ech terton k, we denote the set of clusters tht we hve s U (k). Snce we strt y plcng every schem n sngleton cluster, U () = {{S }, {S 2},...,{S S }}. After ech terton k, the numer of clusters shrnks y one s we merge the two closest (most smlr) clusters nto one new cluster,.e., U (k+) = U (k). We defne the smlrty etween ny two clusters U (k) nd U (k) s follows: c sm(u (k) )= s sm(s,s U (k) U (k) ) S U (k) S U (k) Tht s, the verge of the smlrtes etween every schem n U (k) nd every schem n U (k).weshownllourexperments (Secton 7.2) tht other cluster smlrty mesures cn e lso used nd gve smlr results. For ech terton k, let the closest pr of clusters e U (k), then the new (merged) cluster wll e the unon nd U (k) of U (k) nd U (k) ;thts,u (k+) = U (k) other cluster U (k) c U (k) \{U (k) sme n U (k+) ;thts,u (k+) c U (k+) =(U (k) \{U (k) = U (k) c U (k). For every }, U c (k) remns the.thuswehve }) {U (k+) } For every pr of clusters n U (k+) not ncludng U (k+), nter-cluster smlrtes remn the sme s they were n the prevous terton, so there s no need to recompute them. For U (k+), we cn compute ts smlrty to every other cluster U c (k+) U (k+) \ {U (k+) } n constnt mount of tme y utlzng the memozed vlues from the prevous terton s follows: c sm(u c (k+),u (k+) )= U (k).c sm(u (k) c,u (k) U (k) )+ U (k) + U (k).c sm(u c (k) ) Thus, our memozton cn e updted n O( U (k+) ) runnng tme. Clusterng stops when the most smlr pr of clusters (U (k) ) s not smlr enough; tht s, c sm(u (k) ) < τ c sm, where τ c sm s pre-defned threshold. Our experments n Secton 7.2 elorte on the choce of τ c sm. Let the lst set of clusters produced efore the lgorthm stops e C = {C,C 2,...,C C }. The set C s the output of our clusterng lgorthm. 5.3 Assgnng Proltes The mn source of uncertnty n schem clusterng s the schems tht le on the oundres etween clusters. Actully, n some cses, ssgnng these oundry schems to clusters s rtrry. For exmple, consder the cse when gglomertve clusterng s runnng nd there exst three clusters U (k), U (k) 2 nd U (k) 3. It s possle to hve c sm(u (k) 2 )=c sm(u (k) 3 ) τ c sm. If no other pr of clusters s s smlr s (U (k) 2 )nd(u (k) 3 ), then ether U (k) 2 or U (k) 3 wll e merged wth U (k). The choce wll typclly e rtrry. Other possle sources of uncertnty nclude cses of very smll dfferences etween c sm(u (k) 2 )ndc sm(u (k) 3 ). Thus, we consder ssgnng sngle schem to multple domns wth dfferent proltes f t hs suffcent smlrty to ll of them. Algorthm 3 explns how these proltes re ssgned. Snce we re gong to ssgn some schems to multple domns, whle ech schem n S elongs to one nd only one cluster n C, we need to seprte the concept of domns from the concept of clusters. Weusethenotonofclustersto

6 Algorthm 3 Assgn Proltes : procedure AssgnProltes 2: nput: Set of schems S = {S,S 2,...,S S } 3: nput: Set of clusters C = {C,C 2,...,C C } 4: for ll C r C do 5: Defne domn D r 6: end for 7: for ll S S do 8: for ll C r C do 9: f s c sm(s,c r) τ c sm nd s c sm(s,c r) 0: mx C C : Pr(S D r) s c sm(s,c) θ then s c sm(s,c r) s c sm(s,c ) D D(S ) 2: else 3: Pr(S D r) 0 4: end f 5: end for 6: end for 7: return {(S,D r,pr(s D r)) : for ll S nd D r} 8: end procedure refer to sets of schems tht prtton S, lke those returned y Algorthm 2. We use the noton of domns to refer to sets of schems too; however, every schem n S my elong to multple domns wth dfferent proltes. We construct domns from the clusters returned y Algorthm 2 s follows. Frst, we consder the exstence of cluster s n ndctor of the exstence of domn, so the numer of domns equls the numer of clusters. Let the set of domns e D = {D,D 2,...,D D },where D = C, nd ech domn D r D corresponds to cluster C r C, for ll r. We then exmne every schem S S; fs s suffcently smlr to multple clusters then we ssgn S to the domns correspondng to these clusters wth dfferent proltes sed on the smlrtes etween S nd ech of these clusters. The smlrty etween schem S S nd cluster C r C s mesured s follows: s c sm(s,c r)= s sm(s,s ) C r S C r Tht s, the verge of the schem smlrtes etween S nd ll the schems n C r. For ny schem S to e ssgned to ny domn D r, two condtons must e stsfed. Frst, the vlue of s c sm(s,c r)mustetlestτ c sm. Second, we requre tht the rto etween s c sm(s,c r)ndthe mxmum smlrty etween S nd ny other cluster e s c sm(s,c r) suffcently lrge; tht s, θ, for mx s c sm(s,c) C C some θ [0, ]. The threshold θ quntfes the degree of uncertnty llowed when ssgnng schems to domns; hgher θ mens hgher uncertnty. In our experments, we set θ =0.02. For ech schem S S, let D(S ) = {D r : s c sm(s,c r) s c sm(s,c r) τ c sm nd θ}. s c sm(s,c) mx C C Also, for ech domn D r D, lets(d r)={s : D r D(S )}. For ll S S(D r), Pr(S D r)=0. Otherwse, for ll S S(D r), the prolty tht S elongs to D r s estmted s the schem-to-cluster smlrty etween S nd the C r, normlzed so tht ll the proltes ssgned to S sum up to. Tht s, s c sm(s,c r) ; f S S(D r) Pr(S D s c sm(s r)=,c ) D D(S ) 0 ; otherwse The output of ths phse s the set of trples {(S,D r,pr(s D r)) : for ll S S nd D r D}. Trples wth Pr(S D r) = 0 do not need to e represented explctly. After schem clusterng, fully utomtc schem medton nd mppng technque (e.g., [6]) cn e run to generte prolstc medted schem for ech domn D r, nd generte prolstc mppngs from ech schem S S(D r) to the medted schem of D r. 6. QUERY CLASSIFICATION In ths secton we nvestgte the ssue of nswerng keyword queres posed over our mult-domn dt ntegrton system y retrevng nd rnkng relevnt domns. We use nve Byes clssfer to determne the prolty tht keyword query elongs to ny of the domns tht re constructed durng the clusterng phse. For the clssfer to do tht, some of the keywords n the query need to e smlr to some ttrute nmes n the relevnt domns. The desgn of our clssfer ensures tht expensve opertons re performed t system setup tme rther thn query tme. At query tme, we crete feture vector to chrcterze the keyword query n the sme mnner s we dd for every schem n S n Secton 5.. Let Q denote the set of keywords n the keyword query entered y the user. We convert ll keywords nto our cnoncl form, nd remove stop words nd extremely smll keywords. The result s the set of terms T Q. We then construct the nry feture vector F Q. Let F Q e the th feture n F Q,ndL e the th term n the vector L tht contns ll terms n ll schems. We set the vlue of the feture F Q tofthereexststermnt Q tht s suffcently smlr to L ;thts,fmx t sm(l,t) t T Q τ t sm. Otherwse,F Q =0. Our trget s to determne the posteror prolty for ech domn D r; tht s, gven F Q, the prolty tht Q elongs to D r. Let us denote tht prolty s Pr(D r F Q ). Accordng to Byes rule: Pr(D r F Q )= Pr(F Q D r)pr(d r) () Pr(F Q ) Pr(F Q D r) s the prolty tht n rtrry schem S rnd, rndomly chosen from the domn D r, hs feture vector equl to F Q. Pr(D r) s the prolty tht n rtrry schem S rnd, rndomly chosen from S, elongs to D r. Pr(F Q ) s the prolty tht n rtrry schem S rnd, rndomly chosen from S, hs feture vector equl to F Q. Bsed on pplcton context, we my ssgn Q to the domn tht hs the mxmum posteror prolty (.e., rg mx Pr(D r F Q )), or we my return lst of relevnt D r

7 domns rnked y posteror proltes. Note tht, n Equton, we do not need to compute Pr(F Q ) snce t s constnt for ll D r D nd thus t does not ffect the reltve order of posteror proltes. Consequently, for ech domn D r, we only need to compute Pr(F Q D r)pr(d r). We mke the fundmentl ssumpton of the nve Byes clssfer, whch s the ssumpton tht ll fetures re condtonlly ndependent gven the domn. Consequently, Pr(F Q D r)pr(d r) = Pr(D r) dm L = Pr(F Q Dr) (2) where Pr(F Q Dr) s the prolty tht n rtrry schem S rnd, rndomly chosen from the domn D r,hs ts th feture F rnd equl to F Q. The vlues of Pr(D r)ndpr(f Q Dr), for ll, depend on whch schems re ssgned to whch domns. Ths ssgnment s determned sed on nother prolty dstruton s descred n Secton 5.3. Therefore, Pr(D r)cn e expressed n terms of the followng summton over ll susets of S(D r): Pr(D r) = Pr(D r D r = S )Pr(D r = S ) (3) S S(D r) Smlrly, for ll, Pr(F Q Dr) = S S(D r) Pr(F Q Dr = S,D r) Pr(D r = S D r) (4) We nlyze the proltes on the rght-hnd sdes of Equtons 3 nd 4 s follows. Frst, the prolty Pr(D r D r = S ) s estmted s: Pr(D r D r = S )= S (5) S Second, Pr(D r = S ) cn e expressed s the prolty of the followng conuncton: Pr(D r = S )=Pr S D r S D r S S S S We mke second smplfyng ssumpton y ssumng tht the ssgnments of schems to domns re sttstclly ndependent. Consequently, Pr(D r = S )= Pr(S D r) Pr(S D r) (6) S S S S Thrd, Pr(F Q Dr = S,D r) s the prolty tht the th feture of n rtrry schem S rnd, rndomly chosen from S,equlsF Q. Ths prolty s estmted s follows: Pr(F Q Dr = S,D r)= {S : S S nd F = F Q } S Snce ll fetures re nry, the lst equton cn e rewrtten s follows: S S F f F Q Pr(F Q Dr = S,D S r)= = S S F f F Q S =0 (7) However, there re two prolems wth Equton 7. The frst prolem s tht S cn e zero; f, for ll S S(D r), Pr(S D r), then there s non-zero prolty tht D r s empty, so S my e zero. The second prolem s relted to roustness. If query hs n extr term (.e., term tht does not exst n ny of the schems n S(D r)), then no mtter how mny other terms re common etween the query nd the schems n S(D r), the prolty tht the query elongs to D r wll e zero. To see tht, ssume tht F =0forllS S(D r). Then,ccordngtoEquton 7 the prolty Pr(F Q Dr = S,D r) wll e zero, for ll S S(D r). Susttutng Pr(F Q Dr = S,D r)n Equton 4, the prolty Pr(F Q Dr) wll lso e zero. Eventully, y susttutng Pr(F Q Dr) nequton2,the posteror prolty wll e zero. Smlrly, t s esy to see tht, f query hs mssng term (.e., term tht exsts n ll the schems n S(D r) ut not n the query), then no mtter how mny other terms re common etween the query nd the schems n S(D r), the prolty tht the query elongs to D r wll e zero. To solve these two prolems we use the m-estmte of proltes [3]. Bsclly, for ech domn D r, nd for ech suset S S(D r), we ct s f S hs m ddtonl schems, some of them hve ll ther fetures set to, whle the others hve ll ther fetures set to 0. Consequently, Equton 7 cn e rewrtten s follows: S S F + p.m f F Q Pr(F Q Dr = S,D S r)= = + m S S F + p.m f F Q S =0 + m (8) where p (0, ) s the frcton of ddtonl schems tht hve ll ther fetures set to. A typcl choce would e to set p =0.5 so s to gve the clssfer no s towrds ether extr terms or mssng terms. However, we need to consder the fct tht keyword queres re usully short. A typcl keyword query wll contn smll suset of the terms n the schems of S(D r), plus smll numer of extr terms, so t s much more lkely to hve mssng terms thn extr terms. We set m =+ S nd p =/dm L, whchgves stronger s towrds mssng terms. Fnlly, the prolty Pr(D r = S D r) cn e computed usng Byes rule s follows: Pr(D r = S D r)= Pr(Dr Dr = S )Pr(D r = S ) (9) Pr(D r) Note tht the proltes on the rght-hnd sde of Equton 9 re lredy computed s prt of Equton 3. By susttutng Equtons 5, 6, 8 nd 9 nto Equtons 3 nd 4, nd then susttutng Equtons 3 nd 4 nto Equton 2, we otn the posteror proltes requred to rnk domns. Byesn clssfcton s expensve, ut ll of the expensve opertons n our cse cn e done t setup tme rther thn query tme. Equtons 5, 6 nd 9 re ll ndependent of the user s query. Equton 8 depends on the vlue of the query feture F Q, ut snce F Q my ssume one of only two vlues (ether 0 or ) we cn stll compute Equton 8 t setup tme for oth F Q =0ndF Q =. Therefore, ll the proltes used on the rght-hnd sde of Equton 2 cn e pre-computed nd stored t setup tme. At query tme, clcultng Equton 2 for ll domns tkes O( D dm L) runnng tme per query.

8 To nlyze the setup tme needed for Equton 2 let us frst defne the set of uncertn schems for ech domn D r s Ŝ(Dr) ={S : S S(Dr) nd P r(s Dr) }. Ŝ(D r ) s the set of ll schems tht elong to D r wth proltes strctly smller thn nd strctly greter thn 0. We wll lso use the term certn schems to refer to schems tht elong to D r wth prolty ; tht s, S(D r)\ŝ(dr). Any suset S S(D r) my nclude or exclude ny uncertn schems nd stll mntn non-zero prolty; tht s, Pr(D r = S ) 0. However, f ny certn schem s excluded from S,thenPr(D r = S ) wll e zero ccordng to Equton 6, nd consequently S wll not contrute to the summton n Equton 3. Addtonlly, y susttutng Pr(D r = S )ntoequton9,pr(d r = S D r) wll lso e zero, nd gn S wll not contrute to the summton n Equton 4. Therefore, when computng the summtons n Equtons 9 nd 4, we only need to consder the susets tht contn ll the certn schems n S(D r). Ths prunes the numer of susets to e consdered from 2 S(Dr) to 2 Ŝ(Dr). By memozng ntermedte vlues, we cn clculte the proltes n Equton 2 for ll domns t setup tme n O(mx { Ŝ(Dr) D 2 Ŝ(Dr) } D dm L + S dm L) r D runnng tme. The growth rte of setup tme s domnted y the need to enumerte ll possle comntons of uncertn schems for ech domn. Thus, the tme to construct the clssfer depends on the numer of uncertn schems much more strongly thn the totl numer of schems. Uncertn schems re schems tht le on the oundres etween domns; tht s, schems wth close smlrtes to multple domns. Typclly, they should not consttute sgnfcnt portons of ther domns. Whenever necessry, the numer of uncertn schems cn e decresed y decresng the prmeter θ (Secton 5.3). 7. EXPERIMENTS 7. Expermentl Setup We mplement our lgorthms n C++ nd use ths prototype to evlute the effectveness of our schem clusterng nd query clssfcton technques. We run our experments on Wndows Vst mchne, wth Intel Centrno Duo 2GHz processor nd 3GB RAM. The gol of our experments s to demonstrte tht our technques cn correctly cluster schems of dt sources vlle on the we nto domns, nd cn clssfy keyword queres nto pproprte domns. For these experments we need schems of we dt sources leled wth ther correct domns, nd we need queres tht re lso leled wth domns. Genertng nd lelng schems nd queres n menngful wy n tself poses some nterestng chllenges, whch we descre next. 7.. Schems Used We use the schem set used n [6], whch we otned from the uthors of tht pper. Tht schem set conssts of 2323 schems from 5 dfferent domns tht were extrcted from Google s we ndex, nd we refer to t s the DDH set fter the ntls of the uthors of [6]. The domns n DDH re few nd shrply seprted, nd thus re expected to lend themselves perfectly to clusterng. Dt sources on the we re not restrcted to smll numer of well-defned domns, ut rther come from extremely dverse nd overlppng domns. To test our schem clusterng nd query clssfcton methods on such dverse nd overlppng domns, we collect our own schem sets mnully through we serch nd through lsts of hdden we dt sources tht re vlle on-lne (e.g., n Wkped). We extrcted two sets of schems from two types of dt sources. The frst schem set, whch we refer to s DW, s extrcted from deep we dt sources. For tht schem set, we fnd the we form nterfces to deep we dt sources, nd mnully extrct the ttrute nmes n ech form. These ttrute nmes form the schem of the dt source. The second schem set, whch we refer to s SS, s extrcted from downlodle spredsheets tht we otned usng Google s serch y fle type opton. The schem of spredsheet conssts of the mnully extrcted heders of the columns n the spredsheet. The ttrute nmes n DW schems tend to e phrsed n etter wy nd re more ccurtely ndctve of the domn thn the ones n SS schems. In oth schem sets, round 25% of the schems re unque n the sense tht humn would not cluster ny of them wth ny other schems n ther sets. These unque schems re expected to remn unclustered fter the clusterng lgorthm termntes Evlutng Schem Clusterng To evlute the effectveness of schem clusterng, typcl pproch would e to mnully ssgn lel to ech schem ndctng ts domn, nd then to mesure the effectveness of the clusterng lgorthm t groupng schems wth the sme domn lel. Ths pproch works well for the DDH schem set snce the domns re shrply seprted, ut t does not work well for DW nd SS. For DW nd SS, the oundres etween dfferent domns re not lwys ovous, nd sngle schem my e correctly clssfed nto severl domns. The followng exmple llustrtes ths prolem. Exmple 7.. Consder the followng two schems, extrcted from two dfferent dt sources, oth provdng nformton out fculty memers: S :(fculty memer, offce phone, eml, fx) S 2:(nme, poston, fflton, reserch nterests) Although oth schems re concerned wth fculty memers, they provde dfferent types of nformton. In prncple, S nd S 2 should e clustered together snce user lookng for nformton out fculty memers my fnd oth of them useful. However, consderng the fct tht the oectve of clusterng n our cse s to serve s prelmnry step efore schem medton nd mppng, clusterng S nd S 2 together my not e so useful snce the two schems together do not provde good nput for schem medton nd mppng lgorthms. The queston s: If the clusterng lgorthm clusters S nd S 2 together, s tht flse postve? If t does not cluster them together, s tht flse negtve? In order to del wth ths prolem, we perform our experments s follows. For the two schem sets DW nd SS, we mnully ssocte ech schem S wth set of lels B(S ) ccordng to wht we perceve s potentl domns for S. Exmple domn lels tht we use nclude moves, logrphy, nd people. Ech schem s leled wth t lest one lel. Tle provdes detled sttstcs out the lels used for DW, SS, nd ther unon. These numers ndcte tht few lels hve the morty of schems s-

9 DW SS Both Numer of Schems Mx. terms per schem Avg. terms per schem Numer of lels used Mx. lels per schem Avg. lels per schem.5.4 Mx. schems per lel Avg. schems per lel Tle : Sttstcs out schem sets. socted wth them, whle the morty of lels hve few schems. Let the set of ll lels used e B = S = B(S) = {B,B 2,...,B B }. Also, let S(B )denotethesetofll schems leled wth B. We run the clusterng lgorthm on our schem sets nd exmne the set of domns D tht s produced y the clusterng lgorthm, nd we lel ech domn D r D wth the set of domnnt lels wthn tht domn. Usully there s only one domnnt lel ut sometmes there re severl lels tht eqully domnte the domn. Let B(D r) denote the set of domnnt lels n the domn D r. Also, let D(B ) denote the set of domns domnted y B. We determne domnnt lels s follows: B(D r) = rg mx Pr(S D r) B B S S(B ) Summng the proltes should e nterpreted s weghted countng of the schems n D r nd s not ntended to hve prolstc menng. We lso sum proltes s weghted countng of schems when estmtng precson nd recll. If more thn one lel eqully domnte the domn, we nclude them ll n B(D r). A specl cse s when the domnnt lel of D r does not hve n solute morty; tht s, mx B B S S(B ) Pr(S D r) < 2 Pr(S D r) S S We then cll D r non-homogeneous domn. A nonhomogeneous domn s treted s f t hs no domnnt lel;.e. B(D r)=φ. When computng precson nd recll, schems ssgned to non-homogeneous domns re ll counted s flse negtves. We lso compute the frcton of schems ssgned to non-homogeneous domns s one of our performnce mesures. Another specl cse s domn wth only one schem. Tht hppens when schem s not suffcently smlr to ny other schems n S, gven the vlue used for the threshold τ c sm. Wemesurethefrcton of unclustered schems, nd exclude them from other performnce mesurements lke precson nd recll. One lst cse to e consdered s when two dfferent domns re domnted y the sme lel;.e. B(D ) B(D ) φ where. Weusethetermfrgmentton to refer to tht cse nd we mesure the degree of frgmentton n our experments y computng the verge numer of domns domntedyechlel;thts, D(B ). Fnlly, we B B B mesure precson nd recll s follows. Precson: For ech schem S S(D r), f B(S ) B(D r) φ then S contrutes to the true postves of D r, denoted s TP Dr, y the prolty of memershp of S n D r. TP Dr = Pr(S D r) S :B(S ) B(D r) φ Smlrly, the flse postves of D r, denoted s FP Dr,re estmted s FP Dr = Pr(S D r) S :B(S ) B(D r)=φ We therefore estmte the verge precson s D TP Dr TP Dr + FP Dr D r D Recll: For ech domn D r,fb B(D r)ndthereexsts S S(D r) such tht B B(S ), then S contrutes to the true postves of B, denoted s TP B, y the prolty of memershp of S n D r;thts, TP B = Pr(S D r) D r D(B ) S S(B ) Smlrly, the flse negtves of B, denoted s FN B,re estmted s FN B = Pr(S D r) D r D(B ) S S(B ) We therefore estmte the verge recll s B TP B TP B + FN B B B To evlute our clusterng pproch, we mesure the effect of chngng t c sm on the performnce mesures lke precson, recll, frgmentton, unclustered schems, nd nonhomogeneous domns. We lso compre the performnce of our clusterng lgorthm when other cluster-to-cluster smlrty mesures re used nsted of the verge Jccrd smlrty tht s descred n Secton 5.2. The other lterntves we consder for cluster-to-cluster smlrty re Mn. Jccrd, Mx. Jccrd, nd Totl Jccrd. These three smlrty mesures cn e defned s follows. Let U (k) U (k) e two clusters t gven terton k. nd Mn. Jccrd: The mnmum of the Jccrd smlrtes etween every schem n U (k) nd every schem n U (k). c sm(u (k) )= mn S U (k),s U (k) s sm(s,s ) Mx. Jccrd: The mxmum of the Jccrd smlrtes etween every schem n U (k) nd every schem n U (k). c sm(u (k) )= mx S U (k),s U (k) s sm(s,s ) Totl Jccrd: The numer of terms common etween ll the schems n U (k) nd U (k) dvded y the numer of ll terms tht exst n ny of the schems n U (k) or U (k). {l : c sm(u (k) S U (k) Fl = S U (k) Fl =} )= {l : S U (k) F l = S U (k) F l =}

10 rge Precson Aver Frcton of sc chems n non homog. domns c_sm c_sm c_sm Avg. Jccrd Totl Jccrd Mn. Jccrd Mx. Jccrd c_sm c_sm Fgure 2: Schem clusterng qulty. Av verge Recll clustered s Frcton of Un Schem Averge e Frgmentton 7..3 Genertng Queres To evlute our query clssfcton lgorthm we need to smulte typcl query formulton process n whch the user enters query tht ncludes some ttrute nmes wth prtculr domn n mnd. Ths query formulton process s rndom process tht we smulte s follows. We let the numer of keywords n ech query rnge from to 0, wth 00 queres generted for ech numer n ths rnge. We use the sme domn lelng termnology s n Secton For ech rndomly generted query Q rnd,wepckfromb rndom lel B rnd for Q rnd to trget. The lel B rnd s selected sed on the followng prolty dstruton: Pr(B rnd )= S(B rnd) B = S(B) Therefore, lel ssocted wth lrger numer of schems wll receve lrger numer of queres, ensurng lnced dstruton of queres. Hvng selected lel B rnd, we strt genertng the keywords of the query. For smplcty, we tret the multple keywords n the sme query s set of condtonlly-ndependent nd dentcllydstruted rndom vrles gven B rnd. Let T ll e the set of ll terms extrcted from ll schems s explned n Secton 5.; tht s, T ll = S ST. We need to pck from T ll some keywords tht user wll typclly ssocte wth B rnd s chrcterstc keywords tht dstngush t from other lels. For ech term t l T ll,letfreq(t l,b ) ndcte the numer of schems n S(B ) tht contn the term t l ;thts,freq(t l,b )= {S : S S(B ) nd t l T }. When pckng terms for B rnd, we flter out the terms tht do not exst n suffcently lrge frcton of schems n S(B rnd ). The frcton tht we use for DW nd SS s 0.25, whle n the cse DDH we use only 0. snce the sze of S(B rnd ) n the cse of DDH s counted n hundreds. After flterng out nfrequent terms, we need to estmte for ech of the remnng terms the prolty tht the term wll e used n query tht trgets B rnd. Weusethefollowngformul to estmte the degree y whch term t l dstngushes lelb from other lels: λ(t l,b )= Freq(t l,b ) / Freq(t,B ) B t T ll B B Freq(t l,b ) t T ll Freq(t,B ) Tht s, the rto etween the reltve frequency of t l n B, nd the verge reltve frequency of t l n ll domn lels. We normlze λ(t l,b ) such tht, gven lel B,thesummton of the normlzed λ(t l,b ), for ll t l,equls. The normlzed vlue of λ(t l,b ) s used s the prolty of pckng the term t l gven tht the lel B hs lredy een pcked. Therefore, λ(t rnd,b rnd ) Pr(t rnd B rnd )= t T ll λ(t,b rnd ) Ths wy, we ssgn hgher proltes to the terms tht exst n S(B rnd ) wth hgher rtos reltve to ther exstence n the schems of other lels. 7.2 Schem Clusterng Qulty We compre the effectveness of our clusterng lgorthm when usng the four smlrty mesures: Mn. Jccrd, Mx. Jccrd, Avg. Jccrd nd Totl Jccrd. We lso mesure the effect of chngng the vlue of τ c sm on the qulty of clusterng. Frst we run our clusterng lgorthm on the DDH schem set. The clusterng lgorthm works perfectly on DDH, gvng precson nd recll vlues ove 0.99 for ll τ c sm 0.2 nd for ll smlrty mesures, except Mx. Jccrd whch gves low recll for τ c sm < 0.5. The perfect performnce of the clusterng lgorthm on DDH s expected snce the schems n DDH elong to few well-seprted domns. Next we run the clusterng lgorthm on the unon of the two schem sets DW nd SS. Fgure 2 shows the performnce of the clusterng lgorthm on the unon of DW nd SS, usng the four smlrty mesures, s τ c sm vres from 0. to0.9. The fgure shows tht ll the smlrty mesures perform lmost the sme, except for Mx. Jccrd whch performs

11 τ c sm =0.2 τ c sm =0.3 DW SS Both DW SS Both Precson Recll Unclustered Non-homog Frgmentton Tle 2: Evluton of schem clusterng. worse thn the rest under some settngs, nd s therefore not recommended. Totl Jccrd, whch s more expensve thn the rest, provdes no sustntl gns over Avg. Jccrd or Mn. Jccrd, so t too s not recommended. We recommend ether Avg. Jccrd or Mn. Jccrd. We use Avg. Jccrd s our defult smlrty mesure s descred n Secton 5.2. The fgure lso llustrtes how τ c sm ffects the effectveness of clusterng. As τ c sm ncreses, precson nd recll ncrese, nd the frcton of schems n non-homogeneous domns decreses, whch re ll desrle effects. However, s τ c sm ncreses, the numer of unclustered schems lso ncreses, whch s undesrle. However, we should tke nto ccount tht 25% of the schems re lredy unque (s mentoned n Secton 7..) nd should therefore remn unclustered. At the extreme vlue of τ c sm = 0.9, ll schems re unclustered. Therefore, we hve trde-off etween the numer of unclustered schems nd the qulty of clusterng s mesured through precson, recll nd the frcton of schems n non-homogeneous domns. Frgmentton, whch does not nclude unclustered schems or non-homogeneous domns, generlly ncreses s the vlue of τ c sm ncreses from 0. to 0.5, snce hgher vlues of τ c sm proht smlr clusters from gettng merged efore the clusterng lgorthm termntes, nd therefore they get frgmented. Strtng from round 0.5, s the vlue of τ c sm ncreses frgmentton decreses ecuse τ c sm s ecomng so hgh tht t reks mny domns down nto unclustered schems. As more domns get roken down nto unclustered schems (whch re not counted s domns), the numer of domns sgnfcntly decreses. Therefore, there s much less potentl to hve lel ssocted wth multple domns. Ths set of experments suggests settng τ c sm etween 0.2 nd0.3. It lso shows tht clusterng s roust snce t s not very senstve to mnor chnges n τ c sm. Tle 2 presents results from set of experments tht focuses on the performnce of the clusterng lgorthm for τ c sm =0.2nd0.3. Ths set of experments s performed on ech of the two sets of schems DW nd SS seprtely, nd on the unon of DW nd SS. As we sw prevously, ncresng the vlue of τ c sm from 0.2 to0.3 ncreses precson nd recll, nd decreses the frcton of schems n nonhomogeneous domns, ut t lso ncreses the frcton of unclustered schems. The performnce mesures re generlly etter for DW thn SS ecuse SS s more nosy nd less rgdly structured thn DW. The performnce on the unon of DW nd SS s etween the performnce on the ndvdul sets, whch s expected. The mportnt oservtons re tht clusterng qulty s hgh, nd vryng τ c sm does not cuse mor vrtons n ny of the clusterng performnce mesures. From these experments, we see tht the clusterng lgorthm produces hgh qulty results for dfferent dt sets, nd cn e effectvely nd roustly controlled usng the prmeter τ c sm. 7.3 Effect On Medton And Mppng Although t s possle n prncple to perform schem medton nd mppng wthout pror clusterng, our experments show tht there re serous prolems tht rse when dong tht. In ths secton, we descre two prolems tht we oserved when dong schem medton nd mppng on our schem sets wthout pror clusterng. For the purpose of our experments, we use the prolstc schem medton nd mppng lgorthms descred n [6]. The frst prolem s relted to the semntc coherence of medted ttrutes. It s common to encounter two ttrutes from two dfferent domns hvng exctly the sme nme ut wth dfferent menngs dependng on the domn. For exmple, n the DW schem set, the ttrute fmly nme s used n schem from the people domn to refer to the lst nme of person, nd n schem from the ology domn to refer to the fmly of lvng orgnsm (.e., txonomc rnk). When performng medton nd mppng on DW wthout clusterng the schems frst, these two ttrutes re ssocted wth ech others n sngle medted ttrute. At runtme, when posng query on the medted schem to retreve vlues from the ttrute fmly nme, the result s n ncoherent set of vlues otned from oth dt sources. Ths prolem does not rse when schems re clustered efore medton. The second prolem s relted to the sze of the medted schem. One of the technques used n schem medton to mke t trctle s to use n ttrute frequency threshold to flter out ttrutes tht pper n only smll frcton of schems (e.g., n [6] the threshold s 0.). However, ths threshold s prolemtc f no clusterng s done efore medton. In tht cse, the threshold wll elmnte most or ll the ttrutes from the domns tht hve fewer schems thn other domns, cusng these smll domns to e under-represented or completely sent n the medted schem. For exmple, when performng schem medton on the DDH schem set wth threshold of 0. nd wthout clusterng, the result s medted schem n whch 2 of the 5 domns of DDH re sent. Even fter reducng the threshold to 0.0, the smllest domn, nmely people, s stll under-represented wth only 4 ttrutes n the medted schem, not ncludng the most relevnt ttrutes lke phone, ddress, nd eml. Pckng very smll threshold vlue wll cuse lrger domns to e over-represented y ncludng lrge numer of nfrequent nd unnterestng ttrutes n the medted schem. Gong to the extreme of completely elmntng the threshold (.e., usng threshold of 0) results n menngless medted schem tht s merely unon of ll ttrutes from ll schems (2060 medted ttrutes n the cse of DDH). Besdes eng menngless, ths huge numer of medted ttrutes sgnfcntly ncreses the runnng tme of schem medton nd mppng. The totl runnng tme for medton nd mppng n ths cse s 5 hours, whle n ll our experments, when dong schem clusterng, medton, nd mppng, usng typcl prmeters, the totl end-to-end runnng tme s lwys less thn 25 mnutes. We conclude tht schem clusterng efore medton nd mppng mproves qulty nd scllty. 7.4 Query Clssfcton Qulty In ths secton, we present experments to evlute the ccurcy of our nve Byes clssfer. Frst, we run our clusterng lgorthm to cluster schems nto domns. Next, we

12 h Relevnt Frcton of Queres wt Top Results Numer of Keywords Fgure 3: Query clssfcton qulty. Top- Frcton Top-3 Frcton construct nve Byes clssfer s descred n Secton 6 sed on the domns tht re generted from clusterng. We then use tht clssfer to clssfy queres tht we generte rndomly s descred n Secton The clssfer returns rnked lst of domns, sorted descendngly ccordng to ther relevnce to the query. For ech query sze from to 0 we compute the top- frcton, whch s the frcton of queres for whch the top-rnked domn dentfed y the clssfer s leled wth the sme lel B rnd s the query. We lso compute the top-3 frcton, whch the frcton of queres for whch t lest one of the top three domns s leled wth the sme lel s the query. The top-3 frcton s menngful only for DW nd SS snce the numer lels s reltvely lrge. For DDH, where the numer of lels s only 5, we only compute the top- frcton. Our experments on DDH gve lmost perfect results, wth the top- frcton eng for ll query szes, except for sngle-keyword queres where the top- frcton drops slghtly to out The clssfcton results on the unon of DW nd SS re shown n Fgure 3. As the numer of keywords per query ncreses, clssfcton ccurcy ncreses untl the top- frcton ecomes lmost. Our results show tht the clssfer works well, even though the keyword queres generted y our rndom query genertor sometmes nclude very non-ndctve keywords due to the rndom nture of the query genertor. For smll query szes, t s qute common to generte query tht s domnted y non-ndctve keywords. In ddton to qulty, we lso mesure the runnng tme needed to construct the nve Byes clssfer. For the lrge schem set DDH, t tkes only out 5 mnutes to construct the clssfer, whle for the unon of DW nd SS t tkes less thn mnute. 8. CONCLUSION The growng numer of structured dt sources on the we hs entled growng nterest n dt ntegrton for these sources. Exstng dt ntegrton technques operte on dt sources tht elong to sngle domn. At we scle, t s nfesle to cluster dt sources nto domns mnully. We del wth ths prolem nd propose schem clusterng pproch tht leverges technques from document clusterng. We use prolstc model to hndle the uncertnty n ssgnng schems to domns, whch fts wth prevous work on dt ntegrton wth uncertnty. We lso propose technque sed on nve Byes clssfcton tht resons on top of our prolstc model n order to ssgn keyword queres posed y users to the most relevnt domns. Acknowledgements We thnk Alon Hlevy nd Ansh Ds Srm for shrng the schem set from [6] wth us. Ths work ws supported y Google Reserch Awrd nd y the Nturl Scences nd Engneerng Reserch Councl of Cnd (NSERC) through the Busness Intellgence Network strtegc networks grnt. 9. REFERENCES [] A. Aoulng nd K. El Gely. μe: User guded source selecton nd schem medton for nternet scle dt ntegrton. In ICDE, [2] N. O. Andrews nd E. A. Fox. Recent developments n document clusterng. Techncl report, Computer Scence, Vrgn Tech, [3] B. Cestnk. Estmtng proltes: A crucl tsk n mchne lernng. In Proc. of the 9th Europen Conference on Artfcl Intellgence, 990. [4] K. C.-C. Chng, B. He, C. L, M. Ptel, nd Z. Zhng. Structured dtses on the we: oservtons nd mplctons. SIGMOD Rec., 33(3), [5] W.W.Cohen,P.Rvkumr,ndS.E.Fenerg.A comprson of strng dstnce metrcs for nme-mtchng tsks. In Proc. of the workshop on Dt Clenng nd Oect Consoldton t KDD, [6] A. Ds Srm, X. Dong, nd A. Hlevy. Bootstrppng py-s-you-go dt ntegrton systems. In SIGMOD, [7] X. L. Dong, A. Hlevy, nd C. Yu. Dt ntegrton wth uncertnty. VLDBJ, 8(2), [8] R.O.Dud,P.E.Hrt,ndD.G.Stork.Pttern Clssfcton. Wley-Interscence Pulcton, second edton, [9] M. Frnkln, A. Hlevy, nd D. Mer. From dtses to dtspces: new strcton for nformton mngement. SIGMOD Rec., 34(4), [0] D. Gusfeld. Algorthms on Strngs, Trees nd Sequences: Computer Scence nd Computtonl Bology. Cmrdge Unversty Press, 997. [] B. He nd K. C.-C. Chng. Sttstcl schem mtchng cross we query nterfces. In SIGMOD, [2] B. He, T. To, nd K. C.-C. Chng. Orgnzng structured we sources y query schems: clusterng pproch. In CIKM, [3] J. Mdhvn, S. Cohen, X. Dong, A. Hlevy, S. Jeffery, D. Ko, nd C. Yu. We-scle dt ntegrton: You cn fford to py s you go. In CIDR, [4] J. Mdhvn, D. Ko, L. Kot, V. Gnpthy, A. Rsmussen, nd A. Hlevy. Google s deep we crwl. Proc. VLDB Endow., (2), [5] T. M. Mtchell. Mchne Lernng. McGrw-Hll, 997. [6] E. Rhm nd P. A. Bernsten. A survey of pproches to utomtc schem mtchng. VLDBJ, 0(4), 200. [7] P.-N. Tn, M. Stench, nd V. Kumr. Introducton to Dt Mnng. Addson-Wesley Longmn, [8] W. Wu, A. Don, nd C. T. Yu. Mergng nterfce schems on the deep we v clusterng ggregton. In ICDM, 2005.

Newton-Raphson Method of Solving a Nonlinear Equation Autar Kaw

Newton-Raphson Method of Solving a Nonlinear Equation Autar Kaw Newton-Rphson Method o Solvng Nonlner Equton Autr Kw Ater redng ths chpter, you should be ble to:. derve the Newton-Rphson method ormul,. develop the lgorthm o the Newton-Rphson method,. use the Newton-Rphson

More information

Incorporating Negative Values in AHP Using Rule- Based Scoring Methodology for Ranking of Sustainable Chemical Process Design Options

Incorporating Negative Values in AHP Using Rule- Based Scoring Methodology for Ranking of Sustainable Chemical Process Design Options 20 th Europen ymposum on Computer Aded Process Engneerng ECAPE20. Perucc nd G. Buzz Ferrrs (Edtors) 2010 Elsever B.V. All rghts reserved. Incorportng Negtve Vlues n AHP Usng Rule- Bsed corng Methodology

More information

Resistive Network Analysis. The Node Voltage Method - 1

Resistive Network Analysis. The Node Voltage Method - 1 esste Network Anlyss he nlyss of n electrcl network conssts of determnng ech of the unknown rnch currents nd node oltges. A numer of methods for network nlyss he een deeloped, sed on Ohm s Lw nd Krchoff

More information

Fuzzy Clustering for TV Program Classification

Fuzzy Clustering for TV Program Classification Fuzzy Clusterng for TV rogrm Clssfcton Yu Zhwen Northwestern olytechncl Unversty X n,.r.chn, 7007 yuzhwen77@yhoo.com.cn Gu Jnhu Northwestern olytechncl Unversty X n,.r.chn, 7007 guh@nwpu.edu.cn Zhou Xngshe

More information

Vector Geometry for Computer Graphics

Vector Geometry for Computer Graphics Vector Geometry for Computer Grphcs Bo Getz Jnury, 7 Contents Prt I: Bsc Defntons Coordnte Systems... Ponts nd Vectors Mtrces nd Determnnts.. 4 Prt II: Opertons Vector ddton nd sclr multplcton... 5 The

More information

WHAT HAPPENS WHEN YOU MIX COMPLEX NUMBERS WITH PRIME NUMBERS?

WHAT HAPPENS WHEN YOU MIX COMPLEX NUMBERS WITH PRIME NUMBERS? WHAT HAPPES WHE YOU MIX COMPLEX UMBERS WITH PRIME UMBERS? There s n ol syng, you n t pples n ornges. Mthemtns hte n t; they love to throw pples n ornges nto foo proessor n see wht hppens. Sometmes they

More information

Optimal Pricing Scheme for Information Services

Optimal Pricing Scheme for Information Services Optml rcng Scheme for Informton Servces Shn-y Wu Opertons nd Informton Mngement The Whrton School Unversty of ennsylvn E-ml: shnwu@whrton.upenn.edu e-yu (Shron) Chen Grdute School of Industrl Admnstrton

More information

Regular Sets and Expressions

Regular Sets and Expressions Regulr Sets nd Expressions Finite utomt re importnt in science, mthemtics, nd engineering. Engineers like them ecuse they re super models for circuits (And, since the dvent of VLSI systems sometimes finite

More information

ALABAMA ASSOCIATION of EMERGENCY MANAGERS

ALABAMA ASSOCIATION of EMERGENCY MANAGERS LBM SSOCTON of EMERGENCY MNGERS ON O PCE C BELLO MER E T R O CD NCY M N G L R PROFESSONL CERTFCTON PROGRM .. E. M. CERTFCTON PROGRM 2014 RULES ND REGULTONS 1. THERE WLL BE FOUR LEVELS OF CERTFCTON. BSC,

More information

Irregular Repeat Accumulate Codes 1

Irregular Repeat Accumulate Codes 1 Irregulr epet Accumulte Codes 1 Hu Jn, Amod Khndekr, nd obert McElece Deprtment of Electrcl Engneerng, Clforn Insttute of Technology Psden, CA 9115 USA E-ml: {hu, mod, rjm}@systems.cltech.edu Abstrct:

More information

WiMAX DBA Algorithm Using a 2-Tier Max-Min Fair Sharing Policy

WiMAX DBA Algorithm Using a 2-Tier Max-Min Fair Sharing Policy WMAX DBA Algorthm Usng 2-Ter Mx-Mn Fr Shrng Polcy Pe-Chen Tseng 1, J-Yn Ts 2, nd Wen-Shyng Hwng 2,* 1 Deprtment of Informton Engneerng nd Informtcs, Tzu Ch College of Technology, Hulen, Twn pechen@tccn.edu.tw

More information

Homework 3 Solutions

Homework 3 Solutions CS 341: Foundtions of Computer Science II Prof. Mrvin Nkym Homework 3 Solutions 1. Give NFAs with the specified numer of sttes recognizing ech of the following lnguges. In ll cses, the lphet is Σ = {,1}.

More information

Bayesian Updating with Continuous Priors Class 13, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Bayesian Updating with Continuous Priors Class 13, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom Byesin Updting with Continuous Priors Clss 3, 8.05, Spring 04 Jeremy Orloff nd Jonthn Bloom Lerning Gols. Understnd prmeterized fmily of distriutions s representing continuous rnge of hypotheses for the

More information

Experiment 6: Friction

Experiment 6: Friction Experiment 6: Friction In previous lbs we studied Newton s lws in n idel setting, tht is, one where friction nd ir resistnce were ignored. However, from our everydy experience with motion, we know tht

More information

A Hadoop Job Scheduling Model Based on Uncategorized Slot

A Hadoop Job Scheduling Model Based on Uncategorized Slot Journl of Communctons Vol. 10, No. 10, October 2015 A Hdoop Job Schedulng Model Bsed on Unctegored Slot To Xue nd Tng-tng L Deprtment of Computer Scence, X n Polytechnc Unversty, X n 710048, Chn Eml: xt73@163.com;

More information

What is Candidate Sampling

What is Candidate Sampling What s Canddate Samplng Say we have a multclass or mult label problem where each tranng example ( x, T ) conssts of a context x a small (mult)set of target classes T out of a large unverse L of possble

More information

Reasoning to Solve Equations and Inequalities

Reasoning to Solve Equations and Inequalities Lesson4 Resoning to Solve Equtions nd Inequlities In erlier work in this unit, you modeled situtions with severl vriles nd equtions. For exmple, suppose you were given usiness plns for concert showing

More information

Section 5-4 Trigonometric Functions

Section 5-4 Trigonometric Functions 5- Trigonometric Functions Section 5- Trigonometric Functions Definition of the Trigonometric Functions Clcultor Evlution of Trigonometric Functions Definition of the Trigonometric Functions Alternte Form

More information

EQUATIONS OF LINES AND PLANES

EQUATIONS OF LINES AND PLANES EQUATIONS OF LINES AND PLANES MATH 195, SECTION 59 (VIPUL NAIK) Corresponding mteril in the ook: Section 12.5. Wht students should definitely get: Prmetric eqution of line given in point-direction nd twopoint

More information

ORIGIN DESTINATION DISAGGREGATION USING FRATAR BIPROPORTIONAL LEAST SQUARES ESTIMATION FOR TRUCK FORECASTING

ORIGIN DESTINATION DISAGGREGATION USING FRATAR BIPROPORTIONAL LEAST SQUARES ESTIMATION FOR TRUCK FORECASTING ORIGIN DESTINATION DISAGGREGATION USING FRATAR BIPROPORTIONAL LEAST SQUARES ESTIMATION FOR TRUCK FORECASTING Unversty of Wsconsn Mlwukee Pper No. 09-1 Ntonl Center for Freght & Infrstructure Reserch &

More information

Use Geometry Expressions to create a more complex locus of points. Find evidence for equivalence using Geometry Expressions.

Use Geometry Expressions to create a more complex locus of points. Find evidence for equivalence using Geometry Expressions. Lerning Objectives Loci nd Conics Lesson 3: The Ellipse Level: Preclculus Time required: 120 minutes In this lesson, students will generlize their knowledge of the circle to the ellipse. The prmetric nd

More information

Appendix D: Completing the Square and the Quadratic Formula. In Appendix A, two special cases of expanding brackets were considered:

Appendix D: Completing the Square and the Quadratic Formula. In Appendix A, two special cases of expanding brackets were considered: Appendi D: Completing the Squre nd the Qudrtic Formul Fctoring qudrtic epressions such s: + 6 + 8 ws one of the topics introduced in Appendi C. Fctoring qudrtic epressions is useful skill tht cn help you

More information

Graphs on Logarithmic and Semilogarithmic Paper

Graphs on Logarithmic and Semilogarithmic Paper 0CH_PHClter_TMSETE_ 3//00 :3 PM Pge Grphs on Logrithmic nd Semilogrithmic Pper OBJECTIVES When ou hve completed this chpter, ou should be ble to: Mke grphs on logrithmic nd semilogrithmic pper. Grph empiricl

More information

Polynomial Functions. Polynomial functions in one variable can be written in expanded form as ( )

Polynomial Functions. Polynomial functions in one variable can be written in expanded form as ( ) Polynomil Functions Polynomil functions in one vrible cn be written in expnded form s n n 1 n 2 2 f x = x + x + x + + x + x+ n n 1 n 2 2 1 0 Exmples of polynomils in expnded form re nd 3 8 7 4 = 5 4 +

More information

Joint Opaque booking systems for online travel agencies

Joint Opaque booking systems for online travel agencies Jont Opque bookng systems for onlne trvel gences Mlgorzt OGOOWSKA nd Domnque TORRE Mrch 2010 Abstrct Ths pper nlyzes the propertes of the dvnced Opque bookng systems used by the onlne trvel gences n conjuncton

More information

Algebra Review. How well do you remember your algebra?

Algebra Review. How well do you remember your algebra? Algebr Review How well do you remember your lgebr? 1 The Order of Opertions Wht do we men when we write + 4? If we multiply we get 6 nd dding 4 gives 10. But, if we dd + 4 = 7 first, then multiply by then

More information

Luby s Alg. for Maximal Independent Sets using Pairwise Independence

Luby s Alg. for Maximal Independent Sets using Pairwise Independence Lecture Notes for Randomzed Algorthms Luby s Alg. for Maxmal Independent Sets usng Parwse Independence Last Updated by Erc Vgoda on February, 006 8. Maxmal Independent Sets For a graph G = (V, E), an ndependent

More information

LINEAR TRANSFORMATIONS AND THEIR REPRESENTING MATRICES

LINEAR TRANSFORMATIONS AND THEIR REPRESENTING MATRICES LINEAR TRANSFORMATIONS AND THEIR REPRESENTING MATRICES DAVID WEBB CONTENTS Liner trnsformtions 2 The representing mtrix of liner trnsformtion 3 3 An ppliction: reflections in the plne 6 4 The lgebr of

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

Or more simply put, when adding or subtracting quantities, their uncertainties add.

Or more simply put, when adding or subtracting quantities, their uncertainties add. Propgtion of Uncertint through Mthemticl Opertions Since the untit of interest in n eperiment is rrel otined mesuring tht untit directl, we must understnd how error propgtes when mthemticl opertions re

More information

DlNBVRGH + Sickness Absence Monitoring Report. Executive of the Council. Purpose of report

DlNBVRGH + Sickness Absence Monitoring Report. Executive of the Council. Purpose of report DlNBVRGH + + THE CITY OF EDINBURGH COUNCIL Sickness Absence Monitoring Report Executive of the Council 8fh My 4 I.I...3 Purpose of report This report quntifies the mount of working time lost s result of

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Lectures 8 and 9 1 Rectangular waveguides

Lectures 8 and 9 1 Rectangular waveguides 1 Lectures 8 nd 9 1 Rectngulr wveguides y b x z Consider rectngulr wveguide with 0 < x b. There re two types of wves in hollow wveguide with only one conductor; Trnsverse electric wves

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd business. Introducing technology

More information

Positive Integral Operators With Analytic Kernels

Positive Integral Operators With Analytic Kernels Çnky Ünverte Fen-Edeyt Fkülte, Journl of Art nd Scence Sy : 6 / Arl k 006 Potve ntegrl Opertor Wth Anlytc Kernel Cn Murt D KMEN Atrct n th pper we contruct exmple of potve defnte ntegrl kernel whch re

More information

The Greedy Method. Introduction. 0/1 Knapsack Problem

The Greedy Method. Introduction. 0/1 Knapsack Problem The Greedy Method Introducton We have completed data structures. We now are gong to look at algorthm desgn methods. Often we are lookng at optmzaton problems whose performance s exponental. For an optmzaton

More information

Morgan Stanley Ad Hoc Reporting Guide

Morgan Stanley Ad Hoc Reporting Guide spphire user guide Ferury 2015 Morgn Stnley Ad Hoc Reporting Guide An Overview For Spphire Users 1 Introduction The Ad Hoc Reporting tool is ville for your reporting needs outside of the Spphire stndrd

More information

Three-Phase Induction Generator Feeding a Single-Phase Electrical Distribution System - Time Domain Mathematical Model

Three-Phase Induction Generator Feeding a Single-Phase Electrical Distribution System - Time Domain Mathematical Model Three-Phse Induton Genertor Feedng Sngle-Phse Eletrl Dstruton System - Tme Domn Mthemtl Model R.G. de Mendonç, MS. CEFET- GO Jtí Deentrlzed Unty Eletrotehnl Coordnton Jtí GO Brzl 763 L. Mrtns Neto, Dr.

More information

How To Network A Smll Business

How To Network A Smll Business Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

AntiSpyware Enterprise Module 8.5

AntiSpyware Enterprise Module 8.5 AntiSpywre Enterprise Module 8.5 Product Guide Aout the AntiSpywre Enterprise Module The McAfee AntiSpywre Enterprise Module 8.5 is n dd-on to the VirusScn Enterprise 8.5i product tht extends its ility

More information

Lecture 2: Single Layer Perceptrons Kevin Swingler

Lecture 2: Single Layer Perceptrons Kevin Swingler Lecture 2: Sngle Layer Perceptrons Kevn Sngler kms@cs.str.ac.uk Recap: McCulloch-Ptts Neuron Ths vastly smplfed model of real neurons s also knon as a Threshold Logc Unt: W 2 A Y 3 n W n. A set of synapses

More information

Operating Network Load Balancing with the Media Independent Information Service for Vehicular Based Systems

Operating Network Load Balancing with the Media Independent Information Service for Vehicular Based Systems CHI MA et l: OPERATING NETWORK LOAD BALANCING WITH THE MEDIA INDEPENDENT... Opertng Network Lod Blncng wth the Med Independent Inforton Servce for Vehculr Bsed Systes Ch M, End Fllon, Yunsong Qo, Brn Lee

More information

Lecture 3 Gaussian Probability Distribution

Lecture 3 Gaussian Probability Distribution Lecture 3 Gussin Probbility Distribution Introduction l Gussin probbility distribution is perhps the most used distribution in ll of science. u lso clled bell shped curve or norml distribution l Unlike

More information

Small Business Networking

Small Business Networking Why network is n essentil productivity tool for ny smll business Effective technology is essentil for smll businesses looking to increse the productivity of their people nd processes. Introducing technology

More information

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis

The Development of Web Log Mining Based on Improve-K-Means Clustering Analysis The Development of Web Log Mnng Based on Improve-K-Means Clusterng Analyss TngZhong Wang * College of Informaton Technology, Luoyang Normal Unversty, Luoyang, 471022, Chna wangtngzhong2@sna.cn Abstract.

More information

Treatment Spring Late Summer Fall 0.10 5.56 3.85 0.61 6.97 3.01 1.91 3.01 2.13 2.99 5.33 2.50 1.06 3.53 6.10 Mean = 1.33 Mean = 4.88 Mean = 3.

Treatment Spring Late Summer Fall 0.10 5.56 3.85 0.61 6.97 3.01 1.91 3.01 2.13 2.99 5.33 2.50 1.06 3.53 6.10 Mean = 1.33 Mean = 4.88 Mean = 3. The nlysis of vrince (ANOVA) Although the t-test is one of the most commonly used sttisticl hypothesis tests, it hs limittions. The mjor limittion is tht the t-test cn be used to compre the mens of only

More information

Binary Representation of Numbers Autar Kaw

Binary Representation of Numbers Autar Kaw Binry Representtion of Numbers Autr Kw After reding this chpter, you should be ble to: 1. convert bse- rel number to its binry representtion,. convert binry number to n equivlent bse- number. In everydy

More information

Simple Interest Loans (Section 5.1) :

Simple Interest Loans (Section 5.1) : Chapter 5 Fnance The frst part of ths revew wll explan the dfferent nterest and nvestment equatons you learned n secton 5.1 through 5.4 of your textbook and go through several examples. The second part

More information

2 DIODE CLIPPING and CLAMPING CIRCUITS

2 DIODE CLIPPING and CLAMPING CIRCUITS 2 DIODE CLIPPING nd CLAMPING CIRCUITS 2.1 Ojectives Understnding the operting principle of diode clipping circuit Understnding the operting principle of clmping circuit Understnding the wveform chnge of

More information

1. Measuring association using correlation and regression

1. Measuring association using correlation and regression How to measure assocaton I: Correlaton. 1. Measurng assocaton usng correlaton and regresson We often would lke to know how one varable, such as a mother's weght, s related to another varable, such as a

More information

FAULT TREES AND RELIABILITY BLOCK DIAGRAMS. Harry G. Kwatny. Department of Mechanical Engineering & Mechanics Drexel University

FAULT TREES AND RELIABILITY BLOCK DIAGRAMS. Harry G. Kwatny. Department of Mechanical Engineering & Mechanics Drexel University SYSTEM FAULT AND Hrry G. Kwtny Deprtment of Mechnicl Engineering & Mechnics Drexel University OUTLINE SYSTEM RBD Definition RBDs nd Fult Trees System Structure Structure Functions Pths nd Cutsets Reliility

More information

Fuzzy Logic Based Anomaly Detection for Embedded Network Security Cyber Sensor

Fuzzy Logic Based Anomaly Detection for Embedded Network Security Cyber Sensor INL/CON-10-20411 PREPRINT Fuzzy Logc Bsed Anomly Detecton for Embedded Network Securty Cyber Sensor 2011 IEEE Symposum on Computtonl Intellgence n Cyber Securty Ondre Lnd Mlos Mnc Todd Vollmer Json Wrght

More information

Texas Instruments 30X IIS Calculator

Texas Instruments 30X IIS Calculator Texas Instruments 30X IIS Calculator Keystrokes for the TI-30X IIS are shown for a few topcs n whch keystrokes are unque. Start by readng the Quk Start secton. Then, before begnnng a specfc unt of the

More information

PROF. BOYAN KOSTADINOV NEW YORK CITY COLLEGE OF TECHNOLOGY, CUNY

PROF. BOYAN KOSTADINOV NEW YORK CITY COLLEGE OF TECHNOLOGY, CUNY MAT 0630 INTERNET RESOURCES, REVIEW OF CONCEPTS AND COMMON MISTAKES PROF. BOYAN KOSTADINOV NEW YORK CITY COLLEGE OF TECHNOLOGY, CUNY Contents 1. ACT Compss Prctice Tests 1 2. Common Mistkes 2 3. Distributive

More information

Multi-Market Trading and Liquidity: Theory and Evidence

Multi-Market Trading and Liquidity: Theory and Evidence Mult-Mrket Trdng nd Lqudty: Theory nd Evdence Shmuel Bruch, G. Andrew Kroly, b* Mchel L. Lemmon Eccles School of Busness, Unversty of Uth, Slt Lke Cty, UT 84, USA b Fsher College of Busness, Oho Stte Unversty,

More information

An Alternative Way to Measure Private Equity Performance

An Alternative Way to Measure Private Equity Performance An Alternatve Way to Measure Prvate Equty Performance Peter Todd Parlux Investment Technology LLC Summary Internal Rate of Return (IRR) s probably the most common way to measure the performance of prvate

More information

Factoring Polynomials

Factoring Polynomials Fctoring Polynomils Some definitions (not necessrily ll for secondry school mthemtics): A polynomil is the sum of one or more terms, in which ech term consists of product of constnt nd one or more vribles

More information

Geometric bounding box interpolation: an alternative for efficient video annotation

Geometric bounding box interpolation: an alternative for efficient video annotation Gl-Jménez et l. EURASIP Journl on Imge nd Vdeo Processng (2016) 2016:8 DOI 10.1186/s13640-016-0108-7 RESEARCH Open Access Geometrc oundng ox nterpolton: n lterntve for effcent vdeo nnotton Pedro Gl-Jménez

More information

Rotating DC Motors Part II

Rotating DC Motors Part II Rotting Motors rt II II.1 Motor Equivlent Circuit The next step in our consiertion of motors is to evelop n equivlent circuit which cn be use to better unerstn motor opertion. The rmtures in rel motors

More information

Integration by Substitution

Integration by Substitution Integrtion by Substitution Dr. Philippe B. Lvl Kennesw Stte University August, 8 Abstrct This hndout contins mteril on very importnt integrtion method clled integrtion by substitution. Substitution is

More information

How To Set Up A Network For Your Business

How To Set Up A Network For Your Business Why Network is n Essentil Productivity Tool for Any Smll Business TechAdvisory.org SME Reports sponsored by Effective technology is essentil for smll businesses looking to increse their productivity. Computer

More information

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 2 LOSSLESS IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module LOSSLESS IMAGE COMPRESSION SYSTEMS Lesson 3 Lossless Compresson: Huffman Codng Instructonal Objectves At the end of ths lesson, the students should be able to:. Defne and measure source entropy..

More information

4.11 Inner Product Spaces

4.11 Inner Product Spaces 314 CHAPTER 4 Vector Spces 9. A mtrix of the form 0 0 b c 0 d 0 0 e 0 f g 0 h 0 cnnot be invertible. 10. A mtrix of the form bc d e f ghi such tht e bd = 0 cnnot be invertible. 4.11 Inner Product Spces

More information

Example 27.1 Draw a Venn diagram to show the relationship between counting numbers, whole numbers, integers, and rational numbers.

Example 27.1 Draw a Venn diagram to show the relationship between counting numbers, whole numbers, integers, and rational numbers. 2 Rtionl Numbers Integers such s 5 were importnt when solving the eqution x+5 = 0. In similr wy, frctions re importnt for solving equtions like 2x = 1. Wht bout equtions like 2x + 1 = 0? Equtions of this

More information

9 CONTINUOUS DISTRIBUTIONS

9 CONTINUOUS DISTRIBUTIONS 9 CONTINUOUS DISTIBUTIONS A rndom vrible whose vlue my fll nywhere in rnge of vlues is continuous rndom vrible nd will be ssocited with some continuous distribution. Continuous distributions re to discrete

More information

Quick Reference Guide: One-time Account Update

Quick Reference Guide: One-time Account Update Quick Reference Guide: One-time Account Updte How to complete The Quick Reference Guide shows wht existing SingPss users need to do when logging in to the enhnced SingPss service for the first time. 1)

More information

1 Example 1: Axis-aligned rectangles

1 Example 1: Axis-aligned rectangles COS 511: Theoretcal Machne Learnng Lecturer: Rob Schapre Lecture # 6 Scrbe: Aaron Schld February 21, 2013 Last class, we dscussed an analogue for Occam s Razor for nfnte hypothess spaces that, n conjuncton

More information

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by

8.5 UNITARY AND HERMITIAN MATRICES. The conjugate transpose of a complex matrix A, denoted by A*, is given by 6 CHAPTER 8 COMPLEX VECTOR SPACES 5. Fnd the kernel of the lnear transformaton gven n Exercse 5. In Exercses 55 and 56, fnd the mage of v, for the ndcated composton, where and are gven by the followng

More information

Small Businesses Decisions to Offer Health Insurance to Employees

Small Businesses Decisions to Offer Health Insurance to Employees Smll Businesses Decisions to Offer Helth Insurnce to Employees Ctherine McLughlin nd Adm Swinurn, June 2014 Employer-sponsored helth insurnce (ESI) is the dominnt source of coverge for nonelderly dults

More information

Review guide for the final exam in Math 233

Review guide for the final exam in Math 233 Review guide for the finl exm in Mth 33 1 Bsic mteril. This review includes the reminder of the mteril for mth 33. The finl exm will be cumultive exm with mny of the problems coming from the mteril covered

More information

5 a LAN 6 a gateway 7 a modem

5 a LAN 6 a gateway 7 a modem STARTER With the help of this digrm, try to descrie the function of these components of typicl network system: 1 file server 2 ridge 3 router 4 ckone 5 LAN 6 gtewy 7 modem Another Novell LAN Router Internet

More information

Nordea G10 Alpha Carry Index

Nordea G10 Alpha Carry Index Nordea G10 Alpha Carry Index Index Rules v1.1 Verson as of 10/10/2013 1 (6) Page 1 Index Descrpton The G10 Alpha Carry Index, the Index, follows the development of a rule based strategy whch nvests and

More information

Mathematics. Vectors. hsn.uk.net. Higher. Contents. Vectors 128 HSN23100

Mathematics. Vectors. hsn.uk.net. Higher. Contents. Vectors 128 HSN23100 hsn.uk.net Higher Mthemtics UNIT 3 OUTCOME 1 Vectors Contents Vectors 18 1 Vectors nd Sclrs 18 Components 18 3 Mgnitude 130 4 Equl Vectors 131 5 Addition nd Subtrction of Vectors 13 6 Multipliction by

More information

Small Business Cloud Services

Small Business Cloud Services Smll Business Cloud Services Summry. We re thick in the midst of historic se-chnge in computing. Like the emergence of personl computers, grphicl user interfces, nd mobile devices, the cloud is lredy profoundly

More information

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy

Answer: A). There is a flatter IS curve in the high MPC economy. Original LM LM after increase in M. IS curve for low MPC economy 4.02 Quz Solutons Fall 2004 Multple-Choce Questons (30/00 ponts) Please, crcle the correct answer for each of the followng 0 multple-choce questons. For each queston, only one of the answers s correct.

More information

Helicopter Theme and Variations

Helicopter Theme and Variations Helicopter Theme nd Vritions Or, Some Experimentl Designs Employing Pper Helicopters Some possible explntory vribles re: Who drops the helicopter The length of the rotor bldes The height from which the

More information

Operations with Polynomials

Operations with Polynomials 38 Chpter P Prerequisites P.4 Opertions with Polynomils Wht you should lern: Write polynomils in stndrd form nd identify the leding coefficients nd degrees of polynomils Add nd subtrct polynomils Multiply

More information

MULTI-CRITERIA DECISION AIDING IN PROJECT MANAGEMENT OUTRANKING APPROACH AND VERBAL DECISION ANALYSIS

MULTI-CRITERIA DECISION AIDING IN PROJECT MANAGEMENT OUTRANKING APPROACH AND VERBAL DECISION ANALYSIS Dorot Górec Deprtment of Econometrcs nd Sttstcs Ncolus Coperncus Unversty n Toruń MULTI-CRITERIA DECISION AIDING IN PROJECT MANAGEMENT OUTRANKING APPROACH AND VERBAL DECISION ANALYSIS Introducton A proect

More information

JaERM Software-as-a-Solution Package

JaERM Software-as-a-Solution Package JERM Softwre-s--Solution Pckge Enterprise Risk Mngement ( ERM ) Public listed compnies nd orgnistions providing finncil services re required by Monetry Authority of Singpore ( MAS ) nd/or Singpore Stock

More information

Distributions. (corresponding to the cumulative distribution function for the discrete case).

Distributions. (corresponding to the cumulative distribution function for the discrete case). Distributions Recll tht n integrble function f : R [,] such tht R f()d = is clled probbility density function (pdf). The distribution function for the pdf is given by F() = (corresponding to the cumultive

More information

Loyalty Program and Customer Retention of Bank Credit Cards --an Logistic Regression Analysis based on Questionnaires

Loyalty Program and Customer Retention of Bank Credit Cards --an Logistic Regression Analysis based on Questionnaires oylty Progrm nd Customer Retenton of Bnk Credt Crds --n ogstc Regresson nlyss sed on Questonnres ZHU Qn IN Runyo College of Economcs Zhejng Gongshng Unversty P.R.Chn 310014 strct To Chnese credt crd ssuers

More information

Lesson 28 Psychrometric Processes

Lesson 28 Psychrometric Processes 1 Lesson 28 Psychrometrc Processes Verson 1 ME, IIT Khrgpur 1 2 The specfc objectves of ths lecture re to: 1. Introducton to psychrometrc processes nd ther representton (Secton 28.1) 2. Importnt psychrometrc

More information

Research on performance evaluation in logistics service supply chain based unascertained measure

Research on performance evaluation in logistics service supply chain based unascertained measure Suo Junun, L Yncng, Dong Humn Reserch on performnce evluton n logstcs servce suppl chn bsed unscertned mesure Abstrct Junun Suo *, Yncng L, Humn Dong Hebe Unverst of Engneerng, Hndn056038, Chn Receved

More information

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits

Linear Circuits Analysis. Superposition, Thevenin /Norton Equivalent circuits Lnear Crcuts Analyss. Superposton, Theenn /Norton Equalent crcuts So far we hae explored tmendependent (resste) elements that are also lnear. A tmendependent elements s one for whch we can plot an / cure.

More information

Extending Probabilistic Dynamic Epistemic Logic

Extending Probabilistic Dynamic Epistemic Logic Extendng Probablstc Dynamc Epstemc Logc Joshua Sack May 29, 2008 Probablty Space Defnton A probablty space s a tuple (S, A, µ), where 1 S s a set called the sample space. 2 A P(S) s a σ-algebra: a set

More information

Math 135 Circles and Completing the Square Examples

Math 135 Circles and Completing the Square Examples Mth 135 Circles nd Completing the Squre Exmples A perfect squre is number such tht = b 2 for some rel number b. Some exmples of perfect squres re 4 = 2 2, 16 = 4 2, 169 = 13 2. We wish to hve method for

More information

Cardiff Economics Working Papers

Cardiff Economics Working Papers Crdff Economcs Workng Ppers Workng Pper No. E204/4 Reforms, Incentves nd Bnkng Sector Productvty: A Cse of Nepl Kul B Luntel, Shekh Selm nd Pushkr Bjrchry August 204 Crdff Busness School Aberconwy Buldng

More information

Answer, Key Homework 10 David McIntyre 1

Answer, Key Homework 10 David McIntyre 1 Answer, Key Homework 10 Dvid McIntyre 1 This print-out should hve 22 questions, check tht it is complete. Multiple-choice questions my continue on the next column or pge: find ll choices efore mking your

More information

A Performance Analysis of View Maintenance Techniques for Data Warehouses

A Performance Analysis of View Maintenance Techniques for Data Warehouses A Performance Analyss of Vew Mantenance Technques for Data Warehouses Xng Wang Dell Computer Corporaton Round Roc, Texas Le Gruenwald The nversty of Olahoma School of Computer Scence orman, OK 739 Guangtao

More information

Unit 6: Exponents and Radicals

Unit 6: Exponents and Radicals Eponents nd Rdicls -: The Rel Numer Sstem Unit : Eponents nd Rdicls Pure Mth 0 Notes Nturl Numers (N): - counting numers. {,,,,, } Whole Numers (W): - counting numers with 0. {0,,,,,, } Integers (I): -

More information

Concept Formation Using Graph Grammars

Concept Formation Using Graph Grammars Concept Formtion Using Grph Grmmrs Istvn Jonyer, Lwrence B. Holder nd Dine J. Cook Deprtment of Computer Science nd Engineering University of Texs t Arlington Box 19015 (416 Ytes St.), Arlington, TX 76019-0015

More information

Network Configuration Independence Mechanism

Network Configuration Independence Mechanism 3GPP TSG SA WG3 Security S3#19 S3-010323 3-6 July, 2001 Newbury, UK Source: Title: Document for: AT&T Wireless Network Configurtion Independence Mechnism Approvl 1 Introduction During the lst S3 meeting

More information

7.5. Present Value of an Annuity. Investigate

7.5. Present Value of an Annuity. Investigate 7.5 Present Value of an Annuty Owen and Anna are approachng retrement and are puttng ther fnances n order. They have worked hard and nvested ther earnngs so that they now have a large amount of money on

More information

Example A rectangular box without lid is to be made from a square cardboard of sides 18 cm by cutting equal squares from each corner and then folding

Example A rectangular box without lid is to be made from a square cardboard of sides 18 cm by cutting equal squares from each corner and then folding 1 Exmple A rectngulr box without lid is to be mde from squre crdbord of sides 18 cm by cutting equl squres from ech corner nd then folding up the sides. 1 Exmple A rectngulr box without lid is to be mde

More information

Econ 4721 Money and Banking Problem Set 2 Answer Key

Econ 4721 Money and Banking Problem Set 2 Answer Key Econ 472 Money nd Bnking Problem Set 2 Answer Key Problem (35 points) Consider n overlpping genertions model in which consumers live for two periods. The number of people born in ech genertion grows in

More information

Recurrence. 1 Definitions and main statements

Recurrence. 1 Definitions and main statements Recurrence 1 Defntons and man statements Let X n, n = 0, 1, 2,... be a MC wth the state space S = (1, 2,...), transton probabltes p j = P {X n+1 = j X n = }, and the transton matrx P = (p j ),j S def.

More information

Problem Set 3. a) We are asked how people will react, if the interest rate i on bonds is negative.

Problem Set 3. a) We are asked how people will react, if the interest rate i on bonds is negative. Queston roblem Set 3 a) We are asked how people wll react, f the nterest rate on bonds s negatve. When

More information

Boolean Algebra. ECE 152A Winter 2012

Boolean Algebra. ECE 152A Winter 2012 Boolen Algebr ECE 52A Wnter 22 Redng Assgnent Brown nd Vrnesc 2 Introducton to Logc Crcuts 2.5 Boolen Algebr 2.5. The Venn Dgr 2.5.2 Notton nd Ternology 2.5.3 Precedence of Opertons 2.6 Synthess Usng AND,

More information

8 Algorithm for Binary Searching in Trees

8 Algorithm for Binary Searching in Trees 8 Algorthm for Bnary Searchng n Trees In ths secton we present our algorthm for bnary searchng n trees. A crucal observaton employed by the algorthm s that ths problem can be effcently solved when the

More information

The Velocity Factor of an Insulated Two-Wire Transmission Line

The Velocity Factor of an Insulated Two-Wire Transmission Line The Velocity Fctor of n Insulted Two-Wire Trnsmission Line Problem Kirk T. McDonld Joseph Henry Lbortories, Princeton University, Princeton, NJ 08544 Mrch 7, 008 Estimte the velocity fctor F = v/c nd the

More information