Similar documents
Data Structures and Algorithms V Otávio Braga

Improving Regional PCE Estimates Using Credit Card Transaction Data

Chapter 11, Testing, Part 2: Integration and System Testing

Themethodofmovingcurvesandmovingsurfacesisanew,eectivetoolfor Abstract

Natural Language Processing. Today. Logistic Regression Models. Lecture 13 10/6/2015. Jim Martin. Multinomial Logistic Regression


X12 Implementation. Guidelines. For. Transfer Version (996o)

The world s most adaptive enterprise- level digital marketing system

portion,theso-calledserver,providesbasicservicessuchasdatai/o,buermanagementandconcurrency

Artwork master Inspection and touch up Production phototools Inspection and touch up. development of outer layers


Towards an automated testing framework to manage variability using the UML Testing Profile


MUNIS Chart of Accounts Spreadsheet Definition

City University of Hong Kong. Information on a Course offered by the Department of Management Sciences with effect from Semester A in 2012 / 2013

CSE373: Data Structures & Algorithms Lecture 14: Hash Collisions. Linda Shapiro Spring 2016

DATA OBFUSCATION. What is data obfuscation?

Attention windows of second level fixations. Input image. Attention window of first level fixation

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Reslogg: Mobile app for logging and analyzing travel behavior

Software Classification Methodology and Standardisation

BIG DATA: IT MAY BE BIG BUT IS IT SMART?

Project Information Dissemination Service (PIDS) CORDIS


Agile Approach and MDA in Software Development Process


Dinopolis Java Coding Convention

Computer Science 281 Binary and Hexadecimal Review

Microfinance and Technology

Bachelor of Applied Information Science (Information Systems Security)

Network Segmentation. June 30, :00 Noon Eastern

Software Engineering

OPEN SOURCE SOFTWARE in Italian schools a national survey

Figure2:Themixtureoffactoranalysisgenerativemodel. j;j z

front unit 1 3 back unit

This is Factoring and Solving by Factoring, chapter 6 from the book Beginning Algebra (index.html) (v. 1.0).

Setting up Dynamicweb for Load Balancing with Microsoft ARR for IIS8

Linear Programming Problems

Emory University RT to Bachelor of Medical Science Degree Medical Imaging

Parallel and Distributed Computing Programming Assignment 1

Outline. Introduction. State-of-the-art Forensic Methods. Hardware-based Workload Forensics. Experimental Results. Summary. OS level Hypervisor level

USE OF DATA MINING TO DERIVE CRM STRATEGIES OF AN AUTOMOBILE REPAIR SERVICE CENTER IN KOREA

Chapter 6. The Graduate Program of Study

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

INSTRUCTIONAL DESIGN ONE YEAR BY SEMESTER

Thomas Lerner. Mobile Payment. Springer

Deep Insights Smart Decisions Motionlogic

Calculation Algorithm for Network Flow Parameters Entropy in Anomaly Detection

Binary Search Trees. A Generic Tree. Binary Trees. Nodes in a binary search tree ( B-S-T) are of the form. P parent. Key. Satellite data L R

Master of Science in Sociology or Rural Sociology Department of Sociology, Iowa State University Updated May 2015

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

Making Dynamic Memory Allocation Static To Support WCET Analyses

Static Program Transformations for Efficient Software Model Checking

User Manual. Enterprise Reporting Portal (EREP) User Manual. October Global Operations Education

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

Bachelor of Bachelor of Computer Science

Business Intelligence Osvaldo Maysonet VP Marketing & Customer Knowledge Banco Popular

Design Cycle for Microprocessors

How To Protect Your Data From Being Stolen

PCI COMPLIANCE FOR HIGHER EDUCATION BEST PRACTICES CHECKLIST. Presented By: The Treasury Institute for Higher Education.

Software Development Best Practices

Verifone Enhanced Zone Router

Operating Systems. Virtual Memory

1 The Java Virtual Machine

Instruction scheduling

CUSTOMER RELATIONSHIP MANAGEMENT SYSTEMS IN THE SRI LANKAN HOSPITALITY INDUSTRY FOR SUSTAINED GROWTH AND DEVELOPMENT

... Schema Integration

A Knowledge-Poor Approach to BioCreative V DNER and CID Tasks

Transcription:

AClassofLinearAlgorithmstoProcessSetsofSegments GonzaloNavarroRicardoBaeza-Yates DepartmentofComputerScience fgnavarro,rbaezag@dcc.uchile.cl BlancoEncalada2120 UniversityofChile Santiago-Chile currentsolutionstooperatesegmentsfocusonsingleoperations(e.g.insertionorsearching), weareinterestedinset-orientedoperations(e.g.union,dierenceandothersmorespecicto tomanipulatesetsofnelements.weshowthatawideclassofoperationscaninfactbe segments).inthosecases,extendingthecurrentapproachesleadstoo(nlogn)timecomplexity performedino(n)time,i.e.inaconstantamortizedcostperprocessedsegment.wepresent thegeneralframeworkandshowanumberofoperationsofthatkind,depictingandanalyzing Weaddresstheproblemofecientlyperformingoperationsonsetsofsegments.While Abstract ThisworkhasbeensupportedinpartbyFONDECYTgrants1940271and1950622. thealgorithms.finally,weshowsomeapplicationsofthistechnique.

1Introduction Inmanypracticalapplicationstheproblemofmanipulatingsegmentsarisesunderdierentforms. modelswithconstraints[9,8]andstructuredtextsearch[13,10]. Typicalexamplesarecomputationalgeometry[14,11,6,3],temporaldatabases[7,2],database studied[14,5].alltheseapproachesfocuson\single"operations,inwhichasinglesegmentis behaviorfortheseoperations. operatedagainstasetofsegments.examplesare:searchingforasegment,insertinganewsegment structuredtextortemporaldatabases.examplesoftheoperationsthatarecommonlyperformedin intotheset,removingasegmentfromtheset,etc.in[5],itisshownhowtoachieveo(logn) oriented"operationsareneeded.thisisthecase,forexample,ofset-orientedquerylanguagesfor Becauseofthissituation,theproblemofmanipulatingasetofsegmentshasbeenextensively theseapplicationsareretrievingsegmentsincludingothersegments(e.g.allchapterscontainingat leastfourgures,instructuredtextsearch),orallsegmentsfromasetshortlyprecedingasegment scene,inatemporaldatabasedescribingmovies). fromanotherset(e.g.allmusicalsceneswhereagivenactorappearsthatshortlyprecedeacolorful However,forsomeapplications,thattypeofoperationsarelessimportant,whilemore\set- nonestingisallowed,thesegmentsaretriviallylinearlyordered,anditiseasytodeveloplinear solutionsfornelements.ouraimistoshowthatunderquitegeneralassumptions,muchbetter withamorecomplexdatastructure,andanalyzeunderwhichsituationstheideaworks. solutionscanbefound. segments,thusleadingtoo(n)algorithms.therefore,weextendthissimplemechanismtodeal Animportantconsiderationisthatwesupportnestinginthesegmentsformingaset,sinceif Weshowthatinmanycases,asolutionsimilartolistmerging[1]canbeappliedtosetsof AtrivialextensionofthecurrentapproachestodealwiththeserequirementsleadstoO(nlogn) wefoundlinearalgorithms.insection5weshowsomecomplexoperationsforwhichwefound algorithms[4]. moreexpensivesolutions.insection6weshowsomeapplicationsusingthesealgorithms.finally, weexplainthegeneralschemeofoursolution.insection4weshowanumberofproblemsforwhich retrievalcanbefoundin[12,13]. case.anearlierbutmoredetailedversionofpartofthisworkanditsapplicationtostructuredtext O(n)amortizedtime,whilethecostofanindividualoperationisknowntobe(logn)intheworst Thispaperisorganizedasfollows.Insection2weexplainourmodelofoperation.Insection3 ThemaincontributionofthispaperisasimpletechniquetoperformO(n)setoperationsin andtohx;yi=y.asegmenta=hx;yiissaidtocontainanotheroneb=hx0;y0i(andwedenote 2Preliminaries itasaborba)ixx0^y0y.ifasegmentcontainsanotheronewesaythattheynest. Asegmentisapairhx;yi,wherexandyarerealnumbersandxy.WedeneFromhx;yi=x insection7wepresentourconclusionsandfutureworkdirections. intheother).iftwosegmentsdonotnestandarenotdisjointwesaythattheyoverlap.finally, ab^a6=b. Thesegmentsaresaidtobedisjointiy<x0_y0<x(wesaya<bintherstcaseanda>b weusetheequalsignbetweensegmentswiththeobviousmeaning,andaborbatodenote

AandBinsomeway,inO(jAj+jBj)time. operations: Eachsetmustformahierarchy,i.e.nooverlapsareallowedbetweenanytwosegmentsof Toachievethatgoal,weimposetwofurtherrestrictions,oneonthesetsandoneontheprocessing Ourgeneralproblemis:giventwosetsofsegmentsAandB,obtainanewsetCbyprocessing Weareinterestedinset-orientedoperations,towhichweimposetheadditionalrestriction ofoperatingwithproximalsegments.proximalitymeansthatthepresenceorabsenceofa thanoverlapping,soweareinterestedinprovidingnesting. However,insomeapplications(e.g.structuredtextsearch)nestingcanbemoreimportant triviallyorderedbytheirrstorlastextreme,andthenormallistmergingalgorithmswork[4]. agivenset.thisisbecausetheonlywaywecouldobtainlinearalgorithmsforsetswith givensegmentinthenalresultmustbedenedintermsofrelativelyclosesegmentsinthe overlappingsegmentswaspreventingnestinginsidetheset.inthiscasethesegmentscanbe resultdoes.thatis,ouralgorithmsworkalsoifsegmentsfromdierentargumentsoverlap. Observethatwedonotneedthattheunionofbothoperandsformsahierarchy,aslongasthe arguments.thisisbecauseweplantotraversetheargumentsinsynchronizationtoproduce Weuseatreedatastructuretoarrangethesegmentsofeachset.Sincetherearenooverlaps, Ourapproachtothesolutioncanbedenedingeneraltermsasfollows: theresults. Toobtainthesolutionset(whichisalsoarrangedinatree),wearegoingtotraverseboth wedenethetreebythecontainmentrelationsbetweensegments. Wearenotinterestedinhowthesetsarebuiltintherstplace,andhowtheyarenally used.ourschemeisnotsoecienttobuildthetreesbyconsecutiverandominsertions,but ontreesandselectelementsundermorecomplexcriteria.tobeabletogeneratethewhole segments.allthealgorithmsconsistofvariationsofthisidea. solutionbytraversingtheoperandsjustonce,itisnecessarytheassumptionofproximal operandtreessimultaneously,inasynchronizedway,whilewegeneratethesolutiontreeat intheapplicationsweareinterestedin,thoseproblemsaresolvedinanad-hocway(weshow thesametime.theideaistogeneralizethelistmergingalgorithms,tomakethemoperate 3ASolutionScheme 3.1DataStructure Inwhatfollows,wedescribemoreindetailourdatastructureandalgorithmicscheme. anexampleinsection6). asegmentadescendsfromanotheronebinthetreeiab.althoughforclaritywedonotallow repetitions,thealgorithmsareeasilymodiedtoaccountforthis. Assaid,wearrangethesetofsegmentsinatree.Thecriteriontodenethetreeisstraightforward:

AformaldenitionofourtypeTreefollows: Segm=fhx;yi=x;y2R^xyg Subtree=SegmTree Tree=Subtree associatedsegment.ourtreescanbeseeninfactasforestswithorderamongtheirtrees. whereristhesetofrealnumbers.asitcanbeseen,therootofourtreedoesnothavean node:subtree!segmandsubtree:subtree!treeareselectors,i.e.ifs=(s;((s1;t1);:::; Wedenesomefunctionstoaccessthistreetype: 8(s;((s1;t1);:::;(sk;tk)))2Subtree;8i21::k;ssi (sk;tk)))2subtree,thennode(s)=sandsubtree(s)=((s1;t1);:::;(sk;tk)). 8(s1;t1);:::;(sk;tk)2Tree;8i21::k?1;si<si+1 head:subtree+!subtreereturnstherstelementofthelist,i.e.head(fl1;:::;lkg)=l1. tail:subtree+!subtreeeliminatestherstelement,i.e.tail(fl1;:::;lkg)=fl2;:::;l;lkg. 2Treedenotesanemptytree. operation,wemoveinoneorbothtrees,goingtothenextnodeorjumpingdirectlytothenext allowecientcomputationoftheseaccessfunctions. 3.2AlgorithmicScheme Ateachstep,thecurrentnodesofbothtreesarecompared,anddependingontheresultsandthe Thegeneralformofouralgorithmsconsistsoftraversingbothtrees,normallyinpreorpostorder. Althoughwedonotconsideranyparticularrepresentation,thedatastructureforourtreesmust sibling(thusskippingthecurrentsubtree).eachparticularcaseisavariationofthisgeneralidea. Weshowanumberofexamplesinthenextsection. fromthenalsolution(almostallusefuloperationsselectelementsfromonlyoneoperand).this selectorrejectthemarkednodes.somealgorithmscanbesolvedwithoutmarking,though. goodabstraction,sincewedonotdetailhowwecollectmarkedordeleteunmarkednodes.moreover, themarkingalgorithmcanbeusedfortwocomplementaryoperations,dependingonwhetherwe markingcanbeasimplebooleanmarkoritmayhaveamorecomplexmeaning.thisprovidesa Thuswearegoingtodescribethealgorithmsbymarkingnodesthatmustbeselectedorrejected traversal,whichistrivial,weusetwotypesoftraversaloperationshere,oneto\goright"andother storepointerstoparents,wekeepexplicitstackstoimplementthesetwotraversals. to\godown"inthetree.sinceseveraltreesaretraversedatthesametimeanditiswastefulto Atthebeginningofeachalgorithm,weinitializeanemptystackforeachargument.This Wedescribenowanimportantabstractionrelatedtothewaywetraversethetrees.Unlikelist Afterinitializingthestack,ifthetreeisnotempty,wepushthersttop-levelnodeofthetree operationspush,pop,emptyandtoponthestacks intothestack. stackholdspointerstotrees,andisreferredtoashargumenti:stack.weusethenormal

Thealgorithmsaccessonlythetopofthestacksandterminatewhenastackisempty.Argumenttreenamesareuppercaseletters.Theirlowercaseversiondenotesthetopnodesoftheir h1;1i(where1=1,1>x;8x2r,and1+x=1;8x2r). does,wepushitsrstchildintothestack.ifitdoesnot,weperforma\goright"operation. Togodowninthetree,wetestwhetherthecurrenttopofthestackhaschildrenornot.Ifit correspondingstacks,e.g.p=head(top(p:stack)).ifp:stackisempty,pisassumedtobe brace).figure1showsthebasicalgorithmsfortreetraversal. Togoright,wereplacethecurrenttopofthestackbyitsnextsibling.Ifthereisnonext Weuseapseudocodenotationforouralgorithms.Weincludeacase-likeinstruction(abigleft sibling,wepopthecurrenttopandretrytheoperationwiththeparent.weeventuallyempty thestackinthisway. Infact,goingdownmeans\processrstthechildrenandthenthesiblings". Init(P) Empty(P:stack). If(subtree(p)6=)Push(P:stack;subtree(p)) While(:Empty(P:stack)^tail(Top(P:stack))=)Pop(P:stack). elseright(p). If(P6=)Push(P:stack;P). If(:Empty(P:stack))Top(P:stack) Down(P) X,hXistheheightofitstree(intheworstcaseitcanbenX)anddXisthemaximumdegreeof itstree(itcanalsobenxinaattree).wealsousen,dandhasthemaximumcorresponding Weusethefollowingnumbersintheanalysis:nXisthesizeofthesetcorrespondingtooperand Figure1:Basicoperationsfortreetraversal. tail(top(p:stack)). valuebetweenalloperands(therearenormallytwooperands). ItshouldbeclearthateithercollectingmarkedordeletingunmarkednodesisO(n)time, Twoobservationsabouttheanalysis: simplewiththeseprimitives. AlthoughaparticularoperationoftreetraversalcanworkuptoO(h),wenotethatthewhole Wearenowinpositiontodescribeanumberofexamplealgorithms.Theirdescriptionisvery wherenisthenumberofnodesofthetree. thantwice.so,theamortizedcostoftreetraversalsisalwayslinearwiththesizeofthetree. traversal,evenbyusingdown,iso(n).thisisbecausenoedgeofthetreeistraversedmore

manipulation,segmentsincludingorincludedinothers,segmentsafterorbeforeothers,etc.inthis sectionweexplainindetailacoupleofthemandtheiranalysis. 4.1SetDierenceandIntersection Thereareanumberofinterestingproblemsthatadmitalinearimplementation,forexample:set 4LinearOperations isthatthetreesareunmarkedattherstplace,andwetraversebothtreesinsynchronization, Setdierenceandintersectionarecomplementaryversionsofasinglemarkingalgorithm.Theidea markingallnodesoftherstargumentthatarealsointhesecondone.thus,welaterimplement setdierencebycollectingunmarkednodesandsetintersectionbycollectingmarkednodes. Figure2showsthealgorithmtomarkthetree.PandQarethearguments. Whilemax(To(p);To(q))<1 Init(P).Init(Q). 8><>:p<q:Right(P). else:down(p). pq:down(q). p>q:right(q). p=q:markp.down(p). 4.2SegmentsIncludedinOthers time.itisalso(n)intheworstcase. beenalreadyshownlinear),andweworko(1)ateachstep.therefore,wehaveo(np+nq)=o(n) Thisalgorithmislinear,sinceasingletraversalisdoneoneachargument(thattraversalhas Figure2:Markingalgorithmforsetdierenceorintersection. Q,insynchronism.WhenanodeofPisincludedinatop-levelnodeofQ,thatsubtreeofPis toaccountforthis. marked.otherwisewediscardthatpnodeandcontinuewithitschildren. segmentofq. nodebutalsoitswholesubtreeisconsideredmarked.thecollectionalgorithmmustbemodied AnotherinterestingoperationisIn(P;Q),thatselectselementsfromPthatareincludedina renethisanalysisasfollows:eachtimewedoadown(p)isbecauseitcontainsanelementofqor ThisalgorithmisO(nP+nQ)=O(n),bythesameargumentsasbefore.Inthiscase,wecan Toavoidtoomuchmarkingoverhead,westatethatamarkinanodemeansthatnotonlythat ThealgorithmtosolveIn(P;Q)ispresentedinFigure3.WetraversethetoplevelsofPand becauseitoverlapswithanelementofq,thuseachextremeofeachsegmentof(thetop-levelof)q levelofthispath,theoperationcantakeusdpcomparisons,thusthecostiso(dqhpdp)=o(d2h). rstlevelinp,therelevantlistfromthetop-levelofqhasonlyoneelement(theoriginalq).ateach iscompared,atmost,withacompletepathofp(lengthhp).thatisbecauseoncewedescendthe takeso(logn)(atleasttodothemarking). Thatmeans,forexample,thatinanapplicationwithconstantdandbalancedtreestheoperation

Whilemax(To(p);To(q))<1 Init(P).Init(Q). 8><>:p<q:Right(P). p>q:right(q). pq:markp.right(p). else:down(p). 5ComplexOperations Then,thecomplexityofthisoperatorisO(min(nP+nQ;dQhPdP))=O(min(n;d2h)). Figure3:Markingalgorithmforsegmentsincludedinothers. includingksegments,andcomplexversionsof\after"and\before"aresomeexamples.weexplain Althoughmostoperationscanbeimplementedinlineartime,therearemorecomplexoperations thatseemnottohavealinearimplementation:segmentsincludedinotherswithpositions,segments indetailtherstone. 5.1SegmentsIncludedinOtherswithPositions slightlydierentalgorithms. consideringthemaximalsegmentsofpincludedinq,foreachq.ifasegmentofpoverlapswith q,thesegmentanditsdescendantsinparenotconsidered(seefigure4).othercriteriaproduce besegmentsincludedinqthatnestinp(e.g.sectionsinsidesections),wedene\position"byonly AnoperationthathappenstobeusefulforapplicationsistoselectsegmentsfromPthatareincluded ofpthatareincludedunderthesameq(e.g.thethirdsectionofeachchapter).sincetheremay insomesegmentqofqatagivenposition.thepositionisdenedintermsoftheothersegments node.wecanuse,e.g.rst,k-th,last,last?k,primepositions,etc.forthisalgorithmweusea genericpredicatetoavoidanyrestriction:sisanexpressiondenotingtheallowedpositions. WhenwendasetofnodesofPincludedinoneofQ,wemarkthes-thnodes,andthenwepass Figure4:Criteriontoparticipateininclusionwithpositions.Ellipsesindicateselectedsegments. Anotherconcerniswhichlanguagewillweusetodenotetheallowedpositionsofanincluded Thealgorithmrequiressimplebooleanmarking.Wetraversebothtreesinsynchronization. q P

againovertheincludedp-nodes,thistimecomparingthemwiththesubtreeoftheq-node.if, instead,thenodeofpincludesoneofq,wefollowthechildrenofthep-node.figure5showsthe algorithm. Whilemax(To(p);To(q))<1 Init(P).Init(Q). 8><>: else:down(p). p<q:right(p). p>q:right(q). pq:lp Figure5:Markingalgorithmforsegmentsincludedinothersatagivenposition. Whilehead(lp)q Down(Q). If(pos2s)Markhead(lp). Top(P:stack).pos tail(lp).pos pos+1. h2;ni;:::;hn;nig;s). bydoingatmosto(dp)work(whenpq),thusthealgorithmiso(np+nqdp).butalso thenalcollectionofnodesiso(np).observethateachelementofqisdeletedfromtheproblem O(min(nP+nQdP;nQ+nPhQ))=O(nmin(d;h)). algorithmisalsoo(nq+nphq).then,thealgorithmhasthebestfrombothcomplexities,namely observethateachelementofpcanbeworkedonbyatmostacompletepathofq,thusthe Toseethatitisalso(n2),considerthefollowingexample:In(fh1;1i;h2;2i;:::;hn;nig;fh1;ni; Toanalyzethisalgorithm,considerthatwecantraversebothPandQcompletely,andthat onlyrangeswecanimplementitino(nmin(h;klogd))wherekisthenumberofrangeswehave. considermanynodesofq,notjustone,andthereseemstobenobetterwaytodothat.ifthe 6SomeApplications languageofpositions(s)isrestrictedwecanobtainbettercomplexities,e.g.ifweallowtoexpress met.butobservethatinordertodeterminewhetherapnodeistobemarkedornot,wehaveto Thisnon-linearcomplexitymaybesurprising,sincetherequirementofproximalityseemstobe Ourtechniqueappliestoanumberofdissimilarapplications.Inthissectionwebrieydiscussa temporaldatabaseandexplainindetailastructuredtextsearchapplication.ouraimisnotonly toexposerealsituationswheretheproblemarises,butalsotoshowthepracticalperformanceof oneventsthatoccuratagivenpointorintervalintime.ifwehaveaset-orientedquerylanguage wewillbeinterestedinquestionssuchas\givemealltheevents(andtheirtimes)thatsatisfysome ouralgorithms. 6.1TemporalDatabases Temporaldatabases[7,2]canmanipulateinformationwithtemporalvalidity.Theirdataisbased constraint"(anexampleisgivenintheintroduction).

hastomanageanumberofquantitativefacts(i.e.eventsthatareknowntohavehappenedat time(e.g.ahappenedbeforeb,withoutknowingwhen).wejustwanttoshowthatifadatabase knowntimes),itmaybeinterestingforitsback-endtoperformset-orientedoperationsonthese facts.thesealgorithmswouldbeapartofamoregeneralinferenceengine,beinganecientway toselectrelevantfactstoworkon. Infact,temporaldatabasesaremuchmorecomplex,sincetheyalsodealwithuncertaintyin restrictionweimposeontheapplicabilityofourmodeltotemporaldatabases. wecannothandleresults(nalorintermediate)havingsegmentoverlapping.therefore,thisisa sequentialprocessescanbehierarchicallystructured. tions,althougheachanswerandintermediateresultisasubsetofsomesequentialprocess.those intoasetofsequentialprocesses,andsubsetsofdierentprocessescanbecombinedintoopera- Inmosttemporaldatabases,thetimeintervalscanoverlapandnest.Asexplainedearlier, Moderntextualdatabasesareagoodexampleofhierarchicalstructuring.Wedescribehereaspecic 6.2StructuredTextSearch Thisdoesnotmeanthattherecannotbeoverlapsinthedatabase.Wecandividetheknowledge modelhasmotivatedthisworkonecientalgorithmstoevaluatequeries,althoughsomeoperations modeltoquerystructuredtextdatabases,calledproximalnodes[13,12].thedevelopmentofthis dierslightlyfromthoseexposedhere. matchingexpressions.eachstructuralcomponent(e.g.chapters,gures,pages)ispreindexed,so tobeasubsetofagivenhierarchy. are:chapter/section/paragraphandfascicle/page/line.inthiscase,itmakesfullsenseforanswers componentsofthedatabase(e.g.chapters,pages,etc.).examplesofhierarchicalviewsofthetext ofindependenthierarchicalstructuresbuiltonthetext.eachhierarchyisstrict,althoughthere maybeoverlapsbetweendierenthierarchies.thenodesofthehierarchiesarethestructural Theleavesofthequerysyntaxtreesarenamesofstructuralcomponentsandtextpattern- Inthismodel,atextualdatabaseisseenasatext(asequenceofsymbols)plusanumber thesetofallelementsofagiventypecanberetrievedinatimeproportionaltothesizeofthe retrievedset.pattern-matchingusesanotherindextoreturnalistofsegmentsofthetextthat matchedthepattern.theselistsareconsideredtobepartofaspecialhierarchy.thisishowthe queryingfacilitiesbutahighexecutionoverhead. possibilitiesandthequerylanguageofthedatabaseandthosewhichhaverichstructuringand goodtradeobetweenthemodelswhichachievehigheciencybystronglyrestrictingthestructuring setsarebuiltintherstplace.allinternalnodesofquerysyntaxtreescorrespondtooperations betweensegmentsoftheoperands,andtheyarefurtherrestrictedtooperateonlyonproximal segments. theimplementation(usingthealgorithmsweexposehere)isveryecient.themodelconstitutesa Iwanttextparagraphsinitalicswhicharebefore(butinthesamepage)agurethatsays Someexamplesofqueriesare: Itisshownthatthequerylanguageobtainedundertheserestrictionshasgoodexpressivity,while GivemeallreferencestoKnuth'sbooksinchapters2-4. Iwantaparagraphprecedinganotherparagraphwheretheword\Computer"appearsbefore somethingabouttheearth. (at10symbolsorless)theword\science".bothparagraphsmustbeinthesamepage.

neededforthenalresult(thussavingalotofworkwhenprocessingtheintermediateresults). top-leveloftheresulttreeandaskstoexpandonlysomenodes.inthiscase,wesavetheworkof Thismechanismcanalsobeusedtoimplementanavigationalinterface,inwhichtheuserseesthe computingnodesthatarenottobeseen.althoughtheworst-casecomplexityofthelazyoperations isworsethanthatof\full"operations,therealtimesarebetter,sincelessnodesareprocessedin Iwantallsectionswithmathematicalformulasthatarenotappendices. practice. Wealsoimplementedalazyversionofthealgorithms,thatonlyprocessesthenodesthatare dierent.wetestedeachoperationondierentoperandsofsizes(i.e.numberofnodes)ranging gorithms.weusedadatabaseofcprogramsandlatexdocuments,whosestructuringisquite from100to10000.wealsotestedanumberofmorecomplexqueries,tocomparethefullandlazy versionsonrealqueries. correspondtoasunsparcclassicofapproximatelyspecmark26and16mbofram.fromthe testsweextractthefollowingconclusions(theymaydierforotherapplications): Aprototypehasbeenimplementedforthismodel,totesttheaverageperformanceoftheal- Figure6showsatypicalexampleofthetimesofanalgorithm(i.e.asingleoperation).They 1.0 seconds 0.5 full lazy 0.1 Thefullversionsofthealgorithmsarealllinearinpractice,sincethesituationsunderwhich 0.05 Thefullversionshaveverylowvariance,beingtheirtimeshighlypredictable,proportionalto Figure6:Typicaltimesforequal-sizedoperands.Observethatweusealogarithmicscale. theyarenotareveryunlikelytooccurinstructuredtextdatabases(e.g.averydeeptree). 50.000nodesprocessedpersecondperoperator. thesumofthesizesofallintermediateresults.theconstantforourmachineisapproximately 0.01 operand 100 1000 10000 size

7ConclusionsandFutureWork Weanalyzedtheproblemofmanipulatingasetofsegmentswhenweareinterestedinset-oriented Thelazyversionisnormallybetterthanthefulloneinpractice,despitetheworsecomplexities, operations.classicalsolutionsprovidesimpleoperationswitho(logn)complexity,whichleadsto verylargevariance,though. especiallyforcomplexqueries.thisisbecauselessnodesareexpanded.lazyalgorithmshave O(nlogn)solutionsforset-orientedoperations. formedinlineartime.theseassumptionsare:theoperandsandtheresultmustnothaveoverlapping segmentsinside,andtheoperationsmustworkonproximalsegments. However,weshowthat,undersomegeneralassumptions,set-orientedoperationscanbeper- includeaveragetimesofarealimplementationofthealgorithms. operationsbecauseoftheamortizedcost,i.e.itsperformanceforsingleoperationsisnotgood. adaptedtoanumberofapparentlydissimilarproblems.thistechniqueworkswellforset-oriented merge-likeoperations,andappliedtheframeworktosolveinlineartimeanumberofset-oriented exampleoperationsonsegments.theaimwastoshowthattheframeworkisexibleenoughtobe Thereareanumberoffutureworkdirectionsrelatedtothiswork.Themostimportantare: Finally,wepresentedsomeapplicationstowhichthistechniquecouldbeapplied.Wealso Wedevelopedaframeworkorientedtotreetraversalthatgeneralizestheideaoflisttraversalfor Findatechniquethatkeepsthesegoodresultswhileimprovingtheperformanceforsingle Extendtheseideastoallowoverlapsintoaset,sincethiswillopenawealthofnewpossibilities toapplythistechnique. Extendtheframeworktoaccountformoredimensions,e.g.manipulatinghypercubesinstead Studyadisk-basedimplementation,tryingtominimizeseektimes.Someworkonthisdirectionhasalreadybeendone. speciallybylookingatapplicationsthatcanbenetfromthiswork. operations. Searchformoreoperationsthatcanbeimplementedinlineartimebyusingthisframework, Acknowledgments WethankClaudiaMedeirosandNinaEdelweissfortheirhelpontemporaldatabases.Wealso Studyaparallelimplementation,sinceouralgorithmsseemtobehighlyparallelizable. incomputationalgeometryandinmapprocessing(e.g.ingeographicinformationsystems). ofjustone-dimensionalsegments.thismayopenanumberofopportunitiesofapplications References thankthehelpfulcommentsofthereferees. [1]A.Aho,R.Sethi,andJ.Ullman.Compilers:Principles,TechniquesandTools.Addison- Wesley,1986.

[4]C.Clarke,G.Cormack,andF.Burkowski.Schema-independentretrievalfromheterogeneous [2]J.Allen,J.IIendler,andA.Tate,editors.ReadingsinPlanning.MorganKaufmann,1990. [5]T.Cormen,C.Leiserson,andR.Rivest.IntroductiontoAlgorithms.TheMITPress,1990. [3]J.Bentley.AlgorithmsforKlee'srectangleproblems.Dept.ofComputerScience,Carnegie- [6]H.Edelsbrunner.Dynamicdatastructuresfororthogonalintersection.TechnicalReportF59, structuredtext.inprocs.ofthe4thannualsymposiumondocumentanalysisandinformation Retrieval,Apr.1995. MellonUniv.Unpublishednotes.,1977. [9]P.Kanellakis,S.Ramaswamy,D.Vengro,andJ.Vitter.Indexingfordatamodelswithconstraintsandclasses.TechnicalReportCS-93-21,Dept.ofComputerScience,BrownUniversity, [7]A.T.etal.TemporalDatabases:Theory,DesignandImplementation.BenjaminCummings, [8]P.KanellakisandD.Goldin.Constraintprogramminganddatabasequerylanguages.Technical ReportCS-94-31,Dept.ofComputerScience,BrownUniversity,June1994. 1993. Tech.Univ.Graz,InstitutefurInformationsverarbeitung,1980. [10]A.Loeen.Textdatabases:Asurveyoftextmodelsandsystems.ACMSIGMODConference. [11]E.McCreight.Prioritysearchtrees.TechnicalReportCSL-81-5,XeroxPARC,1981. [12]G.Navarro.Alanguageforqueriesonstructureandcontentsoftextualdatabases.Master'sthe- ACMSIGMODRECORD,23(1):97{106,Mar.1994. May1993. [13]G.NavarroandR.Baeza-Yates.Alanguageforqueriesonstructureandcontentsoftextual [14]F.PreparataandM.Shamos.ComputationalGeometry.Springer-Verlag,2ndedition,1988. sis,dept.ofcomputerscience,univ.ofchile,apr.1995.ftp://sunsite.dcc.uchile.cl/- pub/users/gnavarro/thesis95.ps.gz. databases.inproc.acmsigir'95,pages93{101,1995.ftp://sunsite.dcc.uchile.cl/- pub/users/gnavarro/sigir95.ps.gz.