recent)algorithmcalledbdm.bdmskipscharactersusinga\suxau-



Similar documents
Online EFFECTIVE AS OF JANUARY 2013


Fast string matching

Improved Single and Multiple Approximate String Matching


Number of objects k 2k 4k 8k 16k 32k 64k 128k256k512k 1m 2m 4m 8m

Factoring - Factoring Special Products

A Fast Pattern Matching Algorithm with Two Sliding Windows (TSW)


Factoring Methods. Example 1: 2x * x + 2 * 1 2(x + 1)

A Partition-Based Efficient Algorithm for Large Scale. Multiple-Strings Matching

timeout StoR!msg0 RtoS?ack0

PERFECT SQUARES AND FACTORING EXAMPLES

7-6. Choosing a Factoring Model. Extension: Factoring Polynomials with More Than One Variable IN T RO DUC E T EACH. Standards for Mathematical Content

DECLARATION OF PERFORMANCE NO. HU-DOP_TN _001

DECLARATION OF PERFORMANCE NO. HU-DOP_TD-25_001

An efficient matching algorithm for encoded DNA sequences and binary strings

FORT LAUDERDALE INTERNATIONAL BOAT SHOW BAHIA MAR IN-WATER SET-UP SCHEDULE

Factoring. Factoring Polynomial Equations. Special Factoring Patterns. Factoring. Special Factoring Patterns. Special Factoring Patterns

1.4. Removing Brackets. Introduction. Prerequisites. Learning Outcomes. Learning Style

READ ME For all study abroad students. How to study abroad and also graduate successfully:

A The Exact Online String Matching Problem: a Review of the Most Recent Results

Trading Calendar - East Capital UCITS Funds

A Multiple Sliding Windows Approach to Speed Up String Matching Algorithms

1. Find the length of BC in the following triangles. It will help to first find the length of the segment marked X.

COMMERCIAL AIRLINE FACT SHEET

รายงานผลการด าเน นงาน เร อง โครงการศ นย เร ยนร การเกษตรพอเพ ยงป 2552 โดย

Factoring (pp. 1 of 4)

Russian Mechanized Corp Composition

SolidWorks Corporation: CSWP Sample Exam


BUSINESS ANALYTICS. Overview. Lecture 0. Information Systems and Machine Learning Lab. University of Hildesheim. Germany

Use Case Diagram. Tom Polanski, Analex Corporation CSCI Object-Oriented Analysis and Design (Spring 2001) Homework #3 Use Cases

Future Trends in Airline Pricing, Yield. March 13, 2013

Improved Approach for Exact Pattern Matching

How to Code Load Rating of a New Bridge with No Load Rating Analysis. How to Code Load Rating of a Bridge Exempt from Load Rating Per ODOT BDM

Chords and Voicings Made Simple By: Sungmin Shin January 2012

EXAMINATION POINT SCORES USED IN THE 2014 SCHOOL AND COLLEGE PERFORMANCE TABLES

~ EQUIVALENT FORMS ~

NOTES: Please check the website regularly as this timetable is subject to change up to 10 days prior to the start date of the examinations

The wide window string matching algorithm

Improved Single and Multiple Approximate String Matching

College of Science. Agricultural Sciences Department. Biology and Chemistry Department. Earth and Space Science Department

Mark Scheme (Results) January GCE Decision D1 (6689) Paper 1

List of Undergraduate Courses University of Economics, Prague Fall/Winter Semester 2015/2016

Data Mining Applications in Manufacturing

Factoring Special Polynomials

BACCALAUREATRE DEGREE REQUIREMENTS

Customer Pole Display SERIES 8035

Data Mining: Overview. What is Data Mining?

Number Theory Hungarian Style. Cameron Byerley s interpretation of Csaba Szabó s lectures

S. Mai 1 and C. Zimmermann 2

NRFs administrasjon TLF Norske Rørgrossisters Forening. Efficientinformation logistics. Terje Røising General Manager

IMPROVED ALGORITHMS FOR STRING SEARCHING PROBLEMS

Using the ac Method to Factor

IT Application Controls Questionnaire

BALANCED THREE-PHASE CIRCUITS

Fundamentele Informatica II

JUNE PRELIMINARY EXAMINATION TIMETABLE

Factoring - Grouping

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Certificate of Compliance

Simplification of Radical Expressions

List of Graduate Courses University of Economics, Prague Fall/Winter Semester 2015/2016

k, then n = p2α 1 1 pα k

Oral Diagnosis: The Physical Exam

SCHOOLOFCOMPUTERSTUDIES RESEARCHREPORTSERIES UniversityofLeeds Report95.4

Schedule GS-2 INTERMEDIATE GENERAL SERVICE

Basic Music Theory for Junior Cert.

Training Manual. Shuffle Master Gaming Three Card Poker Training Manual R

Rational Expressions - Complex Fractions

RESEARCH OF THE NETWORK SERVER IN SELF-SIMILAR TRAFFIC ENVIRONMENT

List of Graduate Courses University of Economics, Prague Fall/Winter Semester 2015/2016

Find all of the real numbers x that satisfy the algebraic equation:

DHL EXPRESS CANADA E-BILL STANDARD SPECIFICATIONS

Factoring Polynomials

Ukulele Music Theory Part 2 Keys & Chord Families By Pete Farrugia BA (Hons), Dip Mus, Dip LCM

Factoring Trinomials of the Form x 2 bx c

NormalizingIncompleteDatabases

Exercise Set 3. Similar triangles. Parallel lines

Factoring Polynomials

How to Play Chords on your Mountain Dulcimer tuned DAd

CALIFORNIA S TRADE PUBLICATION FOR THE HOME LOAN ORIGINATOR

Langara College Fall archived

CMS-1500 Claim Form/American National Standards Institute (ANSI) Crosswalk for Paper/Electronic Claims

Transcription:

1Dept.ofComputerScience,UniversityofChile.BlancoEncalada2120,Santiago, ABit-parallelApproachtoSuxAutomata: FastExtendedStringMatching Abstract.Wepresentanewalgorithmforstringmatching.Thealgorithm,calledBNDM,isthebit-parallelsimulationofaknown(butomaton"whichismadedeterministicinthepreprocessing.BNDM,in- recent)algorithmcalledbdm.bdmskipscharactersusinga\suxau- Chile.gnavarro@dcc.uchile.cl. 2InstitutGaspardMonge,CiteDescartes,Champs-sur-Marne,77454 Marne-la-ValleeCedex2,France.raffinot@monge.univ-mlv.fr. 3PartiallysupportedbyChileanFondecytgrant1-950622. GonzaloNavarro13MathieuRanot2 theiroriginalformulation.weshowthat,asotherbit-parallelalgorithms, BNDMcanbeextendedtohandleclassesofcharactersinthepattern orverylongpatterns(e.g.onenglishtextitisthefastestbetween5 stead,simulatesthenondeterministicversionusingbit-parallelism.this easilyimplementothervariantsofbdmwhichareextremelycomplexin ily.thismakesitthefastestalgorithminallcasesexceptforveryshort and110characters).moreover,thealgorithmisverysimple,allowingto algorithmis20%-25%fasterthanbdm,2-3timesfasterthanotherbitparallelalgorithms,and10%-40%fasterthanalltheboyer-moorefam- 1Introduction Thestring-matchingproblemistondalltheoccurrencesofagivenpattern p=p1p2:::pminalargetextt=t1t2:::tn,bothsequencesofcharactersfrom eralizethesuxautomatondenitiontohandleclassesofcharacters.to andinthetext,multiplepatternsandtoallowerrorsinthepatternor thebestofourknowledge,thisextensionhasnotbeenstudiedbefore. inthetext,combiningsimplicity,eciencyandexibility.wealsogen- thersthavinglinearworst-casebehavior,isknuth-morris-pratt(kmp)[14].a Moore(BM)[6].Thisalgorithmleadstoseveralvariations,likeHoorspool[12] secondalgorithm,asfamousaskmp,whichallowstoskipcharacters,isboyer- andsunday[20],formingthefastestknownstring-matchingalgorithms. regardedaslookingforautomatawhichareecientinsomesense.forinstance, anitecharacterset. KMPissimplyadeterministicautomatonthatsearchesthepattern,beingits mainmeritthatitiso(m)inspaceandconstructiontime.manyvariationsof Severalalgorithmsexisttosolvethisproblem.Oneofthemostfamous,and thebmfamilyaresupportedbyanautomatonaswell. wheretheideaistosearchasubstringofthepatterninsteadofaprex(askmp), Alargepartoftheresearchinecientalgorithmsforstringmatchingcanbe Anotherautomaton,called\suxautomaton"isusedin[9,10,11,15,19],

approach,whichhasalsobeenextendedtomultipatternmatching[9,11,19] (i.e.lookingfortheoccurrencesofasetofpatterns). orasux(asbm).optimalsublinearalgorithmsonaverage,like\backward DAWGMatch"(BDM)orTurboBDM[10,11],havebeenobtainedwiththis algorithmshavebeenobtainedforexactstringmatching[2,22],aswellasapproximatestringmatching[22,23,3].althoughthesealgorithmsworkwellonly Anotherrelatedlineofresearchistotakethoseautomataintheirnondetionsinsidecomputerwordstoperformmanyoperationsinparallel.Competitive memoryrequirements. onrelativelyshortpatterns,theyaresimpler,moreexible,andhaveverylow terministicforminsteadofmakingthemdeterministic.usuallythenondeter- ministicversionsareverysimpleandregularandcanbesimulatedusing\bit- parallelism"[1].thistechniqueusestheintrinsicparallelismofthebitmanipula- Matching(BNDM),whichweextendtohandleclassesofcharacters,tosearch tainafaststringmatchingalgorithm,calledbackwardnondeterministicdawg Shift-Or[2].BNDMusesanondeterministicsuxautomatonthatissimulated multiplepatterns,andtoallowerrorsinthepatternand/orinthetext,like thanshift-or),fasterthanitsdeterministic-automatoncounterpartbdm(20%- 25%faster),usinglittlespaceincomparisonwiththeBDMorTurboBDMalgorithms,andbeingverysimpletoimplement.Itbecomesthefasteststring matchingalgorithm,beatingalltheboyer-moorefamily(sundayincluded)by terministicversionusingbit-parallelism.thisextensionhasnotbeenconsidered forthebdmorturbobdmalgorithmsbefore. previousoneswhichcouldbeextendedinsuchaway(typically2-3timesfaster usingbit-parallelism.thisnewalgorithmhastheadvantageofbeingfasterthan Inthispaperwemergesomeaspectsofthetwoapproachesinordertoob- 90-150letters),dependingonjjandthearchitecture,otheralgorithmsbecome fasterthanbndm(sundayandbdm,respectively).moreover,wedeneanew suxautomatonwhichhandlesclassesofcharactersandwesimulateitsnonde- 10%to40%.Onlyforveryshort(upto2-6letters)orverylongpatterns(past thebitwise-xorand\"complementsallthebits.theshift-leftoperation,\<<", thebitsofcomputerwords:\j"isthebitwise-or,\&"isthebitwise-and,\b"is denotebitrepetition(e.g.031=0001).weusec-likesyntaxforoperationson ofpifpcanbewrittenp=uxv,u;v2.wedenotefact(p)thesetoffactors calledsu(p). ofp.afactorxofpiscalledasuxofpisp=ux.thesetofsuxesofpis Weintroducesomenotationnow.Awordx2isafactor(i.e.substring) movesthebitstotheleftandenterszerosfromtheright,i.e.bmbm?1:::b2b1<< r=bm?r:::b2b10r.wecaninterpretbitmasksasintegersalsotoperform arithmeticoperationsonthem. Anexpandedversionofthisworkcanbefoundin[17]. Wedenoteasb`:::b1thebitsofamaskoflength`.Weuseexponentiationto

thenexplainhowisitusedinthesearchalgorithm 2.1SuxAutomata WedescribeinthissectiontheBDMpatternmatchingalgorithm[10,11].This algorithmisbasedonasuxautomaton.werstdescribesuchautomatonand 2SearchingwithSuxAutomata andisshowninfigure1.weshownowhowthecorrespondingdeterministic niteautomatonthatrecognizesallthesuxesofthispattern.by\incomplete" wemeanthatsometransitionsarenotpresent. fordeterministicacyclicwordgraph)istheminimal(incomplete)deterministic automatonisbuilt.i01234567 Asuxautomatononapatternp=p1p2:::pm(frequentlycalledDAWG(p)- Thenondeterministicversionofthisautomatonhasaveryregularstructure Fig.1.Anondeterministicsuxautomatonforthepatternp=baabbaa.Dashedlines representepsilontransitions(i.e.theyoccurwithoutconsuminganyinput).iisthe initialstateoftheautomaton. Givenafactorxofthepatternp,endpos(x)isthesetofallthepattern positionswhereanoccurrenceofxends(thereisatleastone,sincexisafactor ofthepattern,andthereareasmanyasrepetitionsofxinsidep).formally, suchintegeraposition.forexample,endpos(baa)=f3;7ginthewordbaabbaa. Noticethatendpos()isthecompletesetofpossiblepositions(recallthatis givenx2fact(p),wedeneendpos(x)=fi=9u;p1p2:::pi=uxg.wecalleach theemptystring).noticethatforanyu;v,endpos(u)andendpos(v)areeither Fact(p),wedeneuvifandonlyifendpos(u)=endpos(v) disjointoronecontainedintheother. p=baabbaa,wehavethatbaaaabecauseinalltheplaceswhereaaendsin (noticethatoneofthefactorsmustbeasuxoftheotherforthisequivalence tohold,althoughtheconverseisnottrue).forinstance,inourexamplepattern thepattern,baaendsalso(andvice-versa). Wedeneanequivalencerelationbetweenfactorsofthepattern.Foru;v2 setsofpositions.astate,therefore,canbethoughtofafactorofthepattern ThenodesoftheDAWGcorrespondtotheequivalenceclassesof,i.e.to

inthenondeterministicautomaton. 1;)[p(i2+1;)[:::[p(ik;),where Anotherwaytoseeitisthatthesetofpositionsisinfactthesetofactivestates alreadyrecognized,exceptbecausewedonotdistinguishbetweensomefactors. whichisthesametosaythatwetrytoextendthefactorthatwerecognized withthenexttextcharacter,andkeepthepositionsthatstillmatch.ifweare Thereisanedgelabeledfromthesetofpositionsfi1;i2;:::ikgtop(i1+ p(i;)=(figifimandpi= pattern).asanexample,thedeterministicsuxautomatonofthewordbaabbaa correspondstothesetf0::mg.finally,astateisterminalifitscorresponding subsetofpositionscontainsthelastpositionm(i.e.wematchedasuxofthe isgiveninfigure2. leftwithnomatchingpositions,wedonotbuildthetransition.theinitialstate ;otherwise Fig.2.Deterministicsuxautomatonofthewordbaabbaa.Thelargestnodeisthe initialstate. 0,1,2,3,4,5,6,7 ab2,3,6,7 1,4,52,6 a aba3,7bbaa b The(deterministic)suxautomatonisawellknownstructure[8,5,11,18], 4567 andwedonotproveanyofitspropertieshere(neitherthecorrectnessofthe previousconstruction).thesizeofdawg(p)islinearinm(countingbothnodes algorithmisthatthisautomatoncannotonlybeusedtorecognizethesuxes ofp,butalsofactorsofp.bythesuxautomatondenition,thereisapath andedges),andcanbebuiltinlineartime[8].averyimportantfactforour butoptimalinaverage(o(nlogm=m)time)4.othermorecomplexvariations suchasturbobdm[10]andmultibdm[11,19]achievelineartimeintheworst matchingalgorithmcalledbdm.thisalgorithmiso(mn)timeintheworstcase, 2.2SearchAlgorithm Thesuxautomatonstructureisusedin[10,11]todesignasimplepattern labeledbyxformtheinitialnodeofdawg(p)ifandonlyifxisafactorofp. 4ThelowerboundofO(nlogm=m)inaverageforanypatternmatchingalgorithm underaberbouillimodelisfroma.c.yaoin[24].

case.tosearchapatternp=p1p2:::pminatextt=t1t2:::tn,thesux longestone.thebackwardsearchendsbecauseoftwopossiblereasons: suxesofprarethereverseprexesofp).thelastrecognizedprexisthe positionlastinsidethewindowandendingattheendofthewindow(sincethe avariablelast).thiscorrespondstondingaprexofthepatternstartingat searchesbackwardsinsidethewindowforafactorofthepatternpusingthe notcorrespondtotheentirepatternp,thewindowpositionisremembered(in automatonofpr=pmpm?1:::p1(i.ethepatternreadbackwards)isbuilt.a suxautomaton.duringthissearch,ifaterminalstateisreachedwhichdoes windowoflengthmisslidalongthetext,fromlefttoright.thealgorithm 1.Wefailtorecognizeafactor,i.ewereachaletterthatdoesnotcorrespond thewindowtotherightinlastcharacters(wecannotmissanoccurrence toatransitionindawg(pr).figure3illustratesthiscase.wethenshift becauseinthatcasethesuxautomatonwouldhavefounditsprexinthe window). Failtorecognizeafactorat:thepatterncannotstartbefore. SearchforafactorwiththeDAWG Recordinlastthewindowpositionwhenaterminalstateisreached last Window Themaximumprexstartsatlast 2.Wereachthebeginningofthewindow,thereforerecognizingthepatternp. Wereporttheoccurrence,andweshiftthewindowexactlyasintheprevious Fig.3.Basicsearchwiththesuxautomaton safeshift case(noticethatwehavethepreviouslastvalue). Newwindow Searchexample:wesearchthepatternaabbaabinthetext WerstbuildDAWG(pr=baabbaa),whichisgiveninFigure2.Wenotethe currentwindowbetweensquarebracketsandtherecognizedprexinarectangle. Webeginwith T=[abbabaa]bbaab,m=7,last=7. T=abbabaabbaab:

1.T=[abbabaa]bbaab. aisafactorofprandareverse prexofp.last=6. 2.T=[abbabaa]bbaab. aaisafactorofprandareverse prexofp.last=5. 3.T=[abbabaa]bbaab. aabisafactorofpr. Wefailtorecognizethenexta. Soweshiftthewindowtolast. Wesearchagainintheposition: T=abbab[aabbaab], last=7. 4.T=abbab[aabbaab]. bisafactorofpr. 5.T=abbab[aabbaab]. baisafactorofpr. 6.T=abbab[aabbaab]. baaisafactorofpr. 7.T=abbab[aabbaab]. baaisafactorofpr,andareverse prexofp.last=4. 8.T=abbab[aabbaab]. baabisafactorofpr. 9.T=abbab[aabbaab]. baabbisafactorofpr. 10.T=abbab[aabbaab]. baabbaisafactorofpr. 11.T=abbab[aabbaab]. Werecognizethewordaabbaab andreportanoccurrence. 3Bit-Parallelism In[2],anewapproachtotextsearchingwasproposed.Itisbasedonbitparallelism[1],whichconsistsintakingadvantageoftheintrinsicparallelism ofthebitoperationsinsideacomputerwordtocutdownthenumberofoperationsbyafactorofatmostw,wherewisthenumberofbitsinthecomputer word. TheShift-Oralgorithmusesbit-parallelismtosimulatetheoperationofa nondeterministicautomatonthatsearchesthepatterninthetext(seefigure4). AsthisautomatonissimulatedintimeO(mn),theShift-Oralgorithmachieves O(mn=w)worst-casetime(optimalspeedup).Ifweconverttheautomatonto deterministicwegetaversionofkmp[14],whichiso(n)searchtime,although twiceasslowinpracticeformw. 01234567 baabbaa Fig.4.Anondeterministicautomatontosearchthepatternp=baabbaainatext. Theinitialstateis0. WeexplainnowavariantoftheShift-Oralgorithm(calledShift-And).The algorithmbuildsrstatablebwhichforeachcharacterstoresabitmask bm:::b1.themaskinb[c]hasthei-thbitsetifandonlyifpi=c.thestateof thesearchiskeptinamachinewordd=dm:::d1,wherediissetwheneverthe

statenumberediinfigure4isactive.therefore,wereportamatchwhenever dmisset. textcharacter:eachstategetsthevalueofthepreviousone,providedthetext whichmimicswhatoccursinsidethenondeterministicautomatonforeachnew charactermatchesthecorrespondingarrow.the\j0m?11"correspondstothe usingtheformulad0 WesetD=0originally,andforeachnewtextcharacterTj,weupdateD allthetime). algorithmusesdm=wecomputerwordsforthesimulation(notallthemareactive initialself-loop.forpatternslongerthanthecomputerword(i.e.m>w),the Thisalgorithmisverysimpleandcanbeextendedtohandleclassesofcharacters(i.e.eachpatternpositionmatchesasetofcharacters),andtoallow mismatches.thisparadigmwaslaterenhancedtosupportwildcards,regular problems[22,3].bit-parallelismbecameageneralwaytosimulatesimplenon- expressions,approximatesearch,etc.yieldingthefastestalgorithmsforthose deterministicautomatainsteadofconvertingthemtodeterministic.thisishow weuseitinouralgorithm. 4Bit-ParallelismonSuxAutomata WesimulatetheBDMalgorithmusingbit-parallelism.Theresultisanalgorithm whichissimpler,useslessmemory,hasmorelocalityofreference,andiseasily ((D<<1)j0m?11)&B[Tj] dm:::d1. And,wekeepthestateofthesearchusingmbitsofacomputerwordD= 4.1TheBasicAlgorithm WesimulatethereverseversionoftheautomatonofFigure1.JustasforShift- showlaterhowtoextendthealgorithmforlongerpatterns. extendedtohandlemorecomplexpatterns.werstassumethatmwand ispositionedatanewtextpositionjustafterpos,itsearchesbackwardsthe windowtpos+1::tpos+musingthedawgautomaton,untileithermiterations areperformed(whichimpliesamatchinthecurrentwindow)ortheautomaton cannotperformanytransition. TheBDMalgorithmmovesawindowoverthetext.Eachtimethewindow updated.eachtimewendaprexofthepattern(dm=1)werememberthe weinitializedandscanthewindowbackwards.foreachnewtextcharacterwe Tpos+1+m?k::Tpos+m.Sincewebeginatiteration0,theinitialvalueforDis1m. Thereisamatchifandonlyifafteriterationmitholdsdm=1.Whenever longestprexmatchedcorrespondstothenextwindowposition. dm=1,wehavematchedaprexofthepatterninthecurrentwindow.the Inourcase,thebitdiatiterationkissetifandonlyifpm?i+1::m?i+k= Thealgorithmisasfollows.Eachtimewepositionthewindowinthetext

setsthebitscorrespondingtothepositionswherethepatternhasthecharacter c(justasinshift-and).theformulatoupdatedfollows andwesuspendthescanning(thiscorrespondstonothavinganytransitionto followintheautomaton).ifwecanperformmiterationsthenwereportamatch. positioninthewindow.ifwerunoutof1'sindthentherecannotbeamatch notshownforclarity. realcode,relatedtoimprovedowofcontrolandbitmanipulationtricks,are WeuseamaskBwhichforeachcharactercstoresabitmask.Thismask ThealgorithmissummarizedinFigure5.Someoptimizationsdoneonthe D0 (D&B[Tj])<<1 BNDM(p=p1p2:::pm;T=t1t2:::tn) 3. 1.Preprocessing 2. 4.Search 5. 6. 7. 10. 8. 9. Fori21::mdoB[pm?i+1] Forc2doB[c] Whilepos<=n?mdo j D=1m WhileD!=0mdo 0m;last m0m 11. j j?1 D&B[Tpos+j] B[pm?i+1]j0m?i10i?1 13. 12. ifd&10m?1!=0mthen Fig.5.Bit-parallelcodeforBDM.Someoptimizationsarenotshownforclarity. 16. 17. 14. 15. 18.Endofwhile pos D ifj>0thenlast pos+last D<<1 elsereportanoccurrenceatpos+1 j thecurrentwindowbetweensquarebrackets,aswellastherecognizedprexin Searchexample:wesearchthepatternaabbaabinthetextT=abbabaabba arectangle.webeginwith T=[abbabaa]bbaab,D=1111111,B=a1100110 ab.immediatelyaftereachstepnumber(1to11)weshowthetextandnote last=7,j=7. b0011001,m=7,

1.[abbabaa]bbaab. 1111111 &1100110 D=1100110 j=6;last=6 2.[abbabaa]bbaab. 1001100 &1100110 D=1000100 j=5;last=5 3.[abbabaa]bbaab. 0001000 &0011001 D=0001000 j=4;last=5 4.[abbabaa]bbaab. 0010000 &1100110 D=0000000 j=3;last=5 Wefailtorecognize thenexta.soweshift thewindowtolast.we searchagainintheposition:abbab[aabbaab],last =7,j=7. 5.abbab[aabbaab]. 1111111 &0011001 D=0011001 j=6;last=7 6.abbab[aabbaab]. 0110010 &1100110 D=0100010 j=5;last=7 7.abbab[aabbaab]. 1000100 &1100110 D=1000100 j=4;last=4 8.abbab[aabbaab]. 0001000 &0011001 D=0001000 j=3;last=4 9.abbab[aabbaab]. 0010000 &0011001 D=0010000 j=2;last=4 10.abbab[aabbaab]. 0100000 &1100110 D=0100000 j=2;last=4 11.abbab[aabbaab]. 1000000 &1100110 D=1000000 j=0;last=4 Reportanoccurrenceat6. 4.2HandlingLongerPatterns WecancopewithlongerpatternsbysettingupanarrayofwordsDtandsimulatingtheworkonalongcomputerword.Weproposeadierentalternative whichwasexperimentallyfoundtobefaster. Ifm>w,wepartitionthepatterninM=dm=wesubpatternssi,suchthat p=s1s2:::smandsiisoflengthmi=wifi<mandmm=m?w(m?1). Thosesubpatternsaresearchedwiththebasicalgorithm. Wenowsearchs1inthetextwiththebasicalgorithm.Ifs1isfoundata textpositionj,weverifywhethers2followsit.thatis,wepositionawindowat Tj+m1::Tj+m1+m2?1andusethebasicalgorithmfors2inthatwindow.Ifs2is inthewindow,wecontinuesimilarlywiths3andsoon.thisprocessendseither becausewendthecompletepatternandreportit,orbecausewefailtonda subpatternsi. Wehavetomovethewindownow.Aneasyalternativeistousetheshift last1thatcorrespondstothesearchofs1.however,ifwetestedthesubpatterns s1tosi,eachonehasapossibleshiftlasti,andweusethemaximumofallshifts.

jj=w))otherwise. Thatis,O(mn)intheworstcase(e.g.T=an;p=am),O(n=m)inthebestcase (e.g.t=an;p=am?1b),ando(nlogjjm=m)onaverage(whichisoptimal). 4.3Analysis ThepreprocessingtimeforouralgorithmisO(m+jj)ifmw,andO(m(1+ Ouralgorithm,however,benetsfrommorelocalityofreference,sincewedonot accessanautomatonbutonlyafewvariableswhichcanbeputinregisters(with theexceptionofthebtable).asweshowintheexperiments,thisdierence makesouralgorithmthefastestone. Inthesimplecasemw,theanalysisisthesameasfortheBDMalgorithm. words).thebestcaseoccurswhenthetexttraversalusings1alwaysperformsits maximumshiftafterlookingonecharacter,whichiso(n=w).weshow,nally, thattheaveragecaseiso(nlogjjw=w).clearlythesecomplexitiesareworse thanthoseofthesimplebdmalgorithmforlongenoughpatterns.weshowin oftheo(mn)stepsofthebdmalgorithmforcestoworkondm=wecomputer theexperimentsuptowhichlengthourversionisfasterinpractice. s1andcheckfortherestofthepattern.thecheckfors2inthewindowcosts Whenm>w,ouralgorithmisO(nm2=w)intheworstcase(sinceeach totalcostincurredbytheexistenceofs2:::smisatmost O(w)atmost.Withprobability1=jjwwends2andchecks3,andsoon.The Thesearchcostfors1isO(nlogjjw=w).Withprobability1=jjw,wend theshiftsnow.thesearchofeachsubpatternsiprovidesashiftlasti,andwe whichthereforedoesnotaectthemaincosttosearchs1(neitherintheory sincetheextracostiso(1)norinpracticesince"isverysmall).weconsider takethemaximumshift.now,theshiftlastiparticipatesinthismaximumwith M?1 Xi=1w takingthemaximum)thelongestpossibleshiftswwiththeirprobabilities,we probability1=jjwi.thelongestpossibleshiftisw.hence,ifwesum(insteadof jjwi"=w jjw(1+o(w=jjw))=o(1) 5FurtherImprovements longerthanlast1andshorterthanlast1+"=last1+o(1),andhencethecost isthatofsearchings1pluslowerorderterms. getintothesamesumabove,whichis"=o(1).therefore,theaverageshiftis [10,15],thelastonebeinglinearintheworstcaseandstillsublinearonaverage.Themainideaistoavoidretraversingthesamecharactersinthebackward caseevenformw,sincewecantraversethecompletewindowbackwards andadvanceitinonecharacter.ouraimnowistoreduceitsworstcasefrom O(nm2=w)toO(nm=w),i.e.O(n)whenm=O(w). 5.1ALinearAlgorithm Althoughouralgorithmhasoptimalaveragecase,itisnotlinearintheworst ImprovedvariationsonBDMalreadyexist,suchasTurboBDMandTurboRF

positions,wealreadyknowthatti+last::ti+m?1isaprexofthepattern(recallfigure3).theendingpositionoftheprexinthewindowisusuallycalled thecriticalposition.themainproblemifthisareaisnotretraversedishowto rememberonlytherstone. asfollows:letubethepatternprexbeforethecriticalposition.ifwereach setofpositionswhichisgivenbythestatewereachedinthesuxautomaton. p=uzr)werecognizethewholepatternp,andthenextshiftcorrespondsto thelongestborderofp(i.e.thelongestproperprexthatisalsoasux),which Weshifttotherightmostoccurrenceofzrinthepattern. possibletoknowwhetherzrisasuxofthepatternp:ifzrisasux,(i.e. thecriticalpositionafterreading(backwards)afactorzwiththedawg,itis canbecomputedinadvance.ifzrisnotasux,itappearsinthepatternina OnestrategyaddsakindofBMmachinetotheBDMalgorithm.Itworks windowvericationusingthefactthatwhenweadvancethewindowinlast determinethenextshift,sinceamongallpossibleshiftsinti+last::ti+m?1we preprocessingphasetoassociateinlineartimeanoccurrenceofzrinthepattern patternafterthebmshift.wedothatnow.recallthatuistheprexbefore thecriticalposition.theturborf(secondvariation)[10]usesacomplicated thefactorzwereadwiththedawgisasux,wetestwhetherdjzj=1.to gettherightmostoccurrence,weseektherightmost1ind,whichwecanget (ifitexists)inconstanttimewithlog2(d&(d?1))5.weimplementedthis algorithmunderthenamebmbndmintheexperimentalpartofthispaper. Thisalgorithmremainsquadratic,becausewedonotkeepaprexofthe ItisnotdiculttosimulatethisideainourBNDMalgorithm.Toknowif usethispreprocessingphaseondawgs.withoursimulation,thispreprocessing toaborderbuofu,inordertoobtainthemaximalprexofthepatternthatisa suxofuzr.moreover,theturborfusesasuxtree,anditisquitedicultto phasebecomessimple.toeachprexuiofthepatternp,weassociateamask Bord[i]thatregistersthestartingpositionsofthebordersofui(included). factorzends).hence,thebitsofx=bord[i]&darethepositionssatisfying anoccurrenceofzr.therstsetofpositionsisbord[i],andthesecondoneis bothcriteria.aswewanttherightmostsuchoccurrence(i.e.themaximalprex), nameturbobndmintheexperimentalpartofthispaper. preciselythecurrentdvalue(i.e.positionsinthepatternwheretherecognized wetakeagainlog2(x&(x?1)).weimplementedthisalgorithmunderthe borderofu,wewantthepositionswhichstartaborderofuandcontinuewith Thistableisprecomputedinlineartime.Now,tojoinoneoccurrenceofzrtoa 5.2AConstant-SpaceAlgorithm Itisalsointerestingtonoticethat,althoughthealgorithmneedsO(jjm=w) extraspace,wecanmakeitconstantspaceonabinaryalphabet2=f0;1g. 5Itisfasterandcleanertoimplementthislog2byshiftingthemasktotherightuntil andgetthesameresult. itbecomeszero.usingthistechniquewecanusethesimplerexpressiond^(d?1)

Thetrickisthatinthiscase,B[1]=pandB[0]=B[1].Therefore,weneed whichifthealphabetisconsideredofconstantsizeisofthesameorderofthe representingthesymbolsofwithbitsandworkingonthebits(themisaligned noextrastorageapartfromthepatternitselftoperformalltheoperations.in normalsearchtime. theory,anytextoveranitealphabetcouldbesearchedinconstantspaceby matcheshavetobelaterdiscarded).thisinvolvesanaveragesearchtimeof 6Extensions mlog2jjlog2(mlog2jj)=normaltimelog2jj1+log2log2jj nlog2jj log2m Weanalyzenowsomeextensionsapplicabletoourbasicscheme,whichforma thisworktheonlyextendedpatternswedealwitharethoseallowingaclassof successfulcombinationofeciencyandexibility. charactersateachposition. patterns"thosethataremorecomplexthanasimplestringtobesearched.in 6.1ClassesofCharacters AsintheShift-Oralgorithm,weallowthateachpositioninthepatternmatches notonlyasinglecharacterbutanarbitrarysetofcharacters.wecall\extended isnotanymoreadawg.wecallitextendeddawg.toourknowledge,this ognizesallsuxesofanextendedpatternp=c1c2:::cm.thisautomaton inp.afactorx=x1x2:::xrofp=c1c2:::cmisasuxifx12cm?r+1;x22 thatx12ci?r+1;x22ci?r+2;:::;xr2ci.suchaniiscalledapositionofx Ci?r+2;:::;xr2Cm. inisafactorofanextendedpatternp=c1c2:::cmifthereexistsanisuch Similarlytotherstpartofthiswork,wedesignanautomatonwhichrec- Wedenotep=C1C2:::Cmsuchextendedpatterns.Awordx=x1x2:::xr intheextendedpatternb[a,b]abbaa,andl-endpos(bba)=f3;6g(noticethat, DAWG,exceptforthenewdenitionofsuxes.Foranyxfactorofp,wedenote L-endpos(x)thesetofpositionsofxinp.Forexample,L-endpos(baa)=f3;7g implementation. ConstructionTheconstructionweuseisquitesimilartotheonewegiveforthe kindofautomatonhasneverbeenstudied.werstgiveaformalconstruction unlikebefore,thesetsofpositionscanbenotdisjointandnooneasubsetofthe oftheextendeddawg(provingitscorrectness)andlaterpresentabit-parallel other).wedenetheequivalencerelationeforu;vfactorsofpby uevifandonlyifl-endpos(u)=l-endpos(v):

factors(aspreviouslydened).theequivalencerelationeiscompatiblewith Lemma1LetpbeanextendedpatternandEtheequivalencerelationonits theconcatenationonwords. Wedenep(i;)withi2f0;1;:::;m;m+1g;2by StatesoftheautomatonaretheequivalenceclassesofE.Thereisanedge Thislemmaallowsustodeneanautomatonfromthisequivalenceclass. p(i;)=(figifimand2ci labeledbyfromthesetofpositionsfi1;i2;:::ikgtop(i1+1;)[p(i2+ ;otherwise word[a,b]aa[a,b]baaisgiveninfigure6. 1;)[:::[p(ik+1;),ifitisnotempty.Theinitialnodeoftheautomaton thesetofpositionsthatcontainm.asanexample,thesuxautomatonofthe isthesetthatcontainsallthepositions.terminalsnodesoftheautomatonare 0,1,2,3,4,5,6,7ba1,2,3,4,6,7 1,4,5aa2,3,4,7 2,6 4,5b b aaa,bbaa b3,7 3,44567 a,b Lemma2TheExtendedDAWGofanextendedpatternp=C1C2:::Cmrecognizesthesetofsuxesofp. Fig.6.ExtendedDAWGoftheextendedpattern0[a;b]1a2a3[a;b]4b5a6a7 ba bit-parallelism. Abit-parallelimplementation:fromtheaboveconstruction,theonlymodicationthatouralgorithmneedsisthattheBtablehasthei-thbitsetforall charactersbelongingtothesetofthei-thpositionofthepattern.thereforewe tendedpatternp.wedonotgiveanalgorithmtobuildthisextendeddawg Wecanusethisnewautomatontorecognizethesetofsuxesofanex- initsdeterministicform,butwesimulatethedeterministicautomatonusing simplychangeline3(partofthepreprocessing)inthealgorithmoffigure5to Fori21::m;c2doifc2pithenB[c] B[c]j0m?i10i?1

longshifts.however,thisismuchmoreresistantthansomesimplevariationsof doesnotchange. suchthatnowthepreprocessingtakeso(jjm)timebutthesearchalgorithm characters.thisiscommon,forinstance,indnadatabases.wecaneasily itselfmayhavebasiccharactersaswellasothersymbolsdenotingsetsofbasic Boyer-Mooresinceitusesmoreknowledgeaboutthematchedcharacters. canbedegradediftheclassesofcharactersaresignicantlylargeandprevent Moore-likealgorithm.Itshouldbeclear,however,thattheeciencyoftheshifts Wepointoutnowanotherextensionrelatedtoclassesofcharacters:thetext WecombinetheexibilityofextendedpatternswiththeeciencyofaBoyerhandlesuchtexts.AssumethatthesymbolCrepresentsthesetfc1;:::;crg. ThenwesetB[C]=B[c1]j:::jB[cr].Thisismuchmorediculttoachieve oflengthminparallel,wecanuseanarrangementproposedin[22],which withalgorithmsnotbasedonbit-parallelism. 6.2MultiplePatterns TosearchasetofpatternsP1:::Pr(i.e.reportingtheoccurrencesofallthem) concatenatesthepatternsasfollows:p=p11p21:::pr1p12p22:::pr2::::: P1mP2m:::Prm(i.e.alltherstletters,thenallthesecondletters,etc.)and Figure5isthattheshiftisnotinonebitbutinrbitsinline15(sincewehave searchespjustasasinglepattern.theonlydierenceinthealgorithmof position.thatis,wereplacetheold10m?1testmaskby1r0r(m?1)inline12. neededforeachword.moreover,itwillreportthematchesofanyofthepatterns dmofthecomputerwordweconsideralltherbitscorrespondingtothehighest rbitspermultipatternposition)andthatinsteadoflookingforthehighestbit andwillnotallowshiftingmorethanwhatallpatternsallowtoshift. patterns).inthiscasetheshiftinline15isforonebit,andthemaskforline12is (10m?1)r.Onsomeprocessorsashiftinonepositionisfasterthanashiftinr>1 positions,whichcouldbeanadvantageforthisarrangement.ontheotherhand, Thiswillautomaticallysearchforrwordsoflengthmandkeepallthebits intheircurrentproposal. inthiscasewemustclearthebitsthatarecarriedfromthehighestpositionof apatterntothenextone,replacingline15ford=(d<<1)&(1m?10)r.this dierentlengthsforthealgorithmofwuandmanber[22]whichisnotpossible involvesanextraoperation.finally,thisarrangementallowstohavepatternsof Analternativearrangementis:P=P1P2:::Pr(i.e.justconcatenatethe sothatthepatternsineachgrouptinwbits.sincethisskipscharacters,itis betteronaveragethan[22].asweshowintheexperiments,thisisalsobetter ifmbw=2candrm>wwedividethesetofpatternsintodr=bw=mcegroups, themostconservativeamongalltherpatterns. thansequentiallysearchingeachpatterninturn,evengiventhattheshiftsare Clearlythesetechniquescannotbeappliedtothecasem>bw=2c.However,

patterninatextallowingatmostk\errors".theerrorsareinsertions,deletions Approximatestringmatchingistheproblemofndingalltheoccurrencesofa 6.3ApproximateStringMatching oneforlowerrorlevels. anecientlterisproposedtodeterminethatlargetextareascannotcontain allthepiecesinparallel.sincekerrorscannotdestroythek+1pieces,someof thepiecesmustappearwithnoerrorsclosetoeachoccurrence.theyusethe andreplacementstoperforminthepatternsothatitmatchesthetext.in[22], thebestofbothworlds:ourperformanceiscomparabletoboyer-moorealgorithmsandwekeeptheexibilityofbit-parallelismhandleclassesofcharacters. Weshowintheexperimentshowouralgorithmperformsinthissetup. 7ExperimentalResults Weranextensiveexperimentsonrandomandnaturallanguagetexttoshow Ourmultipatternsearchtechniquepresentedintheprevioussectioncombines anoccurrence.itisbasedondividingthepatternink+1piecesandsearching amultipatternboyer-moorestrategyispreferred,whichisfasterbutdoesnot handleclassesofcharactersandotherextensions.thisalgorithmisthefastest multipatternsearchalgorithmmentionedinthepreviousparagraph.in[4,3], textsandpatternswith=2to64,aswellasnaturallanguagetextanddna sequences. UltraSparc-1of167MHz,with64MbofRAM,runningSunOS5.5.1.WemeasureCPUtimes,whicharewithin2%with95%condence.Weusedrandom howecientareouralgorithmsinpractice.theexperimentswererunonasun thebmfamily.turbobndmiscompetitivewithsimplebndmandhaslinear simplebndm.classicalbdm,ontheotherhand,issometimesslowerthan form2-6.thefastestalgorithmisbmbndm,thoughitisverycloseto KMP(veryslowtoappearintheplots,closeto0.14sec/Mb),Shift-Or(not patterns.thecomparisonincludesthebestknownalgorithms:bm,bm-sunday, alwaysshown,closeto0.07sec/mb),classicalbdm,andourthreebit-parallel variants:bndm,bmbndmandturbobndm. Ourbit-parallelalgorithmsarealwaysthefastestforshortpatterns,except WeshowinFigure7someoftheresultsforshort(mw)andlong(m>w) BNDM).Forlargeralphabets,ontheotherhand,anotherverysimplealgorithm ismorecomplex(noticethatboyer-mooreisfasterthanbdm,butslowerthan worstcase.ouralgorithmsareespeciallygoodforsmallalphabetssincetheyuse moreinformationonthematchedpatternthanothers.theonlygoodcompetitor forsmallalphabetsisboyer-moore,whichhoweverisslowerbecausethecode getsveryclose:bm-sunday.however,wearealwaysatleast10%faster. 6Wedidnotincludethemorecomplexvariationsofouralgorithmbecausetheyhave alreadybeenshownverysimilartothesimpleone.wedidnotincludealsothe algorithmswhichareknownnottoimprove,suchasshift-orandkmp. Onlongerpatterns6ouralgorithmceasestoimprovebecauseitbasically

searchesfortherstwlettersofthepattern,whileclassicalbdmkeepsimproving.hence,ouralgorithmceasestobethebestone(beatenbybdm)form 90-150.Thisvaluewouldatleastduplicateina64-bitarchitecture. generatedmanuallyasfollows:weselectfromanenglishtextaninfrequentword, thebndmalgorithm.wecomparetheeciencyagainstshift-or.theresultis presentedintable1,whichshowsthateveninthecaseofthreeinitialornal namely"responsible"(closeto10matchespermegabyte).thenwereplaceits lettersallowingalargeclassofcharacterstheshiftsaresignicantandwedouble theperformanceofshift-or.hence,ourgoalsofhandlingclassesofcharacters rstorlastcharactersbytheclassfa::zg.thiswilladverselyaecttheshiftsof withimprovedsearchtimesareachieved. Weshowalsosomeillustrativeresultsusingclassesofcharacters,whichwere responsible6.582.71 responsibl?6.512.96 responsib??6.523.23 responsi???6.493.40?esponsible6.462.93??sponsible6.553.42???ponsible6.513.78 PatternShift-OrBNDM Table1.Searchtimeswithclassesofcharacters,in1/100-thofsecondspermegabyte onenglishtext.thequestionmark'?'representstheclassfa::zg. rstarrangementisslightlymoreecientthanthesecondone,theyarealways calledmulti-bndm(1)and(2)attendingtotheirpresentationorder),against vesequentialsearcheswithbndm(calledbndminthelegend),andagainst theparallelversionproposedin[22](calledmulti-wm).asitcanbeseen,our moreecientthanasequentialsearch(althoughtheimprovementisnotve-fold thatalthoughwetaketheminimumshiftamongallthepatterns,wecanstill dobetterthansearchingeachpatterninturn.wetakerandomgroupsofve patternsoflength6andcompareourmultipatternalgorithm(initstwoversions, WepresentinFigure8someresultsonourmultipatternalgorithm,toshow forapproximatestringmatching.weincludethefastestknownalgorithmsin thecomparison[4,3,7,13,16,23,22].wecomparethosealgorithmsagainst buttwo-orthree-foldbecauseofshortershifts),andaremoreecientthanthe ourversionof[4](wherethesundayalgorithmisreplacedbyourbndm),while proposalof[22]provided8. weconsider[22]notasthebit-parallelalgorithmpresentedtherebuttheirother proposal,namelyreductiontoexactsearchingusingtheiralgorithmmulti-wm formultipatternsearch(showninthepreviousexperiment).figure9showsthe Finally,weshowtheperformanceofourmultipatternalgorithmwhenused

resultsfordierentalphabetsizesandm=20. algorithmceasestobecompetitiveshortbeforetheoriginalversion[4].thisis back,ouralgorithmisquitecloseto[4](sometimesevenfaster)whichmakesit becausethelengthofthepatternstosearchforiso(m=k).despitethisdraw- areasonablycompetitiveyetmoreexiblealternative,whilebeingfasterthan theotherexiblecandidate[22]. 8Conclusions SinceBNDMisnotverygoodforveryshortpatterns,theapproximatesearch ofanondeterministicsuxautomaton.thisautomatonhasbeenpreviously Wepresentanewalgorithm(calledBNDM)basedonthebit-parallelsimulation usedindeterministicforminanalgorithmcalledbdm.ournewalgorithm isexperimentallyshowntobeveryfastonaverage.itisthefastestalgorithm matchingandapproximatepatternmatching,amongothers. tendedsimplyandecientlytohandleclassesofcharacters,multiplepattern usingbit-parallelismandbecomepracticalalgorithms.turbobndmhasaverageperformanceveryclosetobndm,thougho(n)worstcasebehavior,while BMBNDMisslightlyfasterthanBNDM.TheBNDMalgorithmcanbeex- inallcasesforpatternsfromlength5to110(onenglish;theboundsvary variationscalledturbobndmandbmbndmwhicharederivedfromthecor- dependingonthealphabetsizeandthearchitecture).wepresentalsosome respondingvariantsofbdm.thesevariantsaremuchmoresimplyimplemented tershasneverbeenstudied.itsstudyshouldpermittoextendthebdmand TurboRFtohandleclassesofcharacters. Thenewsuxautomatonweintroduceandsimulateforclassesofcharac- References isjustabmalgorithmwhichusespairsofcharactersinsteadofsingleones. matchingsoftwares.weplantoworkonthisideatoo. Thisisanorthogonaltechniquethatcanbeincorporatedinallalgorithms,and ageneralstudyofthistechniquewouldpermittoimprovethespeedofpattern TheAgrepsoftware[21]isinmanycasesfasterthanBNDM.However,Agrep 5.A.Blumer,A.Ehrenfeucht,andD.Haussler.Averagesizesofsuxtreesand 4.R.Baeza-YatesandC.Perleberg.Fastandpracticalapproximatepatternmatching.InProc.CPM'92,pages185{192.Springer-Verlag,1992.LNCS644ing.InProc.ofCPM'96,pages1{23,1996. 35(10):74{82,October1992. putercongress,volumei,pages465{476.elsevierscience,september1992. 3.R.Baeza-YatesandG.Navarro.Afasteralgorithmforapproximatestringmatch- 1.R.Baeza-Yates.Textretrieval:Theoryandpractice.In12thIFIPWorldCom- 2.R.Baeza-YatesandG.Gonnet.Anewapproachtotextsearching.CACM, dawgs.discreteappliedmathematics,24(1):37{45,1989. 6.R.S.BoyerandJ.S.Moore.Afaststringsearchingalgorithm.Communications oftheacm,20(10):762{772,1977.

10.M.Crochemore, 9.M.Crochemore, 8.M.Crochemore.Transducersandrepetitions.Theor.Comput.Sci.,45(1):63{86, 7.W.ChangandJ.Lampe.Theoreticalandempiricalcomparisonsofapproximate 1986. stringmatchingalgorithms.inproc.ofcpm'92,pages172{181,1992.lncs644. 12.R.N.Horspool.Practicalfastsearchinginstrings.Softw.Pract.Exp.,10:501{506, 11.M.CrochemoreandW.Rytter.Textalgorithms.OxfordUniversityPress,1994. S.Jarominek,T.Lecroq,W.Plandowski,andW.Rytter.Fastpracticalmultipatternmatching.Rapport93-3,InstitutGaspardMonge,UniversitedeMarnela Vallee,1993. L.Gasieniec,S.Jarominek,T.Lecroq,W.Plandowski,andW.Rytter.Speeding uptwostring-matchingalgorithms.algorithmica,(12):247{267,1994. A.Czumaj, 13.P.Jokinen,J.Tarhio,andE.Ukkonen.Acomparisonofapproximatestring 14.D.E.Knuth,J.H.Morris,Jr,andV.R.Pratt.Fastpatternmatchinginstrings. matchingalgorithms.softwarepracticeandexperience,26(12):1439{1458,1996. 1980. A.Czumaj, 15.T.Lecroq.Recherchesdemot.Thesededoctorat,Universited'Orleans,France, 16.G.Navarro.Apartialdeterministicautomatonforapproximatestringmatching. SIAMJournalonComputing,6(1):323{350,1977. 1992. 19.M.Ranot.Onthemultibackwarddawgmatchingalgorithm(MultiBDM).In 17.G.NavarroandM.Ranot.Abit-parallelapproachtosuxautomata:Fast 18.M.Ranot.Asymptoticestimationoftheaveragenumberofterminalstatesin InProc.ofWSP'97,pages112{124.CarletonUniversityPress,1997. Processing,pages149{165,Valparaiso,Chile,November12-13,1997.CarletonUniversityPress. Chile,November12-13,1997.CarletonUniversityPress. R.Baeza-Yates,editor,Proceedingsofthe4rdSouthAmericanWorkshoponString dawgs.inr.baeza-yates,editor,proc.ofwsp'97,pages140{148,valparaiso, puterscience,univ.ofchile,jan1998.ftp://ftp.dcc.uchile.cl/pub/users/- gnavarro/bndm.ps.gz. extendedstringmatching.technicalreporttr/dcc-98-1,dept.ofcom- 20.D.Sunday.Averyfastsubstringsearchalgorithm.CACM,33(8):132{142,August 21.S.WuandU.Manber.Agrep{afastapproximatepattern-matchingtool.In 22.S.WuandU.Manber.Fasttextsearchingallowingerrors.CACM,35(10):83{91, 23.S.Wu,U.Manber,andE.Myers.Asub-quadraticalgorithmforapproximate 24.A.C.Yao.Thecomplexityofpatternmatchingforarandomstring.SIAMJournal Proc.ofUSENIXTechnicalConference,pages153{162,1992. 1990. October1992. oncomputing,8(3):368{387,1979. limitedexpressionmatching.algorithmica,15(1):50{67,1996. ThisarticlewasprocessedusingtheLATEXmacropackagewithLLNCSstyle

30 51015202530 2 7 234567 m 40 160 406080100120140160 1.5 5.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 m 30 51015202530 2.0 5.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 m 40 160 406080100120140160 1.5 2.4 1.5 1.8 2.1 2.4 m 30 51015202530 2 7 234567 m 40 160 406080100120140160 1.7 2.5 1.7 1.9 2.1 2.3 m BDM BNDM BMBNDM TurboBNDMShift-Or Sunday Boyer-Moore Fig.7.Timesin1/100-thofsecondspermegabyte.Forrsttothirdrow,random textwith=4,randomtextwith=64andenglishtext.leftcolumnshowsshort patterns,rightcolumnshowslongpatterns.

248163264 0 40 0 5 10 15 20 25 30 35 40 t Multi-BNDM(1)Multi-BNDM(2)Multi-WMBNDM Fig.8.Timesin1/100-thofsecondspermegabyte,formultipatternsearchonrandom textofdierentalphabetsizes(xaxis). ++++++++++ 1 10 12345678910 0.0 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 k t ++++++++++ 1 10 12345678910 0.0 4.0 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 k t Ex.Part.(ours) Ex.Part.[4] Ex.Part.[22] BitParall.[3]Col.Part.[7] Counting[13]+DFA[16] 4-russians[23] Fig.9.Timesinsecondspermegabyte,forrandomtextonpatternsoflength20,and =16and64(rstandsecondcolumn,respectively).Thexaxisisthenumberof errorsallowed.