Size: px
Start display at page:

Download ""

Transcription

1 TheIMSCorpusWorkbench CorpusAdministrator'sManual InstitutfurmaschinelleSprachverarbeitung UniversitatStuttgart OliverChrist {Computerlinguistik{ D70174Stuttgart1 Azenbergstr.12 LastModied:WedNov914:33:271994(oli) Created:ThuFeb2410:34:111994(oli) tc.bibentry:christ:94b Released:{notyet{

2 Contents 1Overview 1.1Introduction:::::::::::::::::::::::::::::::::::: 1.2TheroleoftheCorpusAdministrator::::::::::::::::::::: 4 1.4CreditsandAcknowledgements::::::::::::::::::::::::: 1.3Organizationofthismanual::::::::::::::::::::::::::: 2Internalcorpusrepresentation 5 2.1Positionalattributes::::::::::::::::::::::::::::::: 2.1.1Integerizedles:::::::::::::::::::::::::::::: 2.1.2Inverseditemsequence:::::::::::::::::::::::::: Example:asimplewordsearch::::::::::::::::::::: Otherattributetypes:::::::::::::::::::::::::::::: Structuralattributes::::::::::::::::::::::::::: Thesetofpositionalattributes:::::::::::::::::::::15 2.3Externaltoolsanddynamicattributes::::::::::::::::::::: Alignmentattributes::::::::::::::::::::::::::: Bigramandmappingtables:::::::::::::::::::::::18 3Encoding:Transformingacorpusintoitsinternalrepresentation 3.2Theencodeprogram:::::::::::::::::::::::::::::::21 3.1Theinternalrepresentationofacorpus::::::::::::::::::::: Spacerequirements::::::::::::::::::::::::::::::::26 3.3Themakeallprogram:::::::::::::::::::::::::::::: Positionalattributes::::::::::::::::::::::::::: Structuralattributes:::::::::::::::::::::::::::27 1

3 4Thecorpusregistry IMSCorpusWorkbench:Administrator'sManual 2 4.2Thecontentsofaregistryle::::::::::::::::::::::::::29 4.1Someremarksaboutnomenclature::::::::::::::::::::::: Positionalattributes::::::::::::::::::::::::::: Structuralattributes::::::::::::::::::::::::::: Theheader:::::::::::::::::::::::::::::::: Mappingtables:::::::::::::::::::::::::::::: ngramtables::::::::::::::::::::::::::::::: Alignmentattributes:::::::::::::::::::::::::::35 4.4Alastexample::::::::::::::::::::::::::::::::::37 4.3Registrationofremotecorpora::::::::::::::::::::::::: Dynamicattributes::::::::::::::::::::::::::::35 5Remoteaccess{clientandserversetup 4.5Stepstofollow::::::::::::::::::::::::::::::::::38 5.2Howtostartthecorpusdataserver:::::::::::::::::::::::41 5.1The.ratand.ratlogles:::::::::::::::::::::::::::39 6Utilitiesanddebuggingtools 6.1Decodingofcorpusandattributeinformation::::::::::::::::: Decodingofcorpusinformation:decode::::::::::::::::42 6.2CreationandDecodingofBigramTables:::::::::::::::::::: Creationofbigramtables:gen-bigrams:::::::::::::::: Decodingofwordlists:lexdecode:::::::::::::::::::42 6.3CreationandDecodingofMappingTables::::::::::::::::::: Creationofmappingtables:gen-mapping-table::::::::::: Decodingofbigramtables:decode-bigrams:::::::::::::43 6.4Generalutilities:::::::::::::::::::::::::::::::::: Comparingwordlistsandcorpora:check-coverage::::::::: Decodingofmappingtables:decode-mapping-table::::::::: Convertinginternalintegerstoreadablenumbers:itoa::::::: Convertingreadablenumberstointernalintegers:atoi:::::::44

4 7Accesscontrolandsecurityissues IMSCorpusWorkbench:Administrator'sManual 3 7.1Controllinglocalaccesstocorpora:::::::::::::::::::::::45 AHardwareandoperatingsystemrequirements 7.2Controllingremoteaccesstocorpora::::::::::::::::::::::46 BReusedsoftwarepackagesandcopyrightnotices 48 B.1TheregularexpressionmatcherbyHenrySpencer::::::::::::::49

5 Chapter1 Overview queryingoflargetextcorpora.thismanualdescribeshowtoencodeatextcorpusand 1.1 TheIMScorpusworkbenchisasetoftoolsfortheecientencoding,representationand Introduction howthevariousadministrationtoolsmustbeusedtotransformatextcorpusintothe moregeneralpapers,especially[christ,1994]foranoverviewofthesystemarchitectureas familiarwiththeoverallarchitecture,werecommenda\top-down"readingthroughthe representationusedbytheaccesstools.thismanualdoesnotdescribethefunctionality awhole. ofthequerytoolsorthearchitectureoftheworkbenchingeneral.ifthereaderisnot representationusedbytheimsworkbench,thefollowingstepshavetobeperformed: ecientlookup.totransformatextcorpusfromitstextualrepresentationtotheinternal pusdata,thedierent\items"(i.e.,words)usedinthecorpusandseveralindexlesfor Theinternalrepresentationofacorpusconsistsofasetofleswhichrepresentthecor- 3.declarationofthecorpusinaglobal\registrydirectory"; 2.encodingofthetextle; 1.transformationofthetextleinone-word-per-lineformat; Steps1and3havetobedonemanually,forsteps2and4therearetoolswithinthe workbench. 4.andbuildingseveralleindices. expectstondalewiththeverysamenameasthesymbolicnameofthecorpustobe toatool,itislookedupinacentraldirectory(calledthe\corpusregistry"),wherethetool usedinstep2above)accessacorpusviaasymbolicname.whenasymbolicnameispassed Thethirdstep,theregistrationofacorpus,isnecessarysincealmostalltools(buttheone accessed.thisleholdsadescriptionofthecomponentsofthecorpus,mainlyalistwhere thecomponentsarestoredphysically.so,auserdoesnothavetoknowwhereacorpusis stored,heorsheonlyhastoknowitssymbolicnametoaccessit. 4

6 Afteracorpusistransformedintoitsinternalrepresentationandregistered,itcanbe IMSCorpusWorkbench:Administrator'sManual 5 usedbythevarioustoolsoftheworkbench,forexamplethequerytools(xkwic,cqp, print-aligned). WithintheIMScorpusworkbench,thecorpusadministratorhasthetaskstoprovideusers 1.2 withnewcorporaortochangeexistingcorporawhensomeinformationhastobeadded TheroleoftheCorpusAdministrator orupdated.second,theadministratorhastoproperlyinstalltheusuallylargecorpusles inthelesystemandtondan\optimal"placewithregardtobackuppolicies,diskusage andaccesseciency.third,thecorpusadministratorhastocareaboutaccesscontrol, evenwithinoneinstitution.thesetasksaresimilarto\standard"systemadministration. sincecorporaexistwherecopyrightsorlicenseagreementsinhibitanunrestrictedaccess, accesscontrol,thatis,yourlocalsystemadministrator. familiarwiththestandardunixtextprocessingtools,backupstrategies,andsecurityand Wethereforesuggestthatthecorpusadministrationtasksarefullledbysomeonewhois whichareusedtostorethecorpusdata.youmayskiptheentirechapterifyouwant,it Thismanualisorganizedasfollows.Thenextchapter2explainstheinternaldatastructures 1.3 Organizationofthismanual stepswhicharenecessarytotransformacorpusfromitstextualrepresentationintoits isnotnecessaryfortheotherchapters,butusefulifyouhaveproblemswiththetools formatoftheleswhichdescribethephysicalattributesofacorpus.chapter5describes internalrepresentation.chapter4,then,describesindetailtheregistrydirectoryandthe orwanttolearnhowtomanipulatethedatales.chapter3,then,describesthevarious howtosetuptheclient-server-capabilitiesoftheworkbench.chapter6describesutilities andrelatedtoolswhicheitheraddmore(orothertypesof)informationtoacorpusor checkwhetherthetoolscanrunatallonyoursystem,youmayrefertoappendixafor hardwareandoperatingsystemspecicrequirementsofourtools. Anotherimportantpointisaccesscontrolforcorpora,whichisdiscussedinchapter7.To areusefulforotherpurposes,forexamplefordebuggingofacorpus(duringencoding). makeuseofintegerizeddatalesandreversedindices.bothofthesetechniquesarewellknownintheareaofinformationprocessingformanydecades,buttoourknowledgetherst whoappliedthemtotextandcorpusprocessinginthelinguisticareawaskenchurch. Hedeservesourgreatestthanksforpointingustothesemethods. TheinternaldatastructuresweuseinCqp,Xkwicandsomeothertoolsoftheworkbench 1.4 CreditsandAcknowledgements

7 Neithertheauthors,norIMS,northeUniversityofStuttgartmakeanyrepresentations IMSCorpusWorkbench:Administrator'sManual 6 aboutthesuitabilityofthesoftwaredescribedhereinortheassociateddocumentation foranypurpose.itisprovided"asis"withoutexpressorimpliedwarranty.wedisclaim allwarrantieswithregardtothesoftwaredescribedhereinortherelateddocumentation, othertortiousaction,arisingoutoforinconnectionwiththeuseorperformanceofthis liableforanyspecial,indirectorconsequentialdamagesoranydamageswhatsoever resultingfromlossofuse,dataorprots,whetherinanactionofcontract,negligenceor includingallimpliedwarrantiesofmerchantabilityandtness,innoeventshallwebe software.

8 Chapter2 Internalcorpusrepresentation Thischapterexplainshowcorpusdataisrepresentedinternally.Whenyouunderstandthe internalrepresentation,youcanusethetoolsofthetoolboxtocreate,updateorchange corpusinformationwithouthavingtogobacktothetextualversionandencodingthewhole stuagain.youwillalsobeabletogureouthowtoencodetheinternalrepresentation Ifyoudonotneedto\hack"withthecorpusdata,youmayskiptheentirechapter.The forleswhichcannotbecomputedbythetoolsofthetoolbox,forexampleduetomemory problems,softwarebugsorlimitations. understandingoftheinternalrepresentationisnotnecessaryfortheotherpartsofthis manual,butusefulifyouencounterproblemswiththetools. 2.1 WithintheIMScorpusworkbench,acorpuscanhaveanarbitrarynumberofannotations ofdierenttypes.inoursystem,acorpusisprimarilyregardedasasequenceofwords(not Positionalattributes whichisthemostimportantannotationtype.attributesofthisclasshavea(string)value asasequenceofcharacters).thewords,then,arenumbered,sothatwecandirectlyaccess ateachcorpusposition.1 thenthwordofthecorpus).thisleadstothemoregeneralnotionofpositionalattributes, thewordatacertaincorpuspositionp(i.e.,therstwordinthecorpus,or,ingeneral, : pos: N N IP NUM N ADJ N IP regardedasthenumberofalineinthisrepresentation. 1Whenthecorpusisstoredinaverticalizedone-word-per-lineformat,acorpuspositioncanalsobe Figure2.1:Corpuspositionsandvalues word: Pierre Vinken, 61 years old blessing n-2 n-1 7

9 Thecorpustextfallswithintheclassofpositionalattributes,sincewecanspecify,foreach IMSCorpusWorkbench:Administrator'sManual 8 ofthecorpus.inourview,weregardpos-tagsasassignedtoacorpuspositionratherthan corpusposition,thewordwhichoccursatthatposition.thepositionalattributewhich holdsthecorpustextproperalwayshasthepredenedattributename\word".other positionalattributesare,forexample,part-of-speechtags,whichareassignedtothewords string(asillustratedingure2.1).wethereforeusethesameinternalrepresentationforthe tothewordatthatposition.then,thepositionalattributes\word"and\tag"donotdier verymuchanymore:bothhave,foreachcorpusposition,avaluewhichis,inourcase,a wordsequenceofthecorpus(thecorpustext)andthetagsequence(theassociatedpostags),aswellasforother,additionalpositionalattributeslikelemmas,morphosyntactic equallength,oneofwhichcapturesthesequenceofwords,theothercapturesthesequence tags,etc.inotherwords,ataggedcorpusisinourviewasetoftwopositionalattributesof then,isacollectionofattributesofdierenttypes. Thequestionoftheinternalrepresentationofcorporawithmultiple(positional)annotationscanthusbereducedtothequestionofrepresentingasinglepositionalattribute (rememberthattheallpositionalattributesmustbeofequallength,thatis,encodeequal Thetwokeyconceptsoftheinternalrepresentationofapositionalattributesare: lengthitemstreams). integerizedrepresentation:itemsareencodedasintegernumbers,whereequalitems informationencodedinsuchapositionalattribute(here,wordvs.tag).acorpusingeneral, oftags.inthefollowing,wethereforeusetheterm\item"toabstractfromthetypeof inversedleindices:forthesequenceofnumbers,aninversedleiscreated.the (words,:::)getthesameintegercode.thesequenceofitemsisthenrepresentedas inversedlecaptures,foreachitem(better:itemcode)thesetofoccurrencesofthe asequenceofintegernumbers; Fortheconstructionoftheintegercode,younormallyneedasegmentationortokenization tool,sincethe,andtheareconsidereddierentandundesirablygetdierentcodes. iteminthepositionalattribute. Theadvantagesoftheintegercodeisthattherepresenteditemshaveequalinternallength (inthecaseofintegers,4bytesonourmachines).sincethelengthoftheitemsequenceis knownandtheitemsareofequallength,theitemsequencecanbehandledlikeanarray ofitems,withtheadvantageofrandomaccess.theinversedleisneededforlookup:since computedinasinglestep. itdirectlyindexesthesetofoccurrencesofagivenitem(code),theoccurrencescanbe tionoftheitemsequence.then,theitemsequenceisrepresentedasasequenceofinteger 2.1.1Integerizedles codes. Itisobviousthatatleasttwofunctionsareneededtohandlethisencoding: Assaidabove,therstsetofdatastructuresisanintegerizedleofthetextualrepresenta-

10 rst,afunctiontocomputetheintegercodeofagivenitem(astring); IMSCorpusWorkbench:Administrator'sManual 9 Therstdatastructureistheitemlistor\lexicon":itcapturesthesetof(dierent)items. Thesetwofunctionsrequiresomeauxiliarydatastructurestobeecientlycomputable. second,afunctiontoretrievethe(character)stringwhenthecodeisgiven. (2.1) (octal\000)ispaddedattheendofeachword.theleisnotsorted(butitmaybe).a UNIXcommandtoproducethislewouldbe: Internally,thisisthesetofstringsoccurringintheitemsequence,whereaNULLcharacter whereitisassumedthattheinputitemsequenceisinone-word-per-lineformat.inthis example,theoutputwouldbesorted,butthisisnotnecessary.theitemlistalreadydenes sort-u1wpl-item-seq tr'\n''\0'>lexicon theitemcodeforeachitem,sinceitisassumedthattherstitemintheitemlisthascode 0,thenextonehascode1,andsoon. 0)fromtheinputstream,sothattheexampleabovewillnotworkwith\traditionaltr". GNU'strdoesnothavethisbug. Notethat\traditional"trsdeleteASCII0( representedbytheitemcodeciscomputedinonestepviathisindex,whichwecallthe totheleosets(inbytes)intheitemlist.thus,thestartingpositionsofthethestring thestringsinthisle.thisindexgives,foreachitemcode,amappingfromitemcodes Forthelookupofastringinthislist,itisusefultohaveanindexofstartingpositionsof anitemcodeiseverythingbetweenthestartingpositioncomputedbytheitemindexup NULLcharacter(whichmustnotoccurintheitemsthemselves),thestringrepresentedby tothenextnullcharacter. itemlistindexorlexiconindex.sincethestringsintheitemlistareterminatedwiththe computed: (2.2) Again,thisindexcanbecomputedbyaUNIXcommandwhenthelexiconisalready tr'\0''\n'<lexicon atoiisautilityprogramincludedinthetoolboxandmapsnumbers(representedtextually atoi>lexicon.idx gawk'begin{pos=0} asasequenceofdigits)totheirinternalrepresentation. {printpos;pos=pos+length($1)+1}' perhapswouldneedmorespace,butcouldcomputetheconversioninnstepswherenis implemented.2forexample,thesamefunctionalitycouldbeachievedwithtries,which bedoneinanumberofdierentways{currently,binarysearchoverasortedstringindexis Thenextdatastructuresupportsthemappingfromstringstotheiritemcodes.Thiscould thelengthoftheinputstring. sinceitisrarelyused(allcomputationsaredoneontheitemcodes,wheneverpossible,insteadofstrings), wedidn'tyetconvertittoamoreecientmethod. 2Themethodcurrentlyimplementedintheworkbenchisverysimpleandcouldbespedupalot,but

11 Thebinarysearchrequiresasortedstructure.Forthispurpose,wedonotkeepasorted IMSCorpusWorkbench:Administrator'sManual 10 1),theitemcodeatthisposition.So,Ls(0)istheitemcodeofthe\smallest"item,and Ls(1)isthecodeofthesecond-smallestitem,etc.Thesorteditemlistcanthusbetextually itemina\virtual"sorteditemlist(rangingfrom0tothenumberofencodeditemsminus itemlist,butratheranotherindex(denotedbyls)whichholds,foreachpositionpofan printedbythefunction gprints; for(i=0;i<"sizeofitemset";i++)f code=sortidx[i]; s=lexidx[code]; Here,foreachpossiblepositioninthesortedindex,i,rsttheitemcodecodeatthatpositioniscomputed.Then,throughaccessingtheitemindex,thecharacterstringrepresented (2.3)tr'\0''\n'<lexicon Asyoucanimagine,thislecaneasilyproducedbyaUNIXcommand: bycodeisdetermined,whichisthenprinted. gawk'{printnr-1"\t"$1}' Therstlinecomputesthestringsfromtheitemlist,whicharethenprexedbytheircode gawk'{print$1}' atoi>lexicon.srt sort+1 (whichisthe\position"intheitemlist),beginningwithcode0fortherstword.thislist ofcode/valuepairsisthensortedbythevalues,whichoccurinthesecondcolumn.the outputofthesortingisthenltered,sothatonlythecodesareprinted.thecodesequence isthentransformedintotheinternalformatandwrittentotheindexle. Note:OneofthereasonswedonotusetheseUNIXcommandstocreatethedatastructures fromotherprograms(itworkswithsignedcharacters,whereasinternallyweworkwith isthattheunixsortcommandsometimeshandlestheorderof8bit-charactersdierently Thedatastructuresusedtorepresenttheencodeditemsequenceandtheassociatedauxiliarydatastructureswhichfacilitatethenecessarymappingsareillustratedingure2.2. itemstoitemcodeswillnotworkproperlyotherwise. whenthestandard7bitasciicharactersetisused.theinternalfunctionswhichmapfrom unsignedcharacters).sotheunixcommandswhichusesortonlycreatethesameles Anotherpossiblityistoproducetheitemlistandtheindicesinasinglegawkrun.The scriptbelowcanbeusedforthissecondpurpose,butitassignsotheritemcodes: istoreadanalreadyexistingitemlistle,whichmaybeproducedbythecommandsabove. Theonlyleforwhichwedidn'tyetgiveaUNIXcommandistheitemsequence(orbetter, thesequenceofencodeditems).gawk'sarrayscanbeusedforthispurpose.onepossiblity (2.4)BEGIN{ maxcode=0; OliverChrist } position=0; IMSStuttgart August9,1996

12 IMSCorpusWorkbench:Administrator'sManual 11 Item Sequence Item List Index Item List Sorted Index never the cucumber ; 31 & wine {if(!($1initemlist)){ Figure2.2:Integerizeditemsandassociateddatastructures print$1>"lexicon.asc" itemlist[$1]=maxcode; printposition>"lexicon.idx.asc" Item Index ==> String }else maxcode++; code=maxcode; position=position+length($1)+1; Afterthecodeisexecutedwithatextleasinput,theASCIIrepresentationshavetobe } printcode>"corpus.asc" code=itemlist[$1]; convertedintotheinternalformat(thiscouldbedoneviapipesinthegawkscriptalso,but weleftthatouthereforthesakeofclarity): (2.5)atoi<corpus.asc>corpus Afterthat,command2.3canbeusedtoproducethesorteditemlistindex. tr'\n''\0'<lexicon.asc>lexicon rm-f*.asc atoi<lexicon.idx.asc>lexicon.idx sequence.thisinversedleholds,foreachitemcode,thesetofpositionsintheitem 2.1.2Inverseditemsequence Thesecondsetofdatastructuresconcernstheinversedleindexassociatedwiththeitem

13 sequencewheretheitemcodeoccurs.throughthemappingfunctionsintroducedinthe IMSCorpusWorkbench:Administrator'sManual 12 Theinversedleisrepresentedbyasetofthreeles: whereacertainwordorpart-of-speechtagoccurs. lastsection,wecanalsoregardtheinverseditemsequenceasalistofcorpuspositions second,anindexintothisle.thisindexreturns,foreachitemcode,thestartpoint rst,theinversedleitself,whichcontainsasetofcorpuspositions; third,atableofitemcodefrequencies,whichgives,foreachitemcode,thenumber ofoccurrencesofthecodeinthecorpus(whichis,ofcourse,equaltothesizeofthe oftheassociatedoccurrencesintheinversedle; ThethreelescanalsobecomputedbyUNIXcommands.First,thereversedsequenceis producedbythefollowingcommand: setofoccurrences). (2.6)itoacorpus gawk'{print$i"\t"nr-1}' First,theinternalrepresentationoftheitemsequenceisconvertedintoreadablenumbers. gawk'{print$2}' atoi>corpus.rev sort-ns Thisnumbersequenceisthensuxedwithitspositioninthecorpus,whichisthensorted bythecode,sothatwegetcode/positionpairs.fromthissequence,thepositionisstripped Thefrequenciescanalreadybecomputedinthegawkencodescript(2.4),butanother o,sothatweonlygetthesequenceofpositions,whichexactlyistheinversedle. possibilityisaslightlymodiedversionofthescriptabove: (2.7)itoacorpus gawk'{print$i"\t"nr-1}' gawk'{print$1}' atoi>corpus.cnt uniq-c sort-ns Here,wekeepthecodesequenceofthecode/positionpairs.Thissequenceofcodesappears Thelastle,theindexintotheinversedle,cansimplybecomputedfromthefrequencies tointernalformat.3 collapsedintoonlyasinglelineandcounted.thesecountsarestrippedoandconverted insortedorder.bythecalltotheuniqutility,equalsubsequentlines(here:codes)are bysummingthemup: (2.8)itoacorpus.cnt beomitted.theversionhereisjustforclarity. 3Itwouldbemoreecienttouseagawkarraytoholdtheitemcodecounts,sincethesortstepcould gawk'begin{pos=0}{printpos;pos+=$1}' atoi>corpus.rdx

14 Now,thewholesetofsevenlesrepresentingthedataofapositionalattribute(which IMSCorpusWorkbench:Administrator'sManual 13 therearetoolswhichperformthesestepsmuchfasterthantheshellscriptspresentedhere. scriptsmayhelptoproducetheencodedversionofacorpus. Butinsomecases,theutilitiesofthetoolboxrunintomemoryproblems,andthenthese wecallthesevencomponentsofapositionalattribute)havebeencreated.inthetoolbox, Index into reversed file Reversed File : : Nr of item occurrences (freq) Thecomponentsassociatedwiththeinversedleandtheirmeaningsareillustratedin Figure2.3:Reversedleindices 31 Item Index ==> Set of occurrences foranitem,awordforexample,isperformed. gure2.3.thenextsectionwillshowthesinglestepswhicharetakenwhenasimplesearch Aftertheinternaldatastructureshavebeenintroduced,wecancomputetheconcordance forasingleitem,forexamplethewordthe.mostdatastructurescanbetreatedasanarray, 2.1.3Example:asimplewordsearch soweusethesymbols Cfortheitemsequence(accessedbyC[i]whereiisacorpusposition).Theelements Rforthereverseditemsequence(accessedbyR[i]whereiisanindexintothis ofcareitemcodes; sequence,computedfromirbelow).theelementsofrarecorpuspositions;

15 Figure2.4:Asimplewordsearch Sorted Index Word list never the cucumber ; & wine Index into reversed corpus ID: Nr of item occurrences (freq) : : Reversed Corpus : Corpus "Match" Index into Word List and Word List Lookup Concordance Element IMSCorpusWorkbench:Administrator'sManual 14

16 ILfortheitemlistindex(accessedbyIL[c]wherecisanitemcode).Theelements IMSCorpusWorkbench:Administrator'sManual 15 Ffortheitemfrequencytable(accessedbyF[c]wherecisanitemcode).The elementsoffareitemfrequenciesinc; ofilarebyteosetsintotheitemlist; IRforthereverseditemsequenceindex(accessedbyIR[c]wherecisanitemcode). SLforthesorteditemlistindex(accessedbySL[i]whereiisapositioninthe\virtual" Lfortheitemlist(anarrayofcharacters,onlyaccessedbyosetsofIL); TheelementsofIRarepointers(osets)intoR; ForcomputingthesetofoccurrencesofatextualiteminC,thefollowingstepshavetobe Thesesevenarraysarethecomponentsofapositionalattribute. sorteditemlist).theelementsofslareitemcodes. taken(alsoillustratedingure2.4fortheword\the"): rst,theitemcodec(i)ofitemihastobedetermined.forthispurpose,thesorted iftheitemcodecouldbedetermined,thereverseditemsequenceindexisconsulted found; itemindexslisconsultedandsearchedwithbinarysearchuntiltheitemcodeis second,theitemfrequencylistisaccessedtocomputethe\length"oftheposition iinthereverseditemsequence; todeterminethestartingpositionrs(i)=ir[c(i)]ofthepositionsetassociatedwith then,thesetofoccurrencesp(i)isthesetofpositionsstoredinthereverseditem setf(i)=f[c(i)]; Thetaskofcomputingthesetofoccurrencesofiintheitemsequenceisthencompleted. Notethattheitemsequenceitselfdidn'thavetobeaccessed. sequencerstartingatrs(i)withlengthf(i)(r[rs(i)]:::r[rs(i)+f(i) 1]). Forcomputingtheconcordanceandprintingit,though,theitemsequenceCmustbe bounds(0;jcj 1).Foreachitemkinthissubsequence,theassociated(textual)itemmust foreachp2p(i)the\subsequence"between[p cl;p+cr]incmustbecomputed(inthe bedeterminedbycomputingthestartpositionts(k)=il[k]ofkintheitemlistindex. consulted.whenclistheleftdisplaycontext(intermsofitems)andcristherightcontext, 2.1.4Thesetofpositionalattributes Then,theitemlistcanbeconsultedtogetthestrings(k),whichthenisprinted. ofcandrareequal(seealsogure2.5): positionalattributehasitsownsetofcomponents.foreachpositionalattribute,thelength TheIMSCorpusToolboxsupportsanarbitrarynumberofpositionalattributes.Each OliverChrist IMSStuttgart jcj=jrj August9,1996

17 IMSCorpusWorkbench:Administrator'sManual 16 PA word Item freqs Item Seq Reversed Item Seq Index for RC Index for IL Sorted Idx Item list PA pos Item Seq Reversed Item Seq Item freqs Index for RC Index for IL Sorted Idx Item list PA lemma Item freqs Item Seq Index for RC Item list Reversed Item Seq Index for IL andthelengthsofil;ir;sl;andfareequal: Figure2.5:Thesetofpositionalattributes Sorted Idx itemsequencesoftheseattributesmustbeequal: Furthermore,betweenallpositionalattributesassociatedwithacorpus,thelengthsofthe jilj=jirj=jslj=jfj ofcourse,nosuchconditionusuallyholdsbetweentheothercomponentsoftwopositional attributes. jcwordj=jclemmaj=jcposj=jcsynj=::: 2.2.1Structuralattributes Otherattributetypes phrases,orotherentities.internally,thesestructuresarerepresentedasintervalsofcorpus Structuralattributescaptureinformationaboutboundariesofsentences,paragraphs, Currently,therearetwolimitationswithrespecttostructuralattributes: internally(4bytes,thesizeofaninteger,foreachofthetwopositions). positions,whicharethestartandendpoint(inclusive)ofthestructure.suchanintervalisapairofcorpuspositions.therefore,eachstructuralitemneeds8bytesofstorage

18 rst,theintervalsmustnotberecursive(forexample,embeddednpsinnps); IMSCorpusWorkbench:Administrator'sManual 17 andtheymustnotbeoverlapping. Positional : Attributes pos: N N IP NUM N ADJ N IP word: Pierre Vinken, 61 years old blessing n-2 n-1 Figure2.6illustratestherepresentationofstructuralattributes.Thenumberofstructural attributesassociatedwithacorpusisnotlimited. Figure2.6:Structuralattributes Structural s Attributes paragraph S.Normally,thestructuralattributedataiscreatedwiththeencodeutility.Butinsome Unlikepositionalattributes,thedataforapositionalattributeisstoredinasinglele, Creatingstructuralattributedata isanarrayofintegerpairs,wherejsjisthenumberofintervals.thelesizeofsisthen cases,itisusefultomanipulateorcreatethelesthroughotherutilities.thedatales corpushasbeenencoded),asimpleawkscriptcanhelp.youmust,however,beawareofthe needs4bytes. Ifthelesareconstructedmanually(withoutthehelpofencode,forexample,afterthe 42jSj,sinceforeachinterval,twointegernumbershavetobestored,eachofwhich internalrepresentationofpositionalattributesandthe\logics"ofcorpuspositions;second, somepitfallshavetobecircumvented. Let'sassumethataone-word-per-lineinputlewithmarkedsentenceboundaries(in withatoi): (2.9)BEGIN{ SGML-style,like<s></s>)isavailable.Then,theintervalscanbeextractedbythe followingawkscript(theoutputofwhichhastobeconvertedintointernalintegerformat position=0; open=0; }{if($1==closetag){ closetag="</"structure">" opentag="<"structure">" structure="s" if(open){ #closingtag,don'tincrementposition. printposition-1;#thankstoa3@wsserv.vdl.nl(adriverhoef)

19 open=0; IMSCorpusWorkbench:Administrator'sManual 18 }else{ }elseif($1==opentag){ } print"closingnon-opengroupatline"nr":"$0>>"/dev/stderr" #tag,don'tincrementpositionthen exit if(open){ open=1 }printposition printposition-1 #forgottoclosegroup,whichwedon'tconsideranerror }elseif($1~/<\/?[a-za-z]+>/){ }END{position++; #donothing,otherstructuraltag? First,caremustbetakenwhengroupsareclosedwhicharenotopen.Theothercase, } if(open) reopeningopengroups,isnotconsideredanerror,sinceclosingtagsareoptional.additionally,whenstructuretagsareusedinthetext,thelinenumber(position)mustnotbe incremented.buteventhen,thisawkprogrammayyielderrors.so,atleastcheckwhether thesizeoftheresultinglecanbedividedby8. printposition Bigramandmappingtables 2.2.2Alignmentattributes 2.3 Externaltoolsanddynamicattributes

20 IMSCorpusWorkbench:Administrator'sManual 19, Positional Attributes : pos: N N IP NUM N ADJ N IP word: Pierre Vinken, 61 years old blessing n-2 n-1 Alignment Figure2.7:Alignmentattributes word: Pierre Vinken, 61, wird Bigram Tables: Pierre Vinken 61 years Pierre Vinken, 61 years Mapping Tables: NP NPS PUNCT CARD N Figure2.8:Bigramandmappingtables Pierre Vinken 61 years, Figure2.9:External(dynamic)attributes Value Request pipe() invocation Data Access Module Value computation Value return Value passing Value check/conversion External Tool

21 Chapter3 Encoding:Transformingacorpus intoitsinternalrepresentation asasequenceofcharacters).thewords,then,arenumbered,sothatwecantalkabout WithintheIMScorpusworkbench,acorpuscanhaveanarbitrarynumberofannotations ofdierenttypes.inoursystem,acorpusisprimarilyregardedasasequenceofwords(not thewordatacertaincorpuspositionp,therstwordinthecorpus,or,ingeneral,thenth themostimportantannotationtype.attributesofthisclasshavea(string)valueateach corpusposition.1 Thecorpustextfallswithintheclassofpositionalattributes,sincewecanspecify,foreach wordofthecorpus.thisleadstothemoregeneralnotionofpositionalattributes,whichis ofthecorpus.inourview,weregardpos-tagsasassignedtoacorpuspositionrather corpusposition,thewordwhichoccursatthatposition.thepositionalattributewhich thantothewordatthatposition.then,thepositionalattributes\word"and\tag"do holdsthecorpustextproperalwayshasthepredenedattributename\word".other positionalattributesare,forexample,part-of-speechtags,whichareassignedtothewords taggedcorpusisinourviewasetoftwocorporaofequallength,oneofwhichcapturesthe ourcase,astring.wethereforeusethesamerepresentationforthewordsequenceofthe corpus(thecorpustext)andthetagsequence(theassociatedpos-tags).inotherwords,a notdierverymuchanymore:bothhave,foreachcorpusposition,avaluewhichis,in sequenceofwords,theothercapturesthesequenceoftags.inthefollowing,wetherefore Section3.2describesthestepswhicharenecessarytoprepareatextuallyrepresented usetheterm\item"toabstractfromthetypeofinformationencodedinsuchanattribute corpustobesuitableasinputfortheencodingtoolsaswellastherstofthetwoencoding Inthefollowingsection3.1,weshortlydescribetheinternalrepresentationofacorpus. (here,wordvs.tag).acorpusthen,isacollectionofattributesofdierenttypes. theindicesassociatedwithacorpus. regardedasthenumberofalineinthisrepresentation. tools,encode.section3.3thendescribesthesecondencodingtoolwhichisusedtobuild 1Whenthecorpusisstoredinaverticalizedone-word-per-lineformat,acorpuspositioncanalsobe 20

22 3.1 Theinternalrepresentationofacorpus IMSCorpusWorkbench:Administrator'sManual 21 Afterencoding,eachitemofatextualcorpusisrepresentedasauniqueintegervalue2. Forexample,iftherstitemofatextcorpusis\The",all\The"sinthetextcorpus willinternallyberepresentedastheintegernumber03.thecorpuscanthenbephysically representedasasequenceofintegernumbers.tobeabletogettheitemwhichisrepresented ofintegers),the\lexicon"whichholds,foreachinteger,thestringitrepresents,andan now,3lesarenecessarytoholdtheinformation:thecorpus(consistingofasequence indextothelexicon.thesethreelesarethosewhichareproducedwithinthesecondstep byaninteger,another(indexed)leholdsthemappingsfromintegerstostrings.upto describedbelowinsection3.2. oftheitemsinthelexicon,whichisnecessarytoecientlycomputetheintegercodeofan duringtheencodingofacorpus.thetoolwhichperformsthistaskiscalledencodeandis Twoadditionallesarebuiltnext:therstholdsinformationaboutthesortedsequence item,giventhestring.theotherleholds,foreachitem,thenumberoftimesitoccursin itself,leadingtoanotherle.insummary,wehavesevenlessofarwhichrepresentthe eachiteminthecorpus,thecorpuspositionswheretheitemoccurs.thisindexisindexed thecorpus. informationofonepositionalattribute.thefouradditionalleswhicharenotbuiltbythe Forecientlookup,areversedleorreversedindexhastobebuilt.Thisindexholds,for inputfortheencodeprogram,whichisdescribedinthenextsection. totheencodingofacorpus,ithastobetransformedintoaformatwhichissuitableas encodeprogramarecreatedwiththemakeallprogramdescribedinsection3.3.butprior Whenacorpusconsistsofseveralpositionalattributes(forexample,aPOSattribute 3.2 additionallytothestandard\word"attribute),itcaneitherbeencodedinonesinglestep Theencodeprogram (providedthatitisinasuitabletextualinputformatforencode)orthevariouspositional attributescanbeencodedoneafteranotherandbeaddedtoanalreadyexistingcorpus. Thislatterwayisalsousefulwhenoneofthepositionalattributeshasbeenchanged,for beencodedoccursinasingleline.thislinemaycontainblanks,providingawaytoencode accuratetagassignments. Inbothcases,theinputformatisaone-word-per-lineformat,whereeachitemwhichisto example,whenatagsethasbeenchangedorabettertaggerwasavailabletoproducemore adjacentmulti-wordlexemes,ifdesired.butcareshouldbetakentoavoidblanksatthe endofanitem,since,forexample,\the"and\the"aredierentstringsandthereforeare Inthecaseofthecorpustext,theinputmaylookasfollows: encodedwithdierentcodeswhichcanleadtoundesiredeectswhennotalloccurrences ofthearefoundinatextduetoablankattheendofsome. consideredequal. paperbykenw.church,\asetofunixtoolsforprocessinglargetextcorpora". 3Ofcourse,\the"willgetanothercodethan\The",since\The"and\the"textuallydierandarenot 2Theinternalcorpusrepresentationweuseishighlyinspiredbyan{unfortunately{unpublisheddraft

23 Pierre IMSCorpusWorkbench:Administrator'sManual 22 Vinken,61 years old,will join the board as anonexecutive director Anotherle,then,mayholdthesequenceofassignedtags(inwhichcasebothlesmust Nov. holdthesamenumberoflines).theinputformatcan,forexample,beproducedoutofa 29 rawtextlewiththetrcommand4:. Thiscommandreplacesallblanksintheinputlewithlinebreaks.Theencodeprogram threelescorpus,lexiconandlexicon.idxinthecurrentdirectory: thentakesaone-word-per-lineinputle(orreadsthatformatfromstdin)andcreatesthe tr'''\n'<text_file>1wpl-file encodinginasinglestep,onecouldenterthefollowingpipe: The-toptioninstructsencodetoreaditsinputfromthelegivenasanargumentof theoptioninsteadofreadingthestandardinput.todoboththetransformationandthe encode-t1wpl-file or,ifonewantstoholdthetextinacompressedformat: zcattext_file.gz tr'''\n' encode tr'''\n'<text_file encode arydetectormustberunonarawcorpustoproducetheappropriateinputle.the (specialcharactershavebeenseparatedfromthewords),perhapsevenasentencebound- thesimpletrexamplesabove,itisalreadyassumedthatthecorpushasbeentokenized Ingeneral,theone-word-per-lineformatcanbeproducedbyanyprogramyoulike.In Now,encodemayalsotakeanannotatedcorpuswithseveralpositionalattributesina applicabletonew,rawcorporawhichrstmayhavetobepreprocessedbyothertools. simpleexampleonlyshowshowencodeprocessesitsinput,ingeneral,thismethodisnot singlele.inthiscase,eachlineoftheinputformatconsistsofanumberofattribute values,separatedbytabulatorcharacters.thus,theinputleconsistsofseveralcolumns, eachdenotesonepositionalattribute.apos-taggedtextthenmaylookasfollows: OliverChrist 4Thetrcommandisastandardcommandavailableonmanyplatformsandisnotpartofthistoolset. IMSStuttgart August9,1996

24 IMSCorpusWorkbench:Administrator'sManual 23 Pierre<tab>NP Vinken<tab>NP,<tab>, 61<tab>CD years<tab>nns old<tab>jj,<tab>, will<tab>md join<tab>vb the<tab>dt board<tab>nn as<tab>in a<tab>dt nonexecutive<tab>jj director<tab>nn Nov.<tab>NP 29<tab>CD.<tab>SENT where<tab>denotesasingletabulatorcharacter(asciivalue9)5.encodemustthen knowwhichpositionalattributeisrepresentedintheothercolumnsoftheleandhow theyshallbenamed.thisisdonewiththe-poption: encode-t<input-file>-ppos Here,theoption-Ppos(\P"for\positionalattribute")instructsencodetotreatthe secondcolumnintheleasthesequenceofvaluesoftheposattribute.thelesassociated withtheposattributehavetheirnamesprexedwithpos(whichleadstopos.corpus, pos.lexiconandsoon).theorderinwhichthe-poptionsaregivenisrelevant,sincethe rst-poptiondenotesthenameoftheattributerepresentedinthesecondcolumninthe inputle,thesecond-poptiondenotesthethirdcolumnetc.bydefault,therstcolumnis treatedasthewordsequenceandthereforegetstheprexword,butthiscanbeoverridden withthe-poption.pleaserefertothemanualpageofencodefordetails. Upto32positionalattributescancurrentlybeencodedinasinglestep.Iflargeamounts oftextaretobeencoded,trytodeterminethediskspacethecorpusneedsafterencoding inadvanceandlookforalesystemwhereenoughspaceisavailable.somehintsonthe expectedsizearegiveninsection3.4.asexplainedinchapter4,itispossibletosplitthe lesofacorpusbetweenseverallesystemsincasethereisn'tenoughspaceonasingle disk.ifallelsefails,youmayhavetoencodethesetofpositionalattributesinseveralruns ofencode,eachwiththeappropriateprexpassedwiththe-poption. Acorpus(oranarbitrarypositionalattribute)maybeassignedanothertypeofinformation whichcanbeencodedwiththeencodeprogram,namelystructuralinformationwhichcan beusedtorepresentarticle,sentenceorparagraphboundaries.thiskindofinformationis representedintheinputlewithsgml-likemarkers: <article> <s> Pierre<tab>NP Vinken<tab>NP,<tab>, 5Therearenogeneralhintsonhowtoproducethisinputformat.Ingeneral,itisagoodideatouse standardtoolslikeawkandsed.

25 IMSCorpusWorkbench:Administrator'sManual 24 29<tab>CD.<tab>SENT <s> attributesinthele.somepointshavetobenoted: Ofcourse,structuralinformationcanbeencodedindependentlyofadditionalpositional </s> </article> inalinewithastructuremarker(s,article),novaluesofpositionalattributesmay theendtagsmaybeomitted.inthatcase,astructurespansallitemsuntilthestart occur; ofthenextstructureortheendofle; ifastructuremarkerline,everythingafterablankoraftertheclosinganglebracket Intheaboveexample,thecallforencodewouldlooklikethis: structuresmustnotberecursiveoroverlapping,thatis,treescannotberepresented. (>)ofthetagisneglected; Here,thetwoencodedstructuralattributesareeachdeclaredwiththe-S(for\structural attribute")option.theorderinwhichthestructuralattributesaredeclareddoesnot encode-t<input-file>-ppos-sarticle-ss Becarefultodeclareallstructuralattributesintheencodecall,sinceundeclaredstructural matter. attributesareconsideredassimpleattributevaluesintherstcolumnandthereforeare Again,upto32structuralattributescancurrentlybeencodedinasinglestep. astructureattributedesignatorinanglebracketsorislineofaxednumeroftabulatorseparatedcolumns.errorsmayoccurifthisruleisnotobeyed. -p<prefix>hastheeectthatthelesbelongingtothepositionalattributeinthe rstcolumnoftheinputlewillgettheprex\prefix.".notethatthedotatthe endoftheprexisaddedautomaticallyandmustnotbepartoftheoptionvalue.nor- havetabulator-separatedcolumnsafterthem.theruleissimple:eitheralineconsistsof occursincelineswithundeclaredstructuralattributesintherstcolumningeneraldonot treatedliterally.ifyouareencodingseveralpositionalattributesatonce,anerrorwill encodeacceptsanumberoffurtheroptions: lesmaycollidewiththethoseofthenewattribute.therefore,whentheprexis readypresentinthedirectorythedataiswrittento,sincethenamesofthecorpus Thiswillleadtoproblemsifapositionalattributeisencodedafteracorpusisalmally,thelesgetthenamecorpus[.cnt,.rev,.rdx]andlexicon[.idx,.srt]. thelesforthepositionalattributeintherstcolumnoftheinputle,theother notgiven,datamaybeoverwrittenandlost.thisoptionaectsonlythenamesof OliverChrist positionalattributes{ifpresent{willgettheprexgivenwiththe-poption; IMSStuttgart August9,1996

26 -d<path>letstheuserspecifythedirectoryinwhichthedatashallbewritten.the IMSCorpusWorkbench:Administrator'sManual 25 -sinstructsencodetoskipemptylines(lineswithnocharacters{notevenblanks{ pathshouldnotendwithaslash.defaultistowritealloutputlestothecurrent directory; themostrecentdescriptionoftheprogram. encodeacceptsanumberoffurtheroptions.pleaserefertotheencodemanualpagefor init)duringencoding. thepositionalattribute,sinceittriestowritethedatatothesamelesinwhichthe corpusafterencodingthecorpusitself:alossofdatamayoccurduringtheencodingof -Doptionisspecied.Becarefulwhenyouadd(orupdate)apositionalattributetoa Pleasebeawarethatencodewritesitoutputlesintothecurrentdirectoryunlessthe corpusdatamayalreadybestored.eitheryoushouldpassthe-poptiontoprexthe lesbelongingtotherstcolumnorputeachpositionalattributeinadirectoryofitself topreventencodefromoverwritingimportantdata.itisaverygoodideatochangethe leaccessmodeofalllesbelongingtoapositionalattributetonon-writeableforanyone 3.3 afterencoding,inordertopreventaccidentialoverwriting. Afterencodingacorpus,eachpositionalattributehastobedeclaredinthecorpusregistry. Pleaserefertochapter4foradetaileddescriptionofhowtodothis.Themakeallprogram, Themakeallprogram whichconstructsthesecondsetoflesduringtheencodingprocess,willnotworkon undeclaredpositionalattributesorcorpora. Afterapositionalattributeisdeclaredintheregistry,makeallmustberuntoconstruct thesymbolicnameofthepositionalattributeforwhichtheindexlesshallbeproduced: thenecessaryindexles.therearenooptions,andtheonlyargumentmakeallacceptsis Thiswillproduceallmissinglesforallpositionalattributesdeclaredforthecorpus treebank. makealltreebank Ifyouonlywanttoproducethelesforasinglepositionalattribute,givethenameofthe attributeasanadditionalargument: thatthiscallcanbeissuedfromanypointinthelesystem,sincemakealllooksupin (giventhatthesesymbolicnamesarethoseoftherespectivepositionalattributes).note makealltreebankpos theregistrytondthecorpusdata.dataisonlywrittentothedirectoryspeciedinthe networkloadincaseofnfs-mountedlesystems). registrydescriptionle.itisthereforeagoodideatorunmakealleitheronaveryfast machineoronamachinewhichlocallyholdsthedisksthecorpusisstoredon(toreduce

27 Note:\makeall"willcurrentlytrytocreatenon-compressedlesforattributeswhich IMSCorpusWorkbench:Administrator'sManual 26 alreadyhavethecompletedataincompressedform.thisisabugandwillbexedina futurerelease. Afterrunningmakeallonallpositionalattributesofacorpus,thecorpusisreadyforuse. errorsoccurorwhentheprogramhastobedebugged. makeallcurrentlyproducesalotofdebuggingoutput.thisoutputisonlyimportantwhen sageslike\can'tallocatememory"or\notenoughmemory".theseproblemscanonly makeallmayhaveproblemsduetomemoryorswapspacelimitationsandyielderrormes- Whentryingtoencodereallybigcorpora(20millionwordsandabove),encodeand thewholecorpusdata(seebelowinsection3.4).askyoursystemadministratorforfurther availableswapspaceisshownwiththepstat-scommandandshouldbeenoughtohold runencode/makeallonanothermachineatyoursitewhichhasenoughswapspace).the besolvedbyprovidingmoreswapspacetothemachinetheprogramisrunningon(orto help. 3.4 Thissectiongivessomehintsonhowmuchspacewillbeusedbyanencodedattributes. Currently,weonlycoverpositionalandstructuralattributeshere. Spacerequirements sequence,jaj,thereforeisthenumberofelementsofthissequence.further,letdbethe LetAbethesequenceofitemscapturedbythepositionalattribute.Thelengthofthis 3.4.1Positionalattributes thenumberofdistinctwords. setofdistinctstringsencodedina(thelistofdierentwords,forexample).then,jdjis Twonumbersareimportant: thenumberofitemsintheinputle(jaj)whichisequaltothenumberoflinesinthe annotationmarkers); command(maybepre-pipedwithagrep-vcommandtogetridofthestructural markers,ifpresent).thisnumbercan,forexample,becomputedwiththewc-l one-word-per-lineinputleforencode(minusthenumberofstructuralannotation thenumberofdistinctitemsintheinputle(jdj)(thisnumbercanbecomputedby Anotherimportantnumberisthespaceneededfortheone-timerepresentationofalldifferentitems,whichhereisdenotedS.Thisisthesumofthelengthsofeachdierentword runningthepipesort-u wc per-lineinputtextle). -lovertherespectivecolumnintheone-word- plusone: OliverChrist S=Xs2D(strlen(s)+1)=jDj+Xs2Dstrlen(s) IMSStuttgart August9,1996

28 Theaddingof1isnecessarysinceanullcharacter('\0')isaddedtoeachstring.The IMSCorpusWorkbench:Administrator'sManual 27 numberisgivenbyrunningthepipesort-u wc-covertherespectivecolumninthe Now,thesizeofonepositionalattribute(inbytes)canbecomputedasfollows: inputle.6 afterencoding. Thesizeoftheinputtextledoesnotgointothisformula,sinceitisnotneededanymore Mp=2(jAj4)+S+4(4jDj)=8jAj+16jDj+S Foreachpositionalattribute,thisformulahastobeevaluatedagain,sincethenumber ofdierentitemsintheattribute(jdj)andthespaceneededtorepresentthemonce mainlydependsonitslength,whichisnotverysurprising.alessaccuratenumberofspace haveasmallnumberofdistictvalues,thespaceneededtorepresentapositionalattribute positionalattributesofacorpus).sincepositionalattributesbutthewordattributeusually (S)usuallydiersbetweenseveralpositionalattributes(jAjmustbeconstantforanytwo requirementcanberoughlyestimatedbymultiplyingthesizeoftheuncompressedinput 3.4.2Structuralattributes textle(s)by2. Thedataofastructuralattribute,forexamplesentenceboundaries,isstoredinasingle le,asanorderedsequenceofcorpusintervals(thatis,pairsofcorpuspositions).so,the computationofthespaceneededtorepresenttheinformationofonestructuralattributes inthisattribute(\thenumberofsentences"),eachbeingapairoftwocorpuspositions. isverysimple:letsbethestructuralattribute.then,jsjisthenumberofintervalsstored asfollows: Sinceeachcorpuspositionisstoredasa4-byteintegernumber,thespacecanbecomputed So,ifyouwanttorepresent1000sentences,youneed8000bytestostorethedata. Ms=(jSj42)=8jSj GNU's/FSF'ssetoftextutilities,orwiththeawkutility. OliverChrist 6Thecolumnscanbeextractedfromamulti-columnlewiththecutprogramwhichiscontainedwithin IMSStuttgart August9,1996

29 Chapter4 Thecorpusregistry tionally,theregistryholdsalewhichdescribeswhocanaccessalocalcorpusfromremote hostsandalewhichcapturesalogofallremoteconnectionstolocalcorpora.thischapteronlydescribesthedescriptionlesforlocalcorporaandthedescriptionlesforremote describeswhereinthelesystemthevariousleswhichbuildthecorpusarestored.addi- Thecorpusregistryholds,foreachcorpusbeingprocessedbytheworkbench,alewhich inchapter5. Theregistryissimplyagloballyaccessibledirectory,calledtheregistrydirectory,andholds, corpora,thetwoadditionalleswhicharenecessaryforremoteconnectionsaredescribed 4.2below,denewhichannotationsareassociatedwiththecorpusandwherethedatais aleiscalledtheregistryleofthecorpus.thecontentsofthele,describedinsection stored.anannotationwhichisnotdenedintheregistryleofacorpuscannotbeaccessed foreachcorpus,alewiththesamenameasthecorpusname,inlowercaseletters.1such beaccessiblebyalltools.itiswithintheresponsibilityofthecorpusadministrator(you!) byanyofthetools.similarly,whenanattributeisdenedintheregistry,itissupposedto toassurethatallannotationsdenedinallregistrylesareaccessible,andthatonlythose Currently,therearesomeruleshowtonametheattributesofacorpusandhowtoname attributesaredenedwhichareinfactaccessible. therespectiveregistryles.theserulesaredescribedinthefollowingsection. Attheendofthischapter,section4.5summarizesthesinglestepswhichyoushouldfollow Thisissueisdiscussedinchapter7below. whenpreparingandregisteringanewcorpus. Asalreadymentionedinsection1.2above,corporaexistwhereaccesshastobecontrolled. 4.1 Twosimpleruleshavetobeobeyedforthedenitionofnamesforcorporaandattributes: 1InCqp,allcorpusnamesareenteredinuppercase,buttheyareconvertedtolowercasetoloadthe Someremarksaboutnomenclature correctcorpus. 28

30 corpusandattributenamesmustbeginwithalowercaseletter,andmaybe IMSCorpusWorkbench:Administrator'sManual 29 Bydefault,thetoolsexpecttheregistrydirectorytobe/corpora/c1/registry.Sinceat yoursitethisdirectorymostprobablydoesnotexist,thedefaultvaluecanbeoverridden followedbyanarbitrarylongsequenceoflowercaselettersordigits. bytheenvironmentvariablecorpusregistry.pleasedonotaddaslashattheendofthe variableinhisorher.tcshrcor.cshrcshellinitializationleinhis/herhomedirectory valueofthisvariable.wesuggestthateithereachuserofthetoolssetsthisenvironment orthatitissetinoftheglobalshellinitializationles,whichusuallyresidein/etcand non-defaultregistrydirectory. areonlywriteablebythesystemadministratororthesuperuser.pleaseaskyoursystem Additionally,almostalltoolsofthetoolboxtakethe-rcommandlineoptiontospecifya administratorforfurtherhelpincaseyoushouldn'tknowwhereorhowtosetthevariable. 4.2 Intheregistryle,allattributesofacorpusaredeclared.Additionally,some\global" variablesareset. Thecontentsofaregistryle Aregistrylemaycontainemptylines. Acommentbeginswithahashmark(#),everythinguptotheendofthelineisnotread. Theformatofaregistryleis: Thisorderhastobekeptinallregistryles.Intheattributedenitionsection,theattributesmaybedeclaredinanyorder. <Attributedenitions> <Headerinformationandglobalvariables> 4.2.1Theheader Theeldnamesare Theheaderconsistsof4declarationsofvalues,eachofwhichisprecededbytheeldname. Theeldnames(keywords)arealluppercase. ashort(one-line)descriptionofthecorpus(keywordname).theeldvalueisastring auniqueidentier(keywordid).theeldvalueisasymbol.usually,theeldvalue shouldbethesameasthelenameoftheregistryle; enclosedindoublequotes; optionally,the\homedirectory"ofthecorpus(keywordhome).theeldvalueisa OliverChrist path(notenclosedindoublequotes); IMSStuttgart August9,1996

31 optionally,thepathofthe\infole"ofthecorpus(keywordinfo).thistextle IMSCorpusWorkbench:Administrator'sManual 30 shouldcontainadescriptionofthecorpus,itsannotations,perhapsadministrative IftheHOMEeldismissing,youhavetospecifythepathforeachattribute,soitismore part-of-speechannotationthere,ifthecorpusistagged. information,etc.itisalsoagoodideatoincludeadescriptionofthetagsetofthe convenienttodenethiseldwhenallcorpus-relateddatalesarekeptinasingledirectory. buttonisselectedinthecorpuslist). Aftertheheader,thesetofcorpusattributes(annotations)isdeclared.Thedierenttypes Cqp,thisisdonewiththeinfocommand,inXkwic,thisleisdisplayedwhentheInfo IftheINFOeldismissing,nocorpusinformationcanbedisplayedinCqporXkwic(in ofannotationsare positionalattributes(section4.2.2); structuralattributes(section4.2.3); mappingtables(section4.2.4); ngramtables(section4.2.5); dynamicattributes(section4.2.7); alignmentinformation(section4.2.6); thecaseofmappingtablesandbigramtables). Assaidbefore,allannotationsmaybedeclaredinalmostanyorder(butseethenotesin insection3.2.theselesarecalledthecomponentsofapositionalattribute.aregistryle Apositionalattributeisencodedasasetofsevenles,whichhavebeendescribedabove 4.2.2Positionalattributes isstored.itisnotnecessarytomanuallysetthenamesofallcomponentssincethereare defaultruleshowtocomputeundenedcomponentlenamesfromthedenedones.in denes,foreachcomponentofapositionalattribute,thelenameinwhichthecomponent donothavetodeclarecomponentpathsatall. Thedeclarationofapositionalattributelooksasfollows: fact,wesuggesttorelyonthedefaultnameswhichencodeassignstotheles.then,you optionalandisonlyneededwhenyouhavetodenenon-defaultlenames. wherenameistheidentierfortheattribute(likeword,pos,lemma,:::).theoptbodyis ATTRIBUTENameOptBody Whenyouusethebody,itcanbeoneofthefollowing: OliverChrist adenitionofthecomponentpaths,comppathspec; IMSStuttgart August9,1996

32 orthedeclarationthattheattributeisfoundonaremotehost,remotespec; IMSCorpusWorkbench:Administrator'sManual 31 orapathwhichoverwritesthehomeeldofthecorpusandsaysthatallcomponents remotelystoredcorporaisdisabled.whenthealtpathdeclarationisused,itdeclaresa TheRemoteSpeccurrentlyisnotsupported,sinceintheactualdistribution,accessto arestoredatadierentplace,altpath. pathdierentofthecorpuspathforthisspecialattribute. Withinthesebraces,asequenceofcomponentname/pathspecicationpairsislisted.Each Thecomponentpathspecication,CompPathSpec,mustbeenclosedinbracesf:::g. componentnamemayonlyoccuronce: fcomponentidpathspec ::: \homedirectory"oftheattribute.twoothervirtualcomponentsareaname,thenameof Onecomponentis\virtual"(DIR)anddoesn'tdescribelenamesbutratherdenotesthe g theattributejustbeingdened,andapath,whichisthe\homedirectory"ofthecorpus, ThePathSpecisastandardpath,inwhichMacrosmaybeused.Suchamacroisstarted whichdefaultstothevalueofthevalueofthehomeeldoftheheader(ifpresent). withadollarsign$anddirectlyfollowedbyacomponentname.forexample,themacro specications.forexample, $ANAMErepresentsthevalueoftheattributename.Macrovaluesmaybeusedinpath directory),followedbyaslash,followedbythevalueoftheanamevariable,andthenfollowed standsfortheconcatenationofthevalueoftheapathvariable(usuallythecorpushome $APATH/$ANAME.f by.f. Ingeneral,itispossibletorefertothevalueofanycomponentbyprexingitscomponent Thefollowingtableliststhecomponents,thecomponentidentiersandthedefaultvalue refertoacomponentvaluewhichisnotyetdened. identierwithadollarsign($).thus,whenacomponentvalueisdened,itispossibleto ormacrothroughwhichthe(default)valueiscomputed: usethevaluesofpreviouslydenedcomponentsinthedenition.itisanerrorwhenyou

Natural Language Processing

Natural Language Processing Natural Language Processing 2 Open NLP (http://opennlp.apache.org/) Java library for processing natural language text Based on Machine Learning tools maximum entropy, perceptron Includes pre-built models

More information

Lecture 2, Introduction to Python. Python Programming Language

Lecture 2, Introduction to Python. Python Programming Language BINF 3360, Introduction to Computational Biology Lecture 2, Introduction to Python Young-Rae Cho Associate Professor Department of Computer Science Baylor University Python Programming Language Script

More information

Chapter 7. Language models. Statistical Machine Translation

Chapter 7. Language models. Statistical Machine Translation Chapter 7 Language models Statistical Machine Translation Language models Language models answer the question: How likely is a string of English words good English? Help with reordering p lm (the house

More information

Technical Information www.jovian.ca

Technical Information www.jovian.ca Technical Information www.jovian.ca Europa is a fully integrated Anti Spam & Email Appliance that offers 4 feature rich Services: > Anti Spam / Anti Virus > Email Redundancy > Email Service > Personalized

More information

Secret Debian Internals

Secret Debian Internals Enrico Zini enrico@debian.org 25 February 2007 BTS Where to find it Source code: bzr branch http://bugs.debian.org/debbugs-source/mainline/ Data on merkel at /org/bugs.debian.org/spool/ Data rsyncable

More information

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 Markus Dickinson Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 1 / 34 Basic text analysis Before any sophisticated analysis, we want ways to get a sense of text data

More information

Verizon Firewall. 1 Introduction. 2 Firewall Home Page

Verizon Firewall. 1 Introduction. 2 Firewall Home Page Verizon Firewall 1 Introduction Verizon Firewall monitors all traffic to and from a computer to block unauthorized access and protect personal information. It provides users with control over all outgoing

More information

Completing the Accounts Payable (AP) Redistribution Form For Invoices Matched to a Purchase Order

Completing the Accounts Payable (AP) Redistribution Form For Invoices Matched to a Purchase Order Completing the Accounts Payable (AP) Redistribution Form For Invoices Matched to a Purchase Order The steps below outline how to use Oracle to find the necessary information to complete the AP Redistribution

More information

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1 Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically

More information

1. VERSIONSHISTORY...2 2. MERCHANT TRANSACTION RECONCILIATION FILE GENERAL...2 3. STRUCTURE FILE...3 4. DETAILED FILE DESCRIPTION...

1. VERSIONSHISTORY...2 2. MERCHANT TRANSACTION RECONCILIATION FILE GENERAL...2 3. STRUCTURE FILE...3 4. DETAILED FILE DESCRIPTION... MERCHANT TRANSACTION RECONCILIATION FILE FILE DESCRIPTION ACTUAL FROM APRIL 2014 VERSION 2.1 CONTENT 1. VERSIONSHISTORY...2 2. MERCHANT TRANSACTION RECONCILIATION FILE GENERAL...2 3. STRUCTURE FILE...3

More information

Audit Troubleshooting

Audit Troubleshooting CHAPTER 2 Revised: July 2010, Introduction This chapter provides the information needed for monitoring and troubleshooting audit events and alarms. This chapter is divided into the following sections:

More information

NAT TCP SIP ALG Support

NAT TCP SIP ALG Support The feature allows embedded messages of the Session Initiation Protocol (SIP) passing through a device that is configured with Network Address Translation (NAT) to be translated and encoded back to the

More information

Sales Person Commission

Sales Person Commission Sales Person Commission Table of Contents INTRODUCTION...1 Technical Support...1 Overview...2 GETTING STARTED...3 Adding New Salespersons...3 Commission Rates...7 Viewing a Salesperson's Invoices or Proposals...11

More information

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 Starting a Python and an NLTK Session Open a Python 2.7 IDLE (Python GUI) window or a Python interpreter

More information

Litigation Support connector installation and integration guide for Summation

Litigation Support connector installation and integration guide for Summation Litigation Support connector installation and integration guide for Summation For AccuRoute v2.3 July 28, 2009 Omtool, Ltd. 6 Riverside Drive Andover, MA 01810 Phone: +1/1 978 327 5700 Toll-free in the

More information

Schema documentation for types1.2.xsd

Schema documentation for types1.2.xsd Generated with oxygen XML Editor Take care of the environment, print only if necessary! 8 february 2011 Table of Contents : ""...........................................................................................................

More information

Configure your firewall for administrative access via RADIUS authentication

Configure your firewall for administrative access via RADIUS authentication Configure your firewall for administrative access via RADIUS authentication Version 1.0 PAN-OS 5.0.1 Johan Loos johan@accessdenied.be Configure your Palo Alto firewall for RADIUS Authentication This guide

More information

PayScan Bill Payment Retailer Software Architecture & Design

PayScan Bill Payment Retailer Software Architecture & Design PayScan Bill Payment Retailer Software Architecture & Design INTRODUCTION The following diagram (not illustrated here) depicts the software components of PayScan bill payment system within a Retailer Point

More information

SAPScript. A Standard Text is a like our normal documents. In Standard Text, you can create standard documents like letters, articles etc

SAPScript. A Standard Text is a like our normal documents. In Standard Text, you can create standard documents like letters, articles etc SAPScript There are three components in SAPScript 1. Standard Text 2. Layout Set 3. ABAP/4 program SAPScript is the Word processing tool of SAP It has high level of integration with all SAP modules STANDARD

More information

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing The scanner (or lexical analyzer) of a compiler processes the source program, recognizing

More information

Special Topics in Computer Science

Special Topics in Computer Science Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS

More information

Financial Processing Journal Voucher (JV)

Financial Processing Journal Voucher (JV) Financial Processing Journal Voucher (JV) Contents Document Layout... 1 Journal Voucher Details Tab... 2 Process Overview... 4 Business Rules... 4 Routing... 4 Initiating a Journal Voucher Document...

More information

Configuring NetFlow on Cisco ASR 9000 Series Aggregation Services Router

Configuring NetFlow on Cisco ASR 9000 Series Aggregation Services Router Configuring NetFlow on Cisco ASR 9000 Series Aggregation Services Router This module describes the configuration of NetFlow on the Cisco ASR 9000 Series Aggregation Services Router. A NetFlow flow is a

More information

There s nothing like a Firewall. Olivier Paul, GET/INT MONAM 07, Toulouse, France

There s nothing like a Firewall. Olivier Paul, GET/INT MONAM 07, Toulouse, France There s nothing like a Firewall Olivier Paul, GET/INT MONAM 07, Toulouse, France How does this relate to risk? From wikipedia Risk = (probability of incident) * (Cost of incident) From risk evaluation

More information

CardSwipe Integration

CardSwipe Integration CardSwipe Integration CardSwipe is an app that passes data from a Mag Stripe Reader to your application. No data is stored in CardSwipe making it PCI Compliant. Communication between CardSwipe and your

More information

PaperCut Payment Gateway Module PayPal Website Payments Standard Quick Start Guide

PaperCut Payment Gateway Module PayPal Website Payments Standard Quick Start Guide PaperCut Payment Gateway Module PayPal Website Payments Standard Quick Start Guide This guide is designed to supplement the Payment Gateway Module documentation and provides a guide to installing, setting

More information

eaccounts Customer Instruction Manual

eaccounts Customer Instruction Manual eaccounts Customer Instruction Manual Table of Contents 1.0 eaccounts Homepage... 3 2.0 Login... 4 3.0 Login History... 5 4.0 Download History... 6 5.0 Verify Customer Details... 7 6.0 Verify Operations

More information

A Mixed Trigrams Approach for Context Sensitive Spell Checking

A Mixed Trigrams Approach for Context Sensitive Spell Checking A Mixed Trigrams Approach for Context Sensitive Spell Checking Davide Fossati and Barbara Di Eugenio Department of Computer Science University of Illinois at Chicago Chicago, IL, USA dfossa1@uic.edu, bdieugen@cs.uic.edu

More information

EVALITA 07 parsing task

EVALITA 07 parsing task EVALITA 07 parsing task Cristina BOSCO Alessandro MAZZEI Vincenzo LOMBARDO (Dipartimento di Informatica Università di Torino) 1 overview 1. task 2. development data 3. evaluation 4. conclusions 2 task

More information

geändert / changed: Pos. Beschreibung Datum: Name: Item description: Date: Name: 1 2 3 4

geändert / changed: Pos. Beschreibung Datum: Name: Item description: Date: Name: 1 2 3 4 Hose Fitting Fitting Hitzeschutz Part Nr.: Material or Hose Fitting Fitting Hitzeschutz Part Nr.: Material or Hose Fitting Fitting Hitzeschutz Part Nr.: Material or Hose Fitting Fitting Hitzeschutz Part

More information

Universal Data Mover. User Guide. Indesca / Infitran. Version 4.1.0. udm-user-4101

Universal Data Mover. User Guide. Indesca / Infitran. Version 4.1.0. udm-user-4101 Universal Data Mover User Guide Indesca / Infitran Version 4.1.0 udm-user-4101 Universal Data Mover User Guide Indesca / Infitran 4.1.0 Document Name Document ID Universal Data Mover 4.1.0 User Guide

More information

Expect-lite Update. Linux Symposium 2014. A quick look four years later by Craig Miller

Expect-lite Update. Linux Symposium 2014. A quick look four years later by Craig Miller Expect-lite Update Linux Symposium 2014 A quick look four years later by Craig Miller Quick Introduction What is expect-lite Expect-lite is an open-source automation software Designed for non-programmers

More information

SDL Passolo 2015 Table of Contents General... 1 Content Overview... 1 Typographic Conventions... 2 First Steps... 5 First steps... 5 The Start Page... 5 Creating a Project... 5 Updating and Alignment...

More information

Secure Held Print Jobs. Administrator's Guide

Secure Held Print Jobs. Administrator's Guide Secure Held Print Jobs Administrator's Guide September 2013 www.lexmark.com Contents 2 Contents Overview...3 Configuring Secure Held Print Jobs...4 Configuring and securing the application...4 Using Secure

More information

WinSALTS. The 32-bit version of the WinSALTS Program. WinSALTS Training Handout Modified for ROM II EDI Users 10 March 10, 2005 Version 5.

WinSALTS. The 32-bit version of the WinSALTS Program. WinSALTS Training Handout Modified for ROM II EDI Users 10 March 10, 2005 Version 5. WinSALTS The 32-bit version of the WinSALTS Program WinSALTS Training Handout Modified for ROM II EDI Users 10 March 10, 2005 Version 5.04 SALTS CENTRAL NAVSISA (N00367) 5450 Carlisle Pike P.O. Box 2010

More information

User's Guide. Using RFDBManager. For 433 MHz / 2.4 GHz RF. Version 1.23.01

User's Guide. Using RFDBManager. For 433 MHz / 2.4 GHz RF. Version 1.23.01 User's Guide Using RFDBManager For 433 MHz / 2.4 GHz RF Version 1.23.01 Copyright Notice Copyright 2005 Syntech Information Company Limited. All rights reserved The software contains proprietary information

More information

VISION FINANCIALS. Budget Status (GLS8020) Introduction. Purpose of the Report

VISION FINANCIALS. Budget Status (GLS8020) Introduction. Purpose of the Report VISION FINANCIALS Budget Status (GLS8020) Introduction Purpose of the Report The report displays all Commitment Control ledger amounts (budgeted, associated revenue, pre-encumbrance, encumbrance, expense)

More information

Data Intensive Computing Handout 5 Hadoop

Data Intensive Computing Handout 5 Hadoop Data Intensive Computing Handout 5 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.

More information

IT cost survey for Swiss banks 2015 Evaluation report (based on 2014 effective data and 2015 budget data)

IT cost survey for Swiss banks 2015 Evaluation report (based on 2014 effective data and 2015 budget data) IT cost survey for Swiss banks 2015 Evaluation report (based on 2014 effective data and 2015 budget data) Zurich, May 2015 ferhat.geyran@itopia.ch rene.stierli@itopia.ch Agenda slide/page Introduction

More information

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison Preparing your data for analysis using SAS Landon Sego 24 April 2003 Department of Statistics UW-Madison Assumptions That you have used SAS at least a few times. It doesn t matter whether you run SAS in

More information

CLC Server Command Line Tools USER MANUAL

CLC Server Command Line Tools USER MANUAL CLC Server Command Line Tools USER MANUAL Manual for CLC Server Command Line Tools 2.5 Windows, Mac OS X and Linux September 4, 2015 This software is for research purposes only. QIAGEN Aarhus A/S Silkeborgvej

More information

Load Balancing and Sessions. C. Kopparapu, Load Balancing Servers, Firewalls and Caches. Wiley, 2002.

Load Balancing and Sessions. C. Kopparapu, Load Balancing Servers, Firewalls and Caches. Wiley, 2002. Load Balancing and Sessions C. Kopparapu, Load Balancing Servers, Firewalls and Caches. Wiley, 2002. Scalability multiple servers Availability server fails Manageability Goals do not route to it take servers

More information

Integrating Procurement Cards with Oracle Internet Expenses: Lessons Learned. Session ID: 08141

Integrating Procurement Cards with Oracle Internet Expenses: Lessons Learned. Session ID: 08141 Integrating Procurement Cards with Oracle Internet Expenses: Lessons Learned October 06, 2011 Presented By Ashish Nagarkar (AST Corporation) anagarka@astcorporation.com Sari Fessenden (City of Modesto)

More information

Overview of Web Services API

Overview of Web Services API 1 CHAPTER The Cisco IP Interoperability and Collaboration System (IPICS) 4.5(x) application programming interface (API) provides a web services-based API that enables the management and control of various

More information

Swipe reader interfaces

Swipe reader interfaces Section 2-9 Swipe reader interfaces This section: Defines the built-in Wiegand and Mag Stripe data formats which can be read by the 4422 swipe card module, the 4410 swipe card and PINpad module, the 4420

More information

Microsoft Dynamics GP. Field Service - Preventive Maintenance

Microsoft Dynamics GP. Field Service - Preventive Maintenance Microsoft Dynamics GP Field Service - Preventive Maintenance Copyright Copyright 2010 Microsoft Corporation. All rights reserved. Complying with all applicable copyright laws is the responsibility of the

More information

24 Uses of Turing Machines

24 Uses of Turing Machines Formal Language and Automata Theory: CS2004 24 Uses of Turing Machines 24 Introduction We have previously covered the application of Turing Machine as a recognizer and decider In this lecture we will discuss

More information

Issue Tracking System. User Manual

Issue Tracking System. User Manual Issue Tracking System User Manual Document Number: ODM_ITS-USM-0001(ITS_Customer_Interface) Revision Number : 2.5 Security Level : Public Date : 2010.12.17 Prepared by: HuiHui Wang Date Prepared: 2010.12.17

More information

Sending an Email Message from a Process

Sending an Email Message from a Process Adobe Enterprise Technical Enablement Sending an Email Message from a Process In this topic, you will learn how the Email service can be used to send email messages from a process. Objectives After completing

More information

Table Of Contents. iii

Table Of Contents. iii PASSOLO Handbook Table Of Contents General... 1 Content Overview... 1 Typographic Conventions... 2 First Steps... 3 First steps... 3 The Welcome dialog... 3 User login... 4 PASSOLO Projects... 5 Overview...

More information

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Part of Speech (POS) Tagging Given a sentence X, predict its

More information

The make utility. Basics

The make utility. Basics The make utility Basics make is a utility that helps keep the executable versions of programs current. It automatically updates a target file when changes are made to the files used to build the target.

More information

IBM MaaS360 Mobile Document Editor User Guide

IBM MaaS360 Mobile Document Editor User Guide IBM MaaS360 Mobile Document Editor User Guide Introduction MaaS360 Mobile Document Editor allows you to edit files directly in IBM MaaS360 Secure Mobile Mail or in your IBM MaaS360 Docs Repository. MaaS360

More information

Data Intensive Computing Handout 6 Hadoop

Data Intensive Computing Handout 6 Hadoop Data Intensive Computing Handout 6 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.

More information

A Time Efficient Algorithm for Web Log Analysis

A Time Efficient Algorithm for Web Log Analysis A Time Efficient Algorithm for Web Log Analysis Santosh Shakya Anju Singh Divakar Singh Student [M.Tech.6 th sem (CSE)] Asst.Proff, Dept. of CSE BU HOD (CSE), BUIT, BUIT,BU Bhopal Barkatullah University,

More information

Document Management: Document Imaging System Setup

Document Management: Document Imaging System Setup Document Management: DocMan, Release 4.1 2003 Enterprise Computer Systems, Inc., Greenville, SC Notice This manual is provided to enhance your knowledge of the software product. It is your responsibility

More information

FAR014: A Day in the Life for a Mars Planner

FAR014: A Day in the Life for a Mars Planner FAR014: A Day in the Life for a Mars Planner featuring the new Replenishment Workbench Robert Collom, Supply Capability Manager Mars Information Services Diana Mitten, Product Director JDA Agenda Introduction

More information

CHAPTER 11 LEGAL ACCOUNTING MODULE 11.0 OVERVIEW 11.1 REQUIREMENTS AND INSTALLATION. 11.1.1 Special Requirements. 11.1.

CHAPTER 11 LEGAL ACCOUNTING MODULE 11.0 OVERVIEW 11.1 REQUIREMENTS AND INSTALLATION. 11.1.1 Special Requirements. 11.1. EXTENDED SERVICE OPTIONS CHAPTER 11 11.0 OVERVIEW The Legal Accounting Module provides line item tracking of legal expenses incurred during the collection process. You can track expenses incurred by the

More information

Horizon Debt Collect. User s and Administrator s Guide

Horizon Debt Collect. User s and Administrator s Guide Horizon Debt Collect User s and Administrator s Guide Microsoft, Windows, Windows NT, Windows 2000, Windows XP, and SQL Server are registered trademarks of Microsoft Corporation. Sybase is a registered

More information

Eligible Professional Menu Measure Frequently Asked Questions

Eligible Professional Menu Measure Frequently Asked Questions Eligible Professional Menu Measure Frequently Asked Questions Drug Formulary Checks 1. If an EP is unable to meet the measure of a meaningful use objective because it is outside of the scope of his or

More information

An Eprints Apache Log Filter for Non-Redundant Document Downloads by Browser Agents

An Eprints Apache Log Filter for Non-Redundant Document Downloads by Browser Agents An Eprints Apache Log Filter for Non-Redundant Document Downloads by Browser Agents Ed Sponsler Caltech Library System http://resolver.caltech.edu/caltechlib:spoeal04 December, 2004 Contents 1 Abstract

More information

Microsoft Dynamics GP. Pay Steps for Human Resources Release 9.0

Microsoft Dynamics GP. Pay Steps for Human Resources Release 9.0 Microsoft Dynamics GP Pay Steps for Human Resources Release 9.0 Copyright Copyright 2006 Microsoft Corporation. All rights reserved. Complying with all applicable copyright laws is the responsibility of

More information

Introduction to Linux operating system. module Basic Bioinformatics PBF

Introduction to Linux operating system. module Basic Bioinformatics PBF Introduction to Linux operating system module Basic Bioinformatics PBF What is Linux? A Unix-like Operating System A famous open source project Free to use, distribute, modify under a compatible licence

More information

PageR Enterprise Monitored Objects - AS/400-5

PageR Enterprise Monitored Objects - AS/400-5 PageR Enterprise Monitored Objects - AS/400-5 The AS/400 server is widely used by organizations around the world. It is well known for its stability and around the clock availability. PageR can help users

More information

RiOffice Users Manual

RiOffice Users Manual RiOffice Users Manual Rio Networks 9/23/2009 Contents Available Services... 4 Core PBX Features... 4 Voicemail Features... 4 Call Center Features... 4 Call Features... 4 Using Your Phone... 5 Phone Layout...

More information

Intercluster Lookup Service

Intercluster Lookup Service When the (ILS) is configured on multiple clusters, ILS updates Cisco Unified Communications Manager with the current status of remote clusters in the ILS network. The ILS cluster discovery service allows

More information

Data Intensive Computing Handout 4 Hadoop

Data Intensive Computing Handout 4 Hadoop Data Intensive Computing Handout 4 Hadoop Hadoop 1.2.1 is installed in /HADOOP directory. The JobTracker web interface is available at http://dlrc:50030, the NameNode web interface is available at http://dlrc:50070.

More information

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad

More information

System and Network Management

System and Network Management - System and Network Management Network Management : ability to monitor, control and plan the resources and components of computer system and networks network management is a problem created by computer!

More information

VBA Microsoft Access 2007 Macros to Import Formats and Labels to SAS

VBA Microsoft Access 2007 Macros to Import Formats and Labels to SAS WUSS 2011 VBA Microsoft Access 2007 Macros to Import Formats and Labels to SAS Maria S. Melguizo Castro, Jerry R Stalnaker, and Christopher J. Swearingen Biostatistics Program, Department of Pediatrics

More information

Web Development using PHP (WD_PHP) Duration 1.5 months

Web Development using PHP (WD_PHP) Duration 1.5 months Duration 1.5 months Our program is a practical knowledge oriented program aimed at learning the techniques of web development using PHP, HTML, CSS & JavaScript. It has some unique features which are as

More information

Installing and Setting up Microsoft DNS Server

Installing and Setting up Microsoft DNS Server Training Installing and Setting up Microsoft DNS Server Introduction Versions Used Windows Server 2003 Setup Used i. Server Name = martini ii. Credentials: User = Administrator, Password = password iii.

More information

Click-To-Talk. ZyXEL IP PBX License IP PBX LOGIN DETAILS. Edition 1, 07/2009. LAN IP: https://192.168.1.12 WAN IP: https://172.16.1.1.

Click-To-Talk. ZyXEL IP PBX License IP PBX LOGIN DETAILS. Edition 1, 07/2009. LAN IP: https://192.168.1.12 WAN IP: https://172.16.1.1. Click-To-Talk ZyXEL IP PBX License Edition 1, 07/2009 IP PBX LOGIN DETAILS LAN IP: https://192.168.1.12 WAN IP: https://172.16.1.1 Username: admin Password: 1234 www.zyxel.com Copyright 2009 ZyXEL Communications

More information

Microsoft Dynamics GP. Field Service Preventive Maintenance

Microsoft Dynamics GP. Field Service Preventive Maintenance Microsoft Dynamics GP Field Service Preventive Maintenance Copyright Copyright 2011 Microsoft. All rights reserved. Limitation of liability This document is provided as-is. Information and views expressed

More information

SWIFT MT940 MT942 formats for exporting data from OfficeNet Direct

SWIFT MT940 MT942 formats for exporting data from OfficeNet Direct SWIFT MT940 MT942 formats for exporting data from OfficeNet Direct January 2008 2008 All rights reserved. With the exception of the conditions specified in or based on the 1912 Copyright Act, no part of

More information

FortiVoice. Version 7.00 User Guide

FortiVoice. Version 7.00 User Guide FortiVoice Version 7.00 User Guide FortiVoice Version 7.00 User Guide Revision 2 28 October 2011 Copyright 2011 Fortinet, Inc. All rights reserved. Contents and terms are subject to change by Fortinet

More information

Connect Ticket Entry. Quick Reference Guide

Connect Ticket Entry. Quick Reference Guide Connect Ticket Entry Quick Reference Guide Davisware 514 Market Loop West Dundee, IL 60118 Phone: (847) 426-6000 Fax: (847) 426-6027 Contents are the exclusive property of Davisware. Copyright 2015. All

More information

Template and Daily Schedules

Template and Daily Schedules IDX PATIENT SCHEDULING APPLICATION Template and Daily Schedules TRAINING GUIDE IDX 9.0 MSU HEALTHTEAM TRAINING AND EDUCATION JANUARY 2007 1 MODULE 1 OVERVIEW OF SCHEDULES... 3 MAINTENANCE ACTIVITIES...

More information

Using Process Monitor

Using Process Monitor Using Process Monitor Process Monitor Tutorial This information was adapted from the help file for the program. Process Monitor is an advanced monitoring tool for Windows that shows real time file system,

More information

Chapter 24: Creating Reports and Extracting Data

Chapter 24: Creating Reports and Extracting Data Chapter 24: Creating Reports and Extracting Data SEER*DMS includes an integrated reporting and extract module to create pre-defined system reports and extracts. Ad hoc listings and extracts can be generated

More information

Borderware Firewall Server Version 7.1. VPN Authentication Configuration Guide. Copyright 2005 CRYPTOCard Corporation All Rights Reserved

Borderware Firewall Server Version 7.1. VPN Authentication Configuration Guide. Copyright 2005 CRYPTOCard Corporation All Rights Reserved Borderware Firewall Server Version 7.1 VPN Authentication Configuration Guide Copyright 2005 CRYPTOCard Corporation All Rights Reserved http://www.cryptocard.com Overview The BorderWare Firewall Server

More information

CPSM MEDITECH 5.67. Inventory Inquiries

CPSM MEDITECH 5.67. Inventory Inquiries CPSM MEDITECH 5.67 Inventory Inquiries Contents CPSM Inventory Inquires... 2 Stock Inquiry... 2 Select... 11 Item Inquiry... 16 Purchase Order Inquiry... 32 Check Purchase Order Number... 38 View Vendor

More information

enabling prepaid products & services everywhere General description: This documentation will illustrate how to configure the AMS Gateway Airtime Module for IQ Business/IQ Enterprise/IQ POS/IQ Free POS

More information

Chapter 6: Data Entry

Chapter 6: Data Entry Chapter 6: Data Entry The Imports module in SEER*DMS includes data entry screens that allow you to enter data using the keyboard. This feature is used to key in data printed on paper forms or images of

More information

Trustkeeper PCI Compliance Guide for Merchants

Trustkeeper PCI Compliance Guide for Merchants Trustkeeper PCI Compliance Guide for Merchants For questions about Trustkeeper and the enrollment process please contact Trustwave at 866-659-9067. 1. Register yourself with Trustkeeper The first step

More information

Paperless Collection System. PCS Collector

Paperless Collection System. PCS Collector Paperless Collection System PCS Collector About this Manual This IDX Training Manual is written to give you a step-by-step guide for your classroom training and a handy reference for your daily work. The

More information

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar http://www.cis.upenn.edu/ anoop/ anoop@linc.cis.upenn.edu

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar http://www.cis.upenn.edu/ anoop/ anoop@linc.cis.upenn.edu Applying Co-Training Methods to Statistical Parsing Anoop Sarkar http://www.cis.upenn.edu/ anoop/ anoop@linc.cis.upenn.edu 1 Statistical Parsing: the company s clinical trials of both its animal and human-based

More information

This presentation explains how to monitor memory consumption of DataStage processes during run time.

This presentation explains how to monitor memory consumption of DataStage processes during run time. This presentation explains how to monitor memory consumption of DataStage processes during run time. Page 1 of 9 The objectives of this presentation are to explain why and when it is useful to monitor

More information

Apply PERL to BioInformatics (II)

Apply PERL to BioInformatics (II) Apply PERL to BioInformatics (II) Lecture Note for Computational Biology 1 (LSM 5191) Jiren Wang http://www.bii.a-star.edu.sg/~jiren BioInformatics Institute Singapore Outline Some examples for manipulating

More information

INASP: Effective Network Management Workshops

INASP: Effective Network Management Workshops INASP: Effective Network Management Workshops Linux Familiarization and Commands (Exercises) Based on the materials developed by NSRC for AfNOG 2013, and reused with thanks. Adapted for the INASP Network

More information

DOMIQ, SIP and Mobotix cameras

DOMIQ, SIP and Mobotix cameras DOMIQ, SIP and Mobotix cameras This tutorial is the second in the series in which we present integration of Mobotix devices with the DOMIQ system. The main subject of this tutorial is the implementation

More information

Configuring Denial of Service Protection

Configuring Denial of Service Protection 24 CHAPTER This chapter contains information on how to protect your system against Denial of Service (DoS) attacks. The information covered in this chapter is unique to the Catalyst 6500 series switches,

More information

Logging Service and Log Viewer for CPC Monitoring

Logging Service and Log Viewer for CPC Monitoring TECHNICAL BULLETIN Logging Service and Log Viewer for CPC Monitoring Overview CPC has developed a set of add-on programs for its Monitoring software that generates logs of events and errors encountered

More information

Real Estate Reports Overview Quick Reference Guide

Real Estate Reports Overview Quick Reference Guide Real Estate Reports Overview Quick Reference Guide Overview This guide shows you the options available for customising the standard RE reports available in SAP. It covers the following: Using individual

More information

Thirdlane User Portal 2.1. Users Guide 05/12/2008. Third Lane Technologies, LLC 39 Power Lane Fairfax, CA 94930. http://www.thirdlane.

Thirdlane User Portal 2.1. Users Guide 05/12/2008. Third Lane Technologies, LLC 39 Power Lane Fairfax, CA 94930. http://www.thirdlane. Thirdlane User Portal 2.1 Users Guide 05/12/2008 Third Lane Technologies, LLC 39 Power Lane Fairfax, CA 94930 http://www.thirdlane.com Copyright 2003-2008. Third Lane Technologies, LLC. All rights reserved.

More information

VTiger CRM + Joomla/ChronoForms Integration

VTiger CRM + Joomla/ChronoForms Integration VTiger CRM + Joomla/ChronoForms Integration Table of Contents 1.- Configuration of VTiger... 2 A.- Enabling the Webforms module... 2 B.- Creating a service user... 2 C.- Editing the Webforms configuration

More information

Avaya Network Configuration Manager User Guide

Avaya Network Configuration Manager User Guide Avaya Network Configuration Manager User Guide May 2004 Avaya Network Configuration Manager User Guide Copyright Avaya Inc. 2004 ALL RIGHTS RESERVED The products, specifications, and other technical information

More information

COSC 3351 Software Design. Architectural Design (II) Edgar Gabriel. Spring 2008. Virtual Machine

COSC 3351 Software Design. Architectural Design (II) Edgar Gabriel. Spring 2008. Virtual Machine COSC 3351 Software Design Architectural Design (II) Spring 2008 Virtual Machine A software system of virtual machine architecture usually consists of 4 components: Program component: stores the program

More information