TheIMSCorpusWorkbench CorpusAdministrator'sManual InstitutfurmaschinelleSprachverarbeitung UniversitatStuttgart OliverChrist oli@ims.uni-stuttgart.de {Computerlinguistik{ D70174Stuttgart1 Azenbergstr.12 LastModied:WedNov914:33:271994(oli) Created:ThuFeb2410:34:111994(oli) tc.bibentry:christ:94b Released:{notyet{
Contents 1Overview 1.1Introduction:::::::::::::::::::::::::::::::::::: 1.2TheroleoftheCorpusAdministrator::::::::::::::::::::: 4 1.4CreditsandAcknowledgements::::::::::::::::::::::::: 1.3Organizationofthismanual::::::::::::::::::::::::::: 2Internalcorpusrepresentation 5 2.1Positionalattributes::::::::::::::::::::::::::::::: 2.1.1Integerizedles:::::::::::::::::::::::::::::: 2.1.2Inverseditemsequence::::::::::::::::::::::::::117 2.1.3Example:asimplewordsearch:::::::::::::::::::::138 2.2Otherattributetypes::::::::::::::::::::::::::::::16 2.2.1Structuralattributes:::::::::::::::::::::::::::16 2.1.4Thesetofpositionalattributes:::::::::::::::::::::15 2.3Externaltoolsanddynamicattributes:::::::::::::::::::::18 2.2.2Alignmentattributes:::::::::::::::::::::::::::18 2.2.3Bigramandmappingtables:::::::::::::::::::::::18 3Encoding:Transformingacorpusintoitsinternalrepresentation 3.2Theencodeprogram:::::::::::::::::::::::::::::::21 3.1Theinternalrepresentationofacorpus:::::::::::::::::::::21 20 3.4Spacerequirements::::::::::::::::::::::::::::::::26 3.3Themakeallprogram::::::::::::::::::::::::::::::25 3.4.1Positionalattributes:::::::::::::::::::::::::::26 3.4.2Structuralattributes:::::::::::::::::::::::::::27 1
4Thecorpusregistry IMSCorpusWorkbench:Administrator'sManual 2 4.2Thecontentsofaregistryle::::::::::::::::::::::::::29 4.1Someremarksaboutnomenclature:::::::::::::::::::::::28 4.2.2Positionalattributes:::::::::::::::::::::::::::30 4.2.3Structuralattributes:::::::::::::::::::::::::::34 4.2.1Theheader::::::::::::::::::::::::::::::::29 4.2.4Mappingtables::::::::::::::::::::::::::::::34 4.2.5ngramtables:::::::::::::::::::::::::::::::35 4.2.6Alignmentattributes:::::::::::::::::::::::::::35 4.4Alastexample::::::::::::::::::::::::::::::::::37 4.3Registrationofremotecorpora:::::::::::::::::::::::::37 4.2.7Dynamicattributes::::::::::::::::::::::::::::35 5Remoteaccess{clientandserversetup 4.5Stepstofollow::::::::::::::::::::::::::::::::::38 5.2Howtostartthecorpusdataserver:::::::::::::::::::::::41 5.1The.ratand.ratlogles:::::::::::::::::::::::::::39 6Utilitiesanddebuggingtools 6.1Decodingofcorpusandattributeinformation:::::::::::::::::42 6.1.1Decodingofcorpusinformation:decode::::::::::::::::42 6.2CreationandDecodingofBigramTables::::::::::::::::::::43 6.2.1Creationofbigramtables:gen-bigrams::::::::::::::::43 6.1.2Decodingofwordlists:lexdecode:::::::::::::::::::42 6.3CreationandDecodingofMappingTables:::::::::::::::::::43 6.3.1Creationofmappingtables:gen-mapping-table:::::::::::43 6.2.2Decodingofbigramtables:decode-bigrams:::::::::::::43 6.4Generalutilities::::::::::::::::::::::::::::::::::44 6.4.1Comparingwordlistsandcorpora:check-coverage:::::::::44 6.3.2Decodingofmappingtables:decode-mapping-table:::::::::44 6.4.2Convertinginternalintegerstoreadablenumbers:itoa:::::::44 6.4.3Convertingreadablenumberstointernalintegers:atoi:::::::44
7Accesscontrolandsecurityissues IMSCorpusWorkbench:Administrator'sManual 3 7.1Controllinglocalaccesstocorpora:::::::::::::::::::::::45 AHardwareandoperatingsystemrequirements 7.2Controllingremoteaccesstocorpora::::::::::::::::::::::46 BReusedsoftwarepackagesandcopyrightnotices 48 B.1TheregularexpressionmatcherbyHenrySpencer::::::::::::::49
Chapter1 Overview queryingoflargetextcorpora.thismanualdescribeshowtoencodeatextcorpusand 1.1 TheIMScorpusworkbenchisasetoftoolsfortheecientencoding,representationand Introduction howthevariousadministrationtoolsmustbeusedtotransformatextcorpusintothe moregeneralpapers,especially[christ,1994]foranoverviewofthesystemarchitectureas familiarwiththeoverallarchitecture,werecommenda\top-down"readingthroughthe representationusedbytheaccesstools.thismanualdoesnotdescribethefunctionality awhole. ofthequerytoolsorthearchitectureoftheworkbenchingeneral.ifthereaderisnot representationusedbytheimsworkbench,thefollowingstepshavetobeperformed: ecientlookup.totransformatextcorpusfromitstextualrepresentationtotheinternal pusdata,thedierent\items"(i.e.,words)usedinthecorpusandseveralindexlesfor Theinternalrepresentationofacorpusconsistsofasetofleswhichrepresentthecor- 3.declarationofthecorpusinaglobal\registrydirectory"; 2.encodingofthetextle; 1.transformationofthetextleinone-word-per-lineformat; Steps1and3havetobedonemanually,forsteps2and4therearetoolswithinthe workbench. 4.andbuildingseveralleindices. expectstondalewiththeverysamenameasthesymbolicnameofthecorpustobe toatool,itislookedupinacentraldirectory(calledthe\corpusregistry"),wherethetool usedinstep2above)accessacorpusviaasymbolicname.whenasymbolicnameispassed Thethirdstep,theregistrationofacorpus,isnecessarysincealmostalltools(buttheone accessed.thisleholdsadescriptionofthecomponentsofthecorpus,mainlyalistwhere thecomponentsarestoredphysically.so,auserdoesnothavetoknowwhereacorpusis stored,heorsheonlyhastoknowitssymbolicnametoaccessit. 4
Afteracorpusistransformedintoitsinternalrepresentationandregistered,itcanbe IMSCorpusWorkbench:Administrator'sManual 5 usedbythevarioustoolsoftheworkbench,forexamplethequerytools(xkwic,cqp, print-aligned). WithintheIMScorpusworkbench,thecorpusadministratorhasthetaskstoprovideusers 1.2 withnewcorporaortochangeexistingcorporawhensomeinformationhastobeadded TheroleoftheCorpusAdministrator orupdated.second,theadministratorhastoproperlyinstalltheusuallylargecorpusles inthelesystemandtondan\optimal"placewithregardtobackuppolicies,diskusage andaccesseciency.third,thecorpusadministratorhastocareaboutaccesscontrol, evenwithinoneinstitution.thesetasksaresimilarto\standard"systemadministration. sincecorporaexistwherecopyrightsorlicenseagreementsinhibitanunrestrictedaccess, accesscontrol,thatis,yourlocalsystemadministrator. familiarwiththestandardunixtextprocessingtools,backupstrategies,andsecurityand Wethereforesuggestthatthecorpusadministrationtasksarefullledbysomeonewhois whichareusedtostorethecorpusdata.youmayskiptheentirechapterifyouwant,it Thismanualisorganizedasfollows.Thenextchapter2explainstheinternaldatastructures 1.3 Organizationofthismanual stepswhicharenecessarytotransformacorpusfromitstextualrepresentationintoits isnotnecessaryfortheotherchapters,butusefulifyouhaveproblemswiththetools formatoftheleswhichdescribethephysicalattributesofacorpus.chapter5describes internalrepresentation.chapter4,then,describesindetailtheregistrydirectoryandthe orwanttolearnhowtomanipulatethedatales.chapter3,then,describesthevarious howtosetuptheclient-server-capabilitiesoftheworkbench.chapter6describesutilities andrelatedtoolswhicheitheraddmore(orothertypesof)informationtoacorpusor checkwhetherthetoolscanrunatallonyoursystem,youmayrefertoappendixafor hardwareandoperatingsystemspecicrequirementsofourtools. Anotherimportantpointisaccesscontrolforcorpora,whichisdiscussedinchapter7.To areusefulforotherpurposes,forexamplefordebuggingofacorpus(duringencoding). makeuseofintegerizeddatalesandreversedindices.bothofthesetechniquesarewellknownintheareaofinformationprocessingformanydecades,buttoourknowledgetherst whoappliedthemtotextandcorpusprocessinginthelinguisticareawaskenchurch. Hedeservesourgreatestthanksforpointingustothesemethods. TheinternaldatastructuresweuseinCqp,Xkwicandsomeothertoolsoftheworkbench 1.4 CreditsandAcknowledgements
Neithertheauthors,norIMS,northeUniversityofStuttgartmakeanyrepresentations IMSCorpusWorkbench:Administrator'sManual 6 aboutthesuitabilityofthesoftwaredescribedhereinortheassociateddocumentation foranypurpose.itisprovided"asis"withoutexpressorimpliedwarranty.wedisclaim allwarrantieswithregardtothesoftwaredescribedhereinortherelateddocumentation, othertortiousaction,arisingoutoforinconnectionwiththeuseorperformanceofthis liableforanyspecial,indirectorconsequentialdamagesoranydamageswhatsoever resultingfromlossofuse,dataorprots,whetherinanactionofcontract,negligenceor includingallimpliedwarrantiesofmerchantabilityandtness,innoeventshallwebe software.
Chapter2 Internalcorpusrepresentation Thischapterexplainshowcorpusdataisrepresentedinternally.Whenyouunderstandthe internalrepresentation,youcanusethetoolsofthetoolboxtocreate,updateorchange corpusinformationwithouthavingtogobacktothetextualversionandencodingthewhole stuagain.youwillalsobeabletogureouthowtoencodetheinternalrepresentation Ifyoudonotneedto\hack"withthecorpusdata,youmayskiptheentirechapter.The forleswhichcannotbecomputedbythetoolsofthetoolbox,forexampleduetomemory problems,softwarebugsorlimitations. understandingoftheinternalrepresentationisnotnecessaryfortheotherpartsofthis manual,butusefulifyouencounterproblemswiththetools. 2.1 WithintheIMScorpusworkbench,acorpuscanhaveanarbitrarynumberofannotations ofdierenttypes.inoursystem,acorpusisprimarilyregardedasasequenceofwords(not Positionalattributes whichisthemostimportantannotationtype.attributesofthisclasshavea(string)value asasequenceofcharacters).thewords,then,arenumbered,sothatwecandirectlyaccess ateachcorpusposition.1 thenthwordofthecorpus).thisleadstothemoregeneralnotionofpositionalattributes, thewordatacertaincorpuspositionp(i.e.,therstwordinthecorpus,or,ingeneral, : pos: N N IP NUM N ADJ N IP regardedasthenumberofalineinthisrepresentation. 1Whenthecorpusisstoredinaverticalizedone-word-per-lineformat,acorpuspositioncanalsobe Figure2.1:Corpuspositionsandvalues word: Pierre Vinken, 61 years old blessing. 0 1 2 3 4 5 n-2 n-1 7
Thecorpustextfallswithintheclassofpositionalattributes,sincewecanspecify,foreach IMSCorpusWorkbench:Administrator'sManual 8 ofthecorpus.inourview,weregardpos-tagsasassignedtoacorpuspositionratherthan corpusposition,thewordwhichoccursatthatposition.thepositionalattributewhich holdsthecorpustextproperalwayshasthepredenedattributename\word".other positionalattributesare,forexample,part-of-speechtags,whichareassignedtothewords string(asillustratedingure2.1).wethereforeusethesameinternalrepresentationforthe tothewordatthatposition.then,thepositionalattributes\word"and\tag"donotdier verymuchanymore:bothhave,foreachcorpusposition,avaluewhichis,inourcase,a wordsequenceofthecorpus(thecorpustext)andthetagsequence(theassociatedpostags),aswellasforother,additionalpositionalattributeslikelemmas,morphosyntactic equallength,oneofwhichcapturesthesequenceofwords,theothercapturesthesequence tags,etc.inotherwords,ataggedcorpusisinourviewasetoftwopositionalattributesof then,isacollectionofattributesofdierenttypes. Thequestionoftheinternalrepresentationofcorporawithmultiple(positional)annotationscanthusbereducedtothequestionofrepresentingasinglepositionalattribute (rememberthattheallpositionalattributesmustbeofequallength,thatis,encodeequal Thetwokeyconceptsoftheinternalrepresentationofapositionalattributesare: lengthitemstreams). integerizedrepresentation:itemsareencodedasintegernumbers,whereequalitems informationencodedinsuchapositionalattribute(here,wordvs.tag).acorpusingeneral, oftags.inthefollowing,wethereforeusetheterm\item"toabstractfromthetypeof inversedleindices:forthesequenceofnumbers,aninversedleiscreated.the (words,:::)getthesameintegercode.thesequenceofitemsisthenrepresentedas inversedlecaptures,foreachitem(better:itemcode)thesetofoccurrencesofthe asequenceofintegernumbers; Fortheconstructionoftheintegercode,younormallyneedasegmentationortokenization tool,sincethe,andtheareconsidereddierentandundesirablygetdierentcodes. iteminthepositionalattribute. Theadvantagesoftheintegercodeisthattherepresenteditemshaveequalinternallength (inthecaseofintegers,4bytesonourmachines).sincethelengthoftheitemsequenceis knownandtheitemsareofequallength,theitemsequencecanbehandledlikeanarray ofitems,withtheadvantageofrandomaccess.theinversedleisneededforlookup:since computedinasinglestep. itdirectlyindexesthesetofoccurrencesofagivenitem(code),theoccurrencescanbe tionoftheitemsequence.then,theitemsequenceisrepresentedasasequenceofinteger 2.1.1Integerizedles codes. Itisobviousthatatleasttwofunctionsareneededtohandlethisencoding: Assaidabove,therstsetofdatastructuresisanintegerizedleofthetextualrepresenta-
rst,afunctiontocomputetheintegercodeofagivenitem(astring); IMSCorpusWorkbench:Administrator'sManual 9 Therstdatastructureistheitemlistor\lexicon":itcapturesthesetof(dierent)items. Thesetwofunctionsrequiresomeauxiliarydatastructurestobeecientlycomputable. second,afunctiontoretrievethe(character)stringwhenthecodeisgiven. (2.1) (octal\000)ispaddedattheendofeachword.theleisnotsorted(butitmaybe).a UNIXcommandtoproducethislewouldbe: Internally,thisisthesetofstringsoccurringintheitemsequence,whereaNULLcharacter whereitisassumedthattheinputitemsequenceisinone-word-per-lineformat.inthis example,theoutputwouldbesorted,butthisisnotnecessary.theitemlistalreadydenes sort-u1wpl-item-seq tr'\n''\0'>lexicon theitemcodeforeachitem,sinceitisassumedthattherstitemintheitemlisthascode 0,thenextonehascode1,andsoon. 0)fromtheinputstream,sothattheexampleabovewillnotworkwith\traditionaltr". GNU'strdoesnothavethisbug. Notethat\traditional"trsdeleteASCII0( representedbytheitemcodeciscomputedinonestepviathisindex,whichwecallthe totheleosets(inbytes)intheitemlist.thus,thestartingpositionsofthethestring thestringsinthisle.thisindexgives,foreachitemcode,amappingfromitemcodes Forthelookupofastringinthislist,itisusefultohaveanindexofstartingpositionsof anitemcodeiseverythingbetweenthestartingpositioncomputedbytheitemindexup NULLcharacter(whichmustnotoccurintheitemsthemselves),thestringrepresentedby tothenextnullcharacter. itemlistindexorlexiconindex.sincethestringsintheitemlistareterminatedwiththe computed: (2.2) Again,thisindexcanbecomputedbyaUNIXcommandwhenthelexiconisalready tr'\0''\n'<lexicon atoiisautilityprogramincludedinthetoolboxandmapsnumbers(representedtextually atoi>lexicon.idx gawk'begin{pos=0} asasequenceofdigits)totheirinternalrepresentation. {printpos;pos=pos+length($1)+1}' perhapswouldneedmorespace,butcouldcomputetheconversioninnstepswherenis implemented.2forexample,thesamefunctionalitycouldbeachievedwithtries,which bedoneinanumberofdierentways{currently,binarysearchoverasortedstringindexis Thenextdatastructuresupportsthemappingfromstringstotheiritemcodes.Thiscould thelengthoftheinputstring. sinceitisrarelyused(allcomputationsaredoneontheitemcodes,wheneverpossible,insteadofstrings), wedidn'tyetconvertittoamoreecientmethod. 2Themethodcurrentlyimplementedintheworkbenchisverysimpleandcouldbespedupalot,but
Thebinarysearchrequiresasortedstructure.Forthispurpose,wedonotkeepasorted IMSCorpusWorkbench:Administrator'sManual 10 1),theitemcodeatthisposition.So,Ls(0)istheitemcodeofthe\smallest"item,and Ls(1)isthecodeofthesecond-smallestitem,etc.Thesorteditemlistcanthusbetextually itemina\virtual"sorteditemlist(rangingfrom0tothenumberofencodeditemsminus itemlist,butratheranotherindex(denotedbyls)whichholds,foreachpositionpofan printedbythefunction gprints; for(i=0;i<"sizeofitemset";i++)f code=sortidx[i]; s=lexidx[code]; Here,foreachpossiblepositioninthesortedindex,i,rsttheitemcodecodeatthatpositioniscomputed.Then,throughaccessingtheitemindex,thecharacterstringrepresented (2.3)tr'\0''\n'<lexicon Asyoucanimagine,thislecaneasilyproducedbyaUNIXcommand: bycodeisdetermined,whichisthenprinted. gawk'{printnr-1"\t"$1}' Therstlinecomputesthestringsfromtheitemlist,whicharethenprexedbytheircode gawk'{print$1}' atoi>lexicon.srt sort+1 (whichisthe\position"intheitemlist),beginningwithcode0fortherstword.thislist ofcode/valuepairsisthensortedbythevalues,whichoccurinthesecondcolumn.the outputofthesortingisthenltered,sothatonlythecodesareprinted.thecodesequence isthentransformedintotheinternalformatandwrittentotheindexle. Note:OneofthereasonswedonotusetheseUNIXcommandstocreatethedatastructures fromotherprograms(itworkswithsignedcharacters,whereasinternallyweworkwith isthattheunixsortcommandsometimeshandlestheorderof8bit-charactersdierently Thedatastructuresusedtorepresenttheencodeditemsequenceandtheassociatedauxiliarydatastructureswhichfacilitatethenecessarymappingsareillustratedingure2.2. itemstoitemcodeswillnotworkproperlyotherwise. whenthestandard7bitasciicharactersetisused.theinternalfunctionswhichmapfrom unsignedcharacters).sotheunixcommandswhichusesortonlycreatethesameles Anotherpossiblityistoproducetheitemlistandtheindicesinasinglegawkrun.The scriptbelowcanbeusedforthissecondpurpose,butitassignsotheritemcodes: istoreadanalreadyexistingitemlistle,whichmaybeproducedbythecommandsabove. Theonlyleforwhichwedidn'tyetgiveaUNIXcommandistheitemsequence(orbetter, thesequenceofencodeditems).gawk'sarrayscanbeusedforthispurpose.onepossiblity (2.4)BEGIN{ maxcode=0; OliverChrist } position=0; IMSStuttgart August9,1996
IMSCorpusWorkbench:Administrator'sManual 11 Item Sequence Item List Index Item List Sorted Index 27 103 27 49 31 27 28 29 30 1047 never the cucumber ; 31 & wine {if(!($1initemlist)){ Figure2.2:Integerizeditemsandassociateddatastructures print$1>"lexicon.asc" itemlist[$1]=maxcode; printposition>"lexicon.idx.asc" Item Index ==> String }else maxcode++; code=maxcode; position=position+length($1)+1; Afterthecodeisexecutedwithatextleasinput,theASCIIrepresentationshavetobe } printcode>"corpus.asc" code=itemlist[$1]; convertedintotheinternalformat(thiscouldbedoneviapipesinthegawkscriptalso,but weleftthatouthereforthesakeofclarity): (2.5)atoi<corpus.asc>corpus Afterthat,command2.3canbeusedtoproducethesorteditemlistindex. tr'\n''\0'<lexicon.asc>lexicon rm-f*.asc atoi<lexicon.idx.asc>lexicon.idx sequence.thisinversedleholds,foreachitemcode,thesetofpositionsintheitem 2.1.2Inverseditemsequence Thesecondsetofdatastructuresconcernstheinversedleindexassociatedwiththeitem
sequencewheretheitemcodeoccurs.throughthemappingfunctionsintroducedinthe IMSCorpusWorkbench:Administrator'sManual 12 Theinversedleisrepresentedbyasetofthreeles: whereacertainwordorpart-of-speechtagoccurs. lastsection,wecanalsoregardtheinverseditemsequenceasalistofcorpuspositions second,anindexintothisle.thisindexreturns,foreachitemcode,thestartpoint rst,theinversedleitself,whichcontainsasetofcorpuspositions; third,atableofitemcodefrequencies,whichgives,foreachitemcode,thenumber ofoccurrencesofthecodeinthecorpus(whichis,ofcourse,equaltothesizeofthe oftheassociatedoccurrencesintheinversedle; ThethreelescanalsobecomputedbyUNIXcommands.First,thereversedsequenceis producedbythefollowingcommand: setofoccurrences). (2.6)itoacorpus gawk'{print$i"\t"nr-1}' First,theinternalrepresentationoftheitemsequenceisconvertedintoreadablenumbers. gawk'{print$2}' atoi>corpus.rev sort-ns Thisnumbersequenceisthensuxedwithitspositioninthecorpus,whichisthensorted bythecode,sothatwegetcode/positionpairs.fromthissequence,thepositionisstripped Thefrequenciescanalreadybecomputedinthegawkencodescript(2.4),butanother o,sothatweonlygetthesequenceofpositions,whichexactlyistheinversedle. possibilityisaslightlymodiedversionofthescriptabove: (2.7)itoacorpus gawk'{print$i"\t"nr-1}' gawk'{print$1}' atoi>corpus.cnt uniq-c sort-ns Here,wekeepthecodesequenceofthecode/positionpairs.Thissequenceofcodesappears Thelastle,theindexintotheinversedle,cansimplybecomputedfromthefrequencies tointernalformat.3 collapsedintoonlyasinglelineandcounted.thesecountsarestrippedoandconverted insortedorder.bythecalltotheuniqutility,equalsubsequentlines(here:codes)are bysummingthemup: (2.8)itoacorpus.cnt beomitted.theversionhereisjustforclarity. 3Itwouldbemoreecienttouseagawkarraytoholdtheitemcodecounts,sincethesortstepcould gawk'begin{pos=0}{printpos;pos+=$1}' atoi>corpus.rdx
Now,thewholesetofsevenlesrepresentingthedataofapositionalattribute(which IMSCorpusWorkbench:Administrator'sManual 13 therearetoolswhichperformthesestepsmuchfasterthantheshellscriptspresentedhere. scriptsmayhelptoproducetheencodedversionofacorpus. Butinsomecases,theutilitiesofthetoolboxrunintomemoryproblems,andthenthese wecallthesevencomponentsofapositionalattribute)havebeencreated.inthetoolbox, Index into reversed file Reversed File 27 28 29 30 31 533 3113 12 740 533: 17 19 101 397 440 533+23: Nr of item occurrences (freq) 27 23 28 5 29 231 30 1 Thecomponentsassociatedwiththeinversedleandtheirmeaningsareillustratedin Figure2.3:Reversedleindices 31 Item Index ==> Set of occurrences foranitem,awordforexample,isperformed. gure2.3.thenextsectionwillshowthesinglestepswhicharetakenwhenasimplesearch Aftertheinternaldatastructureshavebeenintroduced,wecancomputetheconcordance forasingleitem,forexamplethewordthe.mostdatastructurescanbetreatedasanarray, 2.1.3Example:asimplewordsearch soweusethesymbols Cfortheitemsequence(accessedbyC[i]whereiisacorpusposition).Theelements Rforthereverseditemsequence(accessedbyR[i]whereiisanindexintothis ofcareitemcodes; sequence,computedfromirbelow).theelementsofrarecorpuspositions;
Figure2.4:Asimplewordsearch Sorted Index Word list never the cucumber ; & wine Index into reversed corpus 27 533 28 3113 ID: 27 29 12 30 740 31 Nr of item occurrences (freq) 27 23 28 5 29 231 30 1 31 533: 533+23: Reversed Corpus 17 99 101 397 440 101: Corpus 27 103 27 49 31 "Match" Index into Word List and Word List Lookup Concordance Element IMSCorpusWorkbench:Administrator'sManual 14
ILfortheitemlistindex(accessedbyIL[c]wherecisanitemcode).Theelements IMSCorpusWorkbench:Administrator'sManual 15 Ffortheitemfrequencytable(accessedbyF[c]wherecisanitemcode).The elementsoffareitemfrequenciesinc; ofilarebyteosetsintotheitemlist; IRforthereverseditemsequenceindex(accessedbyIR[c]wherecisanitemcode). SLforthesorteditemlistindex(accessedbySL[i]whereiisapositioninthe\virtual" Lfortheitemlist(anarrayofcharacters,onlyaccessedbyosetsofIL); TheelementsofIRarepointers(osets)intoR; ForcomputingthesetofoccurrencesofatextualiteminC,thefollowingstepshavetobe Thesesevenarraysarethecomponentsofapositionalattribute. sorteditemlist).theelementsofslareitemcodes. taken(alsoillustratedingure2.4fortheword\the"): rst,theitemcodec(i)ofitemihastobedetermined.forthispurpose,thesorted iftheitemcodecouldbedetermined,thereverseditemsequenceindexisconsulted found; itemindexslisconsultedandsearchedwithbinarysearchuntiltheitemcodeis second,theitemfrequencylistisaccessedtocomputethe\length"oftheposition iinthereverseditemsequence; todeterminethestartingpositionrs(i)=ir[c(i)]ofthepositionsetassociatedwith then,thesetofoccurrencesp(i)isthesetofpositionsstoredinthereverseditem setf(i)=f[c(i)]; Thetaskofcomputingthesetofoccurrencesofiintheitemsequenceisthencompleted. Notethattheitemsequenceitselfdidn'thavetobeaccessed. sequencerstartingatrs(i)withlengthf(i)(r[rs(i)]:::r[rs(i)+f(i) 1]). Forcomputingtheconcordanceandprintingit,though,theitemsequenceCmustbe bounds(0;jcj 1).Foreachitemkinthissubsequence,theassociated(textual)itemmust foreachp2p(i)the\subsequence"between[p cl;p+cr]incmustbecomputed(inthe bedeterminedbycomputingthestartpositionts(k)=il[k]ofkintheitemlistindex. consulted.whenclistheleftdisplaycontext(intermsofitems)andcristherightcontext, 2.1.4Thesetofpositionalattributes Then,theitemlistcanbeconsultedtogetthestrings(k),whichthenisprinted. ofcandrareequal(seealsogure2.5): positionalattributehasitsownsetofcomponents.foreachpositionalattribute,thelength TheIMSCorpusToolboxsupportsanarbitrarynumberofpositionalattributes.Each OliverChrist IMSStuttgart jcj=jrj August9,1996
IMSCorpusWorkbench:Administrator'sManual 16 PA word Item freqs Item Seq Reversed Item Seq Index for RC Index for IL Sorted Idx Item list PA pos Item Seq Reversed Item Seq Item freqs Index for RC Index for IL Sorted Idx Item list PA lemma Item freqs Item Seq Index for RC Item list Reversed Item Seq Index for IL andthelengthsofil;ir;sl;andfareequal: Figure2.5:Thesetofpositionalattributes Sorted Idx itemsequencesoftheseattributesmustbeequal: Furthermore,betweenallpositionalattributesassociatedwithacorpus,thelengthsofthe jilj=jirj=jslj=jfj ofcourse,nosuchconditionusuallyholdsbetweentheothercomponentsoftwopositional attributes. jcwordj=jclemmaj=jcposj=jcsynj=::: 2.2.1Structuralattributes Otherattributetypes phrases,orotherentities.internally,thesestructuresarerepresentedasintervalsofcorpus Structuralattributescaptureinformationaboutboundariesofsentences,paragraphs, Currently,therearetwolimitationswithrespecttostructuralattributes: internally(4bytes,thesizeofaninteger,foreachofthetwopositions). positions,whicharethestartandendpoint(inclusive)ofthestructure.suchanintervalisapairofcorpuspositions.therefore,eachstructuralitemneeds8bytesofstorage
rst,theintervalsmustnotberecursive(forexample,embeddednpsinnps); IMSCorpusWorkbench:Administrator'sManual 17 andtheymustnotbeoverlapping. Positional : Attributes pos: N N IP NUM N ADJ N IP word: Pierre Vinken, 61 years old blessing. 0 1 2 3 4 5 n-2 n-1 Figure2.6illustratestherepresentationofstructuralattributes.Thenumberofstructural attributesassociatedwithacorpusisnotlimited. Figure2.6:Structuralattributes Structural s Attributes paragraph S.Normally,thestructuralattributedataiscreatedwiththeencodeutility.Butinsome Unlikepositionalattributes,thedataforapositionalattributeisstoredinasinglele, Creatingstructuralattributedata isanarrayofintegerpairs,wherejsjisthenumberofintervals.thelesizeofsisthen cases,itisusefultomanipulateorcreatethelesthroughotherutilities.thedatales corpushasbeenencoded),asimpleawkscriptcanhelp.youmust,however,beawareofthe needs4bytes. Ifthelesareconstructedmanually(withoutthehelpofencode,forexample,afterthe 42jSj,sinceforeachinterval,twointegernumbershavetobestored,eachofwhich internalrepresentationofpositionalattributesandthe\logics"ofcorpuspositions;second, somepitfallshavetobecircumvented. Let'sassumethataone-word-per-lineinputlewithmarkedsentenceboundaries(in withatoi): (2.9)BEGIN{ SGML-style,like<s></s>)isavailable.Then,theintervalscanbeextractedbythe followingawkscript(theoutputofwhichhastobeconvertedintointernalintegerformat position=0; open=0; }{if($1==closetag){ closetag="</"structure">" opentag="<"structure">" structure="s" if(open){ #closingtag,don'tincrementposition. printposition-1;#thankstoa3@wsserv.vdl.nl(adriverhoef)
open=0; IMSCorpusWorkbench:Administrator'sManual 18 }else{ }elseif($1==opentag){ } print"closingnon-opengroupatline"nr":"$0>>"/dev/stderr" #tag,don'tincrementpositionthen exit if(open){ open=1 }printposition printposition-1 #forgottoclosegroup,whichwedon'tconsideranerror }elseif($1~/<\/?[a-za-z]+>/){ }END{position++; #donothing,otherstructuraltag? First,caremustbetakenwhengroupsareclosedwhicharenotopen.Theothercase, } if(open) reopeningopengroups,isnotconsideredanerror,sinceclosingtagsareoptional.additionally,whenstructuretagsareusedinthetext,thelinenumber(position)mustnotbe incremented.buteventhen,thisawkprogrammayyielderrors.so,atleastcheckwhether thesizeoftheresultinglecanbedividedby8. printposition-1 2.2.3Bigramandmappingtables 2.2.2Alignmentattributes 2.3 Externaltoolsanddynamicattributes
IMSCorpusWorkbench:Administrator'sManual 19, Positional Attributes : pos: N N IP NUM N ADJ N IP word: Pierre Vinken, 61 years old blessing. 0 1 2 3 4 5 n-2 n-1 Alignment Figure2.7:Alignmentattributes word: Pierre Vinken, 61, wird 0 1 2 3 4 5 Bigram Tables: Pierre Vinken 61 years Pierre Vinken, 61 years Mapping Tables: NP NPS PUNCT CARD N Figure2.8:Bigramandmappingtables Pierre Vinken 61 years, Figure2.9:External(dynamic)attributes Value Request pipe() invocation Data Access Module Value computation Value return Value passing Value check/conversion External Tool
Chapter3 Encoding:Transformingacorpus intoitsinternalrepresentation asasequenceofcharacters).thewords,then,arenumbered,sothatwecantalkabout WithintheIMScorpusworkbench,acorpuscanhaveanarbitrarynumberofannotations ofdierenttypes.inoursystem,acorpusisprimarilyregardedasasequenceofwords(not thewordatacertaincorpuspositionp,therstwordinthecorpus,or,ingeneral,thenth themostimportantannotationtype.attributesofthisclasshavea(string)valueateach corpusposition.1 Thecorpustextfallswithintheclassofpositionalattributes,sincewecanspecify,foreach wordofthecorpus.thisleadstothemoregeneralnotionofpositionalattributes,whichis ofthecorpus.inourview,weregardpos-tagsasassignedtoacorpuspositionrather corpusposition,thewordwhichoccursatthatposition.thepositionalattributewhich thantothewordatthatposition.then,thepositionalattributes\word"and\tag"do holdsthecorpustextproperalwayshasthepredenedattributename\word".other positionalattributesare,forexample,part-of-speechtags,whichareassignedtothewords taggedcorpusisinourviewasetoftwocorporaofequallength,oneofwhichcapturesthe ourcase,astring.wethereforeusethesamerepresentationforthewordsequenceofthe corpus(thecorpustext)andthetagsequence(theassociatedpos-tags).inotherwords,a notdierverymuchanymore:bothhave,foreachcorpusposition,avaluewhichis,in sequenceofwords,theothercapturesthesequenceoftags.inthefollowing,wetherefore Section3.2describesthestepswhicharenecessarytoprepareatextuallyrepresented usetheterm\item"toabstractfromthetypeofinformationencodedinsuchanattribute corpustobesuitableasinputfortheencodingtoolsaswellastherstofthetwoencoding Inthefollowingsection3.1,weshortlydescribetheinternalrepresentationofacorpus. (here,wordvs.tag).acorpusthen,isacollectionofattributesofdierenttypes. theindicesassociatedwithacorpus. regardedasthenumberofalineinthisrepresentation. tools,encode.section3.3thendescribesthesecondencodingtoolwhichisusedtobuild 1Whenthecorpusisstoredinaverticalizedone-word-per-lineformat,acorpuspositioncanalsobe 20
3.1 Theinternalrepresentationofacorpus IMSCorpusWorkbench:Administrator'sManual 21 Afterencoding,eachitemofatextualcorpusisrepresentedasauniqueintegervalue2. Forexample,iftherstitemofatextcorpusis\The",all\The"sinthetextcorpus willinternallyberepresentedastheintegernumber03.thecorpuscanthenbephysically representedasasequenceofintegernumbers.tobeabletogettheitemwhichisrepresented ofintegers),the\lexicon"whichholds,foreachinteger,thestringitrepresents,andan now,3lesarenecessarytoholdtheinformation:thecorpus(consistingofasequence indextothelexicon.thesethreelesarethosewhichareproducedwithinthesecondstep byaninteger,another(indexed)leholdsthemappingsfromintegerstostrings.upto describedbelowinsection3.2. oftheitemsinthelexicon,whichisnecessarytoecientlycomputetheintegercodeofan duringtheencodingofacorpus.thetoolwhichperformsthistaskiscalledencodeandis Twoadditionallesarebuiltnext:therstholdsinformationaboutthesortedsequence item,giventhestring.theotherleholds,foreachitem,thenumberoftimesitoccursin itself,leadingtoanotherle.insummary,wehavesevenlessofarwhichrepresentthe eachiteminthecorpus,thecorpuspositionswheretheitemoccurs.thisindexisindexed thecorpus. informationofonepositionalattribute.thefouradditionalleswhicharenotbuiltbythe Forecientlookup,areversedleorreversedindexhastobebuilt.Thisindexholds,for inputfortheencodeprogram,whichisdescribedinthenextsection. totheencodingofacorpus,ithastobetransformedintoaformatwhichissuitableas encodeprogramarecreatedwiththemakeallprogramdescribedinsection3.3.butprior Whenacorpusconsistsofseveralpositionalattributes(forexample,aPOSattribute 3.2 additionallytothestandard\word"attribute),itcaneitherbeencodedinonesinglestep Theencodeprogram (providedthatitisinasuitabletextualinputformatforencode)orthevariouspositional attributescanbeencodedoneafteranotherandbeaddedtoanalreadyexistingcorpus. Thislatterwayisalsousefulwhenoneofthepositionalattributeshasbeenchanged,for beencodedoccursinasingleline.thislinemaycontainblanks,providingawaytoencode accuratetagassignments. Inbothcases,theinputformatisaone-word-per-lineformat,whereeachitemwhichisto example,whenatagsethasbeenchangedorabettertaggerwasavailabletoproducemore adjacentmulti-wordlexemes,ifdesired.butcareshouldbetakentoavoidblanksatthe endofanitem,since,forexample,\the"and\the"aredierentstringsandthereforeare Inthecaseofthecorpustext,theinputmaylookasfollows: encodedwithdierentcodeswhichcanleadtoundesiredeectswhennotalloccurrences ofthearefoundinatextduetoablankattheendofsome. consideredequal. paperbykenw.church,\asetofunixtoolsforprocessinglargetextcorpora". 3Ofcourse,\the"willgetanothercodethan\The",since\The"and\the"textuallydierandarenot 2Theinternalcorpusrepresentationweuseishighlyinspiredbyan{unfortunately{unpublisheddraft
Pierre IMSCorpusWorkbench:Administrator'sManual 22 Vinken,61 years old,will join the board as anonexecutive director Anotherle,then,mayholdthesequenceofassignedtags(inwhichcasebothlesmust Nov. holdthesamenumberoflines).theinputformatcan,forexample,beproducedoutofa 29 rawtextlewiththetrcommand4:. Thiscommandreplacesallblanksintheinputlewithlinebreaks.Theencodeprogram threelescorpus,lexiconandlexicon.idxinthecurrentdirectory: thentakesaone-word-per-lineinputle(orreadsthatformatfromstdin)andcreatesthe tr'''\n'<text_file>1wpl-file encodinginasinglestep,onecouldenterthefollowingpipe: The-toptioninstructsencodetoreaditsinputfromthelegivenasanargumentof theoptioninsteadofreadingthestandardinput.todoboththetransformationandthe encode-t1wpl-file or,ifonewantstoholdthetextinacompressedformat: zcattext_file.gz tr'''\n' encode tr'''\n'<text_file encode arydetectormustberunonarawcorpustoproducetheappropriateinputle.the (specialcharactershavebeenseparatedfromthewords),perhapsevenasentencebound- thesimpletrexamplesabove,itisalreadyassumedthatthecorpushasbeentokenized Ingeneral,theone-word-per-lineformatcanbeproducedbyanyprogramyoulike.In Now,encodemayalsotakeanannotatedcorpuswithseveralpositionalattributesina applicabletonew,rawcorporawhichrstmayhavetobepreprocessedbyothertools. simpleexampleonlyshowshowencodeprocessesitsinput,ingeneral,thismethodisnot singlele.inthiscase,eachlineoftheinputformatconsistsofanumberofattribute values,separatedbytabulatorcharacters.thus,theinputleconsistsofseveralcolumns, eachdenotesonepositionalattribute.apos-taggedtextthenmaylookasfollows: OliverChrist 4Thetrcommandisastandardcommandavailableonmanyplatformsandisnotpartofthistoolset. IMSStuttgart August9,1996
IMSCorpusWorkbench:Administrator'sManual 23 Pierre<tab>NP Vinken<tab>NP,<tab>, 61<tab>CD years<tab>nns old<tab>jj,<tab>, will<tab>md join<tab>vb the<tab>dt board<tab>nn as<tab>in a<tab>dt nonexecutive<tab>jj director<tab>nn Nov.<tab>NP 29<tab>CD.<tab>SENT where<tab>denotesasingletabulatorcharacter(asciivalue9)5.encodemustthen knowwhichpositionalattributeisrepresentedintheothercolumnsoftheleandhow theyshallbenamed.thisisdonewiththe-poption: encode-t<input-file>-ppos Here,theoption-Ppos(\P"for\positionalattribute")instructsencodetotreatthe secondcolumnintheleasthesequenceofvaluesoftheposattribute.thelesassociated withtheposattributehavetheirnamesprexedwithpos(whichleadstopos.corpus, pos.lexiconandsoon).theorderinwhichthe-poptionsaregivenisrelevant,sincethe rst-poptiondenotesthenameoftheattributerepresentedinthesecondcolumninthe inputle,thesecond-poptiondenotesthethirdcolumnetc.bydefault,therstcolumnis treatedasthewordsequenceandthereforegetstheprexword,butthiscanbeoverridden withthe-poption.pleaserefertothemanualpageofencodefordetails. Upto32positionalattributescancurrentlybeencodedinasinglestep.Iflargeamounts oftextaretobeencoded,trytodeterminethediskspacethecorpusneedsafterencoding inadvanceandlookforalesystemwhereenoughspaceisavailable.somehintsonthe expectedsizearegiveninsection3.4.asexplainedinchapter4,itispossibletosplitthe lesofacorpusbetweenseverallesystemsincasethereisn'tenoughspaceonasingle disk.ifallelsefails,youmayhavetoencodethesetofpositionalattributesinseveralruns ofencode,eachwiththeappropriateprexpassedwiththe-poption. Acorpus(oranarbitrarypositionalattribute)maybeassignedanothertypeofinformation whichcanbeencodedwiththeencodeprogram,namelystructuralinformationwhichcan beusedtorepresentarticle,sentenceorparagraphboundaries.thiskindofinformationis representedintheinputlewithsgml-likemarkers: <article> <s> Pierre<tab>NP Vinken<tab>NP,<tab>, 5Therearenogeneralhintsonhowtoproducethisinputformat.Ingeneral,itisagoodideatouse standardtoolslikeawkandsed.
IMSCorpusWorkbench:Administrator'sManual 24 29<tab>CD.<tab>SENT <s> attributesinthele.somepointshavetobenoted: Ofcourse,structuralinformationcanbeencodedindependentlyofadditionalpositional </s> </article> inalinewithastructuremarker(s,article),novaluesofpositionalattributesmay theendtagsmaybeomitted.inthatcase,astructurespansallitemsuntilthestart occur; ofthenextstructureortheendofle; ifastructuremarkerline,everythingafterablankoraftertheclosinganglebracket Intheaboveexample,thecallforencodewouldlooklikethis: structuresmustnotberecursiveoroverlapping,thatis,treescannotberepresented. (>)ofthetagisneglected; Here,thetwoencodedstructuralattributesareeachdeclaredwiththe-S(for\structural attribute")option.theorderinwhichthestructuralattributesaredeclareddoesnot encode-t<input-file>-ppos-sarticle-ss Becarefultodeclareallstructuralattributesintheencodecall,sinceundeclaredstructural matter. attributesareconsideredassimpleattributevaluesintherstcolumnandthereforeare Again,upto32structuralattributescancurrentlybeencodedinasinglestep. astructureattributedesignatorinanglebracketsorislineofaxednumeroftabulatorseparatedcolumns.errorsmayoccurifthisruleisnotobeyed. -p<prefix>hastheeectthatthelesbelongingtothepositionalattributeinthe rstcolumnoftheinputlewillgettheprex\prefix.".notethatthedotatthe endoftheprexisaddedautomaticallyandmustnotbepartoftheoptionvalue.nor- havetabulator-separatedcolumnsafterthem.theruleissimple:eitheralineconsistsof occursincelineswithundeclaredstructuralattributesintherstcolumningeneraldonot treatedliterally.ifyouareencodingseveralpositionalattributesatonce,anerrorwill encodeacceptsanumberoffurtheroptions: lesmaycollidewiththethoseofthenewattribute.therefore,whentheprexis readypresentinthedirectorythedataiswrittento,sincethenamesofthecorpus Thiswillleadtoproblemsifapositionalattributeisencodedafteracorpusisalmally,thelesgetthenamecorpus[.cnt,.rev,.rdx]andlexicon[.idx,.srt]. thelesforthepositionalattributeintherstcolumnoftheinputle,theother notgiven,datamaybeoverwrittenandlost.thisoptionaectsonlythenamesof OliverChrist positionalattributes{ifpresent{willgettheprexgivenwiththe-poption; IMSStuttgart August9,1996
-d<path>letstheuserspecifythedirectoryinwhichthedatashallbewritten.the IMSCorpusWorkbench:Administrator'sManual 25 -sinstructsencodetoskipemptylines(lineswithnocharacters{notevenblanks{ pathshouldnotendwithaslash.defaultistowritealloutputlestothecurrent directory; themostrecentdescriptionoftheprogram. encodeacceptsanumberoffurtheroptions.pleaserefertotheencodemanualpagefor init)duringencoding. thepositionalattribute,sinceittriestowritethedatatothesamelesinwhichthe corpusafterencodingthecorpusitself:alossofdatamayoccurduringtheencodingof -Doptionisspecied.Becarefulwhenyouadd(orupdate)apositionalattributetoa Pleasebeawarethatencodewritesitoutputlesintothecurrentdirectoryunlessthe corpusdatamayalreadybestored.eitheryoushouldpassthe-poptiontoprexthe lesbelongingtotherstcolumnorputeachpositionalattributeinadirectoryofitself topreventencodefromoverwritingimportantdata.itisaverygoodideatochangethe leaccessmodeofalllesbelongingtoapositionalattributetonon-writeableforanyone 3.3 afterencoding,inordertopreventaccidentialoverwriting. Afterencodingacorpus,eachpositionalattributehastobedeclaredinthecorpusregistry. Pleaserefertochapter4foradetaileddescriptionofhowtodothis.Themakeallprogram, Themakeallprogram whichconstructsthesecondsetoflesduringtheencodingprocess,willnotworkon undeclaredpositionalattributesorcorpora. Afterapositionalattributeisdeclaredintheregistry,makeallmustberuntoconstruct thesymbolicnameofthepositionalattributeforwhichtheindexlesshallbeproduced: thenecessaryindexles.therearenooptions,andtheonlyargumentmakeallacceptsis Thiswillproduceallmissinglesforallpositionalattributesdeclaredforthecorpus treebank. makealltreebank Ifyouonlywanttoproducethelesforasinglepositionalattribute,givethenameofthe attributeasanadditionalargument: thatthiscallcanbeissuedfromanypointinthelesystem,sincemakealllooksupin (giventhatthesesymbolicnamesarethoseoftherespectivepositionalattributes).note makealltreebankpos theregistrytondthecorpusdata.dataisonlywrittentothedirectoryspeciedinthe networkloadincaseofnfs-mountedlesystems). registrydescriptionle.itisthereforeagoodideatorunmakealleitheronaveryfast machineoronamachinewhichlocallyholdsthedisksthecorpusisstoredon(toreduce
Note:\makeall"willcurrentlytrytocreatenon-compressedlesforattributeswhich IMSCorpusWorkbench:Administrator'sManual 26 alreadyhavethecompletedataincompressedform.thisisabugandwillbexedina futurerelease. Afterrunningmakeallonallpositionalattributesofacorpus,thecorpusisreadyforuse. errorsoccurorwhentheprogramhastobedebugged. makeallcurrentlyproducesalotofdebuggingoutput.thisoutputisonlyimportantwhen sageslike\can'tallocatememory"or\notenoughmemory".theseproblemscanonly makeallmayhaveproblemsduetomemoryorswapspacelimitationsandyielderrormes- Whentryingtoencodereallybigcorpora(20millionwordsandabove),encodeand thewholecorpusdata(seebelowinsection3.4).askyoursystemadministratorforfurther availableswapspaceisshownwiththepstat-scommandandshouldbeenoughtohold runencode/makeallonanothermachineatyoursitewhichhasenoughswapspace).the besolvedbyprovidingmoreswapspacetothemachinetheprogramisrunningon(orto help. 3.4 Thissectiongivessomehintsonhowmuchspacewillbeusedbyanencodedattributes. Currently,weonlycoverpositionalandstructuralattributeshere. Spacerequirements sequence,jaj,thereforeisthenumberofelementsofthissequence.further,letdbethe LetAbethesequenceofitemscapturedbythepositionalattribute.Thelengthofthis 3.4.1Positionalattributes thenumberofdistinctwords. setofdistinctstringsencodedina(thelistofdierentwords,forexample).then,jdjis Twonumbersareimportant: thenumberofitemsintheinputle(jaj)whichisequaltothenumberoflinesinthe annotationmarkers); command(maybepre-pipedwithagrep-vcommandtogetridofthestructural markers,ifpresent).thisnumbercan,forexample,becomputedwiththewc-l one-word-per-lineinputleforencode(minusthenumberofstructuralannotation thenumberofdistinctitemsintheinputle(jdj)(thisnumbercanbecomputedby Anotherimportantnumberisthespaceneededfortheone-timerepresentationofalldifferentitems,whichhereisdenotedS.Thisisthesumofthelengthsofeachdierentword runningthepipesort-u wc per-lineinputtextle). -lovertherespectivecolumnintheone-word- plusone: OliverChrist S=Xs2D(strlen(s)+1)=jDj+Xs2Dstrlen(s) IMSStuttgart August9,1996
Theaddingof1isnecessarysinceanullcharacter('\0')isaddedtoeachstring.The IMSCorpusWorkbench:Administrator'sManual 27 numberisgivenbyrunningthepipesort-u wc-covertherespectivecolumninthe Now,thesizeofonepositionalattribute(inbytes)canbecomputedasfollows: inputle.6 afterencoding. Thesizeoftheinputtextledoesnotgointothisformula,sinceitisnotneededanymore Mp=2(jAj4)+S+4(4jDj)=8jAj+16jDj+S Foreachpositionalattribute,thisformulahastobeevaluatedagain,sincethenumber ofdierentitemsintheattribute(jdj)andthespaceneededtorepresentthemonce mainlydependsonitslength,whichisnotverysurprising.alessaccuratenumberofspace haveasmallnumberofdistictvalues,thespaceneededtorepresentapositionalattribute positionalattributesofacorpus).sincepositionalattributesbutthewordattributeusually (S)usuallydiersbetweenseveralpositionalattributes(jAjmustbeconstantforanytwo requirementcanberoughlyestimatedbymultiplyingthesizeoftheuncompressedinput 3.4.2Structuralattributes textle(s)by2. Thedataofastructuralattribute,forexamplesentenceboundaries,isstoredinasingle le,asanorderedsequenceofcorpusintervals(thatis,pairsofcorpuspositions).so,the computationofthespaceneededtorepresenttheinformationofonestructuralattributes inthisattribute(\thenumberofsentences"),eachbeingapairoftwocorpuspositions. isverysimple:letsbethestructuralattribute.then,jsjisthenumberofintervalsstored asfollows: Sinceeachcorpuspositionisstoredasa4-byteintegernumber,thespacecanbecomputed So,ifyouwanttorepresent1000sentences,youneed8000bytestostorethedata. Ms=(jSj42)=8jSj GNU's/FSF'ssetoftextutilities,orwiththeawkutility. OliverChrist 6Thecolumnscanbeextractedfromamulti-columnlewiththecutprogramwhichiscontainedwithin IMSStuttgart August9,1996
Chapter4 Thecorpusregistry tionally,theregistryholdsalewhichdescribeswhocanaccessalocalcorpusfromremote hostsandalewhichcapturesalogofallremoteconnectionstolocalcorpora.thischapteronlydescribesthedescriptionlesforlocalcorporaandthedescriptionlesforremote describeswhereinthelesystemthevariousleswhichbuildthecorpusarestored.addi- Thecorpusregistryholds,foreachcorpusbeingprocessedbytheworkbench,alewhich inchapter5. Theregistryissimplyagloballyaccessibledirectory,calledtheregistrydirectory,andholds, corpora,thetwoadditionalleswhicharenecessaryforremoteconnectionsaredescribed 4.2below,denewhichannotationsareassociatedwiththecorpusandwherethedatais aleiscalledtheregistryleofthecorpus.thecontentsofthele,describedinsection stored.anannotationwhichisnotdenedintheregistryleofacorpuscannotbeaccessed foreachcorpus,alewiththesamenameasthecorpusname,inlowercaseletters.1such beaccessiblebyalltools.itiswithintheresponsibilityofthecorpusadministrator(you!) byanyofthetools.similarly,whenanattributeisdenedintheregistry,itissupposedto toassurethatallannotationsdenedinallregistrylesareaccessible,andthatonlythose Currently,therearesomeruleshowtonametheattributesofacorpusandhowtoname attributesaredenedwhichareinfactaccessible. therespectiveregistryles.theserulesaredescribedinthefollowingsection. Attheendofthischapter,section4.5summarizesthesinglestepswhichyoushouldfollow Thisissueisdiscussedinchapter7below. whenpreparingandregisteringanewcorpus. Asalreadymentionedinsection1.2above,corporaexistwhereaccesshastobecontrolled. 4.1 Twosimpleruleshavetobeobeyedforthedenitionofnamesforcorporaandattributes: 1InCqp,allcorpusnamesareenteredinuppercase,buttheyareconvertedtolowercasetoloadthe Someremarksaboutnomenclature correctcorpus. 28
corpusandattributenamesmustbeginwithalowercaseletter,andmaybe IMSCorpusWorkbench:Administrator'sManual 29 Bydefault,thetoolsexpecttheregistrydirectorytobe/corpora/c1/registry.Sinceat yoursitethisdirectorymostprobablydoesnotexist,thedefaultvaluecanbeoverridden followedbyanarbitrarylongsequenceoflowercaselettersordigits. bytheenvironmentvariablecorpusregistry.pleasedonotaddaslashattheendofthe variableinhisorher.tcshrcor.cshrcshellinitializationleinhis/herhomedirectory valueofthisvariable.wesuggestthateithereachuserofthetoolssetsthisenvironment orthatitissetinoftheglobalshellinitializationles,whichusuallyresidein/etcand non-defaultregistrydirectory. areonlywriteablebythesystemadministratororthesuperuser.pleaseaskyoursystem Additionally,almostalltoolsofthetoolboxtakethe-rcommandlineoptiontospecifya administratorforfurtherhelpincaseyoushouldn'tknowwhereorhowtosetthevariable. 4.2 Intheregistryle,allattributesofacorpusaredeclared.Additionally,some\global" variablesareset. Thecontentsofaregistryle Aregistrylemaycontainemptylines. Acommentbeginswithahashmark(#),everythinguptotheendofthelineisnotread. Theformatofaregistryleis: Thisorderhastobekeptinallregistryles.Intheattributedenitionsection,theattributesmaybedeclaredinanyorder. <Attributedenitions> <Headerinformationandglobalvariables> 4.2.1Theheader Theeldnamesare Theheaderconsistsof4declarationsofvalues,eachofwhichisprecededbytheeldname. Theeldnames(keywords)arealluppercase. ashort(one-line)descriptionofthecorpus(keywordname).theeldvalueisastring auniqueidentier(keywordid).theeldvalueisasymbol.usually,theeldvalue shouldbethesameasthelenameoftheregistryle; enclosedindoublequotes; optionally,the\homedirectory"ofthecorpus(keywordhome).theeldvalueisa OliverChrist path(notenclosedindoublequotes); IMSStuttgart August9,1996
optionally,thepathofthe\infole"ofthecorpus(keywordinfo).thistextle IMSCorpusWorkbench:Administrator'sManual 30 shouldcontainadescriptionofthecorpus,itsannotations,perhapsadministrative IftheHOMEeldismissing,youhavetospecifythepathforeachattribute,soitismore part-of-speechannotationthere,ifthecorpusistagged. information,etc.itisalsoagoodideatoincludeadescriptionofthetagsetofthe convenienttodenethiseldwhenallcorpus-relateddatalesarekeptinasingledirectory. buttonisselectedinthecorpuslist). Aftertheheader,thesetofcorpusattributes(annotations)isdeclared.Thedierenttypes Cqp,thisisdonewiththeinfocommand,inXkwic,thisleisdisplayedwhentheInfo IftheINFOeldismissing,nocorpusinformationcanbedisplayedinCqporXkwic(in ofannotationsare positionalattributes(section4.2.2); structuralattributes(section4.2.3); mappingtables(section4.2.4); ngramtables(section4.2.5); dynamicattributes(section4.2.7); alignmentinformation(section4.2.6); thecaseofmappingtablesandbigramtables). Assaidbefore,allannotationsmaybedeclaredinalmostanyorder(butseethenotesin insection3.2.theselesarecalledthecomponentsofapositionalattribute.aregistryle Apositionalattributeisencodedasasetofsevenles,whichhavebeendescribedabove 4.2.2Positionalattributes isstored.itisnotnecessarytomanuallysetthenamesofallcomponentssincethereare defaultruleshowtocomputeundenedcomponentlenamesfromthedenedones.in denes,foreachcomponentofapositionalattribute,thelenameinwhichthecomponent donothavetodeclarecomponentpathsatall. Thedeclarationofapositionalattributelooksasfollows: fact,wesuggesttorelyonthedefaultnameswhichencodeassignstotheles.then,you optionalandisonlyneededwhenyouhavetodenenon-defaultlenames. wherenameistheidentierfortheattribute(likeword,pos,lemma,:::).theoptbodyis ATTRIBUTENameOptBody Whenyouusethebody,itcanbeoneofthefollowing: OliverChrist adenitionofthecomponentpaths,comppathspec; IMSStuttgart August9,1996
orthedeclarationthattheattributeisfoundonaremotehost,remotespec; IMSCorpusWorkbench:Administrator'sManual 31 orapathwhichoverwritesthehomeeldofthecorpusandsaysthatallcomponents remotelystoredcorporaisdisabled.whenthealtpathdeclarationisused,itdeclaresa TheRemoteSpeccurrentlyisnotsupported,sinceintheactualdistribution,accessto arestoredatadierentplace,altpath. pathdierentofthecorpuspathforthisspecialattribute. Withinthesebraces,asequenceofcomponentname/pathspecicationpairsislisted.Each Thecomponentpathspecication,CompPathSpec,mustbeenclosedinbracesf:::g. componentnamemayonlyoccuronce: fcomponentidpathspec ::: \homedirectory"oftheattribute.twoothervirtualcomponentsareaname,thenameof Onecomponentis\virtual"(DIR)anddoesn'tdescribelenamesbutratherdenotesthe g theattributejustbeingdened,andapath,whichisthe\homedirectory"ofthecorpus, ThePathSpecisastandardpath,inwhichMacrosmaybeused.Suchamacroisstarted whichdefaultstothevalueofthevalueofthehomeeldoftheheader(ifpresent). withadollarsign$anddirectlyfollowedbyacomponentname.forexample,themacro specications.forexample, $ANAMErepresentsthevalueoftheattributename.Macrovaluesmaybeusedinpath directory),followedbyaslash,followedbythevalueoftheanamevariable,andthenfollowed standsfortheconcatenationofthevalueoftheapathvariable(usuallythecorpushome $APATH/$ANAME.f by.f. Ingeneral,itispossibletorefertothevalueofanycomponentbyprexingitscomponent Thefollowingtableliststhecomponents,thecomponentidentiersandthedefaultvalue refertoacomponentvaluewhichisnotyetdened. identierwithadollarsign($).thus,whenacomponentvalueisdened,itispossibleto ormacrothroughwhichthe(default)valueiscomputed: usethevaluesofpreviouslydenedcomponentsinthedenition.itisanerrorwhenyou
Component IMSCorpusWorkbench:Administrator'sManual ComponentIDDefaultRule/Value 32 RevCorpusIdx ReversedCorpus Directory REVCIDX REVCORP CORPUS DIR $CORPUS.rev $DIR/$ANAME.corpus $APATH CorpusFreqs FREQS $CORPUS.rdx LexiconSortindexLEXSRT LexiconIndex LEXIDX LEXICON $LEXICON.srt $LEXICON.idx $DIR/$ANAME.lexicon $CORPUS.cnt Thistablealsoshowsthecomponentpathvalueswhenthereisnocomponentpathspecicationatall.TheAPATHelddefaultstotheHOMEdirectoryofthecorpus(theroot pathvaluesbytheruleslistedinthetable. theattributebeingdened,aname,isalwaysknown.then,theothercomponentsgettheir Youcan,ofcourse,setallcomponentpathvaluesasyoulike,butitisstronglyrecommended directory,/,isassumedifthishomespecicationismissingintheheader).thenameof wordmustbepresentineverycorpusdeclaraion(registryle). Notethatthewordattributealwaysmustbedened,andthatanattributewiththename torelyonthedefaultvalues. Let'slookatasmallexample.Thefollowingregistryleonlydenesthepositionalattribute word:name"penntreebank" ID HOME/corpora/kwic/up ATTRIBUTEword Duetothedefaultrules,thepathsofthewordattributecomponentshavethefollowing values: RevCorpusIdx ReversedCorpus Directory Component $CORPUS.rdx=/corpora/kwic/up/word.corpus.rdx $CORPUS.rev=/corpora/kwic/up/word.corpus.rev $DIR/$ANAME.corpus=/corpora/kwic/up/word.corpus $APATH=/corpora/kwic/up Value LexiconIndex LexiconSortindex$LEXICON.srt=/corpora/kwic/up/word.lexicon.srt CorpusFreqs $LEXICON.idx=/corpora/kwic/up/word.lexicon.idx $DIR/$ANAME.lexicon=/corpora/kwic/up/word.lexicon $CORPUS.cnt=/corpora/kwic/up/word.corpus.cnt ofdierentcorporaetc.),thesystemwillcrashcertainly,andalsodatadamagemayresult. samecomponentpathfordierentcomponents(orforcomponentsofdierentattributes Thefollowingexampleshowsadeclarationofacorpuswiththe\old"namingconventions: Pleasenotethatnocheckingisdonewithrespecttolenamecollisions.Ifyouusethe NAME"GetaggterBZK(``NeuesDeutschland'')" OliverChrist ID HOME/corpora/kwic/bzk-tagged IMSStuttgart August9,1996
IMSCorpusWorkbench:Administrator'sManual 33 ATTRIBUTEword {CORPUS$APATH/corpus }ATTRIBUTEpos {CORPUS$APATH/corpus.pos LEXICON$APATH/lexicon Here,thepathsofthewordattributecomponentshavethefollowingvalues: } LEXICON$APATH/pos.lexicon Component Directory ReversedCorpus RevCorpusIdx /corpora/kwic/bzk/corpus.rev /corpora/kwic/bzk/corpus.rdx Value CorpusFreqs LexiconIndex LexiconSortindex/corpora/kwic/bzk/lexicon.srt /corpora/kwic/bzk/lexicon.idx /corpora/kwic/bzk/corpus.cnt Thepathsoftheposattributecomponentshavethefollowingvalues: Component Directory ReversedCorpus RevCorpusIdx Value CorpusFreqs LexiconIndex /corpora/kwic/bzk/corpus.pos.rev LexiconSortindex/corpora/kwic/bzk/pos.lexicon.srt /corpora/kwic/bzk/corpus.pos.rdx /corpora/kwic/bzk/pos.lexicon.idx /corpora/kwic/bzk/corpus.pos.cnt Thealternatepathspecication keepingthestandardnameconventions. specication,you\move"thewholesetofattributecomponentstoanotherdirectory,while value.thecomponentpathvaluesarecomputedbythedefaultrules.withthealtpath Inthealternateattributepathspecication,AltPath,you\overwrite"thedefaultAPATH Example: NAME"WallStreetJournal,verylarge" IDwsj HOME/corpora/kwic/wsj Here,thewordattributeison/corpora/kwic/wsj,butthesetofcomponentswhichbelong ATTRIBUTEpos/var/space/kwic/wsj.pos ATTRIBUTEword space. totheposattributearestoredonanotherlesystem,perhapsduetolimitationsofdisk
Theremotespecication IMSCorpusWorkbench:Administrator'sManual 34 iscurrentlydisabled,sowedonothavetowriteanythingaboutithere::: Structuralattributesalsohavecomponents,buttheirvaluescannotbechangedinthe 4.2.3Structuralattributes STRUCTUREisthekeyword,andNameisthenameofthestructurebeingdened.Optionally, registryle.so,thedeclarationofastructuralattributeisverysimple: thisdeclarationmaybefollowedbyastoragespecication,optstoragespec: STRUCTURENameOptStorageSpec thiscaneitherbeapath,which\moves"theattributedatatoanotherdirectorylike oraremotedeclaration,whichiscurrentlydisabledandthereforedoesn'tneedtobe thecorpushomedirectory); thealtpathspecicationforpositionalattributes(thedefaultistostorethedatain STRUCTURE,followedbythenameofthestructure: Normally,thedeclarationofastructuralattributesimplyisasequenceofthekeyword coveredhere. STRUCTUREs Structuralattributesarestoredinonlyonele,andthenameofthisleisderivedbythe STRUCTUREp rule$apath/$aname.rng(where$apathdefaultstothevalueofhome). STRUCTUREarticle STRUCTUREnp Thedeclarationofamappingtablelooksasfollows: 4.2.4Mappingtables samemeaning. TheOptStorageSpecisthesameasinthecaseofstructuralattributesabove,andhasthe MAPTABLESourceNameTargetNameOptStorageSpec unidirectional,thatis,youhavetocomputeanddeclaremappingtablesinboth\directions" (sothedeclarationisnotorder-independentinthiscase).currently,mappingtablesare oftwopositionalattributesfollow.thesepositionalattributesmustalreadybedeclared ThekeywordMAPTABLEintroducesthedeclarationofamappingtable.Then,thenames Exampledeclaration: ifyouneedthem(forexample,frompostowordaswellasfromwordtopos). OliverChrist MAPTABLEwordpos MAPTABLEposword IMSStuttgart August9,1996
4.2.5ngramtables IMSCorpusWorkbench:Administrator'sManual 35 Thedeclarationofamappingtablelooksasfollows: samemeaning. TheOptStorageSpecisthesameasinthecaseofstructuralattributesabove,andhasthe NGRAMPANamenOptStorageSpec aresupported,sothatthisvaluenalwaysmustbe2. PANameisthenfollowedbythe\dimension"ofthetable,n.Currently,onlybigramtables positionalattributefollows,whichalreadymustbedeclaredearlierintheregistryle.the ThekeywordNGRAMintroducesthedeclarationofangramtable.Then,thenameofa Exampledeclaration: 4.2.6Alignmentattributes NGRAMword2 NGRAMpos2 follows: Withanalignmentdeclaration,itisexpressedthatthecorpuswherethisattributeisbeing declared,isalignedtoanothercorpuswhichalsohasitsregistryle.thedeclarationisas samemeaning. TheOptStorageSpecisthesameasinthecaseofstructuralattributesabove,andhasthe ALIGNEDCorpusNameOptStorageSpec theregistryleoftheothercorpus. Alignenttablesareunidirectional.So,iftheotheralignedcorpusalsoisalignedtothe TheCorpusNamemustbethenameofanexisting(registered)corpus. corpuscurrentlybeingdened,thealignmentintheotherdirectionmustbedeclaredin Exampledeclaration: 4.2.7Dynamicattributes ALIGNEDhansard-f externalknowledgesource. Thedeclarationofadynamicattributelooksasfollows: returnvaluetype.additionally,itmustbedeclaredhowthevalueiscomputedfromthe Dynamicattributesaredeclaredlikefunctions:theyhaveaname,anargumentlist,anda
DYNAMICName(ArgList):RVTypeShellCall IMSCorpusWorkbench:Administrator'sManual 36 tions.afterthename,theargumentlistiswritteninparentheses.theargumentlistisa TheNameisthenameofthedynamicattribute,followingthestandardnamingconven- identiers. characterstringarguments),int(forintegertypearguments)andpos(forcorpus-position sequenceofcomma-separatedtypeidentiers.currently,thetypeidentiersstring(for TheShellCallisastring,enclosedindoublequotes.Itdeneshowthereturnvalueis arguments)aresupported.thereturnvaluetype,rvtype,alsomustbeoneofthesetype computedbyacalltoanexternaltool.thevaluecomputedbytheexternaltool(which isexpectedtooccuronthestdoutoftheexternaltoolafteritstermination)iscoercedto thereturnvaluetype,whetherthismakessenseornot.soitisyourtasktomakesurethat IntheShellCall,yourefertotheargumentsofthefunctionwith$1fortherstargument, $2forthesecond,andsoon.Itisanerrorwhenargumentnumbersareusedwhichare higherthanthenumberofargumentsdeclared.itisnotanerror,though,whendeclared theexternaltoolcomputesinformationwhichcanbecoercedtothereturnvaluetype. Example: argumentsarenotusedintheshellcall. INT-typevalue.Thevalueiscomputedbycallingtheexternalprogramwnreqwithsome Here,thedynamicattributewndisttakesthreeSTRING-typeargumentsandreturnsan DYNAMICwndist(STRING,STRING,STRING):INT"/usr/local/bin/wnreq-s'$1''$2''$3'" (whichmustagreewiththedeclaredargumenttypes)aregluedintotheshellcallatthe Whenthevalueofadynamicattributeisrequested,theactualparametersofthefunction positionsindicatedbytheargumentreferences($n).then,thisshellcallisevaluated(via parametersandthe(actual)argumentsgluedintotheshellcall. thepipemechanismofunix).itcanyieldtoerrorswhenargumentsarepassedwhich partiallyhelps).inafuturereleaseoftheexternalknowledgesourceinterface,perhapswe (forexample,bysurroundingthevariablereferenceswithsinglequotes,althoughthisonly passingthroughconstraintsinthequery,oryouhavetodesignyourshellcallcarefully containcharacterswhichareinterpretedbytheshell.theseshouldeitherbeexcludedfrom doesnottakeplaceanymore. Thepipetotheexternalprogramisnotkeptopen(wewillxthatperhapsinalaterrelease),sothattheexternaltoolisinvokedforeachvaluerequest.Twothingsareimportant: willsupportargumentpassingtothestdinoftheexternaltool,sothatshellevaluation thestartuptimeoftheexternaltoolshouldbeminimized.itshouldonlyloadas rst,thenumberofdynamiccallsmustbeminimized{useexternalinformationin littledataasnecessary,andthetimenecessarytocomputetherequestedinformation queriescarefully!batchqueryingshouldalsobeconsidered; donothavetoloadlargeamountsofdata. sothattheserveronlystartsonce,andsmallclients(likewnreqintheexampleabove) shouldbesmall.thiscanbestbeachievedbydesigningaclient-serverarchitecture,
Pleaserememberthattheexternaltoolinterfaceaswellasdynamicattributesingeneral IMSCorpusWorkbench:Administrator'sManual 37 areexperimentaltools,andarenotoptimizedtowardseciency. 4.3 Remotecorporaarecurrentlydisabled.Sojustskipthecrapinthissection. Registrationofremotecorpora \standard"registrylehastobecreated.thisledeclaresthatacorpus(orapositional Whencorporaarestoredremotelyandhavetobeaccessedviathenetwork,theyare declareddierentlythanlocalcorpora.foreachremotecorpus,aregistrylesimilartoa IDhavethesamemeaninglikedescribedaboveandareoptional.ThevalueoftheREMOTE \component"isthenameoftheremotehostwhichshallbeconnectedwhenthecorpusis attribute)isstoredremotely.itmustonlycontaintheeldsname,idandremote.nameand accessed.noothercomponentvaluesmustappearintheregistryleforremotecorpora. Hereisanexampleforafullandvalidremotedeclaration: 4.4REMOTEmaple.gardeners.edu Wehopethattheformatoftheregistryleissimpleenoughtounderstanditwithouttoo manyexplanationsafteryouarefairlyfamiliarwithit.sojusthavealookatthislast Alastexample example. NAME"Hansard-Corpus(EnglishPart)" ID HOME/corpora/kwic/hansard-e INFO ATTRIBUTEword {CORPUS$APATH/corpus /corpora/kwic/hansard-e/.info }ALIGNEDhansard-f LEXICON$APATH/lexicon STRUCTUREs STRUCTUREnp ATTRIBUTEpos ATTRIBUTElemma MAPTABLEwordpos NGRAMword2 MAPTABLEposword DYNAMICwndist(STRING,STRING,STRING):INT NGRAMpos2 OliverChrist "/corpora/bin/wnreq-s'$1''$2''$3'" IMSStuttgart August9,1996
4.5 Stepstofollow IMSCorpusWorkbench:Administrator'sManual 38 Now,whatarethestepstofollowifyouhavetoregisteranewcorpus? First,trytoestimatethespacerequirementsoftheencodedcorpusandndaplacefor makeanewdirectoryforthedataofthenewcorpus.ifthereisplaceenough,try ifyoudonotndenoughplaceonasinglelesystem; itinyourlesystem.considersplittingofattributesorsingleattributecomponents inasingledirectory,soatleastforthissteptheremustbeenoughspaceonthedisk. Rememberthatencode,therststepduringcorpuspreparation,putsalldatales toputalldatainthisdirectory.then,youwillonlyhavetosetthehomeeldof theregistryle,andallotherlenamesarethencomputedbythedefaultrules. Ifyoudonotndenoughspace,youhavetosplitthe(positional)attributesbetween inputleinasinglerunofencode). severalsteps(byusingthe-poptionforencodeandonlyencodingonecolumnofthe severallesystems.then,youwillhavetoencodethesetofpositionalattributesin Runencode,andletitwriteallcorpusdatainthenewdirectory,ifthereisenough Registerthenewcorpus:giveitanew,uniquenameandcreatetheregistryle. place.youmayalsouse/tmpasanintermediateplacetoholdthedata,andthen mvingthedatalestotheirproperdestinations. Ifthecorpusisregistered,rundescribe-corpuswiththenameofthenewcorpus. thatannotationscanbeadded(orremoved)atanytime. Declareallannotationswhichhavebeenproducedbytherunofencode.Remember carefully. aredeclared,wheretheywillbestored,andwhattheirstatusis.checkthislist Thiswillshowyouwhethertheregistryleissyntacticallycorrect,whichattributes Afterallpositionalattributeshavebeenencoded(eitherinasinglesteporinseveral createandregisteradditionalattributes(mappingtables,bigramtables,alignment whichhavenotyetbeenproducedbyencode. steps),runmakeallonce,whichwillcreateallcomponentsofallpositionalattributes Checkthelepermissionsofalllesproducedbyencodeandmakeall.Datales Youmaycheckyourcorpusagainwithdescribe-corpus. data,:::)withtherespectiveutilitiesorbyothermeans; coveredinchapter7). directoryandthedatadirectory)shouldbereadable(securityandaccesscontrolis andtheregistryleshouldhavethepermission444,andthedirectories(theregistry Goodluck! StartCqporXkwic,andcheckwhetherthenewcorpusappearsinthelistofavailable corpora.
Chapter5 setup Remoteaccess{clientandserver corpustoolboxdoesnotsupportremotecorpusdataaccessatthemoment.itwastoo slowanyway::: Inthecurrentdistribution,thecorpusdataserverisnotincluded,sothattheIMS corpus,itmustbestoredlocallyandbeaccessiblebylocalusers(thatis,itmustbe Thischapterexplainshowcorporaarepreparedtoallowremoteaccess.To\export"a isauthorizedtoaccesstherespectivecorpus.itshouldbeclearthatremoteaccessto requestsfrom\outside"andservestheserequests,aftercheckingwhethertheremoteuser declaredintheregistry).basically,oneachmachinewherecorporaarestoredwhichare corporaisthemostvulnerablepointwithrespecttocopyrightandaccesscontrolissues. tobeexportedtoremoteusers,aserverprocessmustrunwhichwaitsforcorpusdata Wethereforesuggestthatremoteaccessisonlygrantedforeither\free"resourcesorwhen capabilitieswithouttheneedforspecialtoolsorsetups.anotheraccesscriterioniswhether isstored.almostalltools(butneitherencodenormakeall)havebuilt-inremoteaccess Remoteconnectionscanbebuiltonlyifaserverisrunningonthehostonwhichacorpus youhavefullyunderstoodhowtocontrolremoteaccess. theusertryingtoaccessacorpusremotelyisallowedtoaccessthecorpusatall.the howtorunthecorpusdataserver. followingsection5.1describeshowtograntaccesstocorpora,andsection5.2describes 5.1 Remoteaccesstocorporawhicharestoredphysically(eitheronlocaldisksoronNFSmounteddisks)onthesamemachinetheserverisrunningoniscontrolledbyalecalled The.ratand.ratlogles package..rat(\remoteaccesstable")intheregistrydirectory.thisleholdsanarbitrarynumber oflines,eachbeingapairofregularexpressionsbasedontheposixegrepsyntax1.the 1Adescriptionofthissyntaxcanbefound,forexamle,inthedocumentationoftheFSF/GNUregex 39
rstregularexpressionineachlinedescribessymboliccorpusnamesastheyoccurinthe IMSCorpusWorkbench:Administrator'sManual 40 accesstableandcheckswhetherthenameofthecorpusmatchestherstexpressionofthe Whenacorpusistobeaccessedremotely,theservergoesthroughalllinesintheremote describedintherstexpression. registry,thesecondregularexpressiondescribeswhohasaccesstothepositionalattributes matchedbythesecondregularexpression.ifso,accessisgrantedandnomorelinesinthe line.ifso,itischeckedwhetherthenameoftheuserattheremotehost(user@host)is The.ratlecould,forexample,lookasfollows: tablearechecked.ifnot,thenextlineistried.ifnolineismatched,accessisdenied. treebank.*(joe mary chris)@(rose tulip)\.gardeners\.edu treebank.*jack@.* up.*(joe jack)@.*\.gardeners\.edu tojoe,maryandchriswhentheytrytoaccessthecorpusfromoneofthehosts Therstlinegrantsaccesstoallpositionalattributesdenedforthecorpustreebank.*corpus_adm@maple\.gardeners\.edu rose.gardeners.eduortulip.gardeners.edu.notethatthedot,ifnotprexedbya sincetherearemanyjacksoutthereandyousurelydon'twantthatallofthemprobably pected.thesecondlinegrantsaccesstothesamepositionalattributesforjack,nomatter fromwhichhosthetriestoaccessthecorpus.suchalineshouldnotoccurinyour.ratle, slash,matcheseverycharacterandhastobeescapedwithaslashwhenaliteraldotisex- upattributesforjoeandjack,nomatterfromwhichhostinthegardeners.edudomain theyareconnecting.thefourthline,alsoalittlebittoogeneraltomytaste,grantsaccess whichaccessistobegrantedaspreciselyaspossible.thethirdlinegrantsaccesstoall getaccesstoyourcorpusdata.youshouldalwaystrytospecifytheusersandhostsfrom Again,a\#"intherstcolumnofthe.ratleindicatesacommentwhichextendsupto toallcorporafortheusercorpus_adm@maple.gardeners.edu. toallowremoteaccesstoallpositionalattributesassignedtothecorpus(ifthisaccess Whenacorpushasotherpositionalattributesbeyondthe\word"attribute,itisimportant theendoftheline. shouldbepossibleoverthenetwork).thisisachievedthroughthe.*attheendoftherst accesstocorporaandwhetheraconnectionrequestwasgrantedornot.entriesinthisle granted. The.ratlogle{alsoresidingintheregistrydirectory{logsallrequestsforremote expressionofeachlinewhichdescribesthepositionalattributesforwhichaccessshouldbe look,forexample,asfollows: CDSonmapleatThuJan2016:47:271994 CDSonmapleatThuJan2016:52:031994 loginrequestfromjoe@tulip.gardeners.eduforcorpustreebank(granted) loginrequestfrombill@tulip.gardeners.eduforcorpustreebank(denied)
5.2 Howtostartthecorpusdataserver IMSCorpusWorkbench:Administrator'sManual 41 Startingtheserverisquitesimple:youonlyhavetostarttheprogramcds(corpusdata server)asabackgroundprocess.thereare,however,somepointstoberespected: onlyonecdsprocessmustrunonahost; thecdsprocessmustbestartedbythesamepersonwhoownsthe.ratandthe the.ratlogledoesn'texistwhencdsisstarted,itcanbecreatedwiththetouch command(althoughcdstriestocreatetheleincaseitshouldnotyetbepresent)..ratlogle.thele.ratmustbepresentasdescribedintheprevioussection.if expectittoworkasfastastheaccesstolocallystoredcorpora. Remoteaccesstocorporaisslow.Weonlyimplementeditfortestingpurposes,sodon't
Chapter6 Utilitiesanddebuggingtools capturedbytheseattributes.thesetoolsaredescribedinthissection. Additionallytotheencodeandmakeallutilitiesintroducedinchapter3,therearesome pageshavebeenwrittenandshouldbeavailableatyoursite. Unlessotherwiseindicated,foreachofthetoolsmentionedinthischapterUnixmanual otherutilitiesfortheconstructionofthevariousattributesorthedisplayoftheinformation Thischapteris\underconstruction",sopleaserefertothemanualpagesofthetools. 6.1.1Decodingofcorpusinformation:decode Decodingofcorpusandattributeinformation withencodeandmakeallandprintsthedatatextuallyonstdout.theusercanselectthe decodedecodes(thatis,printsencodedattributevaluesof)aregisteredcorpusencoded attributeswhichareprinted,aswellasthestartandendcorpuspositions.alternatively, corpuspositionsmaybepipedintodecodeinordertoprintthevaluesatthesepositions. withoneoftheoptions. tributename.theorderinwhichtheattributevaluesareprintedisthesameastheorder ofthecorrespondingcommandlinearguments.atleastoneattributemustbespecied Theattributevaluesareseparatedbyatabulatorcharacterandareprecededbytheat- 6.1.2Decodingofwordlists:lexdecode encodeandmakeallandprintstheattributevaluestextuallyonstdout.ifnopositional wordareprinted.additionally,informationabouttheabsolutefrequencyofthevaluesin lexdecodeprintsthevaluesofapositionalattributeofaregisteredcorpusencodedwith corpuscanbeprintedand/orthevaluescanbeprintedinlexicallyascendingorder. attributenameisgivenviathe-poption,thevaluesofthestandardpositionalattribute 42
CreationandDecodingofBigramTables IMSCorpusWorkbench:Administrator'sManual 43 6.2.1Creationofbigramtables:gen-bigrams gen-bigramscomputesbigramtablesforapositionalattributeofacorpusinacertain corpuswordsandincrementsthecounts,whereasthereversedmethod(selectablewith (whichcanbeselectedthroughthe-soptionandisthedefault)sequentiallyshiftsthe window. the-roption)worksviathereversedindex. gen-bigramscanusetwodierentalgorithmstocomputethetable:thesequentialmethod leastthefrequencymfreq.inthiscase,thecomputationisalwaysdoneviathereversed corpus. aminimalfrequencyisgiven,bigramsareonlycomputedforthosewordswhichhaveat Whenabiasisgiven,onlybigramcellswithahighercountthanthisbiasarestored.When tributeofacorpusinthewindowsize.(whichcurrentlyalwaysis2,buthastobegivenas decode-bigramsprintsthecontentsofthebigramtableassociatedwithapositionalat- 6.2.2Decodingofbigramtables:decode-bigrams appropriateparameters. anargument). Bydefault,thetableisprintedintheinternalform,butwiththe-toption,amorereadable, tabularoutputisproduced.then,thewindowwidthandheightcanbealteredwiththe 6.3.1Creationofmappingtables:gen-mapping-table CreationandDecodingofMappingTables gen-mapping-tablecomputesmappingtablesfromthepositionalattributesourcepato method(whichcanbeselectedthroughthe-soptionandisthedefault)allocatesa thepositionalattributetargetpaofcorpusthetablesaredirection-specic{ifyouwant largetableandrunssequentiallythroughthecorpusandincrementsthecounts.thetree tohavemappingtablesintheotherdirection,youmustcreatethem,too. gen-mapping-tablecanusethreedierentalgorithmstocomputethetable:thestandard computation(selectedwiththe-toption)internallyusesatree-likestructure,thusspeeding ofvaluesofthetwoattributes,times4(spaceforoneinteger). butshouldbethefastestmethod.thetablesizeiscomputedbymultiplyingthenumber withspaceforallpossiblecells,itcanonlybeusedforsmalltables(lessthan10mbsize), upsearch.third,thedirectmethod(selectedwiththe-doption)allocatesahugetable
6.3.2Decodingofmappingtables:decode-mapping-table IMSCorpusWorkbench:Administrator'sManual 44 decode-mapping-tableprintsthecontentsofthemappingtableassociatedwiththesource readable,tabularoutputisproduced.bydefault,thewholetableisprinted,butbyusing corpus Bydefault,thetableisprintedintheinternalform,butwiththe-toption,amore positionalattributesourcepaandthetargetpositionalattributetargetpaofthecorpus Then,themappingsareonlydisplayedforthosesourcevalueswhichmatchthepattern. fromstdin.alternatively,aregularexpressionpatterncanbegivenasthelastargument. the-poption,decode-mapping-tablereadsthesetsofvaluesofthesourceattribute 6.4.1Comparingwordlistsandcorpora:check-coverage Generalutilities check-coverageisaprogramwhichreadsalistofwordseitherfromstdin(default)or Ifthewordoccursinoneofthecorporaandifthe-soption(\printsuccess")isgiven, thewordoccursinthewordattributeofthesecorpora.ifnot,thewordisprintedtostdout. listofcorporaspeciedinthecorpus-nameslist(maximumis32corpusnames)whether fromthelespeciedwiththe-foption.check-coveragethen,foreachword,looksinthe 6.4.2Convertinginternalintegerstoreadablenumbers:itoa matchesarealsowrittentostdout. stdout,onenumberperline. itoareads,fromstdinorfromeachofthenamedles,asequenceofintegersintheinternal (machine-dependent)dataformat(4byteintegers)andwriteseachnumbertextuallyto atoireads,fromstdinorfromeachofthenamedles,asequenceofnumbersrepresentedas 6.4.3Convertingreadablenumberstointernalintegers:atoi stdoutasasequenceof4-byteints,intheinternal(machine-dependent)dataformat.the functionissimilartotheatoi(3)functionoftheclibrary. digitsandnewlines.thenumbers(nottheindividualdigits,ofcourse)arethenwrittento digitsequences.theinputmustbeinaone-number-per-lineformatandmustonlycontain
Chapter7 Accesscontrolandsecurityissues corporaavailablefromthelinguisticdataconsortium(ldc). portantissuewhencorporawithrestrictedaccessareprovided,forexamplethetipster Thischapterdescribeshowtocontrolaccesstothecorpusdata.Accesscontrolisanim- 7.1 Accesstocorpusdatacurrentlycanonlybecontrolledbyrestrictingthereadabilityofthe registrydirectory,theregistrylesintheregistrydirectoryandthedirectorythecorpus Controllinglocalaccesstocorpora dataisstoredinbysettingtheuserandgroupidsandtheaccesspermissionsoftheles anddirectories. gureoutwherethecomponentsofacorpusarestoredand{withsomeknowledgeofthe itssymbolicname.however,independentofthereadabilityoftheregistryle,ausercan le.ifsomeonecannotreadaregistryle,heorshecannotaccessthecorpusbywayof Theeasiestwaytocontrolaccessisbyproperlysettingthereadpermissionsofaregistry internalcorpusrepresentationorwiththehelpoftheutilitiesmentionedintheprevious chapter6{canreconstructthecorpustextfromtheencodeddataunlessthecomponents restrictionsoftheregistryle.whenacorpusismoreorless\public"inthesensethatits theencodeddataisstoredin(user,groupandr/wpermissions)arethesameastheaccess areread-protected.wethereforerecommendthattheaccessrestrictionsofthedirectory useiseitherfreeorrestrictedtoyourinstitution,accesscontrolisprobablynotnecessary. Further,westronglyrecommendthat theregistrydirectoryisreadablebyeveryonebutwriteableonlyforthecorpusadministrator(mode755)trator,andreadableexactlybythoseuserswhichmayaccessthecorpus; theregistrydirectoryisownedbythecorpusadministrator; theregistrylesintheregistrydirectoryareonlywriteablebythecorpusadminis- 45
thedirectoriesthedataofapositionalattributeisstoredin(\datadirectories")are IMSCorpusWorkbench:Administrator'sManual 46 thedatadirectorieshavethesamegroupid,owneridandreadpermissionsasthe writeableonlybythecorpusadministrator; thedatales(components)havemode444(arenotwriteablebyanyone).forchanges registrylewhichdescribestheattribue(plusthenecessary\exec"bits); orupdates,onlythecorpusadministratormaychangetheaccessrestrictionsofale print-aligned)mayberestrictedtothecorpusadministrator.theutilitylexdecode, Additionally,theread/execpermissionsoftoolsotherthanthequerytools(Xkwic,Cqp, oradirectory. however,shouldbeexecutablebyallcorpususerssinceitproducesusefulfrequencyinformationwhichdoesn'tallowthereconstructionofthecorpustextproper.westronglstrictedtoonlythecorpusadministrator.noneoftheprogramsshouldbesetuid. Sincethedecodingutilitiespermittotextuallydecodeacorpuswithallitsinformation, careshouldbetakenthatunauthorizedusers(guestswithtemporaryaccounts,forexample) recommend,however,thatexecutionpermissionforthecorpusdataserver(cds)isre- cannotusethesetools.oneidea,forexample,istosetthesetgidbitofthequerytools stored(thetopdirectorysuces).thecorpusdataandtheregistryles(orthedirectories (Cqp,Xkwic)tothegroupidunderwhichthecorpusdata,theregistrylesetc.are butnotreadableformembersofothergroups(o-rx).thedecodingutilitiesmustnotbe inwhichtheyarestored),then,shouldbereadableonlyformembersofthis\corpusgroup", setgidforthecorpusgroup.bythisstrategy,thecorpusdatacanbeaccessedviathequery accesstothecorpusregistrylesbythenormalgrouppermissions. tools,butnotviathedecodingutilitiesiftheuserdoesnotbelongtothegroupwhichhas 7.2 Asmentionedinchapter5above,remoteaccessisavulnerablepointwithrespectto accesscontrol.withsomeknowledgeabouttheinternalsofthetools,itwouldbepossible Controllingremoteaccesstocorpora Althoughthisisnotatrivialtask,youshouldkeepinmindthatitisatleastpossible.The foralmosteveryonetofakehisorheridentityandtogainillegalaccesstoyourcorpora. onlywaytopreventthisisnottoexportcorporaatall,thatis,nottorunacorpusdata checkthe.ratloglefrequentlyinordertogainanoverviewwhichcorporaareaccessed bywhom. The.ratandthe.ratloglesshouldbeownedbythecorpusadministratorandhave server(andtopreventotheruserstostartitaswell).additionally,itisgoodpolicyto accessmode600,thatis,readableandwriteableonlybytheowner(i.e.,thecorpusadministrator). whetherthecorpus/userpairiscontainedintheremoteaccesstable(.rat). forallcorpora.accesstocorporafrom\outside"isthereforeonlycontrolledbychecking fore,theserverprocesshasaccesstoallregistrylesandtoallpositionalattributesdened Rememberthatthecorpusdataserverisusuallyrunbythecorpusadministrator.There-
Therefore,theentriesinthe.ratlemustbecarefullydesignedandnoonebutthecorpus IMSCorpusWorkbench:Administrator'sManual 47 administratorshouldbeabletoreadorwritethe.ratle.youshouldhaveafairknowledge ofregularexpressionsandtakecarethattheexpressionsarenot\overgenerating".ifyou wesuggestthatyoubestenumerateallattribute/userpairswiththeirfullnamesandnot arenotsureaboutwhichusernamesorattributesarematchedbyyourregularexpressions, whoaren'tinthelistofintendedusers.justtobeonthesafeside::: And,lastly,don'tbroadcastthatyoupermitaccesstoyourcorporatoowidelyortopeople touse\wildcards"(.),thekleenestar(*)ortheplusconstruct(+).
AppendixA Hardwareandoperatingsystem requirements Thetoolshavebeendevelopedandtestedonthefollowingdevelopmentplatform: WindowSystem:XWindowSystem(tm),Version11,Release5(X11R5) Compiler:gccV2.5.7onSunOSRelease4.1.3 Widgetset:OSF/Motif(tm)V1.2 Wecurrentlyonlydeliverbinarylesforthisplatform.Noothersystemsaresupported.In ordertorunthetools,atleast32mbofmemoryarerecommended.onecpuisofcourse Hardware:Sun-4M/Sparcstation3(2CPUs,64MBmemory,50MHz) sucient. 48
AppendixB Reusedsoftwarepackagesand copyrightnotices thanktheprovidersofthesoftwareandincludetheirdisclaimersandcopyrightnotices. Intheimplementationofoursystem,wemadeuseofthefollowingsoftwarepackages.We B.1 Copyright(c)1992HenrySpencer. TheregularexpressionmatcherbyHenrySpencer ThiscodeisderivedfromsoftwarecontributedtoBerkeleyby HenrySpenceroftheUniversityofToronto. Copyright(c)1992,1993 TheRegentsoftheUniversityofCalifornia.Allrightsreserved. aremet: 1.Redistributionsofsourcecodemustretaintheabovecopyright modification,arepermittedprovidedthatthefollowingconditions Redistributionanduseinsourceandbinaryforms,withorwithout 3.Alladvertisingmaterialsmentioningfeaturesoruseofthissoftware 2.Redistributionsinbinaryformmustreproducetheabovecopyright documentationand/orothermaterialsprovidedwiththedistribution. notice,thislistofconditionsandthefollowingdisclaimerinthe notice,thislistofconditionsandthefollowingdisclaimer. 4.NeitherthenameoftheUniversitynorthenamesofitscontributors maybeusedtoendorseorpromoteproductsderivedfromthissoftware mustdisplaythefollowingacknowledgement: withoutspecificpriorwrittenpermission. ThisproductincludessoftwaredevelopedbytheUniversityof California,Berkeleyanditscontributors. FORANYDIRECT,INDIRECT,INCIDENTAL,SPECIAL,EXEMPLARY,ORCONSEQUENTIAL AREDISCLAIMED.INNOEVENTSHALLTHEREGENTSORCONTRIBUTORSBELIABLE IMPLIEDWARRANTIESOFMERCHANTABILITYANDFITNESSFORAPARTICULARPURPOSE THISSOFTWAREISPROVIDEDBYTHEREGENTSANDCONTRIBUTORS``ASIS''AND ANYEXPRESSORIMPLIEDWARRANTIES,INCLUDING,BUTNOTLIMITEDTO,THE 49
DAMAGES(INCLUDING,BUTNOTLIMITEDTO,PROCUREMENTOFSUBSTITUTEGOODS IMSCorpusWorkbench:Administrator'sManual 50 SUCHDAMAGE. OUTOFTHEUSEOFTHISSOFTWARE,EVENIFADVISEDOFTHEPOSSIBILITYOF ORSERVICES;LOSSOFUSE,DATA,ORPROFITS;ORBUSINESSINTERRUPTION) HOWEVERCAUSEDANDONANYTHEORYOFLIABILITY,WHETHERINCONTRACT,STRICT LIABILITY,ORTORT(INCLUDINGNEGLIGENCEOROTHERWISE)ARISINGINANYWAY @(#)regex.h 8.1(Berkeley)6/2/93
Bibliography [Christ,1994]OliverChrist.Amodularandexiblearchitectureforanintegratedcorpusquerysystem.InProceedingsofCOMPLEX'94:3rdConferenceonComputational [Christ,1993]OliverChrist.TheXkwicUserManual.InstitutfurmaschinelleSprachverarbeitung,UniversitatStuttgart,1993. LexicographyandTextResearch(Budapest,July7{101994),Budapest,Hungary,1994. CMP-LGarchiveid9408005. [SchulzeandChrist,1994]BrunoM.SchulzeandOliverChrist.TheCQPUsers'sManual. [Schulze,1994]BrunoM.Schulze.EntwurfundImplementierungeinesAnfragesystems 1994.(RevisedOctober1994). InstitutfurmaschinelleSprachverarbeitung,UniversitatStuttgart,Version1.0d,May furtextcorpora.diplomarbeitnr.1059,universitatstuttgart,institutfurmaschinelle Sprachverarbeitung(IMS)andInstitutfurInformatik,January1994.(InGerman). 51