Similar documents
>

Natural Language Processing

Lecture 2, Introduction to Python. Python Programming Language

Chapter 7. Language models. Statistical Machine Translation

Secret Debian Internals

Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013

Completing the Accounts Payable (AP) Redistribution Form For Invoices Matched to a Purchase Order

Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1

1. VERSIONSHISTORY MERCHANT TRANSACTION RECONCILIATION FILE GENERAL STRUCTURE FILE DETAILED FILE DESCRIPTION...

Audit Troubleshooting

NAT TCP SIP ALG Support

Sales Person Commission

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015

Litigation Support connector installation and integration guide for Summation

Schema documentation for types1.2.xsd

Configure your firewall for administrative access via RADIUS authentication

SAPScript. A Standard Text is a like our normal documents. In Standard Text, you can create standard documents like letters, articles etc

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Special Topics in Computer Science

Financial Processing Journal Voucher (JV)

Configuring NetFlow on Cisco ASR 9000 Series Aggregation Services Router

CardSwipe Integration

PaperCut Payment Gateway Module PayPal Website Payments Standard Quick Start Guide

eaccounts Customer Instruction Manual

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Universal Data Mover. User Guide. Indesca / Infitran. Version udm-user-4101


Secure Held Print Jobs. Administrator's Guide

WinSALTS. The 32-bit version of the WinSALTS Program. WinSALTS Training Handout Modified for ROM II EDI Users 10 March 10, 2005 Version 5.

User's Guide. Using RFDBManager. For 433 MHz / 2.4 GHz RF. Version

VISION FINANCIALS. Budget Status (GLS8020) Introduction. Purpose of the Report

Data Intensive Computing Handout 5 Hadoop

Preparing your data for analysis using SAS. Landon Sego 24 April 2003 Department of Statistics UW-Madison

CLC Server Command Line Tools USER MANUAL

Load Balancing and Sessions. C. Kopparapu, Load Balancing Servers, Firewalls and Caches. Wiley, 2002.

Integrating Procurement Cards with Oracle Internet Expenses: Lessons Learned. Session ID: 08141

Overview of Web Services API

Swipe reader interfaces

Annual Report H I G H E R E D U C AT I O N C O M M I S S I O N - PA K I S TA N

Microsoft Dynamics GP. Field Service - Preventive Maintenance

24 Uses of Turing Machines

Issue Tracking System. User Manual

Sending an Message from a Process

Table Of Contents. iii

NLP Programming Tutorial 5 - Part of Speech Tagging with Hidden Markov Models

The make utility. Basics

IBM MaaS360 Mobile Document Editor User Guide

Data Intensive Computing Handout 6 Hadoop

A Time Efficient Algorithm for Web Log Analysis

Document Management: Document Imaging System Setup

FAR014: A Day in the Life for a Mars Planner

CHAPTER 11 LEGAL ACCOUNTING MODULE 11.0 OVERVIEW 11.1 REQUIREMENTS AND INSTALLATION Special Requirements

Horizon Debt Collect. User s and Administrator s Guide

An Eprints Apache Log Filter for Non-Redundant Document Downloads by Browser Agents

Microsoft Dynamics GP. Pay Steps for Human Resources Release 9.0

Introduction to Linux operating system. module Basic Bioinformatics PBF

PageR Enterprise Monitored Objects - AS/400-5

RiOffice Users Manual

Intercluster Lookup Service

CSCI 5417 Information Retrieval Systems Jim Martin!

System and Network Management

VBA Microsoft Access 2007 Macros to Import Formats and Labels to SAS

Web Development using PHP (WD_PHP) Duration 1.5 months

Installing and Setting up Microsoft DNS Server

Click-To-Talk. ZyXEL IP PBX License IP PBX LOGIN DETAILS. Edition 1, 07/2009. LAN IP: WAN IP:

Microsoft Dynamics GP. Field Service Preventive Maintenance

SWIFT MT940 MT942 formats for exporting data from OfficeNet Direct

FortiVoice. Version 7.00 User Guide

Connect Ticket Entry. Quick Reference Guide

Template and Daily Schedules

Using Process Monitor

Chapter 24: Creating Reports and Extracting Data

Borderware Firewall Server Version 7.1. VPN Authentication Configuration Guide. Copyright 2005 CRYPTOCard Corporation All Rights Reserved

CPSM MEDITECH Inventory Inquiries

Trustkeeper PCI Compliance Guide for Merchants

Applying Co-Training Methods to Statistical Parsing. Anoop Sarkar anoop/

This presentation explains how to monitor memory consumption of DataStage processes during run time.

Apply PERL to BioInformatics (II)

INASP: Effective Network Management Workshops

DOMIQ, SIP and Mobotix cameras

Configuring Denial of Service Protection

Logging Service and Log Viewer for CPC Monitoring

Real Estate Reports Overview Quick Reference Guide

Thirdlane User Portal 2.1. Users Guide 05/12/2008. Third Lane Technologies, LLC 39 Power Lane Fairfax, CA

VTiger CRM + Joomla/ChronoForms Integration

Avaya Network Configuration Manager User Guide

Transcription:

TheIMSCorpusWorkbench CorpusAdministrator'sManual InstitutfurmaschinelleSprachverarbeitung UniversitatStuttgart OliverChrist oli@ims.uni-stuttgart.de {Computerlinguistik{ D70174Stuttgart1 Azenbergstr.12 LastModied:WedNov914:33:271994(oli) Created:ThuFeb2410:34:111994(oli) tc.bibentry:christ:94b Released:{notyet{

Contents 1Overview 1.1Introduction:::::::::::::::::::::::::::::::::::: 1.2TheroleoftheCorpusAdministrator::::::::::::::::::::: 4 1.4CreditsandAcknowledgements::::::::::::::::::::::::: 1.3Organizationofthismanual::::::::::::::::::::::::::: 2Internalcorpusrepresentation 5 2.1Positionalattributes::::::::::::::::::::::::::::::: 2.1.1Integerizedles:::::::::::::::::::::::::::::: 2.1.2Inverseditemsequence::::::::::::::::::::::::::117 2.1.3Example:asimplewordsearch:::::::::::::::::::::138 2.2Otherattributetypes::::::::::::::::::::::::::::::16 2.2.1Structuralattributes:::::::::::::::::::::::::::16 2.1.4Thesetofpositionalattributes:::::::::::::::::::::15 2.3Externaltoolsanddynamicattributes:::::::::::::::::::::18 2.2.2Alignmentattributes:::::::::::::::::::::::::::18 2.2.3Bigramandmappingtables:::::::::::::::::::::::18 3Encoding:Transformingacorpusintoitsinternalrepresentation 3.2Theencodeprogram:::::::::::::::::::::::::::::::21 3.1Theinternalrepresentationofacorpus:::::::::::::::::::::21 20 3.4Spacerequirements::::::::::::::::::::::::::::::::26 3.3Themakeallprogram::::::::::::::::::::::::::::::25 3.4.1Positionalattributes:::::::::::::::::::::::::::26 3.4.2Structuralattributes:::::::::::::::::::::::::::27 1

4Thecorpusregistry IMSCorpusWorkbench:Administrator'sManual 2 4.2Thecontentsofaregistryle::::::::::::::::::::::::::29 4.1Someremarksaboutnomenclature:::::::::::::::::::::::28 4.2.2Positionalattributes:::::::::::::::::::::::::::30 4.2.3Structuralattributes:::::::::::::::::::::::::::34 4.2.1Theheader::::::::::::::::::::::::::::::::29 4.2.4Mappingtables::::::::::::::::::::::::::::::34 4.2.5ngramtables:::::::::::::::::::::::::::::::35 4.2.6Alignmentattributes:::::::::::::::::::::::::::35 4.4Alastexample::::::::::::::::::::::::::::::::::37 4.3Registrationofremotecorpora:::::::::::::::::::::::::37 4.2.7Dynamicattributes::::::::::::::::::::::::::::35 5Remoteaccess{clientandserversetup 4.5Stepstofollow::::::::::::::::::::::::::::::::::38 5.2Howtostartthecorpusdataserver:::::::::::::::::::::::41 5.1The.ratand.ratlogles:::::::::::::::::::::::::::39 6Utilitiesanddebuggingtools 6.1Decodingofcorpusandattributeinformation:::::::::::::::::42 6.1.1Decodingofcorpusinformation:decode::::::::::::::::42 6.2CreationandDecodingofBigramTables::::::::::::::::::::43 6.2.1Creationofbigramtables:gen-bigrams::::::::::::::::43 6.1.2Decodingofwordlists:lexdecode:::::::::::::::::::42 6.3CreationandDecodingofMappingTables:::::::::::::::::::43 6.3.1Creationofmappingtables:gen-mapping-table:::::::::::43 6.2.2Decodingofbigramtables:decode-bigrams:::::::::::::43 6.4Generalutilities::::::::::::::::::::::::::::::::::44 6.4.1Comparingwordlistsandcorpora:check-coverage:::::::::44 6.3.2Decodingofmappingtables:decode-mapping-table:::::::::44 6.4.2Convertinginternalintegerstoreadablenumbers:itoa:::::::44 6.4.3Convertingreadablenumberstointernalintegers:atoi:::::::44

7Accesscontrolandsecurityissues IMSCorpusWorkbench:Administrator'sManual 3 7.1Controllinglocalaccesstocorpora:::::::::::::::::::::::45 AHardwareandoperatingsystemrequirements 7.2Controllingremoteaccesstocorpora::::::::::::::::::::::46 BReusedsoftwarepackagesandcopyrightnotices 48 B.1TheregularexpressionmatcherbyHenrySpencer::::::::::::::49

Chapter1 Overview queryingoflargetextcorpora.thismanualdescribeshowtoencodeatextcorpusand 1.1 TheIMScorpusworkbenchisasetoftoolsfortheecientencoding,representationand Introduction howthevariousadministrationtoolsmustbeusedtotransformatextcorpusintothe moregeneralpapers,especially[christ,1994]foranoverviewofthesystemarchitectureas familiarwiththeoverallarchitecture,werecommenda\top-down"readingthroughthe representationusedbytheaccesstools.thismanualdoesnotdescribethefunctionality awhole. ofthequerytoolsorthearchitectureoftheworkbenchingeneral.ifthereaderisnot representationusedbytheimsworkbench,thefollowingstepshavetobeperformed: ecientlookup.totransformatextcorpusfromitstextualrepresentationtotheinternal pusdata,thedierent\items"(i.e.,words)usedinthecorpusandseveralindexlesfor Theinternalrepresentationofacorpusconsistsofasetofleswhichrepresentthecor- 3.declarationofthecorpusinaglobal\registrydirectory"; 2.encodingofthetextle; 1.transformationofthetextleinone-word-per-lineformat; Steps1and3havetobedonemanually,forsteps2and4therearetoolswithinthe workbench. 4.andbuildingseveralleindices. expectstondalewiththeverysamenameasthesymbolicnameofthecorpustobe toatool,itislookedupinacentraldirectory(calledthe\corpusregistry"),wherethetool usedinstep2above)accessacorpusviaasymbolicname.whenasymbolicnameispassed Thethirdstep,theregistrationofacorpus,isnecessarysincealmostalltools(buttheone accessed.thisleholdsadescriptionofthecomponentsofthecorpus,mainlyalistwhere thecomponentsarestoredphysically.so,auserdoesnothavetoknowwhereacorpusis stored,heorsheonlyhastoknowitssymbolicnametoaccessit. 4

Afteracorpusistransformedintoitsinternalrepresentationandregistered,itcanbe IMSCorpusWorkbench:Administrator'sManual 5 usedbythevarioustoolsoftheworkbench,forexamplethequerytools(xkwic,cqp, print-aligned). WithintheIMScorpusworkbench,thecorpusadministratorhasthetaskstoprovideusers 1.2 withnewcorporaortochangeexistingcorporawhensomeinformationhastobeadded TheroleoftheCorpusAdministrator orupdated.second,theadministratorhastoproperlyinstalltheusuallylargecorpusles inthelesystemandtondan\optimal"placewithregardtobackuppolicies,diskusage andaccesseciency.third,thecorpusadministratorhastocareaboutaccesscontrol, evenwithinoneinstitution.thesetasksaresimilarto\standard"systemadministration. sincecorporaexistwherecopyrightsorlicenseagreementsinhibitanunrestrictedaccess, accesscontrol,thatis,yourlocalsystemadministrator. familiarwiththestandardunixtextprocessingtools,backupstrategies,andsecurityand Wethereforesuggestthatthecorpusadministrationtasksarefullledbysomeonewhois whichareusedtostorethecorpusdata.youmayskiptheentirechapterifyouwant,it Thismanualisorganizedasfollows.Thenextchapter2explainstheinternaldatastructures 1.3 Organizationofthismanual stepswhicharenecessarytotransformacorpusfromitstextualrepresentationintoits isnotnecessaryfortheotherchapters,butusefulifyouhaveproblemswiththetools formatoftheleswhichdescribethephysicalattributesofacorpus.chapter5describes internalrepresentation.chapter4,then,describesindetailtheregistrydirectoryandthe orwanttolearnhowtomanipulatethedatales.chapter3,then,describesthevarious howtosetuptheclient-server-capabilitiesoftheworkbench.chapter6describesutilities andrelatedtoolswhicheitheraddmore(orothertypesof)informationtoacorpusor checkwhetherthetoolscanrunatallonyoursystem,youmayrefertoappendixafor hardwareandoperatingsystemspecicrequirementsofourtools. Anotherimportantpointisaccesscontrolforcorpora,whichisdiscussedinchapter7.To areusefulforotherpurposes,forexamplefordebuggingofacorpus(duringencoding). makeuseofintegerizeddatalesandreversedindices.bothofthesetechniquesarewellknownintheareaofinformationprocessingformanydecades,buttoourknowledgetherst whoappliedthemtotextandcorpusprocessinginthelinguisticareawaskenchurch. Hedeservesourgreatestthanksforpointingustothesemethods. TheinternaldatastructuresweuseinCqp,Xkwicandsomeothertoolsoftheworkbench 1.4 CreditsandAcknowledgements

Neithertheauthors,norIMS,northeUniversityofStuttgartmakeanyrepresentations IMSCorpusWorkbench:Administrator'sManual 6 aboutthesuitabilityofthesoftwaredescribedhereinortheassociateddocumentation foranypurpose.itisprovided"asis"withoutexpressorimpliedwarranty.wedisclaim allwarrantieswithregardtothesoftwaredescribedhereinortherelateddocumentation, othertortiousaction,arisingoutoforinconnectionwiththeuseorperformanceofthis liableforanyspecial,indirectorconsequentialdamagesoranydamageswhatsoever resultingfromlossofuse,dataorprots,whetherinanactionofcontract,negligenceor includingallimpliedwarrantiesofmerchantabilityandtness,innoeventshallwebe software.

Chapter2 Internalcorpusrepresentation Thischapterexplainshowcorpusdataisrepresentedinternally.Whenyouunderstandthe internalrepresentation,youcanusethetoolsofthetoolboxtocreate,updateorchange corpusinformationwithouthavingtogobacktothetextualversionandencodingthewhole stuagain.youwillalsobeabletogureouthowtoencodetheinternalrepresentation Ifyoudonotneedto\hack"withthecorpusdata,youmayskiptheentirechapter.The forleswhichcannotbecomputedbythetoolsofthetoolbox,forexampleduetomemory problems,softwarebugsorlimitations. understandingoftheinternalrepresentationisnotnecessaryfortheotherpartsofthis manual,butusefulifyouencounterproblemswiththetools. 2.1 WithintheIMScorpusworkbench,acorpuscanhaveanarbitrarynumberofannotations ofdierenttypes.inoursystem,acorpusisprimarilyregardedasasequenceofwords(not Positionalattributes whichisthemostimportantannotationtype.attributesofthisclasshavea(string)value asasequenceofcharacters).thewords,then,arenumbered,sothatwecandirectlyaccess ateachcorpusposition.1 thenthwordofthecorpus).thisleadstothemoregeneralnotionofpositionalattributes, thewordatacertaincorpuspositionp(i.e.,therstwordinthecorpus,or,ingeneral, : pos: N N IP NUM N ADJ N IP regardedasthenumberofalineinthisrepresentation. 1Whenthecorpusisstoredinaverticalizedone-word-per-lineformat,acorpuspositioncanalsobe Figure2.1:Corpuspositionsandvalues word: Pierre Vinken, 61 years old blessing. 0 1 2 3 4 5 n-2 n-1 7

Thecorpustextfallswithintheclassofpositionalattributes,sincewecanspecify,foreach IMSCorpusWorkbench:Administrator'sManual 8 ofthecorpus.inourview,weregardpos-tagsasassignedtoacorpuspositionratherthan corpusposition,thewordwhichoccursatthatposition.thepositionalattributewhich holdsthecorpustextproperalwayshasthepredenedattributename\word".other positionalattributesare,forexample,part-of-speechtags,whichareassignedtothewords string(asillustratedingure2.1).wethereforeusethesameinternalrepresentationforthe tothewordatthatposition.then,thepositionalattributes\word"and\tag"donotdier verymuchanymore:bothhave,foreachcorpusposition,avaluewhichis,inourcase,a wordsequenceofthecorpus(thecorpustext)andthetagsequence(theassociatedpostags),aswellasforother,additionalpositionalattributeslikelemmas,morphosyntactic equallength,oneofwhichcapturesthesequenceofwords,theothercapturesthesequence tags,etc.inotherwords,ataggedcorpusisinourviewasetoftwopositionalattributesof then,isacollectionofattributesofdierenttypes. Thequestionoftheinternalrepresentationofcorporawithmultiple(positional)annotationscanthusbereducedtothequestionofrepresentingasinglepositionalattribute (rememberthattheallpositionalattributesmustbeofequallength,thatis,encodeequal Thetwokeyconceptsoftheinternalrepresentationofapositionalattributesare: lengthitemstreams). integerizedrepresentation:itemsareencodedasintegernumbers,whereequalitems informationencodedinsuchapositionalattribute(here,wordvs.tag).acorpusingeneral, oftags.inthefollowing,wethereforeusetheterm\item"toabstractfromthetypeof inversedleindices:forthesequenceofnumbers,aninversedleiscreated.the (words,:::)getthesameintegercode.thesequenceofitemsisthenrepresentedas inversedlecaptures,foreachitem(better:itemcode)thesetofoccurrencesofthe asequenceofintegernumbers; Fortheconstructionoftheintegercode,younormallyneedasegmentationortokenization tool,sincethe,andtheareconsidereddierentandundesirablygetdierentcodes. iteminthepositionalattribute. Theadvantagesoftheintegercodeisthattherepresenteditemshaveequalinternallength (inthecaseofintegers,4bytesonourmachines).sincethelengthoftheitemsequenceis knownandtheitemsareofequallength,theitemsequencecanbehandledlikeanarray ofitems,withtheadvantageofrandomaccess.theinversedleisneededforlookup:since computedinasinglestep. itdirectlyindexesthesetofoccurrencesofagivenitem(code),theoccurrencescanbe tionoftheitemsequence.then,theitemsequenceisrepresentedasasequenceofinteger 2.1.1Integerizedles codes. Itisobviousthatatleasttwofunctionsareneededtohandlethisencoding: Assaidabove,therstsetofdatastructuresisanintegerizedleofthetextualrepresenta-

rst,afunctiontocomputetheintegercodeofagivenitem(astring); IMSCorpusWorkbench:Administrator'sManual 9 Therstdatastructureistheitemlistor\lexicon":itcapturesthesetof(dierent)items. Thesetwofunctionsrequiresomeauxiliarydatastructurestobeecientlycomputable. second,afunctiontoretrievethe(character)stringwhenthecodeisgiven. (2.1) (octal\000)ispaddedattheendofeachword.theleisnotsorted(butitmaybe).a UNIXcommandtoproducethislewouldbe: Internally,thisisthesetofstringsoccurringintheitemsequence,whereaNULLcharacter whereitisassumedthattheinputitemsequenceisinone-word-per-lineformat.inthis example,theoutputwouldbesorted,butthisisnotnecessary.theitemlistalreadydenes sort-u1wpl-item-seq tr'\n''\0'>lexicon theitemcodeforeachitem,sinceitisassumedthattherstitemintheitemlisthascode 0,thenextonehascode1,andsoon. 0)fromtheinputstream,sothattheexampleabovewillnotworkwith\traditionaltr". GNU'strdoesnothavethisbug. Notethat\traditional"trsdeleteASCII0( representedbytheitemcodeciscomputedinonestepviathisindex,whichwecallthe totheleosets(inbytes)intheitemlist.thus,thestartingpositionsofthethestring thestringsinthisle.thisindexgives,foreachitemcode,amappingfromitemcodes Forthelookupofastringinthislist,itisusefultohaveanindexofstartingpositionsof anitemcodeiseverythingbetweenthestartingpositioncomputedbytheitemindexup NULLcharacter(whichmustnotoccurintheitemsthemselves),thestringrepresentedby tothenextnullcharacter. itemlistindexorlexiconindex.sincethestringsintheitemlistareterminatedwiththe computed: (2.2) Again,thisindexcanbecomputedbyaUNIXcommandwhenthelexiconisalready tr'\0''\n'<lexicon atoiisautilityprogramincludedinthetoolboxandmapsnumbers(representedtextually atoi>lexicon.idx gawk'begin{pos=0} asasequenceofdigits)totheirinternalrepresentation. {printpos;pos=pos+length($1)+1}' perhapswouldneedmorespace,butcouldcomputetheconversioninnstepswherenis implemented.2forexample,thesamefunctionalitycouldbeachievedwithtries,which bedoneinanumberofdierentways{currently,binarysearchoverasortedstringindexis Thenextdatastructuresupportsthemappingfromstringstotheiritemcodes.Thiscould thelengthoftheinputstring. sinceitisrarelyused(allcomputationsaredoneontheitemcodes,wheneverpossible,insteadofstrings), wedidn'tyetconvertittoamoreecientmethod. 2Themethodcurrentlyimplementedintheworkbenchisverysimpleandcouldbespedupalot,but

Thebinarysearchrequiresasortedstructure.Forthispurpose,wedonotkeepasorted IMSCorpusWorkbench:Administrator'sManual 10 1),theitemcodeatthisposition.So,Ls(0)istheitemcodeofthe\smallest"item,and Ls(1)isthecodeofthesecond-smallestitem,etc.Thesorteditemlistcanthusbetextually itemina\virtual"sorteditemlist(rangingfrom0tothenumberofencodeditemsminus itemlist,butratheranotherindex(denotedbyls)whichholds,foreachpositionpofan printedbythefunction gprints; for(i=0;i<"sizeofitemset";i++)f code=sortidx[i]; s=lexidx[code]; Here,foreachpossiblepositioninthesortedindex,i,rsttheitemcodecodeatthatpositioniscomputed.Then,throughaccessingtheitemindex,thecharacterstringrepresented (2.3)tr'\0''\n'<lexicon Asyoucanimagine,thislecaneasilyproducedbyaUNIXcommand: bycodeisdetermined,whichisthenprinted. gawk'{printnr-1"\t"$1}' Therstlinecomputesthestringsfromtheitemlist,whicharethenprexedbytheircode gawk'{print$1}' atoi>lexicon.srt sort+1 (whichisthe\position"intheitemlist),beginningwithcode0fortherstword.thislist ofcode/valuepairsisthensortedbythevalues,whichoccurinthesecondcolumn.the outputofthesortingisthenltered,sothatonlythecodesareprinted.thecodesequence isthentransformedintotheinternalformatandwrittentotheindexle. Note:OneofthereasonswedonotusetheseUNIXcommandstocreatethedatastructures fromotherprograms(itworkswithsignedcharacters,whereasinternallyweworkwith isthattheunixsortcommandsometimeshandlestheorderof8bit-charactersdierently Thedatastructuresusedtorepresenttheencodeditemsequenceandtheassociatedauxiliarydatastructureswhichfacilitatethenecessarymappingsareillustratedingure2.2. itemstoitemcodeswillnotworkproperlyotherwise. whenthestandard7bitasciicharactersetisused.theinternalfunctionswhichmapfrom unsignedcharacters).sotheunixcommandswhichusesortonlycreatethesameles Anotherpossiblityistoproducetheitemlistandtheindicesinasinglegawkrun.The scriptbelowcanbeusedforthissecondpurpose,butitassignsotheritemcodes: istoreadanalreadyexistingitemlistle,whichmaybeproducedbythecommandsabove. Theonlyleforwhichwedidn'tyetgiveaUNIXcommandistheitemsequence(orbetter, thesequenceofencodeditems).gawk'sarrayscanbeusedforthispurpose.onepossiblity (2.4)BEGIN{ maxcode=0; OliverChrist } position=0; IMSStuttgart August9,1996

IMSCorpusWorkbench:Administrator'sManual 11 Item Sequence Item List Index Item List Sorted Index 27 103 27 49 31 27 28 29 30 1047 never the cucumber ; 31 & wine {if(!($1initemlist)){ Figure2.2:Integerizeditemsandassociateddatastructures print$1>"lexicon.asc" itemlist[$1]=maxcode; printposition>"lexicon.idx.asc" Item Index ==> String }else maxcode++; code=maxcode; position=position+length($1)+1; Afterthecodeisexecutedwithatextleasinput,theASCIIrepresentationshavetobe } printcode>"corpus.asc" code=itemlist[$1]; convertedintotheinternalformat(thiscouldbedoneviapipesinthegawkscriptalso,but weleftthatouthereforthesakeofclarity): (2.5)atoi<corpus.asc>corpus Afterthat,command2.3canbeusedtoproducethesorteditemlistindex. tr'\n''\0'<lexicon.asc>lexicon rm-f*.asc atoi<lexicon.idx.asc>lexicon.idx sequence.thisinversedleholds,foreachitemcode,thesetofpositionsintheitem 2.1.2Inverseditemsequence Thesecondsetofdatastructuresconcernstheinversedleindexassociatedwiththeitem

sequencewheretheitemcodeoccurs.throughthemappingfunctionsintroducedinthe IMSCorpusWorkbench:Administrator'sManual 12 Theinversedleisrepresentedbyasetofthreeles: whereacertainwordorpart-of-speechtagoccurs. lastsection,wecanalsoregardtheinverseditemsequenceasalistofcorpuspositions second,anindexintothisle.thisindexreturns,foreachitemcode,thestartpoint rst,theinversedleitself,whichcontainsasetofcorpuspositions; third,atableofitemcodefrequencies,whichgives,foreachitemcode,thenumber ofoccurrencesofthecodeinthecorpus(whichis,ofcourse,equaltothesizeofthe oftheassociatedoccurrencesintheinversedle; ThethreelescanalsobecomputedbyUNIXcommands.First,thereversedsequenceis producedbythefollowingcommand: setofoccurrences). (2.6)itoacorpus gawk'{print$i"\t"nr-1}' First,theinternalrepresentationoftheitemsequenceisconvertedintoreadablenumbers. gawk'{print$2}' atoi>corpus.rev sort-ns Thisnumbersequenceisthensuxedwithitspositioninthecorpus,whichisthensorted bythecode,sothatwegetcode/positionpairs.fromthissequence,thepositionisstripped Thefrequenciescanalreadybecomputedinthegawkencodescript(2.4),butanother o,sothatweonlygetthesequenceofpositions,whichexactlyistheinversedle. possibilityisaslightlymodiedversionofthescriptabove: (2.7)itoacorpus gawk'{print$i"\t"nr-1}' gawk'{print$1}' atoi>corpus.cnt uniq-c sort-ns Here,wekeepthecodesequenceofthecode/positionpairs.Thissequenceofcodesappears Thelastle,theindexintotheinversedle,cansimplybecomputedfromthefrequencies tointernalformat.3 collapsedintoonlyasinglelineandcounted.thesecountsarestrippedoandconverted insortedorder.bythecalltotheuniqutility,equalsubsequentlines(here:codes)are bysummingthemup: (2.8)itoacorpus.cnt beomitted.theversionhereisjustforclarity. 3Itwouldbemoreecienttouseagawkarraytoholdtheitemcodecounts,sincethesortstepcould gawk'begin{pos=0}{printpos;pos+=$1}' atoi>corpus.rdx

Now,thewholesetofsevenlesrepresentingthedataofapositionalattribute(which IMSCorpusWorkbench:Administrator'sManual 13 therearetoolswhichperformthesestepsmuchfasterthantheshellscriptspresentedhere. scriptsmayhelptoproducetheencodedversionofacorpus. Butinsomecases,theutilitiesofthetoolboxrunintomemoryproblems,andthenthese wecallthesevencomponentsofapositionalattribute)havebeencreated.inthetoolbox, Index into reversed file Reversed File 27 28 29 30 31 533 3113 12 740 533: 17 19 101 397 440 533+23: Nr of item occurrences (freq) 27 23 28 5 29 231 30 1 Thecomponentsassociatedwiththeinversedleandtheirmeaningsareillustratedin Figure2.3:Reversedleindices 31 Item Index ==> Set of occurrences foranitem,awordforexample,isperformed. gure2.3.thenextsectionwillshowthesinglestepswhicharetakenwhenasimplesearch Aftertheinternaldatastructureshavebeenintroduced,wecancomputetheconcordance forasingleitem,forexamplethewordthe.mostdatastructurescanbetreatedasanarray, 2.1.3Example:asimplewordsearch soweusethesymbols Cfortheitemsequence(accessedbyC[i]whereiisacorpusposition).Theelements Rforthereverseditemsequence(accessedbyR[i]whereiisanindexintothis ofcareitemcodes; sequence,computedfromirbelow).theelementsofrarecorpuspositions;

Figure2.4:Asimplewordsearch Sorted Index Word list never the cucumber ; & wine Index into reversed corpus 27 533 28 3113 ID: 27 29 12 30 740 31 Nr of item occurrences (freq) 27 23 28 5 29 231 30 1 31 533: 533+23: Reversed Corpus 17 99 101 397 440 101: Corpus 27 103 27 49 31 "Match" Index into Word List and Word List Lookup Concordance Element IMSCorpusWorkbench:Administrator'sManual 14

ILfortheitemlistindex(accessedbyIL[c]wherecisanitemcode).Theelements IMSCorpusWorkbench:Administrator'sManual 15 Ffortheitemfrequencytable(accessedbyF[c]wherecisanitemcode).The elementsoffareitemfrequenciesinc; ofilarebyteosetsintotheitemlist; IRforthereverseditemsequenceindex(accessedbyIR[c]wherecisanitemcode). SLforthesorteditemlistindex(accessedbySL[i]whereiisapositioninthe\virtual" Lfortheitemlist(anarrayofcharacters,onlyaccessedbyosetsofIL); TheelementsofIRarepointers(osets)intoR; ForcomputingthesetofoccurrencesofatextualiteminC,thefollowingstepshavetobe Thesesevenarraysarethecomponentsofapositionalattribute. sorteditemlist).theelementsofslareitemcodes. taken(alsoillustratedingure2.4fortheword\the"): rst,theitemcodec(i)ofitemihastobedetermined.forthispurpose,thesorted iftheitemcodecouldbedetermined,thereverseditemsequenceindexisconsulted found; itemindexslisconsultedandsearchedwithbinarysearchuntiltheitemcodeis second,theitemfrequencylistisaccessedtocomputethe\length"oftheposition iinthereverseditemsequence; todeterminethestartingpositionrs(i)=ir[c(i)]ofthepositionsetassociatedwith then,thesetofoccurrencesp(i)isthesetofpositionsstoredinthereverseditem setf(i)=f[c(i)]; Thetaskofcomputingthesetofoccurrencesofiintheitemsequenceisthencompleted. Notethattheitemsequenceitselfdidn'thavetobeaccessed. sequencerstartingatrs(i)withlengthf(i)(r[rs(i)]:::r[rs(i)+f(i) 1]). Forcomputingtheconcordanceandprintingit,though,theitemsequenceCmustbe bounds(0;jcj 1).Foreachitemkinthissubsequence,theassociated(textual)itemmust foreachp2p(i)the\subsequence"between[p cl;p+cr]incmustbecomputed(inthe bedeterminedbycomputingthestartpositionts(k)=il[k]ofkintheitemlistindex. consulted.whenclistheleftdisplaycontext(intermsofitems)andcristherightcontext, 2.1.4Thesetofpositionalattributes Then,theitemlistcanbeconsultedtogetthestrings(k),whichthenisprinted. ofcandrareequal(seealsogure2.5): positionalattributehasitsownsetofcomponents.foreachpositionalattribute,thelength TheIMSCorpusToolboxsupportsanarbitrarynumberofpositionalattributes.Each OliverChrist IMSStuttgart jcj=jrj August9,1996

IMSCorpusWorkbench:Administrator'sManual 16 PA word Item freqs Item Seq Reversed Item Seq Index for RC Index for IL Sorted Idx Item list PA pos Item Seq Reversed Item Seq Item freqs Index for RC Index for IL Sorted Idx Item list PA lemma Item freqs Item Seq Index for RC Item list Reversed Item Seq Index for IL andthelengthsofil;ir;sl;andfareequal: Figure2.5:Thesetofpositionalattributes Sorted Idx itemsequencesoftheseattributesmustbeequal: Furthermore,betweenallpositionalattributesassociatedwithacorpus,thelengthsofthe jilj=jirj=jslj=jfj ofcourse,nosuchconditionusuallyholdsbetweentheothercomponentsoftwopositional attributes. jcwordj=jclemmaj=jcposj=jcsynj=::: 2.2.1Structuralattributes Otherattributetypes phrases,orotherentities.internally,thesestructuresarerepresentedasintervalsofcorpus Structuralattributescaptureinformationaboutboundariesofsentences,paragraphs, Currently,therearetwolimitationswithrespecttostructuralattributes: internally(4bytes,thesizeofaninteger,foreachofthetwopositions). positions,whicharethestartandendpoint(inclusive)ofthestructure.suchanintervalisapairofcorpuspositions.therefore,eachstructuralitemneeds8bytesofstorage

rst,theintervalsmustnotberecursive(forexample,embeddednpsinnps); IMSCorpusWorkbench:Administrator'sManual 17 andtheymustnotbeoverlapping. Positional : Attributes pos: N N IP NUM N ADJ N IP word: Pierre Vinken, 61 years old blessing. 0 1 2 3 4 5 n-2 n-1 Figure2.6illustratestherepresentationofstructuralattributes.Thenumberofstructural attributesassociatedwithacorpusisnotlimited. Figure2.6:Structuralattributes Structural s Attributes paragraph S.Normally,thestructuralattributedataiscreatedwiththeencodeutility.Butinsome Unlikepositionalattributes,thedataforapositionalattributeisstoredinasinglele, Creatingstructuralattributedata isanarrayofintegerpairs,wherejsjisthenumberofintervals.thelesizeofsisthen cases,itisusefultomanipulateorcreatethelesthroughotherutilities.thedatales corpushasbeenencoded),asimpleawkscriptcanhelp.youmust,however,beawareofthe needs4bytes. Ifthelesareconstructedmanually(withoutthehelpofencode,forexample,afterthe 42jSj,sinceforeachinterval,twointegernumbershavetobestored,eachofwhich internalrepresentationofpositionalattributesandthe\logics"ofcorpuspositions;second, somepitfallshavetobecircumvented. Let'sassumethataone-word-per-lineinputlewithmarkedsentenceboundaries(in withatoi): (2.9)BEGIN{ SGML-style,like<s></s>)isavailable.Then,theintervalscanbeextractedbythe followingawkscript(theoutputofwhichhastobeconvertedintointernalintegerformat position=0; open=0; }{if($1==closetag){ closetag="</"structure">" opentag="<"structure">" structure="s" if(open){ #closingtag,don'tincrementposition. printposition-1;#thankstoa3@wsserv.vdl.nl(adriverhoef)

open=0; IMSCorpusWorkbench:Administrator'sManual 18 }else{ }elseif($1==opentag){ } print"closingnon-opengroupatline"nr":"$0>>"/dev/stderr" #tag,don'tincrementpositionthen exit if(open){ open=1 }printposition printposition-1 #forgottoclosegroup,whichwedon'tconsideranerror }elseif($1~/<\/?[a-za-z]+>/){ }END{position++; #donothing,otherstructuraltag? First,caremustbetakenwhengroupsareclosedwhicharenotopen.Theothercase, } if(open) reopeningopengroups,isnotconsideredanerror,sinceclosingtagsareoptional.additionally,whenstructuretagsareusedinthetext,thelinenumber(position)mustnotbe incremented.buteventhen,thisawkprogrammayyielderrors.so,atleastcheckwhether thesizeoftheresultinglecanbedividedby8. printposition-1 2.2.3Bigramandmappingtables 2.2.2Alignmentattributes 2.3 Externaltoolsanddynamicattributes

IMSCorpusWorkbench:Administrator'sManual 19, Positional Attributes : pos: N N IP NUM N ADJ N IP word: Pierre Vinken, 61 years old blessing. 0 1 2 3 4 5 n-2 n-1 Alignment Figure2.7:Alignmentattributes word: Pierre Vinken, 61, wird 0 1 2 3 4 5 Bigram Tables: Pierre Vinken 61 years Pierre Vinken, 61 years Mapping Tables: NP NPS PUNCT CARD N Figure2.8:Bigramandmappingtables Pierre Vinken 61 years, Figure2.9:External(dynamic)attributes Value Request pipe() invocation Data Access Module Value computation Value return Value passing Value check/conversion External Tool

Chapter3 Encoding:Transformingacorpus intoitsinternalrepresentation asasequenceofcharacters).thewords,then,arenumbered,sothatwecantalkabout WithintheIMScorpusworkbench,acorpuscanhaveanarbitrarynumberofannotations ofdierenttypes.inoursystem,acorpusisprimarilyregardedasasequenceofwords(not thewordatacertaincorpuspositionp,therstwordinthecorpus,or,ingeneral,thenth themostimportantannotationtype.attributesofthisclasshavea(string)valueateach corpusposition.1 Thecorpustextfallswithintheclassofpositionalattributes,sincewecanspecify,foreach wordofthecorpus.thisleadstothemoregeneralnotionofpositionalattributes,whichis ofthecorpus.inourview,weregardpos-tagsasassignedtoacorpuspositionrather corpusposition,thewordwhichoccursatthatposition.thepositionalattributewhich thantothewordatthatposition.then,thepositionalattributes\word"and\tag"do holdsthecorpustextproperalwayshasthepredenedattributename\word".other positionalattributesare,forexample,part-of-speechtags,whichareassignedtothewords taggedcorpusisinourviewasetoftwocorporaofequallength,oneofwhichcapturesthe ourcase,astring.wethereforeusethesamerepresentationforthewordsequenceofthe corpus(thecorpustext)andthetagsequence(theassociatedpos-tags).inotherwords,a notdierverymuchanymore:bothhave,foreachcorpusposition,avaluewhichis,in sequenceofwords,theothercapturesthesequenceoftags.inthefollowing,wetherefore Section3.2describesthestepswhicharenecessarytoprepareatextuallyrepresented usetheterm\item"toabstractfromthetypeofinformationencodedinsuchanattribute corpustobesuitableasinputfortheencodingtoolsaswellastherstofthetwoencoding Inthefollowingsection3.1,weshortlydescribetheinternalrepresentationofacorpus. (here,wordvs.tag).acorpusthen,isacollectionofattributesofdierenttypes. theindicesassociatedwithacorpus. regardedasthenumberofalineinthisrepresentation. tools,encode.section3.3thendescribesthesecondencodingtoolwhichisusedtobuild 1Whenthecorpusisstoredinaverticalizedone-word-per-lineformat,acorpuspositioncanalsobe 20

3.1 Theinternalrepresentationofacorpus IMSCorpusWorkbench:Administrator'sManual 21 Afterencoding,eachitemofatextualcorpusisrepresentedasauniqueintegervalue2. Forexample,iftherstitemofatextcorpusis\The",all\The"sinthetextcorpus willinternallyberepresentedastheintegernumber03.thecorpuscanthenbephysically representedasasequenceofintegernumbers.tobeabletogettheitemwhichisrepresented ofintegers),the\lexicon"whichholds,foreachinteger,thestringitrepresents,andan now,3lesarenecessarytoholdtheinformation:thecorpus(consistingofasequence indextothelexicon.thesethreelesarethosewhichareproducedwithinthesecondstep byaninteger,another(indexed)leholdsthemappingsfromintegerstostrings.upto describedbelowinsection3.2. oftheitemsinthelexicon,whichisnecessarytoecientlycomputetheintegercodeofan duringtheencodingofacorpus.thetoolwhichperformsthistaskiscalledencodeandis Twoadditionallesarebuiltnext:therstholdsinformationaboutthesortedsequence item,giventhestring.theotherleholds,foreachitem,thenumberoftimesitoccursin itself,leadingtoanotherle.insummary,wehavesevenlessofarwhichrepresentthe eachiteminthecorpus,thecorpuspositionswheretheitemoccurs.thisindexisindexed thecorpus. informationofonepositionalattribute.thefouradditionalleswhicharenotbuiltbythe Forecientlookup,areversedleorreversedindexhastobebuilt.Thisindexholds,for inputfortheencodeprogram,whichisdescribedinthenextsection. totheencodingofacorpus,ithastobetransformedintoaformatwhichissuitableas encodeprogramarecreatedwiththemakeallprogramdescribedinsection3.3.butprior Whenacorpusconsistsofseveralpositionalattributes(forexample,aPOSattribute 3.2 additionallytothestandard\word"attribute),itcaneitherbeencodedinonesinglestep Theencodeprogram (providedthatitisinasuitabletextualinputformatforencode)orthevariouspositional attributescanbeencodedoneafteranotherandbeaddedtoanalreadyexistingcorpus. Thislatterwayisalsousefulwhenoneofthepositionalattributeshasbeenchanged,for beencodedoccursinasingleline.thislinemaycontainblanks,providingawaytoencode accuratetagassignments. Inbothcases,theinputformatisaone-word-per-lineformat,whereeachitemwhichisto example,whenatagsethasbeenchangedorabettertaggerwasavailabletoproducemore adjacentmulti-wordlexemes,ifdesired.butcareshouldbetakentoavoidblanksatthe endofanitem,since,forexample,\the"and\the"aredierentstringsandthereforeare Inthecaseofthecorpustext,theinputmaylookasfollows: encodedwithdierentcodeswhichcanleadtoundesiredeectswhennotalloccurrences ofthearefoundinatextduetoablankattheendofsome. consideredequal. paperbykenw.church,\asetofunixtoolsforprocessinglargetextcorpora". 3Ofcourse,\the"willgetanothercodethan\The",since\The"and\the"textuallydierandarenot 2Theinternalcorpusrepresentationweuseishighlyinspiredbyan{unfortunately{unpublisheddraft

Pierre IMSCorpusWorkbench:Administrator'sManual 22 Vinken,61 years old,will join the board as anonexecutive director Anotherle,then,mayholdthesequenceofassignedtags(inwhichcasebothlesmust Nov. holdthesamenumberoflines).theinputformatcan,forexample,beproducedoutofa 29 rawtextlewiththetrcommand4:. Thiscommandreplacesallblanksintheinputlewithlinebreaks.Theencodeprogram threelescorpus,lexiconandlexicon.idxinthecurrentdirectory: thentakesaone-word-per-lineinputle(orreadsthatformatfromstdin)andcreatesthe tr'''\n'<text_file>1wpl-file encodinginasinglestep,onecouldenterthefollowingpipe: The-toptioninstructsencodetoreaditsinputfromthelegivenasanargumentof theoptioninsteadofreadingthestandardinput.todoboththetransformationandthe encode-t1wpl-file or,ifonewantstoholdthetextinacompressedformat: zcattext_file.gz tr'''\n' encode tr'''\n'<text_file encode arydetectormustberunonarawcorpustoproducetheappropriateinputle.the (specialcharactershavebeenseparatedfromthewords),perhapsevenasentencebound- thesimpletrexamplesabove,itisalreadyassumedthatthecorpushasbeentokenized Ingeneral,theone-word-per-lineformatcanbeproducedbyanyprogramyoulike.In Now,encodemayalsotakeanannotatedcorpuswithseveralpositionalattributesina applicabletonew,rawcorporawhichrstmayhavetobepreprocessedbyothertools. simpleexampleonlyshowshowencodeprocessesitsinput,ingeneral,thismethodisnot singlele.inthiscase,eachlineoftheinputformatconsistsofanumberofattribute values,separatedbytabulatorcharacters.thus,theinputleconsistsofseveralcolumns, eachdenotesonepositionalattribute.apos-taggedtextthenmaylookasfollows: OliverChrist 4Thetrcommandisastandardcommandavailableonmanyplatformsandisnotpartofthistoolset. IMSStuttgart August9,1996

IMSCorpusWorkbench:Administrator'sManual 23 Pierre<tab>NP Vinken<tab>NP,<tab>, 61<tab>CD years<tab>nns old<tab>jj,<tab>, will<tab>md join<tab>vb the<tab>dt board<tab>nn as<tab>in a<tab>dt nonexecutive<tab>jj director<tab>nn Nov.<tab>NP 29<tab>CD.<tab>SENT where<tab>denotesasingletabulatorcharacter(asciivalue9)5.encodemustthen knowwhichpositionalattributeisrepresentedintheothercolumnsoftheleandhow theyshallbenamed.thisisdonewiththe-poption: encode-t<input-file>-ppos Here,theoption-Ppos(\P"for\positionalattribute")instructsencodetotreatthe secondcolumnintheleasthesequenceofvaluesoftheposattribute.thelesassociated withtheposattributehavetheirnamesprexedwithpos(whichleadstopos.corpus, pos.lexiconandsoon).theorderinwhichthe-poptionsaregivenisrelevant,sincethe rst-poptiondenotesthenameoftheattributerepresentedinthesecondcolumninthe inputle,thesecond-poptiondenotesthethirdcolumnetc.bydefault,therstcolumnis treatedasthewordsequenceandthereforegetstheprexword,butthiscanbeoverridden withthe-poption.pleaserefertothemanualpageofencodefordetails. Upto32positionalattributescancurrentlybeencodedinasinglestep.Iflargeamounts oftextaretobeencoded,trytodeterminethediskspacethecorpusneedsafterencoding inadvanceandlookforalesystemwhereenoughspaceisavailable.somehintsonthe expectedsizearegiveninsection3.4.asexplainedinchapter4,itispossibletosplitthe lesofacorpusbetweenseverallesystemsincasethereisn'tenoughspaceonasingle disk.ifallelsefails,youmayhavetoencodethesetofpositionalattributesinseveralruns ofencode,eachwiththeappropriateprexpassedwiththe-poption. Acorpus(oranarbitrarypositionalattribute)maybeassignedanothertypeofinformation whichcanbeencodedwiththeencodeprogram,namelystructuralinformationwhichcan beusedtorepresentarticle,sentenceorparagraphboundaries.thiskindofinformationis representedintheinputlewithsgml-likemarkers: <article> <s> Pierre<tab>NP Vinken<tab>NP,<tab>, 5Therearenogeneralhintsonhowtoproducethisinputformat.Ingeneral,itisagoodideatouse standardtoolslikeawkandsed.

IMSCorpusWorkbench:Administrator'sManual 24 29<tab>CD.<tab>SENT <s> attributesinthele.somepointshavetobenoted: Ofcourse,structuralinformationcanbeencodedindependentlyofadditionalpositional </s> </article> inalinewithastructuremarker(s,article),novaluesofpositionalattributesmay theendtagsmaybeomitted.inthatcase,astructurespansallitemsuntilthestart occur; ofthenextstructureortheendofle; ifastructuremarkerline,everythingafterablankoraftertheclosinganglebracket Intheaboveexample,thecallforencodewouldlooklikethis: structuresmustnotberecursiveoroverlapping,thatis,treescannotberepresented. (>)ofthetagisneglected; Here,thetwoencodedstructuralattributesareeachdeclaredwiththe-S(for\structural attribute")option.theorderinwhichthestructuralattributesaredeclareddoesnot encode-t<input-file>-ppos-sarticle-ss Becarefultodeclareallstructuralattributesintheencodecall,sinceundeclaredstructural matter. attributesareconsideredassimpleattributevaluesintherstcolumnandthereforeare Again,upto32structuralattributescancurrentlybeencodedinasinglestep. astructureattributedesignatorinanglebracketsorislineofaxednumeroftabulatorseparatedcolumns.errorsmayoccurifthisruleisnotobeyed. -p<prefix>hastheeectthatthelesbelongingtothepositionalattributeinthe rstcolumnoftheinputlewillgettheprex\prefix.".notethatthedotatthe endoftheprexisaddedautomaticallyandmustnotbepartoftheoptionvalue.nor- havetabulator-separatedcolumnsafterthem.theruleissimple:eitheralineconsistsof occursincelineswithundeclaredstructuralattributesintherstcolumningeneraldonot treatedliterally.ifyouareencodingseveralpositionalattributesatonce,anerrorwill encodeacceptsanumberoffurtheroptions: lesmaycollidewiththethoseofthenewattribute.therefore,whentheprexis readypresentinthedirectorythedataiswrittento,sincethenamesofthecorpus Thiswillleadtoproblemsifapositionalattributeisencodedafteracorpusisalmally,thelesgetthenamecorpus[.cnt,.rev,.rdx]andlexicon[.idx,.srt]. thelesforthepositionalattributeintherstcolumnoftheinputle,theother notgiven,datamaybeoverwrittenandlost.thisoptionaectsonlythenamesof OliverChrist positionalattributes{ifpresent{willgettheprexgivenwiththe-poption; IMSStuttgart August9,1996

-d<path>letstheuserspecifythedirectoryinwhichthedatashallbewritten.the IMSCorpusWorkbench:Administrator'sManual 25 -sinstructsencodetoskipemptylines(lineswithnocharacters{notevenblanks{ pathshouldnotendwithaslash.defaultistowritealloutputlestothecurrent directory; themostrecentdescriptionoftheprogram. encodeacceptsanumberoffurtheroptions.pleaserefertotheencodemanualpagefor init)duringencoding. thepositionalattribute,sinceittriestowritethedatatothesamelesinwhichthe corpusafterencodingthecorpusitself:alossofdatamayoccurduringtheencodingof -Doptionisspecied.Becarefulwhenyouadd(orupdate)apositionalattributetoa Pleasebeawarethatencodewritesitoutputlesintothecurrentdirectoryunlessthe corpusdatamayalreadybestored.eitheryoushouldpassthe-poptiontoprexthe lesbelongingtotherstcolumnorputeachpositionalattributeinadirectoryofitself topreventencodefromoverwritingimportantdata.itisaverygoodideatochangethe leaccessmodeofalllesbelongingtoapositionalattributetonon-writeableforanyone 3.3 afterencoding,inordertopreventaccidentialoverwriting. Afterencodingacorpus,eachpositionalattributehastobedeclaredinthecorpusregistry. Pleaserefertochapter4foradetaileddescriptionofhowtodothis.Themakeallprogram, Themakeallprogram whichconstructsthesecondsetoflesduringtheencodingprocess,willnotworkon undeclaredpositionalattributesorcorpora. Afterapositionalattributeisdeclaredintheregistry,makeallmustberuntoconstruct thesymbolicnameofthepositionalattributeforwhichtheindexlesshallbeproduced: thenecessaryindexles.therearenooptions,andtheonlyargumentmakeallacceptsis Thiswillproduceallmissinglesforallpositionalattributesdeclaredforthecorpus treebank. makealltreebank Ifyouonlywanttoproducethelesforasinglepositionalattribute,givethenameofthe attributeasanadditionalargument: thatthiscallcanbeissuedfromanypointinthelesystem,sincemakealllooksupin (giventhatthesesymbolicnamesarethoseoftherespectivepositionalattributes).note makealltreebankpos theregistrytondthecorpusdata.dataisonlywrittentothedirectoryspeciedinthe networkloadincaseofnfs-mountedlesystems). registrydescriptionle.itisthereforeagoodideatorunmakealleitheronaveryfast machineoronamachinewhichlocallyholdsthedisksthecorpusisstoredon(toreduce

Note:\makeall"willcurrentlytrytocreatenon-compressedlesforattributeswhich IMSCorpusWorkbench:Administrator'sManual 26 alreadyhavethecompletedataincompressedform.thisisabugandwillbexedina futurerelease. Afterrunningmakeallonallpositionalattributesofacorpus,thecorpusisreadyforuse. errorsoccurorwhentheprogramhastobedebugged. makeallcurrentlyproducesalotofdebuggingoutput.thisoutputisonlyimportantwhen sageslike\can'tallocatememory"or\notenoughmemory".theseproblemscanonly makeallmayhaveproblemsduetomemoryorswapspacelimitationsandyielderrormes- Whentryingtoencodereallybigcorpora(20millionwordsandabove),encodeand thewholecorpusdata(seebelowinsection3.4).askyoursystemadministratorforfurther availableswapspaceisshownwiththepstat-scommandandshouldbeenoughtohold runencode/makeallonanothermachineatyoursitewhichhasenoughswapspace).the besolvedbyprovidingmoreswapspacetothemachinetheprogramisrunningon(orto help. 3.4 Thissectiongivessomehintsonhowmuchspacewillbeusedbyanencodedattributes. Currently,weonlycoverpositionalandstructuralattributeshere. Spacerequirements sequence,jaj,thereforeisthenumberofelementsofthissequence.further,letdbethe LetAbethesequenceofitemscapturedbythepositionalattribute.Thelengthofthis 3.4.1Positionalattributes thenumberofdistinctwords. setofdistinctstringsencodedina(thelistofdierentwords,forexample).then,jdjis Twonumbersareimportant: thenumberofitemsintheinputle(jaj)whichisequaltothenumberoflinesinthe annotationmarkers); command(maybepre-pipedwithagrep-vcommandtogetridofthestructural markers,ifpresent).thisnumbercan,forexample,becomputedwiththewc-l one-word-per-lineinputleforencode(minusthenumberofstructuralannotation thenumberofdistinctitemsintheinputle(jdj)(thisnumbercanbecomputedby Anotherimportantnumberisthespaceneededfortheone-timerepresentationofalldifferentitems,whichhereisdenotedS.Thisisthesumofthelengthsofeachdierentword runningthepipesort-u wc per-lineinputtextle). -lovertherespectivecolumnintheone-word- plusone: OliverChrist S=Xs2D(strlen(s)+1)=jDj+Xs2Dstrlen(s) IMSStuttgart August9,1996

Theaddingof1isnecessarysinceanullcharacter('\0')isaddedtoeachstring.The IMSCorpusWorkbench:Administrator'sManual 27 numberisgivenbyrunningthepipesort-u wc-covertherespectivecolumninthe Now,thesizeofonepositionalattribute(inbytes)canbecomputedasfollows: inputle.6 afterencoding. Thesizeoftheinputtextledoesnotgointothisformula,sinceitisnotneededanymore Mp=2(jAj4)+S+4(4jDj)=8jAj+16jDj+S Foreachpositionalattribute,thisformulahastobeevaluatedagain,sincethenumber ofdierentitemsintheattribute(jdj)andthespaceneededtorepresentthemonce mainlydependsonitslength,whichisnotverysurprising.alessaccuratenumberofspace haveasmallnumberofdistictvalues,thespaceneededtorepresentapositionalattribute positionalattributesofacorpus).sincepositionalattributesbutthewordattributeusually (S)usuallydiersbetweenseveralpositionalattributes(jAjmustbeconstantforanytwo requirementcanberoughlyestimatedbymultiplyingthesizeoftheuncompressedinput 3.4.2Structuralattributes textle(s)by2. Thedataofastructuralattribute,forexamplesentenceboundaries,isstoredinasingle le,asanorderedsequenceofcorpusintervals(thatis,pairsofcorpuspositions).so,the computationofthespaceneededtorepresenttheinformationofonestructuralattributes inthisattribute(\thenumberofsentences"),eachbeingapairoftwocorpuspositions. isverysimple:letsbethestructuralattribute.then,jsjisthenumberofintervalsstored asfollows: Sinceeachcorpuspositionisstoredasa4-byteintegernumber,thespacecanbecomputed So,ifyouwanttorepresent1000sentences,youneed8000bytestostorethedata. Ms=(jSj42)=8jSj GNU's/FSF'ssetoftextutilities,orwiththeawkutility. OliverChrist 6Thecolumnscanbeextractedfromamulti-columnlewiththecutprogramwhichiscontainedwithin IMSStuttgart August9,1996

Chapter4 Thecorpusregistry tionally,theregistryholdsalewhichdescribeswhocanaccessalocalcorpusfromremote hostsandalewhichcapturesalogofallremoteconnectionstolocalcorpora.thischapteronlydescribesthedescriptionlesforlocalcorporaandthedescriptionlesforremote describeswhereinthelesystemthevariousleswhichbuildthecorpusarestored.addi- Thecorpusregistryholds,foreachcorpusbeingprocessedbytheworkbench,alewhich inchapter5. Theregistryissimplyagloballyaccessibledirectory,calledtheregistrydirectory,andholds, corpora,thetwoadditionalleswhicharenecessaryforremoteconnectionsaredescribed 4.2below,denewhichannotationsareassociatedwiththecorpusandwherethedatais aleiscalledtheregistryleofthecorpus.thecontentsofthele,describedinsection stored.anannotationwhichisnotdenedintheregistryleofacorpuscannotbeaccessed foreachcorpus,alewiththesamenameasthecorpusname,inlowercaseletters.1such beaccessiblebyalltools.itiswithintheresponsibilityofthecorpusadministrator(you!) byanyofthetools.similarly,whenanattributeisdenedintheregistry,itissupposedto toassurethatallannotationsdenedinallregistrylesareaccessible,andthatonlythose Currently,therearesomeruleshowtonametheattributesofacorpusandhowtoname attributesaredenedwhichareinfactaccessible. therespectiveregistryles.theserulesaredescribedinthefollowingsection. Attheendofthischapter,section4.5summarizesthesinglestepswhichyoushouldfollow Thisissueisdiscussedinchapter7below. whenpreparingandregisteringanewcorpus. Asalreadymentionedinsection1.2above,corporaexistwhereaccesshastobecontrolled. 4.1 Twosimpleruleshavetobeobeyedforthedenitionofnamesforcorporaandattributes: 1InCqp,allcorpusnamesareenteredinuppercase,buttheyareconvertedtolowercasetoloadthe Someremarksaboutnomenclature correctcorpus. 28

corpusandattributenamesmustbeginwithalowercaseletter,andmaybe IMSCorpusWorkbench:Administrator'sManual 29 Bydefault,thetoolsexpecttheregistrydirectorytobe/corpora/c1/registry.Sinceat yoursitethisdirectorymostprobablydoesnotexist,thedefaultvaluecanbeoverridden followedbyanarbitrarylongsequenceoflowercaselettersordigits. bytheenvironmentvariablecorpusregistry.pleasedonotaddaslashattheendofthe variableinhisorher.tcshrcor.cshrcshellinitializationleinhis/herhomedirectory valueofthisvariable.wesuggestthateithereachuserofthetoolssetsthisenvironment orthatitissetinoftheglobalshellinitializationles,whichusuallyresidein/etcand non-defaultregistrydirectory. areonlywriteablebythesystemadministratororthesuperuser.pleaseaskyoursystem Additionally,almostalltoolsofthetoolboxtakethe-rcommandlineoptiontospecifya administratorforfurtherhelpincaseyoushouldn'tknowwhereorhowtosetthevariable. 4.2 Intheregistryle,allattributesofacorpusaredeclared.Additionally,some\global" variablesareset. Thecontentsofaregistryle Aregistrylemaycontainemptylines. Acommentbeginswithahashmark(#),everythinguptotheendofthelineisnotread. Theformatofaregistryleis: Thisorderhastobekeptinallregistryles.Intheattributedenitionsection,theattributesmaybedeclaredinanyorder. <Attributedenitions> <Headerinformationandglobalvariables> 4.2.1Theheader Theeldnamesare Theheaderconsistsof4declarationsofvalues,eachofwhichisprecededbytheeldname. Theeldnames(keywords)arealluppercase. ashort(one-line)descriptionofthecorpus(keywordname).theeldvalueisastring auniqueidentier(keywordid).theeldvalueisasymbol.usually,theeldvalue shouldbethesameasthelenameoftheregistryle; enclosedindoublequotes; optionally,the\homedirectory"ofthecorpus(keywordhome).theeldvalueisa OliverChrist path(notenclosedindoublequotes); IMSStuttgart August9,1996

optionally,thepathofthe\infole"ofthecorpus(keywordinfo).thistextle IMSCorpusWorkbench:Administrator'sManual 30 shouldcontainadescriptionofthecorpus,itsannotations,perhapsadministrative IftheHOMEeldismissing,youhavetospecifythepathforeachattribute,soitismore part-of-speechannotationthere,ifthecorpusistagged. information,etc.itisalsoagoodideatoincludeadescriptionofthetagsetofthe convenienttodenethiseldwhenallcorpus-relateddatalesarekeptinasingledirectory. buttonisselectedinthecorpuslist). Aftertheheader,thesetofcorpusattributes(annotations)isdeclared.Thedierenttypes Cqp,thisisdonewiththeinfocommand,inXkwic,thisleisdisplayedwhentheInfo IftheINFOeldismissing,nocorpusinformationcanbedisplayedinCqporXkwic(in ofannotationsare positionalattributes(section4.2.2); structuralattributes(section4.2.3); mappingtables(section4.2.4); ngramtables(section4.2.5); dynamicattributes(section4.2.7); alignmentinformation(section4.2.6); thecaseofmappingtablesandbigramtables). Assaidbefore,allannotationsmaybedeclaredinalmostanyorder(butseethenotesin insection3.2.theselesarecalledthecomponentsofapositionalattribute.aregistryle Apositionalattributeisencodedasasetofsevenles,whichhavebeendescribedabove 4.2.2Positionalattributes isstored.itisnotnecessarytomanuallysetthenamesofallcomponentssincethereare defaultruleshowtocomputeundenedcomponentlenamesfromthedenedones.in denes,foreachcomponentofapositionalattribute,thelenameinwhichthecomponent donothavetodeclarecomponentpathsatall. Thedeclarationofapositionalattributelooksasfollows: fact,wesuggesttorelyonthedefaultnameswhichencodeassignstotheles.then,you optionalandisonlyneededwhenyouhavetodenenon-defaultlenames. wherenameistheidentierfortheattribute(likeword,pos,lemma,:::).theoptbodyis ATTRIBUTENameOptBody Whenyouusethebody,itcanbeoneofthefollowing: OliverChrist adenitionofthecomponentpaths,comppathspec; IMSStuttgart August9,1996

orthedeclarationthattheattributeisfoundonaremotehost,remotespec; IMSCorpusWorkbench:Administrator'sManual 31 orapathwhichoverwritesthehomeeldofthecorpusandsaysthatallcomponents remotelystoredcorporaisdisabled.whenthealtpathdeclarationisused,itdeclaresa TheRemoteSpeccurrentlyisnotsupported,sinceintheactualdistribution,accessto arestoredatadierentplace,altpath. pathdierentofthecorpuspathforthisspecialattribute. Withinthesebraces,asequenceofcomponentname/pathspecicationpairsislisted.Each Thecomponentpathspecication,CompPathSpec,mustbeenclosedinbracesf:::g. componentnamemayonlyoccuronce: fcomponentidpathspec ::: \homedirectory"oftheattribute.twoothervirtualcomponentsareaname,thenameof Onecomponentis\virtual"(DIR)anddoesn'tdescribelenamesbutratherdenotesthe g theattributejustbeingdened,andapath,whichisthe\homedirectory"ofthecorpus, ThePathSpecisastandardpath,inwhichMacrosmaybeused.Suchamacroisstarted whichdefaultstothevalueofthevalueofthehomeeldoftheheader(ifpresent). withadollarsign$anddirectlyfollowedbyacomponentname.forexample,themacro specications.forexample, $ANAMErepresentsthevalueoftheattributename.Macrovaluesmaybeusedinpath directory),followedbyaslash,followedbythevalueoftheanamevariable,andthenfollowed standsfortheconcatenationofthevalueoftheapathvariable(usuallythecorpushome $APATH/$ANAME.f by.f. Ingeneral,itispossibletorefertothevalueofanycomponentbyprexingitscomponent Thefollowingtableliststhecomponents,thecomponentidentiersandthedefaultvalue refertoacomponentvaluewhichisnotyetdened. identierwithadollarsign($).thus,whenacomponentvalueisdened,itispossibleto ormacrothroughwhichthe(default)valueiscomputed: usethevaluesofpreviouslydenedcomponentsinthedenition.itisanerrorwhenyou

Component IMSCorpusWorkbench:Administrator'sManual ComponentIDDefaultRule/Value 32 RevCorpusIdx ReversedCorpus Directory REVCIDX REVCORP CORPUS DIR $CORPUS.rev $DIR/$ANAME.corpus $APATH CorpusFreqs FREQS $CORPUS.rdx LexiconSortindexLEXSRT LexiconIndex LEXIDX LEXICON $LEXICON.srt $LEXICON.idx $DIR/$ANAME.lexicon $CORPUS.cnt Thistablealsoshowsthecomponentpathvalueswhenthereisnocomponentpathspecicationatall.TheAPATHelddefaultstotheHOMEdirectoryofthecorpus(theroot pathvaluesbytheruleslistedinthetable. theattributebeingdened,aname,isalwaysknown.then,theothercomponentsgettheir Youcan,ofcourse,setallcomponentpathvaluesasyoulike,butitisstronglyrecommended directory,/,isassumedifthishomespecicationismissingintheheader).thenameof wordmustbepresentineverycorpusdeclaraion(registryle). Notethatthewordattributealwaysmustbedened,andthatanattributewiththename torelyonthedefaultvalues. Let'slookatasmallexample.Thefollowingregistryleonlydenesthepositionalattribute word:name"penntreebank" ID HOME/corpora/kwic/up ATTRIBUTEword Duetothedefaultrules,thepathsofthewordattributecomponentshavethefollowing values: RevCorpusIdx ReversedCorpus Directory Component $CORPUS.rdx=/corpora/kwic/up/word.corpus.rdx $CORPUS.rev=/corpora/kwic/up/word.corpus.rev $DIR/$ANAME.corpus=/corpora/kwic/up/word.corpus $APATH=/corpora/kwic/up Value LexiconIndex LexiconSortindex$LEXICON.srt=/corpora/kwic/up/word.lexicon.srt CorpusFreqs $LEXICON.idx=/corpora/kwic/up/word.lexicon.idx $DIR/$ANAME.lexicon=/corpora/kwic/up/word.lexicon $CORPUS.cnt=/corpora/kwic/up/word.corpus.cnt ofdierentcorporaetc.),thesystemwillcrashcertainly,andalsodatadamagemayresult. samecomponentpathfordierentcomponents(orforcomponentsofdierentattributes Thefollowingexampleshowsadeclarationofacorpuswiththe\old"namingconventions: Pleasenotethatnocheckingisdonewithrespecttolenamecollisions.Ifyouusethe NAME"GetaggterBZK(``NeuesDeutschland'')" OliverChrist ID HOME/corpora/kwic/bzk-tagged IMSStuttgart August9,1996

IMSCorpusWorkbench:Administrator'sManual 33 ATTRIBUTEword {CORPUS$APATH/corpus }ATTRIBUTEpos {CORPUS$APATH/corpus.pos LEXICON$APATH/lexicon Here,thepathsofthewordattributecomponentshavethefollowingvalues: } LEXICON$APATH/pos.lexicon Component Directory ReversedCorpus RevCorpusIdx /corpora/kwic/bzk/corpus.rev /corpora/kwic/bzk/corpus.rdx Value CorpusFreqs LexiconIndex LexiconSortindex/corpora/kwic/bzk/lexicon.srt /corpora/kwic/bzk/lexicon.idx /corpora/kwic/bzk/corpus.cnt Thepathsoftheposattributecomponentshavethefollowingvalues: Component Directory ReversedCorpus RevCorpusIdx Value CorpusFreqs LexiconIndex /corpora/kwic/bzk/corpus.pos.rev LexiconSortindex/corpora/kwic/bzk/pos.lexicon.srt /corpora/kwic/bzk/corpus.pos.rdx /corpora/kwic/bzk/pos.lexicon.idx /corpora/kwic/bzk/corpus.pos.cnt Thealternatepathspecication keepingthestandardnameconventions. specication,you\move"thewholesetofattributecomponentstoanotherdirectory,while value.thecomponentpathvaluesarecomputedbythedefaultrules.withthealtpath Inthealternateattributepathspecication,AltPath,you\overwrite"thedefaultAPATH Example: NAME"WallStreetJournal,verylarge" IDwsj HOME/corpora/kwic/wsj Here,thewordattributeison/corpora/kwic/wsj,butthesetofcomponentswhichbelong ATTRIBUTEpos/var/space/kwic/wsj.pos ATTRIBUTEword space. totheposattributearestoredonanotherlesystem,perhapsduetolimitationsofdisk

Theremotespecication IMSCorpusWorkbench:Administrator'sManual 34 iscurrentlydisabled,sowedonothavetowriteanythingaboutithere::: Structuralattributesalsohavecomponents,buttheirvaluescannotbechangedinthe 4.2.3Structuralattributes STRUCTUREisthekeyword,andNameisthenameofthestructurebeingdened.Optionally, registryle.so,thedeclarationofastructuralattributeisverysimple: thisdeclarationmaybefollowedbyastoragespecication,optstoragespec: STRUCTURENameOptStorageSpec thiscaneitherbeapath,which\moves"theattributedatatoanotherdirectorylike oraremotedeclaration,whichiscurrentlydisabledandthereforedoesn'tneedtobe thecorpushomedirectory); thealtpathspecicationforpositionalattributes(thedefaultistostorethedatain STRUCTURE,followedbythenameofthestructure: Normally,thedeclarationofastructuralattributesimplyisasequenceofthekeyword coveredhere. STRUCTUREs Structuralattributesarestoredinonlyonele,andthenameofthisleisderivedbythe STRUCTUREp rule$apath/$aname.rng(where$apathdefaultstothevalueofhome). STRUCTUREarticle STRUCTUREnp Thedeclarationofamappingtablelooksasfollows: 4.2.4Mappingtables samemeaning. TheOptStorageSpecisthesameasinthecaseofstructuralattributesabove,andhasthe MAPTABLESourceNameTargetNameOptStorageSpec unidirectional,thatis,youhavetocomputeanddeclaremappingtablesinboth\directions" (sothedeclarationisnotorder-independentinthiscase).currently,mappingtablesare oftwopositionalattributesfollow.thesepositionalattributesmustalreadybedeclared ThekeywordMAPTABLEintroducesthedeclarationofamappingtable.Then,thenames Exampledeclaration: ifyouneedthem(forexample,frompostowordaswellasfromwordtopos). OliverChrist MAPTABLEwordpos MAPTABLEposword IMSStuttgart August9,1996

4.2.5ngramtables IMSCorpusWorkbench:Administrator'sManual 35 Thedeclarationofamappingtablelooksasfollows: samemeaning. TheOptStorageSpecisthesameasinthecaseofstructuralattributesabove,andhasthe NGRAMPANamenOptStorageSpec aresupported,sothatthisvaluenalwaysmustbe2. PANameisthenfollowedbythe\dimension"ofthetable,n.Currently,onlybigramtables positionalattributefollows,whichalreadymustbedeclaredearlierintheregistryle.the ThekeywordNGRAMintroducesthedeclarationofangramtable.Then,thenameofa Exampledeclaration: 4.2.6Alignmentattributes NGRAMword2 NGRAMpos2 follows: Withanalignmentdeclaration,itisexpressedthatthecorpuswherethisattributeisbeing declared,isalignedtoanothercorpuswhichalsohasitsregistryle.thedeclarationisas samemeaning. TheOptStorageSpecisthesameasinthecaseofstructuralattributesabove,andhasthe ALIGNEDCorpusNameOptStorageSpec theregistryleoftheothercorpus. Alignenttablesareunidirectional.So,iftheotheralignedcorpusalsoisalignedtothe TheCorpusNamemustbethenameofanexisting(registered)corpus. corpuscurrentlybeingdened,thealignmentintheotherdirectionmustbedeclaredin Exampledeclaration: 4.2.7Dynamicattributes ALIGNEDhansard-f externalknowledgesource. Thedeclarationofadynamicattributelooksasfollows: returnvaluetype.additionally,itmustbedeclaredhowthevalueiscomputedfromthe Dynamicattributesaredeclaredlikefunctions:theyhaveaname,anargumentlist,anda

DYNAMICName(ArgList):RVTypeShellCall IMSCorpusWorkbench:Administrator'sManual 36 tions.afterthename,theargumentlistiswritteninparentheses.theargumentlistisa TheNameisthenameofthedynamicattribute,followingthestandardnamingconven- identiers. characterstringarguments),int(forintegertypearguments)andpos(forcorpus-position sequenceofcomma-separatedtypeidentiers.currently,thetypeidentiersstring(for TheShellCallisastring,enclosedindoublequotes.Itdeneshowthereturnvalueis arguments)aresupported.thereturnvaluetype,rvtype,alsomustbeoneofthesetype computedbyacalltoanexternaltool.thevaluecomputedbytheexternaltool(which isexpectedtooccuronthestdoutoftheexternaltoolafteritstermination)iscoercedto thereturnvaluetype,whetherthismakessenseornot.soitisyourtasktomakesurethat IntheShellCall,yourefertotheargumentsofthefunctionwith$1fortherstargument, $2forthesecond,andsoon.Itisanerrorwhenargumentnumbersareusedwhichare higherthanthenumberofargumentsdeclared.itisnotanerror,though,whendeclared theexternaltoolcomputesinformationwhichcanbecoercedtothereturnvaluetype. Example: argumentsarenotusedintheshellcall. INT-typevalue.Thevalueiscomputedbycallingtheexternalprogramwnreqwithsome Here,thedynamicattributewndisttakesthreeSTRING-typeargumentsandreturnsan DYNAMICwndist(STRING,STRING,STRING):INT"/usr/local/bin/wnreq-s'$1''$2''$3'" (whichmustagreewiththedeclaredargumenttypes)aregluedintotheshellcallatthe Whenthevalueofadynamicattributeisrequested,theactualparametersofthefunction positionsindicatedbytheargumentreferences($n).then,thisshellcallisevaluated(via parametersandthe(actual)argumentsgluedintotheshellcall. thepipemechanismofunix).itcanyieldtoerrorswhenargumentsarepassedwhich partiallyhelps).inafuturereleaseoftheexternalknowledgesourceinterface,perhapswe (forexample,bysurroundingthevariablereferenceswithsinglequotes,althoughthisonly passingthroughconstraintsinthequery,oryouhavetodesignyourshellcallcarefully containcharacterswhichareinterpretedbytheshell.theseshouldeitherbeexcludedfrom doesnottakeplaceanymore. Thepipetotheexternalprogramisnotkeptopen(wewillxthatperhapsinalaterrelease),sothattheexternaltoolisinvokedforeachvaluerequest.Twothingsareimportant: willsupportargumentpassingtothestdinoftheexternaltool,sothatshellevaluation thestartuptimeoftheexternaltoolshouldbeminimized.itshouldonlyloadas rst,thenumberofdynamiccallsmustbeminimized{useexternalinformationin littledataasnecessary,andthetimenecessarytocomputetherequestedinformation queriescarefully!batchqueryingshouldalsobeconsidered; donothavetoloadlargeamountsofdata. sothattheserveronlystartsonce,andsmallclients(likewnreqintheexampleabove) shouldbesmall.thiscanbestbeachievedbydesigningaclient-serverarchitecture,

Pleaserememberthattheexternaltoolinterfaceaswellasdynamicattributesingeneral IMSCorpusWorkbench:Administrator'sManual 37 areexperimentaltools,andarenotoptimizedtowardseciency. 4.3 Remotecorporaarecurrentlydisabled.Sojustskipthecrapinthissection. Registrationofremotecorpora \standard"registrylehastobecreated.thisledeclaresthatacorpus(orapositional Whencorporaarestoredremotelyandhavetobeaccessedviathenetwork,theyare declareddierentlythanlocalcorpora.foreachremotecorpus,aregistrylesimilartoa IDhavethesamemeaninglikedescribedaboveandareoptional.ThevalueoftheREMOTE \component"isthenameoftheremotehostwhichshallbeconnectedwhenthecorpusis attribute)isstoredremotely.itmustonlycontaintheeldsname,idandremote.nameand accessed.noothercomponentvaluesmustappearintheregistryleforremotecorpora. Hereisanexampleforafullandvalidremotedeclaration: 4.4REMOTEmaple.gardeners.edu Wehopethattheformatoftheregistryleissimpleenoughtounderstanditwithouttoo manyexplanationsafteryouarefairlyfamiliarwithit.sojusthavealookatthislast Alastexample example. NAME"Hansard-Corpus(EnglishPart)" ID HOME/corpora/kwic/hansard-e INFO ATTRIBUTEword {CORPUS$APATH/corpus /corpora/kwic/hansard-e/.info }ALIGNEDhansard-f LEXICON$APATH/lexicon STRUCTUREs STRUCTUREnp ATTRIBUTEpos ATTRIBUTElemma MAPTABLEwordpos NGRAMword2 MAPTABLEposword DYNAMICwndist(STRING,STRING,STRING):INT NGRAMpos2 OliverChrist "/corpora/bin/wnreq-s'$1''$2''$3'" IMSStuttgart August9,1996

4.5 Stepstofollow IMSCorpusWorkbench:Administrator'sManual 38 Now,whatarethestepstofollowifyouhavetoregisteranewcorpus? First,trytoestimatethespacerequirementsoftheencodedcorpusandndaplacefor makeanewdirectoryforthedataofthenewcorpus.ifthereisplaceenough,try ifyoudonotndenoughplaceonasinglelesystem; itinyourlesystem.considersplittingofattributesorsingleattributecomponents inasingledirectory,soatleastforthissteptheremustbeenoughspaceonthedisk. Rememberthatencode,therststepduringcorpuspreparation,putsalldatales toputalldatainthisdirectory.then,youwillonlyhavetosetthehomeeldof theregistryle,andallotherlenamesarethencomputedbythedefaultrules. Ifyoudonotndenoughspace,youhavetosplitthe(positional)attributesbetween inputleinasinglerunofencode). severalsteps(byusingthe-poptionforencodeandonlyencodingonecolumnofthe severallesystems.then,youwillhavetoencodethesetofpositionalattributesin Runencode,andletitwriteallcorpusdatainthenewdirectory,ifthereisenough Registerthenewcorpus:giveitanew,uniquenameandcreatetheregistryle. place.youmayalsouse/tmpasanintermediateplacetoholdthedata,andthen mvingthedatalestotheirproperdestinations. Ifthecorpusisregistered,rundescribe-corpuswiththenameofthenewcorpus. thatannotationscanbeadded(orremoved)atanytime. Declareallannotationswhichhavebeenproducedbytherunofencode.Remember carefully. aredeclared,wheretheywillbestored,andwhattheirstatusis.checkthislist Thiswillshowyouwhethertheregistryleissyntacticallycorrect,whichattributes Afterallpositionalattributeshavebeenencoded(eitherinasinglesteporinseveral createandregisteradditionalattributes(mappingtables,bigramtables,alignment whichhavenotyetbeenproducedbyencode. steps),runmakeallonce,whichwillcreateallcomponentsofallpositionalattributes Checkthelepermissionsofalllesproducedbyencodeandmakeall.Datales Youmaycheckyourcorpusagainwithdescribe-corpus. data,:::)withtherespectiveutilitiesorbyothermeans; coveredinchapter7). directoryandthedatadirectory)shouldbereadable(securityandaccesscontrolis andtheregistryleshouldhavethepermission444,andthedirectories(theregistry Goodluck! StartCqporXkwic,andcheckwhetherthenewcorpusappearsinthelistofavailable corpora.

Chapter5 setup Remoteaccess{clientandserver corpustoolboxdoesnotsupportremotecorpusdataaccessatthemoment.itwastoo slowanyway::: Inthecurrentdistribution,thecorpusdataserverisnotincluded,sothattheIMS corpus,itmustbestoredlocallyandbeaccessiblebylocalusers(thatis,itmustbe Thischapterexplainshowcorporaarepreparedtoallowremoteaccess.To\export"a isauthorizedtoaccesstherespectivecorpus.itshouldbeclearthatremoteaccessto requestsfrom\outside"andservestheserequests,aftercheckingwhethertheremoteuser declaredintheregistry).basically,oneachmachinewherecorporaarestoredwhichare corporaisthemostvulnerablepointwithrespecttocopyrightandaccesscontrolissues. tobeexportedtoremoteusers,aserverprocessmustrunwhichwaitsforcorpusdata Wethereforesuggestthatremoteaccessisonlygrantedforeither\free"resourcesorwhen capabilitieswithouttheneedforspecialtoolsorsetups.anotheraccesscriterioniswhether isstored.almostalltools(butneitherencodenormakeall)havebuilt-inremoteaccess Remoteconnectionscanbebuiltonlyifaserverisrunningonthehostonwhichacorpus youhavefullyunderstoodhowtocontrolremoteaccess. theusertryingtoaccessacorpusremotelyisallowedtoaccessthecorpusatall.the howtorunthecorpusdataserver. followingsection5.1describeshowtograntaccesstocorpora,andsection5.2describes 5.1 Remoteaccesstocorporawhicharestoredphysically(eitheronlocaldisksoronNFSmounteddisks)onthesamemachinetheserverisrunningoniscontrolledbyalecalled The.ratand.ratlogles package..rat(\remoteaccesstable")intheregistrydirectory.thisleholdsanarbitrarynumber oflines,eachbeingapairofregularexpressionsbasedontheposixegrepsyntax1.the 1Adescriptionofthissyntaxcanbefound,forexamle,inthedocumentationoftheFSF/GNUregex 39

rstregularexpressionineachlinedescribessymboliccorpusnamesastheyoccurinthe IMSCorpusWorkbench:Administrator'sManual 40 accesstableandcheckswhetherthenameofthecorpusmatchestherstexpressionofthe Whenacorpusistobeaccessedremotely,theservergoesthroughalllinesintheremote describedintherstexpression. registry,thesecondregularexpressiondescribeswhohasaccesstothepositionalattributes matchedbythesecondregularexpression.ifso,accessisgrantedandnomorelinesinthe line.ifso,itischeckedwhetherthenameoftheuserattheremotehost(user@host)is The.ratlecould,forexample,lookasfollows: tablearechecked.ifnot,thenextlineistried.ifnolineismatched,accessisdenied. treebank.*(joe mary chris)@(rose tulip)\.gardeners\.edu treebank.*jack@.* up.*(joe jack)@.*\.gardeners\.edu tojoe,maryandchriswhentheytrytoaccessthecorpusfromoneofthehosts Therstlinegrantsaccesstoallpositionalattributesdenedforthecorpustreebank.*corpus_adm@maple\.gardeners\.edu rose.gardeners.eduortulip.gardeners.edu.notethatthedot,ifnotprexedbya sincetherearemanyjacksoutthereandyousurelydon'twantthatallofthemprobably pected.thesecondlinegrantsaccesstothesamepositionalattributesforjack,nomatter fromwhichhosthetriestoaccessthecorpus.suchalineshouldnotoccurinyour.ratle, slash,matcheseverycharacterandhastobeescapedwithaslashwhenaliteraldotisex- upattributesforjoeandjack,nomatterfromwhichhostinthegardeners.edudomain theyareconnecting.thefourthline,alsoalittlebittoogeneraltomytaste,grantsaccess whichaccessistobegrantedaspreciselyaspossible.thethirdlinegrantsaccesstoall getaccesstoyourcorpusdata.youshouldalwaystrytospecifytheusersandhostsfrom Again,a\#"intherstcolumnofthe.ratleindicatesacommentwhichextendsupto toallcorporafortheusercorpus_adm@maple.gardeners.edu. toallowremoteaccesstoallpositionalattributesassignedtothecorpus(ifthisaccess Whenacorpushasotherpositionalattributesbeyondthe\word"attribute,itisimportant theendoftheline. shouldbepossibleoverthenetwork).thisisachievedthroughthe.*attheendoftherst accesstocorporaandwhetheraconnectionrequestwasgrantedornot.entriesinthisle granted. The.ratlogle{alsoresidingintheregistrydirectory{logsallrequestsforremote expressionofeachlinewhichdescribesthepositionalattributesforwhichaccessshouldbe look,forexample,asfollows: CDSonmapleatThuJan2016:47:271994 CDSonmapleatThuJan2016:52:031994 loginrequestfromjoe@tulip.gardeners.eduforcorpustreebank(granted) loginrequestfrombill@tulip.gardeners.eduforcorpustreebank(denied)

5.2 Howtostartthecorpusdataserver IMSCorpusWorkbench:Administrator'sManual 41 Startingtheserverisquitesimple:youonlyhavetostarttheprogramcds(corpusdata server)asabackgroundprocess.thereare,however,somepointstoberespected: onlyonecdsprocessmustrunonahost; thecdsprocessmustbestartedbythesamepersonwhoownsthe.ratandthe the.ratlogledoesn'texistwhencdsisstarted,itcanbecreatedwiththetouch command(althoughcdstriestocreatetheleincaseitshouldnotyetbepresent)..ratlogle.thele.ratmustbepresentasdescribedintheprevioussection.if expectittoworkasfastastheaccesstolocallystoredcorpora. Remoteaccesstocorporaisslow.Weonlyimplementeditfortestingpurposes,sodon't

Chapter6 Utilitiesanddebuggingtools capturedbytheseattributes.thesetoolsaredescribedinthissection. Additionallytotheencodeandmakeallutilitiesintroducedinchapter3,therearesome pageshavebeenwrittenandshouldbeavailableatyoursite. Unlessotherwiseindicated,foreachofthetoolsmentionedinthischapterUnixmanual otherutilitiesfortheconstructionofthevariousattributesorthedisplayoftheinformation Thischapteris\underconstruction",sopleaserefertothemanualpagesofthetools. 6.1.1Decodingofcorpusinformation:decode Decodingofcorpusandattributeinformation withencodeandmakeallandprintsthedatatextuallyonstdout.theusercanselectthe decodedecodes(thatis,printsencodedattributevaluesof)aregisteredcorpusencoded attributeswhichareprinted,aswellasthestartandendcorpuspositions.alternatively, corpuspositionsmaybepipedintodecodeinordertoprintthevaluesatthesepositions. withoneoftheoptions. tributename.theorderinwhichtheattributevaluesareprintedisthesameastheorder ofthecorrespondingcommandlinearguments.atleastoneattributemustbespecied Theattributevaluesareseparatedbyatabulatorcharacterandareprecededbytheat- 6.1.2Decodingofwordlists:lexdecode encodeandmakeallandprintstheattributevaluestextuallyonstdout.ifnopositional wordareprinted.additionally,informationabouttheabsolutefrequencyofthevaluesin lexdecodeprintsthevaluesofapositionalattributeofaregisteredcorpusencodedwith corpuscanbeprintedand/orthevaluescanbeprintedinlexicallyascendingorder. attributenameisgivenviathe-poption,thevaluesofthestandardpositionalattribute 42

CreationandDecodingofBigramTables IMSCorpusWorkbench:Administrator'sManual 43 6.2.1Creationofbigramtables:gen-bigrams gen-bigramscomputesbigramtablesforapositionalattributeofacorpusinacertain corpuswordsandincrementsthecounts,whereasthereversedmethod(selectablewith (whichcanbeselectedthroughthe-soptionandisthedefault)sequentiallyshiftsthe window. the-roption)worksviathereversedindex. gen-bigramscanusetwodierentalgorithmstocomputethetable:thesequentialmethod leastthefrequencymfreq.inthiscase,thecomputationisalwaysdoneviathereversed corpus. aminimalfrequencyisgiven,bigramsareonlycomputedforthosewordswhichhaveat Whenabiasisgiven,onlybigramcellswithahighercountthanthisbiasarestored.When tributeofacorpusinthewindowsize.(whichcurrentlyalwaysis2,buthastobegivenas decode-bigramsprintsthecontentsofthebigramtableassociatedwithapositionalat- 6.2.2Decodingofbigramtables:decode-bigrams appropriateparameters. anargument). Bydefault,thetableisprintedintheinternalform,butwiththe-toption,amorereadable, tabularoutputisproduced.then,thewindowwidthandheightcanbealteredwiththe 6.3.1Creationofmappingtables:gen-mapping-table CreationandDecodingofMappingTables gen-mapping-tablecomputesmappingtablesfromthepositionalattributesourcepato method(whichcanbeselectedthroughthe-soptionandisthedefault)allocatesa thepositionalattributetargetpaofcorpusthetablesaredirection-specic{ifyouwant largetableandrunssequentiallythroughthecorpusandincrementsthecounts.thetree tohavemappingtablesintheotherdirection,youmustcreatethem,too. gen-mapping-tablecanusethreedierentalgorithmstocomputethetable:thestandard computation(selectedwiththe-toption)internallyusesatree-likestructure,thusspeeding ofvaluesofthetwoattributes,times4(spaceforoneinteger). butshouldbethefastestmethod.thetablesizeiscomputedbymultiplyingthenumber withspaceforallpossiblecells,itcanonlybeusedforsmalltables(lessthan10mbsize), upsearch.third,thedirectmethod(selectedwiththe-doption)allocatesahugetable

6.3.2Decodingofmappingtables:decode-mapping-table IMSCorpusWorkbench:Administrator'sManual 44 decode-mapping-tableprintsthecontentsofthemappingtableassociatedwiththesource readable,tabularoutputisproduced.bydefault,thewholetableisprinted,butbyusing corpus Bydefault,thetableisprintedintheinternalform,butwiththe-toption,amore positionalattributesourcepaandthetargetpositionalattributetargetpaofthecorpus Then,themappingsareonlydisplayedforthosesourcevalueswhichmatchthepattern. fromstdin.alternatively,aregularexpressionpatterncanbegivenasthelastargument. the-poption,decode-mapping-tablereadsthesetsofvaluesofthesourceattribute 6.4.1Comparingwordlistsandcorpora:check-coverage Generalutilities check-coverageisaprogramwhichreadsalistofwordseitherfromstdin(default)or Ifthewordoccursinoneofthecorporaandifthe-soption(\printsuccess")isgiven, thewordoccursinthewordattributeofthesecorpora.ifnot,thewordisprintedtostdout. listofcorporaspeciedinthecorpus-nameslist(maximumis32corpusnames)whether fromthelespeciedwiththe-foption.check-coveragethen,foreachword,looksinthe 6.4.2Convertinginternalintegerstoreadablenumbers:itoa matchesarealsowrittentostdout. stdout,onenumberperline. itoareads,fromstdinorfromeachofthenamedles,asequenceofintegersintheinternal (machine-dependent)dataformat(4byteintegers)andwriteseachnumbertextuallyto atoireads,fromstdinorfromeachofthenamedles,asequenceofnumbersrepresentedas 6.4.3Convertingreadablenumberstointernalintegers:atoi stdoutasasequenceof4-byteints,intheinternal(machine-dependent)dataformat.the functionissimilartotheatoi(3)functionoftheclibrary. digitsandnewlines.thenumbers(nottheindividualdigits,ofcourse)arethenwrittento digitsequences.theinputmustbeinaone-number-per-lineformatandmustonlycontain

Chapter7 Accesscontrolandsecurityissues corporaavailablefromthelinguisticdataconsortium(ldc). portantissuewhencorporawithrestrictedaccessareprovided,forexamplethetipster Thischapterdescribeshowtocontrolaccesstothecorpusdata.Accesscontrolisanim- 7.1 Accesstocorpusdatacurrentlycanonlybecontrolledbyrestrictingthereadabilityofthe registrydirectory,theregistrylesintheregistrydirectoryandthedirectorythecorpus Controllinglocalaccesstocorpora dataisstoredinbysettingtheuserandgroupidsandtheaccesspermissionsoftheles anddirectories. gureoutwherethecomponentsofacorpusarestoredand{withsomeknowledgeofthe itssymbolicname.however,independentofthereadabilityoftheregistryle,ausercan le.ifsomeonecannotreadaregistryle,heorshecannotaccessthecorpusbywayof Theeasiestwaytocontrolaccessisbyproperlysettingthereadpermissionsofaregistry internalcorpusrepresentationorwiththehelpoftheutilitiesmentionedintheprevious chapter6{canreconstructthecorpustextfromtheencodeddataunlessthecomponents restrictionsoftheregistryle.whenacorpusismoreorless\public"inthesensethatits theencodeddataisstoredin(user,groupandr/wpermissions)arethesameastheaccess areread-protected.wethereforerecommendthattheaccessrestrictionsofthedirectory useiseitherfreeorrestrictedtoyourinstitution,accesscontrolisprobablynotnecessary. Further,westronglyrecommendthat theregistrydirectoryisreadablebyeveryonebutwriteableonlyforthecorpusadministrator(mode755)trator,andreadableexactlybythoseuserswhichmayaccessthecorpus; theregistrydirectoryisownedbythecorpusadministrator; theregistrylesintheregistrydirectoryareonlywriteablebythecorpusadminis- 45

thedirectoriesthedataofapositionalattributeisstoredin(\datadirectories")are IMSCorpusWorkbench:Administrator'sManual 46 thedatadirectorieshavethesamegroupid,owneridandreadpermissionsasthe writeableonlybythecorpusadministrator; thedatales(components)havemode444(arenotwriteablebyanyone).forchanges registrylewhichdescribestheattribue(plusthenecessary\exec"bits); orupdates,onlythecorpusadministratormaychangetheaccessrestrictionsofale print-aligned)mayberestrictedtothecorpusadministrator.theutilitylexdecode, Additionally,theread/execpermissionsoftoolsotherthanthequerytools(Xkwic,Cqp, oradirectory. however,shouldbeexecutablebyallcorpususerssinceitproducesusefulfrequencyinformationwhichdoesn'tallowthereconstructionofthecorpustextproper.westronglstrictedtoonlythecorpusadministrator.noneoftheprogramsshouldbesetuid. Sincethedecodingutilitiespermittotextuallydecodeacorpuswithallitsinformation, careshouldbetakenthatunauthorizedusers(guestswithtemporaryaccounts,forexample) recommend,however,thatexecutionpermissionforthecorpusdataserver(cds)isre- cannotusethesetools.oneidea,forexample,istosetthesetgidbitofthequerytools stored(thetopdirectorysuces).thecorpusdataandtheregistryles(orthedirectories (Cqp,Xkwic)tothegroupidunderwhichthecorpusdata,theregistrylesetc.are butnotreadableformembersofothergroups(o-rx).thedecodingutilitiesmustnotbe inwhichtheyarestored),then,shouldbereadableonlyformembersofthis\corpusgroup", setgidforthecorpusgroup.bythisstrategy,thecorpusdatacanbeaccessedviathequery accesstothecorpusregistrylesbythenormalgrouppermissions. tools,butnotviathedecodingutilitiesiftheuserdoesnotbelongtothegroupwhichhas 7.2 Asmentionedinchapter5above,remoteaccessisavulnerablepointwithrespectto accesscontrol.withsomeknowledgeabouttheinternalsofthetools,itwouldbepossible Controllingremoteaccesstocorpora Althoughthisisnotatrivialtask,youshouldkeepinmindthatitisatleastpossible.The foralmosteveryonetofakehisorheridentityandtogainillegalaccesstoyourcorpora. onlywaytopreventthisisnottoexportcorporaatall,thatis,nottorunacorpusdata checkthe.ratloglefrequentlyinordertogainanoverviewwhichcorporaareaccessed bywhom. The.ratandthe.ratloglesshouldbeownedbythecorpusadministratorandhave server(andtopreventotheruserstostartitaswell).additionally,itisgoodpolicyto accessmode600,thatis,readableandwriteableonlybytheowner(i.e.,thecorpusadministrator). whetherthecorpus/userpairiscontainedintheremoteaccesstable(.rat). forallcorpora.accesstocorporafrom\outside"isthereforeonlycontrolledbychecking fore,theserverprocesshasaccesstoallregistrylesandtoallpositionalattributesdened Rememberthatthecorpusdataserverisusuallyrunbythecorpusadministrator.There-

Therefore,theentriesinthe.ratlemustbecarefullydesignedandnoonebutthecorpus IMSCorpusWorkbench:Administrator'sManual 47 administratorshouldbeabletoreadorwritethe.ratle.youshouldhaveafairknowledge ofregularexpressionsandtakecarethattheexpressionsarenot\overgenerating".ifyou wesuggestthatyoubestenumerateallattribute/userpairswiththeirfullnamesandnot arenotsureaboutwhichusernamesorattributesarematchedbyyourregularexpressions, whoaren'tinthelistofintendedusers.justtobeonthesafeside::: And,lastly,don'tbroadcastthatyoupermitaccesstoyourcorporatoowidelyortopeople touse\wildcards"(.),thekleenestar(*)ortheplusconstruct(+).

AppendixA Hardwareandoperatingsystem requirements Thetoolshavebeendevelopedandtestedonthefollowingdevelopmentplatform: WindowSystem:XWindowSystem(tm),Version11,Release5(X11R5) Compiler:gccV2.5.7onSunOSRelease4.1.3 Widgetset:OSF/Motif(tm)V1.2 Wecurrentlyonlydeliverbinarylesforthisplatform.Noothersystemsaresupported.In ordertorunthetools,atleast32mbofmemoryarerecommended.onecpuisofcourse Hardware:Sun-4M/Sparcstation3(2CPUs,64MBmemory,50MHz) sucient. 48

AppendixB Reusedsoftwarepackagesand copyrightnotices thanktheprovidersofthesoftwareandincludetheirdisclaimersandcopyrightnotices. Intheimplementationofoursystem,wemadeuseofthefollowingsoftwarepackages.We B.1 Copyright(c)1992HenrySpencer. TheregularexpressionmatcherbyHenrySpencer ThiscodeisderivedfromsoftwarecontributedtoBerkeleyby HenrySpenceroftheUniversityofToronto. Copyright(c)1992,1993 TheRegentsoftheUniversityofCalifornia.Allrightsreserved. aremet: 1.Redistributionsofsourcecodemustretaintheabovecopyright modification,arepermittedprovidedthatthefollowingconditions Redistributionanduseinsourceandbinaryforms,withorwithout 3.Alladvertisingmaterialsmentioningfeaturesoruseofthissoftware 2.Redistributionsinbinaryformmustreproducetheabovecopyright documentationand/orothermaterialsprovidedwiththedistribution. notice,thislistofconditionsandthefollowingdisclaimerinthe notice,thislistofconditionsandthefollowingdisclaimer. 4.NeitherthenameoftheUniversitynorthenamesofitscontributors maybeusedtoendorseorpromoteproductsderivedfromthissoftware mustdisplaythefollowingacknowledgement: withoutspecificpriorwrittenpermission. ThisproductincludessoftwaredevelopedbytheUniversityof California,Berkeleyanditscontributors. FORANYDIRECT,INDIRECT,INCIDENTAL,SPECIAL,EXEMPLARY,ORCONSEQUENTIAL AREDISCLAIMED.INNOEVENTSHALLTHEREGENTSORCONTRIBUTORSBELIABLE IMPLIEDWARRANTIESOFMERCHANTABILITYANDFITNESSFORAPARTICULARPURPOSE THISSOFTWAREISPROVIDEDBYTHEREGENTSANDCONTRIBUTORS``ASIS''AND ANYEXPRESSORIMPLIEDWARRANTIES,INCLUDING,BUTNOTLIMITEDTO,THE 49

DAMAGES(INCLUDING,BUTNOTLIMITEDTO,PROCUREMENTOFSUBSTITUTEGOODS IMSCorpusWorkbench:Administrator'sManual 50 SUCHDAMAGE. OUTOFTHEUSEOFTHISSOFTWARE,EVENIFADVISEDOFTHEPOSSIBILITYOF ORSERVICES;LOSSOFUSE,DATA,ORPROFITS;ORBUSINESSINTERRUPTION) HOWEVERCAUSEDANDONANYTHEORYOFLIABILITY,WHETHERINCONTRACT,STRICT LIABILITY,ORTORT(INCLUDINGNEGLIGENCEOROTHERWISE)ARISINGINANYWAY @(#)regex.h 8.1(Berkeley)6/2/93

Bibliography [Christ,1994]OliverChrist.Amodularandexiblearchitectureforanintegratedcorpusquerysystem.InProceedingsofCOMPLEX'94:3rdConferenceonComputational [Christ,1993]OliverChrist.TheXkwicUserManual.InstitutfurmaschinelleSprachverarbeitung,UniversitatStuttgart,1993. LexicographyandTextResearch(Budapest,July7{101994),Budapest,Hungary,1994. CMP-LGarchiveid9408005. [SchulzeandChrist,1994]BrunoM.SchulzeandOliverChrist.TheCQPUsers'sManual. [Schulze,1994]BrunoM.Schulze.EntwurfundImplementierungeinesAnfragesystems 1994.(RevisedOctober1994). InstitutfurmaschinelleSprachverarbeitung,UniversitatStuttgart,Version1.0d,May furtextcorpora.diplomarbeitnr.1059,universitatstuttgart,institutfurmaschinelle Sprachverarbeitung(IMS)andInstitutfurInformatik,January1994.(InGerman). 51