Arabic Morphological Analysis on the Internet



Similar documents
Arabic Morphology Generation Using a Concatenative Strategy

A Machine Translation System Between a Pair of Closely Related Languages

Automated Lossless Hyper-Minimization for Morphological Analyzers

Comprendium Translator System Overview

Encoding script-specific writing rules based on the Unicode character set

Right-to-Left Language Support in EMu

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Approaches to Arabic Name Transliteration and Matching in the DataFlux Quality Knowledge Base

Natural Language Database Interface for the Community Based Monitoring System *

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Special Topics in Computer Science

Tibetan For Windows - Software Development and Future Speculations. Marvin Moser, Tibetan for Windows & Lucent Technologies, USA

Study Plan for Master of Arts in Applied Linguistics

Actuate Business Intelligence and Reporting Tools (BIRT)

Internationalizing the Domain Name System. Šimon Hochla, Anisa Azis, Fara Nabilla

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

Natural Language to Relational Query by Using Parsing Compiler

Overview of MT techniques. Malek Boualem (FT)

Translution Price List GBP

MODERN WRITTEN ARABIC. Volume I. Hosted for free on livelingua.com

ACKNOWLEDGEMENTS. At the completion of this study there are many people that I need to thank. Foremost of

Formal Languages and Automata Theory - Regular Expressions and Finite Automata -

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Automata Theory. Şubat 2006 Tuğrul Yılmaz Ankara Üniversitesi

Developing LMF-XML Bilingual Dictionaries for Colloquial Arabic Dialects

Programming Languages

Hypercosm. Studio.

Collecting Polish German Parallel Corpora in the Internet

Automatic Text Analysis Using Drupal

Text-To-Speech Technologies for Mobile Telephony Services

Crossword Compiler Ver. 8.1

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Rendering/Layout Engine for Complex script. Pema Geyleg

Improving Data Driven Part-of-Speech Tagging by Morphologic Knowledge Induction

Compilers. Introduction to Compilers. Lecture 1. Spring term. Mick O Donnell: michael.odonnell@uam.es Alfonso Ortega: alfonso.ortega@uam.

OWrite One of the more interesting features Manipulating documents Documents can be printed OWrite has the look and feel Find and replace

Chapter 13 Computer Programs and Programming Languages. Discovering Computers Your Interactive Guide to the Digital World

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Master of Arts in Linguistics Syllabus

Translation Solution for

VIRTUAL LABORATORY: MULTI-STYLE CODE EDITOR

Brill s rule-based PoS tagger

The Unicode Standard Version 8.0 Core Specification

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

Computer Programming. Course Details An Introduction to Computational Tools. Prof. Mauro Gaspari:

IE Class Web Design Curriculum

Object-Oriented Software Specification in Programming Language Design and Implementation

Eventia Log Parsing Editor 1.0 Administration Guide

2- Electronic Mail (SMTP), File Transfer (FTP), & Remote Logging (TELNET)

Regular Expressions and Automata using Haskell

Submission guidelines for authors and editors

A Programming Language for Mechanical Translation Victor H. Yngve, Massachusetts Institute of Technology, Cambridge, Massachusetts

OX Spreadsheet Product Guide

Experiences with Online Programming Examinations

Part A. EpiData Entry

Interpretation of Test Scores for the ACCUPLACER Tests

Toolbox 1! Susan Gehr!! Cell/text (707) !

Efficiency of Web Based SAX XML Distributed Processing

Building a Question Classifier for a TREC-Style Question Answering System

Translating QueueMetrics into a new language

MA in English language teaching Pázmány Péter Catholic University *** List of courses and course descriptions ***

COMP 356 Programming Language Structures Notes for Chapter 4 of Concepts of Programming Languages Scanning and Parsing

Chapter 2 Text Processing with the Command Line Interface

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

Customizing an English-Korean Machine Translation System for Patent Translation *

Mahesh Srinivasan. Assistant Professor of Psychology and Cognitive Science University of California, Berkeley

High-level Petri Nets

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

ScreenMatch: Providing Context to Software Translators by Displaying Screenshots

Search Engine Optimization with Jahia

Teaching Formal Methods for Computational Linguistics at Uppsala University

Programming Languages & Tools

Programming Languages CIS 443

CSCI 3136 Principles of Programming Languages

Visualizing WordNet Structure

Chapter 1. Dr. Chris Irwin Davis Phone: (972) Office: ECSS CS-4337 Organization of Programming Languages

6.170 Tutorial 3 - Ruby Basics

MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

1 Introduction. 2 An Interpreter. 2.1 Handling Source Code

EURESCOM - P923 (Babelweb) PIR.3.1

2- Electronic Mail (SMTP), File Transfer (FTP), & Remote Logging (TELNET)

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

ChildFreq: An Online Tool to Explore Word Frequencies in Child Language

How To Overcome Language Barriers In Europe

Application of Natural Language Interface to a Machine Translation Problem

Multilingual Interface for Grid Market Directory Services: An Experience with Supporting Tamil

COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE. Fall 2014

Automation of Translation: Past, Presence, and Future Karl Heinz Freigang, Universität des Saarlandes, Saarbrücken

G563 Quantitative Paleontology. SQL databases. An introduction. Department of Geological Sciences Indiana University. (c) 2012, P.

Transcription:

Arabic Morphological Analysis on the Internet K. R. Beesley1 Xerox Research Centre Europe Grenoble Laboratory 6, chemin de Maupertuis 38240 MEYLAN, France Tel: 33 4 76 61 50 64 Fax: 33 4 76 61 50 99 Ken.Beesley@xrce.xerox.com Abstract: [Arabic, Morphology, Finite State] This paper describes a finite-state morphological analyzer of written Modern Standard Arabic words that is available for testing on the Internet at http://www.xrce.xerox.com/research/mltt/arabic. The system consists of the analyzer proper, running on a network server, and Java applets that run on the user s machine and render words in standard Arabic orthography both for input and output. An overview of the system is provided, including the history, finite-state technology, dictionary coverage and status. 1 Introduction In 1996, the Xerox Research Centre Europe produced a large morphological analyzer for Modern Standard Arabic, henceforth Arabic (Beesley, 1996). In 1997, the rules were rewritten to more reliably support generation, and a Java user interface was added to allow users to interact with the system via the Internet in standard Arabic orthography. The analyzer-generator is based on dictionaries from an earlier project at ALPNET (Beesley, 1989; Beesley et al., 1989; Beesley, 1990; Buckwalter, 1990), but the entire system was extensively redesigned and rebuilt using Xerox finite-state technology. The system accepts on-line orthographical words of Arabic and returns morphological analyses that identify affixes and separate roots from patterns. Input words may include full diacritics, partial diacritics, or no diacritics; and if diacritics are present, they reduce the amount of ambiguity accordingly. A fully-voweled spelling and a terse but useful English gloss are also returned as part of each analysis. The system has wide dictionary coverage and is intended as a pedagogical dictionary-lookup aid, a comprehension-assistance tool, and as a component in larger natural-language-processing systems. 2 Challenges of the Arabic System There are two principal challenges in building any serious morphological analyzer for written Arabic. First, the system must in fact accept as input words of Arabic, returning correct and useful morphological analyses. Second, in order to be psychologically acceptable to many users, the input and output must be rendered in traditional Arabic orthography rather than in romanization. 2.1 Arabic Morphological Analysis In its simplest diagrammatic form, a morphological analyzer can be characterized as a black-box module (see Figure 1) that accepts words of Arabic and outputs morphological analyses. Often serving as components in larger natural-language-processing systems, morphological analyzers and their internal workings are often quite invisible to the average user. In computer analyses of Arabic, or of any other language, the input words are of

morphological analyses Morphological Analyzer word Figure 1: A Morphological Analyzer as a Black Box course in digital form, with the original characters represented as numbers according to encoding standards such as ASMO8859-6 and UNICODE. As for the content of the morphological analysis itself, it will always be somewhat theory-dependent, especially in the case of Arabic. In the broadest terms, a morphological analyzer should separate and identify the component morphemes of the input word, labeling them somehow with sufficient information to be useful for the purposes at hand. In the case of Arabic, one would certainly expect a morphological analyzer to separate and identify prefixed word-like morphemes such as the conjunctions wa and fa, prefixed prepositions such as bi and li, the definite article, verbal prefixes and suffixes, nominal case suffixes, and enclitic direct-object and possessive-pronoun suffixes. Where things become more complicated is in the formal analysis of the typical Arabic or, more generally, Semitic STEM; various lexicographers and linguists propose 1. Stems as one-part structures: i.e. a stem is simply treated as a single morpheme. 2. Stems as two-part constructs of a ROOT and a stem-framework PATTERN: e.g. ktb (ḤH )2and u i, where the underscores represent slots for the root consonants, sometimes termed RADICALS. In such a system, the root and pattern morphemes are said informally to interdigitate together to form stems like kutib (ỊJ ). For such an analysis of Hebrew stems, see (Harris, 1941). 3. Stems as three-part constructs of a ROOT, a consonant-vowel (CV) TEMPLATE, and a VOCALIZATION: e.g. ktb and template CVCVC and vocalization ui to form kutib (ỊJ ), drs and CVCCVC and vocalization a to form darras ( PX), ktb and CVVCVC and a to form kaatab (ỊKA ), etc. The templates are drawn from a set of perhaps dozens, and the radicals and vowels are associated with the C and V slots via controversial association rules. Such a three-way division was popularized by some of the early work of John J. McCarthy (McCarthy, 1981).3 4. Stems as multi-part constructs of a root, a prosodically motivated pattern (from a severely limited set), a vocalization, and various affixes. This approach characterizes some of the more recent work of McCarthy (Mc- Carthy, 1993). In what follows, I shall assume that stems are built from at least two morpheme components, including a root usually consisting of three (but sometimes two or four) radicals like ktb (ḤH ), drs ( PX), fcl ( ), klkl ( ), sm ( ) or tm ( H). I shall also assume that a crucial task for a morphological analyzer is to identify these underlying roots, even if they are obscured in the surface string by phonological or orthographical processes. Beyond the arguments for the linguistic reality of Semitic roots,4, an entirely practical motivation for identifying roots is that words in traditional Arabic dictionaries, such as the authoritative Hans Wehr dictionary (Wehr, 1979), are organized under root headings. To look up a word in such a dictionary, the student must know the root. Beyond the question of how the stem should be split up theoretically into morphemes, a scientific (testable and disprovable) theory must translate into a workable mechanism for generation and analysis. For a general survey of computational approaches to Arabic, see (Kiraz, 1994b). At ALPNET, we used a slight enhancement of the popular TWO-LEVEL MORPHOLOGY (Koskenniemi, 1983; Gajek et al., 1983; Beesley, 1990; Antworth, 1990), a theoretical and computational framework that has been used to build significant morphological analyzers for English, Greek, Russian, German, Polish, Hungarian, Finnish, Swedish, Danish, Norwegian, Swahili, Klingon and no doubt many other languages.5the current system was built using Xerox finite-state technology, a set

of basic algorithms and compilers for building and manipulating finite-state transducers. In both the ALPNET and Xerox implementations, roots and patterns are formalized as regular languages that are straightforwardly combined into stems via intersection, a finite-state operation. 2.2 Arabic Script Input and Output The ALPNET system was intended to be a component, not a stand-alone system with a user interface, and users were forced to interact with it using a Roman transliteration. This represented a psychological barrier to many reviewers, especially to native Arabic speakers. While trained linguists are used to making a distinction between a language and the orthography (or orthographies) used to write it (Sampson, 1985; Coulmas, 1989; Daniels and Bright, 1996), some observers of the old ALPNET system saw Roman characters on the screen and concluded immediately that we weren t dealing with real Arabic at all. It is important to make a distinction between what I will call TRANSLITERATIONS and TRANSCRIPTIONS. I use the term transliteration to denote an orthography that reproduces all and only the significant orthographical distinctions found in a conventional orthography (Arabic orthography in our case) but which carefully substitutes the original symbols with new symbols that are more convenient to store or display. The limitations of typewriters, printers and computer software in Europe and the Americas often make it convenient to represent Arabic orthography using Roman letters, as in the ALPNET system. To qualify as a transliteration, under this definition, the new orthography must be trivially and unambiguously convertible back and forth with the original orthography. Computer text encodings like UNICODE are also transliterations where the substituted symbols are numbers rather than letters. All text on a computer, whether English, Arabic or whatever, is transliterated. The trivial reversibility of traditional Arabic orthography and a proper transliteration mean that they are, for all computational purposes, equivalent. Transliterations, in the sense just defined, must be distinguished from transcriptions, which usually serve for phonological description and may have little relation to the conventional orthography. Transcriptions have valid uses, and they represent possible orthographies for Arabic, but they are not automatically mappable into the conventional orthography. A computer system based on a transcription would be of no use in analyzing genuine Arabic text written in conventional Arabic orthography. Despite the fact that the ALPNET system used a proper transliteration rather than a transcription of Arabic, the interface was still not esthetically acceptable to many users. The solution to this problem in the Xerox system was to write a Java interface that renders the underlying codes as genuine Arabic orthography. As HTML and Java still have no built-in support for the display of Arabic, a bitmap font, derived from a public-domain font by Yannis Haralambous, was smuggled into the applet disguised as integer arrays, and the rendering algorithm was implemented at low levels using Java graphics functions. 3 System Description The information flow in the Xerox Arabic system is shown in Figure 2. Users access the system via the Arabic homepage at http://www.xrce.xerox.com/research/mltt/arabic. The Arabic HTML entry page is mostly filled up with a Java applet that represents a virtual keyboard. Users can type in words either by mouse-clicking on the virtual-key objects or by typing the corresponding keys on their physical keyboards. Four different keyboard layouts are currently provided: PC-like, Mac-like, and the implementors preferred Buckwalter Transliteration laid out on both English and French keyboards. As words are typed, the appropriate UNICODE Arabic characters are added to an internal buffer (Figure 3). That buffer is observed by an Arabic Canvas object that renders the actual Arabic orthography on the screen, updating the display each time the buffer is changed in any way. When the user presses the Enter key (or clicks on the Enter-key object) the buffer contents are sent to a Perl CGI script running on a server in Grenoble. The script applies each input word in an upward direction to the morphological analyzer (Figure 2), which is implemented as a finite-state transducer (FST). Typically there are several output strings, each representing a possible analysis of the input word. When presenting the various solutions back to the user, it is valuable to display the fully-voweled (full diacritics) spelling of each solution. So each solution string is then applied in a downward direction to a morphological generator, which is exactly the same as the analyzer except for having a lower-level language that is restricted to fully-voweled strings. The various solutions are also tokenized into morphemes, which are looked up in a dictionary of English glosses. The Perl script incorporates the analyses, the generated fully-voweled strings,

analysis1 analysis2 analysis3 analysis4 analysisn Morphological Analyzer (FST) Morphological Generator (FST) English Gloss Buffer fully voweled glosses input word Web Browser (Java) output html input words CGI Script (Perl) User Machine Internet Xerox Server Figure 2: Information Flow in the Xerox Arabic System and the English glosses into an HTML page which is sent back to the user s Internet browser. Arabic words in the HTML document are enclosed in APPLET elements which cause the browser to invoke the Java applet that renders the string in traditional Arabic orthography on the user s screen. 4 Finite-State Morphological Processing 4.1 General Finite-State Theory and Techniques The Arabic Morphological Analyzer is built using finite-state compilers and algorithms, and the results are stored and run as finite-state transducers. As both Two-Level Morphology (Koskenniemi, 1983; Karttunen, 1983; Gajek et al., 1983; Antworth, 1990) and Finite-State Morphology (Karttunen et al., 1992; Karttunen, 1994) are abundantly documented elsewhere, only an outline of finite-state computing will be provided here. Both systems are based on the insight, originally due to C. Douglas Johnson (Johnson, 1972) and rediscovered by Ronald Kaplan and Martin Kay(Kaplan and Kay, 1981; Kaplan and Kay, 1994), that the rewrite rules used traditionally by linguists to describe phonological derivations are only finite-state in power and can be modeled as finite-state transducers. Let s step back a minute to see what this means. All computer scientists and formal linguists are acquainted with REGULAR EXPRESSIONS and the REGULAR LANGUAGES that they denote. In formal language theory, a LANGUAGE is simply a set of strings of symbols (which linguists usually think of as words consisting of letters). If we consider English, French and Arabic similarly as sets of strings, then linguists can, and do, write regular expressions, i.e. finite-state grammars, that model the natural language as closely as possible. From a computational point of view, regular expressions can be compiled into FINITE-STATE MACHINES (FSMs) that accept all the words of the language and reject all words that are not in the language. The mathematical properties of regular languages are well understood; they are closed under operations such as union, concatenation, and intersection. A regular expression modeling a natural language like Spanish, compiled into a finite-state machine, could be the basis for a simple linguistic product like a spell checker in a word processor. Every word in the document

Arabic Canvas Object kaaf, taa, baa UNICODE Observable Buffer Object UNICODE events ascii (e.g. ktb) Virtual Keyboard Java Object Physical Keyboard Figure 3: Arabic Script Display in Java

kanpat,regularlanguage L1 Rule 1: N!m/_p()FST1 kampat,regularlanguage L2 Rule 2: p!m/m_()fst2 kammat,regularlanguage L3 Figure 4: A Cascade of Rewrite Rules Corresponds to a Composition of Transducers would be applied to the finite-state machine, and every word rejected by the machine would be flagged as misspelled. Such a spell checker would be useful and reliable to the extent that the formal language corresponds to the real Spanish language. REGULAR RELATIONS, which are not nearly so well known (Kaplan and Kay, 1994), are relations or mappings between two regular languages. For our human convenience, we can think of a finite-state relation as having an upper-side language and a lower-side language; and each string in one language is related to one or more strings in the other language. From another point of view, we can also think of relations as languages that consist of ordered pairs of strings, and any ordered pair of strings is either in the relation or it s not. A FINITE-STATE TRANSDUCER (FST) is the corresponding machine that accepts all and only the ordered pairs in the relation; if given a string from the lower language, an FST returns all the related strings in the upper language, and vice versa. Finite-state transducers can implement more interesting linguistic systems such as morphological analyzers, tokenizers, part-of-speech taggers, and even some simple syntactic parsers. Consider the cascade of two rewrite rules in Figure 4.1 that maps an underlying string kanpat consisting of the morpheme kan, with an underspecified nasal N, concatenated with another morpheme pat, into the surface string kammat. Johnson s result tells us that the rules can be modeled as regular relations and can therefore be compiled into finite-state transducers. The well-understood mathematics of finite-state transducers tells us that if there is a regular relation between L1andL2, and another regular relation betweenl2andl3, then there exists another regular relation that maps directly froml1tol3, effectively discarding the intermediate level. IfR1andR2can indeed be compiled into finite-state transducersfst1andfst2, then the single transducer can be computed as the COMPOSITION FST1.o.FST2of the two transducers. By successive application of composition, which is yet another wellunderstood finite-state operation, even a complex grammmar with 60 levels could be reduced to two levels, with corresponding benefits in runtime speed. In addition, transducers are inherently bidirectional, allowing the grammar to be run either forwards for generation or backw ards for analysis. This was all very nice in theory, but efficient algorithms for compiling and manipulating FSTs did not exist in the early 1980s. Xerox started work to produce these algorithms and rule compilers based on them (Karttunen et al., 1987; Karttunen and Beesley, 1992). Soon it was realized that morphological information (such as partof-speech tags, person, number, tense and mood tags) could be encoded as symbols within strings, and that thus lexicons themselves could be compiled into transducers (Karttunen, 1993). This lexical FST could then be composed together with the derivation rules to create a single all-encompassing FST called a LEXICAL TRANS- DUCER (Karttunen et al., 1992; Karttunen, 1994) that contained all the lexical and derivational information, mapping directly between upper-level analysis strings and lower-level surface strings. Such a lexical transducer (Figure 5) then implements the Black Box morphological analyzer sketched above in Figure 1. In the Xerox Arabic system, lexicons are written in the lexc language and are compiled into finite-state transducers. Rules to intersect roots and patterns, and derivational rules to perform deletion, epenthesis, assimilation and metathesis, are written in the twolc language and in another notation known as REPLACE RULES (Karttunen, 1995; Karttunen and Kempe, 1995; Kempe and Karttunen, 1996). Not surprisingly, the rules controlling the realization of hamza (the glottal stop) are particularly subtle. The rules also compile into finite-state transducers, and all the components of the grammar are combined into a single lexical transducer using the finite-state union and composition algorithms. By keeping within the finite-state domain, grammatical components can be combined and modified using any valid finite-state operation. Lexical transducers can be run forwards to generate or backwards to analyze, and they are computationally very efficient for natural-language problems. The code that runs lexical transducers is completely language-independent.

Language of Analysis Strings Finite State Transducer Language of Orthographical Surface Strings Figure 5: A Morphological Analyzer-Generator can be Implemented as a Finite State Transducer 4.2 Finite-State Grammars and the Arabic Stem In Two-Level and Finite-State Grammars, dictionary-like formalisms (Antworth, 1990) and even high-level languages such as lexc (Karttunen, 1993) are used to define lexicons. The finite-state operations that they traditionally assume are unioning and concatenation. Linguists specify sublexicons of morphemes, such as Spanish verb stems or Arabic perfect verb endings, and these morpheme strings are formally unioned together. A notation of CONTINUATION CLASSES designates for each morpheme which other classes of morphemes can follow, and this translates naturally into concatenation. The effective limitation of classic finite-state lexicography to modeling concatenative morphology has been noted and criticized, and it has often been assumed that these systems would be useless for modeling non-concatenative morphology as in Arabic. In the ALPNET Arabic system (Beesley, 1989; Beesley, 1990) and at Xerox, the Arabic lexicons also use the continuation-class notation to indicate the co-occurrence of morphemes in words, and most continuations are implemented in the traditional way as concatenation. However, the association of morphemes comprising the stem is formalized as intersection. At ALPNET, lacking actual finite-state algorithms such as intersection, the intersection of roots and patterns was simulated in C code at runtime, a process dubbed DETOURING. At Xerox, roots and patterns are pre-intersected at compile time using finite-state rules. A simplified example can be included here. Let us assume that a stem like daras ( PX) is analyzed as a combination of a root drs, a CV pattern CVCVC and a vocalization a. We formalize roots as regular languages, e.g.6 define drs [ d r s ]/? ; define ktb [ k t b ]/? ; where? represents any symbol and where / is the ignore operator. The root drs is therefore defined as the regular language consisting of all strings containing d, r and s, in that order, and ignoring any characters around them. Similarly, we define C as the union of all the consonants in our fragment, V as the union of all vowels, and the various CV templates as regular languages consisting of concatenations of consonants and vowels. define C [ d r s k t b q b l ] ; define V [ a i u ] ;! etcetera define FormI C V C V C ; define FormII C V C X V C ; define FormIII C V V C V C ;! etcetera Vocalizations are defined similarly: define PerfAct [ a* ]/nv ; define PerfPass [ u* i ]/nv ;

where [ a* ]/nv denotes the regular language consisting of any number of as, ignoring all non-vowel symbols. With such definitions, abstract stems can then be computed straightforwardly by intersection, e.g. drs & FormI & PerfAct yields the regular language consisting of just one string, daras; drs & FormIII & PerfPass yields stem duuris; drs & FormII & PerfPass yields stem durxis, where symbol X is realized later as a copy of the previous consonant (or an orthographical shadda) by subsequently applied variation rules, similar to Antworth s finite-state treatment of reduplication in Tagalog (Antworth, 1990). The details of the formation of Arabic stems via intersection must be presented elsewhere (Beesley, 1997), and a new finite-state algorithm called MERGE is in development and promises results equivalent to intersection but with greater convenience and efficiency. 5 Conclusion The Xerox Arabic Morphological Analysis and Generator was made available for testing on the Internet 9 September 1997. Currently the underlying lexicons include about 4930 roots, each one hand-encoded to indicate the subset of patterns with which it can combine; in practice, the average root participates in about 18 morphotactically distinct stems, yielding approximately 90,000 stems based on roots and patterns. Some of these are phonologically indistinguishable on the lower side, resulting in approximately 70,000 root-pattern intersections to be performed during compilation. The pattern dictionary contains about 400 phonologically distinct entries, many of them ambiguous. Other sublexicons include prefixes, suffixes, and whole stems that are not root-based. Altogether, the lexicon compiles into a finite-state transducer that accounts for 72,000,000 abstract words, each of which can be analyzed in any of its possible spellings, ranging from no diacritics to fully voweled. Additions to the lexicon are easy for those who are trained in the internal codes. The Xerox system is still in the research stage, but the project will be continued and commercialized if sufficient interest is shown. It is already known that the dictionary needs more proper names, and testing will no doubt uncover other omissions. We need to devise a way to handle multi-word expressions before the work expands into part-of-speech disambiguation and parsing. Work will continue on the user interface, including the provision of new virtual-keyboard layouts and cutand-paste input from online text. The analyzer itself may also be redesigned to take advantage of the newest finite-state algorithms, such as merge, which provide a more efficient way to perform the intersections that build stems. Notes 1Kenneth R. Beesley, D.Phil. 1983, Edinburgh University, is a Principal Scientist at the Xerox Research Centre Europe, Grenoble Laboratory, Multi-Lingual Theory and Technology Group. 2Arabic-script displays in this paper were encoded using the excellent ArabTeX package for TEX and L A TEX by Prof. Dr. Klaus Lagally of the University of Stuttgart. My thanks to him for his help, and especially for a custom-written style file that allow me to use my own favorite Arabic transliteration. 3The highly influential early Arabic work of (McCarthy, 1981) inspired a wide range of challenges, responses, and attempts at formalization. The formal assumptions of McCarthy were generally attacked by (Hudson, 1986), defended by (Haile and Mtenje, 1988) and then attacked again by (Hudson, 1991). Martin Kay (Kay, 1982; Kay, 1987) accepted McCarthy s data, but he formalized the association of roots, CV templates and vocalizations using a multi-level finite-state transducer. Other work inspired primarily by McCarthy s data includes (Bird and Blackburn, 1991) and (Kiraz, 1994a). 4McCarthy (McCarthy, 1981; McCarthy, 1982) cites the charming example of a Hijazi Bedouin play language wherein speakers freely scramble the order of the radicals within a root. 5A large English system and many other two-level demonstrations are available free from the Summer Institute of Linguistics: http://www.sil.org/pckimmo/pc-kimmo.html. German, Russian, Swahili and the Scandinavian systems were created by Kimmo Koskenniemi, his students and associates: http://www.lingsoft.fi/. Another large German system was written by Anne Schiller (Schiller, 1996). MorphoLogic has used Two-Level Morphology to write large analyzers for Polish and Hungarian:

http://www.morphologic.hu/morphenu.htm. A two-level Klingon analyzer was built in three weeks of madness by the present author (Beesley, 1992b; Beesley, 1992a; Okrand, 1985). 6The define statements in the text are executable commands in the Xerox xfst interface to the finite-state algorithms. References Antworth, E. L. (1990). PC-KIMMO: a two-level processor for morphological analysis. Number 16 in Occasional publications in academic computing. Summer Institute of Linguistics, Dallas. Beesley, K. R. (1989). Computer analysis of Arabic morphology: A two-level approach with detours. In Third Annual Symposium on Arabic Linguistics, Salt Lake City. University of Utah. Published as Beesley, 1991. Beesley, K. R. (1990). Finite-state description of Arabic morphology. In Proceedings of the Second Cambridge Conference on Bilingual Computing in Arabic and English. No pagination. Beesley, K. R. (1991). Computer analysis of Arabic morphology: A two-level approach with detours. In Comrie, B. and Eid, M., editors, Perspectives on Arabic Linguistics III: Papers from the Third Annual Symposium on Arabic Linguistics, pages 155 172. John Benjamins, Amsterdam. Read originally at the Third Annual Symposium on Arabic Linguistics, University of Utah, Salt Lake City, Utah, 3-4 March 1989. Beesley, K. R. (1992a). Klingon morphology, part 2: Verbs. HolQeD, 1(3):10 18. Beesley, K. R. (1992b). Klingon two-level morphology, part 1: Nouns. HolQeD, 1(2):16 24. Beesley, K. R. (1996). Arabic finite-state morphological analysis and generation. In COLING-96 Proceedings, volume 1, pages 89 94, Copenhagen. Center for Sprogteknologi. The 16th International Conference on Computational Linguistics. Beesley, K. R. (1997). Arabic stem morphotactics via finite-state intersection. In Preparation. Beesley, K. R., Buckwalter, T., and Newton, S. N. (1989). Two-level finite-state analysis of Arabic morphology. In Proceedings of the Seminar on Bilingual Computing in Arabic and English, Cambridge, England. No pagination. Bird, S. and Blackburn, P. (1991). A logical approach to Arabic phonology. In EACL-91, pages 89 94. Buckwalter, T. A. (1990). Lexicographic notation of Arabic noun pattern morphemes and their inflectional features. In Proceedings of the Second Cambridge Conference on Bilingual Computing in Arabic and English. No pagination. Coulmas, F. (1989). The Writing Systems of the World. Blackwell, Oxford. Daniels, P. T. and Bright, W. (1996). The World s Writing Systems. Oxford University Press, Oxford. Gajek, O. et al. (1983). LISP implementation. In Dalrymple et al., editors, Texas Linguistic Forum, number 22, pages 187 202. Linguistics Department, University of Texas, Austin, TX. Haile, A. and Mtenje, A. (1988). In defence of the autosegmental treatment of nonconcatenative morphology. Journal of Linguistics, 24:433 455. Reply to Hudson:1986. Harris, Z. (1941). Linguistic structure of Hebrew. Journal of the American Oriental Society, 62:143 167. Hudson, G. (1986). Arabic root and pattern morphology without tiers. Journal of Linguistics, 22:85 122. Reply to McCarthy:1981. Hudson, G. (1991). On a defence of autosegmentalism. Journal of Linguistics, 27:163 177. Reply to Haile & Mtenje:1988. Johnson, C. D. (1972). Formal aspects of phonological description. Mouton, The Hague. Kaplan, R. M. and Kay, M. (1981). Phonological rules and finite-state transducers. In Linguistic Society of America Meeting Handbook, Fifty-Sixth Annual Meeting, New York. Abstract.

Kaplan, R. M. and Kay, M. (1994). Regular models of phonological rule systems. Computational Linguistics, 20(3):331 378. Karttunen, L. (1983). KIMMO: a general morphological processor. In Dalrymple et al., editors, Texas Linguistic Forum, number 22, pages 165 186. Linguistics Department, University of Texas, Austin, TX. Karttunen, L. (1993). Finite-state lexicon compiler. Technical Report ISTL-NLTT-1993-04-02, Xerox Palo Alto Research Center, Palo Alto, CA. Karttunen, L. (1994). Constructing lexical transducers. In Proceedings of the 15th International Conference on Computational Linguistics, Coling, Kyoto, Japan. Karttunen, L. (1995). The replace operator. In Proceedings of the 33rd Annual Meeting of the ACL, Cambridge, MA. Available at http://www.xrce.xerox.com/publis/mltt/mltttech.html. Karttunen, L. and Beesley, K. R. (1992). Two-level rule compiler. Technical Report ISTL-92-2, Xerox Palo Alto Research Center, Palo Alto, CA. Karttunen, L., Kaplan, R. M., and Zaenen, A. (1992). Two-level morphology with composition. In Proceedings COLING 92, pages 141 148, Nantes, France. Karttunen, L. and Kempe, A. (1995). The parallel replacement operation in finite-state calculus. Technical Report MLTT-021, Rank Xerox Research Centre, Grenoble, France. Available at http://www.xrce.xerox.com/publis/mltt/mltttech.html. Karttunen, L., Koskenniemi, K., and Kaplan, R. M. (1987). A compiler for two-level phonological rules. Technical report, Xerox Palo Alto Research Center and Center for the Study of Language and Information, Stanford University. Kay, M. (1982). Nonconcatenative finite-state morphology. Uncertain date. A shorter version was later published as Kay:1987. Kay, M. (1987). Nonconcatenative finite-state morphology. In Proceedings of the Third Conference of the European Chapter of the Association for Computational Linguistics, pages 2 10. Kempe, A. and Karttunen, L. (1996). Parallel replacement in finite-state calculus. In COLING-96, Copenhagen. Kiraz, G. (1994a). Multi-tape two-level morphology: a case study in Semitic non-linear morphology. In COLING-94: Papers Presented to the 15th International Conference on Computational Linguistics, volume 1, pages 180 186. Kiraz, G. A. (1994b). Computational Analyses of Arabic Morphology. PhD thesis, University of Cambridge. Koskenniemi, K. (1983). Two-level morphology: A general computational model for word-form recognition and production. Publication 11, University of Helsinki, Department of General Linguistics, Helsinki. McCarthy, J. J. (1981). A prosodic theory of nonconcatenative morphology. Linguistic Inquiry, 12(3):373 418. McCarthy, J. J. (1982). Prosodic templates, morphemic templates, and morphemic tiers. In Hulst, V. D. and Smith, N., editors, The Structure of Phonological Representations, number 1. Foris, Dordrecht. McCarthy, J. J. (1993). Template form in prosodic morphology. In Stvan, L. et al., editors, Papers from the Third Annual Formal Linguistics Society of Midamerica Conference, pages 187 218, Bloomington, Indiana. Indiana University Linguistics Club. Okrand, M. (1985). The Klingon Dictionary:English/Klingon, Klingon/English. Pocket Books, New York. Sampson, G. (1985). Writing Systems. Hutchinson, London. Schiller, A. (1996). Deutsche Flexions- und Kompositionsmorphologie mit PC-KIMMO. In Hausser, R., editor, Linguistische Verifikation: Dokumentation zur Ersten Morpholympics 1994, number 34 in Sprache und Information. Max Niemeyer Verlag, Tübingen. Wehr, H. (1979). A Dictionary of Modern Written Arabic. Spoken Language Services, Inc., Ithaca, NY, 4 edition. Edited by J. Milton Cowan.