Integration of an XML electronic dictionary with linguistic tools for natural language processing



Similar documents
Development of Support Services for Linguistic Research over the Internet (Project TIN )

<!--=========================================--> <!--=========================================-->

<xs:complextype name="trescdokumentu_typ">

+ <xs:element name="productsubtype" type="xs:string" minoccurs="0"/>

XML-Based Software Development

THE BACHELOR S DEGREE IN SPANISH

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

PTE Academic Preparation Course Outline

Jean Véronis is Maître de Conférences at the Université de Provence and Researcher at the

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Data Integration Hub for a Hybrid Paper Search

COURSE SYLLABUS ESU 561 ASPECTS OF THE ENGLISH LANGUAGE. Fall 2014

DRAFT. Standard Definition. Extensible Event Stream. Christian W. Günther Fluxicon Process Laboratories

<xs:restriction base="xs:string">

Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6

Performance Indicators-Language Arts Reading and Writing 3 rd Grade

User manual for e-line DNB: the XML import file. User manual for e-line DNB: the XML import file

Service Description: NIH GovTrip - NBS Web Service

Knowledge. Subject Knowledge Audit - Spanish Meta-linguistic challenges full some none

Virginia English Standards of Learning Grade 8

Gplus Adapter 8.0. for Siebel CRM. Developer s Guide

The Dictionary of the Common Modern Greek Language is being compiled 1 under

Determine two or more main ideas of a text and use details from the text to support the answer

COURSES IN ENGLISH AND OTHER LANGUAGES AT THE UNIVERSITY OF HUELVA (update: 24 th July 2014)

Common Core Progress English Language Arts

How To Disambiguation In Spain

Presentation / Interface 1.3

Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics

The New Forest Small School

Teaching English as a Foreign Language (TEFL) Certificate Programs

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

COMPUTATIONAL DATA ANALYSIS FOR SYNTAX

The Direct Project. Implementation Guide for Direct Project Trust Bundle Distribution. Version March 2013

Parent Help Booklet. Level 3

Schema XSD opisująca typy dokumentów obsługiwane w Systemie invooclip

XIII. Service Oriented Computing. Laurea Triennale in Informatica Corso di Ingegneria del Software I A.A. 2006/2007 Andrea Polini

Ling 201 Syntax 1. Jirka Hana April 10, 2006

Latin Syllabus S2 - S7

Filen ex_e.xml. Her kommer koderne Det der står skrevet med fed er ændret af grp <?xml version="1.0"?>

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

10th Grade Language. Goal ISAT% Objective Description (with content limits) Vocabulary Words

COURSES IN ENGLISH AND OTHER LANGUAGES AT THE UNIVERSITY OF HUELVA (update: 3rd October 2014)

Introduction. Web Data Management and Distribution. Serge Abiteboul Ioana Manolescu Philippe Rigaux Marie-Christine Rousset Pierre Senellart

COM_2006_023_02.xsd <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs=" elementformdefault="qualified">

Chapter 4. Sharing Data through Web Services

Submission guidelines for authors and editors

Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3

Rhode Island College

SPELLING WORD #1: SENTENCE:

Glossary of key terms and guide to methods of language analysis AS and A-level English Language (7701 and 7702)

Interpreting areading Scaled Scores for Instruction

A Writer s Reference, Seventh Edition Diana Hacker Nancy Sommers

SPANISH Kindergarten

DiCE in the web: An online Spanish collocation dictionary

Paraphrasing controlled English texts

Special Topics in Computer Science

Customizing an English-Korean Machine Translation System for Patent Translation *

SEPTEMBER Unit 1 Page Learning Goals 1 Short a 2 b 3-5 blends 6-7 c as in cat 8-11 t p

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

ELAGSEKRI7: With prompting and support, describe the relationship between illustrations and the text (how the illustrations support the text).

Index. 344 Grammar and Language Workbook, Grade 8

Points of Interference in Learning English as a Second Language

ONLINE ENGLISH LANGUAGE RESOURCES

Dhydro: a generic environment developed to edit and access multilingual terminological data on the Internet

The Oxford Learner s Dictionary of Academic English

[MS-FSDAP]: Forms Services Design and Activation Web Service Protocol

Natural Language Database Interface for the Community Based Monitoring System *

A proposal for a payment system for public transport based on the ubiquitous paradigm

MODERN WRITTEN ARABIC. Volume I. Hosted for free on livelingua.com

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Spanish IA Grade Levels 9 12

Multiple electronic signatures on multiple documents

COURSE OBJECTIVES SPAN 100/101 ELEMENTARY SPANISH LISTENING. SPEAKING/FUNCTIONAl KNOWLEDGE

Testing an electronic collocation dictionary interface: Diccionario de Colocaciones del Español

Flattening Enterprise Knowledge

Language Meaning and Use

Chapter 3 Grammar and Punctuation

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

Design and Implementation of a Feedback Systems Web Laboratory Prototype

Actionable Awareness. 5/12/2015 TEI Proprietary TEI Proprietary

Luis Bonilla, Ph.D. Curriculum Vitae. 124 Sunnyside Park Rd. Syracuse, NY

A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches

Inspiration Standards Match: Virginia

Ask your teacher about any which you aren t sure of, especially any differences.

Databases in Organizations

Empirical Machine Translation and its Evaluation

ASPIRE Programmable Language and Engine

Meeting the Standard in North Carolina

[MS-DVRD]: Device Registration Discovery Protocol. Intellectual Property Rights Notice for Open Specifications Documentation

Collecting Polish German Parallel Corpora in the Internet

Multi language e Discovery Three Critical Steps for Litigating in a Global Economy

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

Natural Language to Relational Query by Using Parsing Compiler

Study Skills. Photos of Salamanca Dialogs with pictures Chats DVD

Modernize your NonStop COBOL Applications with XML Thunder September 29, 2009 Mike Bonham, TIC Software John Russell, Canam Software

No Evidence. 8.9 f X

A Machine Translation System Between a Pair of Closely Related Languages

ENGLISH AS AN ADDITIONAL LANGUAGE (EAL) COMPANION TO AusVELS

Focus: Reading Unit of Study: Research & Media Literary; Informational Text; Biographies and Autobiographies

Transcription:

Information Processing and Management xxx (2006) xxx xxx www.elsevier.com/locate/infoproman Integration of an XML electronic dictionary with linguistic tools for natural language processing Octavio Santana Suárez, Francisco J. Carreras Riudavets, Zenón Hernández Figueroa, Antonio C. González Cabrera * Department of Informática y Sistemas, Edificio de Informática y Matemáticas, Campus Universitario de Tafira, Universidad de Las Palmas de Gran Canaria, 35017 Las Palmas, Spain Received 22 June 2006; received in revised form 9 August 2006; accepted 16 August 2006 Abstract This study proposes the codification of lexical information in electronic dictionaries, in accordance with a generic and extendable XML scheme model, and its conjunction with linguistic tools for the processing of natural language. Our approach is different from other similar studies in that we propose XML coding of those items from a dictionary of meanings that are less related to the lexical units. Linguistic information, such as morphology, syllables, phonology, etc., will be included by means of specific linguistic tools. The use of XML as a container for the information allows the use of other XML tools for carrying out searches or for enabling presentation of the information in different resources. This model is particularly important as it combines two parallel paradigms extendable labelling of documents and computational linguistics and it is also applicable to other languages. We have included a comparison with the labelling proposal of printed dictionaries carried out by the Text Encoding Initiative (TEI). The proposed design has been validated with a dictionary of more than 145 000 accepted meanings. Ó 2006 Elsevier Ltd. All rights reserved. Keywords: Encoding; Dictionary; XML; Computational linguistics 1. Introduction Dictionaries have weakly structured contents entries, describing the information relative to each lexical unit. At the same time, it has been demonstrated that the structure of the articles in the dictionaries significantly changes from one dictionary to another. This study proposes a formal XML scheme model for Spanish language dictionaries that allows representation of the structure and contents that are most relevant for natural language processing in XML format. The main objectives are the following: (a) consider only the information inherent to the entity being defined, * Tel.: +34 928458729; fax: +34 928458711. E-mail addresses: osantana@dis.ulpgc.es (O. Santana Suárez), fcarreras@dis.ulpgc.es (F.J. Carreras Riudavets), zhernandez@dis.ulpgc.es (Z. Hernández Figueroa), agonzalez@dis.ulpgc.es (A.C. González Cabrera). 0306-4573/$ - see front matter Ó 2006 Elsevier Ltd. All rights reserved. doi:10.1016/j.ipm.2006.08.005

2 O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx (b) appropriately structure the information in such a way that it allows efficient searches, and (c) allow integration with other linguistic tools and technologies. Information that is more related to the lexical unit than to the semantics represented such as morphology is not included in the XML design, as in the Spanish language this is highly irregular and can be obtained by means of linguistic tools that solve the problem (Santana, Pérez, Carreras, Hernández, & Rodríguez, 2003). We have opted to use the XML standard as a container of the information inherent in the electronic dictionaries as the starting information printed dictionaries is given in a semi-structured way. In a relational database, this information would produce few tables with a large number of fields with void values or a large number of tables. XML complies with the requirements for storing the lexical resources, explained in detail in Bird and Gary (2002). Bird proposed storing this type of information in accessible and non-proprietary formats, so it can be easily shared between applications. As a method for representing the information in the XML document, we have chosen the recommendations of W3C XML Scheme instead of DTD (Document Type Definition). The XML Scheme is currently considered a replacement of DTD, and both are used to define restrictions in the XML documents. In contrast to DTD, the XML Scheme provides strongly typed data and support for the name spaces and, as it is based on XML, it is also extendable (Dongwon & Chu, 2000). Once the dictionaries have been built in this format, they can be stored in relational databases with XML support, or in XML native databases, thereby becoming very valuable tools in the field of computational linguistics. For example, by means of a Java interface and XSL transformations we could present the results of queries in different media, such as HTML pages, mobile devices by means of WAP, or implement web services for dialogues between applications. The remainder of this paper is structured as follows. In Section 2 we discuss related work. Section 3 describes the formal scheme proposed. The representation in XMLSchema of the model proposed is shown in Section 4. In Section 5 we compare our scheme with the TEI proposal. We provide justification of the proposed model in Section 6. In Section 7 we describe the validation of our approach. We touch on future work in Section 8 and conclude in Section 9. 2. Similar studies The encoding of the information present in printed dictionaries has been a matter that has been extensively dealt with, although a good number of the studies have been aimed at English dictionaries. For example, Ide, Véronis, Warwick-Armstrong, and Calzolari (1992) drafted, in accordance with the first works from the Text Encoding Initiative (TEI), several guides for encoding dictionaries. Then, in Ide and Véronis (1995), the main problems encountered when designing a codification format for dictionaries were described. They highlighted, firstly, the conflict between the generality of a description, formulated in such a way that it is applicable to other dictionaries, and the capacity to precisely describe the specific structure of any dictionary. Secondly, they mentioned the need to accommodate various views and applications of the encoded dictionary, for example as a printed object and as an information database. Without any doubt whatsoever, the approaches most widely recognised and accepted by the international community are those put into effect by the Text Encoding Initiative. In Sperberg-McQueen and Burnard (1994), a series of basic tags were defined for the encoding of mono- and multilingual dictionaries with human beings in mind, taking printed dictionaries as a reference. This recommendation was later transposed to the XML standard (Sperberg-McQueen & Burnard, 2004, chap. 12). Other studies have also appeared, dealing with the identification of the essential information fields that appear in dictionaries; Amsler and Tompa (1998) made an extensive proposal on the tags and attributes that should be used to characterise the information contained in dictionaries. Nevertheless, all of these studies have been mainly determined by the format of printed dictionaries, whose restrictions limit the development of the models, including a more effective subjacent structure of the lexical information (Ide, Kilgarriff, & Romary, 2000). Therefore, as already indicated by Nancy Ide, the recommendation for the codification of electronic dictionaries provided by the TEI does not provide a sufficiently concise description to be efficiently used in the codification of the information scheme of the electronic dictionaries.

O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx 3 Other previous studies on this topic (Tutin & Véronis, 1998) highlighted the high degree of DTD generality of the proposed DTD and described some of the problems in its use. Other researchers are also using the XML recommendation as a framework for the information included in dictionaries (Berger, Reitter, & Stede, 2002) although, as already indicated by some of them, the use of these technologies and their potentials are only starting to be explored. This work is the first to design an XML scheme specifically for Spanish treatment. Our proposal also lacks the ambiguity the TEI encoding specification has. We have designed a formal XML scheme able to support, in addition to English, the main Spanish dictionaries, and it may be easily extended to support other languages. The main aim of our proposal is to develop a means of better representing the information in the dictionaries. This opens a window towards future electronic dictionaries as sources of lexical and morphological wealth. A solid, robust and versatile design allows the development of different interfaces according to necessity and the destination of the information. 3. Formal model of the mono-lingual electronic dictionary: MOFDEM Dictionaries store and organise data structures that we call entries. As indicated in Sperberg-McQueen and Burnard (2004, chap. 12), a simple dictionary entry should include the following information: the form of the word, the grammar class, the definition, synonyms, translation to other languages, etymology, references to other entries, usages and examples. Therefore, every entry of the dictionary provides a description of a certain word by means of a series of attributes. For the construction of this scheme we have identified the types of information shown in dictionaries: etymology, definition, grammar category, etc. The hierarchy of the elements has been determined and they have been labelled with explicit names, avoiding the use of abbreviations. The structure of a dictionary can be represented as a finite group of elements, E i, that includes each and all of the entries in the dictionary. D ¼fE 1 ;...; E n g Each entry E i is made up of the headword and its possible graphic variations, W i (generally, simple lexical units), plus a finite group of articles, A i, including information relative to the specific voice. Phrases in foreign languages and Latinisms, adopted due to their usage in the Spanish language, can become entries too, for example hot dog, ad hoc, etc. A phrase is also considered an entry when any of the complete words that make it up does not exist as a separate entry, for example en volandas. These phrases are dealt with as headwords, W i. The productive suffix and prefix elements are also treated in the same way. The criteria for the creation of the different articles have been set in the possible etymologies of the same voice. E i ¼fW i ; A i g where W i ¼fW i1 ;...; W im g and A i ¼fA i1 ;...; A ir g In a hierarchical representation we have the following structure (Fig. 1). As shown in Fig. 1, each article A ij is represented by: a head, HD ij ; optionally its etymology, ET ij ; a finite group of accepted meanings, AM ij ; and a finite series of expressions or compound lexical units, EX ij, made up of various words. The expression is shown in the entry corresponding to the first noun, adjective or verb Dictionary Entry 1... Entry i... Entry n Words i Article i1... Article ir Word 1... Word m Head Etymology AcceptedMeanings Expressions Fig. 1. Structure of the dictionary entries.

4 O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx AcceptedMeaning Definition ScientificName Examples GrammarCategories Antonyms Synonyms Outlines PrepositionalRegime Localisation Matters Usages Fig. 2. Structure of the accepted meanings. included. Thus, the expression de buenas a primeras will be shown in the bueno entry and the expression a tontas y a ciegas will be shown in the tonto entry. A ij ¼fHD ij ; ET ij ; AM ij ; EX ij g; where AM ij ¼fAM ij1 ;...; AM ijp g and EX ij ¼fEX ij1 ;...; EX ijq g The head, HD ij, and the etymology, ET ij, are simple tags (not made up by other labels) that include, respectively, the masculine and feminine forms of the headword (not only their endings) and the information relative to the origin of the specific word. The various accepted meanings are derived from the various definitions of the word. Each, AM ijk, is made up of a series of elements or tags that characterise the same meaning and that can be represented in the following way (Fig. 2). This structure can be hierarchised to include more levels to represent the linguistic characteristics that interrelate the various elements. The basic representation has been chosen with the maximum level of detail that allows the most convenient reorganisation later. The only simple elements within the accepted meaning, are the definition, DF ijk, of cardinality one, and the scientific name, SN ijk, which is an optional element. AM ijk ¼fDF ijk ; SN ijk ; EX ijk ; GC ijk ; LC ijk ; MT ijk ; US ijk ; PR ijk ; OT ijk ; SI ijk ; AN ijk g The element including the examples, EX ijk, provides quotes or sentences showing the usage of the word. In the case of quotes, information on their origin is provided (author, source, date, page) as well as the example text (Fig. 3). EX ijk ¼fEX ijk1 ;...; EX ijks g where EX ijkh ¼fEXA ijkh ; EXS ijkh ; EXD ijkh ; EXP ijkh ; EXT ijkh g The grammar category, GC ijk, shows the various syntactical functions that the word can carry out within a specific accepted meaning. Generally, a specific word has one sole grammar category associated with it; nevertheless, there are words that have various associated grammar categories, for example contrario, disminuido, pánfilo,... Not all words have the same degree of generality in terms of the geographical area of their usage. The localisation, LC ijkh, describes one or several geographical areas, GA ijkh, where the word is mainly used. This Example Author Source Date Page Text Fig. 3. Structure of the example element.

O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx 5 Geography... Zone Region Area Note Fig. 4. Structure of the localisation. Usage Intention Collective Speech Fig. 5. Strucutre of the element usage. element is made up of: the geographical zone or country and, optionally, the region or territorial division, the area (clearly defining a space within a region), and a clarification note (Fig. 4). LC ijkh ¼fGA ijk1 ;...; GA ijkt g and GA ijkh ¼fGAZ ijkh ; GAR ijkh ; GAA ijkh ; GAN ijkh g The matter, MT ijk, indicates the fields of knowledge or professional activities where this word is more frequently used. The usage, US ijk, relates the conditions of use or the origin of the word to a specific accepted meaning. This element is made up of: the intention to use the term (e.g., scornful, humoristic, etc.), the sociocultural scope where it is used (e.g., children, slang, etc.) and the linguistic environment it is used in (e.g., colloquial, educated, euphemistic, literary, etc.) (Fig. 5). US ijk ¼fUSI ijk ; USC ijk ; USS ijk g The rest of the elements that make up the accepted meaning are: the prepositional regime, PR ijk, which indicates the prepositions that are used with the word; the outline, OT ijk, which specifies the semantic and grammar limits of the use of the word in a specific accepted meaning; and, finally, the synonyms, SI ijk, and the antonyms, AN ijk. The last elements making up the article are the compound lexical units, made up of various words labelled as expressions. Each expression element, EP ijk, includes the phrase or expression entry, comprising the compound lexical unit and its possible variations (i.e., other ways of writing the same expression or phrase, for example, bien mirado, mirándolo bien, si bien se mira) and a finite series of accepted meanings of the expression, AME ijk. The structure of the accepted meaning of the expression coincides with the prior definition of the tag Accepted Meaning. EP ijk ¼fEE ijk ; AME ijk g where AME AM The structure of the expression gives rise to the following hierarchical representation (Fig. 6). Expressions Expression 1... Expression k... Expression v Phrases Accepted Meanings Expression Phrase 1 Phrase w Fig. 6. Structure of the expressions.

6 O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx 4. Representation of MOFDEM in XMLSchema The XML scheme specifies the syntax, that is to say, the formal grammar that defines the restrictions on how the elements should be shown in the XML document. The model described in the previous point is defined by means of a XML scheme, in the following way: <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns="http://gedlc.ulpgc.es/diccionario.xsd" xmlns:mstns="http://gedlc.ulpgc.es/diccionario.xsd" xmlns:xs="http://www.w3.org/2001/xmlschema" targetnamespace="http://gedlc.ulpgc.es/diccionario.xsd" elementformdefault="qualified" attributeformdefault="unqualified" version="8"> <xs:element name="dictionary" type="tdictionary"/> <xs:complextype name="tdictionary"> <xs:element name="title" type="xs:string"/> <xs:element name="entry" maxoccurs="unbounded"> <xs:element name="words"> <xs:element name="word" type="xs:string" maxoccurs="15"/> <xs:element name="article" maxoccurs="5"> <xs:element name="head" type="xs:string"/> <xs:element name="etymology" type="xs:string" minoccurs="0"/> <xs:element name="meanings" minoccurs="0"> <xs:element name="meaning" type="tmeaning" maxoccurs="50"/> <xs:element name="expressions" minoccurs="0"> <xs:element name="expression" minoccurs="0" maxoccurs="150"> <xs:complexcontent> <xs:extension base="texpression"> <xs:attribute name="nexpression" type="xs:decimal" use="required"/> </xs:extension> </xs:complexcontent> <xs:attribute name="nexpresion" type="xs:decimal" use="optional"/> <xs:attribute name="narticle" type="xs:decimal" use="required"/> <xs:complextype name="tmeaning">

O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx 7 <xs:element name="definition" type="xs:string"/> <xs:element name="scientificname" type="xs:string" minoccurs="0"/> <xs:element name="examples" minoccurs="0"> <xs:element name="example" type="texample" maxoccurs="10"/> <xs:element name="grammarcategories"> <xs:element name="category" maxoccurs="5"> <xs:simplecontent> <xs:extension base="xs:string"> <xs:attribute name="code" type="xs:decimal" use="required"/> <xs:attribute name="class" type="xs:string" use="required"/> </xs:extension> </xs:simplecontent> <xs:element name="localisation" minoccurs="0"> <xs:element name="geography" type="tgeography" maxoccurs="5"/> <xs:element name="matters" minoccurs="0"> <xs:element name="matter" type="xs:string" maxoccurs="5"/> <xs:element name="usages" minoccurs="0"> <xs:element name="usage" type="tuse" maxoccurs="3"/> <xs:element name="prepositionalregime" minoccurs="0"> <xs:element name="preposition" type="xs:string" maxoccurs="5"/> <xs:element name="outlines" minoccurs="0"> <xs:element name="outline" type="xs:string" maxoccurs="5"/> <xs:element name="synonyms" minoccurs="0"> <xs:element name="synonym" type="xs:string" maxoccurs="15"/> <xs:element name="antonyms" minoccurs="0">

8 O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx <xs:element name="antonyms" type="xs:string" maxoccurs="15"/> <xs:attribute name="nmeaning" type="xs:decimal" use="required"/> <xs:complextype name="texpression"> <xs:element name="phrases"> <xs:element name="phrase" type="xs:string" maxoccurs="5"/> <xs:element name="meaningsexpression" type="tmeaning" maxoccurs="5"/> <xs:complextype name="tgeography"> <xs:element name="zone" type="xs:string"/> <xs:element name="region" type="xs:string" minoccurs="0"/> <xs:element name="area" type="xs:string" minoccurs="0"/> <xs:element name="note" type="xs:string" minoccurs="0"/> <xs:complextype name="texample"> <xs:element name="author" type="xs:string" minoccurs="0"/> <xs:element name="source" type="xs:string" minoccurs="0"/> <xs:element name="date" type="xs:date" minoccurs="0"/> <xs:element name="page" type="xs:decimal" minoccurs="0"/> <xs:element name="text" type="xs:string"/> <xs:complextype name="tuse"> <xs:element name="intention" type="xs:string" minoccurs="0"/> <xs:element name="collective" type="xs:string" minoccurs="0"/> <xs:element name="speech" type="xs:string" minoccurs="0"/> </xs:schema> 5. MOFDEM versus TEI The following is a brief comparison of the proposed model MOFDEM and the TEI guide for the encoding of the printed dictionaries. TEI defines two different elements for the dictionary entries: <entry>, which coincides with the conventional entries in most dictionaries, and <entryfree>, which uses the same elements but allows them to be combined in a freer way. The guide does not clearly establish when to use one or the other option, but recommends the use of <entry>. Another label is also defined, <superentry>, which allows for grouping of a series of homographs. The model proposed defines one sole label for the entries (<Entry>). As the Spanish language does not have the same casuistry with homographs that the English language has, we do not have a need for two different types of entries. Within each entry, TEI proposes using the labels <hom>, grouping together the information relative to a homograph, and <sense>, grouping together of all the information relative to each sense, to carry out the groupings together that can give rise to various combinations. As a reference point the model proposes a tag for carrying out the groupings, labelled <Article>. At the same time the articles will be made up of a series of Accepted Meanings <Meaning> that will be determined by the different definitions of the word and by a series of expressions including the word <Expressions>. Discrimination by etymology indirectly groups together a series of accepted meanings that have a strong semantic relationship with each other. For example,

O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx 9 the Spanish word avería has two well differentiated semantic blocks, coinciding with the different etymologies of avería. Each semantic block can include more than one definition. avería1. 1. f. Casa o lugar donde se crían aves. 2. Averío. avería2. Del ár. al- awariyya, las mercaderías estropeadas, probablemente a través del cat. avaria. 1. f. Daño que padecen las mercaderías o géneros. 2. Derecho de avería. 3. fam. Azar, daño o perjuicio. 4. Daño que impide el funcionamiento de un aparato, instalación, vehículo, etc. 5. Mar. Daño que por cualquier causa sufre la embarcación o su carga. TEI recommends that the following information on each article dealt with should be provided: (a) Information on the form <form>, orthography <orth>, pronunciation <pron>, hyphenation <hyph>, syllables, <syll> and stress, <stress>. In Spanish dictionaries part of this information might not make much sense since, for example, the pronunciation of Spanish is generic for all words and not for each specific word, as is the case in English. (b) Grammar information: gender <gen>, number <number>, case <case>, person (1st, 2nd, 3rd, etc.) associated with a conjugated form <per>, tense <tns>, and mood <mood>. One of the premises of this model is not to include this type of information in the dictionary scheme. This information can be obtained from a linguistic tool dealing with morphology and, therefore, it can be general to all the dictionary entries, without the need of being in the information base. (c) Definitions of translations to other languages <trans>. The scope of this work is mono-lingual dictionaries and, therefore, this type of information is not included in the proposed scheme. (d) Etymological information: reference language <lang>, date <date>, reference words or phrases <mentioned>, phrase or word used to provide a definition for another word or phrase <gloss>, pronunciation <pron>, and usage <usg>. TEI provides a very detailed description on the etymological information that could be included in this work; however, we have decided not to detail the <Etymology> field due to a lack of structure of the lexicographic sources and to the lower performance of these details in applications for the processing of natural language. (e) Examples: real use examples <q>, phrases or quotes attributed by the narrator or author to some external organism of the text <quote>, quote from another document together with a bibliographic reference of its source <cit>. In this study for this label we propose more detail, allowing a sectorial and detailed search in this type of bibliographical information. (f) Usage information, indicated in the type of the element attribute <usg>, whose typical labels are: time for temporary use (archaic, obsolete, etc.), reg for register (argot, formal, taboo, ironic, funny, etc.), style (literal, figurative, etc.), acc for connotation effects (scornful, offensive, etc.), dom for matter (astronomy, philosophy, etc.), and geo for geographical use. The scheme proposed by this study has geographical marks <Geography> and techniques <Matters> that are independent of the linguistic use of the word <Usage>. The first two elements, geography and matters, restrict the universality of the word to a specific geographical area and to a technical or knowledge field, while the use reflects the linguistic context in which the word is used. 6. Justification of the proposed model Our model is characterised by containing those items from a dictionary of meanings that are less related to the lexical units and by using linguistic tools to include this information-headwording, flexion and verbal

10 O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx tense, syllables and hyphenisation, and pronunciation. In this way the information provided by these tools is generalised to all the entries and not exclusively to the irregular information an entry may have. Therefore, the following information has been deliberately eliminated from the model: gender and number, verbal tense and flexions in general, syllables, hyphenisation and pronunciation. The advantage of including these linguistic tools is based on the fact that the solution they provide is extendable to all the words of a language and not only to the entries. For example, in a dictionary that is XML encoded according to the TEI, the pronunciation would exclusively refer to the entry. However, by means of the inclusion of linguistic tools, we could additionally have access to the pronunciation of all the flexional forms of that word, greatly enriching the dictionary. What result do we obtain from a search in several of the main Spanish language electronic dictionaries for the word perrillo, pececito or precomiéndoselas? Simply, the electronic dictionary does not provide the appropriate entry, a statement that can be easily checked by using many of the electronic dictionaries available. This is due to the fact that the electronic dictionaries are applications limited to search for information represented in accordance with some of the codification models proposed; this information is directly taken from the printed dictionaries, subject to paper limitations. 7. Validation In order to verify the validity of the proposed scheme we have used it to generate an XML document with more than 67000 entries from one of the main Spanish dictionaries. The validity was checked by means of random selections. In order to achieve this we have identified the dictionary syntax, and we have developed an application identifying its different elements and positioning them in their corresponding XML tags. A total of 125 292 accepted meanings and 18406 expressions have been identified in the entries. At the same time, within the expressions, a total of 19 887 accepted meanings have been processed. The processing of this dictionary has determined, on the one hand, that there are no unidentified elements left and, on the other hand, that all the elements defined have been used. Once the XML document of the dictionary was obtained, various verifications were carried out. Initially, by using XSL, we obtained counters of the various elements corresponding to both the entry and the accepted meanings of the expression. In this way, we demonstrated that the accepted meanings of the word and those of the expression share their elements. Also, by means of XSL, files with the various elements have been generated separately, verifying, in this much easier way, the correct processing of the inputs. Lastly, various entries have been selected at random and their correct processing has been verified manually (Table 1). 8. Future work The proposed scheme is only a first step towards building a final information system that supports an electronic dictionary. Nevertheless, there are important steps to be carried out in other essential parts that will Table 1 Result of the processing of the dictionary Word accepted meanings Expression accepted meanings <Definition> 125292 19887 <ScientificName> 1749 381 <Example> 49521 7046 <Localisation> 7811 304 <Matter> 60816 8334 <Usage> 5249 5228 <Prepositionalregime> 1503 7 <Outline> 84 42 <Synonym> 45465 604 <Antonym> 4692 22

constitute this information system. These parts will allow having an electronic dictionary able to provide a correct result for any type of voice, with respect to flexion (Santana et al., 2003), or morpholexic relationship (Santana, Pérez, Carreras, & Rodríguez, 2004), and the drawbacks of the current electronic dictionaries will be solved. Insofar as electronic dictionaries are concerned, the system would involve important qualitative and quantitative improvements, as it would not involve a simple transfer of the information existing in the printed dictionaries. To achieve this, the information system model will have to be developed further in such a way as to allow the integration of the various existing linguistic tools with the information in the dictionary. On the other hand, there is an attempt to develop interfaces that allow presenting the information in various resources, mainly using XML (XPATH, XQuery and XSL), Web, Java and Wap technologies. 9. Conclusions O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx 11 The development of a common codification for different dictionaries is an extremely difficult task (Ide et al., 1992). This study attempts to provide a useful model for the processing of natural language that is easy to use in the codification of information from various dictionaries of the Spanish language, excluding those items that are directly related to the lexical units. In this study, we have described a model different from that proposed by the TEI (Sperberg-McQueen & Burnard, 2004, chap. 12) to define and structure the information included in the electronic dictionaries, with specific rather than general purposes, and its viability has been tested with a leading Spanish dictionary. The TEI proposal regarding the dictionaries consists of an important series of labels that can be used in different combinations depending on the codifier needs. Our proposal presents a well defined series of labels whose only objective is to accommodate those items that are less related to the lexical units for natural language processing of the electronic dictionaries. The combination of linguistic tools with electronic dictionaries labelled in XML provides a new perspective in this area with a series of advantages in the final product: access to the dictionary articles by means of any word in the written language (without the need of using the canonical form) and the ability to obtain the pronunciation, syllables and hyphenisation of any word (not only those of the entries expressed by the headword). The model is especially important as it is extendable to other languages and combines two parallel paradigms: document extendable labelling and computational linguistics. This model will be, at the same time, a very useful part in the construction of electronic dictionaries. References Amsler, R. A., & Tompa, F. W. (1998). An SGML-based standard for English monolingual dictionaries. In Fourth annual conference of the UW Center for the New Oxford English Dictionary, University of Waterloo Center for the New Oxford English Dictionary, (pp. 61 79). Waterloo, Ontario. Berger, D., Reitter, D., & Stede, M. (2002). XML/XSL in the dictionary: the case of discourse markers. In Proceedings of the 2nd workshop on NLP and XML (NLPXML-2002), 19th international conference on computational linguistics. Taipei, Taiwan. Online available http://www.reitter-it-media.de/compling/papers/bergeretal_xmldiscmarkers_2002.pdf. Bird, Steven, & Gary, Simons, (2002). Seven dimensions of portability for language documentation and description. In Proceedings of the workshop on portability issues in human language technologies, Third international conference on language resources and evaluation. Las Palmas, Canary Islands. Online available http://arxiv.org/abs/cs/0204020. Dongwon, Lee, & Wesley, W. Chu (2000). Comparative analysis of six XML schema languages. ACM SIGMOD Record, Vol. 29, No. 3. (pp. 76 87). Online available http://pike.psu.edu/publications/sigmod-record-00.pdf. Ide, N., Kilgarriff, A., & Romary, L. (2000). A formal model of dictionary estructure and content. In Euralex 2000 Proceedings (pp. 113 126). Online available ftp://ftp.itri.bton.ac.uk/reports/itri-00-30.pdf. Ide, N., & Véronis, J. (1995). Encoding dictionaries. In N. Ide & J. Veronis (Eds.), The text encoding initiative: background and context (pp. 67 80). Dordrecht: Kluwer Academic Publishers. Ide, N., Véronis, J., Warwick-Armstrong, S., & Calzolari, N. (1992). Principles for encoding machine readable dictionaries. In Fifth Euralex international congress. Online available http://www.up.univ-mrs.fr/veronis/pdf/1992euralex.pdf. Santana, O., Pérez, J., Carreras, F., Hernández, Z., & Rodríguez, G. (2003). The Spanish morphology in Internet. Lecture notes in computer science (Vol. 2722). 3-540-40522-4. Springer-Verlag, Online available http://www.gedlc.ulpgc.es/art_ps/art39.pdf, ISSN 0302-9743, Web engineering. Santana, O., Pérez, J., Carreras, F., & Rodríguez, G. (2004). Suffixal and prefixal morpholexical relationships of the Spanish. Lecture notes in artificial intelligence (Vol. 3230). Springer-Verlag, Online available http://www.gedlc.ulpgc.es/art_ps/art45.pdf, ISSN 0302-9743.

12 O. Santana Suárez et al. / Information Processing and Management xxx (2006) xxx xxx Sperberg-McQueen, C.M., & Burnard, L. (1994). Guidelines for electronic text encoding and interchange, Text encoding initiative, Chicago and Oxford. Sperberg-McQueen, C.M., & Burnard, L. (2004). The XML version of the TEI guidelines print dictionaries. Text encoding initiative. Online available http://www.tei-c.org/p4x/di.html. Tutin, A., & Véronis, J. (1998). Electronic dictionary encoding: customizing the TEI guidelines. In Euralex 1998 Proceedings. Online available http://www.up.univ-mrs.fr/veronis/pdf/1998euralex.pdf.