UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO. A Framework for Integrating Natural Language Tools



Similar documents
Automatic Text Analysis Using Drupal

In: Proceedings of RECPAD th Portuguese Conference on Pattern Recognition June 27th- 28th, 2002 Aveiro, Portugal

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Component visualization methods for large legacy software in C/C++

Natural Language to Relational Query by Using Parsing Compiler

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Natural Language Interfaces to Databases: simple tips towards usability

Flattening Enterprise Knowledge

A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases

Software Architecture Document

EuroRec Repository. Translation Manual. January 2012

StreamServe Project Guide and Framework Versão 1.4 / Maio-2013

A GrAF-compliant Indonesian Speech Recognition Web Service on the Language Grid for Transcription Crowdsourcing

MANAGEMENT SYSTEM FOR A FLEET OF VEHICLES BASED ON GPS. João André Correia Telo de Oliveira

Towards a flexible syntax/semantics interface

BUILDING OLAP TOOLS OVER LARGE DATABASES

2. Distributed Handwriting Recognition. Abstract. 1. Introduction

Prova escrita de conhecimentos específicos de Inglês

Transcribing with Annotation Graphs

Database System Concepts

Cloud Computing and Advanced Relationship Analytics

HELP DESK SYSTEMS. Using CaseBased Reasoning

Chapter 2 Database System Concepts and Architecture

An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)

Intelligent Analysis of User Interactions in a Collaborative Software Engineering Context

SOFTWARE TESTING TRAINING COURSES CONTENTS

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Natural Language Processing in the EHR Lifecycle

Masters in Information Technology

Semantic annotation of requirements for automatic UML class diagram generation

The Prolog Interface to the Unstructured Information Management Architecture

OVERVIEW OF JPSEARCH: A STANDARD FOR IMAGE SEARCH AND RETRIEVAL

Special Topics in Computer Science

estatistik.core: COLLECTING RAW DATA FROM ERP SYSTEMS

Europass Curriculum Vitae

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Building a Question Classifier for a TREC-Style Question Answering System

Using Object And Object-Oriented Technologies for XML-native Database Systems

Architectural Patterns. Layers: Pattern. Architectural Pattern Examples. Layer 3. Component 3.1. Layer 2. Component 2.1 Component 2.2.

Semantic Search in Portals using Ontologies

Architecture Design & Sequence Diagram. Week 7

University Data Warehouse Design Issues: A Case Study

HYBRID INTELLIGENT SUITE FOR DECISION SUPPORT IN SUGARCANE HARVEST

Parsing Technology and its role in Legacy Modernization. A Metaware White Paper

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Managing large sound databases using Mpeg7

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

Glance Project: a database retrieval mechanism for the ATLAS detector

A Framework-based Online Question Answering System. Oliver Scheuer, Dan Shen, Dietrich Klakow

Distributed Database for Environmental Data Integration

HTML5 Data Visualization and Manipulation Tool Colorado School of Mines Field Session Summer 2013

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Page 1 of 5. (Modules, Subjects) SENG DSYS PSYS KMS ADB INS IAT

31 Case Studies: Java Natural Language Tools Available on the Web

ISSUES ON FORMING METADATA OF EDITORIAL SYSTEM S DOCUMENT MANAGEMENT

A Workbench for Prototyping XML Data Exchange (extended abstract)

SCADE System Technical Data Sheet. System Requirements Analysis. Technical Data Sheet SCADE System

Natural Language Database Interface for the Community Based Monitoring System *

Data Warehouses in the Path from Databases to Archives

Abstract 1. INTRODUCTION

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project

Presente e futuro del Web Semantico

FACULTY OF COMPUTER SCIENCE AND INFORMATION TECHNOLOGY AUTUMN 2016 BACHELOR COURSES

Software Engineering EMR Project Report

Development/Maintenance/Reuse: Software Evolution in Product Lines

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Customizing an English-Korean Machine Translation System for Patent Translation *

Ficha técnica de curso Código: IFCAD320a

Language and Computation

DEGREE PLAN INSTRUCTIONS FOR COMPUTER ENGINEERING

Survey Results: Requirements and Use Cases for Linguistic Linked Data

irods and Metadata survey Version 0.1 Date March Abhijeet Kodgire 25th

Transformation of Free-text Electronic Health Records for Efficient Information Retrieval and Support of Knowledge Discovery

A Framework for Developing the Web-based Data Integration Tool for Web-Oriented Data Warehousing

Technical Report. The KNIME Text Processing Feature:

Development of a generic IT service catalog as pre-arrangement for Service Level Agreements

UIMA: Unstructured Information Management Architecture for Data Mining Applications and developing an Annotator Component for Sentiment Analysis

What is a database? COSC 304 Introduction to Database Systems. Database Introduction. Example Problem. Databases in the Real-World

Web-based Multimedia Content Management System for Effective News Personalization on Interactive Broadcasting

Search Result Optimization using Annotators

Metadata Management for Data Warehouse Projects

SERVICE-ORIENTED MODELING FRAMEWORK (SOMF ) SERVICE-ORIENTED BUSINESS INTEGRATION MODEL LANGUAGE SPECIFICATIONS

The Development of Multimedia-Multilingual Document Storage, Retrieval and Delivery System for E-Organization (STREDEO PROJECT)

Transcription:

UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO A Framework for Integrating Natural Language Tools João de Almeida Varelas Graça (Licenciado) Dissertação para obtenção do Grau de Mestre em Engenharia Informática e de Computadores DOCUMENTO PROVISÓRIO Fevereiro 2006

Abstract Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics that studies the problems inherent to the processing and manipulation of natural language. NLP systems are typically characterized by a pipeline architecture, in which several NLP tools connected as a chain of filters apply successive transformations to the data that flows through the system. Usually, each tool is independently developed by a different researcher whose focus is on his/her own research problem rather than on the future integration of the tool in a broader system. Hence, when integrating such tools, one may face problems that lead to information losses, such as: (i) the output of a tool consists of the data it has acted upon and usually does not contain all the input data. Sometimes this raises a problem if the discarded information is required by a tool that appears at a later stage of the pipeline; (ii) each tool has its own input/output format so conversions between data formats may be needed when a tool consumes data produced by another one. This conversion may not be possible if the descriptive power of each format is distinct; (iii) the formats used by different tools do not establish relations between the input/output data. These relations are useful for aligning information produced at different levels and to avoid the repetition of common data across them. These problems make the reuse of NLP tools in distinct NLP systems a cumbersome task. This dissertation proposes a solution to these problems, by using a client-server architecture. The server acts as a blackboard where all tools add/consult the data. In our solution, a tool adds a layer of linguistic information over a data signal and the system maintains the cross-relations between the existing layers of data. The data is kept in the repository under a conceptual model that is independent of the client tools which allow the representation of a broad range of linguistic information. The tools

interact with the repository through a generic remote API which allows the creation of new data and the navigation through all the existing data. Moreover, this work provides libraries implemented in several programming languages that abstract the connection and communication protocol details between the NLP tools and the server. These libraries also offer additional levels of functionality that simplify the creation of NLP tools.

Resumo O Processamento de Língua Natural (PLN) é um ramo da Inteligência Artificial que estuda os problemas inerentes ao processamento e manipulação da Língua Natural. Os sistemas de PLN são normalmente caracterizados por uma arquitectura de canais e filtros onde um conjunto de ferramentas de PLN aplica um conjunto sucessivo de transformações aos dados que fluem no sistema. Cada ferramenta é normalmente desenvolvida por um investigador, cuja preocupação se centra no seu problema e não na integração da sua ferramenta em futuros sistemas. Quando se integram diferentes ferramentas para criar um sistema, surgem tipicamente os seguintes problemas, que podem levar à perda de informação: i) a saída de cada ferramenta consiste nos dados que ela alterou, e pode não conter todos os dados de entrada. Este facto pode originar problemas se a informação que foi descartada for necessária para ferramentas que apareçam posteriormente no sistema; ii) cada ferramenta possui o seu próprio formato de dados, logo é necessário converter os diferentes formatos para permitir que as ferramentas comuniquem entre si. Adicionalmente a expressividade de cada formato pode não ser diferente, caso em que a conversão pode não ser possível; iii) as diferentes ferramentas não estabelecem relações entre os dados de entrada e saída necessários para alinhar os dados produzidos por diversas ferramentas, e para evitar a replicação de informação. Estes problemas dificultam a reutilização de ferramentas de PLN em diferentes sistemas de PLN. Este trabalho apresenta uma solução para estes problemas que consiste na utilização de uma arquitectura cliente servidor. O servidor é um repositório usado pelas ferramentas para adicionar e consultar informação. Cada ferramenta adiciona um nível de informação sobre um sinal de dados e o sistema mantem relações entre os diversos níveis. Os dados são guardados no repositório sob um modelo concep-

tual, independente das diversas ferramentas, e que permite representar diversos tipos de informação linguística. As ferramentas interagem com o servidor através de uma interface remota que permite que estas adicionem dados e naveguem através de todos os dados existentes. Este trabalho oferece ainda bibliotecas implementadas em diversas linguagens de programação que abstraem os detalhes referentes ao protocolo de ligação e comunicação entre o cliente e o servidor. Estas bibliotecas oferecem funcionalidade acrescida às ferramentas, o que simplifica a sua criação.

Keywords & Palavras Chave Keywords Natural Language processing systems Natural Language processing tools integration Repository Linguistic Annotation Data lineage Information loss Palavras Chave Sistemas de processamento de Língua Natural Integração de ferramentas de processamento de Língua Natural Repositório Anotação Linguística Alinhamento de dados Perda de informação

Acknowledgments I would like to express my gratitude to everyone that helped me during the development of this dissertation, provided me with their support, and endured my constant stress and bad temper. Without them this work would not have been possible. I would like to thank my Supervisor, Professor Nuno Mamede, for all his guidance over these years, his constant advices and corrections, and his never ending patience towards my doubts and requests. I also cannot forget the support provided by my Co-Supervisor, João Dias Pereira. Both helped me immensely in the development of this dissertation. My thanks extend to the INESC-ID Spoken Language Systems lab team, who were extremely welcoming and cooperative, and particularly to those who worked more closely with me in this project: David Matos, Luísa Coheur, Ricardo Ribeiro, Joana Paulo, Fernando Baptista and Paula Vaz. I thank them for their suggestions and support. Thanks also to André Nascimento for his help with the proof-reading of the dissertation s text. To all my friend that have always been there for me, even when things seemed to be going wrong, thank you for your words of comfort and motivation. And last, but certainly not least, I thank my family, for their unconditional support, not only throughout this project, but also for my entire life. Lisboa, February 22, 2006 João de Almeida Varelas Graça

Contents 1 Introduction 1 1.1 Motivation.................................... 1 1.2 Objectives.................................... 4 1.3 Proposed Solution................................ 4 1.3.1 Architecture............................... 4 1.3.2 Conversions between data formats.................. 6 1.3.3 Data lineage............................... 7 1.3.4 Summary................................ 8 1.4 Requirements.................................. 9 1.4.1 Conceptual Model Requirements................... 10 1.4.2 System Requirements......................... 13 1.5 Contributions.................................. 14 1.6 Dissertation Structure.............................. 14 2 Related Work 15 2.1 Introduction................................... 15 2.2 AGTK....................................... 18 2.3 ATLAS...................................... 23 2.4 EMDROS..................................... 25 i

2.5 NLTK....................................... 27 2.6 GATE....................................... 28 2.7 Festival...................................... 29 2.8 Summary..................................... 31 3 Conceptual Model 37 3.1 Introduction................................... 37 3.2 Conceptual Model Entities........................... 38 3.2.1 Repository................................ 40 3.2.2 Data................................... 41 3.2.3 SignalData, Index and Region..................... 41 3.2.4 Analysis................................. 43 3.2.5 Segment and Segmentation...................... 44 3.2.6 Relation................................. 51 3.2.7 Classification.............................. 52 3.2.8 CrossRelation.............................. 54 3.3 Conceptual Model API............................. 55 3.4 Summary..................................... 57 4 Architecture 61 4.1 Introduction................................... 61 4.2 Server architecture............................... 62 4.2.1 Server Architecture Description.................... 62 4.2.1.1 Data Layer.......................... 63 ii

4.2.1.2 Service Layer......................... 64 4.2.1.3 Remote Interface....................... 65 4.2.2 Server Architecture Implementation................. 65 4.2.2.1 Data Layer.......................... 66 4.2.2.2 Service Layer......................... 67 4.2.2.3 Remote Interface....................... 68 4.2.3 Server Architecture Interaction Examples.............. 68 4.3 Client Library.................................. 70 4.3.1 Client Library Description....................... 70 4.3.1.1 Client Stub Layer...................... 71 4.3.1.2 Conceptual Model Layer.................. 71 4.3.1.3 Extra Layers......................... 72 4.3.2 Java Implementation.......................... 73 4.3.2.1 Client Stub.......................... 73 4.3.2.2 Conceptual Model...................... 73 4.3.2.3 Extra Layers......................... 74 4.3.3 Solution Validation Tools....................... 75 4.3.3.1 Simple NLP system..................... 75 4.3.3.2 Concurrent processing................... 82 4.4 Summary..................................... 82 5 Conclusion and Future Work 85 5.1 Summary..................................... 85 5.2 Future work................................... 87 iii

5.3 Contributions.................................. 90 A Conceptual Model API 95 B Repository Server 105 B.1 Data Layer API................................. 105 B.2 Conceptual Model Persistent Format..................... 107 B.3 Data Transfer Objects.............................. 107 B.4 Remote Interface API.............................. 108 iv

List of Figures 1.1 Example of an NLP system using a pipes and filters architecture..... 2 1.2 Each tool passing all input data to output................... 5 1.3 Tool consuming data from different tools................... 5 1.4 Component combining data from different tools............... 6 1.5 An NLP system using the shared repository................. 6 1.6 External components converting data formats................ 7 1.7 Identification of words in a text......................... 12 1.8 Ambiguous segmentation example...................... 12 1.9 Classification Ambiguity example....................... 13 2.1 AGTK internal structure............................. 20 2.2 Annotation Graphs example........................... 21 2.3 ATLAS region utilization example....................... 24 2.4 ATLAS Children utilization example...................... 25 2.5 EMDROS database example.......................... 26 2.6 Gate annotation graph example........................ 29 2.7 Utterance example................................ 30 3.1 Conceptual Model class diagram........................ 39 3.2 Repository class diagram............................. 41 v

3.3 SignalData class diagram............................. 42 3.4 TextSignalData class diagram.......................... 43 3.5 Analysis class diagram.............................. 44 3.6 Segmentation and Segment class diagram................... 45 3.7 Relation class diagram.............................. 51 3.8 Classification class diagram........................... 53 3.9 CrossRelation class diagram.......................... 54 3.10 Iterator class diagram.............................. 56 3.11 type and description class diagram...................... 56 3.12 Complete conceptual model class diagram.................. 59 4.1 Client Server Architecture............................ 61 4.2 Server layers................................... 63 4.3 DTO class diagram................................ 67 4.4 Client Library Internal Structure........................ 71 4.5 Extra layers interfaces.............................. 74 4.6 Examples notation................................ 76 vi

List of Tables 2.1 System requirements resume.......................... 33 2.2 Conceptual model requirements resume................... 34 vii

viii

1.1 Motivation 1 Introduction Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics that studies the problems inherent to the processing and manipulation of natural language. It is devoted to making computers understand statements written in human languages. Several NLP systems have been developed to solve some of the major NLP tasks, such as question answering and dialog systems. Usually, NLP systems are composed of several NLP tools where each tool is charged of a specific linguistic task, such as word sense disambiguation or syntactic analysis. In most systems, tools are executed in a pipeline manner where each tool performs a processing step. It is common for these tools to use results from the previous processing steps and to produce new linguistic information that can be used by the following processing steps. For instance, a word sense disambiguation may combine the output of a word tokenizer and the output of a part of speech tagger. Tools are independently developed by distinct individuals whose focus is on his/her own problem rather than on the future integration of the tool in a broader system. Tool data formats are developed according to the tools requirements and normally the output of each tool does not contain all input information. So, some data is discarded between the Input/Output of each tool. Hence, when integrating such tools, several problems arise, which are mainly related to the following: (i) how the tools communicate with each other and (ii) what kind of information flows between them. At the Spoken Language Systems Lab (L 2 F), where this work was developed, sev-

2 CHAPTER 1. INTRODUCTION eral NLP systems have already been created. Whenever tools were integrated to compose a system, most of the detected problems were related to the information flow between those tools, which can lead to information losses. These problems are: Architectural problems - the information discarded along the system may be required further ahead by other tools; Conversion between data formats - conversions are necessary between the different data formats. If the expressiveness of each format is different, some of the formats may not be completely mappable into other formats. Besides these problems that lead to information losses there is another problem concerning the data: how to maintain the data lineage between information produced by different tools composing a NLP system. When viewing tools output as a layer of information over a primary data source and considering that layers are normally related to each other, it is desirable to maintain relations between those layers. First, it allows the data lineage between the different levels, allowing the navigation through related linguistic information produced by different tools. Secondly, tools can reference data from other layers avoiding the repetition of common data. These types of relations are called cross-relations because they span across linguistic information layers. Finally, a last problem concerns the fact that each NLP tool programmer usually develops its own data model to represent the linguistic information, and Input/Output facilities of that data model. Since, these different data models normally represent similar information they tend to have a big resemblance. The redefinition of such similar models represents a waste of time. Figure 1.1: Example of an NLP system using a pipes and filters architecture.

1.1. MOTIVATION 3 Figure 1.1 shows a simple NLP system composed by three components. The system initial input data is a text file. Component A performs the tokenization and part of speech tagging of the text, component B performs a post-morphologic analysis and component C performs a syntactic analysis. In this example two conversions between data formats are required, one between components A and B, and another between components B and C. This small system has the following problems: Component B receives the results from component A, and one of its tasks is to separate the contractions used in the text. From that point on the initial state of the text is lost. The syntactic parser (C) only uses the morphologic features of the words identified by component A. Its output is a set of syntactic trees which do not have any reference to the original words in the text. At this stage, the system has already lost the information about the words that belong to the text which might be required by components that are added in the future, to the system. Moreover, problems in the conversion between data formats may exist if the data formats expressiveness is different. For example, if a phonetic transcription of the original text was required, it would need the data of the syntactic analysis to select a possible interpretation of the words of the text, and the original words in the text. Since part of the information was lost throughout the system this information would not be present. So the new component would require all of the tools output as input and it would have to merge the information between the tools. However, if the information produced between those tools was aligned these problems could be overcome by following the relations between the different levels of data. The rest of this chapter is organized as follows. Section 1.2 describes the objective of this dissertation. Then, Section 1.3, proposes a solution to the detected problems, in the form of a shared repository for NLP tools. Section 1.4 defines a set of requirements that a solution for the detected problems must fulfil, and Section 1.5 describes the major contributions of this work. Finally Section 1.6 describes the structure of this thesis.

4 CHAPTER 1. INTRODUCTION 1.2 Objectives The main objective of this work is to build an NLP framework for the creation of NLP systems that: Avoids information losses between tools composing a system; Simplifies new NLP tools implementation by providing general Input/Output facilities and a data model to represent linguistic information; Maintains the data from tools aligned allowing the navigation through related data from different tools. 1.3 Proposed Solution This Section first describes some alternatives to solve the architectural and data format problems, and our solution to each one of them. Then it presents an objective that arises from the solution to those problems concerning the data lineage between the information produced by different tools. Finally it presents the overall solution composed by the solutions to each individual problem. 1.3.1 Architecture A possible approach to the architectural problem is guaranteeing that the tools produce all the information required for the next steps of the system. This simple approach only works when all of the system s tools are known in advance. Moreover, this approach demands that each tool has to produce extra information. This approach is not extensible because the addition of new tools may force changes on existing tools. A generalization of the previous alternative consists in guarantee that each tool passes all input information to its output (Figure 1.2). This strategy has the following problems: firstly, the tools must know how to manage a large amount of data, which

1.3. PROPOSED SOLUTION 5 may not be related with the tools themselves. Secondly, each tool may have to load and parse a large amount of data upon its initialization and consequently save a large amount of data when terminating. Finally, it complicates handling data from other tools simultaneously. For example, in Figure 1.3, component D, which is a morphologic disambiguation tool, requires the output of component B, and component C, both of which are part-of-speech taggers. In this example, component D has the responsibility of merging the common data from its input (text and dataa), besides performing its processing. This merging operation is difficult, and depends on the components used to produce the input, which leads to modifications in component D whenever component C, or B are changed. Figure 1.2: Each tool passing all input data to output. Figure 1.3: Tool consuming data from different tools. Another strategy that does not impose any extra effort on tools consists in having separated components, which know how to combine information from different tools (see Figure 1.4). Nevertheless, this strategy requires building new components every time a new combination of tools is available. Moreover, information intersection is not a trivial task and requires a considerable effort. This strategy was proposed on Galinha (de Matos et al., 2002, 2003).

6 CHAPTER 1. INTRODUCTION Figure 1.4: Component combining data from different tools. Our solution consists in shifting the architectural pattern from pipes and filters to a client-server style where the server is a shared data style. This solution proposes a shared repository where tools add new layers of information without changing the existing ones (see Figure 1.5). All linguistic information is available in the server. This way tools only have to select the required information, avoiding the loading, parsing and saving of extra data. Since the server is itself a shared data style we avoid information lost because all information is kept in the repository and it is never removed. Moreover, if this solution is used the information merging problem is avoided, since each tool only uses the layers it requires as input adding the new produced layer at the end. Figure 1.5: An NLP system using the shared repository. 1.3.2 Conversions between data formats A strategy to manage the different formats used by each tool consists in having specific components that are responsible for performing the data conversions (see Figure 1.6). Even if the creation of such components is simplified, for example, by imposing that

1.3. PROPOSED SOLUTION 7 each tool consumes/produces data in XML format, and a XSLT engine is used to perform the conversions, it is still necessary to define a XSLT style sheet for each conversion. Figure 1.6: External components converting data formats. This approach has two drawbacks. First, the need to define a XSLT for each data conversion. This might not be a trivial task, and if n distinct components exist, the number of possible XSLT grows at a rate of n*n. The second drawback is the expressiveness of each data format. This approach assumes that the content of each format is somehow describable in all the other formats, which may not be true, and in this case there will be information losses when the transformation is performed. This was the approach followed in (de Matos et al., 2002, 2003). Our solution consists in defining a conceptual model, capable of representing a broad range of linguistic information produced by NLP tools. The information stored in the shared repository is described using this conceptual model. Moreover, the tools may also use this model to represent their information, avoiding the need to perform any conversions. If the tools are not able to use this model, the model can be used as an interlingua between tools, so the number of necessary conversions is reduced from a rate of n*n to a rate of n. Since the model is able to represent a broad range of linguistic information, no information is lost due to lack of expressiveness of the format, because every possible data format should be mappable into this model. 1.3.3 Data lineage An approach to the data lineage problem is adding semantic information to each data layer in the shared repository. For example, supposing a morphologic layer and a

8 CHAPTER 1. INTRODUCTION syntactic layer, the layers semantics would force the use of morphologic layer elements by the syntactic layer. This way relations between layers would be established by the layers semantics. However, with this approach it is difficult to have several layers of the same semantic type. It also restricts the type of layers it is possible to define as well as the content of each layer. These restrictions are not desirable, since we require the solution to be generic enough to allow the creation of any NLP system, composed by any type of tools. Since the information produced by all NLP tools will be kept in the same repository, represented under the same conceptual model, our solution consists in extending the conceptual model, allowing the representation of cross-relations between the linguistic information belonging to different layers of information. Tools create cross-relations between their Input/Output data when creating the entities of the conceptual model. 1.3.4 Summary The solutions previously proposed for each of the problems are not independent from each other. The overall solution consists in using a client-server architecture, where the server is a repository of linguistic information and primary data sources represented under a conceptual model. The clients are NLP tools. All information produced by the clients is kept in the repository under a specific layer univocally identified. Each client can select information from the repository by selecting specific layers, and navigate through information from different layers using the cross-relations existing between the layers. Furthermore, besides the described server our proposed solution contains the definition of client libraries. A client library has the objective of simplifying the creation of NLP tools, by abstracting the connection and communication protocol details between the NLP tools and the server. Moreover, each client library provides several layers of functionality that simplify the use of the server and therefore the implementation of

1.4. REQUIREMENTS 9 new tools. For example the client library can provide an implementation of the conceptual model, where each element acts as a proxy for the corresponding server s element. Currently there is a version of the client library for the Java programming language. 1.4 Requirements This section defines a set of requirements for a solution that supports and simplifies the integration of independently developed NLP tools into NLP systems. These requirements were used to validate the proposed solution, and are the following: No information should be lost between tools in an NLP system; Tools should only produce directly related information; The solution should simplify the creation of new tools, by providing an Input/Output interface, which handles the loading and saving of data used by the tool and also by providing a data model that can be used by each tool to represent linguistic information; The solution should minimize the number of conversion components required to build an NLP system, when integrating existing NLP tools that do not comply with the system s model; The provided interface should allow the navigation between information produced by different NLP tools. To achieve the previous generic requirements we defined two groups of requirements that the solution must fulfil, namely: Conceptual Model Requirements - Represents the requirements for a conceptual model capable of representing and relating a broad range of linguistic information, which are described in Subsection 1.4.1;

10 CHAPTER 1. INTRODUCTION System requirements - Represents requirements of the underlying system related with the interaction between the system and the NLP tools, which are described in Subsection 1.4.2. 1.4.1 Conceptual Model Requirements The conceptual model s main requirement is that it must be able to represent a broad range of linguistic information produced by different NLP tools. Furthermore, the conceptual model must be extensible because it is impossible to foresee all kinds of linguistic information that may appear in the future. We begin by distinguishing two conceptually different kinds of information that the conceptual model must represent: Primary data sources such as a text, or a speech signal; Linguistic information produced by NLP tools, over primary data sources, or previously defined linguistic information. We also identify four types of actions that NLP tools may perform: Creation and edition of primary data sources, for example, the incremental creation of a new primary data source containing the phonetic transcription of the text belonging to another primary data source. This newly created data source can be the target of the linguistic information of other tools; Identification of linguistic elements from a primary data source, for instance, the segmentation of a sentence into words; Creation of relational information between linguistic elements, such as the relation between a verb and the corresponding subject. Assignment of characteristics to a linguistic element, or a relation, for example, the morphological features of a word;

1.4. REQUIREMENTS 11 Each NLP tool may produce several types of information at the same time. The linguistic information generated by an NLP tool is normally derived from linguistic information created by other tools. For example, a part of speech tagger will use the segments produced by a tokenizer and add morphologic information to those segments. A morphological desambiguator may use the classifications produced by several part of speech taggers to select the most appropriate classification. The conceptual model must be able to represent several layers of both primary data sources and linguistic information. We have defined the following requirements regarding these two types of information: The conceptual model must be able to represent any kind of primary data source such as text, speech, video, or any combination of these; The conceptual model must support the creation and edition of primary data sources; All linguistic information except primary data sources produced by an NLP tool must be associated with the same layer; The conceptual model must allow the selection of linguistic information through the identification of the layer that contains it; Each layer is associated with the identification of the tool that produced it. The last three requirements are necessary to simplify the identification of information inside the conceptual model. This way all linguistic information is organized into layers identified by the tool that produced it. The conceptual model must represent the three types of linguistic information that each NLP tool can produce: (i) the identification of linguistic elements; (ii) the creation of relations between linguistic elements; (iii) and the assignment of characteristic to linguistic element and the relations. Figure 1.7 shows an example of the identification of linguistic elements, namely the words of a text.

12 CHAPTER 1. INTRODUCTION Figure 1.7: Identification of words in a text. We have defined the following requirements for the representation of those linguistic elements: The model must be able to represent ambiguity in the identification of linguistic elements, for example, a compound term can be segmented as only one segment containing the compound term or several segments for each word. Figure 1.8 shows an example of segmentation ambiguity; The model must be able to represent trees of linguistic elements, for example, syntactic trees. The model must allow the creation of relations between linguistic elements from other layers. It must be possible to represent classification ambiguity, which correspond to associating disjunct sets of characteristic to the same linguistic element. For example, distinct morphological features for the same word. Figure 1.9 shows distinct grammatical categories for the word that; The model must allow the association of characteristic to linguistic elements, or relations from other layers. Figure 1.8: Ambiguous segmentation example.

1.4. REQUIREMENTS 13 Figure 1.9: Classification Ambiguity example. Besides the representation of linguistic information produced by each NLP tool, the conceptual model has the following requirements concerning the relations between linguistic information from different layers: The model must be able to represent relations between linguistic elements from different layers. These relations represent dependencies between layers of information, and allow the navigation between layers; The conceptual model must allow linguistic elements to reference data belonging to a primary data source, without having to copy its value, thus avoiding the repetition of the same data in several layers; The model must be able to represent data which may not exist in any primary data source, for example, the separation of contractions. 1.4.2 System Requirements This subsection presents the general requirements of the system which are not related to its conceptual model, but with the interaction between the system and the NLP tools, which are the following: The system must simplify the iteration of data in the repository, e.g. all segments from a segmentation. This is required because iteration is the most common way of interaction between an NLP tool and its data; The system must allow the selection of the data based on the layer s identification; this way an NLP tool only handles the data it requires;

14 CHAPTER 1. INTRODUCTION The system must allow the access to data from an unfinished Analysis to allow the parallel processing of data. This way an NLP tool may consume information that is being produced at the same time by another NLP tool; The system must persist its data; The system must interact with NLP tools written in any programming language. 1.5 Contributions The main contributions of this work are: The definition of a conceptual model that can be used as a common model by several NLP tools, and is able to represent a broad range of linguistic information, different types of primary data, and related the represented information; The implementation of a framework for NLP systems that reduces the effort to implement and integrate NLP tools. 1.6 Dissertation Structure Chapter 2 describes several works that solve similar problems to the ones addressed in this thesis. These works consist in architectures that try to simplify the creation of NLP systems based on independent NLP tools, and in linguistic annotation frameworks, which try to abstract the logical structure of linguistic annotations, thus creating a conceptual core capable of representing all kinds of linguistic information. Chapter 3 presents our conceptual model, which fulfils the requirements described in Subsection 1.4.1, by defining its entities and their responsibilities. Chapter 4 describes the proposed architecture for this work, its general principles and an implementation. Finally, in Chapter 5, makes some remarks of the developed solution and presents some pointers for future work.

2.1 Introduction 2 Related Work This chapter presents a description of some proposals that we compared during the development of this work. The shared data style architecture was analysed. It consists in a data store used by several tools. A shared data style may be a repository or a blackboard. The main difference is that while the repository is passive and the clients access the data as required, the blackboard is active and defines the client execution order. However, the term blackboard is sometimes used in a broader sense meaning a shared data style. The idea of the shared data style is based on a metaphor where a group of experts gathers around a blackboard and work cooperatively to solve a problem, using the blackboard as the workplace for developing the problem. The blackboard architecture is not a new technology, the first blackboard system: the hearsay-ii speech understanding system (Hayes-Roth et al., 1978) was developed in 1978. Some blackboard characteristics are described in (Corkill, 1991) from which we present those that also apply to a repository, which are the following: Independence of expertise - Each module is a specialist in solving a certain aspect of the problem. A module does not depend on other modules to produce its contributions. A module only has to find the information it needs inside the blackboard and then it proceeds with no assistance from other modules. New modules may be added to the blackboard without the need to change any existing modules. This corresponds to the notion of NLP tools developed by independent researchers, which are not able to preview how their tool is going to be used

16 CHAPTER 2. RELATED WORK in broader systems; Diversity in problem solving methods - In a blackboard system each module is a black box. The blackboard has no knowledge about the processing that each module performs, which corresponds to our notion of having a generic system that every NLP tool can use; Flexible representation of blackboard information - A blackboard system does not place any restriction on the information that a module can add. This property follows our desire of semantic independence on the data for the sake of generality; Common interaction language - All modules interacting with the repository must have a common understanding on the data the blackboard holds. So a common language must be available and should be used by every tool using the repository. If a tool could place information on the repository without following the common language, it would not be useful since no other tool could use that information properly. This property is achieved by the usage of a conceptual model, together with an interface that all tools must comply to; Position Metrics - A module should not have to scan the entire blackboard, which can be very big, to find the information it requires. One solution is to divide the blackboard into regions, each corresponding to a particular group of information, which in our case corresponds to the information produced by each NLP tool; Incremental solution generation - Blackboard systems operate incrementally. Each module will contribute to the solution with something they find appropriate. So it must be possible to represent partial solutions and unsolved ambiguities. A related field called linguistic annotation was also analysed. It deals with tools and formats for creating and managing linguistic annotations. Linguistic annotation

2.1. INTRODUCTION 17 covers any descriptive or analytic annotation over a raw source of data. For instance, the segmentation of a text into words, or the morphologic features of those words are both linguistic annotations. The motivation for this field is the enormous necessity for manually annotated corpora in the NLP field. These annotated corpora validate results from NLP tools, and help training NLP statistical tools. The creation of annotated corpora is a very demanding and expensive job. Moreover, as the diversity of annotations over a corpora increases, so increases the value of that linguistic database. Therefore, there is a great focus on the reutilization of linguistic annotation databases, which led to the definition of several standards for linguistic annotation formats. However, the diversity of existing annotations formats makes the reutilization of such databases more difficult. Due to these problems, the development of linguistic annotation frameworks became a priority. The main objective of these frameworks is to develop a logical level of linguistic annotations independent of the annotations physical format, which together with an interface and modules to convert from/to the existing data formats, should promote the reutilization of annotated corpora. The logical level, whose focus is on the logical structure of linguistic annotations rather than their content, should be able to represent all kinds of linguistic annotations existing in the various formats, which is precisely the purpose of our conceptual model. The Linguistic Annotation home page (Consortium, 2006) collects information not only about tools that have been widely used for constructing annotated linguistic databases, but also about the formats commonly adopted by such tools and databases. Another path or research regarding linguistic annotations is the definition of a set of requirements for annotation formalisms. We compared the requirements proposed in (Reidsma et al., 2004), and the requirements that are being defined by the International organization from standardization (ISO), which has formed a sub-committee (SC4) under technical committee 37 (TC37, Terminology and other language resources) to define a standard for linguistic annotation (Ide and Romary, 2001; Ide et al., 2003), against the ones we have defined. We compared two existing linguistic annotation frameworks against our require-

18 CHAPTER 2. RELATED WORK ments. In Section 2.2 we compared the Annotation Graphs Toolkit (AGTK) (Maeda et al., 2002). The AGTK is an implementation of the Annotation Graphs formalism (Bird and Liberman, 1999), which is the most cited work in this area. Section 2.3 compares the ATLAS architecture (Bird et al., 2000; Laprun et al., 99), which is a generalization of the Annotation Graphs formalism to allow the usage of multidimensional signals. We also compared several works regarding architectures that simplify the creation or integration of NLP tools towards their usage in NLP systems. In Section 2.4 we compared EMDROS (Petersen, 2004) which is an open source text database engine for analysis or retrieval of analyzed or annotated text. We continued in Section 2.5 by comparing the Natural Language Toolkit (NLTK) (Loper and Bird, 2002), which is a suite of Python libraries, and programs for symbolic, and statistical natural language processing. Section 2.6 we compared GATE (Bontcheva et al., 2004) which is a general Java based architecture for text engineering that promotes the integration of NLP tools by composing them into a pipes and filters architecture. Then, in Section 2.7, we compared Festival speech synthesis system (Taylor et al., 1998; Black and Taylor, 1997), which is a general framework for building speech synthesis systems. Finally, in Section 2.8, we present a brief summary, and some conclusions of this chapter. 2.2 AGTK The Annotation Graphs (Bird and Liberman, 1999) is a formal framework for representing linguistic annotations based on the analysis of several existing annotation formats. This analysis led to the development of a conceptual core called Annotation Graphs that according to Bird and Liberman (1999) can represent all kinds of linguistic annotations, thus serving as a interlingua between the different tools. The analysed annotation formats include: TIMIT (Garofolo et al., 1986) - a corpus of read speech designed for the recognition of acoustic-phonetic knowledge;

2.2. AGTK 19 Partitur (Schiel et al., 1998)- the format of the Bavarian Archive for Speech Signal made from the collective experience of a broad range of German speech databases; CHILDES (MacWhinney, 1995)- a database of transcript data collected from children and adults who were learning foreign languages; LACITO (Jacobson et al., 2001)- Collection of recorded and transcribed speech data of unwritten languages; NIST UTF (NIST, 1998)- universal transcription format developed by the US National Institute of Standards and Technology; Switchboard (Godfrey et al., 1993)- a corpus of conversational speech, containing several levels of distinct annotations; MUC-7 Message Understanding Conference (Hirschman and Chinchor, 1997)- defines a format for representing linguistic annotations used in information extraction, name entity recognition and coreference resolution. Since that analysis showed that the Annotation Graphs formalism supersede all the other annotation formats, and since we are going to compare the Annotation Graphs formalism against our requirements, no further evaluation was performed concerning the analysed works. The Annotation Graphs formalism focus on annotating speech signals, however the ideas could be extended to text. The Annotation Graphs Toolkit (AGTK) (Maeda et al., 2002) is a framework that provides an instantiation of the Annotation Graphs, and simplifies the creation of annotation tools by providing a set of interfaces to access the data represented in that formalism. The AGTK architecture consists of three modules (see Figure 2.1):

20 CHAPTER 2. RELATED WORK Figure 2.1: AGTK internal structure. AGLIB - The AGLIB provides an interface for Annotation Graphs formalism, implemented in C++. It is composed by the core module, the AGAPI and TreeAPI (Cotton and Bird, 2002), and the ODBC and IO interfaces; AG wrappers - The AG wrappers allows the connection to the AGLIB in several programming languages, for example the Java wrapper provides an interface in Java to the AGLIB; Input/Output Plugins - The Input/Output Plugins allow the definition of separated modules, which convert linguistic annotations to/from the Annotation Graphs formalism to other formats. The architecture of the AGTK was developed in order to simplify the creation of annotation tools. An annotation tool reads the existing linguistic annotations using the interface provided by the system, performs the annotations, and then, saves all the annotations. This approach is the same as the one illustrated in Figure 1.2 and presents the same problems. The Input/Output facilities of the AGTK imposed that all annotations are loaded and saved every time, so a lot of data has to be parsed by each NLP tool, even if the tool is not using this data. Moreover, if a NLP tool uses the output of two NLP tools which were executed in parallel, the NLP tool has to merge the annotations from both tools. Furthermore, a NLP tool cannot consume data that is being produced at the same time by other tool, so no parallel processing is possible.

2.2. AGTK 21 Figure 2.2: Annotation Graphs example. These problems exist even if the provided ODBC interface is used, because the ODBC interface only replaces the file Input/Output to a database Input/Output. The representation formalism of the AGTK is the Annotation Graphs formalism: an annotation graph is a directed acyclic graph where edges contain a set of features, and nodes may contain a time offset. The time offsets are relative to the signal that is being annotated. A formal description of the annotation graphs model can be found in Bird and Liberman (1999). Figure 2.2 shows an annotation graph for a morphologic analysis. Two nodes, marking the beginning and the end, represent each word. Each arc represents the grammatical classification of that word, or an indication of a white space. The AGTK defines an implementation of the Annotation Graphs formalism where several annotation graphs can be grouped, together with several Timelines in an AGSet. A Timeline represents a group of Signals that share the same reference. Each node of the graph is represented by an Anchor element which corresponds to a named offset into a Signal. The arcs of the graph are represented by Annotations elements which have a specific type and a set of attribute value pairs called Features. Some elements may have an attribute called Metadata, which is a set of attribute value pairs which allow the addition of non-linguistic information to those elements.

22 CHAPTER 2. RELATED WORK Several problems exist concerning the representation of the linguistic information produced by a NLP tool. The first problem detected in the Annotation Graphs formalism is that it has no concept to represent a linguistic element. It represents linguistic elements by defining two nodes pointing to their beginning and their end. This representation choice does not allow the representation of relational information between linguistic elements since that information corresponds to arcs between other arcs. This problem concerns both the concept of relational information that we defined in the requirements, and the relations between different layers. Another problem regarding the absence of representation for linguistic elements is that in order to keep a linked graph we must represent arcs which do not correspond to any linguistic element. For example, in Figure 2.2 we had to use arcs to represent the spaces between words. Another problem arises from the fact that the Annotation Graphs do not have the concept of classification ambiguity, meaning that a linguistic element may have several classifications. This ambiguity may be represented as several arcs between the same nodes, however using this alternative we cannot easily distinguish several alternative classifications from other annotations performed over the same zone of the Signal. The representation of segmentation ambiguity and hierarchical segmentation is possible using different arcs between the existing nodes, but again we lack the structural semantics of those elements which are conceptually different, thus making the utilization of the annotations more difficult, even if the Metadata is used to provide information about the semantic role of each arc. The Annotation Graphs formalism does not allow the edition of primary sources of data. The Anchor used to identify points in the Signal is a number that corresponds to an offset in a file. This representation is very restrictive and does not allow, for example, the case of multidimensional signals. Regarding the integration of linguistic information produced by several NLP tools, the Annotation Graphs presents some problems as well. The model has no concept of layer. If a layer is represented as an individual Annotation Graph it is not possible to access data from other layers, which is not acceptable for our purposes. Another al-

2.3. ATLAS 23 ternative is to keep all information in an Annotation Graph and use the Metadata to identify the information produced by each tool, but this alternative has several problems. Firstly, using this approach it is not possible to have several layers over different Signals. Secondly, it would be too difficult to select a sub-graph containing only the information produced by a NLP tool. 2.3 ATLAS ATLAS (Bird et al., 2000; Laprun et al., 99) provides a framework aimed at simplifying the development of linguistic annotation tools. The ATLAS architecture development regarded the extension of the Annotation Graphs formalism to handle multidimensional signals. There is an implementation of the ATLAS architecture in Java. However, this project was abandoned before any tool had used the proposed formalism. Nevertheless, the generalization performed over the Annotation Graphs has some interesting features. The ATLAS architecture follows the same principle, therefore has the same problems of the AGTK architecture so we will not describe them in here. The ATLAS annotation model extends the Annotation Graphs formalism in three aspects: i) it allows the representation of multidimensional signals; ii) it provides a better representation of hierarchical structures; iii) it adds semantic information to the elements of the model. These extensions originated a more generic model: all data represented in an Annotation Graph can be represented by the ATLAS model, but the contrary is not true. The representation of multidimensional regions is achieved by the introduction of the Region element, which is an abstraction representing a zone in a Signal which is delimited by two coordinates represented by Anchor elements. The Anchor element is the only tie between Annotations and the Signal. Figure 2.3 illustrates an annotation using the ATLAS model. There is a Signal which is a text, and two Regions selecting the

24 CHAPTER 2. RELATED WORK words The, and pretty. In this example the Anchors used by the Region element consist in numbers which indicate the characters position at the text. However, if other kind of Signal was used, only the type of the Anchor elements would change. Figure 2.3: ATLAS region utilization example. The addition of the Region concept allows the ATLAS model to represent all types of media. However, the model does not allow the editing of Signals. The representation of hierarchical structures was improved with the addition of the element Children which contains a list of other Annotations that are descendant of a parent Annotation. Figure 2.4 illustrates an example of the utilization of the Children elements where one annotation contains several children Annotations. The ATLAS architecture provides a semantic level, called Meta-Annotation infrastructure for ATLAS (MAIA) which adds semantic information to the elements of the model. MAIA defines a type system used by each ATLAS element. The type of the element restricts the possible relations between the elements and their features. The extension performed by the ATLAS architecture provided a better representation for conceptually different linguistic phenomena, such as hierarchical trees, and ambiguous segmentations, through the Children elements. It also fulfils the requirement of media independence. But even so this model still presents the main problems described for the Annotation Graphs formalism.

2.4. EMDROS 25 Figure 2.4: ATLAS Children utilization example. Nevertheless, the idea of representing zones in a data source using regions was used in our work, since it allows the rest of the model s interface to be independent of specific data source types. 2.4 EMDROS EMDROS (Petersen, 2004) is a text database engine for analysis or retrieval of analyzed or annotated text. The EMDROS system is composed by four layers: Client Layer - Represents the NLP tools that use the services provided EMDROS; MQL layer - Provides an interface to the MQL query language, which uses the EMdF layer to translate the MQL queries into SQL calls to the database layer; EMdF layer - Defines the annotation model used by the database;

26 CHAPTER 2. RELATED WORK Database layer - Represents a relational database which persistently stores the linguistic information. The principal concept of the EMdF is the Object element, which represents a linguistic element. It contains a set of Monads which correspond to the minimum granularity units that the database may have. Objects are grouped into Object Types which define the possible Features that each specific Object may have. A Feature is an attribute-value pair. Features elements are the entities that store all the existing data in the database. Figure 2.5: EMDROS database example. Figure 2.5 shows an EMDROS database. The database contains six monads, one for each character which in this case is the minimum granularity unit. There are two Object Types, letter, and name. Each letter Object has the Feature surface, which contains the letter that the Object is representing. The name Object contains no Features. There are six letter Objects, containing one Monad, and one letter Object containing a set with the six existing Monads. The EMdF model uses the Monads to establish relations between different Objects. In this example word Object contains all the letter Objects, because its set of Monads contains all the letter s Objects Monads. This model is too restrictive for our purposes, for example: the EMdF limits the data source type to text; each linguistic element is identified as an Object which has a set of Features, so it is not possible to represent several classifications for the same Object; Relational information between Objects from the same layer is also not possible.

2.5. NLTK 27 2.5 NLTK The Natural Language Toolkit (NLTK) (Loper and Bird, 2002) is a suite of Python libraries and programs for symbolic and statistical natural language processing. Its purpose is to aid teachers in computational linguistic courses by providing means for students to create practical NLP components, thus helping them to gain an in-depth understanding of the issues involved in particular aspects of NLP. The NLTK consists of a collection of independent modules, sub-divided into dataoriented modules, used to store different types of information, and task-oriented modules, used to perform a variety of tasks that are relevant to natural language processing. The NLTK limits the source data to be text. The NLTK uses a single class to represent the linguistic information about text, the Token class. A Token represents a single unit of text such as a word, a sentence, or a document.the Token contains a predefined set of attribute-value pairs, for example, the TAG attribute encodes the part-of-speech of a word, and the LOC attribute encodes the location of a token in the original text. Each task-oriented module adds new information to the Token, so a Token is a kind of blackboard. There are subclasses of Tokens defined for special purposes. For example, the TreeToken contains a list of children Tokens which is useful for representing trees, like syntactic trees. This representation formalism does not satisfy our requirements. For example, it does not allow the edition of the original text, and it does not allow the representation of ambiguities, such as, classification ambiguity. Each NLP task is composed by an interface which defines at least one method to perform the processing task, for example, there is an interface for all tokenizers which define a tokenize function. There can be different implementations of each interface, however the utilization of interfaces simplifies the integration of modules regardless of the chosen implementation. Moreover, the NLTK restricts the development to the Python programming language, and relies on the Python interpreter to work. For example, a NLP system executes by sequentially invoking several task-oriented modules in the same session of

28 CHAPTER 2. RELATED WORK the interpreter, so the NLTK can be seen as a blackboard where the flow of execution of NLP tools is performed by the Python interpreter with some guidance of the user. 2.6 GATE GATE (Bontcheva et al., 2004) is a general Java based system for text engineering. It contains functionality for plugging in all kinds of NLP tools, such as POS taggers, sentence splitters and Named Entity recognizers. GATE defines two kinds of resources, language resources that refer to data-only resources such as lexicons, corpora, and processing resources representing NLP tools. A NLP system in GATE consists of a list of processing resources executed in a pipeline manner. All resources must use the Java programming language to interact with gate and must follow a specific interface. In GATE, language resources are composed by Corpora that is a set of Documents corresponding to texts. A processing resource produces Annotations over those Documents. Both Corpora and Documents have an associated set of Features composed of attribute-value pairs used to define properties over those entities. Annotations are arcs over a graph where the nodes correspond to a offset in a Document. Each Annotation has a type and a set of Features defining the properties of that Annotation. Figure 2.6 illustrates an example of some Annotations over a Document containing the text The red flower. There are four nodes for referencing the three words and two Annotations for the word red. This architecture does not allow the parallel processing of the data. Furthermore, the model used by the data resources has several limitations; an example would be the fact only the use of text is allowed as source data. Other limitations related with the graph representation are similar to the ones described for the Annotation Graphs formalism in Section 2.2.

2.7. FESTIVAL 29 Figure 2.6: Gate annotation graph example. 2.7 Festival The Festival Speech Synthesis Systems (Taylor et al., 1998; Black and Taylor, 1997) is a general framework for building speech synthesis systems. It offers a set of APIs that allows the implementation of full text to speech systems, while at the same time allows the easy integration of new modules for that purpose. The Festival framework proposes a formalism to represent the various types of linguistic information when converting a text to speech that contains the following entities: Relation - A Relation is a structure used to organize linguistic elements, such as words, into linguistic structures (list or trees). It consists of a set of named links, and a set of nodes. Nodes are purely positional units and contain no information apart from their links. Each node may have previous/next links to create lists and up/down links to create trees. Each node has a link to an Item, which contains the linguistic information associated to that linguistic element; Item - An Item consists in a group of features composed of attribute-value pairs, and a set of named links to nodes belonging to Relations. An Item can belong to several Relations;

30 CHAPTER 2. RELATED WORK Figure 2.7: Utterance example. Utterance - An Utterance is a collection of Relations. Figure 2.7 shows the representation of an Utterance structure. It contains two Relations: the syntax Relation, and the word Relation. The syntax relation is a tree while the word Relation is a list. The Items contain the linguistic information associated with each Relation. Festival has several types of Relations that define the node structure of that level, and the nodes that are shared with other Relations. For example, a token Relation is a list of trees, which starts as the list of tokens identified in the source text. Each node may have several children corresponding to the words that the token is related to. The phrase Relation consists in a list of trees where each node may have several children corresponding to the words in that sentence. New types of Relations can be defined at run-time and used by a specific module. A module in Festival is a function which receives an Utterance as input and returns

2.8. SUMMARY 31 that same Utterance with added information. From this perspective we can see the Utterance as a repository for linguistic information. However this formalism does not have the required expressiveness. First, the source data for the festival system must be text, secondly it does not allow the representation of some kind of ambiguities which are natural in NLP tools. Each module in festival must be declared in an initialization file. Festival defines several types of Utterance and each kind of Utterance defines what modules must be called. For example, a text Utterance is composed by a string and corresponds to the initial state of the processing, so it will call all modules required to synthesize the text. At the other end, a wave Utterance, a wave file already generated, only involves its loading. This means that the Utterance type acts as a controller that defines what modules will be executed, and their execution order, which resembles the controller in a blackboard architecture mentioned earlier in this chapter. The Festival architecture does not allow the concurrent execution of different modules. 2.8 Summary This section presents a summary of the compared works. We start by describing the main reasons that led us to discard each work and then we present a comparison of the fulfillment of each work concerning the defined in requirements. We concluded that the AGTK was not suited for our work, since its purposed architecture presented some of the problems that this dissertation proposes to solve. Moreover, the utilization of the Annotation Graphs formalism for our conceptual model was not possible mainly because of the absence of the linguistic element concept, and the lack of organization of information into layers. Although the ATLAS model extends the AGTK formalism adding to it some useful abstractions it does not complectly fulfil our requirement so we did not use it as our conceptual model. The model inadequacy to our requirements was confirmed by Chris Laprun, one of the ATLAS developers, during LREC 2002 [personnel communication].

32 CHAPTER 2. RELATED WORK The EMDROS framework was discarded since its representation model is too restrictive for our objectives. The NLTK restricts the development to the Python programming language, and relies on the Python interpreter to work. Moreover, its underlying conceptual model is not able to fulfill all our requirements, so we decided that the NLTK could not be used. The GATE framework presents the same problems as the AGTK formalism concerning its conceptual model. Also, it limits its utilization to the Java programming language. Furthermore the GATE promotes the integration of NLP tools into a pipes and filters architecture. For these reasons we decided that the GATE framework was not suited for our work. Finally the Festival framework was not used, because its model did not allow the fulfilment of all our requirements, namely: i) its source data must be text, ii) it cannot represent all types of ambiguities defined in the requirements, iii) it does not allow the concurrent execution of different modules. The coverage provided by each work regarding the requirements defined in Section 1.4 is summarized into two tables. Table 2.1 summarizes the framework requirements, and Table 2.2 contains the analyses regarding the conceptual model requirements. Each table entry may have one of the following values respecting the coverage of that requirement: S - Supported, meaning that the requirement is fulfilled by a specific work; PS - Partially supported, meaning that although the work does not directly fulfil the requirement, it provides some way to fulfil it, for example, by establishing some convection; NS - Not supported, meaning that the requirement cannot be fulfilled by that work. We can see a clear distinction between the analysed works. The works defining linguistic annotation frameworks are able to represent all kinds of data but lack the

2.8. SUMMARY 33 Requirements Work AGTK ATLAS EMDROS NLTK GATE FESTIVAL System Provide iteration facilities NS NS S S S S Selection of data based on a tool PS PS PS PS S PS identification Parallel processing of the data NS NS S NS NS NS Persistence storage of all data S S S S S S Programming language independence S S S NS NS S Table 2.1: System requirements resume. iteration facilities. On the other hand the works concerning the simplification of the creation of NLP systems are tied up to a specific data type, but provide iteration facilities, which is normal, since that is the most natural interaction pattern between NLP tools and the data they use. None of the works, with the exception of GATE, provide access to the data through the identification of the NLP tool that produced it. However, this requirement may be fulfilled by the other works if we add metadata to each element identifying the tool that produced it. Nevertheless, using this approach the selection of relevant data becomes very difficult since the interfaces were not defined for this purpose. From this table we can conclude that EMDROS is the only tool that allows parallel access to the linguistic information, mainly because it consists in an API that works directly on top of database, translating requests into SQL queries. The GATE architecture is the only work that does not have all data available, because of its pipes and filters architecture, whose problems we described in Chapter 1. As for the programming language independence, all tools besides GATE and NLTK provide a way to use their services in various programming languages. None of the studied works can properly handle the editing of primary sources of data, because they treat those elements as an external entity. This is a difference between the analysed works and our work because we see the creation and editing of

34 CHAPTER 2. RELATED WORK Requirements Work AGTK ATLAS EMDROS NLTK GATE FESTIVAL Conceptual Model Manage several primary data S S NS NS PS NS sources Manage several layers of linguistic S S PS S NS S information Allow the usage of all media PS S NS NS NS NS types Allows the edition of data NS NS PS NS NS NS sources All linguistic information from a PS PS PS PS NS PS NLP tool is associated with the same layer Layers associated with NLP tool PS PS PS PS NS PS that produce it Identification of linguistic elements PS PS PS PS PS S Represent ambiguity in linguistic PS PS NS S PS PS elements Represent linguistic elements PS S PS S PS PS trees Represent Relations NS PS PS NS NS PS Represent relations between NS NS NS NS NS PS segments from other layers Assign characteristic to linguistic S S S S S S elements and relations Can represent classification ambiguity PS PS NS NS PS NS Assign characteristics to linguistic NS NS NS S NS S elements or relations from other layers Can represent relations between NS PS PS PS NS NS linguistic elements from different layer Linguistic elements reference S S PS S S S primary data sources Allows representation of data not contained in any primary data source PS PS PS PS PS PS Table 2.2: Conceptual model requirements resume.

2.8. SUMMARY 35 new Data Signals as a common process executed by NLP tools. Nevertheless ATLAS provides a good abstraction for primary data sources which was incorporated in our conceptual model. Furthermore, none of the tools has the concept of separating the information based on the NLP tool that produce it, but this can be realized by adding non linguistic information to each element. However, this approach makes the usage of the provided API harder. Another representation problem that we noticed was the inability to properly represent the various types of ambiguities, and relational information between elements produced by the same tool. After analysing the work being realized by the ISO committee we realized that the representation of linguistic elements as a concept on its own is very important, and it is described in their work. This concept is absent from most works which focus on the representation of annotations, leaving the concept of linguistic element as a property of each annotation, thus making the representation of relational information and ambiguities harder. Finally we noticed a lack of representation power to establish cross-relations between elements produced by different tools.

36 CHAPTER 2. RELATED WORK

3.1 Introduction 3 Conceptual Model This chapter presents the proposed conceptual model used to describe linguistic information. The conceptual model must be able to represent various types of linguistic information, and relate the information produced by NLP tools. Besides representing the different linguist phenomena, the model s API must simplify the use of linguistic information by tools composing an NLP systems. The same linguistic information may be represented in different ways according to its usage. For example, if the phonetic transcription of a text is going to be the target of several NLP tools it should be represented as a new data source. On the contrary, the transcription of each word can be represented as a word s attribute if it is not going to be heavily used. This model was built around the requirements defined in Section 1.4. The conceptual model consists in a set of related entities with predefined semantics. Each entity is described based on its meaning, and on its responsibilities inside the conceptual model. The main entity is the Repository which is a linguistic repository for the output and input of NLP tools. The repository includes raw data sources (SignalData) or other linguistic information (Analyses). Each Analysis may contain three types of information: segmentation information, relational information and classification information. Segmentation information concerns the identification of meaningful elements in the data source, for example, a word in a text. Relational information concerns relations between the identified elements, for example, the relation between the subject and verb of a sentence. Classification information corresponds to the association of characteristics to the identified linguistic elements or relations, e.g. the morphologic features of a word.

38 CHAPTER 3. CONCEPTUAL MODEL An API independent from a specific programming language was also defined. Each implementation of the conceptual model must provide that API. This chapter is organized as follows, Section 3.2 describes the entities of the conceptual model, how they relate to each other and their responsibilities. Then in Section 3.2, we describe the API defined for the conceptual model. Finally, Section 3.4 presents some remarks about the proposed model. 3.2 Conceptual Model Entities This section presents a description of the entities composing the conceptual model, which are illustrated by an UML diagram in Figure 3.1 along with their relations. Each relation is represented with their multiplicity. When no multiplicity is displayed the default value applies, which is one. All relations are bi-directional unless when the direction of the navigability is explicitly declared by using an arrow at the end of a relation. The entities composing the conceptual model are: Repository - A centralized linguistic repository that gathers the output of several NLP tools, and organizes that information into layers. There are two types of layers: SignalData, and Analysis; Data - An abstraction of a data type that can be used by the Repository, e.g. String. SignalData - An abstraction of a raw data source, such as a text; Index - A point in a SignalData; Region - Defines a region in a SignalData, using a pair of indexes. Analysis - Linguistic information other than SignalData elements produced by an NLP tool; Segment - A linguistic element, for example, a word;

3.2. CONCEPTUAL MODEL ENTITIES 39 Figure 3.1: Conceptual Model class diagram. Segmentation - A set of Segments sequentially ordered, e.g., the words in a sentence; Relation - A link between two segments. For example, the relation between the subject and the verb of a phrase Classification - A set of characteristics of a Segment or Relation. For instance, morphologic features of a word; Alternative - A set of Segments representing different alternatives for an ambiguous linguistic element; Cross-Relation - A structural relation between Segments from different Analysis. We defined a set of principal entities from the Repository which are: SignalData, Analysis, Segmentation, Segment, Classification, and Relation. These entities may contain

40 CHAPTER 3. CONCEPTUAL MODEL two attributes: (i) type and (ii) description. These attributes allow the addition of nonlinguistic information to those entities. These attributes do not contain any semantic inside the Repository. However, they were defined in order to allow the creation of specialized APIs that use them. The type attribute may be used by NLP tools to add semantic to the elements of the Repository, while the description attribute may be used by a graphical interface to indicate the meaning of each element. The creation of these attributes is optional, and they are not displayed in the examples in this section. The rest of this section describes each entity in detail. 3.2.1 Repository A Repository element represents the main entity of the conceptual model. It is a centralized linguistic data store that maintains the output of several NLP tools. The information is organized inside the Repository into layers. Each layer is uniquely identified inside the repository by its name. A layer also has information regarding the NLP tool that has created it, and its creation date. The distinction between a layer s name and the name of the NLP tool that created it is required because an NLP tool can be executed several times during the execution of an NLP system. A layer may be open or closed indicating whether the tool has already finished the addition of data into the Repository. A layer can only be changed if it is open. The Repository can hold two different types of layers: i) layers containing primary data sources, which are called SignalData; ii) layers containing other linguistic information that can be produced by NLP tools, which are called Analysis. The Repository is responsible for the creation and management of all the entities of the conceptual model. Figure 3.2 illustrates the Repository class diagram with its main methods. The Repository elements fulfils the requirement of representing several layer of linguistic information of both primary data sources and linguistic information produced by NLP tools. Furthermore, the organization of information into layers fulfils the requirements of allowing the selection of linguistic information based on a layer s iden-

3.2. CONCEPTUAL MODEL ENTITIES 41 Figure 3.2: Repository class diagram. tification. 3.2.2 Data A Data element represents a piece of data inside the Repository. The Data element provides an abstraction of all possible types of data that the Repository may handle, such as Strings, or Waves files, allowing the definition of an uniform model and corresponding API independently of a specific data type. The Data element is used inside the Repository by the SignalData to represent the data it contains and by the Segment to represent its derived data; 3.2.3 SignalData, Index and Region A SignalData element is a representation of a raw data source used by NLP tools, like a text, or an audio file, abstracting details such as its location, and its Data type. All SignalData elements have a minimal granularity unit. In a text its minimal unit is a character, while in an audio file the minimal unit is a sample. An Index represents a point in a SignalData using the SignalData s minimal unit. For example, in a text a possible Index representation is an integer representing a character position inside the text, and in a stereo audio recording the Index representation can be a pair of reals indicating the position of the sample in both stereo channels. A Region defines a zone in a SignalData. It is defined by two Indexes and a Signal-

42 CHAPTER 3. CONCEPTUAL MODEL Data. The Region element encapsulates the details about the specifics of a SignalData and an Index. A SignalData uses a Data element to represent its content, and has the responsibility of allowing the edition and management of the data contained in the raw data source. It is also responsible for the creation of Indexes, and Regions of its own data. Figure 3.3 shows the SignalData class diagram, and its main methods. Figure 3.3: SignalData class diagram. Example 3.1 - SignalData, Index, and Region Example This example shows a Repository with a layer containing a SignalData. This SignalData represents a text which is kept inside the Repository. The minimal unit of this SignalData is the character, its possible Indexes are integers corresponding to the position of the character in the text and are represented below the characters. The figure shows a Region of the SignalData illustrated with a surrounding box corresponding to the word rose, which is defined by the integer par (8,11) corresponding to the start and end characters of the word. The identification of the SignalData and the rest of its attributes are represented below the text.

3.2. CONCEPTUAL MODEL ENTITIES 43 The element SignalData fulfils the requirements of media independence since it can represent any kind of media and at the same time provide the same interface for the rest of the conceptual model. Figure 3.4 illustrates a specific type of SignalData called TextSignalData used to represent text. The TextSignalData uses the TextIndex and StringData. Figure 3.4: TextSignalData class diagram. Furthermore, the SignalData element also fulfils the requirement regarding the editing of its data by using the provided SignalData s methods. 3.2.4 Analysis An Analysis element is a type of layer that represents all the information produced by an NLP tool excluding primary data sources. An Analysis may contain segmentation information, relational information and classification information (see Figure 3.5). Segmentation information represents the identification of linguistic elements in a SignalData, like words, sentences, or phonemes. Relational information represents links between linguistic elements from the same Analysis, and classification information represents the assignment of features to existing linguistic elements, or relations.

44 CHAPTER 3. CONCEPTUAL MODEL Figure 3.5: Analysis class diagram. Since all information produced by an NLP tool (excluding data sources) is gathered into an Analysis element, the requirement that all information produced by an NLP tool should be gathered into a layer is fulfilled. 3.2.5 Segment and Segmentation A Segment element represents a linguistic element. The Segment may contain two Data elements: the original data and the derived data. The original data corresponds to a linguistic element identified in a SignalData, while the derived data corresponds to a possible transformation performed over the original data. A Segment may be ambiguous, meaning that it has a set of Alternative elements for the linguistic element that it identifies. It may be hierarchical, meaning that it may have a parent segment and child segments. Finally, a Segment has a set of disjunct Classification elements (Subsection 3.2.7), where each Classification assigns a set of characteristics to the Segment. It also has a set of Relations (Subsection 3.2.6), that establish links between two Segments from the same Analysis. Finally, a Segment may belong to a set of CrossRelations (Subsection 3.2.8), which are used by the Repository to establish structural relations between Segments from different Analysis. A Segmentation represents a list of ordered Segments, for example the words appearing in a text. Figure 3.6 shows the Segment and Segmentation class diagram, containing the re-

3.2. CONCEPTUAL MODEL ENTITIES 45 Figure 3.6: Segmentation and Segment class diagram lations between the explained entities. The rest of this subsection presents a deeper explanation with examples of the elements and relations described above. Example 3.2 - Segment and Segmentation This example shows a Repository with two layers. The first layer contains the SignalData described in Example 3.1 but with one Region for each word. The Regions are represented in the same way as in the previous example. For the sake of simplicity, the identification of the layer was omitted.

46 CHAPTER 3. CONCEPTUAL MODEL The second layer contains an Analysis whose identification is showed at the bottom. The Analysis contains a Segmentation composed of five Segments one for each word. The Segments are ordered from left to right inside the Segmentation indicating the order in which the words appear in the text. Each Segment contains a Region that identifies the zone in the SignalData corresponding to the segment s linguistic element, which is represented as an arrow from the Segment to the Region on the SignalData s layer. The Region is represented in the SignalData s layer for clarity but it belongs to the corresponding Segment s layer. Besides the data referenced by the Region belonging to a SignalData, which is designated as original data, a Segment may contain another kind of data which is designated as derived data, which represents a transformation performed over the original data as, for example, the separation of a contraction in a text. The use of derived data avoids the creation of a new SignalData where only a small piece of text would be different, thus avoiding the duplication of the text. The derived data has priority over the original data, so if some element queries for the data of a Segment the derived data is returned if it exists. Nevertheless the original data is not lost and is accessible through the Segment s Region. The Segment keeps the two kinds of Data so that it can align the derived data with the original data from a SignalData. Example 3.3 - Usage of derived data This example describes the usage of derived data. It shows a Repository with two layers. The first layer is a SignalData that contains the text They re waiting outside. The representation of the elements SignalData, Index, and Region is the same as in the previous example.

3.2. CONCEPTUAL MODEL ENTITIES 47 The second layer is an Analysis with two segmentations. The first Segmentation contains a Segment for each term of the SignalData which references a Region in the SignalData as described in the previous example. The second Segmentation shows the usage of derived data to represent the separation of the contraction They re. It contains two Segments with derived data (represented in italic) one for each word composing the contraction. These segments still contain the Region referencing the original data so the original data is still available, which allows to access the original state of the text. To clarify the example we have repeated the representation of the SignalData layer, so that the arrows to the regions did not overlap. However, there is only one layer with that SignalData. There are two possibilities to represent segmentation ambiguity: i) use one Segmentation for each possible segmentation; ii) use only one Segmentation containing Ambiguous Segments. The second alternative is preferable when only a small part of the data is ambiguous, because it avoids the repetition of equal structures. We show these different approaches to represent segmentation ambiguity in Example 3.4, and Example 3.5, for the sentence John is from Great Britain, where Great

48 CHAPTER 3. CONCEPTUAL MODEL Britain can be segmented as one Segment corresponding to the compound term, or as two Segments corresponding to each individual word. Example 3.4 - Segmentation ambiguity representation using different Segmentations This example shows the representation of segmentation ambiguity using two distinct Segmentations. The figure shows a Repository containing two layers, where the first layer corresponds to the SignalData. The second layer is an Analysis containing two Segmentations. For simplicity we have placed the data referenced by the Region in the Segment. However, as usual, the Segment does not contain the data in itself. The Segment contains a Region that references the data. The difference between the two Segmentations resides in the representation of the compound term Great Britain, which is represented as only one Segment in the second Segmentation, and as two Segments in the first Segmentation. We can see that the rest of both Segmentations is the same, so we have a duplication of structures. Example 3.5 - Segmentation ambiguity representation using Ambiguous Segments This example shows the representation of segmentation ambiguity using Ambiguous Segments.

3.2. CONCEPTUAL MODEL ENTITIES 49 The difference between this Example and Example 3.4 resides in the fact that the analysis only uses one Segmentation, which contains an Ambiguous Segment instead of using two Segmentations. The Ambiguous Segment contains two alternatives. The first alternative contains two Segments one for each word, while the second alternative contains only one Segment for the compound term. In this example we can see how the usage of Ambiguous Segments can avoid the duplications of equal structures. The representation of hierarchical structures is provided by Hierarchical Segments. An Hierarchical Segment has the concept of Parent Segment and a list of Child Segments. This allows the representation of trees which is useful, for example, to describe syntactic trees. Example 3.6 - Representation of syntactic trees This example shows the representation of hierarchical structures by the conceptual model. The figure shows a Repository composed of three layers.

50 CHAPTER 3. CONCEPTUAL MODEL The first layer is a SignalData containing the text I made her duck, represented as in the previous examples. The second layer is an Analysis which represents the output of a Tokenizer which segmented the text into words. We repeated the representation of these two layers, so that the relations between the segments of each syntactic tree and the segments produced by the Tokenizer are clearer. The third layer is an Analysis resulting from a Syntactic Parser, which produced two syntactic trees. The tree on the left has the meaning I made her bend while the tree on the right has the meaning I cooked her duck. Each tree is represented as one Segmentation, which contains one Segment, the root of the syntactic tree that covers the entire sentence. The root Segment contains several children, which contain other children themselves, thus representing the entire syntactic tree. The leafs of each tree correspond to the Segments of the tokenizer Analysis representing each individual word. The Segments of the third layer are represented with their morphological features, however these features do not belong to the Segments, they are Classifications associated to them. This representation was chosen just to keep the example simple. From the examples presented above we conclude that the elements explained in

3.2. CONCEPTUAL MODEL ENTITIES 51 Figure 3.7: Relation class diagram. this Subsection allow the fulfillment of the requirements regarding: (i) the identification of linguistic elements; (ii) the reference to data from a primary data source without copying it; (iii) the representation of data not belonging to any primary data source (contraction separation using derived data); (iv) the representation of segmentation ambiguity; and (v) representation of hierarchical structures of linguistic elements; 3.2.6 Relation A Relation element represents relational information between two Segments, for example, it may represent the relation between two constituents of a phrase. A Relation belongs to an Analysis. It has a source Segment and a destination Segment. This two segments must belong to the same Analysis, however they can belong to a different Analysis from the one that the Relation belongs. A Relation may have several Classifications assigning characteristics. Figure 3.7 presents the Relation class diagram. Example 3.7 - Relation representation example This example presents an example of a Relation between the word rose and the word pretty.

52 CHAPTER 3. CONCEPTUAL MODEL The figure presents the Repository explained in Example 3.2 in which a Relation was added to the Analysis. The Relation element fulfils the requirement of allowing the creation of relational information between linguistic elements from different layer. 3.2.7 Classification A Classification represents the association of a set of characteristics to a Classifiable element (Segment or Relation). The morphologic features of a word are an example of a Classification. Each Classification has a set of attribute-value pairs, where the key is a string and the value is any type of Data that the Repository handles. A Classification belongs to an Analysis. Each Classifiable element may have several Classifications, which is called classification ambiguity. A Classification may classify elements that belong to another Analysis, e.g. classify words identified by a tokenizer. Figure 3.8 shows the class diagram of the Classification element. Example 3.8 - Classification ambiguity example This example shows the representation of classification ambiguity. The figure shows a Repository with 3 layers. The first two layers were already explained in Example 3.2.

3.2. CONCEPTUAL MODEL ENTITIES 53 Figure 3.8: Classification class diagram. The third layer represents the output of a part-of-speech tagger. Each Segment identified by the tokenizer has several Classifications. In this example, each Classification only contains one attribute-value pair: the word s part of speech. This example shows the association of Classifications to Segments from other layers. The Classification element fulfils the requirement of representing classification am-

54 CHAPTER 3. CONCEPTUAL MODEL biguity, and the requirement of associating Classifications to linguistic elements or relations from other layers. 3.2.8 CrossRelation The CrossRelation element represents structural relations between Segments from different Analysis. These relations express the concept of derived from, meaning that a set of segments is derived from another set of segments. This is a very broad concept that establishes relations between layers. These relations are used to navigate through the data from different layers. Figure 3.9 shows the CrossRelation element class diagram. A CrossRelation contains two sets of Segments, the Child Segments which indicates the Segments that were derived from the Parent Segments. All Segments from each set belongs to the same Analysis, and thus defining the Parent Analysis and Child Analysis. Figure 3.9: CrossRelation class diagram. Example 3.9 - Text translations alignment This example describes how CrossRelations can be used to align different SignalData elements. The figure shows the translation of the sentence The red rose is pretty from English to Portuguese, A rosa vermelha é bonita. The figure shows a Repository with four layers. The first two layers were explained in Example 3.2.

3.3. CONCEPTUAL MODEL API 55 The fourth layer is an Analysis produced by a English to Portuguese Translator, which creates a Segmentation where each Segment represents a Portuguese word from the third layer which is a SignalData containing the translation of the English text from the first layer. The alignment between the two texts is achieved by adding CrossRelations between the Segments from the corresponding Analysis. The CrossRelations are represented as dotted arrows between the Segments. The CrossRelation element fulfills the requirement of allowing the representation of relations between Segments from different Analysis, and thus allowing the navigation through information from different layers. 3.3 Conceptual Model API This chapter describes a programming language independent API for the conceptual model. This API describes the minimum set of methods that each implementation of the conceptual model must have.

56 CHAPTER 3. CONCEPTUAL MODEL Besides the entities described in Section 3.2, some helper entities have been defined and used in the definition of the API. An Iterator element was defined. It provides a way to access elements from an aggregate set without exposing their underlying representation. Figure 3.10: Iterator class diagram. Figure 3.10 shows the Iterator class diagram. It has two methods one to test if there is a next element and another to retrieve that element. Figure 3.11: type and description class diagram. The type and description attributes are associated with two interfaces: HasSemanticType and HasDescription. Figure 3.11 show their methods and the elements that implement those interfaces. Several exception were defined to report exceptional situations that may occur during the repository s execution. Figure 3.12 shows a complete class diagram of the conceptual model, which contains the added entities, and all relations existing between those entities. Each relation is represented with their multiplicity. When no multiplicity is displayed the default

3.4. SUMMARY 57 value applies, which is one. All relations are bi-directional unless when the direction of the navigability is explicitly declared by using an arrow at the end of a relation. The defined API provides getters for all attributes of each entity. Each element has methods to create the entities for which it is responsible, e.g. Analysis creates Segmentation, Relations and Classifications. Moreover, if an entity contains an aggregated set of other entities then it provides an Iterator for those entities, and a method to test if it contains any of those elements, e.g. Segmentation has a method to test if it contains Segments and other method to get the Iterator for those Segments. The Repository provides method to select layers based on its attributes. The Iterator, and Exception entities do not contain any relations since the are not contained by any element they are just used by the rest of the elements. The HasSemanticType and HasDescription interfaces were omitted since they do not have any semantic meaning regarding the conceptual model. An in depth description of API of the conceptual model is provided in Appendix A. 3.4 Summary In this chapter we explained the proposed conceptual model. We presented several examples of how the conceptual model could be used to represent linguistic phenomena, and how the conceptual model fulfilled the requirement identified in Section 1.4. The conceptual model can represent the same linguistic phenomena in several ways. This has both a benefit and a drawback. The benefit is that it allows the representation of linguistic information in a way that simplifies its usage by a specific NLP system. The drawback is that since the model is not minimal, each NLP tool must specify how it represents its data in order to be used by other NLP tools. If a tool is expecting a certain representation of the data, which is not the one produced by other NLP tool then there will be a mismatch between the representation of the data and they will not be able to cooperate.

58 CHAPTER 3. CONCEPTUAL MODEL Moreover, some decisions had to be made regarding the conceptual model that can only be validated by experience using the conceptual model. This choices were based in our experience with NLP tools, and with the usages that we predicted for the conceptual model. For example, the Classification s Features are composed of attributevalue pair. This may be generalized to allow the usage of features structures. The Relation element provides a way to establish links between two Segments, derived from our experience with pos-syntactic processing tools. However, we think that it is possible to have Relations with more than two argument, as in the case where we wish to represent all arguments of a verb. Finally the CrossRelation element is used to establish relations between different layers. But, this can be done by Hierarchical Segments as well. We could also use the CrossRelation element to represent relations between linguistic elements. We decided to keep these concepts separated since they are semantically different. The CrossRelation concept should only be used when a more specific concept cannot be applied. Another doubt concerns the relations between different layers. We establish the relations between Analyses through the CrossRelation elements, that can only be applied to Segments, but the concept of relations between layers could be useful for other entities. For example we could establish CrossRelations between Classifications. It is our conviction that the definition of such a conceptual model is never finished, and as we showed in Chapter 3 other attempts to create such a model have failed to fulfil the requirements we defined. Therefore, the improvement of the conceptual model will always be a part of the future work. Although the model was developed in the context of a shared repository for linguistic information, it is our conviction that it can be used for other purposes, such as an annotation model for a linguistic annotation framework.

3.4. SUMMARY 59 Figure 3.12: Complete conceptual model class diagram.

60 CHAPTER 3. CONCEPTUAL MODEL

4.1 Introduction 4 Architecture This chapter presents the architecture of our solution. The proposed solution has a client server architecture (see Figure 4.1). The clients are NLP tools while the server mainly consist of a centralized repository of linguistic information and data sources represented under the conceptual model explained in Chapter 3. Figure 4.1: Client Server Architecture. In our solution, the clients can be defined as rich clients since they perform the major part of the processing. Each client interacts with the server through a remote API to consult and add linguistic data from/to the repository. The interaction of NLP tools and the repository can be seen as if each tool produced a layer of linguistic information, possibly based on the previously existing layers. Besides the remote API we also define a client library that NLP tools can use to communicate with the server. Both approaches have their advantages, namely: The remote API is independent of the tool s programming language; The client library abstracts the connection and protocol details between the client and the server, and offers an implementation of the conceptual model. There-

62 CHAPTER 4. ARCHITECTURE fore it s easier to implement a NLP tool using a client library than the remote API. However, we must have one client library implementation per programming language. Both kinds of interactions allow the navigation through the linguistic information, either horizontally (information produced by a single NLP tool), either vertically (information from different NLP tools which are somehow related). The rest of this Chapter is organized as follows: Section 4.2 explains the server s internal structure and its behavior. Section 4.3 details the client s library, and Section 4.4 presents some final remarks about the proposed architecture. 4.2 Server architecture This section describes the server s architecture. First, we show the general architecture principles. Subsection 4.2.2 presents an implementation of the architecture in Java using XML-RPC (specification, 2006) as the communication protocol. Finally, Subsection 4.2.3 presents some interaction examples between NLP tools and the server. 4.2.1 Server Architecture Description The server architecture consists in a shared data style. We decided to use a repository architecture instead of a blackboard architecture because we do not want to manage the execution order of NLP tools. The server acts as a datastore for linguistic information that all NLP tools use. This architecture has the advantages of allowing clients to be added without the server knowledge, and to allow the integration of the data produced by all the tools. The linguistic information (described by the use of the conceptual model explained in Chapter 3) is kept by the server. The server is organized in three layers, where each layer can only use the adjacent layers through their interface. The layered approach promotes portability, and maintainability since the role of each layer is well identified, and the implementation of each layer can be changed without affecting the other layers. Figure 4.2 shows the existing layers, which are the following:

4.2. SERVER ARCHITECTURE 63 Figure 4.2: Server layers. Data layer - Responsible for managing the linguistic information stored in the repository; Service Layer - Responsible for defining the boundaries and interface of the server, providing a functional interface over the Data Layer; Remote interface - Responsible for implementing the interface provided by the Service Layer according to a communication protocol. 4.2.1.1 Data Layer The Data Layer contains the logic of the application, using the conceptual model to represent the linguistic information. Besides representing all the linguistic information, the Data Layer is also responsible for guaranteing the persistence of all the data represented by the conceptual model. The Data Layer extends the interface of the conceptual model with some methods required for this particular usage. First, each element must have an unique identifier, and methods to access an element by its identifier, which are used between the client and the server to easily select the required element. The identifier structure is hierarchic. Each SignalData and Analysis element have an unique identifier inside the Repository. Each Segmentation, Classification and Relation have an unique identifier inside an Analysis. Each Segment as an unique identifier inside a Segmentation. For example, a Segment is identified in the client side by the combination of three identifiers:

64 CHAPTER 4. ARCHITECTURE the Analysis identifier, the Segmentation identifier, and the Segment identifier. The second extension concerns the iteration facilities. All elements that can be iterated must add a method to their interface that returns a list of the elements that will be iterated. This method will be used by the Service Layer to provide the iteration facilities. These extensions are described in Appendix B.1. 4.2.1.2 Service Layer The Service Layer (Fowler, 2002) defines the server s boundary by providing a set of methods and coordinates the server s response to each method. It is used by the Remote Interface layer which handles the specific protocol details, and encapsulates the Data Layer. The Service Layer is responsible for hiding the details regarding the domain elements of the Data Layer. It transforms the domain elements into Data Transfer Objects (DTO) which will be passed to the client. The Service Layer is also responsible for providing methods that allow the creation of iteration facilities on the client side. Moreover, and since the Service Layer is a single entry point into the server, it is an ideal place to perform logging and authentication actions. Data Transfer Objects A Data Transfer Object (DTO) (Fowler, 2002) is an object with no semantics and is used to pass information between the client and the server. Each DTO can hold two kinds of information from the domain objects: Identification information - Used to access domain objects; Read-Only information - Information from the domain objects which may not be changed by NLP tools. For example, an Analysis name, or the data reference by a Segment. This information is passed because it will probably be required by NLP tools and this way we avoid other remote calls to fetch that information. We defined a DTO for each element of the conceptual model. Each DTO element

4.2. SERVER ARCHITECTURE 65 must be serialized into some external representation, which is used to transfer the DTO between the client and the server in both directions. The Service Layer together with the DTOs works as a Remote Facade (Fowler, 2002) thus diminishing the number of remote calls required for certain operations. 4.2.1.3 Remote Interface The Remote Interface provides the methods that are available to the client according to a selected protocol. It communicates with the Service Layer, and is responsible for serializing the DTOs provided by the Service Layer into their external representation, which will be sent across the connection. It is also responsible for assembling the DTOs back, and pass them to the Services Layer. The API of the Remote Interface layer is described in Appendix B.4. 4.2.2 Server Architecture Implementation We have implemented a simplified version of the architecture to test the feasibility of our solution. In this implementation, the server is represented by a Java class named XmlRpcServer in the repository.server package. This class implements a XML-RPC server using the APACHE XML-RPC package (project page, 2006). The XML-RPC protocol was chosen because of its simplicity and because it does not impose any restrictions on the programming language used by the client. Nevertheless, this is just a possible implementation of the Remote Interface layer, and can be changed without affecting the rest of the architecture. The XmlRpcServer class uses the WebServer class provided by the APACHE XML- RPC package which implements a simple HTTP server that only provides the core of HTTP functionality required by the XML-RPC protocol. The next subsections detail the implementation of each layer.

66 CHAPTER 4. ARCHITECTURE 4.2.2.1 Data Layer The interface provided by the Data Layer to the Service Layer is implemented as a set of Java interfaces, that correspond to the class diagram showed in Figure 3.12, in the repository.server.model.interfaces package. These interfaces use the repository.server.model.exceptions package which contain the exceptions thrown by the conceptual model. These exceptions are implemented as Java classes that extend the Exception class provided by the Java programming language. In the repository.server.model package we define a set of Java classes that implement the interfaces of the conceptual model. This implementation of the conceptual model guarantees persistence by saving the entire Repository when an Analysis is closed. The repository is serialized into an XML format, which is described in Appendix B.2, and saved into a file. This implementation was performed due to its simplicity, since our main objective was to test the feasibility of our solution. Nevertheless, improvements on the persistence are purposed for future work. We defined one specific SignalData, the TextSignalData to work with text. Along with the implementation of the TextSignalData we have implemented a specific Data, StringData, and a specific Index, the TextIndex. The Iterator element corresponds to the Java Iterator interface. To simplify the creation of our prototype, we impose some restriction on the SignalData interface. Since the SignalData will be the target of linguistic information produced by NLP tools, we must forbid the removal of data from it. We also forbid the addition of Data in the middle of a SignalData, new Data can only be appended at the end of a SignalData. These two restrictions guarantee the coherence of the linguistic information targeting a SignalData. The methods that cannot be used throw an UnavailableOperationException.

4.2. SERVER ARCHITECTURE 67 4.2.2.2 Service Layer The Service Layer is implemented as a Java Interface, RepositoryServices, in the server.services.interfaces package, and defines the methods provided by server that will be specified according to a communication protocol in the Remote Interface layer. The methods defined by the interface are defined using two Java classes, in the server.services package. The RepositoryServicesImp implements the corresponding interface methods, and the DTOAssembler is used for assembling DTO objects from domain objects. Data Transfer Objects The DTO objects are defined in the repository.shared.dto package. They belong to this package, because their implementation is shared by both the server and the Java implementation of the client library. Each DTO is implemented as a Java class, consisting only of getters and setters methods. Figure 4.3 illustrates the DTO class diagram and their relations. Figure 4.3: DTO class diagram. In this implementation we have set the SegmentDTO as the main entity, so it con-

68 CHAPTER 4. ARCHITECTURE tains the list of its ClassificationDTO elements, and its RelationDTO elements. We made this choice since we are convinced that it is more common for a NLP tool to access a Classification or a Relation through a Segment than on the other way around. The RelationDTO contains its Classifications by the same reason. This implementation serializes the DTO into a XML representation, which will be passed to/from the client. So each DTO implements two methods: String toxml() Converts a DTO into its XML representation; static DTO fromxmlrepresentation() Creates a DTO element from its XML representation. The DTD used for serializing the DTOs are described in Appendix B.3. 4.2.2.3 Remote Interface The Remote Interface layer is implemented as a Java class named RepositoryHandler, in the repository.server.handlers package. This class is registered in the XML-RPC server as a handler class. The XML-RPC server handles the mapping between the XML-RPC calls and the Java methods defined in that class. 4.2.3 Server Architecture Interaction Examples This section presents some sample interactions between a client and a server through its XML-RPC interface. The client starts by establishing a connection with the server using its address. Using the obtained connection, the client can interact with the server. The sequence diagrams showed in this section assume that the connection has already been established.

4.2. SERVER ARCHITECTURE 69 Example 4.1 - Sequence Diagram - NLP tool interaction principles This example describes the principles of interaction between a NLP tool and the server. The tool starts by creating an Analysis. Then using this Analysis it performs its processing, and at the end it closes this Analysis, which in this implementation will save the entire Repository. From this point on, this Analysis cannot be changed by any client, unless it is opened again. The figure shows a sequence diagram of the creation and closing of an Analysis. The client calls the remote method createanalysis() on the server. The XML-RPC interface calls the method createanalysis() on the Service Layer which, in turn, calls the method with the same name on the Repository element from the Data Layer. The Repository returns a new Analysis object which is translated into an AnalysisDTO object in the Service Layer by the DTOAssembler. The Interface XML-RPC serializes the AnalysisDTO object into its XML representation and sends it back to the client. After performing its processing the client closes the Analysis. Example 4.2 - Sequence Diagram - Specific Segment selection. This example describes the selection of a Segment based on the position it occupies inside its Segmentation. This method is provided to allow the implementation of iteration on the client side. The figure shows a sequence diagram of this interaction. The client calls the method

70 CHAPTER 4. ARCHITECTURE getsegmentatposition(). This method receives as arguments the XML representation of a Segmentation and an integer representing the position of the required Segment inside the Segmentation. The XML-RPC interface creates a SegmentationDTO object from the received XML description and calls the method with the same name in the Repository Service layer. The Repository Service gets the Analysis which contains the required Segmentation using its identification which is contained inside the SegmentationDTO. Finally, the corresponding Segmentation is fetched using its identification. The Service Layer gets the list of all Segments of that Segmentation and selects the required Segment using the position argument. The Segment is then converted into a SegmentDTO by the DTOAssembler and passed back into the XML-RPC interface which serializes it into its XML representation and returns it to the client. 4.3 Client Library This section describes the client library. We start by describing the general principles of the client library in the next Subsection. Then Subsection 4.3.2 presents an implementation of the client library in Java. Finally, Subsection 4.3.3 describes some solution validation tools that we implemented to test the client library and the repository. 4.3.1 Client Library Description The utilization of the client library allows a NLP tool to abstract from details concerning the communication with the server, and the data exchange protocol. It also pro-

4.3. CLIENT LIBRARY 71 vides some high level interfaces that may simplify the creation of NLP tools. The client library uses a layered architecture (see Figure 4.4), by the reasons explained in Section 4.2.1. The client library has the following layers: Client Stub - Handles the interaction with the server; Conceptual Model - Provides an implementation of the conceptual model; Extra Layers - Provides domain specific interfaces. Figure 4.4: Client Library Internal Structure. The rest of this Subsection details each layer s principles. 4.3.1.1 Client Stub Layer The Client Stub layer is responsible for communicating with the server under the chosen protocol, through the server Remote Interface layer. The Client Stub layer API is the same as the API of the Remote Interface of the server which is described in Chapter B.4. All the other layers of the client library depend on and use this layer, this way the other layers are independent of the specific communication protocol that is being used. 4.3.1.2 Conceptual Model Layer The Conceptual Model layer implements the conceptual model described in Chapter 3. It allows NLP tools to use the Conceptual Model as their object model, thus simplifying the

72 CHAPTER 4. ARCHITECTURE creation of new NLP tools. Since the concepts used by NLP tools are usually similar, by using the conceptual model we pretend to avoid the definition of an equivalent one every time a new NLP tool is created. In addition, by using only the interfaces provided by the Conceptual Model layer its concrete implementation can be changed without changing the NLP tool. This way, a NLP tool can be used as a stand-alone tool or as a client tool connected to the shared repository, by changing the implementation of the Conceptual Model Layer. The Conceptual Model layer elements are proxies (Gamma et al., 1995) for the elements of the Conceptual Model in the server. The methods performed on those elements are delegated into the corresponding elements in the server. The Repository can be used concurrently by several NLP tools, so it is possible that a tool consumes information that is being produced by another tool at the same time. If the consumer tool is faster than the producer tool and depletes the data that is being produced, the consumer tool may finish its processing due to a lack of data before what was expected. To avoid this situation the iterators on the client side have a blocking behaviour. The method hasnext() only returns false when the Analysis that contains the data that is being iterated is closed and there are no more element to iterate. However, this policy can result in a deadlock to the consumer application if the producer application ends abruptly without closing its Analysis. So, we introduced a time limit for which a client method can be blocked in the method hasnext(). 4.3.1.3 Extra Layers The Extra layers layer provide extensibility to the client library. It represents new layers that can be added on top of the previous ones enabling the creation of domain specific layers, which may simplify the creation of new NLP tools. For example, a morphological classification tool could use a layer that provided the concepts of word, phrase and text, with methods such as nextword(), and addgender(word w). We expect that the usage of the client library by NLP tools developers promotes the creation of extra functionality in this layer, that will be made available to NLP tool developers.

4.3. CLIENT LIBRARY 73 The usage of Extra Layers can also provide semantic meaning to the linguistic information kept in the Repository for a given NLP system. For example, an Extra Layer may use the type attribute from the elements of the conceptual model, to add semantics to that data. However, that semantic meaning is not present in the Repository, it is only achieved by the use of Extra Layers. 4.3.2 Java Implementation In this section we describe a Java implementation of the client library used to communicate with the server implementation described in the previous section. We have implemented a Client Stub layer that communicates through XML-RPC with the server implementation described in Subsection 4.2.2, an implementation of the Conceptual Model, and two new interfaces for handling text as Extra Layers. 4.3.2.1 Client Stub The Client Stub layer is implement has a Java class, XmlRpcConnection, in package repository.clientlib that uses the Apache XML-RPC package to communicate with the server. It uses the XmlRpcClient class provided by the APACHE XML-RPC package to connect to the server. The Client Stub is the only tie to the specific communication protocol and is used by all the other layers of the client library. The Client Stub has the same API as the remote interface described in Subsection 4.2.2. 4.3.2.2 Conceptual Model The Conceptual Model layer consists on a set of Java interfaces defined in package repository.clientlib.model.interfaces that are used by NLP tools. These interfaces contain all the elements defined in the conceptual model in Chapter 3. Each interface is implemented in repository.clientlib.model package as a Java class. Each class is a Decorator (Gamma et al., 1995) of the respective DTO defined in Subsection 4.2.2.2, that adds the methods defined for the conceptual model. Because of this

74 CHAPTER 4. ARCHITECTURE dependency the conceptual model implementation depends on a specific DTO implementation. Each entity contains the connection object and uses it to remotely invoke the selected method on the server side. The usage of the Conceptual Model layer by a NLP tool starts by the creation of a Repository element using the constructor Repository(String localhost). This constructor establishes the connection with the server and encapsulates the connection object. From this point on, each NLP tool interacts with the server through the conceptual model entities. 4.3.2.3 Extra Layers In this implementation we have defined four interfaces on the extra layers (see Figure 4.5). Figure 4.5: Extra layers interfaces. The ClientRepository interface adds methods for the creation of specific types of SignalData and Data. The StringData is used to handle strings. The TextSignalData interface (a type of SignalData) and the TextIndex (a type of Index) are used to to handle text kept in the repository. These interfaces simplify the usage of text by NLP developers. For example, the TextSignalData provides the method String getsubstring(int start, int end). This method is mapped in the following methods from the conceptual model API:

4.3. CLIENT LIBRARY 75 SignalData sd Index start = sd.createindex(start); Index end = sd.createindex(end); Data d = sd.getdata(start,end); String data = new String(d.getData()); 4.3.3 Solution Validation Tools We have defined several sample NLP tools which perform simple tasks in order to verify the feasibility of the client library and the server. These tools mimic the behaviour of real NLP tools. We have grouped those tools into two NLP systems. The first represents an usual NLP system composed of several NLP tools such as part of speech tagging or syntactic analysis. The second is used to test the blocking behaviour of the iterators on the client side. 4.3.3.1 Simple NLP system This system is composed by several NLP tools that execute in a pipeline manner. Each tool uses results produced by previous tools. In the following examples we present each tool by showing, its behaviour, and what information it consumes and produces. Each tool receives as command line arguments the identification of the layers that it will use. Furthermore, each tool knows how the information is organized in each layer. Figure 4.6 presents the notation used in the examples, unless indicated otherwise. Example 4.3 - Primary data source creator This tool places a TextSignalData in the repository containing the following text, Jonh and Mary aren t from Great Britain. He made her duck. It uses the TextSignalData interface provided by the client library.

76 CHAPTER 4. ARCHITECTURE Figure 4.6: Examples notation. Example 4.4 - Word Tokenizer This tool segments the text produced by the Primary data source creator tool into words.

4.3. CLIENT LIBRARY 77 It separates the contraction aren t into two different words are and not using derived data, and represents the compound term Great Britain as an Ambiguous Segment. It uses the TextSignalData method gettextiterator() to iterate over the characters of the text. Example 4.5 - Part of speech Tagger This tools assigns several part of speech classifications to each word produced by the Word Tokenizer tool. This tool only classifies the Segment containing the text Great Britain, which corresponds to selecting one of the Alternatives that will be used by the next NLP tools. It iterates over the Segments contained in the Segmentation created by the Word Tokenizer tools. Example 4.6 - Sentence Identification This tool creates Segments identifying the sentences from the original text.

78 CHAPTER 4. ARCHITECTURE It uses the Segments produced by the Word Tokenizer tool. The relations between the sentence and each word composing the sentence are represented using CrossRelations elements between the Segments from both Analysis. Example 4.7 - Syntactic Parser This tool performs a syntactic analysis over the original text. The tool produces individual syntactic trees for each sentence in the text, which are kept into separated Segmentations. In this figure we only show a part of the Analyses one, two and three, and the trees for the sentence He made her duck to simplify the picture.

4.3. CLIENT LIBRARY 79 The picture shows how the syntactic parser uses information from four other layers. It uses the Segments from the Sentence Identification tool as the root for the syntactic trees, and the Segments from the Word Tokenizer tool as the leafs of those trees. The Classifications produced by the Part of speech Tagger are used when building the trees. Example 4.8 - Pos-Syntactic Parser This tool produces some Relations over the text, namely it produces two Relations indicating that the he in the second sentence corresponds to the John from the first sentence, and that the her in the second sentence corresponds to the Mary from the first sentence.

80 CHAPTER 4. ARCHITECTURE It produces Relations over the Segments from Example 4.4. Example 4.9 - English to Portuguese Translator This tool creates a new SignalData consisting on the translation of the original text into Portuguese.

4.3. CLIENT LIBRARY 81 The translation is performed over the syntactic tree of the left with the meaning he made her bend on Example 4.7. The translator also contains a Segmentation performing the tokenization of the Portuguese translation into words. These words are then aligned with the English ones defined in Example 4.4, through the use of CrossRelations. One of the CrossRelation in the picture is displayed in black to simplify the visualization of the inversion between the verb are and the adverb not between the two languages. Moreover, in the picture we can see how some words in English originate two words in Portuguese and vice versa. With the previous examples we pretended to show a real interaction between the

82 CHAPTER 4. ARCHITECTURE repository and several NLP tools. We showed how each tool uses information from different layers in the repository, and how it is possible to navigate through information from different layers. 4.3.3.2 Concurrent processing We have a small system composed of only two tools to test the concurrent utilization of the Repository. We define a producer tool which produces Segments over a text, and a consumer that iterates those Segments to assign Classifications. The producer tool interacts with the user and produces a new Segment every time a key is pressed. This way we can test the blocking mechanism of the iterators by blocking and unblocking the consumer. 4.4 Summary This chapter describes the general architecture of our solution, and an implementation of that architecture. The main objective of this implementation was to demonstrate the feasibility of our solution. However, some issues still require some future study, namely the efficiency of this architecture, the parallel manipulation of the data, and the persistence storage of the data. Nevertheless, this implementation already fulfils the requirements defined in Section 1.4. Regarding the efficiency questions some analysis has to be performed with real NLP tools to verify the cost of using the remote architecture in the tools global execution time. We expect that the execution time of each tool increases due to the remote calls. On the other side since the tools do not have to load and save all the existing data from the repository we expect a decrease on the tools execution time. We expect to measure these two factors in order to verify if the usage of a remote architecture will indeed result in a big efficiency lost. We impose some restrictions over the conceptual model in the server s implementation regarding the changing and removal of existing information. These restrictions

4.4. SUMMARY 83 allowed a simple implementation of the Repository, guaranteeing that the information represented under the conceptual model is always coherent. Nevertheless, additional work can be done regarding these restrictions, namely removing them. For example, the removal of a part of a SignalData would lead to the removal of all Segments that intersected the removed region, and consequently lead to the removal of all Classifications, Relations and CrossRelations targeting the removed Segments. Also, if some Data was added at the middle of a SignalData, the only change required would be to update that Signal s Indexes and the rest of the Repository would continue coherent. The main problem with this approach is that since the Repository may be accessed concurrently, a change at the Repository may invalidate the data that a NLP tool is using. Another issue concerns the persistence of the data. We have implemented a simple mechanism that saves all the Repository every time an Analysis is closed. This simple solution presents some problems. First, it is inefficient since we always save all the data when a single Analysis is closed. This approach may present some problems when dealing with concurrent interactions with the Repository. As for the evolution of the architecture, it is our policy to avoid the addition of functionality to the server side. We prefer to increment the functionality at the client side, and to maintain the server side as stable as possible. However, if future use of the architecture shows that methods can be added to the server s Service Layer diminishing the number of remote calls required to perform a common task, those methods will surely be added. From our experience developing simple NLP tools, we notice that it is very difficult to interact with the server directly through its remote interface. One must handle all details of the XML-RPC protocol, which is time consuming and should not be the focus of a NLP tool developer. So, it is our conviction that all interactions with the repository will use the client library. Moreover, if a client library is not available in a specific language, a user that wishes to use the repository will end up by making one himself, at least the Client Stub layer. Bearing this in mind, we wish to provide client libraries in several programming languages even if we do not provide a full implementation. We

84 CHAPTER 4. ARCHITECTURE also count on users of this framework to submit and make the client libraries available including extra layers that they may develop.

5.1 Summary 5 Conclusion and Future Work This dissertation presents a NLP framework for the integration of NLP tools. The main objectives of this framework are: (i) to support the easy integration of independent NLP tools avoiding the information losses that normally occurs when such integration is made; (ii) to simplify the creation of new NLP tools by providing general input/output facilities and an implementation of a conceptual model capable of representing a broad range of linguistic information; (iii) to minimize the number of converter components that are required when integrating NLP tools that do not comply with the framework s model; (iv) to enable the development of NLP tools that benefit from the existence of a repository of cross-related linguistic information. We defined a set of requirements that such a framework should obey, which are: No information should be lost between NLP tools in a NLP system; Each NLP tool should only produce the information that concerns it; The framework should simplify the creation of new NLP tools, by providing an Input/Output interface, which handles the loading and saving of data used by the tool, and a data model capable of representing a broad range of linguistic information; The use of the framework should minimize the number of conversion modules required to build a NLP system, when integrating tools that do not comply with the framework s model;

86 CHAPTER 5. CONCLUSION AND FUTURE WORK The provided interface should allow the navigation between information produced by different NLP tools. We purposed a solution based on a conceptual model capable of representing linguistic information, and relate the information produced by several NLP tools. This solution satisfies those requirements. Chapter 3 presented a conceptual model capable of representing all types of linguistic information and data sources. We have implemented a text data source for testing purposes, and we showed how the framework provides abstractions that allows it to be independent from any specific data type. We described how the conceptual model could represent several types of linguistic phenomena, and how it is extensible to handle linguistic information that was not accounted for. However, the conceptual model is not minimal, allowing the representation of the same linguistic phenomena in several ways. So, NLP tools developers must know how the information that they wish to use is represented in order to use it. Then Chapter 4 described the framework s architecture where NLP tools act as clients of a server which contains all the linguistic information produced. This interaction is accomplished through a remote API. However, its usage requires the handling of communication protocol details, and the serialized version of the elements from the conceptual model. So this framework also provides client libraries, which are modules that NLP tools can use to interact with the repository. This client library allows the tools to abstract from details such as the remote communication and the protocol that is being used between the client and the server. The client library also provides an implementation of the defined conceptual model simplifying the creation of NLP tools which can use the conceptual model as their object model. We observed that the object model used by several NLP tools are very similar. By providing a conceptual model to the NLP tools developers we hope that the definition of similar object models is avoided every time a new NLP tool is created. However, a client library must be implemented in every programming language that we wish to use to develop NLP tools. We implemented a client library in Java. Several sample NLP tools were also

5.2. FUTURE WORK 87 developed to test the interaction between NLP tools and the repository, and to provide some example on the usage of the framework. 5.2 Future work This section presents some items for future work regarding both the development of the created framework and its usage. We intend to integrate the tools from our laboratory, which consist in the initial steps of every NLP system, with this framework. These tools are: Smorph (Aït-Mokhtar, 1998) a morphology processor; PAsMo (Paulo, 2002) a rule-based rewriter; Marv (Ribeiro et al., 2002) a morphologic disambiguator; Susana (Batista and Mamede, 2002) a syntactic parser; AlgAS, Ogre, and Asdecopas (Coheur et al., 2004a, 2004b) semantic domain tools; The integration of these tools requires the implementation of a client library in C++ since this is the language used to implement these tools. The tools require some changes to use the framework. First, their input/output mechanisms must be changed so that they comply with the framework s API. Secondly, they must keep relations between their Input/Output data to allow data lineage. For example, when a tool separates a contraction, it must keep a relation between the new words resulting from the separation of the contraction, and the original contracted term. That information will be used in the creation of cross-relations when producing the tool s output. Moreover, we plan to refactor the existing tools, by changing their object model by our proposed conceptual model. It is our conviction that this change will simplify the tools.

88 CHAPTER 5. CONCLUSION AND FUTURE WORK The integration of these tools using our framework will probably lead to the refinement of the conceptual model, and to the extension of the provided APIs, especially the ones provided as extra layers on the client side. After the integration of the tools some tests regarding the efficiency of a NLP system using this architecture are required. This analysis can lead to some improvements on the architecture, concerning the utilization of a cache on the client library to diminish the number of remote calls, the content of each DTO, and the methods provided by the service layer. We also plan to implement more types of primary data sources, e.g. audio files, and thus enable the usage of other NLP systems, like natural language generation. The concurrent execution of NLP tools requires some future analysis. First, some restrictions were imposed on the implementation of the conceptual model due to its concurrent execution. We decided to forbid the removal of data and the addition of data into the middle of a SignalData since this simplified our implementation. In the future, these restrictions should be removed. However, their removal requires some analysis to guarantee that the information that each client is manipulating is coherent with the information maintained by the repository. We also wish to evaluate the possibility of using some caching mechanisms on client libraries to reduce the number of remote invocations, and thus improve the framework s efficiency. The persistence storage of the data in the repository should be improved. The first improvement consists in only saving the data that has changed, instead of dumping the entire repository every time an Analysis is closed. Another improvement consists in implementing persistence at the object level with transaction support and locking mechanisms. This requires some analysis together with the concurrent access to the server to guarantee the consistent state of the data. The integration of the work defined in (Cachopo, 2005), which is still under development, is a possible path, since this work purposes a framework that simplifies the definition of persistent objects. Another path for future work consists in using this framework in other research works that are being developed in our laboratory, such as character identification in

5.2. FUTURE WORK 89 stories, anaphora resolution, semantic analysis, and a tool to aid the creation of poems. These works will benefit from the usage of this framework, since it allows them to access all the information produced by NLP tools, and to navigate through related information using cross-relations. Moreover, the requirements of these works were one of the main reasons that led to the development of this framework, since they suffered from the problems identified in Section 1.2 that caused difficulties in their development. Another problem regarding the integration of NLP tools consists in the tags used by each NLP tool to classify linguistic elements. Even if the data structure between two NLP tools is the same, if they use different tag sets to classify the linguistic elements, they will not be able to communicate. This problem was addressed in (de Matos, 2005). Since, this framework established a single entry point for the assignment of classifications, a conversion between tag sets could be performed to guarantee that inside the repository all tags were represented in the same way. Each NLP tool would indicate which tag set it required, and receive the information with the proper tags. We wish to promote this framework as an annotation framework, and to do so, a graphical interface has to be developed to allow the edition of its data. Moreover, some converter modules have to be defined to allow the usage of data annotated in other formalisms, for instance, the Annotation Graphs formalism, in which, large corpora have already been annotated, and are publicly available. This particular usage of the framework will allow the annotation of the corpora existing in our laboratory. As for issues that require future research, we notice that this framework did not answer two important questions regarding the integration of NLP tools. First, how does each tool know what data it must fetch from the repository? For this problem we intend to integrate the repository with a workflow mechanism and a browser. This will enable the creation of NLP systems, by simply selecting from a browser, which tools the NLP system should have. The browser will deal with the selection of the data for each NLP tool. Another question is how does a NLP tool interpret the data stored in the repository. In this work we assume that each NLP tool knows the exact representation

90 CHAPTER 5. CONCLUSION AND FUTURE WORK of the data. This approach might be too restrictive in terms of extensibility. Some research work should be done in this field, namely in the definition of a meta language, that allows each tool to define its data pre-requirements and post-requirements, and a way to match this information automatically. 5.3 Contributions The major contribution of this work is a framework that solves some problems detected when integrating independently developed NLP tools to form NLP systems. In our laboratory, those problems were preventing some works to advance as desired, namely due to the information loss that occurs during the execution of the first steps on NLP systems, and the lack of relations between the produced information. This contribution consists on the definition of a conceptual model capable of representing a broad range of linguistic information. This conceptual model is available to NLP tools developers avoiding the redefinition of similar models, thus reducing the time required to build a tool. This contribution also consists of a framework, using that model, for the integration of NLP tools that avoids the information loss that usually occurs in such integration. An implementation of the framework was performed in Java, together with an implementation of a client library, and some extra interfaces, also in Java. Moreover, this work promotes the development of new NLP tools by enabling new possibilities by having a lot of linguistic information available and related. Finally, some directions for future work were defined.

Bibliography Aït-Mokhtar, S. (1998). L analyse Présyntaxique en une seule étape. Ph. D. thesis, Université Blaise Pascal. Batista, F. and N. Mamede (2002, November). SuSAna: Módulo multifuncional de análise sintáctica de superfície. In J. Gonzalo, A. P. nas, and A. Ferrández (Eds.), Proc.,* Multilingual Information Access and Natural Language Processing Workshop, Sevilla, Spain, pp. 29 37. IBERAMIA 2002. Bird, S., D. Day, J. Garofolo, J. Henderson, C. Laprun, and M. Liberman (2000). Atlas: A flexible and extensible architecture for linguistic annotation. Bird, S. and M. Liberman (1999). A formal framework for linguistic annotation. Technical Report MS-CIS-99-01, Philadelphia, Pennsylvania. Black, A. W. and P. A. Taylor (1997). The Festival Speech Synthesis System: System documentation. Technical Report HCRC/TR-83, Human Communciation Research Centre, University of Edinburgh, Scotland, UK. Avaliable at http://www.cstr.ed.ac.uk/projects/festival.html. Bontcheva, K., V. Tablan, D. Maynard, and H. Cunningham (2004). Evolving GATE to Meet New Challenges in Language Engineering. Natural Language Engineering 10(3/4), 349 373. Cachopo, J. M. P. (2005). Composition Constructs for Separation of Concerns. Ph. D. thesis, Technical University of Lisbon. work in progress. Coheur, L., N. Mamede, and G. G. Bès (2004a). From a surface analysis to a dependency structure. In Workshop on Recent Advances in Dependency Grammar (Coling 2004), Genebra, Suiça. 91

92 BIBLIOGRAPHY Coheur, L., N. Mamede, and G. G. Bès (2004b, Outubro). A multi-use incremental syntaxsemantic interface. In Estal España for natural Language Processing, Alicante, Espanha. Springer-Verlag. Consortium, L. D. (2006). http://www.ldc.upenn.edu/annotation/. Corkill, D. (1991, January). Blackboard Systems. AI Expert 6(9). Cotton, S. and S. Bird (2002). An integrated framework for treebanks and multilayer annotations. In Proceedings of the Third International Conference on Language Resources and Evaluation, pp. p 1670 p 1677. de Matos, D. M. M. (2005, July). Construção de Sistemas de Geração Automática de Língua Natural. Ph. D. thesis, IST - UTL. de Matos, D. M. M., A. M. F. Mateus, J. de Almeida Varelas Graça, and N. Mamede (2002, October). Empowering the user: a data-oriented application-building framework. de Matos, D. M. M., J. L. Paulo, and N. Mamede (2003, June). Managing linguistic resources and tools. pp. 135 142. Fowler, M. (2002, November). Patterns of Enterprise Application Architecture. Addison- Wesley Professional. Gamma, E., R. Helm, R. Johnson, and J. Vlissides (1995, January). Addison-Wesley Professional. Design Patterns. Garofolo, J. S., L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and N. L. Dahlgren (1986). The darpa timit acoustic-phonetic continuous speech corpus cdrom. Godfrey, J., E. Holliman, and J. McDaniel (1993). Switchboard: Telephone speech corpus for research and development. In IEEE Conference on Acoustics, Speech, and Signal processing, pp. 517520.

BIBLIOGRAPHY 93 Hayes-Roth, F., D. J. Mostow, and M. S. Fox (1978). Understanding speech in the hearsay-ii system. In L. Bolc (Ed.), Speech Communication with Computers, pp. 9 42. München: Carl Hanser Verlag and London.: Macmillan. Hirschman, L. and N. Chinchor (1997). Muc-7 coreference task definition. In In Message Understanding Conference Proceedings. Ide, N. and L. Romary (2001, 11-13). Standards for language resources. In Proceedings of the IRCS Workshop on Linguistic Databases, University of Pennsylvania, Philadelphia, pp. 141 149. Ide, N., L. Romary, and E. de la (2003). International standard for a linguistic annotation framework. Jacobson, M., B. Michailovsky, and J. B. Lowe (2001). Linguistic documents synchronizing sound and text. Speech Commun. 33(1-2), 79 96. Laprun, C., J. Fiscus, J. Garofolo, and S. Pajot (99). Recent improvements to the atlas architecture. Technical report, National Intitute of Standards and Technology. http://www.nist.gov/speech/atlas/download/hlt2002-atlas.pdf. Loper, E. and S. Bird (2002). Nltk: The natural language toolkit. CoRR cs.cl/0205028. MacWhinney, B. (1995). The childes project: Tools for analyzing talk. mahwah, nj: Lawrence erlbaum. Maeda, K., X. Ma, H. Lee, and S. Bird (2002, January). The Annotation Graphs Toolkit (Version 1.0): Application Developer s Manual. Linguistic Data Consortium, University of Pennsylvania. http://www.ldc.upenn.edu/ag/doc/tech report/tr1.ps. NIST (1998). A universal transcription format (utf) annotation specification for evaluation of spoken language technology corpora. Paulo, J. L. (2002). Aquisição de termos automática. Master s thesis, Instituto Superior Técnico, Universidade Técnica de Lisboa, Lisboa. (este documento).

94 BIBLIOGRAPHY Petersen, U. (2004). Emdros - a text database engine for analyzed or annotated text. In Colling. project page, A. X.-R. (2006). http://ws.apache.org/xmlrpc/. Reidsma, D., N. Jovanovic, and D. Hofs (2004). Designing annotation tools based on properties of annotation problems. Technical Report 04-45, CTIT, Enschede, NL. ISBN=1381-3625, publisher=ctit, number of pages=13. Ribeiro, R., L. Oliveira, and I. Trancoso (2002). Morphossyntactic Disambiguation for TTS Systems. In Proc. of the 3rd Intl. Conf. on Language Resources and Evaluation, Volume V, pp. 1427 1431. ELRA. ISBN 2951740808. Schiel, F., S. Burger, A. Geumann, and K. Weilhammer (1998). The partitur format at bas. specification, X.-R. (2006). http://www.xmlrpc.com/. Taylor, P., A. Black, and R. Caley (1998). The architecture of the the festival speech synthesis system.

A Conceptual Model API This Appendix describes the defined programming language independent API for the conceptual model. The next table describes the available Exceptions. The rest of this appendix presents a table for each entity of the conceptual model describing its methods. Exceptions Exception Number Description LayerClosedException 1 Indicates that an element is being added to a closed layer InvalidSegmentationException 2 Indicates that a segment is being added to an invalid segmentation ExistingAnalysisException 3 Indicates that a method is creating an analysis with the name of an existing analysis. ExistingSignalDataException 4 Indicates that a method is creating a signal- Data with the name of an existing signaldata. InvalidDataTypeException 5 Indicates that an incorrect type of data is being used InvalidIndexTypeException 6 Indicates that an incorrect type of index is being used IndexBoundsExpcetion 7 Indicates that a SignalData is being accessed outside of its bounds UnexistingDataTypeException 8 Indicates that a data element is being created for an unsupported type UnexistingSignalDataTypeException 9 Indicates that a Signaldata element is being created for an unsupported type UnavailableOperationException 10 Indicates that an operation is not implemented

96 APPENDIX A. CONCEPTUAL MODEL API Entity: HasDescription Method Arguments Return Description Throws getdescription none String Returns the description of the element setdescription description none Sets the description of the element Entity: HasSemanticType Method Arguments Return Description Throws gettype none String Returns the type of the element settype type none Sets the type of the element Entity: Iterator Method Arguments Return Description Throws hasnext none boolean Indicates if there are any more element to iterate next none Element Return the next element in the set being iterated

97 Entity: Repository Method Arguments Return Description Throws getanalysesiterator none Iterator Returns an iterator for all analyses createanalysis name, creator Analysis Creates a new Analysis getanalysisbyname name Analysis Gets an analysis by its name. existsanalysisname name boolean Tests if a name is already used by an analysis getsignalsdataiterator none Iterator Returns an iterator for all signalsdata createsignaldata signaldatatype, name, SignalData Creates a new Signalcreator Data of a given type getsignaldatabyname name SignalData Gets a SignalData by its name existissignaldata name boolean Tests if a name is already used by a signal- Data createdata datatype,data Data Creates a data element of a given type createcrossrelation parents and children CrossRelation Creates a CrossRelation segments 4 4,9 8

98 APPENDIX A. CONCEPTUAL MODEL API Entity: Layer Method Arguments Return Description Throws getrepository none Repository Returns the repository getname none String Returns the name of the layer getcreator none String Returns the name of the creator of the layer gettimestamp none Date Return the creation date of the layer isclosed none boolean Indicates if the layer is closed close none none closes the layer open none none opens the layer Entity: Data Method Arguments Return Description Throws getdata none byte[] Returns a representation of the data element

99 Entity: SignalData Method Arguments Return Description Throws adddata data none Adds some data at the end of the 1,5 SignalData entity adddata data,index none Adds some data at the middle of 1,5,6,7 the SignalData entity getdata start and end index Data Return some data from the SignalData 6,7 entity between the argu- ment indexes getdata none Data Returns all data from the Signal- Data entity removedata start and end index none Removes the data between the 1,6,7 argument indexes getstartindex none Index Returns an index for the beginning of the SignalData entity getendindex none Index Returns an index for the end of the SignalData entity createregion start and end index Region Return a region for the Signal- 6 Data entity createindex indexvalue Index Creates a new Index 6 Entity: Region Method Arguments Return Description Throws getsignaldata none SignalData Returns the SignalData entity associated to this region getstartindex none Index Returns the beginning of the region getendindex none Index Returns the end of the region getdata none Data Returns the data referenced by the region

100 APPENDIX A. CONCEPTUAL MODEL API Entity: Analysis Method Arguments Return Description Throws createsegmentation none Segmentation Creates a segmentation 1 getsegmentationsiterator none Iterator Returns a iterator for all segmentations hassegments none boolean Tests if the analysis has segmentations createclassification classifiable Classification Creates a classification 1 getclassificationsiterator none Iterator Returns a iterator for all classifications hasclassifications none boolean Tests if the analysis has classifications createrelation source and destination Relation Creates a relation 1 segment getrelationsiterator none Iterator Returns a iterator for all relations hasrelations none boolean Tests if the analysis has relations Entity: Classifiable Method Arguments Return Description Throws addclassification classification none Adds a classification to a classifiable element hasclassifications none boolean Tests if a classifiable has classifications getclassificationsiterator none Iterator Returns an iterator to the classifications

101 Entity: Relation Method Arguments Return Description Throws getsourcesegment none Segment Returns the relation s source segment getdestinationsegment none Segment Returns the relation s destination segment getanalysis none Analysis Returns the relation s analysis Entity: Segmentation Method Arguments Return Description Throws getanalysis none Analysis Returns the analysis to which the segmentation belongs createsegment region Segment Creates a new Segment 1 getsegmentsiterator none Iterator Returns an iterator for all segment of the segmentation addsegment Segment none Adds a segment to the segmentation 1,2 hassegments none boolean Indicates if the segmentation has segments

102 APPENDIX A. CONCEPTUAL MODEL API Entity: Segment Method Arguments Return Description Throws getsegmentation none Segmentation Returns the segment s segmentation getdata none Data Returns the derived data of the segment if it exists or the original data otherwise. getoriginaldata none Data Returns the original data getregion none Region Returns the segment s region ishierarchical none boolean Indicates if the segment is hierarchical getchildsiterator none Iterator Returns an iterator for the segment s child segments getparent none Segment Returns the segment s parent setparent Segment none Sets the segment parent addchild Segment none Adds a child segment isambiguous none boolean Indicates if the segment is ambiguous getalternatives none Iterator Returns an iterator for the segment s alternatives createalternative none Alternative Creates a new alternative 1 getrelationsiterator none Iterator Returns an iterator for the segment s relations hasrelations none boolean Indicates if a segment has relations hascrossrelations none boolean Indicates if a segment has CrossRelations getcrossrelationsiterator none Iterator Returns an iterator for the segment s CrossRelations getcrossrelationtoanalysis analysisname CrossRelation Gets a CrossRelation to a specific analysis.

103 Entity: Alternative Method Arguments Return Description Throws getsegmentsiterator none Iterator Returns an iterator to the segments addsegment Segment none Adds a segment to the Alternative 1,2 Entity: Classification Method Arguments Return Description Throws getanalysis none Analysis Returns the corresponding analysis getclassifiable none Classifiable Returns the corresponding classifiable getfeaturesiterator none Iterator Returns an iterator to the features addfeature key, value none Adds a feature to the classification getfeature key Feature Gets a feature by its key 1 Entity: Feature Method Arguments Return Description Throws getkey none String Returns the key of the feature getvalue none String Returns the value of the feature

104 APPENDIX A. CONCEPTUAL MODEL API Entity: CrossRelation Method Arguments Return Description Throws getparentanalysisname none String Gets the analysis containing the parent segment getchildanalysisname none String Gets the analysis containing the child segment getparentsegmentsiterator none Iterator Returns an iterator to the parent segments getchildsegmentsiterator none Iterator Returns an iterator to the child segments addparentsegment Segment none Adds a parent segment to the CrossRelation addparentsegment Segment none Adds a parent segment to the CrossRelation addchildsegment Segment none Adds a child segment to the CrossRelation addparentsegments List none Adds a list of parent segment to the CrossRelation addchildsegments List none Adds a list of child segment to the CrossRelation

B.1 Data Layer API B Repository Server This appendix describes the methods that were added to the conceptual model to allow its use in the server. The interface PersistentXml provides a method that all elements must implement. This method is used to save persistently the linguistic information. In this implementation, the persistence is achieved by serializing all objects into XML and then store them in a file. The rest of the interfaces adds methods to access an element by its identification and methods to access a list of elements from the repository used to provide iteration facilities.

106 APPENDIX B. REPOSITORY SERVER Method Arguments Return Description Entity: PersistentXml toxml none Element Serializes an element into XML Entity: Alternative getsegments none List Return a list with all segments Entity: Analysis getsegmentations none List Returns a list with all segmentations getrelations none List Returns a list with all relations getclassifications none List Returns a list with all classifications getid none int Returns the id of the analysis getsegmentationbyid id Segmentation Returns a segmentation by its id getrelationbyid id Relation Returns a relation by its id getclassificationbyid id Classification Returns a classification by its id Entity: Classifiable getclassifications none List Returns a list with all classifications Entity: Classification getfeatures none List Returns a list with all features getid none int Returns the id of the classification Entity: CrossRelation getparentsegments none List Returns a list with all parent segments getchildsegments none List Returns a list with all child segments Entity: Relation getid int Returns the id of the relation Entity: Repository getanalysisbyid id Analysis Returns an analysis by its id getanalyses none List Returns a list with all analysis elements getsignaldatabyid id SignalData Returns a SignalData by its id getsignalsdata none List Returns a list with all SignalDada elements Entity: Segment getid none int Returns the id of the segment getchilds none Segment Returns a list with all child segments getalternatives none Alternative Returns a list with all alternatives Entity: Segmentation getid none int Returns the id of the segmentation getsegments none List Returns a list with all segments getsegmentbyid id Segment Returns a segment by its id Entity: SignalData getsdtype none int Returns the type of the SignalData getid none int Returns the id of the SignalData

B.2. CONCEPTUAL MODEL PERSISTENT FORMAT 107 B.2 Conceptual Model Persistent Format This appendix presents the DTD used to serialize the repository to save it persistently. B.3 Data Transfer Objects This appendix describes the DTD used to serialize the DTOs. This serialization is performed to pass the DTO between the server and the client and vice-versa.

108 APPENDIX B. REPOSITORY SERVER B.4 Remote Interface API This section describes the interface provided by the remote layer. All methods that return a Vector are methods that can throw an exception on the server side. The Vector always contains two positions. The first position indicated if there has been an exception. The second position has a string with the serialization of the requested element or, if an exception was thrown, the serialized DTO indicating which was the exception. In the rest of these appendix the concept of throwing an exception mean that the method may return a vector where the second position is the serialization of the exception. It is the client side responsibility to use that information and throw the corresponding exception at the client side.

B.4. REMOTE INTERFACE API 109 All elements are represented as a String corresponding to their DTO serialization format. So when one method refers an element as a return or argument value, it mean the serialization of that element s DTO. The methods are described in groups and represented in tables. The first table shows methods used to create and adds elements to the conceptual model. All these methods return a vector because they may throw a LayerClosedException exception. Name Arguments Source Return createsignaldatatype* Type, Name and Origin none Analysis createanalysis Name and Origin none Analysis createclassification Classifiable Analysis Classification createrelation Source and Destination Segment Analysis Relation createsegmentation none Analysis Segmentation createsignaldata type, name and creator none SignalData createsegment region Segmentation Segment createsegmentalternative none Segmentation Alternative addsegmenttosegmentation** Segment Segmentation none adddata*** data SignalData none addfeature key and origin Classification none * This method may also throw an InvalidSignalDataTypeException ** This method may also throw an InvalidSegmentationException *** This method may also throw an InvalidDataTypeException The next table also presents methods to create and add information to the conceptual model. However, these methods may also be used in closed layers. Name Source Element Arguments Return createcrossrelation none Vector of parent and child segments CrossRelation setparentsegment Segment parent Segment none addsegmenttoalternative* Alternative Segment none addchildsegment Segment child Segment none

110 APPENDIX B. REPOSITORY SERVER * This method may also throw an InvalidSegmentationException The following table contains methods to test whether an element has a set of other elements or not, e.g. if an Analysis has any Classifications. All these methods return a boolean indicating whether there are elements or not. Name hassegmentations hasclassifications hasrelations segmentationhassegments segmenthasrelations segmenthascrossrelations segmenthasclassifications relationhasclassifications Source Analysis Analysis Analysis Segmentation Segment Segment Segment Relation The next table describes miscellaneous methods used by the conceptual model. Name Source Arguments Return existsanalysisname none name boolean existssignaldataname none name boolean closeanalysis analysis none none openanalysis analysis none none isanalysisclosed analysis none boolean ishierarchical segment none boolean isambiguous segment none boolean This table describes getters for the entities of the conceptual model. Name Source Element Arguments Return getanalysisbyname none name Analysis getsignaldatabyname none name SignalData getdata* SignalData start and end Index Data getrelationbyid** none identification Relation getsegmentbyid** none identification Segment getparentsegment Segment Segment getcrossrelationtoanalysis Segment analysis name CrossRelation

B.4. REMOTE INTERFACE API 111 * May throw the following exceptions: IndexBoundsExceptions and InvalidIndexTypeException ** Used to get the real elements from the identifications that are kept in the DTOs Finally, the last table presents functions that select an element at a given position corresponding to the position of the element inside a set. All elements return a vector with the semantics explained above. If there is no more element at an aggregate set for the required position a UnexistingIterableElementException is thrown. Name Source Element Return getrelationfromsegmentatpos Segment Relation getsegmentparentcrossrelationatpos Segment CrossRelation getsegmentchildcrossrelationatpos Segment CrossRelation getsegmentchidlatpos Segment Segment getanalysisatpos Repository Analysis getsignaldataatpos Repository SignalData getsegmentatpos Segmentation Segment getsegmentfromalternativeatpos Alternative Segment getclassificationfromrelationatpos Relation Classification getclassificationfromsegmentatpos Segment Classification getsegmentationatpos Analysis Segmentation getclassificationatpos Analysis Classification getrelationatpos Analysis Relation