UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO. A Framework for Integrating Natural Language Tools

Transcription

1 UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO A Framework for Integrating Natural Language Tools João de Almeida Varelas Graça (Licenciado) Dissertação para obtenção do Grau de Mestre em Engenharia Informática e de Computadores DOCUMENTO PROVISÓRIO Fevereiro 2006

2

3 Abstract Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics that studies the problems inherent to the processing and manipulation of natural language. NLP systems are typically characterized by a pipeline architecture, in which several NLP tools connected as a chain of filters apply successive transformations to the data that flows through the system. Usually, each tool is independently developed by a different researcher whose focus is on his/her own research problem rather than on the future integration of the tool in a broader system. Hence, when integrating such tools, one may face problems that lead to information losses, such as: (i) the output of a tool consists of the data it has acted upon and usually does not contain all the input data. Sometimes this raises a problem if the discarded information is required by a tool that appears at a later stage of the pipeline; (ii) each tool has its own input/output format so conversions between data formats may be needed when a tool consumes data produced by another one. This conversion may not be possible if the descriptive power of each format is distinct; (iii) the formats used by different tools do not establish relations between the input/output data. These relations are useful for aligning information produced at different levels and to avoid the repetition of common data across them. These problems make the reuse of NLP tools in distinct NLP systems a cumbersome task. This dissertation proposes a solution to these problems, by using a client-server architecture. The server acts as a blackboard where all tools add/consult the data. In our solution, a tool adds a layer of linguistic information over a data signal and the system maintains the cross-relations between the existing layers of data. The data is kept in the repository under a conceptual model that is independent of the client tools which allow the representation of a broad range of linguistic information. The tools

4 interact with the repository through a generic remote API which allows the creation of new data and the navigation through all the existing data. Moreover, this work provides libraries implemented in several programming languages that abstract the connection and communication protocol details between the NLP tools and the server. These libraries also offer additional levels of functionality that simplify the creation of NLP tools.

5 Resumo O Processamento de Língua Natural (PLN) é um ramo da Inteligência Artificial que estuda os problemas inerentes ao processamento e manipulação da Língua Natural. Os sistemas de PLN são normalmente caracterizados por uma arquitectura de canais e filtros onde um conjunto de ferramentas de PLN aplica um conjunto sucessivo de transformações aos dados que fluem no sistema. Cada ferramenta é normalmente desenvolvida por um investigador, cuja preocupação se centra no seu problema e não na integração da sua ferramenta em futuros sistemas. Quando se integram diferentes ferramentas para criar um sistema, surgem tipicamente os seguintes problemas, que podem levar à perda de informação: i) a saída de cada ferramenta consiste nos dados que ela alterou, e pode não conter todos os dados de entrada. Este facto pode originar problemas se a informação que foi descartada for necessária para ferramentas que apareçam posteriormente no sistema; ii) cada ferramenta possui o seu próprio formato de dados, logo é necessário converter os diferentes formatos para permitir que as ferramentas comuniquem entre si. Adicionalmente a expressividade de cada formato pode não ser diferente, caso em que a conversão pode não ser possível; iii) as diferentes ferramentas não estabelecem relações entre os dados de entrada e saída necessários para alinhar os dados produzidos por diversas ferramentas, e para evitar a replicação de informação. Estes problemas dificultam a reutilização de ferramentas de PLN em diferentes sistemas de PLN. Este trabalho apresenta uma solução para estes problemas que consiste na utilização de uma arquitectura cliente servidor. O servidor é um repositório usado pelas ferramentas para adicionar e consultar informação. Cada ferramenta adiciona um nível de informação sobre um sinal de dados e o sistema mantem relações entre os diversos níveis. Os dados são guardados no repositório sob um modelo concep-

6 tual, independente das diversas ferramentas, e que permite representar diversos tipos de informação linguística. As ferramentas interagem com o servidor através de uma interface remota que permite que estas adicionem dados e naveguem através de todos os dados existentes. Este trabalho oferece ainda bibliotecas implementadas em diversas linguagens de programação que abstraem os detalhes referentes ao protocolo de ligação e comunicação entre o cliente e o servidor. Estas bibliotecas oferecem funcionalidade acrescida às ferramentas, o que simplifica a sua criação.

7 Keywords & Palavras Chave Keywords Natural Language processing systems Natural Language processing tools integration Repository Linguistic Annotation Data lineage Information loss Palavras Chave Sistemas de processamento de Língua Natural Integração de ferramentas de processamento de Língua Natural Repositório Anotação Linguística Alinhamento de dados Perda de informação

8

9 Acknowledgments I would like to express my gratitude to everyone that helped me during the development of this dissertation, provided me with their support, and endured my constant stress and bad temper. Without them this work would not have been possible. I would like to thank my Supervisor, Professor Nuno Mamede, for all his guidance over these years, his constant advices and corrections, and his never ending patience towards my doubts and requests. I also cannot forget the support provided by my Co-Supervisor, João Dias Pereira. Both helped me immensely in the development of this dissertation. My thanks extend to the INESC-ID Spoken Language Systems lab team, who were extremely welcoming and cooperative, and particularly to those who worked more closely with me in this project: David Matos, Luísa Coheur, Ricardo Ribeiro, Joana Paulo, Fernando Baptista and Paula Vaz. I thank them for their suggestions and support. Thanks also to André Nascimento for his help with the proof-reading of the dissertation s text. To all my friend that have always been there for me, even when things seemed to be going wrong, thank you for your words of comfort and motivation. And last, but certainly not least, I thank my family, for their unconditional support, not only throughout this project, but also for my entire life. Lisboa, February 22, 2006 João de Almeida Varelas Graça

10

11 Contents 1 Introduction Motivation Objectives Proposed Solution Architecture Conversions between data formats Data lineage Summary Requirements Conceptual Model Requirements System Requirements Contributions Dissertation Structure Related Work Introduction AGTK ATLAS EMDROS i

12 2.5 NLTK GATE Festival Summary Conceptual Model Introduction Conceptual Model Entities Repository Data SignalData, Index and Region Analysis Segment and Segmentation Relation Classification CrossRelation Conceptual Model API Summary Architecture Introduction Server architecture Server Architecture Description Data Layer ii

13 Service Layer Remote Interface Server Architecture Implementation Data Layer Service Layer Remote Interface Server Architecture Interaction Examples Client Library Client Library Description Client Stub Layer Conceptual Model Layer Extra Layers Java Implementation Client Stub Conceptual Model Extra Layers Solution Validation Tools Simple NLP system Concurrent processing Summary Conclusion and Future Work Summary Future work iii

14 5.3 Contributions A Conceptual Model API 95 B Repository Server 105 B.1 Data Layer API B.2 Conceptual Model Persistent Format B.3 Data Transfer Objects B.4 Remote Interface API iv

15 List of Figures 1.1 Example of an NLP system using a pipes and filters architecture Each tool passing all input data to output Tool consuming data from different tools Component combining data from different tools An NLP system using the shared repository External components converting data formats Identification of words in a text Ambiguous segmentation example Classification Ambiguity example AGTK internal structure Annotation Graphs example ATLAS region utilization example ATLAS Children utilization example EMDROS database example Gate annotation graph example Utterance example Conceptual Model class diagram Repository class diagram v

16 3.3 SignalData class diagram TextSignalData class diagram Analysis class diagram Segmentation and Segment class diagram Relation class diagram Classification class diagram CrossRelation class diagram Iterator class diagram type and description class diagram Complete conceptual model class diagram Client Server Architecture Server layers DTO class diagram Client Library Internal Structure Extra layers interfaces Examples notation vi

17 List of Tables 2.1 System requirements resume Conceptual model requirements resume vii

18 viii

19 1.1 Motivation 1 Introduction Natural Language Processing (NLP) is a subfield of artificial intelligence and linguistics that studies the problems inherent to the processing and manipulation of natural language. It is devoted to making computers understand statements written in human languages. Several NLP systems have been developed to solve some of the major NLP tasks, such as question answering and dialog systems. Usually, NLP systems are composed of several NLP tools where each tool is charged of a specific linguistic task, such as word sense disambiguation or syntactic analysis. In most systems, tools are executed in a pipeline manner where each tool performs a processing step. It is common for these tools to use results from the previous processing steps and to produce new linguistic information that can be used by the following processing steps. For instance, a word sense disambiguation may combine the output of a word tokenizer and the output of a part of speech tagger. Tools are independently developed by distinct individuals whose focus is on his/her own problem rather than on the future integration of the tool in a broader system. Tool data formats are developed according to the tools requirements and normally the output of each tool does not contain all input information. So, some data is discarded between the Input/Output of each tool. Hence, when integrating such tools, several problems arise, which are mainly related to the following: (i) how the tools communicate with each other and (ii) what kind of information flows between them. At the Spoken Language Systems Lab (L 2 F), where this work was developed, sev-

20 2 CHAPTER 1. INTRODUCTION eral NLP systems have already been created. Whenever tools were integrated to compose a system, most of the detected problems were related to the information flow between those tools, which can lead to information losses. These problems are: Architectural problems - the information discarded along the system may be required further ahead by other tools; Conversion between data formats - conversions are necessary between the different data formats. If the expressiveness of each format is different, some of the formats may not be completely mappable into other formats. Besides these problems that lead to information losses there is another problem concerning the data: how to maintain the data lineage between information produced by different tools composing a NLP system. When viewing tools output as a layer of information over a primary data source and considering that layers are normally related to each other, it is desirable to maintain relations between those layers. First, it allows the data lineage between the different levels, allowing the navigation through related linguistic information produced by different tools. Secondly, tools can reference data from other layers avoiding the repetition of common data. These types of relations are called cross-relations because they span across linguistic information layers. Finally, a last problem concerns the fact that each NLP tool programmer usually develops its own data model to represent the linguistic information, and Input/Output facilities of that data model. Since, these different data models normally represent similar information they tend to have a big resemblance. The redefinition of such similar models represents a waste of time. Figure 1.1: Example of an NLP system using a pipes and filters architecture.

21 1.1. MOTIVATION 3 Figure 1.1 shows a simple NLP system composed by three components. The system initial input data is a text file. Component A performs the tokenization and part of speech tagging of the text, component B performs a post-morphologic analysis and component C performs a syntactic analysis. In this example two conversions between data formats are required, one between components A and B, and another between components B and C. This small system has the following problems: Component B receives the results from component A, and one of its tasks is to separate the contractions used in the text. From that point on the initial state of the text is lost. The syntactic parser (C) only uses the morphologic features of the words identified by component A. Its output is a set of syntactic trees which do not have any reference to the original words in the text. At this stage, the system has already lost the information about the words that belong to the text which might be required by components that are added in the future, to the system. Moreover, problems in the conversion between data formats may exist if the data formats expressiveness is different. For example, if a phonetic transcription of the original text was required, it would need the data of the syntactic analysis to select a possible interpretation of the words of the text, and the original words in the text. Since part of the information was lost throughout the system this information would not be present. So the new component would require all of the tools output as input and it would have to merge the information between the tools. However, if the information produced between those tools was aligned these problems could be overcome by following the relations between the different levels of data. The rest of this chapter is organized as follows. Section 1.2 describes the objective of this dissertation. Then, Section 1.3, proposes a solution to the detected problems, in the form of a shared repository for NLP tools. Section 1.4 defines a set of requirements that a solution for the detected problems must fulfil, and Section 1.5 describes the major contributions of this work. Finally Section 1.6 describes the structure of this thesis.

22 4 CHAPTER 1. INTRODUCTION 1.2 Objectives The main objective of this work is to build an NLP framework for the creation of NLP systems that: Avoids information losses between tools composing a system; Simplifies new NLP tools implementation by providing general Input/Output facilities and a data model to represent linguistic information; Maintains the data from tools aligned allowing the navigation through related data from different tools. 1.3 Proposed Solution This Section first describes some alternatives to solve the architectural and data format problems, and our solution to each one of them. Then it presents an objective that arises from the solution to those problems concerning the data lineage between the information produced by different tools. Finally it presents the overall solution composed by the solutions to each individual problem Architecture A possible approach to the architectural problem is guaranteeing that the tools produce all the information required for the next steps of the system. This simple approach only works when all of the system s tools are known in advance. Moreover, this approach demands that each tool has to produce extra information. This approach is not extensible because the addition of new tools may force changes on existing tools. A generalization of the previous alternative consists in guarantee that each tool passes all input information to its output (Figure 1.2). This strategy has the following problems: firstly, the tools must know how to manage a large amount of data, which

23 1.3. PROPOSED SOLUTION 5 may not be related with the tools themselves. Secondly, each tool may have to load and parse a large amount of data upon its initialization and consequently save a large amount of data when terminating. Finally, it complicates handling data from other tools simultaneously. For example, in Figure 1.3, component D, which is a morphologic disambiguation tool, requires the output of component B, and component C, both of which are part-of-speech taggers. In this example, component D has the responsibility of merging the common data from its input (text and dataa), besides performing its processing. This merging operation is difficult, and depends on the components used to produce the input, which leads to modifications in component D whenever component C, or B are changed. Figure 1.2: Each tool passing all input data to output. Figure 1.3: Tool consuming data from different tools. Another strategy that does not impose any extra effort on tools consists in having separated components, which know how to combine information from different tools (see Figure 1.4). Nevertheless, this strategy requires building new components every time a new combination of tools is available. Moreover, information intersection is not a trivial task and requires a considerable effort. This strategy was proposed on Galinha (de Matos et al., 2002, 2003).

24 6 CHAPTER 1. INTRODUCTION Figure 1.4: Component combining data from different tools. Our solution consists in shifting the architectural pattern from pipes and filters to a client-server style where the server is a shared data style. This solution proposes a shared repository where tools add new layers of information without changing the existing ones (see Figure 1.5). All linguistic information is available in the server. This way tools only have to select the required information, avoiding the loading, parsing and saving of extra data. Since the server is itself a shared data style we avoid information lost because all information is kept in the repository and it is never removed. Moreover, if this solution is used the information merging problem is avoided, since each tool only uses the layers it requires as input adding the new produced layer at the end. Figure 1.5: An NLP system using the shared repository Conversions between data formats A strategy to manage the different formats used by each tool consists in having specific components that are responsible for performing the data conversions (see Figure 1.6). Even if the creation of such components is simplified, for example, by imposing that

25 1.3. PROPOSED SOLUTION 7 each tool consumes/produces data in XML format, and a XSLT engine is used to perform the conversions, it is still necessary to define a XSLT style sheet for each conversion. Figure 1.6: External components converting data formats. This approach has two drawbacks. First, the need to define a XSLT for each data conversion. This might not be a trivial task, and if n distinct components exist, the number of possible XSLT grows at a rate of n*n. The second drawback is the expressiveness of each data format. This approach assumes that the content of each format is somehow describable in all the other formats, which may not be true, and in this case there will be information losses when the transformation is performed. This was the approach followed in (de Matos et al., 2002, 2003). Our solution consists in defining a conceptual model, capable of representing a broad range of linguistic information produced by NLP tools. The information stored in the shared repository is described using this conceptual model. Moreover, the tools may also use this model to represent their information, avoiding the need to perform any conversions. If the tools are not able to use this model, the model can be used as an interlingua between tools, so the number of necessary conversions is reduced from a rate of n*n to a rate of n. Since the model is able to represent a broad range of linguistic information, no information is lost due to lack of expressiveness of the format, because every possible data format should be mappable into this model Data lineage An approach to the data lineage problem is adding semantic information to each data layer in the shared repository. For example, supposing a morphologic layer and a

26 8 CHAPTER 1. INTRODUCTION syntactic layer, the layers semantics would force the use of morphologic layer elements by the syntactic layer. This way relations between layers would be established by the layers semantics. However, with this approach it is difficult to have several layers of the same semantic type. It also restricts the type of layers it is possible to define as well as the content of each layer. These restrictions are not desirable, since we require the solution to be generic enough to allow the creation of any NLP system, composed by any type of tools. Since the information produced by all NLP tools will be kept in the same repository, represented under the same conceptual model, our solution consists in extending the conceptual model, allowing the representation of cross-relations between the linguistic information belonging to different layers of information. Tools create cross-relations between their Input/Output data when creating the entities of the conceptual model Summary The solutions previously proposed for each of the problems are not independent from each other. The overall solution consists in using a client-server architecture, where the server is a repository of linguistic information and primary data sources represented under a conceptual model. The clients are NLP tools. All information produced by the clients is kept in the repository under a specific layer univocally identified. Each client can select information from the repository by selecting specific layers, and navigate through information from different layers using the cross-relations existing between the layers. Furthermore, besides the described server our proposed solution contains the definition of client libraries. A client library has the objective of simplifying the creation of NLP tools, by abstracting the connection and communication protocol details between the NLP tools and the server. Moreover, each client library provides several layers of functionality that simplify the use of the server and therefore the implementation of

27 1.4. REQUIREMENTS 9 new tools. For example the client library can provide an implementation of the conceptual model, where each element acts as a proxy for the corresponding server s element. Currently there is a version of the client library for the Java programming language. 1.4 Requirements This section defines a set of requirements for a solution that supports and simplifies the integration of independently developed NLP tools into NLP systems. These requirements were used to validate the proposed solution, and are the following: No information should be lost between tools in an NLP system; Tools should only produce directly related information; The solution should simplify the creation of new tools, by providing an Input/Output interface, which handles the loading and saving of data used by the tool and also by providing a data model that can be used by each tool to represent linguistic information; The solution should minimize the number of conversion components required to build an NLP system, when integrating existing NLP tools that do not comply with the system s model; The provided interface should allow the navigation between information produced by different NLP tools. To achieve the previous generic requirements we defined two groups of requirements that the solution must fulfil, namely: Conceptual Model Requirements - Represents the requirements for a conceptual model capable of representing and relating a broad range of linguistic information, which are described in Subsection 1.4.1;

28 10 CHAPTER 1. INTRODUCTION System requirements - Represents requirements of the underlying system related with the interaction between the system and the NLP tools, which are described in Subsection Conceptual Model Requirements The conceptual model s main requirement is that it must be able to represent a broad range of linguistic information produced by different NLP tools. Furthermore, the conceptual model must be extensible because it is impossible to foresee all kinds of linguistic information that may appear in the future. We begin by distinguishing two conceptually different kinds of information that the conceptual model must represent: Primary data sources such as a text, or a speech signal; Linguistic information produced by NLP tools, over primary data sources, or previously defined linguistic information. We also identify four types of actions that NLP tools may perform: Creation and edition of primary data sources, for example, the incremental creation of a new primary data source containing the phonetic transcription of the text belonging to another primary data source. This newly created data source can be the target of the linguistic information of other tools; Identification of linguistic elements from a primary data source, for instance, the segmentation of a sentence into words; Creation of relational information between linguistic elements, such as the relation between a verb and the corresponding subject. Assignment of characteristics to a linguistic element, or a relation, for example, the morphological features of a word;

29 1.4. REQUIREMENTS 11 Each NLP tool may produce several types of information at the same time. The linguistic information generated by an NLP tool is normally derived from linguistic information created by other tools. For example, a part of speech tagger will use the segments produced by a tokenizer and add morphologic information to those segments. A morphological desambiguator may use the classifications produced by several part of speech taggers to select the most appropriate classification. The conceptual model must be able to represent several layers of both primary data sources and linguistic information. We have defined the following requirements regarding these two types of information: The conceptual model must be able to represent any kind of primary data source such as text, speech, video, or any combination of these; The conceptual model must support the creation and edition of primary data sources; All linguistic information except primary data sources produced by an NLP tool must be associated with the same layer; The conceptual model must allow the selection of linguistic information through the identification of the layer that contains it; Each layer is associated with the identification of the tool that produced it. The last three requirements are necessary to simplify the identification of information inside the conceptual model. This way all linguistic information is organized into layers identified by the tool that produced it. The conceptual model must represent the three types of linguistic information that each NLP tool can produce: (i) the identification of linguistic elements; (ii) the creation of relations between linguistic elements; (iii) and the assignment of characteristic to linguistic element and the relations. Figure 1.7 shows an example of the identification of linguistic elements, namely the words of a text.

30 12 CHAPTER 1. INTRODUCTION Figure 1.7: Identification of words in a text. We have defined the following requirements for the representation of those linguistic elements: The model must be able to represent ambiguity in the identification of linguistic elements, for example, a compound term can be segmented as only one segment containing the compound term or several segments for each word. Figure 1.8 shows an example of segmentation ambiguity; The model must be able to represent trees of linguistic elements, for example, syntactic trees. The model must allow the creation of relations between linguistic elements from other layers. It must be possible to represent classification ambiguity, which correspond to associating disjunct sets of characteristic to the same linguistic element. For example, distinct morphological features for the same word. Figure 1.9 shows distinct grammatical categories for the word that; The model must allow the association of characteristic to linguistic elements, or relations from other layers. Figure 1.8: Ambiguous segmentation example.

31 1.4. REQUIREMENTS 13 Figure 1.9: Classification Ambiguity example. Besides the representation of linguistic information produced by each NLP tool, the conceptual model has the following requirements concerning the relations between linguistic information from different layers: The model must be able to represent relations between linguistic elements from different layers. These relations represent dependencies between layers of information, and allow the navigation between layers; The conceptual model must allow linguistic elements to reference data belonging to a primary data source, without having to copy its value, thus avoiding the repetition of the same data in several layers; The model must be able to represent data which may not exist in any primary data source, for example, the separation of contractions System Requirements This subsection presents the general requirements of the system which are not related to its conceptual model, but with the interaction between the system and the NLP tools, which are the following: The system must simplify the iteration of data in the repository, e.g. all segments from a segmentation. This is required because iteration is the most common way of interaction between an NLP tool and its data; The system must allow the selection of the data based on the layer s identification; this way an NLP tool only handles the data it requires;

32 14 CHAPTER 1. INTRODUCTION The system must allow the access to data from an unfinished Analysis to allow the parallel processing of data. This way an NLP tool may consume information that is being produced at the same time by another NLP tool; The system must persist its data; The system must interact with NLP tools written in any programming language. 1.5 Contributions The main contributions of this work are: The definition of a conceptual model that can be used as a common model by several NLP tools, and is able to represent a broad range of linguistic information, different types of primary data, and related the represented information; The implementation of a framework for NLP systems that reduces the effort to implement and integrate NLP tools. 1.6 Dissertation Structure Chapter 2 describes several works that solve similar problems to the ones addressed in this thesis. These works consist in architectures that try to simplify the creation of NLP systems based on independent NLP tools, and in linguistic annotation frameworks, which try to abstract the logical structure of linguistic annotations, thus creating a conceptual core capable of representing all kinds of linguistic information. Chapter 3 presents our conceptual model, which fulfils the requirements described in Subsection 1.4.1, by defining its entities and their responsibilities. Chapter 4 describes the proposed architecture for this work, its general principles and an implementation. Finally, in Chapter 5, makes some remarks of the developed solution and presents some pointers for future work.

33 2.1 Introduction 2 Related Work This chapter presents a description of some proposals that we compared during the development of this work. The shared data style architecture was analysed. It consists in a data store used by several tools. A shared data style may be a repository or a blackboard. The main difference is that while the repository is passive and the clients access the data as required, the blackboard is active and defines the client execution order. However, the term blackboard is sometimes used in a broader sense meaning a shared data style. The idea of the shared data style is based on a metaphor where a group of experts gathers around a blackboard and work cooperatively to solve a problem, using the blackboard as the workplace for developing the problem. The blackboard architecture is not a new technology, the first blackboard system: the hearsay-ii speech understanding system (Hayes-Roth et al., 1978) was developed in Some blackboard characteristics are described in (Corkill, 1991) from which we present those that also apply to a repository, which are the following: Independence of expertise - Each module is a specialist in solving a certain aspect of the problem. A module does not depend on other modules to produce its contributions. A module only has to find the information it needs inside the blackboard and then it proceeds with no assistance from other modules. New modules may be added to the blackboard without the need to change any existing modules. This corresponds to the notion of NLP tools developed by independent researchers, which are not able to preview how their tool is going to be used

34 16 CHAPTER 2. RELATED WORK in broader systems; Diversity in problem solving methods - In a blackboard system each module is a black box. The blackboard has no knowledge about the processing that each module performs, which corresponds to our notion of having a generic system that every NLP tool can use; Flexible representation of blackboard information - A blackboard system does not place any restriction on the information that a module can add. This property follows our desire of semantic independence on the data for the sake of generality; Common interaction language - All modules interacting with the repository must have a common understanding on the data the blackboard holds. So a common language must be available and should be used by every tool using the repository. If a tool could place information on the repository without following the common language, it would not be useful since no other tool could use that information properly. This property is achieved by the usage of a conceptual model, together with an interface that all tools must comply to; Position Metrics - A module should not have to scan the entire blackboard, which can be very big, to find the information it requires. One solution is to divide the blackboard into regions, each corresponding to a particular group of information, which in our case corresponds to the information produced by each NLP tool; Incremental solution generation - Blackboard systems operate incrementally. Each module will contribute to the solution with something they find appropriate. So it must be possible to represent partial solutions and unsolved ambiguities. A related field called linguistic annotation was also analysed. It deals with tools and formats for creating and managing linguistic annotations. Linguistic annotation

35 2.1. INTRODUCTION 17 covers any descriptive or analytic annotation over a raw source of data. For instance, the segmentation of a text into words, or the morphologic features of those words are both linguistic annotations. The motivation for this field is the enormous necessity for manually annotated corpora in the NLP field. These annotated corpora validate results from NLP tools, and help training NLP statistical tools. The creation of annotated corpora is a very demanding and expensive job. Moreover, as the diversity of annotations over a corpora increases, so increases the value of that linguistic database. Therefore, there is a great focus on the reutilization of linguistic annotation databases, which led to the definition of several standards for linguistic annotation formats. However, the diversity of existing annotations formats makes the reutilization of such databases more difficult. Due to these problems, the development of linguistic annotation frameworks became a priority. The main objective of these frameworks is to develop a logical level of linguistic annotations independent of the annotations physical format, which together with an interface and modules to convert from/to the existing data formats, should promote the reutilization of annotated corpora. The logical level, whose focus is on the logical structure of linguistic annotations rather than their content, should be able to represent all kinds of linguistic annotations existing in the various formats, which is precisely the purpose of our conceptual model. The Linguistic Annotation home page (Consortium, 2006) collects information not only about tools that have been widely used for constructing annotated linguistic databases, but also about the formats commonly adopted by such tools and databases. Another path or research regarding linguistic annotations is the definition of a set of requirements for annotation formalisms. We compared the requirements proposed in (Reidsma et al., 2004), and the requirements that are being defined by the International organization from standardization (ISO), which has formed a sub-committee (SC4) under technical committee 37 (TC37, Terminology and other language resources) to define a standard for linguistic annotation (Ide and Romary, 2001; Ide et al., 2003), against the ones we have defined. We compared two existing linguistic annotation frameworks against our require-

36 18 CHAPTER 2. RELATED WORK ments. In Section 2.2 we compared the Annotation Graphs Toolkit (AGTK) (Maeda et al., 2002). The AGTK is an implementation of the Annotation Graphs formalism (Bird and Liberman, 1999), which is the most cited work in this area. Section 2.3 compares the ATLAS architecture (Bird et al., 2000; Laprun et al., 99), which is a generalization of the Annotation Graphs formalism to allow the usage of multidimensional signals. We also compared several works regarding architectures that simplify the creation or integration of NLP tools towards their usage in NLP systems. In Section 2.4 we compared EMDROS (Petersen, 2004) which is an open source text database engine for analysis or retrieval of analyzed or annotated text. We continued in Section 2.5 by comparing the Natural Language Toolkit (NLTK) (Loper and Bird, 2002), which is a suite of Python libraries, and programs for symbolic, and statistical natural language processing. Section 2.6 we compared GATE (Bontcheva et al., 2004) which is a general Java based architecture for text engineering that promotes the integration of NLP tools by composing them into a pipes and filters architecture. Then, in Section 2.7, we compared Festival speech synthesis system (Taylor et al., 1998; Black and Taylor, 1997), which is a general framework for building speech synthesis systems. Finally, in Section 2.8, we present a brief summary, and some conclusions of this chapter. 2.2 AGTK The Annotation Graphs (Bird and Liberman, 1999) is a formal framework for representing linguistic annotations based on the analysis of several existing annotation formats. This analysis led to the development of a conceptual core called Annotation Graphs that according to Bird and Liberman (1999) can represent all kinds of linguistic annotations, thus serving as a interlingua between the different tools. The analysed annotation formats include: TIMIT (Garofolo et al., 1986) - a corpus of read speech designed for the recognition of acoustic-phonetic knowledge;

37 2.2. AGTK 19 Partitur (Schiel et al., 1998)- the format of the Bavarian Archive for Speech Signal made from the collective experience of a broad range of German speech databases; CHILDES (MacWhinney, 1995)- a database of transcript data collected from children and adults who were learning foreign languages; LACITO (Jacobson et al., 2001)- Collection of recorded and transcribed speech data of unwritten languages; NIST UTF (NIST, 1998)- universal transcription format developed by the US National Institute of Standards and Technology; Switchboard (Godfrey et al., 1993)- a corpus of conversational speech, containing several levels of distinct annotations; MUC-7 Message Understanding Conference (Hirschman and Chinchor, 1997)- defines a format for representing linguistic annotations used in information extraction, name entity recognition and coreference resolution. Since that analysis showed that the Annotation Graphs formalism supersede all the other annotation formats, and since we are going to compare the Annotation Graphs formalism against our requirements, no further evaluation was performed concerning the analysed works. The Annotation Graphs formalism focus on annotating speech signals, however the ideas could be extended to text. The Annotation Graphs Toolkit (AGTK) (Maeda et al., 2002) is a framework that provides an instantiation of the Annotation Graphs, and simplifies the creation of annotation tools by providing a set of interfaces to access the data represented in that formalism. The AGTK architecture consists of three modules (see Figure 2.1):

38 20 CHAPTER 2. RELATED WORK Figure 2.1: AGTK internal structure. AGLIB - The AGLIB provides an interface for Annotation Graphs formalism, implemented in C++. It is composed by the core module, the AGAPI and TreeAPI (Cotton and Bird, 2002), and the ODBC and IO interfaces; AG wrappers - The AG wrappers allows the connection to the AGLIB in several programming languages, for example the Java wrapper provides an interface in Java to the AGLIB; Input/Output Plugins - The Input/Output Plugins allow the definition of separated modules, which convert linguistic annotations to/from the Annotation Graphs formalism to other formats. The architecture of the AGTK was developed in order to simplify the creation of annotation tools. An annotation tool reads the existing linguistic annotations using the interface provided by the system, performs the annotations, and then, saves all the annotations. This approach is the same as the one illustrated in Figure 1.2 and presents the same problems. The Input/Output facilities of the AGTK imposed that all annotations are loaded and saved every time, so a lot of data has to be parsed by each NLP tool, even if the tool is not using this data. Moreover, if a NLP tool uses the output of two NLP tools which were executed in parallel, the NLP tool has to merge the annotations from both tools. Furthermore, a NLP tool cannot consume data that is being produced at the same time by other tool, so no parallel processing is possible.

39 2.2. AGTK 21 Figure 2.2: Annotation Graphs example. These problems exist even if the provided ODBC interface is used, because the ODBC interface only replaces the file Input/Output to a database Input/Output. The representation formalism of the AGTK is the Annotation Graphs formalism: an annotation graph is a directed acyclic graph where edges contain a set of features, and nodes may contain a time offset. The time offsets are relative to the signal that is being annotated. A formal description of the annotation graphs model can be found in Bird and Liberman (1999). Figure 2.2 shows an annotation graph for a morphologic analysis. Two nodes, marking the beginning and the end, represent each word. Each arc represents the grammatical classification of that word, or an indication of a white space. The AGTK defines an implementation of the Annotation Graphs formalism where several annotation graphs can be grouped, together with several Timelines in an AGSet. A Timeline represents a group of Signals that share the same reference. Each node of the graph is represented by an Anchor element which corresponds to a named offset into a Signal. The arcs of the graph are represented by Annotations elements which have a specific type and a set of attribute value pairs called Features. Some elements may have an attribute called Metadata, which is a set of attribute value pairs which allow the addition of non-linguistic information to those elements.

40 22 CHAPTER 2. RELATED WORK Several problems exist concerning the representation of the linguistic information produced by a NLP tool. The first problem detected in the Annotation Graphs formalism is that it has no concept to represent a linguistic element. It represents linguistic elements by defining two nodes pointing to their beginning and their end. This representation choice does not allow the representation of relational information between linguistic elements since that information corresponds to arcs between other arcs. This problem concerns both the concept of relational information that we defined in the requirements, and the relations between different layers. Another problem regarding the absence of representation for linguistic elements is that in order to keep a linked graph we must represent arcs which do not correspond to any linguistic element. For example, in Figure 2.2 we had to use arcs to represent the spaces between words. Another problem arises from the fact that the Annotation Graphs do not have the concept of classification ambiguity, meaning that a linguistic element may have several classifications. This ambiguity may be represented as several arcs between the same nodes, however using this alternative we cannot easily distinguish several alternative classifications from other annotations performed over the same zone of the Signal. The representation of segmentation ambiguity and hierarchical segmentation is possible using different arcs between the existing nodes, but again we lack the structural semantics of those elements which are conceptually different, thus making the utilization of the annotations more difficult, even if the Metadata is used to provide information about the semantic role of each arc. The Annotation Graphs formalism does not allow the edition of primary sources of data. The Anchor used to identify points in the Signal is a number that corresponds to an offset in a file. This representation is very restrictive and does not allow, for example, the case of multidimensional signals. Regarding the integration of linguistic information produced by several NLP tools, the Annotation Graphs presents some problems as well. The model has no concept of layer. If a layer is represented as an individual Annotation Graph it is not possible to access data from other layers, which is not acceptable for our purposes. Another al-

41 2.3. ATLAS 23 ternative is to keep all information in an Annotation Graph and use the Metadata to identify the information produced by each tool, but this alternative has several problems. Firstly, using this approach it is not possible to have several layers over different Signals. Secondly, it would be too difficult to select a sub-graph containing only the information produced by a NLP tool. 2.3 ATLAS ATLAS (Bird et al., 2000; Laprun et al., 99) provides a framework aimed at simplifying the development of linguistic annotation tools. The ATLAS architecture development regarded the extension of the Annotation Graphs formalism to handle multidimensional signals. There is an implementation of the ATLAS architecture in Java. However, this project was abandoned before any tool had used the proposed formalism. Nevertheless, the generalization performed over the Annotation Graphs has some interesting features. The ATLAS architecture follows the same principle, therefore has the same problems of the AGTK architecture so we will not describe them in here. The ATLAS annotation model extends the Annotation Graphs formalism in three aspects: i) it allows the representation of multidimensional signals; ii) it provides a better representation of hierarchical structures; iii) it adds semantic information to the elements of the model. These extensions originated a more generic model: all data represented in an Annotation Graph can be represented by the ATLAS model, but the contrary is not true. The representation of multidimensional regions is achieved by the introduction of the Region element, which is an abstraction representing a zone in a Signal which is delimited by two coordinates represented by Anchor elements. The Anchor element is the only tie between Annotations and the Signal. Figure 2.3 illustrates an annotation using the ATLAS model. There is a Signal which is a text, and two Regions selecting the

42 24 CHAPTER 2. RELATED WORK words The, and pretty. In this example the Anchors used by the Region element consist in numbers which indicate the characters position at the text. However, if other kind of Signal was used, only the type of the Anchor elements would change. Figure 2.3: ATLAS region utilization example. The addition of the Region concept allows the ATLAS model to represent all types of media. However, the model does not allow the editing of Signals. The representation of hierarchical structures was improved with the addition of the element Children which contains a list of other Annotations that are descendant of a parent Annotation. Figure 2.4 illustrates an example of the utilization of the Children elements where one annotation contains several children Annotations. The ATLAS architecture provides a semantic level, called Meta-Annotation infrastructure for ATLAS (MAIA) which adds semantic information to the elements of the model. MAIA defines a type system used by each ATLAS element. The type of the element restricts the possible relations between the elements and their features. The extension performed by the ATLAS architecture provided a better representation for conceptually different linguistic phenomena, such as hierarchical trees, and ambiguous segmentations, through the Children elements. It also fulfils the requirement of media independence. But even so this model still presents the main problems described for the Annotation Graphs formalism.

43 2.4. EMDROS 25 Figure 2.4: ATLAS Children utilization example. Nevertheless, the idea of representing zones in a data source using regions was used in our work, since it allows the rest of the model s interface to be independent of specific data source types. 2.4 EMDROS EMDROS (Petersen, 2004) is a text database engine for analysis or retrieval of analyzed or annotated text. The EMDROS system is composed by four layers: Client Layer - Represents the NLP tools that use the services provided EMDROS; MQL layer - Provides an interface to the MQL query language, which uses the EMdF layer to translate the MQL queries into SQL calls to the database layer; EMdF layer - Defines the annotation model used by the database;

44 26 CHAPTER 2. RELATED WORK Database layer - Represents a relational database which persistently stores the linguistic information. The principal concept of the EMdF is the Object element, which represents a linguistic element. It contains a set of Monads which correspond to the minimum granularity units that the database may have. Objects are grouped into Object Types which define the possible Features that each specific Object may have. A Feature is an attribute-value pair. Features elements are the entities that store all the existing data in the database. Figure 2.5: EMDROS database example. Figure 2.5 shows an EMDROS database. The database contains six monads, one for each character which in this case is the minimum granularity unit. There are two Object Types, letter, and name. Each letter Object has the Feature surface, which contains the letter that the Object is representing. The name Object contains no Features. There are six letter Objects, containing one Monad, and one letter Object containing a set with the six existing Monads. The EMdF model uses the Monads to establish relations between different Objects. In this example word Object contains all the letter Objects, because its set of Monads contains all the letter s Objects Monads. This model is too restrictive for our purposes, for example: the EMdF limits the data source type to text; each linguistic element is identified as an Object which has a set of Features, so it is not possible to represent several classifications for the same Object; Relational information between Objects from the same layer is also not possible.