Natural Language Processing: current projects and research at the IXA Group IXA Research Group on NLP University of the Basque Country Xabier Artola Zubillaga
Motivation A language that seeks to survive in the modern information society requires language technology products. Most of the working applications are only available for the "big" languages. "Minority" languages have to make a great effort to face this challenge. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 2
How to face the challenge? Open proposal for the development of language technology. Steps to take: from necessary infrastructure to useful LE applications. Based on the twelve year-long experience of the IXA Research Group in the field of natural language processing applied to Basque. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 3
IXA Research Group on NLP (UPV/EHU) (I) Main research fields: NLP, computational linguistics, language engineering. Goal: to collaborate on laying foundations for research; the development of language processing software. Application language: mainly Basque. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 4
IXA Research Group on NLP (UPV/EHU) (II) 1986/1987: 4-5 university lecturers (CS) 2000/2001: ~30 members 13 lecturers (11 doctorates, senior researchers) 13 PhD students (research grants) A few research assistants assigned to projects Interdisciplinary team: computer scientists & linguists Collaboration on linguistic aspects: UZEI, Elhuyar,... FOR MORE INFO... http://ixa.si.ehu.es Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 5
IXA Research Group on NLP (UPV/EHU) (III) Relationships with other universities in Euskal Herria; Madrid; Toulouse; Barcelona; Maryland, Las Cruces (USA); Sydney (Australia); Massey (New Zealand); Rome (Italy); Helsinki (Finland);... And companies: Hizkia, Jalgi, Egunkaria, Microsoft, Xerox, LingSoft, LexiQuest,... Funding: local government, University of the Basque Country, Spanish Government,... Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 6
What is the language industry? A growing number of people use computer systems in their everyday life. Remote information retrieval Document writing and correction Many of these systems involve the use and processing of language. Consultation of dictionaries and encyclopedias Second language learning Translation of documents Electronic messaging Automatic phone services Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 7
Some terminology on Natural Language Processing (I) NLP deals with the automatic processing of both spoken and written text: communicate with/through computers by means of every day language. Computational linguistics or computer-oriented linguistics: formalisation of linguistic knowledge for computer processing. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 8
Some terminology on Natural Language Processing (II) Language engineering: production of computer systems which can recognise, understand, interpret and generate human language in all its forms. Typical products of LE are language software systems (lingware) such as lemmatisers, phrase recognisers, word sense disambiguation programs, translation aids, etc. All this is usually gathered under the heading of Human Language Technology. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 9
Underlying philosophy (I) Use, share, and reuse: theories, formalisms, and methodologies techniques and expertise technology Build our own linguistic resources in order to develop: general and specific tools applications and end-user products Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 10
Underlying philosophy (II) As an example: Several OCR programs claim to have Basque among the languages they are set up for. No one includes specific language information (dictionary, bi-grams or tri-grams info., etc.). In some of them, the use of (r acute) and similar obsolete features is the only reason for that claim!!!. Result These programs don't work with Basque texts as properly as they do with other languages. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 11
Strategic priorities: from basic research to application development Research & development End-user applications Language tools Basic & applied research Linguistic foundations Linguistic resources Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 12
Linguistic foundations & resources, tools and applications Linguistic foundations and resources: necessary infrastructure for the automatic processing of a language. Tools: mainly intended for application developers. Applications: commercial or non-commercial, for non-specialised end-users. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 13
Phase I: laying foundations MRD's Comp. description of morphology Basic Lexical Database Raw corpus (written texts & speech recordings) Phonetics Lexicon Morphology Syntax Semantics Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 14
Phase II: first basic tools and applications Xuxen: spelling checker/corrector Lemmatiser/Tagger Morphological analyser Statistical tools for the treatment of corpora MRD's Comp. description of morphology Enriched Lexical Database Morphologically annotated corpus Phonetics Lexicon Morphology Syntax Semantics Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 15
Phase III: more advanced tools and applications Basic CALL Electronic dictionaries Web crawler Grammar checker Environment for linguistic tools integration Xuxen: spelling checker/corrector Lemmatiser/Tagger Surface syntax Morphological analyser analyser Statistical tools for the treatment of corpora WSD MRD's Comp. description of morphology Comp. grammar Lexical Database Morphologically and syntactically annotated corpus Lexicalsemantic KB Phonetics Lexicon Morphology Syntax Semantics Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 16
Phase IV: multilinguality and general use applications NL generation, translation aids, dialog systems,... Information retrieval and extraction Advanced CALL Electronic dictionaries Web crawler Grammar checker Environment for linguistic tools integration Xuxen: spelling checker/corrector Lemmatiser/Tagger Syntax Morphological analyser analyser WSD Statistical tools for the treatment of corpora MRD's Comp. description Comp. Multilingual of morphology grammar lexicalsemantic KB Lexical Database Morphol., synt., and semantically annotated multilingual corpus Phonetics Lexicon Morphology Syntax Semantics Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 17
What not to do (I) Do not start developing applications before linguistic foundations are created. Follow, in general, the sequence stated above: foundations, tools, and applications. When a new system must be built, do not create ad hoc linguistic resources. Design these resources to be easily extended for full coverage and make them reusable by any other tool or application. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 18
What not to do (II): example Basque is a language with a very rich morphology. We decided not to begin with advanced applications (machine translation,...) but rather to develop a broad foundation based on lexicon and morphology. Now those foundations have become the base for present and future developments. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 19
Reusability is a must: example (I) Structured electronic dictionaries Lemmatiser Surface syntax parser Lexical Database Machine- Readable Dictionaries Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 20
Reusability is a must: example (II) Translation aids Structured electronic dictionaries Word-sense disambiguation Morphological analyser Surface syntax parser Lexical Database Corpus Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 21
What not to do (III) When you complete a new resource or tool do not keep it to yourself many researchers in the world are investigating on English, but only a few on each minority language we will not become rich (market criteria do not usually apply) Results should be public and shared for research purposes. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 22
Conclusions Long-term strategy for research and development of language engineering. Based on the experience of the IXA Group on the automatic processing of Basque. Every foundation, tool, and application developed in the previous phases is of great importance to face new problems and challenges. The development of a sound language industry should be the result of a coordinated effort, involving research groups, institutions and industry. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 23