Processing: current projects and research at the IXA Group



Similar documents
Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Kybots, knowledge yielding robots German Rigau IXA group, UPV/EHU

MEDAR Mediterranean Arabic Language and Speech Technology An intermediate report on the MEDAR Survey of actors, projects, products

Special Topics in Computer Science

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

IEEE International Conference on Computing, Analytics and Security Trends CAST-2016 (19 21 December, 2016) Call for Paper

CS 6740 / INFO Ad-hoc IR. Graduate-level introduction to technologies for the computational treatment of information in humanlanguage

MULTIFUNCTIONAL DICTIONARIES

Master of Arts in Linguistics Syllabus

Comprendium Translator System Overview

Customizing an English-Korean Machine Translation System for Patent Translation *

Collecting Polish German Parallel Corpora in the Internet

Research Portfolio. Beáta B. Megyesi January 8, 2007

AnHitz, development and integration of language, speech and visual technologies for Basque

ONLINE TRANSLATION SERVICES FOR THE LAO LANGUAGE

Word Completion and Prediction in Hebrew

Survey Results: Requirements and Use Cases for Linguistic Linked Data

A Mixed Trigrams Approach for Context Sensitive Spell Checking

Overview of MT techniques. Malek Boualem (FT)

Opentrad: bringing to the market open source based Machine Translators

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

Semantic annotation of requirements for automatic UML class diagram generation

Modern foreign languages

Report on the embedding and evaluation of the second MT pilot

Hybrid Strategies. for better products and shorter time-to-market

Text-To-Speech Technologies for Mobile Telephony Services

Automatic assignment of Wikipedia encyclopedic entries to WordNet synsets

Strategic plan for the development of research in Language Technology at the University of Gothenburg

Generating SQL Queries Using Natural Language Syntactic Dependencies and Metadata

MASTER OF PHILOSOPHY IN ENGLISH AND APPLIED LINGUISTICS

LGPLLR : an open source license for NLP (Natural Language Processing) Sébastien Paumier. Université Paris-Est Marne-la-Vallée

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Hybrid Machine Translation Guided by a Rule Based System

Terminology Extraction from Log Files

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

PoS-tagging Italian texts with CORISTagger

A stream computing approach towards scalable NLP

LINGSTAT: AN INTERACTIVE, MACHINE-AIDED TRANSLATION SYSTEM*

209 THE STRUCTURE AND USE OF ENGLISH.

LEXUS: a web based lexicon tool

Glossary of translation tool types

Automatic Identification of Arabic Language Varieties and Dialects in Social Media

National Masters School in Language Technology

THE BACHELOR S DEGREE IN SPANISH

Extracting translation relations for humanreadable dictionaries from bilingual text

How To Complete The Danish Masters Program In Lct

PROMT Technologies for Translation and Big Data

The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge

An Artificial Intelligence approach to Arabic and Islamic content on the internet

Computer-Based Text- and Data Analysis Technologies and Applications. Mark Cieliebak

Why major in linguistics (and what does a linguist do)?

A Machine Translation System Between a Pair of Closely Related Languages

Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)

The Knowledge Sharing Infrastructure KSI. Steven Krauwer

Learning Translation Rules from Bilingual English Filipino Corpus

How RAI's Hyper Media News aggregation system keeps staff on top of the news

Turkish Radiology Dictation System

Study Plan. Bachelor s in. Faculty of Foreign Languages University of Jordan

UNIVERSITY OF JORDAN ADMISSION AND REGISTRATION UNIT COURSE DESCRIPTION

Master of Arts Program in Linguistics for Communication Department of Linguistics Faculty of Liberal Arts Thammasat University

Language and Computation

What Is Linguistics? December 1992 Center for Applied Linguistics

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Free Text Phrase Encoding and Information Extraction from Medical Notes. Jennifer Shu

Exploiting Comparable Corpora and Bilingual Dictionaries. the Cross Language Text Categorization

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

Introduction to formal semantics -

Teaching Formal Methods for Computational Linguistics at Uppsala University

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Study Plan for Master of Arts in Applied Linguistics

Terminology Extraction from Log Files

CS 533: Natural Language. Word Prediction

Identifying Focus, Techniques and Domain of Scientific Papers

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

The history of machine translation in a nutshell

ON GETTING THE MOST OUT OF INTERNET RESOURCES TO RAISE TRANSLATION QUALITY OF PROFESSIONAL DOCUMENTATION

DanNet From Dictionary to Wordnet

M LTO Multilingual On-Line Translation

An Overview of Applied Linguistics

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Semantic analysis of text and speech

NATURAL LANGUAGE QUERY PROCESSING USING SEMANTIC GRAMMAR

TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION

Transcription:

Natural Language Processing: current projects and research at the IXA Group IXA Research Group on NLP University of the Basque Country Xabier Artola Zubillaga

Motivation A language that seeks to survive in the modern information society requires language technology products. Most of the working applications are only available for the "big" languages. "Minority" languages have to make a great effort to face this challenge. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 2

How to face the challenge? Open proposal for the development of language technology. Steps to take: from necessary infrastructure to useful LE applications. Based on the twelve year-long experience of the IXA Research Group in the field of natural language processing applied to Basque. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 3

IXA Research Group on NLP (UPV/EHU) (I) Main research fields: NLP, computational linguistics, language engineering. Goal: to collaborate on laying foundations for research; the development of language processing software. Application language: mainly Basque. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 4

IXA Research Group on NLP (UPV/EHU) (II) 1986/1987: 4-5 university lecturers (CS) 2000/2001: ~30 members 13 lecturers (11 doctorates, senior researchers) 13 PhD students (research grants) A few research assistants assigned to projects Interdisciplinary team: computer scientists & linguists Collaboration on linguistic aspects: UZEI, Elhuyar,... FOR MORE INFO... http://ixa.si.ehu.es Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 5

IXA Research Group on NLP (UPV/EHU) (III) Relationships with other universities in Euskal Herria; Madrid; Toulouse; Barcelona; Maryland, Las Cruces (USA); Sydney (Australia); Massey (New Zealand); Rome (Italy); Helsinki (Finland);... And companies: Hizkia, Jalgi, Egunkaria, Microsoft, Xerox, LingSoft, LexiQuest,... Funding: local government, University of the Basque Country, Spanish Government,... Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 6

What is the language industry? A growing number of people use computer systems in their everyday life. Remote information retrieval Document writing and correction Many of these systems involve the use and processing of language. Consultation of dictionaries and encyclopedias Second language learning Translation of documents Electronic messaging Automatic phone services Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 7

Some terminology on Natural Language Processing (I) NLP deals with the automatic processing of both spoken and written text: communicate with/through computers by means of every day language. Computational linguistics or computer-oriented linguistics: formalisation of linguistic knowledge for computer processing. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 8

Some terminology on Natural Language Processing (II) Language engineering: production of computer systems which can recognise, understand, interpret and generate human language in all its forms. Typical products of LE are language software systems (lingware) such as lemmatisers, phrase recognisers, word sense disambiguation programs, translation aids, etc. All this is usually gathered under the heading of Human Language Technology. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 9

Underlying philosophy (I) Use, share, and reuse: theories, formalisms, and methodologies techniques and expertise technology Build our own linguistic resources in order to develop: general and specific tools applications and end-user products Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 10

Underlying philosophy (II) As an example: Several OCR programs claim to have Basque among the languages they are set up for. No one includes specific language information (dictionary, bi-grams or tri-grams info., etc.). In some of them, the use of (r acute) and similar obsolete features is the only reason for that claim!!!. Result These programs don't work with Basque texts as properly as they do with other languages. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 11

Strategic priorities: from basic research to application development Research & development End-user applications Language tools Basic & applied research Linguistic foundations Linguistic resources Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 12

Linguistic foundations & resources, tools and applications Linguistic foundations and resources: necessary infrastructure for the automatic processing of a language. Tools: mainly intended for application developers. Applications: commercial or non-commercial, for non-specialised end-users. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 13

Phase I: laying foundations MRD's Comp. description of morphology Basic Lexical Database Raw corpus (written texts & speech recordings) Phonetics Lexicon Morphology Syntax Semantics Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 14

Phase II: first basic tools and applications Xuxen: spelling checker/corrector Lemmatiser/Tagger Morphological analyser Statistical tools for the treatment of corpora MRD's Comp. description of morphology Enriched Lexical Database Morphologically annotated corpus Phonetics Lexicon Morphology Syntax Semantics Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 15

Phase III: more advanced tools and applications Basic CALL Electronic dictionaries Web crawler Grammar checker Environment for linguistic tools integration Xuxen: spelling checker/corrector Lemmatiser/Tagger Surface syntax Morphological analyser analyser Statistical tools for the treatment of corpora WSD MRD's Comp. description of morphology Comp. grammar Lexical Database Morphologically and syntactically annotated corpus Lexicalsemantic KB Phonetics Lexicon Morphology Syntax Semantics Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 16

Phase IV: multilinguality and general use applications NL generation, translation aids, dialog systems,... Information retrieval and extraction Advanced CALL Electronic dictionaries Web crawler Grammar checker Environment for linguistic tools integration Xuxen: spelling checker/corrector Lemmatiser/Tagger Syntax Morphological analyser analyser WSD Statistical tools for the treatment of corpora MRD's Comp. description Comp. Multilingual of morphology grammar lexicalsemantic KB Lexical Database Morphol., synt., and semantically annotated multilingual corpus Phonetics Lexicon Morphology Syntax Semantics Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 17

What not to do (I) Do not start developing applications before linguistic foundations are created. Follow, in general, the sequence stated above: foundations, tools, and applications. When a new system must be built, do not create ad hoc linguistic resources. Design these resources to be easily extended for full coverage and make them reusable by any other tool or application. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 18

What not to do (II): example Basque is a language with a very rich morphology. We decided not to begin with advanced applications (machine translation,...) but rather to develop a broad foundation based on lexicon and morphology. Now those foundations have become the base for present and future developments. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 19

Reusability is a must: example (I) Structured electronic dictionaries Lemmatiser Surface syntax parser Lexical Database Machine- Readable Dictionaries Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 20

Reusability is a must: example (II) Translation aids Structured electronic dictionaries Word-sense disambiguation Morphological analyser Surface syntax parser Lexical Database Corpus Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 21

What not to do (III) When you complete a new resource or tool do not keep it to yourself many researchers in the world are investigating on English, but only a few on each minority language we will not become rich (market criteria do not usually apply) Results should be public and shared for research purposes. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 22

Conclusions Long-term strategy for research and development of language engineering. Based on the experience of the IXA Group on the automatic processing of Basque. Every foundation, tool, and application developed in the previous phases is of great importance to face new problems and challenges. The development of a sound language industry should be the result of a coordinated effort, involving research groups, institutions and industry. Multilinguae, 22/11/00 IXA Research Group on NLP (UPV/EHU) 23