Research Portfolio. Beáta B. Megyesi January 8, 2007



Similar documents
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

Program curriculum for graduate studies in Speech and Music Communication

Robust Methods for Automatic Transcription and Alignment of Speech Signals

National Masters School in Language Technology

Special Topics in Computer Science

Brill s rule-based PoS tagger

Module Catalogue for the Bachelor Program in Computational Linguistics at the University of Heidelberg

Prosodic Phrasing: Machine and Human Evaluation

Named Entity Recognition Experiments on Turkish Texts

Processing: current projects and research at the IXA Group

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

31 Case Studies: Java Natural Language Tools Available on the Web

Learning Morphological Disambiguation Rules for Turkish

Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems

Text-To-Speech Technologies for Mobile Telephony Services

Open-Source, Cross-Platform Java Tools Working Together on a Dialogue System

SWING: A tool for modelling intonational varieties of Swedish Beskow, Jonas; Bruce, Gösta; Enflo, Laura; Granström, Björn; Schötz, Susanne

Turkish Radiology Dictation System

The XMU Phrase-Based Statistical Machine Translation System for IWSLT 2006

DEPENDENCY PARSING JOAKIM NIVRE

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Curriculum Vitae. Joakim Nivre. Personal Information. Education

PoS-tagging Italian texts with CORISTagger

SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer

VoiceXML-Based Dialogue Systems

ABSTRACT 2. SYSTEM OVERVIEW 1. INTRODUCTION. 2.1 Speech Recognition

Automatic Speech Recognition and Hybrid Machine Translation for High-Quality Closed-Captioning and Subtitling for Video Broadcast

Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features

Hybrid Strategies. for better products and shorter time-to-market

A POS-based Word Prediction System for the Persian Language

Classification of Natural Language Interfaces to Databases based on the Architectures

Language and Computation

NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Master of Arts in Linguistics Syllabus

Chapter 8. Final Results on Dutch Senseval-2 Test Data

Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia

Shallow Parsing with Apache UIMA

The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems

Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words

Natural Language to Relational Query by Using Parsing Compiler

How the Computer Translates. Svetlana Sokolova President and CEO of PROMT, PhD.

Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University

MEDAR Mediterranean Arabic Language and Speech Technology An intermediate report on the MEDAR Survey of actors, projects, products

Study Plan for Master of Arts in Applied Linguistics

Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science

AUTOMATIC DETECTION OF CONTRASTIVE ELEMENTS IN SPONTANEOUS SPEECH

An Online Service for SUbtitling by MAchine Translation

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

A Machine Translation System Between a Pair of Closely Related Languages

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

Customizing an English-Korean Machine Translation System for Patent Translation *

Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project

Semi-Supervised Learning for Blog Classification

Symbiosis of Evolutionary Techniques and Statistical Natural Language Processing

Terminology Extraction from Log Files

Annotation and Evaluation of Swedish Multiword Named Entities

The SweDat Project and Swedia Database for Phonetic and Acoustic Research

Zeynep Azar. English Teacher, Açı Private Primary School, Istanbul, Turkey Azar, E.Z.

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Audience response system based annotation of speech

Building A Vocabulary Self-Learning Speech Recognition System

How To Complete The Danish Masters Program In Lct

Aspects of North Swedish intonational phonology. Bruce, Gösta

A System for Labeling Self-Repairs in Speech 1

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Word Completion and Prediction in Hebrew

Post-doctoral researcher, Faculty of Translation Studies, University College Ghent

TRANSLATION OF TELUGU-MARATHI AND VICE- VERSA USING RULE BASED MACHINE TRANSLATION

Engaging high school students in interdisciplinary studies through the Computational Linguistics Olympiad

Semantic annotation of requirements for automatic UML class diagram generation

Towards a RB-SMT Hybrid System for Translating Patent Claims Results and Perspectives

The PALAVRAS parser and its Linguateca applications - a mutually productive relationship

ONLINE RESUME PARSING SYSTEM USING TEXT ANALYTICS

ETL Ensembles for Chunking, NER and SRL

Collecting Polish German Parallel Corpora in the Internet

The course is included in the CPD programme for teachers II.

Parsing Software Requirements with an Ontology-based Semantic Role Labeler

MASTER OF PHILOSOPHY IN ENGLISH AND APPLIED LINGUISTICS

SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems for CWMT2011 SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 CWMT2011 技 术 报 告

Transcription:

Research Portfolio Beáta B. Megyesi January 8, 2007 Research Activities Research activities focus on mainly four areas: Natural language processing During the last ten years, since I started my academic career in computational linguistics, my main research topic concerns corpus linguistics, part-of-speech tagging, morphological analysis, and shallow syntactic analysis (e.g. chunking, parsing) mainly by using machine learning techniques for Swedish, English as well as for Hungarian. I am also interested in using rule-based finitestate techniques to build shallow syntactic analyzer for Swedish that provides both phrase-structure and dependency analysis for Swedish. My work within NLP has been published both at national and international conferences, see papers: 1, 2, 3, 4, 5, 8, 9, 10, 14, and 15 under Publications. Speech research Within speech research at KTH, the main purpose was to improve speech synthesis for Swedish. For a better sounding text-to-speech system, in depth analysis is needed to find out the relationship between prosodic and linguistic structure. Therefore, I studied the relationship between prosody in terms of prosodic breaks and linguistic structure in various speaking styles, both in spontaneous and non-spontaneous speech in different communicative situations. The results are published mainly at well-known international conferences, see papers: 6, 7, 11, 12, 13, 15, 16, 17 under Publications. Parallel corpora and machine translation The last two years, I have been working on the development of a parallel corpus between Swedish and Turkish that can be used for machine translation as well as for linguistic analysis of the languages involved. This work has been carried out within the project Supporting research environment for minor languages (Classic, Turkish and Hindi) and serves as a pilot project aiming at developing methods to automatically build parallel corpora between various language pairs that might belong to different language types. The outcome of the pilot project is presented in paper 19 under Publications.

Text categorization During the last year, I begun to look into the automatic classification of texts into genres and text types using machine learning techniques where focus is put on knowledge representation to explore linguistic features, both semantic (by means of automatically extracted keywords) and morpho-syntactic features (e.g. part-of-speech, syntactic phrases and depth), see papers: 18, 20 and 21 under Publications. Publications Reviewed Papers The papers are available at http://stp.lingfil.uu.se/ bea/ 1. Megyesi, B. 1999. Brill s PoS Tagger with Extended Lexical Templates for Hungarian. Workshop (W01) on Machine Learning in Human Language Technology, ACAI 99, Chania, Crete, Greece July 5 - July 16, 1999. 2. Megyesi, B. 1999. Improving Brill s PoS Tagger for an Agglutinative Language. In Proceedings of the Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/- VLC 99), pp. 275 284, University of Maryland, USA, June 21 22, 1999. 3. Megyesi, B. & Rydin, S. 2000. Towards a Finite-State Parser for Swedish. In Proceedings of NoDaLiDa 1999, pp. 115 123, Trondheim, Norway, December 9 10, 1999. 4. Berthelsen, H. & Megyesi, B. 2000. Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora. In Proceedings of the Third International Workshop on TEXT, SPEECH and DIALOGUE, pp. 27 32, Brno, Czech Republic, September 13 16, 2000. Springer-Verlag in LNCS/LNAI series. 5. Megyesi, B. 2001. Data-Driven Methods for PoS tagging and Chunking of Swedish. In Proceedings of NoDaLiDa 2001, Uppsala, Sweden, May 21 22, 2001. 6. Gustafson-Čapková, S. & Megyesi, B. 2001. A Comparative Study of Pauses in Dialogues and Read Speech. In Proceedings of Eurospeech 2001, Volume 2, pp. 931 935, Aalborg, Denmark, September 3 7, 2001. 7. Megyesi, B. & Gustafson-Čapková, S. 2001. Pausing in Dialogues and Read Speech: Speakers Production and Listeners Interpretation. In Proceedings of the Workshop on Prosody in Speech Recognition and Understanding, pp. 107 113, New Jersey, USA, October 22 24, 2001. 8. Megyesi, B. 2001. Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), pp. 151 158, Carnegie Mellon University, Pittsburgh, PA, USA, June 3 4, 2001. 2

9. Megyesi, B. 2001. Phrasal Parsing by Using Data-Driven PoS Taggers. In Proceedings of the Conference on Recent Advances in Natural Language Processing, Euro Conference RANLP-2001, pp. 166 173, Tzigov Chark, Bulgaria, September 5 7, 2001. 10. Megyesi, B. 2002. Shallow Parsing with PoS Taggers and Linguistic Features. Journal of Machine Learning Research: Special Issue on Shallow Parsing, JMLR (2), pp. 639 668, MIT Press. 11. Gustafson-Čapková, S. & Megyesi, B. 2002. Silence and Discourse Context in Read Speech and Dialogues in Swedish. In Proceedings of the Speech Prosody 2002 conference, Bernard Bel & Isabelle Marlien (eds.), pp. 363 366, Aix-en-Provence, France, April 11 13, 2002. 12. Carlson, R., Granström, B., Heldner, M., House, D., Megyesi, B., Strangert, E. & Swerts, M. 2002. Boundaries and groupings the structuring of speech in different communicative situations: a description of the GROG project. In Proceedings of Fonetik 2002, TMH-QPSR Volume 44, pp. 65 69, Stockholm, Sweden, May 29 31, 2002. 13. Megyesi, B. & Gustafson-Čapková, S. 2002. Production and Perception of Pauses and their Linguistic Context in Read and Spontaneous Speech in Swedish. In Proceedings of ICSLP 2002-7th International Conference on Spoken Language Processing, Denver, USA, September 16 20, 2002. 14. Megyesi, B & Carlson, R. 2002. Data-Driven Methods for Building a Swedish Treebank. Extended abstract to the Swedish Treebank Symposium, 28-29 November 2002, Växjö University, Sweden 15. Megyesi, B. 2002. Data-Driven Syntactic Analysis Methods and Applications for Swedish. Doctoral Dissertation, Department of Speech, Music and Hearing, Kungliga Tekniska Högskolan 16. Heldner, M. & Megyesi, B. 2003. Exploring the Prosody-Syntax Interface. In Proceeding of the 15th International Congress of Phonetic Sciences (ICPhS), 2-9 August 2003, Barcelona, Spain 17. Heldner, M. & Megyesi, B. 2003. The Acoustic and Morpho-Syntactic Context of Prosodic Boundaries in Dialogs. In Proceeding of Fonetik 2003, 2-3 June 2003, Umeå, Sweden 18. Wastholm, P., Kusma, A., & Megyesi, B. 2005. Using Linguistic Data for Genre Classification. In Proceedings of Swedish Artificial Intelligence and Learning Systems SAIS-SSLS 2005. Mälardalen University, Västerås. Sweden. 19. Bandmann Megyesi, B., Sågvall Hein, A. and Csató Johansson, É. (2006). Building a Swedish-Turkish Parallel Corpus. In Proceedings of Language Resources and Evaluation Conference LREC 2006. May 22-28, 2006. Genoa, Italy. 3

20. Hulth, A. & Megyesi, B. (2006). A Study on Automatically Extracted Keywords in Text Categorization. In Proceedings of Association for Computational Linguistics ACL 2006 June 17 23, 2006. Sydney, Australia. where 1 and 2, as well as 16 and 17 can be considered as the same papers. The PhD thesis, number 15, is partly based on the papers previously published. Other material 1. Megyesi, B. 1998. A Short Descriptive Grammar for Hungarian. Dept. of Linguistics, Stockholm University. 2. Megyesi, B. 1998. Brill s Transformation-Based Tagger. Dept. of Linguistics, Stockholm University. Supervised Research During the years, I have supervised several master thesis in computational linguistics and tutored and supervised project work in my courses (see also Pedagogical portfolio for a more detailed description). One of the projects on automatic text categorization was of high quality and I together with my students extended their work and wrote the paper Using Linguistic Data for Genre Classification by Wastholm, Kusma and Megyesi in 2005, see under Publications. I arranged a Ph.D. course in Perl programming at the Department of Linguistics, Stockholm University during Spring 1999. The work included planning of the course and part of lecturing. I was an invited speaker at the Swedish National Graduate School of Language Technology in 2003 and gave a talk about Phrasal Parsing by Using Data-Driven PoS Taggers. Also, I participated as a lecturer at the International PhD course on treebanks, arranged by Stockholm University in 2004. Cooperations and founding As I mentioned in the introduction, my research activities have been carried out partly by myself, and partly in cooperation with other researchers. My work on data-driven tagging and chunking of Swedish was performed by myself alone. I worked with prof. Ralph Grishman and Roman Yangerber at New York University in 1999 where I was a visiting researcher and was working on information extraction. I participated in the implementation of a new domain (natural disasters) to the Proteus information extraction system. The visit was supported by STINT, the Swedish Foundation for International Cooperation in Research and Higher Education. At Stockholm University, I cooperated with my colleagues Sara Rydin (1999) on rule-based shallow parsing of Swedish, with Harald Berthelsen (2000) on automatically finding and filtering annotation errors in English by using ensemble methods, and Sofia Gustafson-Capkova (2001-2002) on the relation between 4

prosodic (in terms of pausing) and linguistic structure (on morphological, syntactic and discourse level). In all work, both authors were fully participating in the projects. At CTT, TMH, KTH I worked with prof. Rolf Carlson, prof. Björn Granström, Dr David House, Dr Mattias Heldner, and with prof. Eva Strangert at Umeå University (2002-2003), and Dr Marc Swerts (2002) within the project Gräns och gruppering Strukturering av talet i olika kommunikativa situationer lead by prof. Eva Strangert and financed by the Swedish Research Council (2002 2004). My role in the project was to build a corpus by collecting the material and annotating it by means of prosodic phrases as well as on various linguistic levels, e.g. part-of-speech, and phrase structure information, and run statistical analysis to determine the relationship between the prosodic and linguistic structure. I was one of the initiators to the Swedish Treebank Symposium and the Nordic treebank network in 2002 and 2003 together with prof. Joakim Nivre and prof. Martin Volk. Unfortunately, I was not able to follow the project as I became a mother to twins and was on parental leave from September 2003 to September 2004. Furthermore, I worked with Dr Anette Hulth on text categorization where my main role was to provide the linguistic analysis needed in the knowledge representation phase for the categorization and run the machine learning algorithm to build the models and evaluate these. Currently, I participate in the project Supporting research environment for minor languages (Classic, Turkish and Hindi) supported by the Swedish Research Council and the Faculty of Languages at Uppsala University. I work 10% of my time together with prof. Anna Sågvall Hein, prof. Éva Csató Johanson and Dr Bengt Dahlqvist on building a Swedish Turkish parallel corpus to be used in linguistic research and machine translation. My work includes, besides administrative work such as maintaining the project page, corpus collection, normalization, annotation, and alignment. Prof. Kemal Oflazer at Sabanci University in Istanbul is also connected to the project as he provides the morpho-syntactic analysis of the Turkish material. Also, I am working in the newly founded project Methods and tools for automatic grammar extraction (Metoder och verktyg för automatisk grammatikextraktion) supported by the Swedish Research Council during the period 2006 and 2009 with prof. Anna Sågvall Hein (project leader) and prof. Joakim Nivre. Presentations Papers presented at international conferences/workshops: Brill s PoS Tagger with Extended Lexical Templates for Hungarian. Workshop (W01) on Machine Learning in Human Language Technology, ACAI 99, Greece, 1999. Towards a Finite-State Parser for Swedish. NoDaLiDa 1999, pp. 115 123, Norway, 1999. 5

Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora. Third International Workshop on TEXT, SPEECH and DIALOGUE, pp. 27 32, Brno, Czech Republic, 2000. Data-Driven Methods for PoS tagging and Chunking of Swedish. NoDaLiDa 2001, Sweden, 2001. Pausing in Dialogues and Read Speech: Speakers Production and Listeners Interpretation. Workshop on Prosody in Speech Recognition and Understanding, NJ, USA, 2001. Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish. Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), Carnegie Melon University, PA, USA, 2001. Phrasal Parsing by Using Data-Driven PoS Taggers. Conference on Recent Advances in Natural Language Processing, Euro Conference RANLP- 2001, Bulgaria, 2001. Silence and Discourse Context in Read Speech and Dialogues in Swedish. Speech Prosody 2002 conference, France, 2002. Production and Perception of Pauses and their Linguistic Context in Read and Spontaneous Speech in Swedish. ICSLP 2002-7th International Conference on Spoken Language Processing, Colorado, USA, 2002. Data-Driven Methods for Building a Swedish Treebank. Swedish Treebank Symposium, 28-29 November 2002, Växjö University, Sweden Building a Swedish-Turkish Parallel Corpus. Language Resources and Evaluation Conference LREC 2006, May 22-28, 2006. Genoa, Italy. During the period October 2000 and June 2003, I gave talks about the ongoing research within Natural Language Processing at CTT, KTH for CTT s industrial partners, researchers as well as for international reviewers at least two or three times a year. Invited Lectures DSV, Stockholm University, 1999 New York University, New York, 1999 Copenhagen Business School, Denmark, 2001 Swedish National Graduate School of Language Technology, 2003 International PhD course on treebanks, Stockholm University, 2004 Lund University, 2006 6

Professional Activities Session chair for Named Entity Recognition at the Conference on Recent Advances in Natural Language Processing, Euro Conference RANLP-2001, pp. 166 173, Tzigov Chark, Bulgaria, September 5 7, 2001 Member of the program and organizing committee of the Swedish Treebank Symposium, 28-29 November 2002, Växjö University, Sweden; with Joakim Nivre (chair), Rolf Carlson, Lars Ahrenberg, and Lars Borin. Member of the program committee for the 41st Annual Meeting of the Association for Computational Linguistics (ACL) conference 2003 Member of the program committee for Nordiska Datalingvistdagarna, Nodalida 2003 Member of the program committee for the Conference on Recent Advances in Natural Language Processing (RANLP) 2003 Reviewer for the Journal of Natural Language Engineering 2004 Member of the organizing and program committee for the Language Technology Conference, October 20-21, 2005, Uppsala University Academic Honors Young Researcher Award for the paper entitled Phrasal Parsing by Using Data-Driven PoS Taggers. received at the Euro Conference Recent Advances in Natural Language Processing, RANLP-2001. 5-7 September 2001, Tzigov Chark, Bulgaria. 7