Research Portfolio. Beáta B. Megyesi January 8, 2007

Research Portfolio Beáta B. Megyesi January 8, 2007 Research Activities Research activities focus on mainly four areas: Natural language processing During the last ten years, since I started my academic career in computational linguistics, my main research topic concerns corpus linguistics, part-of-speech tagging, morphological analysis, and shallow syntactic analysis (e.g. chunking, parsing) mainly by using machine learning techniques for Swedish, English as well as for Hungarian. I am also interested in using rule-based finitestate techniques to build shallow syntactic analyzer for Swedish that provides both phrase-structure and dependency analysis for Swedish. My work within NLP has been published both at national and international conferences, see papers: 1, 2, 3, 4, 5, 8, 9, 10, 14, and 15 under Publications. Speech research Within speech research at KTH, the main purpose was to improve speech synthesis for Swedish. For a better sounding text-to-speech system, in depth analysis is needed to find out the relationship between prosodic and linguistic structure. Therefore, I studied the relationship between prosody in terms of prosodic breaks and linguistic structure in various speaking styles, both in spontaneous and non-spontaneous speech in different communicative situations. The results are published mainly at well-known international conferences, see papers: 6, 7, 11, 12, 13, 15, 16, 17 under Publications. Parallel corpora and machine translation The last two years, I have been working on the development of a parallel corpus between Swedish and Turkish that can be used for machine translation as well as for linguistic analysis of the languages involved. This work has been carried out within the project Supporting research environment for minor languages (Classic, Turkish and Hindi) and serves as a pilot project aiming at developing methods to automatically build parallel corpora between various language pairs that might belong to different language types. The outcome of the pilot project is presented in paper 19 under Publications.

Text categorization During the last year, I begun to look into the automatic classification of texts into genres and text types using machine learning techniques where focus is put on knowledge representation to explore linguistic features, both semantic (by means of automatically extracted keywords) and morpho-syntactic features (e.g. part-of-speech, syntactic phrases and depth), see papers: 18, 20 and 21 under Publications. Publications Reviewed Papers The papers are available at http://stp.lingfil.uu.se/ bea/ 1. Megyesi, B. 1999. Brill s PoS Tagger with Extended Lexical Templates for Hungarian. Workshop (W01) on Machine Learning in Human Language Technology, ACAI 99, Chania, Crete, Greece July 5 - July 16, 1999. 2. Megyesi, B. 1999. Improving Brill s PoS Tagger for an Agglutinative Language. In Proceedings of the Joint Sigdat Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/- VLC 99), pp. 275 284, University of Maryland, USA, June 21 22, 1999. 3. Megyesi, B. & Rydin, S. 2000. Towards a Finite-State Parser for Swedish. In Proceedings of NoDaLiDa 1999, pp. 115 123, Trondheim, Norway, December 9 10, 1999. 4. Berthelsen, H. & Megyesi, B. 2000. Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora. In Proceedings of the Third International Workshop on TEXT, SPEECH and DIALOGUE, pp. 27 32, Brno, Czech Republic, September 13 16, 2000. Springer-Verlag in LNCS/LNAI series. 5. Megyesi, B. 2001. Data-Driven Methods for PoS tagging and Chunking of Swedish. In Proceedings of NoDaLiDa 2001, Uppsala, Sweden, May 21 22, 2001. 6. Gustafson-Čapková, S. & Megyesi, B. 2001. A Comparative Study of Pauses in Dialogues and Read Speech. In Proceedings of Eurospeech 2001, Volume 2, pp. 931 935, Aalborg, Denmark, September 3 7, 2001. 7. Megyesi, B. & Gustafson-Čapková, S. 2001. Pausing in Dialogues and Read Speech: Speakers Production and Listeners Interpretation. In Proceedings of the Workshop on Prosody in Speech Recognition and Understanding, pp. 107 113, New Jersey, USA, October 22 24, 2001. 8. Megyesi, B. 2001. Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), pp. 151 158, Carnegie Mellon University, Pittsburgh, PA, USA, June 3 4, 2001. 2

9. Megyesi, B. 2001. Phrasal Parsing by Using Data-Driven PoS Taggers. In Proceedings of the Conference on Recent Advances in Natural Language Processing, Euro Conference RANLP-2001, pp. 166 173, Tzigov Chark, Bulgaria, September 5 7, 2001. 10. Megyesi, B. 2002. Shallow Parsing with PoS Taggers and Linguistic Features. Journal of Machine Learning Research: Special Issue on Shallow Parsing, JMLR (2), pp. 639 668, MIT Press. 11. Gustafson-Čapková, S. & Megyesi, B. 2002. Silence and Discourse Context in Read Speech and Dialogues in Swedish. In Proceedings of the Speech Prosody 2002 conference, Bernard Bel & Isabelle Marlien (eds.), pp. 363 366, Aix-en-Provence, France, April 11 13, 2002. 12. Carlson, R., Granström, B., Heldner, M., House, D., Megyesi, B., Strangert, E. & Swerts, M. 2002. Boundaries and groupings the structuring of speech in different communicative situations: a description of the GROG project. In Proceedings of Fonetik 2002, TMH-QPSR Volume 44, pp. 65 69, Stockholm, Sweden, May 29 31, 2002. 13. Megyesi, B. & Gustafson-Čapková, S. 2002. Production and Perception of Pauses and their Linguistic Context in Read and Spontaneous Speech in Swedish. In Proceedings of ICSLP 2002-7th International Conference on Spoken Language Processing, Denver, USA, September 16 20, 2002. 14. Megyesi, B & Carlson, R. 2002. Data-Driven Methods for Building a Swedish Treebank. Extended abstract to the Swedish Treebank Symposium, 28-29 November 2002, Växjö University, Sweden 15. Megyesi, B. 2002. Data-Driven Syntactic Analysis Methods and Applications for Swedish. Doctoral Dissertation, Department of Speech, Music and Hearing, Kungliga Tekniska Högskolan 16. Heldner, M. & Megyesi, B. 2003. Exploring the Prosody-Syntax Interface. In Proceeding of the 15th International Congress of Phonetic Sciences (ICPhS), 2-9 August 2003, Barcelona, Spain 17. Heldner, M. & Megyesi, B. 2003. The Acoustic and Morpho-Syntactic Context of Prosodic Boundaries in Dialogs. In Proceeding of Fonetik 2003, 2-3 June 2003, Umeå, Sweden 18. Wastholm, P., Kusma, A., & Megyesi, B. 2005. Using Linguistic Data for Genre Classification. In Proceedings of Swedish Artificial Intelligence and Learning Systems SAIS-SSLS 2005. Mälardalen University, Västerås. Sweden. 19. Bandmann Megyesi, B., Sågvall Hein, A. and Csató Johansson, É. (2006). Building a Swedish-Turkish Parallel Corpus. In Proceedings of Language Resources and Evaluation Conference LREC 2006. May 22-28, 2006. Genoa, Italy. 3

20. Hulth, A. & Megyesi, B. (2006). A Study on Automatically Extracted Keywords in Text Categorization. In Proceedings of Association for Computational Linguistics ACL 2006 June 17 23, 2006. Sydney, Australia. where 1 and 2, as well as 16 and 17 can be considered as the same papers. The PhD thesis, number 15, is partly based on the papers previously published. Other material 1. Megyesi, B. 1998. A Short Descriptive Grammar for Hungarian. Dept. of Linguistics, Stockholm University. 2. Megyesi, B. 1998. Brill s Transformation-Based Tagger. Dept. of Linguistics, Stockholm University. Supervised Research During the years, I have supervised several master thesis in computational linguistics and tutored and supervised project work in my courses (see also Pedagogical portfolio for a more detailed description). One of the projects on automatic text categorization was of high quality and I together with my students extended their work and wrote the paper Using Linguistic Data for Genre Classification by Wastholm, Kusma and Megyesi in 2005, see under Publications. I arranged a Ph.D. course in Perl programming at the Department of Linguistics, Stockholm University during Spring 1999. The work included planning of the course and part of lecturing. I was an invited speaker at the Swedish National Graduate School of Language Technology in 2003 and gave a talk about Phrasal Parsing by Using Data-Driven PoS Taggers. Also, I participated as a lecturer at the International PhD course on treebanks, arranged by Stockholm University in 2004. Cooperations and founding As I mentioned in the introduction, my research activities have been carried out partly by myself, and partly in cooperation with other researchers. My work on data-driven tagging and chunking of Swedish was performed by myself alone. I worked with prof. Ralph Grishman and Roman Yangerber at New York University in 1999 where I was a visiting researcher and was working on information extraction. I participated in the implementation of a new domain (natural disasters) to the Proteus information extraction system. The visit was supported by STINT, the Swedish Foundation for International Cooperation in Research and Higher Education. At Stockholm University, I cooperated with my colleagues Sara Rydin (1999) on rule-based shallow parsing of Swedish, with Harald Berthelsen (2000) on automatically finding and filtering annotation errors in English by using ensemble methods, and Sofia Gustafson-Capkova (2001-2002) on the relation between 4

prosodic (in terms of pausing) and linguistic structure (on morphological, syntactic and discourse level). In all work, both authors were fully participating in the projects. At CTT, TMH, KTH I worked with prof. Rolf Carlson, prof. Björn Granström, Dr David House, Dr Mattias Heldner, and with prof. Eva Strangert at Umeå University (2002-2003), and Dr Marc Swerts (2002) within the project Gräns och gruppering Strukturering av talet i olika kommunikativa situationer lead by prof. Eva Strangert and financed by the Swedish Research Council (2002 2004). My role in the project was to build a corpus by collecting the material and annotating it by means of prosodic phrases as well as on various linguistic levels, e.g. part-of-speech, and phrase structure information, and run statistical analysis to determine the relationship between the prosodic and linguistic structure. I was one of the initiators to the Swedish Treebank Symposium and the Nordic treebank network in 2002 and 2003 together with prof. Joakim Nivre and prof. Martin Volk. Unfortunately, I was not able to follow the project as I became a mother to twins and was on parental leave from September 2003 to September 2004. Furthermore, I worked with Dr Anette Hulth on text categorization where my main role was to provide the linguistic analysis needed in the knowledge representation phase for the categorization and run the machine learning algorithm to build the models and evaluate these. Currently, I participate in the project Supporting research environment for minor languages (Classic, Turkish and Hindi) supported by the Swedish Research Council and the Faculty of Languages at Uppsala University. I work 10% of my time together with prof. Anna Sågvall Hein, prof. Éva Csató Johanson and Dr Bengt Dahlqvist on building a Swedish Turkish parallel corpus to be used in linguistic research and machine translation. My work includes, besides administrative work such as maintaining the project page, corpus collection, normalization, annotation, and alignment. Prof. Kemal Oflazer at Sabanci University in Istanbul is also connected to the project as he provides the morpho-syntactic analysis of the Turkish material. Also, I am working in the newly founded project Methods and tools for automatic grammar extraction (Metoder och verktyg för automatisk grammatikextraktion) supported by the Swedish Research Council during the period 2006 and 2009 with prof. Anna Sågvall Hein (project leader) and prof. Joakim Nivre. Presentations Papers presented at international conferences/workshops: Brill s PoS Tagger with Extended Lexical Templates for Hungarian. Workshop (W01) on Machine Learning in Human Language Technology, ACAI 99, Greece, 1999. Towards a Finite-State Parser for Swedish. NoDaLiDa 1999, pp. 115 123, Norway, 1999. 5

Ensemble of Classifiers for Noise Detection in PoS Tagged Corpora. Third International Workshop on TEXT, SPEECH and DIALOGUE, pp. 27 32, Brno, Czech Republic, 2000. Data-Driven Methods for PoS tagging and Chunking of Swedish. NoDaLiDa 2001, Sweden, 2001. Pausing in Dialogues and Read Speech: Speakers Production and Listeners Interpretation. Workshop on Prosody in Speech Recognition and Understanding, NJ, USA, 2001. Comparing Data-Driven Learning Algorithms for PoS Tagging of Swedish. Conference on Empirical Methods in Natural Language Processing (EMNLP 2001), Carnegie Melon University, PA, USA, 2001. Phrasal Parsing by Using Data-Driven PoS Taggers. Conference on Recent Advances in Natural Language Processing, Euro Conference RANLP- 2001, Bulgaria, 2001. Silence and Discourse Context in Read Speech and Dialogues in Swedish. Speech Prosody 2002 conference, France, 2002. Production and Perception of Pauses and their Linguistic Context in Read and Spontaneous Speech in Swedish. ICSLP 2002-7th International Conference on Spoken Language Processing, Colorado, USA, 2002. Data-Driven Methods for Building a Swedish Treebank. Swedish Treebank Symposium, 28-29 November 2002, Växjö University, Sweden Building a Swedish-Turkish Parallel Corpus. Language Resources and Evaluation Conference LREC 2006, May 22-28, 2006. Genoa, Italy. During the period October 2000 and June 2003, I gave talks about the ongoing research within Natural Language Processing at CTT, KTH for CTT s industrial partners, researchers as well as for international reviewers at least two or three times a year. Invited Lectures DSV, Stockholm University, 1999 New York University, New York, 1999 Copenhagen Business School, Denmark, 2001 Swedish National Graduate School of Language Technology, 2003 International PhD course on treebanks, Stockholm University, 2004 Lund University, 2006 6

Professional Activities Session chair for Named Entity Recognition at the Conference on Recent Advances in Natural Language Processing, Euro Conference RANLP-2001, pp. 166 173, Tzigov Chark, Bulgaria, September 5 7, 2001 Member of the program and organizing committee of the Swedish Treebank Symposium, 28-29 November 2002, Växjö University, Sweden; with Joakim Nivre (chair), Rolf Carlson, Lars Ahrenberg, and Lars Borin. Member of the program committee for the 41st Annual Meeting of the Association for Computational Linguistics (ACL) conference 2003 Member of the program committee for Nordiska Datalingvistdagarna, Nodalida 2003 Member of the program committee for the Conference on Recent Advances in Natural Language Processing (RANLP) 2003 Reviewer for the Journal of Natural Language Engineering 2004 Member of the organizing and program committee for the Language Technology Conference, October 20-21, 2005, Uppsala University Academic Honors Young Researcher Award for the paper entitled Phrasal Parsing by Using Data-Driven PoS Taggers. received at the Euro Conference Recent Advances in Natural Language Processing, RANLP-2001. 5-7 September 2001, Tzigov Chark, Bulgaria. 7