PyCantonese: Cantonese linguistic research in the age of big data
|
|
|
- Imogene Sutton
- 10 years ago
- Views:
Transcription
1 PyCantonese: Cantonese linguistic research in the age of big data Jackson L. Lee University of Chicago Childhood Bilingualism Research Center, CUHK September 15, 2015
2 Grammar versus observed data What is linguistics all about? Grammar Observed data In grammar but not observed: Arguably the mainstream focus of linguistic research Why? Productivity, competence, etc How? Introspection, experiments, etc Observed but not in grammar (?): The noisy part of language (slips of tongue, I heard it but I d never say that, etc) Jackson Lee PyCantonese and big data 2
3 A bottom-up approach But grammar is ultimately based on the observed data. Grammar Observed data A strongly empirical view of linguistic research: Focus on what is observed. Where are the data? There s no shortage... big data research! Jackson Lee PyCantonese and big data 3
4 Big data for lingusitic research...as the theme of the 2015 Linguistic Summer Institute at UChicago: Jackson Lee PyCantonese and big data 4
5 Big data + Cantonese? Some (accessible) Cantonese corpora, by year of publication: The Hong Kong Cantonese Adult Language Corpus (Leung and Law 2001; Leung et al. 2004; Fung and Law 2013) Hong Kong Cantonese Child Language Corpus (Lee and Wong 1998) Cantonese Radio Corpus (Francis and Matthews 2005, 2006) The Hong Kong Bilingual Child Language Corpus (Yip and Matthews 2007) Early Cantonese Tagged Database (Yiu 2012) A Linguistic Corpus of Mid-20th Century Hong Kong Cantonese (Chin 2013) PolyU Corpus of Spoken Chinese (Foong et al. 2014) Hong Kong Cantonese Corpus (Luke and Wong 2015) Jackson Lee PyCantonese and big data 5
6 Big data + Cantonese? To what extent are these resources usable and extensible for the general research community? Issues: inconsistent/ad hoc data formats no general toolkits for handling data Jackson Lee PyCantonese and big data 6
7 PyCantonese PyCantonese is a toolkit for handling Cantonese corpus data. Evolving and expanding It is a Python library why Python? a general-purpose programming language the lingua franca for computational linguistics and natural language processing Similar data structures as in NLTK (Bird et al. 2009) An open-source tool Current collaborators: Litong Chen, Tsz-Him Tsui Full documentation (with installation instructions): Jackson Lee PyCantonese and big data 7
8 Accessing corpus data in PyCantonese PyCantonese comes with builtin corpus data! Currently, KK Luke s HKCanCor is included. The corpus provides wordsegmented data with: characters part-of-speech tags Jyutping romanization Jackson Lee PyCantonese and big data 8
9 Accessing corpus data through PyCantonese >>> import pycantonese as pc >>> corpus = pc.hkcancor >>> corpus.number_of_words() >>> corpus.number_of_characters() Jackson Lee PyCantonese and big data 9
10 Parsing Jyutping Jyutping onset, nucleus, coda, tone >>> import pycantonese as pc >>> pc.jyutping( hou2 ) [( h, o, u, 2 )] >>> pc.jyutping( hoeng1gong2 ) [( h, oe, ng, 1 ), ( g, o, ng, 2 )] Also provided: Conversion from Jyutping to Yale or to L A TEX TIPA Jackson Lee PyCantonese and big data 10
11 Basic search capabilities Possible search queries depend heavily on what is encoded and annotated in the corpus data: Jyutping elements? Part-of-speech tags? Characters? More here: Jackson Lee PyCantonese and big data 11
12 More search examples Use filtering strategies for more complicated search queries. Example: Find in HKCanCor all verb+noun word pairs. (defined as 1st word tag = V and 2nd word tag == N ) Approach: 1. Find all words tagged as V together with the immediately following word. 2. Within the results from step 1, retain only cases where the second word has the tag of N. Jackson Lee PyCantonese and big data 12
13 Finding V+N word pairs >>> import pycantonese as pc >>> corpus = pc.hkcancor >>> v = pc.search(corpus, pos="v", word_right=1) >>> len(v) # number of words with "V" >>> vn = list() >>> for wordpair in v:... if not wordpair or len(wordpair) < 2:... continue... if wordpair[1][1] and wordpair[1][1] == "N":... vn.append(wordpair) # save V+N >>> len(vn) # number of V+N word pairs 1535 Jackson Lee PyCantonese and big data 13
14 Some V+N pairs found >>> for i in range(3):... print(vn[i]) [( 聽 teng1, V ), ( 朋 友 pang4jau5, N )] [( 跟 gan1, V ), ( 旅 行 社 leoi5hang4se5, N )] [( 搭 daap3, V ), ( 飛 機 fei1gei1, N )] TODO: Allow regular expressions for search criteria. e.g., the part-of-speech tag of interest could be anything in the tagset that begins with a V (= some sort of verb). Jackson Lee PyCantonese and big data 14
15 Recurrent problem: Part-of-speech tagging Some issues of part-of-speech tagging: 1. How many tags do we use? HKCanCor: 46+ tags Google universal tagset: 12 tags (Petrov et al. 2011) 2. Relatedly, how fine-grained are the tags? e.g., distinguish proper nouns and common nouns? 3. Human annotation work is time-consuming and costly. Jackson Lee PyCantonese and big data 15
16 鬼 gwai2 ghost Examples from HKCanCor: 1. 好 hou2/d 鬼 gwai2/d1 細 sai3/a very GWAI small 2. 有 jau5/v1 鬼 gwai2/d1 今 日 gam1jat6/t resulting-in GWAI today What is the tag D1? These two instances of gwai2 are very different. (An expressive + negator in (2); see Beltrama and Lee (2015)) Current work: Mapping HKCanCor to the universal PoS tagset by Petrov et al Jackson Lee PyCantonese and big data 16
17 A related issue: Word segmentation Issues of word segmentation: 1. AB compound or two separate words? 2. grammatical characters (e.g., aspect markers) a separate word itself or part of a word? Jackson Lee PyCantonese and big data 17
18 Interrogatives A-not-A, A-not-AB If we treat A-not-AB as three words... What is hap in hap-m-happy? Similarly, 鍾 唔 鍾 意 like or not, etc. (In HKCanCor, the first A is treated as an abbreviation, with a tag starting with J.) Or perhaps things like A-not-AB should be treated as one word? (Lee 2012) Same problem: aspect markers Jackson Lee PyCantonese and big data 18
19 Ongoing work Corpus data prep (The Leung-Law-Fung HKCAC, the Francis-Matthews CRCorpus) General tools thus derived Jackson Lee PyCantonese and big data 19
20 Comparing some Hong Kong Cantonese corpora Both standard and non-standard data formats have been used. HKCAC HKCanCor CRCorpus Jackson Lee PyCantonese and big data 20
21 Potential new tools in PyCantonese...and a call for arms! A part-of-speech tagger Conversion between Jyutping and characters, both directions (Issues: Homophony and homography) Word segmentation (with all the usual problems!) Jackson Lee PyCantonese and big data 21
22 Data, data, data Ultimately, what is observed is the data. Data format: General direction for PyCantonese: Adopting the CHILDES CHAT format (MacWhinney 2000) Reasons: Rich annotations It is well documented and supported. XML format available by conversion readable by NLTK and PyCantonese! What about non-conversational data? Data prep: Other (publicly available) datasets out there? Audio(-visual) data? What annotations are desirable? Jackson Lee PyCantonese and big data 22
23 [Update ] Additional notes and code snippets are available here: Jackson Lee PyCantonese and big data 23
24 References I Beltrama, Andrea and Jackson L. Lee Great pizzas, ghost negations: The emergence and persistence of mixed expressives. In Proceedings of Sinn und Bedeutung 19. Bird, Steven, Edward Loper and Ewan Klein Natural Language Processing with Python. O Reilly Media Inc. Chin, Andy C New resources for Cantonese language studies: A linguistic corpus of mid-20th century Hong Kong Cantonese. Newsletter of Chinese Language 92(1): Foong, Ha Yap, Ying Yang and Tak-Sum Wong On the development of sentence final particles (and utterance tags) in Chinese. In Kate Beeching and Ulrich Detges (eds.), Discourse functions at the left and right periphery, Leiden: Koninklijke Brill NV. Francis, Elaine J. and Stephen Matthews A multi-dimensional approach to the category verb in Cantonese. Journal of Linguistics 41: Francis, Elaine J. and Stephen Matthews Categoriality and object extraction in Cantonese serial verb constructions. Natural Language and Linguistic Theory 24: Fung, Suk-Yee and Sam-Po Law A phonetically annotated corpus of spoken Cantonese: The Hong Kong Cantonese Adult Language Corpus. Newsletter of Chinese Language 92(1): 1 5. Jackson Lee PyCantonese and big data 24
25 References II Lee, Jackson L Fixed-tone reduplication in Cantonese. In McGill Working Papers in Linguistics 22(1). Proceedings from the Montreal-Ottawa-Toronto (MOT) Phonology Workshop 2011: Phonology in the 21st Century: In Honour of Glyne Piggott. Lee, Thomas Hung-Tak and Colleen Wong CANCORP: The Hong Kong Cantonese Child Language Corpus. Cahiers de Linguistique Asie Orientale 27(2): Leung, Man-Tak and Sam-Po Law HKCAC: The Hong Kong Cantonese adult language corpus. International Journal of Corpus Linguistics 6: Leung, Man-Tak, Sam-Po Law and Suk-Yee Fung Type and token frequencies of phonological units in Hong Kong Cantonese. Behavior Research Methods, Instruments, and Computer 36(3): Luke, Kang-Kwong and May Lai-Yin Wong The Hong Kong Cantonese Corpus: Design and uses. Journal of Chinese Linguistics. MacWhinney, Brian The childes project: The database, vol. 2. Psychology Press. Petrov, Slav, Dipanjan Das and Ryan McDonald A universal part-of-speech tagset. arxiv preprint arxiv: Yip, Virginia and Stephen Matthews The Bilingual Child: Early Development and Language Contact. Cambridge University Press. Yiu, Carine Yuk-Man Reconstructing early Chinese dialectal grammar: A study of directional verbs in Cantonese. Talk at the Workshop on Innovations in Cantonese Linguistics, March 16-17, Columbus: The Ohio State University. Jackson Lee PyCantonese and big data 25
Working with CHAT transcripts in Python
Working with CHAT transcripts in Python Jackson L. Lee, Ross Burkholder, Gallagher B. Flinn, and Emily R. Coppess University of Chicago January 28, 2016 Abstract This report introduces the Python library
A CHINESE SPEECH DATA WAREHOUSE
A CHINESE SPEECH DATA WAREHOUSE LUK Wing-Pong, Robert and CHENG Chung-Keng Department of Computing, Hong Kong Polytechnic University Tel: 2766 5143, FAX: 2774 0842, E-mail: {csrluk,cskcheng}@comp.polyu.edu.hk
Bilingualism and language acquisition in early childhood: research and application
Bilingualism and language acquisition in early childhood: research and application Virginia Yip (CUHK) and Stephen Matthews (HKU) Childhood Bilingualism Research Centre Outline Early bilingual acquisition
Markus Dickinson. Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013
Markus Dickinson Dept. of Linguistics, Indiana University Catapult Workshop Series; February 1, 2013 1 / 34 Basic text analysis Before any sophisticated analysis, we want ways to get a sense of text data
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
Linguistic Knowledge-driven Approach to Chinese Comparative Elements Extraction
Linguistic Knowledge-driven Approach to Chinese Comparative Elements Extraction Minjun Park Dept. of Chinese Language and Literature Peking University Beijing, 100871, China [email protected] Yulin Yuan
Natural Language Database Interface for the Community Based Monitoring System *
Natural Language Database Interface for the Community Based Monitoring System * Krissanne Kaye Garcia, Ma. Angelica Lumain, Jose Antonio Wong, Jhovee Gerard Yap, Charibeth Cheng De La Salle University
Annotated Corpora in the Cloud: Free Storage and Free Delivery
Annotated Corpora in the Cloud: Free Storage and Free Delivery Graham Wilcock University of Helsinki [email protected] Abstract The paper describes a technical strategy for implementing natural
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic
Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic by Sigrún Helgadóttir Abstract This paper gives the results of an experiment concerned with training three different taggers on tagged
Customizing an English-Korean Machine Translation System for Patent Translation *
Customizing an English-Korean Machine Translation System for Patent Translation * Sung-Kwon Choi, Young-Gil Kim Natural Language Processing Team, Electronics and Telecommunications Research Institute,
Combining structured data with machine learning to improve clinical text de-identification
Combining structured data with machine learning to improve clinical text de-identification DT Tran Scott Halgrim David Carrell Group Health Research Institute Clinical text contains Personally identifiable
Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems
Accelerating and Evaluation of Syntactic Parsing in Natural Language Question Answering Systems cation systems. For example, NLP could be used in Question Answering (QA) systems to understand users natural
PoS-tagging Italian texts with CORISTagger
PoS-tagging Italian texts with CORISTagger Fabio Tamburini DSLO, University of Bologna, Italy [email protected] Abstract. This paper presents an evolution of CORISTagger [1], an high-performance
Special Topics in Computer Science
Special Topics in Computer Science NLP in a Nutshell CS492B Spring Semester 2009 Jong C. Park Computer Science Department Korea Advanced Institute of Science and Technology INTRODUCTION Jong C. Park, CS
Context Grammar and POS Tagging
Context Grammar and POS Tagging Shian-jung Dick Chen Don Loritz New Technology and Research New Technology and Research LexisNexis LexisNexis Ohio, 45342 Ohio, 45342 [email protected] [email protected]
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Master of Arts in Linguistics Syllabus
Master of Arts in Linguistics Syllabus Applicants shall hold a Bachelor s degree with Honours of this University or another qualification of equivalent standard from this University or from another university
Shallow Parsing with Apache UIMA
Shallow Parsing with Apache UIMA Graham Wilcock University of Helsinki Finland [email protected] Abstract Apache UIMA (Unstructured Information Management Architecture) is a framework for linguistic
Schema documentation for types1.2.xsd
Generated with oxygen XML Editor Take care of the environment, print only if necessary! 8 february 2011 Table of Contents : ""...........................................................................................................
ChildFreq: An Online Tool to Explore Word Frequencies in Child Language
LUCS Minor 16, 2010. ISSN 1104-1609. ChildFreq: An Online Tool to Explore Word Frequencies in Child Language Rasmus Bååth Lund University Cognitive Science Kungshuset, Lundagård, 222 22 Lund [email protected]
Stock Market Prediction Using Data Mining
Stock Market Prediction Using Data Mining 1 Ruchi Desai, 2 Prof.Snehal Gandhi 1 M.E., 2 M.Tech. 1 Computer Department 1 Sarvajanik College of Engineering and Technology, Surat, Gujarat, India Abstract
Computerized Language Analysis (CLAN) from The CHILDES Project
Vol. 1, No. 1 (June 2007), pp. 107 112 http://nflrc.hawaii.edu/ldc/ Computerized Language Analysis (CLAN) from The CHILDES Project Reviewed by FELICITY MEAKINS, University of Melbourne CLAN is an annotation
EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language
EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language Thomas Schmidt Institut für Deutsche Sprache, Mannheim R 5, 6-13 D-68161 Mannheim [email protected]
Dutch Parallel Corpus
Dutch Parallel Corpus Lieve Macken [email protected] LT 3, Language and Translation Technology Team Faculty of Applied Language Studies University College Ghent November 29th 2011 Lieve Macken (LT
Brill s rule-based PoS tagger
Beáta Megyesi Department of Linguistics University of Stockholm Extract from D-level thesis (section 3) Brill s rule-based PoS tagger Beáta Megyesi Eric Brill introduced a PoS tagger in 1992 that was based
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer
SWIFT Aligner, A Multifunctional Tool for Parallel Corpora: Visualization, Word Alignment, and (Morpho)-Syntactic Cross-Language Transfer Timur Gilmanov, Olga Scrivner, Sandra Kübler Indiana University
Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project
Integrating NLTK with the Hadoop Map Reduce Framework 433-460 Human Language Technology Project Paul Bone [email protected] June 2008 Contents 1 Introduction 1 2 Method 2 2.1 Hadoop and Python.........................
Milk, bread and toothpaste : Adapting Data Mining techniques for the analysis of collocation at varying levels of discourse
Milk, bread and toothpaste : Adapting Data Mining techniques for the analysis of collocation at varying levels of discourse Rob Sanderson, Matthew Brook O Donnell and Clare Llewellyn What happens with
Micro blogs Oriented Word Segmentation System
Micro blogs Oriented Word Segmentation System Yijia Liu, Meishan Zhang, Wanxiang Che, Ting Liu, Yihe Deng Research Center for Social Computing and Information Retrieval Harbin Institute of Technology,
Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3
Yıl/Year: 2012 Cilt/Volume: 1 Sayı/Issue:2 Sayfalar/Pages: 40-47 Differences in linguistic and discourse features of narrative writing performance Abstract Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu
31 Case Studies: Java Natural Language Tools Available on the Web
31 Case Studies: Java Natural Language Tools Available on the Web Chapter Objectives Chapter Contents This chapter provides a number of sources for open source and free atural language understanding software
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
Language and Computation
Language and Computation week 13, Thursday, April 24 Tamás Biró Yale University [email protected] http://www.birot.hu/courses/2014-lc/ Tamás Biró, Yale U., Language and Computation p. 1 Practical matters
Prosodic focus marking in Bai
Prosodic focus marking in Bai Zenghui Liu 1, Aoju Chen 1,2 & Hans Van de Velde 1 Utrecht University 1, Max Planck Institute for Psycholinguistics 2 [email protected], [email protected], [email protected]
Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia
Computer Assisted Language Learning (CALL): Room for CompLing? Scott, Stella, Stacia Outline I What is CALL? (scott) II Popular language learning sites (stella) Livemocha.com (stacia) III IV Specific sites
The linguistic function of Cantonese discourse particles in the English medium online chat of Cantonese speakers
University of Wollongong Research Online University of Wollongong Thesis Collection University of Wollongong Thesis Collections 2009 The linguistic function of Cantonese discourse particles in the English
Automatic Text Analysis Using Drupal
Automatic Text Analysis Using Drupal By Herman Chai Computer Engineering California Polytechnic State University, San Luis Obispo Advised by Dr. Foaad Khosmood June 14, 2013 Abstract Natural language processing
Statistical Machine Translation
Statistical Machine Translation Some of the content of this lecture is taken from previous lectures and presentations given by Philipp Koehn and Andy Way. Dr. Jennifer Foster National Centre for Language
Word Completion and Prediction in Hebrew
Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology
The Rise of Documentary Linguistics and a New Kind of Corpus
The Rise of Documentary Linguistics and a New Kind of Corpus Gary F. Simons SIL International 5th National Natural Language Research Symposium De La Salle University, Manila, 25 Nov 2008 Milestones in
An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines)
An NLP Curator (or: How I Learned to Stop Worrying and Love NLP Pipelines) James Clarke, Vivek Srikumar, Mark Sammons, Dan Roth Department of Computer Science, University of Illinois, Urbana-Champaign.
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet, Mathieu Roche To cite this version: Hassan Saneifar, Stéphane Bonniol, Anne Laurent, Pascal Poncelet,
Survey Results: Requirements and Use Cases for Linguistic Linked Data
Survey Results: Requirements and Use Cases for Linguistic Linked Data 1 Introduction This survey was conducted by the FP7 Project LIDER (http://www.lider-project.eu/) as input into the W3C Community Group
An online semi automated POS tagger for. Presented By: Pallav Kumar Dutta Indian Institute of Technology Guwahati
An online semi automated POS tagger for Assamese Presented By: Pallav Kumar Dutta Indian Institute of Technology Guwahati Topics Overview, Existing Taggers & Methods Basic requirements of POS Tagger Our
Reliable and Cost-Effective PoS-Tagging
Reliable and Cost-Effective PoS-Tagging Yu-Fang Tsai Keh-Jiann Chen Institute of Information Science, Academia Sinica Nanang, Taipei, Taiwan 5 eddie,[email protected] Abstract In order to achieve
UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE
UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE A.J.P.M.P. Jayaweera #1, N.G.J. Dias *2 # Virtusa Pvt. Ltd. No 752, Dr. Danister De Silva Mawatha, Colombo 09, Sri Lanka * Department of Statistics
What Is Linguistics? December 1992 Center for Applied Linguistics
What Is Linguistics? December 1992 Center for Applied Linguistics Linguistics is the study of language. Knowledge of linguistics, however, is different from knowledge of a language. Just as a person is
Parsing Software Requirements with an Ontology-based Semantic Role Labeler
Parsing Software Requirements with an Ontology-based Semantic Role Labeler Michael Roth University of Edinburgh [email protected] Ewan Klein University of Edinburgh [email protected] Abstract Software
Phase 2 of the D4 Project. Helmut Schmid and Sabine Schulte im Walde
Statistical Verb-Clustering Model soft clustering: Verbs may belong to several clusters trained on verb-argument tuples clusters together verbs with similar subcategorization and selectional restriction
How to make Ontologies self-building from Wiki-Texts
How to make Ontologies self-building from Wiki-Texts Bastian HAARMANN, Frederike GOTTSMANN, and Ulrich SCHADE Fraunhofer Institute for Communication, Information Processing & Ergonomics Neuenahrer Str.
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems
The Transition of Phrase based to Factored based Translation for Tamil language in SMT Systems Dr. Ananthi Sheshasaayee 1, Angela Deepa. V.R 2 1 Research Supervisior, Department of Computer Science & Application,
O Reilly Ebooks Your bookshelf on your devices!
O Reilly Ebooks Your bookshelf on your devices! When you buy an ebook through oreilly.com you get lifetime access to the book, and whenever possible we provide it to you in five, DRM-free file formats
Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1
Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically
The Use of English Colour Terms in Big Data
The Use of English Colour Terms in Big Data Dimitris MYLONAS, 1,3 Matthew PURVER, 1 Mehrnoosh SADRZADEH, 1 Lindsay MACDONALD, 2 Lewis GRIFFIN, 3 1 School of Electronic Engineering and Computer Science,
From Terminology Extraction to Terminology Validation: An Approach Adapted to Log Files
Journal of Universal Computer Science, vol. 21, no. 4 (2015), 604-635 submitted: 22/11/12, accepted: 26/3/15, appeared: 1/4/15 J.UCS From Terminology Extraction to Terminology Validation: An Approach Adapted
The Power of Dots: Using Nonverbal Compensators in Chat Reference
The Power of Dots: Using Nonverbal Compensators in Chat Reference Jack M. Maness University Libraries University of Colorado at Boulder 184 UCB 1720 Pleasant St. Boulder, CO 80309 303-492-4545 [email protected]
Natural Language Processing
Natural Language Processing 2 Open NLP (http://opennlp.apache.org/) Java library for processing natural language text Based on Machine Learning tools maximum entropy, perceptron Includes pre-built models
Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output
Appraise: an Open-Source Toolkit for Manual Evaluation of MT Output Christian Federmann Language Technology Lab, German Research Center for Artificial Intelligence, Stuhlsatzenhausweg 3, D-66123 Saarbrücken,
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information
Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information Satoshi Sekine Computer Science Department New York University [email protected] Kapil Dalwani Computer Science Department
A Method for Automatic De-identification of Medical Records
A Method for Automatic De-identification of Medical Records Arya Tafvizi MIT CSAIL Cambridge, MA 0239, USA [email protected] Maciej Pacula MIT CSAIL Cambridge, MA 0239, USA [email protected] Abstract
D2.4: Two trained semantic decoders for the Appointment Scheduling task
D2.4: Two trained semantic decoders for the Appointment Scheduling task James Henderson, François Mairesse, Lonneke van der Plas, Paola Merlo Distribution: Public CLASSiC Computational Learning in Adaptive
a Chinese-to-Spanish rule-based machine translation
Chinese-to-Spanish rule-based machine translation system Jordi Centelles 1 and Marta R. Costa-jussà 2 1 Centre de Tecnologies i Aplicacions del llenguatge i la Parla (TALP), Universitat Politècnica de
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University
Comma checking in Danish Daniel Hardt Copenhagen Business School & Villanova University 1. Introduction This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect
Very often my clients ask me, Don I need Chinese translation. If I ask which Chinese? They just say Just Chinese. If I explain them there re more
Hello, fellow colleagues in Translation industry. And, Thank you very much for nice introduction. Vanessa. When you hear the topic Asian Languages and Markets, each of you probably had some questions or
Workshop. Neil Barrett PhD, Jens Weber PhD, Vincent Thai MD. Engineering & Health Informa2on Science
Engineering & Health Informa2on Science Engineering NLP Solu/ons for Structured Informa/on from Clinical Text: Extrac'ng Sen'nel Events from Pallia've Care Consult Le8ers Canada-China Clean Energy Initiative
Interactive Dynamic Information Extraction
Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken
CENG 734 Advanced Topics in Bioinformatics
CENG 734 Advanced Topics in Bioinformatics Week 9 Text Mining for Bioinformatics: BioCreative II.5 Fall 2010-2011 Quiz #7 1. Draw the decompressed graph for the following graph summary 2. Describe the
Chunk Parsing. Steven Bird Ewan Klein Edward Loper. University of Melbourne, AUSTRALIA. University of Edinburgh, UK. University of Pennsylvania, USA
Chunk Parsing Steven Bird Ewan Klein Edward Loper University of Melbourne, AUSTRALIA University of Edinburgh, UK University of Pennsylvania, USA March 1, 2012 chunk parsing: efficient and robust approach
GRASP: Grammar- and Syntax-based Pattern-Finder for Collocation and Phrase Learning
PACLIC 24 Proceedings 357 GRASP: Grammar- and Syntax-based Pattern-Finder for Collocation and Phrase Learning Mei-hua Chen a, Chung-chi Huang a, Shih-ting Huang b, and Jason S. Chang b a Institute of Information
An Overview of Applied Linguistics
An Overview of Applied Linguistics Edited by: Norbert Schmitt Abeer Alharbi What is Linguistics? It is a scientific study of a language It s goal is To describe the varieties of languages and explain the
Faculty of Social Science
Division of Architecture Master of Architecture Full-time 2 years Admission Code: ARK HK$42,100 (per annum) for UGC-funded local students HK$130,000 (per annum) for self-financed students Admission Requirements:
SYSTRAN 混 合 策 略 汉 英 和 英 汉 机 器 翻 译 系 统
SYSTRAN Chinese-English and English-Chinese Hybrid Machine Translation Systems Jin Yang, Satoshi Enoue Jean Senellart, Tristan Croiset SYSTRAN Software, Inc. SYSTRAN SA 9333 Genesee Ave. Suite PL1 La Grande
Collecting Polish German Parallel Corpora in the Internet
Proceedings of the International Multiconference on ISSN 1896 7094 Computer Science and Information Technology, pp. 285 292 2007 PIPS Collecting Polish German Parallel Corpora in the Internet Monika Rosińska
The Chinese Language and Language Planning in China. By Na Liu, Center for Applied Linguistics
The Chinese Language and Language Planning in China By Na Liu, Center for Applied Linguistics This brief introduces the Chinese language and its varieties and describes Chinese language planning initiatives
Child Language Data Exchange System
Child Language Data Exchange System The Child Language Data Exchange System (abbreviated: CHILDES) was founded in 1984 by Brian MacWhinney (Carnegie Mellon University) and Catherine Snow (Harvard University)
Twitter Stock Bot. John Matthew Fong The University of Texas at Austin [email protected]
Twitter Stock Bot John Matthew Fong The University of Texas at Austin [email protected] Hassaan Markhiani The University of Texas at Austin [email protected] Abstract The stock market is influenced
Web Information Mining and Decision Support Platform for the Modern Service Industry
Web Information Mining and Decision Support Platform for the Modern Service Industry Binyang Li 1,2, Lanjun Zhou 2,3, Zhongyu Wei 2,3, Kam-fai Wong 2,3,4, Ruifeng Xu 5, Yunqing Xia 6 1 Dept. of Information
2014/02/13 Sphinx Lunch
2014/02/13 Sphinx Lunch Best Student Paper Award @ 2013 IEEE Workshop on Automatic Speech Recognition and Understanding Dec. 9-12, 2013 Unsupervised Induction and Filling of Semantic Slot for Spoken Dialogue
Chinese Open Relation Extraction for Knowledge Acquisition
Chinese Open Relation Extraction for Knowledge Acquisition Yuen-Hsien Tseng 1, Lung-Hao Lee 1,2, Shu-Yen Lin 1, Bo-Shun Liao 1, Mei-Jun Liu 1, Hsin-Hsi Chen 2, Oren Etzioni 3, Anthony Fader 4 1 Information
How To Translate English To Yoruba Language To Yoranuva
International Journal of Language and Linguistics 2015; 3(3): 154-159 Published online May 11, 2015 (http://www.sciencepublishinggroup.com/j/ijll) doi: 10.11648/j.ijll.20150303.17 ISSN: 2330-0205 (Print);
Annotation in Language Documentation
Annotation in Language Documentation Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-29 Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations
Transcriptions in the CHAT format
CHILDES workshop winter 2010 Transcriptions in the CHAT format Mirko Hanke University of Oldenburg, Dept of English [email protected] 10-12-2010 1 The CHILDES project CHILDES is the CHIld Language
Study Plan for Master of Arts in Applied Linguistics
Study Plan for Master of Arts in Applied Linguistics Master of Arts in Applied Linguistics is awarded by the Faculty of Graduate Studies at Jordan University of Science and Technology (JUST) upon the fulfillment
COMPUTATIONAL DATA ANALYSIS FOR SYNTAX
COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy
INVESTIGATING DISCOURSE MARKERS IN PEDAGOGICAL SETTINGS:
ARECLS, 2011, Vol.8, 95-108. INVESTIGATING DISCOURSE MARKERS IN PEDAGOGICAL SETTINGS: A LITERATURE REVIEW SHANRU YANG Abstract This article discusses previous studies on discourse markers and raises research
