How To Write A German Reference Corpus Of Computer Mediated Communication
|
|
|
- Muriel Williams
- 5 years ago
- Views:
Transcription
1 DeRiK: A German Reference Corpus of Computer-Mediated Communication Michael Beißwenger 1, Maria Ermakova 2, Alexander Geyken 2, Lothar Lemnitzer 2, Angelika Storrer 1 1 Department of German Language and Literature, TU Dortmund University, D Dortmund, {beisswenger,storrer}@tu-dortmund.de 2 Berlin-Brandenburg Academy of Sciences and Humanities, Digitales Wörterbuch der deutschen Sprache, Jägerstr. 22/23, D Berlin, {geyken,lemnitzer}@bbaw.de, [email protected] Paper presented at the Digital Humanities 2012 Conference. University of Hamburg, July 19, Abstract The paper describes an ongoing project that aims at building a reference corpus of German computer-mediated communication (CMC) as a new component of an already existing reference corpus of written contemporary German. The Deutsches Referenzkorpus zur internetbasierten Kommunikation (DeRiK) shall include data from the most prominent CMC genres amongst German Internet users and, thus, close a gap in the coverage of the corpus resources in the project Digitales Wörterbuch der deutschen Sprache (DWDS) which are maintained and provided by the Berlin-Brandenburg Academy of Sciences and the Humanities (BBAW). The focus of the paper is on the role of the DeRiK component within the DWDS framework, on sampling issues, and on CMC-specific issues of corpus annotation. 1. Project Background and Focus of the Paper In view of the increasing amount of reading and writing that people do on the Internet, up-to-date corpora of written contemporary language must take into consideration the impact of computer-mediated communication (CMC) on contemporary language and, thus, include samples of emerging written genres such as , weblogs, microblogging on Twitter, discussion boards and wiki discussions, chats and instant messaging conversations, and communication in social network sites. In this paper we present selected aspects of an ongoing project that aims at building a reference corpus of German CMC, called DeRiK ( Deutsches Refe- 1
2 renzkorpus zur internetbasierten Kommunikation ). 1 DeRiK is a joint initiative of TU Dortmund University and the Berlin-Brandenburg Academy of Sciences and the Humanities (BBAW) and is embedded in the scientific network Empirical Research on Internet-based Communication ( funded by the Deutsche Forschungsgemeinschaft (DFG). The corpus will be integrated into the lexical information system provided by the BBAW project Digitales Wörterbuch der deutschen Sprache (DWDS, 2 The focus of this paper is on the role of the DeRiK component within the DWDS framework and on sampling issues (section 2) as well as on CMC-specific issues of corpus annotation (section 3). 2. Integrating CMC Discourse into a Corpus of Contemporary German: Motivation, Sampling, and Application Fields DWDS ( is a lexical information system developed by and hosted at the BBAW. The system offers one-click-access to three different types of resources (Geyken, 2007): a) lexical resources: a common language dictionary 3, an etymological dictionary, and a thesaurus; b) corpus resources: a balanced reference corpus (called DWDS core corpus ) of German ranging from 1900 up to now, a set of additional newspaper corpora, and specialized corpora; c) statistical resources for words and word combinations. These resources are displayed alongside one another in separate panels (see Fig. 1). The system offers the choice among several views, i.e. between several profiles with predefined panel combinations. The CMC component DeRiK ( Deutsches Referenzkorpus zur internetbasierten Kommunikation ) will be integrated into this framework both as an independent panel and as a subcorpus of the DWDS core corpus. The data for DeRiK shall be collected not only once but on a regular basis; DeRiK, thus, will consist of several partial corpora, each of them representing data that has been collected at a certain point of time (e.g., within one year). The sampling of the data is guided by the findings of the ARD/ZDF-Onlinestudie, a German online usage survey ( ZDF-ONLINESTUDIE.DE) which reveals the usage preferences of German Internet users on an annual basis and according to online applications and age groups. The findings Another corpus of contemporary language which aims to include a CMC subcorpus is the Dutch SoNaR project (Reynaert et al. 2010). 3 This dictionary is based on a six-volume paper dictionary, the Wörterbuch der deutschen Gegenwartssprache (WDG, en.: Dictionary of Contemporary German ) published between 1962 and 1977 and compiled at the Deutsche Akademie der Wissenschaften (Klappenbach/Steinitz (eds.) ). 2
3 of this survey allow us to derive an ideal key for the composition of the DeRiK partial corpora, i.e. for deciding which CMC technologies have to be regarded as most prominent amongst German Internet users in any year and in which proportion discourse conducted on the basis of those technologies should be represented in the corpus. However, for practical reasons the project will set out to collect data of only those instances of CMC technologies indicated by the online survey for which the users have explicitly granted permission for (re-)distributing and (re-)using their written utterances for non-commercial purposes/academic research (e.g., by assigning the respective subtypes of the Creative Commons License to CMC documents or to CMC applications on the web). Thus, the key derived from the findings of the annual online survey will describe an ideal compilation (with ideal proportions of the CMC genres) while the legal constraints will compel us to implement this ideal key only in modified form. Since the data will be collected over several years, we will have the possibility to adapt our key for each phase of data collection to changing usage preferences according to the most recent version of the online survey as well as to changes in IPR restrictions on the use of CMC data retrieved from the web for scientific purposes. The first partial corpus of DeRiK will mostly include discourse from Wikipedia talk pages, a selection of forum and weblog discussions, chat conversations, and postings of selected Twitter users. The integration of the CMC reference corpus into the DWDS system may be valuable for various research and application fields, for example: a) Language variation, language change and stylistics: A general-language corpus that includes a CMC component will provide a broad empirical basis (a) for further, corpus-based investigations of the usage and dissemination of CMCspecific phenomena across linguistic varieties and digital genres, and (b) for comparative analyses of the features of CMC discourse and of traditional written genres (e.g. newspaper, fiction, scientific writing, nonliterary prose); it will thus facilitate to track and describe how new linguistic patterns and communicative genres emerge 4. b) Lexicology and lexicography: Besides genre-specific discourse markers and netspeak jargon (like lol laughing out loud or imho in my humble opinion), new vocabulary is characteristic for CMC discourse, e.g. funzen (an abbreviated variant of funktionieren to function) or gruscheln (a function of a German social network platform, most likely a blending of grüßen greet and kuscheln cuddle). There are also CMC-specific processes of lexical-semantic changes, e.g. the broadening of the concept of Freund (friend). Up-to-date lexical resources should document and describe these tendencies by integrating CMC data into their data basis. Once the first partial corpora of the DeRiK corpus are made 4 Overviews of the features of CMC discourse from a linguistic perspective can be found, e.g., in Herring (ed., 1996; 2010), Runkehl et al. (1998), Crystal (2001; 2011), Beißwenger/Storrer (2008), and Storrer (2012). 3
4 available in the DWDS system, it is intended to extend the DWDS dictionary component with entries describing new lexemes that have evolved from CMC discourse. In addition, the DWDS corpus system will then allow one to track how new vocabulary from CMC discourse (such as the examples mentioned above) spreads into traditional genres (e.g. newspaper, fiction, nonliterary prose). c) Language teaching: CMC has become an important part of everyday communication. Language- and culture-specific properties of CMC should, thus, also be taken into consideration in communicative approaches to Second Language Teaching. In this context, the DeRiK corpus and the documentation of CMC vocabulary in the DWDS dictionary may be useful resources. In school teaching, students with German as a native language may use the DWDS system to compare traditional written language with CMC and to explore how style varies across different genres. Fig. 1: Web frontend of the DWDS system ( 3. Annotation of CMC-Specific Phenomena One advantage of integrating DeRiK into the DWDS system is that users can profit from the DWDS corpus annotation and querying facilities: The corpus resources which are currently available in the DWDS system are lemmatized with the TAGH morphology (Geyken and Hanneforth, 2006) and tagged with the part-of-speech 4
5 tagger moot (Jurish, 2003). The corpus search engine DDC (Dialing DWDS Concordancer) supports linguistic queries on several annotation levels (word forms, lemmas, STTS part-of-speech categories) as well as in filtering (e.g. by text type) and sorting options. Since all corpus resources in the DWDS system are encoded according to the guidelines of the Text Encoding Initiative (TEI-P5), the project uses TEI also for the annotation of its CMC component. For this purpose, we have developed a TEIcompliant annotation schema that provides a macrostructure of CMC discourse which covers a broad range of CMC genres (see section 3.1); a partial schema for the description of selected CMC-specific phenomena ( interaction signs : emoticons, interaction words, interaction templates, addressing terms; see section 3.2 for details). A detailed description of the TEI schema for DeRiK is given in Beißwenger et al. (2012) 5. The discussion in this paper will focus on two core issues: The representation of CMC-specific micro- and macrostructures (section 3.1) and the annotation of typical netspeak elements (section 3.2). 3.1 Annotation of CMC-Specific Micro- and Macrostructures We introduced the category posting as a basic element to capture CMC micro- and macrostructures. A posting is defined as a content unit that is being sent to the server en bloc. Postings can usually be recognized by their formal structure, even if they have different forms and structures across CMC genres. This facilitates the automatic segmentation and annotation of CMC micro- and macrostructures. We use the term microstructure to refer to the internal structure of postings. There are cases in which a posting consists of exactly one portion of text. In other CMC genres, e.g. in discussion groups, postings may contain divisions and markup used by the author to structure their content. We use the term macrostructure to describe how the postings are sequenced. While microstructures are generated by an individual author, macrostructures do not emerge from the actions of just one user but from all posting activities of all users involved in a CMC conversation plus server routines for ordering the incoming postings. Our TEI makes a distinction between two major types of CMC macrostructures: logfile structures, which arrange the postings in a linear chronological order based on when they reached the server (as is the case in chats and instant messaging data); 5 The RNG schema file, a TEI-compliant ODD documentation as well as encoding examples are available at 5
6 thread structures, which arrange the postings using two dimensions with specific semantics: the above/below dimension representing a temporal before/after relation; the left/right dimension (by indentation), which usually symbolizes the topical affiliation of one posting to a previous posting (as is the case, e.g., in forum, weblog, and wiki discussions). 3.2 Annotation of Interaction Signs The corpus-based investigation of netspeak jargon is interesting in many research contexts (style variation and language change, discourse management, language teaching, etc.). Our annotation schema comprises elements for a set of netspeak phenomena which we term interaction signs. The term builds on the category interaktive Einheiten which was introduced in the three-volume scientific grammar of the German language Zifonun et al. (1997) to classify interjections (such as hm or oh my god ) and responsives (such as yes and no ) in spoken discourse. In contrast to part-of-speech-categories, interaction signs are not syntactically integrated and do not contribute to the compositional structure of sentences. In spoken discourse, they serve as devices for conversation management, i.e. they can be used to express reactions to the partners utterances or to display the speaker s emotions. Besides interjections and responsives, the category interaction sign includes four CMC-specific subcategories (see Fig. 2): 1. Emoticons, which are iconic units that are created with the keyboard and which typically serve as emotion or irony markers or as responsives. Being of iconic origin, the use of emoticons is not restricted to a specific language. However, different styles of emoticons exist e.g. Western style emoticons such as :-), :-(, ;-), or the :), or Japanese style emoticons such as (^_^), \(^_^)/, (*_*). 2. Interaction words, which are symbolic linguistic units whose morphologic construction is based on a word or a phrase. They may describe gestures or facial expressions, e.g. *g* (< grins grin), *fg* (< fat grin), *s* (< smile), or they are used for the simulation of actions and events. 3. Interaction templates, which are units that the user does not generate with the keyboard but which are generated automatically from a file with a previously prepared text or graphical element after the user has activated a template. 4. Addressing terms, which are units that are used to address an utterance to a particular interlocutor. 6
7 interaction signs specific for CMC interjections responsives emoticons interaction words interaction addressing templates terms EXAMPLES ach äh ah äh au aua eiei gell hm mhm na naja ne oh oh oho oi pst tja (etc.) ja okay nein Western style: :-) :) ;-) ;) :-( :( :-D :D Japanese style: o.o O.O \(^_^)/ (*_*) grins freu ganzdolleknuddel grübel schluck stirnrunzel lach schluchz lol rofl an bilbo21: Fig. 2: Typology of interaction signs (with examples) 4. Conclusion and Outlook Up to now, many assumptions about the Internet s impact on language change have been based upon small datasets. As a new component within the DWDS system, the DeRiK corpus is meant to be a resource for the investigation of language usage in CMC genres on a broader empirical basis. The annotation schema outlined in section 3 is used and evaluated in the ongoing work of the DeRiK project. The categories proposed in this schema will have to be further discussed within the CMC community. We consider the development of this schema as a first step towards the development of an annotation standard that will facilitate interoperability between language data and thus cross-language, cross-genre, and micro-diachronic investigations of CMC phenomena on the basis of distributed corpora. The schema focuses on linguistic aspects, but it is open for extensions motivated through other fields of research, i.e. cultural studies or sentiment analysis. The data that will be collected and annotated in the DeRIK project as well as the tools for their linguistic annotation will be made available through the CLARIN Language Resources and Technology infrastructure. References Beißwenger, M. and Storrer, A. (2008). Corpora of Computer-Mediated Communication. In Lüdeling, A. and Kytö, M. (eds), Corpus Linguistics. An International Handbook, vol. 1. Berlin, de Gruyter, pp Beißwenger, M., Ermakova, M., Geyken, A., Lemnitzer, L., Storrer, A. (2012). A TEI schema for the Representation of the Computer-mediated Communication. Journal of the Text Encoding Initiative (jtei), Issue 3 November 2012 (DOI: /jtei.476). (accessed 11 March 2012). 7
8 Crystal, D. (2001). Language and the Internet. Cambridge: Cambridge University Press. Crystal, D. (2011). Internet Linguistics. A Student Guide. New York: Routledge. Geyken, A. (2007). The DWDS corpus: A reference corpus for the German language of the 20th century. In Fellbaum, Ch. (ed), Collocations and Idioms. London: continuum, pp Herring, S. C. (ed) (1996). Computer-Mediated Communication. Linguistic, Social and Cross-Cultural Perspectives. Amsterdam/Philadelphia: John Benjamins (Pragmatics and Beyond New Series 39). Herring, S. C. (ed) (2010). Computer-Mediated Conversation, Part I. Special Issue of Language@Internet, vol (accessed 11 March 2012). Klein, W., Geyken, A. (2010). Das Digitale Wörterbuch der Deutschen Sprache (DWDS). In Heid, U., Schierholz, S., Schweickard, W., Wiegand, H. E., Gouws, R. H. and Wolski, W. (eds), Lexicographica. pp Reynaert, N., Oostdijk, O., De Clercq, H. et al. (2010). Balancing SoNaR: IPR versus Processing Issues in a 500-Million-Word Written Dutch Reference Corpus. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC'10). Paris, Runkehl, K., Siever, T., Schlobinski, P. (1998). Sprache und Kommunikation im Internet. Überblick und Analysen. Opladen: Westdeutscher Verlag. Storrer, A. (2012, in press). Sprachstil und Sprachvariation in sozialen Netzwerken. In Frank-Job, B., Mehler, A., Sutter, T. (eds), Die Dynamik sozialer und sprachlicher Netzwerke. Konzepte, Methoden und empirische Untersuchungen an Beispielen des WWW. Wiesbaden: VS Verlag für Sozialwissenschaften. TEI Consortium (eds) (2007). TEI P5: Guidelines for Electronic Text Encoding and Interchange. (accessed 25 October 2012). Klappenbach, R., Steinitz, W. (eds) ( ). Wörterbuch der deutschen Gegenwartssprache (WDG). 6 Bände. Berlin: Akademie-Verlag. Zifonun, G., Hoffmann, L., Strecker, B. et al. (eds) (1997). Grammatik der deutschen Sprache. 3 Bände. Berlin und New York: de Gruyter. 8
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus Michael Beißwenger, Eric Ehrhardt, Andrea Horbach, Harald Lüngen, Diana Steffen, Angelika Storrer
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus
Adding Value to CMC Corpora: CLARINification and Part-of-Speech Annotation of the Dortmund Chat Corpus Michael Beißwenger 1, Eric Ehrhardt 2, Andrea Horbach 3, Harald Lüngen 4, Diana Steffen 3, Angelika
Breatling the Meaning of Tag Sets in CMC corpora
An extended tag set for annotating parts of speech in CMC corpora Thomas Bartz 1, Michael Beißwenger 1, Eric Ehrhardt 2, Angelika Storrer 2 1) 2) International Research Days: Social Media and CMC Corpora
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services
Berlin-Brandenburg Academy of sciences and humanities (BBAW) resources / services speakers: Kai Zimmer and Jörg Didakowski Clarin Workshop WP2 February 2009 BBAW/DWDS The BBAW and its 40 longterm projects
EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language
EXMARaLDA and the FOLK tools two toolsets for transcribing and annotating spoken language Thomas Schmidt Institut für Deutsche Sprache, Mannheim R 5, 6-13 D-68161 Mannheim [email protected]
A TEI Schema for the Representation of Computer-mediated Communication
A TEI Schema for the Representation of Computer-mediated Communication Michael Beißwenger 1, Maria Ermakova 2, Alexander Geyken 2, Lothar Lemnitzer 2, Angelika Storrer 1 1 Department of German Language
Master of Arts in Linguistics Syllabus
Master of Arts in Linguistics Syllabus Applicants shall hold a Bachelor s degree with Honours of this University or another qualification of equivalent standard from this University or from another university
What Makes a Good Online Dictionary? Empirical Insights from an Interdisciplinary Research Project
Proceedings of elex 2011, pp. 203-208 What Makes a Good Online Dictionary? Empirical Insights from an Interdisciplinary Research Project Carolin Müller-Spitzer, Alexander Koplenig, Antje Töpel Institute
Generation of Word Profiles for large German corpora
Generation of Word Profiles for large German corpora Alexander Geyken, Alexander Siebert and Jörg Didakowski 1. Introduction Electronic corpora have been used in lexicography and the domain of language
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
WebLicht: Web-based LRT services for German
WebLicht: Web-based LRT services for German Erhard Hinrichs, Marie Hinrichs, Thomas Zastrow Seminar für Sprachwissenschaft, University of Tübingen [email protected] Abstract This software
Local Culture in Global English:
Local Culture in Global English: a case study of Kultur in Sprache / Sprachwissenschaft in Kulturwissenschaften Josef Schmied Chair English Language & Linguistics Chemnitz University of Technology www.tu-chemnitz.de/phil/english/linguist
The Rise of Documentary Linguistics and a New Kind of Corpus
The Rise of Documentary Linguistics and a New Kind of Corpus Gary F. Simons SIL International 5th National Natural Language Research Symposium De La Salle University, Manila, 25 Nov 2008 Milestones in
Aligning Word Senses in GermaNet and the DWDS Dictionary of the German Language
Aligning Word Senses in GermaNet and the DWDS Dictionary of the German Language Verena Henrich, Erhard Hinrichs, Reinhild Barkey Department of Linguistics University of Tübingen, Germany {vhenrich,eh,rbarkey}@sfs.uni-tuebingen.de
Local Culture in Global English:
Local Culture in Global English: a case study of Kultur in Sprache / Sprachwissenschaft in Kulturwissenschaften Josef Schmied Chair English Language & Linguistics Chemnitz University of Technology www.tu-chemnitz.de
Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words
, pp.290-295 http://dx.doi.org/10.14257/astl.2015.111.55 Efficient Techniques for Improved Data Classification and POS Tagging by Monitoring Extraction, Pruning and Updating of Unknown Foreign Words Irfan
Complex Predications in Argument Structure Alternations
Complex Predications in Argument Structure Alternations Stefan Engelberg (Institut für Deutsche Sprache & University of Mannheim) Stefan Engelberg (IDS Mannheim), Universitatea din Bucureşti, November
Web 2.0 Tools for Language Learning. Sadykova G.V Kazan Federal University
Web 2.0 Tools for Language Learning Sadykova G.V Kazan Federal University Web 1.0 vs Web 2.0 Web 1.0: HTML based, slow & expensive Internet connection, expensive software, limited interactivity, one-to-one/many
Discourse Markers in English Writing
Discourse Markers in English Writing Li FENG Abstract Many devices, such as reference, substitution, ellipsis, and discourse marker, contribute to a discourse s cohesion and coherence. This paper focuses
LINGUISTIC SUPPORT IN "THESIS WRITER": CORPUS-BASED ACADEMIC PHRASEOLOGY IN ENGLISH AND GERMAN
ELN INAUGURAL CONFERENCE, PRAGUE, 7-8 NOVEMBER 2015 EUROPEAN LITERACY NETWORK: RESEARCH AND APPLICATIONS Panel session Recent trends in Bachelor s dissertation/thesis research: foci, methods, approaches
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6
Minnesota K-12 Academic Standards in Language Arts Curriculum and Assessment Alignment Form Rewards Intermediate Grades 4-6 4 I. READING AND LITERATURE A. Word Recognition, Analysis, and Fluency The student
ICAME Journal No. 24. Reviews
ICAME Journal No. 24 Reviews Collins COBUILD Grammar Patterns 2: Nouns and Adjectives, edited by Gill Francis, Susan Hunston, andelizabeth Manning, withjohn Sinclair as the founding editor-in-chief of
The English Language Learner CAN DO Booklet
WORLD-CLASS INSTRUCTIONAL DESIGN AND ASSESSMENT The English Language Learner CAN DO Booklet Grades 1-2 Includes: Performance Definitions CAN DO Descriptors For use in conjunction with the WIDA English
Motivation. Korpus-Abfrage: Werkzeuge und Sprachen. Overview. Languages of Corpus Query. SARA Query Possibilities 1
Korpus-Abfrage: Werkzeuge und Sprachen Gastreferat zur Vorlesung Korpuslinguistik mit und für Computerlinguistik Charlotte Merz 3. Dezember 2002 Motivation Lizentiatsarbeit: A Corpus Query Tool for Automatically
Your boldest wishes concerning online corpora: OpenSoNaR and you
1 Your boldest wishes concerning online corpora: OpenSoNaR and you Martin Reynaert TiCC, Tilburg University and CLST, Radboud Universiteit Nijmegen TiCC Colloquium, Tilburg University. October 16th, 2013
COMPUTATIONAL DATA ANALYSIS FOR SYNTAX
COLING 82, J. Horeck~ (ed.j North-Holland Publishing Compa~y Academia, 1982 COMPUTATIONAL DATA ANALYSIS FOR SYNTAX Ludmila UhliFova - Zva Nebeska - Jan Kralik Czech Language Institute Czechoslovak Academy
Modern foreign languages
Modern foreign languages Programme of study for key stage 3 and attainment targets (This is an extract from The National Curriculum 2007) Crown copyright 2007 Qualifications and Curriculum Authority 2007
Using German corpora for linguistic purposes. Dr. Kathrin Steyer Institut für Deutsche Sprache, Mannheim
Using German corpora for linguistic purposes Dr. Kathrin Steyer Institut für Deutsche Sprache, Mannheim Introduction This talk will give a first impression of the complex field of German corpora and methods
Dr. Anna Maria Schneider
Dr. Anna Maria Schneider Postdoctoral Researcher Faculty of Economics and Business Administration Humboldt Universität zu Berlin Rosenstraße 19 10178 Berlin, Germany anna maria.schneider[at]wiwi.hu berlin.de
University of Massachusetts Boston Applied Linguistics Graduate Program. APLING 601 Introduction to Linguistics. Syllabus
University of Massachusetts Boston Applied Linguistics Graduate Program APLING 601 Introduction to Linguistics Syllabus Course Description: This course examines the nature and origin of language, the history
Natural Language to Relational Query by Using Parsing Compiler
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,
APA Annotated Bibliography (Haddad)
APA Annotated Bibliography (Haddad) Gender and Online Communication 1 Arman Haddad Professor Andrews Psychology 101 14 October XXXX Patterns of Gender-Related Differences in Online Communication: An Annotated
Corpus and Discourse. The Web As Corpus. Theory and Practice MARISTELLA GATTO LONDON NEW DELHI NEW YORK SYDNEY
Corpus and Discourse The Web As Corpus Theory and Practice MARISTELLA GATTO B L O O M S B U R Y LONDON NEW DELHI NEW YORK SYDNEY Contents List of Figures xiii List of Tables xvii Preface xix Acknowledgements
CLARIN project DiscAn :
CLARIN project DiscAn : Towards a Discourse Annotation system for Dutch language corpora Ted Sanders Kirsten Vis Utrecht Institute of Linguistics Utrecht University Daan Broeder TLA Max-Planck Institute
psychology and its role in comprehension of the text has been explored and employed
2 The role of background knowledge in language comprehension has been formalized as schema theory, any text, either spoken or written, does not by itself carry meaning. Rather, according to schema theory,
Search and Information Retrieval
Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search
Quantitative Text Typology The Impact of Sentence Length
Quantitative Text Typology The Impact of Sentence Length Emmerich Kelih 1, Peter Grzybek 1, Gordana Antić 2, and Ernst Stadlober 2 1 Department for Slavic Studies, University of Graz, A-8010 Graz, Merangasse
Schema documentation for types1.2.xsd
Generated with oxygen XML Editor Take care of the environment, print only if necessary! 8 february 2011 Table of Contents : ""...........................................................................................................
Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features
, pp.273-280 http://dx.doi.org/10.14257/ijdta.2015.8.4.27 Tibetan-Chinese Bilingual Sentences Alignment Method based on Multiple Features Lirong Qiu School of Information Engineering, MinzuUniversity of
CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test
CINTIL-PropBank I. Basic Information 1.1. Corpus information The CINTIL-PropBank (Branco et al., 2012) is a set of sentences annotated with their constituency structure and semantic role tags, composed
The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project
Seminar on Dec 19 th Abstracts & speaker information The First Online 3D Epigraphic Library: The University of Florida Digital Epigraphy and Archaeology Project Eleni Bozia (USA) Angelos Barmpoutis (USA)
Annotation in Language Documentation
Annotation in Language Documentation Univ. Hamburg Workshop Annotation SEBASTIAN DRUDE 2015-10-29 Topics 1. Language Documentation 2. Data and Annotation (theory) 3. Types and interdependencies of Annotations
Mapping a Traditional Dialectal Dictionary with Linked Open Data
Mapping a Traditional Dialectal Dictionary with Linked Open Data Eveline Wandl-Vogt 1, Thierry Declerck 1,2 1 Institute for Corpus Linguistics and Text Technology, Austrian Academy of Sciences Sonnenfelsgasse
Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov. 2008 [Folie 1]
Content 1. Empirical linguistics 2. Text corpora and corpus linguistics 3. Concordances 4. Application I: The German progressive 5. Part-of-speech tagging 6. Fequency analysis 7. Application II: Compounds
Master of Arts Program in Linguistics for Communication Department of Linguistics Faculty of Liberal Arts Thammasat University
Master of Arts Program in Linguistics for Communication Department of Linguistics Faculty of Liberal Arts Thammasat University 1. Academic Program Master of Arts Program in Linguistics for Communication
Common Core Progress English Language Arts
[ SADLIER Common Core Progress English Language Arts Aligned to the [ Florida Next Generation GRADE 6 Sunshine State (Common Core) Standards for English Language Arts Contents 2 Strand: Reading Standards
Declarative Parsing and Annotation of Electronic Dictionaries
Declarative Parsing and Annotation of Electronic Dictionaries Christian Schneiker 1, Dietmar Seipel 1, Werner Wegstein 2, and Klaus Prätor 3 1 Department of Computer Science {schneiker seipel}@informatik.uni-wuerzburg.de
TOP 10 RESOURCES FOR TEACHERS OF ENGLISH LANGUAGE LEARNERS. Melissa McGavock Director of Bilingual Education
TOP 10 RESOURCES FOR TEACHERS OF ENGLISH LANGUAGE LEARNERS Melissa McGavock Director of Bilingual Education Reading, Writing, and Learning in ESL: A Resource Book for K-12 Teachers Making Content Comprehensible
Text Mining - Scope and Applications
Journal of Computer Science and Applications. ISSN 2231-1270 Volume 5, Number 2 (2013), pp. 51-55 International Research Publication House http://www.irphouse.com Text Mining - Scope and Applications Miss
MULTIFUNCTIONAL DICTIONARIES
In: A. Zampolli, A. Capelli (eds., 1984): The possibilities and limits of the computer in producing and publishing dictionaries. Linguistica Computationale III, Pisa: Giardini, 279-288 MULTIFUNCTIONAL
Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF
Die Vielfalt vereinen: Die CLARIN-Eingangsformate CMDI und TCF Susanne Haaf & Bryan Jurish Deutsches Textarchiv 1. The Metadata Format CMDI Metadata? Metadata Format? and more Metadata? Metadata Format?
Dutch Parallel Corpus
Dutch Parallel Corpus Lieve Macken [email protected] LT 3, Language and Translation Technology Team Faculty of Applied Language Studies University College Ghent November 29th 2011 Lieve Macken (LT
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 ISSN 2229-5518
International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November-2013 5 INTELLIGENT MULTIDIMENSIONAL DATABASE INTERFACE Mona Gharib Mohamed Reda Zahraa E. Mohamed Faculty of Science,
SignLEF: Sign Languages within the European Framework of Reference for Languages
SignLEF: Sign Languages within the European Framework of Reference for Languages Simone Greiner-Ogris, Franz Dotter Centre for Sign Language and Deaf Communication, Alpen Adria Universität Klagenfurt (Austria)
The Use of Text Corpora in Lexical Research
The Use of Text Corpora in Lexical Research Stefan Engelberg Workshop, Universitatea din Bucureşti, November 2008 http://www.ids-mannheim.de/ll/lehre/engelberg/ Webseite_CorpLex/CorpLex.html [email protected]
Teaching terms: a corpus-based approach to terminology in ESP classes
Teaching terms: a corpus-based approach to terminology in ESP classes Maria João Cotter Lisbon School of Accountancy and Administration (ISCAL) (Portugal) Abstract This paper will build up on corpus linguistic
Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov
Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or
SFSU MATESOL Conference May 8, 2009
Reflective communication: Using chat to build L2 skills Brittney Staropoli & Atsushi Koizumi Email: [email protected]; [email protected] 1. What is online chat? - a [real time] online conversation
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR
NATURAL LANGUAGE QUERY PROCESSING USING PROBABILISTIC CONTEXT FREE GRAMMAR Arati K. Deshpande 1 and Prakash. R. Devale 2 1 Student and 2 Professor & Head, Department of Information Technology, Bharati
Radio D Teil 1. Deutsch lernen und unterrichten Arbeitsmaterialien. Episode 01 A Visit to the Countryside
Episode 01 A Visit to the Countryside, a young man, drives to the countryside to visit his mother Hanne. He plans to relax there, but soon finds out that the idyllic country landscape has its downsides
How To Write The English Language Learner Can Do Booklet
WORLD-CLASS INSTRUCTIONAL DESIGN AND ASSESSMENT The English Language Learner CAN DO Booklet Grades 9-12 Includes: Performance Definitions CAN DO Descriptors For use in conjunction with the WIDA English
THE BACHELOR S DEGREE IN SPANISH
Academic regulations for THE BACHELOR S DEGREE IN SPANISH THE FACULTY OF HUMANITIES THE UNIVERSITY OF AARHUS 2007 1 Framework conditions Heading Title Prepared by Effective date Prescribed points Text
The Oxford Learner s Dictionary of Academic English
ISEJ Advertorial The Oxford Learner s Dictionary of Academic English Oxford University Press The Oxford Learner s Dictionary of Academic English (OLDAE) is a brand new learner s dictionary aimed at students
LARGE-SCALE DOCUMENTARY DICTIONARIES
LARGE-SCALE DOCUMENTARY DICTIONARIES ON THE INTERNET Manuscript. Last change 8/8/2011 Alexander Geyken. Large-Scale Documentary Dictionaries on the Internet. In: Gouws, Rufus Hjalmar / Heid, Ulrich / Schweickard,
AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom
AntConc: Design and Development of a Freeware Corpus Analysis Toolkit for the Technical Writing Classroom Laurence Anthony Waseda University [email protected] Abstract In this paper, I will
Study Plan for Master of Arts in Applied Linguistics
Study Plan for Master of Arts in Applied Linguistics Master of Arts in Applied Linguistics is awarded by the Faculty of Graduate Studies at Jordan University of Science and Technology (JUST) upon the fulfillment
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts
MIRACLE at VideoCLEF 2008: Classification of Multilingual Speech Transcripts Julio Villena-Román 1,3, Sara Lana-Serrano 2,3 1 Universidad Carlos III de Madrid 2 Universidad Politécnica de Madrid 3 DAEDALUS
INTERVALS WITHIN AND BETWEEN UTTERANCES. (.) A period in parentheses indicates a micropause less than 0.1 second long. Indicates audible exhalation.
All of the articles in this Special Issue use Jefferson s (2004) transcription conventions (see also Psathas, 1995; Hutchby and Woofit, 1998; ten Have; Markee and Kasper, 2004), in accordance with the
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge
The SYSTRAN Linguistics Platform: A Software Solution to Manage Multilingual Corporate Knowledge White Paper October 2002 I. Translation and Localization New Challenges Businesses are beginning to encounter
Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project
Bridging CAQDAS with text mining: Text analyst s toolbox for Big Data: Science in the Media Project Ahmet Suerdem Istanbul Bilgi University; LSE Methodology Dept. Science in the media project is funded
Department of English. University of Innsbruck
Department of English University of Innsbruck Welcome! Welcome to the Department of English at the University of Innsbruck! Founded in 1898, our department is one of the oldest English departments in Austria.
Information and documentation The Dublin Core metadata element set
ISO TC 46/SC 4 N515 Date: 2003-02-26 ISO 15836:2003(E) ISO TC 46/SC 4 Secretariat: ANSI Information and documentation The Dublin Core metadata element set Information et documentation Éléments fondamentaux
Latin Syllabus S2 - S7
European Schools Office of the Secretary-General Pedagogical Development Unit Ref.: 2014-01-D-35-en-2 Orig.: FR Latin Syllabus S2 - S7 APPROVED BY THE JOINT TEACHING COMMITTEE ON 13 AND 14 FEBRUARY 2014
Register Differences between Prefabs in Native and EFL English
Register Differences between Prefabs in Native and EFL English MARIA WIKTORSSON 1 Introduction In the later stages of EFL (English as a Foreign Language) learning, and foreign language learning in general,
The Power of Dots: Using Nonverbal Compensators in Chat Reference
The Power of Dots: Using Nonverbal Compensators in Chat Reference Jack M. Maness University Libraries University of Colorado at Boulder 184 UCB 1720 Pleasant St. Boulder, CO 80309 303-492-4545 [email protected]
Language Meaning and Use
Language Meaning and Use Raymond Hickey, English Linguistics Website: www.uni-due.de/ele Types of meaning There are four recognisable types of meaning: lexical meaning, grammatical meaning, sentence meaning
Semantic annotation of requirements for automatic UML class diagram generation
www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute
Terminology Extraction from Log Files
Terminology Extraction from Log Files Hassan Saneifar 1,2, Stéphane Bonniol 2, Anne Laurent 1, Pascal Poncelet 1, and Mathieu Roche 1 1 LIRMM - Université Montpellier 2 - CNRS 161 rue Ada, 34392 Montpellier
Differences in linguistic and discourse features of narrative writing performance. Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu 3
Yıl/Year: 2012 Cilt/Volume: 1 Sayı/Issue:2 Sayfalar/Pages: 40-47 Differences in linguistic and discourse features of narrative writing performance Abstract Dr. Bilal Genç 1 Dr. Kağan Büyükkarcı 2 Ali Göksu
Reading Competencies
Reading Competencies The Third Grade Reading Guarantee legislation within Senate Bill 21 requires reading competencies to be adopted by the State Board no later than January 31, 2014. Reading competencies
Pragmatic analysis of hotel websites in terms of interpersonal relationships. Theses of the PhD dissertation by. Kovács Péterné Dudás Andrea
Pragmatic analysis of hotel websites in terms of interpersonal relationships Theses of the PhD dissertation by Kovács Péterné Dudás Andrea Eötvös Loránd University Faculty of Humanities Doctoral School
Building and Annotating Corpora of Computer-Mediated Communication: Issues and Challenges at the Interface of Corpus and Computational Linguistics
Volume 29 Number 2 2014 ISSN 2190-6858 Journal for Language Technology and Computational Linguistics Building and Annotating Corpora of Computer-Mediated Communication: Issues and Challenges at the Interface
Processing: current projects and research at the IXA Group
Natural Language Processing: current projects and research at the IXA Group IXA Research Group on NLP University of the Basque Country Xabier Artola Zubillaga Motivation A language that seeks to survive
