Cultural Trends and language change

Size: px
Start display at page:

Download "Cultural Trends and language change"

Transcription

1 Cultural Trends and language change Gosse Bouma Information Science University of Groningen NHL 2015/03 Gosse Bouma 1/25

2 Popularity of Wolf in English books Gosse Bouma 2/25

3 Google Books Ngrams Digital Library Google Books is a project where books are scanned and turned into text using OCR (Optical Character Recognition) and made searchable with Google search. currently approx. 20 M books, mostly English, mostly since 1800 Google Books Ngrams: Valuable resource for cultural and linguistic studies Gosse Bouma 3/25

4 Google Books Google Books & ngrams viewer The Google Labs N-gram Viewer is the first tool of its kind, capable of precisely and rapidly quantifying cultural trends based on massive quantities of data. It is a gateway to culturomics! The browser is designed to enable you to examine the frequency of words (banana) or phrases ( United States of America ) in books over time. You ll be searching through over 5.2 million books: 4% of all books ever published! A-users-guide-to-culturomics Gosse Bouma 4/25

5 Popularity of various *isms Gosse Bouma 5/25

6 Google Books Ngrams Viewer Jean-Baptiste Michel et al., Quantitative Analysis of Culture Using Millions of Digitized Books. Science books.html Gosse Bouma 6/25

7 Google Books Ngrams Viewer Examples freedom, liberty vampire, werewolf, zombie computer, phone, radio, gun radio, television, internet best, beft Gosse Bouma 7/25

8 Google Books 2.0 The USA is/are Do people think of the United States as a singular or plural entity? And is this constant over time? Gosse Bouma 8/25

9 Google Books 2.0 The USA is/are Do people think of the United States as a singular or plural entity? And is this constant over time? We can answer this question by computing how often the United states is followed by examples of frequent plural verbs and divide the result by the overall frequency of the United States, We can do the same for the United states followed by a frequent singular verb. Gosse Bouma 8/25

10 Google Books 2.0 The USA is/are Do people think of the United States as a singular or plural entity? And is this constant over time? We can answer this question by computing how often the United states is followed by examples of frequent plural verbs and divide the result by the overall frequency of the United States, We can do the same for the United states followed by a frequent singular verb. The result is shown here: Notice the use of arithmetic to sum and divide frequencies. Gosse Bouma 8/25

11 Determiners and Country Names "The Ukraine" is incorrect both grammatically and politically, says Oksana Kyzyma of the Embassy of Ukraine in London. "Ukraine is both the conventional short and long name of the country," she says. "This name is stated in the Ukrainian Declaration of Independence and Constitution." Gosse Bouma 9/25

12 Google Books 2.0 Spelling,phrasal expressions recognise, recognize (try) gone missing (try) graduated college vs graduated from college Search using part of speech experience_noun vs. experience_verb (try) to always _VERB_, never, quickly, boldly Arithmetic (The United States have + The United States are)/the United States (The United States has + The United States is)/the United States try Gosse Bouma 10/25

13 Literary Applications First Person Fiction Ted Underwood suggests there is a sharp drop in the percentage of first person narratives around 1800 Can we investigate this using corpus linguistics? we-dont-already-know-the-broad-outlines-of-literary-history Gosse Bouma 11/25

14 Literary Applications Given novels that are clearly written with a 1st or 3rd person narrator Which words do occur significantly more often in 1st or 3rd person novels? Gosse Bouma 12/25

15 Literary Applications Given a large collection of fiction books: Does the ratio between 1st and 3rd person pronouns change over time? Gosse Bouma 13/25

16 Using Syntax Dependency Relations What are frequent direct objects of drink? drink => *_NOUN What things are magnificent? *_NOUN_ => magnificent Gosse Bouma 14/25

17 Google Ngrams Google BOOKS ngram viewer uses books Google NGRAMS is Web data Gosse Bouma 15/25

18 Dutch Twitter Corpus Since 2011 the Information Science Department of the University of Groningen has been collecting Dutch language tweets. The goal is to collect a representative sample of all tweets posted in Dutch. We estimate that our method captures approximately 40-60% of the relevant tweets. Gosse Bouma 16/25

19 Dutch Twitter Corpus Since 2011 the Information Science Department of the University of Groningen has been collecting Dutch language tweets. The goal is to collect a representative sample of all tweets posted in Dutch. We estimate that our method captures approximately 40-60% of the relevant tweets. RieksOsinga #CTAboutaleb op Mooie en inspirerende woorden. AHPOIESZ Vandaag ons winnend concept gepresenteerd aan College van is er een probleem met de mail? Krijg namelijk een 500 interland server Error. Gelukkig! ICT geeft ook aan dat er geen storing, dat scheelt;-) Fijne zondag nog! NHL_Hogeschool Klaar voor collegereeks Met twee bedrijven Gosse Bouma 16/25

20 Spelling Variation How often is eens written as is? Dit kan uiteraard wel is voorkomen in de statistiek. Source Dutch Twitter Ngram counts, Query wel [is,eens] %en Gosse Bouma 17/25

21 Spelling Variation How often is eens written as is? Dit kan uiteraard wel is voorkomen in de statistiek. Source Dutch Twitter Ngram counts, Query wel [is,eens] %en trigram count perc wel eens voorkomen wel is voorkomen Gosse Bouma 17/25

22 Spelling Variation How often is eens written as is? Dit kan uiteraard wel is voorkomen in de statistiek. Source Dutch Twitter Ngram counts, Query wel [is,eens] %en trigram count perc wel eens voorkomen wel is voorkomen wel eens gebeuren 2, wel is gebeuren Gosse Bouma 17/25

23 Spelling Variation How often is eens written as is? Dit kan uiteraard wel is voorkomen in de statistiek. Source Dutch Twitter Ngram counts, Query wel [is,eens] %en trigram count perc wel eens voorkomen wel is voorkomen wel eens gebeuren 2, wel is gebeuren wel eens zien 34, wel is zien 10, Gosse Bouma 17/25

24 Spelling Variation zeςma Zeς ma is a discourse marker of Arabic etymology that is used in North Africa as well as in the French and Dutch varieties spoken by the North African diaspora in Europe...On the internet, users either omit ς or use another character, e.g. the digit 3. (Bouwmans, 2003) Source Dutch Twitter Ngram counts, Query ze%ma Gosse Bouma 18/25

25 Spelling Variation zeςma Zeς ma is a discourse marker of Arabic etymology that is used in North Africa as well as in the French and Dutch varieties spoken by the North African diaspora in Europe...On the internet, users either omit ς or use another character, e.g. the digit 3. (Bouwmans, 2003) Source Dutch Twitter Ngram counts, Query ze%ma hits word hits word 76,218 zehma 2,845 Zegma 45,561 ze3ma 2,149 ze3hma 29,944 zegma 1,707 zemma 15,058 Zehma 1,568 ZEHMA 8,448 Ze3ma 1,553 zema (and at least 20 other spelling variants) Gosse Bouma 18/25

26 Language Change: Popularity of der (her/there) ik ga der geld geven voor der verjaardag Mag je der vaseline opdoen? Dutch Twitter Ngram counts 2014 Gosse Bouma 19/25

27 Ngram statistics Why use ngram counts? Given enough data, ngram frequencies are often sufficient to study variation and trends Dutch Twitter Corpus # (Million) tweets 2,500 tokens 28,000 unigrams bigrams trigrams grams grams Gosse Bouma 20/25

28 Comparable Tools and Resources Twitter viewers (Univ Groningen, twiqs.nl) : links to actual tweets, trends, metadata slow for large periods and/or frequent ngrams Google Web 1T 5-Gram Database for European languages: ngram counts for 133 billion words of Dutch webtext Regex search, collocations: Corpus Frequency counts Keuleers et al (2010), word frequencies based on Dutch subtitles... Rovereto Twitter n-gram corpus with demographic metadata Herdagdelen (2013) : a Twitter-based dataset using n-grams, thereby overcoming the limitations on the redistribution of raw tweets n-gram counts for 75 million English tweets With gender-of-author and time-of-posting Gosse Bouma 21/25

29 Twitter Ngrams Web Interface Raw ngram counts ( ) Limited regex support, export results as csv, collocations, associations Run your own experiment : Download ngrams data Evert (2010), Google Web 1T 5-Grams Made Easy (but not for the computer) Gosse Bouma 22/25

30 Twitter Ngrams Web Interface - Trends Relative frequencies per month For ngrams occurring at least once in each month Using sqllite + Google Tables Gosse Bouma 23/25

31 Twitter vs Google Web Ngrams een meisje/liedje/... die/dat noun Twitter %die web %die ratio meisje liedje kind type bedrijf boek nummer geld filmpje ding Twitter: , Web Ngrams: 2008 Gosse Bouma 24/25

32 Enjoy! Gosse Bouma 25/25

Phrases. Topics for Today. Phrases. POS Tagging. ! Text transformation. ! Text processing issues

Phrases. Topics for Today. Phrases. POS Tagging. ! Text transformation. ! Text processing issues Topics for Today! Text transformation Word occurrence statistics Tokenizing Stopping and stemming Phrases Document structure Link analysis Information extraction Internationalization Phrases! Many queries

More information

PoliticalMashup. Make implicit structure and information explicit. Content

PoliticalMashup. Make implicit structure and information explicit. Content 1 2 Content Connecting promises and actions of politicians and how the society reacts on them Maarten Marx Universiteit van Amsterdam Overview project Zooming in on one cultural heritage dataset A few

More information

Announcements. Indexing & retrieval. Example CollecRon. Handling phrases 10/25/13. Assignment 2. Office hours changes. Due Wednesday, 11:59PM

Announcements. Indexing & retrieval. Example CollecRon. Handling phrases 10/25/13. Assignment 2. Office hours changes. Due Wednesday, 11:59PM Announcements Indexing & retrieval Info 427 Assignment 2 Due Wednesday, 11:59PM Office hours changes My normal office hours tomorrow (2-3pm) cancelled Today 5:20pm Tomorrow 5:00pm Example CollecRon Simple

More information

Word Completion and Prediction in Hebrew

Word Completion and Prediction in Hebrew Experiments with Language Models for בס"ד Word Completion and Prediction in Hebrew 1 Yaakov HaCohen-Kerner, Asaf Applebaum, Jacob Bitterman Department of Computer Science Jerusalem College of Technology

More information

Er is door mij gebruik gemaakt van dia s uit presentaties van o.a. Anastasios Kesidis, CIL, Athene Griekenland, en Asaf Tzadok, IBM Haifa Research Lab

Er is door mij gebruik gemaakt van dia s uit presentaties van o.a. Anastasios Kesidis, CIL, Athene Griekenland, en Asaf Tzadok, IBM Haifa Research Lab IMPACT is supported by the European Community under the FP7 ICT Work Programme. The project is coordinated by the National Library of the Netherlands. Er is door mij gebruik gemaakt van dia s uit presentaties

More information

SAND: Relation between the Database and Printed Maps

SAND: Relation between the Database and Printed Maps SAND: Relation between the Database and Printed Maps Erik Tjong Kim Sang Meertens Institute erik.tjong.kim.sang@meertens.knaw.nl May 16, 2014 1 Introduction SAND, the Syntactic Atlas of the Dutch Dialects,

More information

Predicting Publication Date: a Text Analysis Exercise over 250,000 Volumes in the HTRC Secure HathiTrust Analytics Research Commons

Predicting Publication Date: a Text Analysis Exercise over 250,000 Volumes in the HTRC Secure HathiTrust Analytics Research Commons Predicting Publication Date: a Text Analysis Exercise over 250,000 Volumes in the HTRC Secure HathiTrust Analytics Research Commons Use case: RDA Digital Humanities Workshop, May 2015 The HathiTrust digital

More information

Timeline (1) Text Mining 2004-2005 Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining?

Timeline (1) Text Mining 2004-2005 Master TKI. Timeline (2) Timeline (3) Overview. What is Text Mining? Text Mining 2004-2005 Master TKI Antal van den Bosch en Walter Daelemans http://ilk.uvt.nl/~antalb/textmining/ Dinsdag, 10.45-12.30, SZ33 Timeline (1) [1 februari 2005] Introductie (WD) [15 februari 2005]

More information

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies

Sentiment analysis of Twitter microblogging posts. Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Sentiment analysis of Twitter microblogging posts Jasmina Smailović Jožef Stefan Institute Department of Knowledge Technologies Introduction Popularity of microblogging services Twitter microblogging posts

More information

BURSTED WOOD PRIMARY SCHOOL

BURSTED WOOD PRIMARY SCHOOL BURSTED WOOD PRIMARY SCHOOL Year 6 English Medium Term Plans Reading Word Objectives apply their growing knowledge of root words prefixes and suffixes (morphology and etymology), as listed under the spelling

More information

Technology Scouting Video Transcription

Technology Scouting Video Transcription Project: Video for End-users Technology Scouting Video Transcription User stories Version: 1.0 Date: March 5, 2010 SURFnet / Kennisnet Innovatieprogramma 2010 Video Transcription: User stories 2 Introduction

More information

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015

NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 NLP Lab Session Week 3 Bigram Frequencies and Mutual Information Scores in NLTK September 16, 2015 Starting a Python and an NLTK Session Open a Python 2.7 IDLE (Python GUI) window or a Python interpreter

More information

Project 2: Term Clouds (HOF) Implementation Report. Members: Nicole Sparks (project leader), Charlie Greenbacker

Project 2: Term Clouds (HOF) Implementation Report. Members: Nicole Sparks (project leader), Charlie Greenbacker CS-889 Spring 2011 Project 2: Term Clouds (HOF) Implementation Report Members: Nicole Sparks (project leader), Charlie Greenbacker Abstract: This report describes the methods used in our implementation

More information

Chi-Square Test. J. Savoy Université de Neuchâtel

Chi-Square Test. J. Savoy Université de Neuchâtel Chi-Square Test J. Savoy Université de Neuchâtel C. D. Manning & H. Schütze : Foundations of statistical natural language processing. The MIT Press. Cambridge (MA) 1 Discriminating Features How can we

More information

Introduction to English Morphology. 14 October 2011

Introduction to English Morphology. 14 October 2011 Introduction to English Morphology 14 October 2011 Morphology Several subfields of linguistic theory: phonology, phonetics, syntax, semantics, pragmatics, and morphology. M. Phonology the selection of

More information

A generalized method for iterative error mining in parsing results

A generalized method for iterative error mining in parsing results A generalized method for iterative error mining in parsing results Daniel de Kok, Jianqiang Ma, Gertjan van Noord GEAF workshop 2009 - August 6, 2009 Daniel de Kok, Jianqiang Ma, Gertjan van Noord A generalized

More information

Special Interest Group Oracle WebCenter

Special Interest Group Oracle WebCenter Special Interest Group Oracle WebCenter Eric Bos Oracle ECM Consultant 28 Oktober 2013 1 Oracle WebCenter Capture 1. Webcenter Capture vs OFR (Perceptive IDC) 2. WebCenter Capture 3. Workspaces en andere

More information

Course description Course title: Dutch Language I: Introduction Course code: EN-IN-DLID Domein: Bewegen & Educatie > Education Objectives

Course description Course title: Dutch Language I: Introduction Course code: EN-IN-DLID Domein: Bewegen & Educatie > Education Objectives Course description Course title: Dutch Language I: Introduction Course code: EN-IN-DLID Domein: Bewegen & Educatie > Education Objectives Understanding basic vocabulary: words (Dutch to English); Use of

More information

Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde

Example-Based Treebank Querying. Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde Example-Based Treebank Querying Liesbeth Augustinus Vincent Vandeghinste Frank Van Eynde LREC 2012, Istanbul May 25, 2012 NEDERBOOMS Exploitation of Dutch treebanks for research in linguistics September

More information

THE GALLOWAY SCHOOL YEAR-AT-A-GLANCE Where Magnificent Minds Thrive! GRAMMAR. Quarter One

THE GALLOWAY SCHOOL YEAR-AT-A-GLANCE Where Magnificent Minds Thrive! GRAMMAR. Quarter One THE GALLOWAY SCHOOL YEAR-AT-A-GLANCE Where Magnificent Minds Thrive! GRAMMAR 3 rd Quarter One Text: Sadlier Grammar Workshop Level Orange Unit 1: The Sentence Lesson 1: Kinds of Sentences Punctuate sentences

More information

MAYORGAME (BURGEMEESTERGAME)

MAYORGAME (BURGEMEESTERGAME) GATE Pilot Safety MAYORGAME (BURGEMEESTERGAME) Twan Boerenkamp Who is it about? Local council Beleidsteam = GBT or Regional Beleidsteam = RBT Mayor = Chairman Advisors now = Voorlichting? Official context

More information

Editing Your Writing for Grammar Mistakes

Editing Your Writing for Grammar Mistakes Editing Your Writing for Grammar Mistakes Does grammar matter? In most assignment guidelines given in the Faculty of Business a requirement for clear expression is mentioned. Some assignment guidelines

More information

The information in this report is confidential. So keep this report in a safe place!

The information in this report is confidential. So keep this report in a safe place! Bram Voorbeeld About this Bridge 360 report 2 CONTENT About this Bridge 360 report... 2 Introduction to the Bridge 360... 3 About the Bridge 360 Profile...4 Bridge Behaviour Profile-Directing...6 Bridge

More information

Whitepaper. Leveraging Social Media Analytics for Competitive Advantage

Whitepaper. Leveraging Social Media Analytics for Competitive Advantage Whitepaper Leveraging Social Media Analytics for Competitive Advantage May 2012 Overview - Social Media and Vertica From the Internet s earliest days computer scientists and programmers have worked to

More information

Curriculum 2014 Writing Programme of Study by Strand. Ros Wilson. Andrell Education Ltd Raising Standards in Education.

Curriculum 2014 Writing Programme of Study by Strand. Ros Wilson. Andrell Education Ltd Raising Standards in Education. Curriculum 2014 Writing Programme of Study by Strand Ros Wilson Tel: 01924 229380 @RosBigWriting = statutory = non-statutory Spelling Learn words containing each of the 40 + phonemes already taught / common

More information

Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012

Digital Collections as Big Data. Leslie Johnston, Library of Congress Digital Preservation 2012 Digital Collections as Big Data Leslie Johnston, Library of Congress Digital Preservation 2012 Data is not just generated by satellites, identified during experiments, or collected during surveys. Datasets

More information

Going Paperless The Utah Experience. Mike Pecorelli Project Manager Utah DEQ

Going Paperless The Utah Experience. Mike Pecorelli Project Manager Utah DEQ Going Paperless The Utah Experience Mike Pecorelli Project Manager Utah DEQ Topic Overview Three Key Topics Interactive Map GIS Tool Electronic Document Management System Database Interactive Map

More information

The acquisition of grammatical gender in bilingual child acquisition of Dutch (by older Moroccan and Turkish children)

The acquisition of grammatical gender in bilingual child acquisition of Dutch (by older Moroccan and Turkish children) The acquisition of grammatical gender in bilingual child acquisition of Dutch (by older Moroccan and Turkish children) The definite determiner, attributive adjective and relative pronoun Leonie Cornips,

More information

ENIN 020 Culture of American Classroom

ENIN 020 Culture of American Classroom The mission of the IELP is to offer innovative programs to international students. These programs are designed to increase English proficiency, to assist with acculturation of life in the United States,

More information

WHITEPAPER. Text Analytics Beginner s Guide

WHITEPAPER. Text Analytics Beginner s Guide WHITEPAPER Text Analytics Beginner s Guide What is Text Analytics? Text Analytics describes a set of linguistic, statistical, and machine learning techniques that model and structure the information content

More information

Unit: Fever, Fire and Fashion Term: Spring 1 Year: 5

Unit: Fever, Fire and Fashion Term: Spring 1 Year: 5 Unit: Fever, Fire and Fashion Term: Spring 1 Year: 5 English Fever, Fire and Fashion Unit Summary In this historical Unit pupils learn about everyday life in London during the 17 th Century. Frost fairs,

More information

Business School. Is grammar only a problem for non-english speaking background students?

Business School. Is grammar only a problem for non-english speaking background students? Business School Editing your writing for grammar mistakes Editing Your Writing for Grammar Mistakes Does grammar matter? In most assignment guidelines given in the Business School, assessment criteria

More information

English for academic year Class 1. Years R and 1. read and understand simple sentences

English for academic year Class 1. Years R and 1. read and understand simple sentences English for academic year Class 1 Years R and 1 Reading Word Reading Reception Children (Foundation Stage) read and understand simple sentences use phonic knowledge to decode regular words and read them

More information

Flattening Enterprise Knowledge

Flattening Enterprise Knowledge Flattening Enterprise Knowledge Do you Control Your Content or Does Your Content Control You? 1 Executive Summary: Enterprise Content Management (ECM) is a common buzz term and every IT manager knows it

More information

Table of contents. INTRODUCTION TO DUTCH GRAMMAR 7 Learning Dutch grammar... 7 Using this reference... 7

Table of contents. INTRODUCTION TO DUTCH GRAMMAR 7 Learning Dutch grammar... 7 Using this reference... 7 Table of contents INTRODUCTION TO DUTCH GRAMMAR 7 Learning Dutch grammar... 7 Using this reference... 7 SPELLING AND PRONUNCIATION 8 The Dutch alphabet... 9 The Letter IJ... 10 Syllables... 11 Four syllable

More information

National Curriculum 2014 Literacy Objectives Spoken language Year 1 Year 2 Year 3 Year 4 Year 5 Year 6

National Curriculum 2014 Literacy Objectives Spoken language Year 1 Year 2 Year 3 Year 4 Year 5 Year 6 Spoken language -structured descriptions, explanations and narratives for different purposes, including for expressing feelings ng, hypothesising, imagining and exploring ideas to comments the interest

More information

Writing in the National Curriculum

Writing in the National Curriculum Transcription spell: words containing each of the 40+ phonemes already taught common exception words the days of the week name the letters of the alphabet: naming the letters of the alphabet in order using

More information

The STEVIN IRME Project

The STEVIN IRME Project The STEVIN IRME Project Jan Odijk STEVIN Midterm Workshop Rotterdam, June 27, 2008 IRME Identification and lexical Representation of Multiword Expressions (MWEs) Participants: Uil-OTS, Utrecht Nicole Grégoire,

More information

KS1 CURRICULUM READING YEAR 1 YEAR 2

KS1 CURRICULUM READING YEAR 1 YEAR 2 KS1 CURRICULUM READING WORD READING YEAR 1 YEAR 2 WORD READING apply phonic knowledge and skills as the route to decode words respond speedily with the correct sound to graphemes (letters or groups of

More information

Laying the Foundation: Important Terminology. encompasses: syntax, morphology, phonology and semantics )

Laying the Foundation: Important Terminology. encompasses: syntax, morphology, phonology and semantics ) REFERENCE GUIDE # 1 (MAKE A COPY OF THIS TO KEEP IN YOUR ENGLISH 101 FOLDER) BASIC GENERAL INFORMATION FOR REFERENCE IN MRS. WHITE S ENGLISH GRAMMAR 101 CLASS (3 PAGES) Laying the Foundation: Important

More information

Mapping linguistic phenomena on Twitter and other big data sources. Gabriel Doyle UC San Diego 2014 LSA Annual Meeting

Mapping linguistic phenomena on Twitter and other big data sources. Gabriel Doyle UC San Diego 2014 LSA Annual Meeting Mapping linguistic phenomena on Twitter and other big data sources Gabriel Doyle UC San Diego 2014 LSA Annual Meeting Big data most major corpora are hundreds of millions of words at most Twitter users

More information

Statistical Natural Language Processing

Statistical Natural Language Processing Statistical Natural Language Processing Prasad Tadepalli CS430 lecture Natural Language Processing Some subproblems are partially solved Spelling correction, grammar checking Information retrieval with

More information

Machine Translation-based Language Model Adaptation for ASR of Spoken Translations

Machine Translation-based Language Model Adaptation for ASR of Spoken Translations Machine Translation-based Language Model Adaptation for ASR of Spoken Translations aka ESAT's contribution to the SCATE project Joris Pelemans Tom Vanallemeersch (CCL) Kris Demuynck (UGent) Lyan Verwimp

More information

Linguistic Research with CLARIN. Jan Odijk MA Rotation Utrecht, 2015-11-10

Linguistic Research with CLARIN. Jan Odijk MA Rotation Utrecht, 2015-11-10 Linguistic Research with CLARIN Jan Odijk MA Rotation Utrecht, 2015-11-10 1 Overview Introduction Search in Corpora and Lexicons Search in PoS-tagged Corpus Search for grammatical relations Search for

More information

FROM WORDS TO INSIGHTS: RETHINKING CONTENT AND BIG DATA

FROM WORDS TO INSIGHTS: RETHINKING CONTENT AND BIG DATA Kalev H. Leetaru Yahoo! Fellow in Residence Georgetown University kalev.leetaru5@gmail.com http://www.kalevleetaru.com FROM WORDS TO INSIGHTS: RETHINKING CONTENT AND BIG DATA AUDIENCE QUESTION Have you

More information

Finding Syntactic Characteristics of Surinamese Dutch

Finding Syntactic Characteristics of Surinamese Dutch Finding Syntactic Characteristics of Surinamese Dutch Erik Tjong Kim Sang Meertens Institute erikt(at)xs4all.nl June 13, 2014 1 Introduction Surinamese Dutch is a variant of Dutch spoken in Suriname, a

More information

Glossary. apostrophe. abbreviation

Glossary.  apostrophe. abbreviation [ Glossary a abbreviation An abbreviation is a shortened form of phrase or word. apostrophe An apostrophe has two uses: to show that two words have been shortened to make one (called a contraction ) and

More information

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING

CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING CAPTURING THE VALUE OF UNSTRUCTURED DATA: INTRODUCTION TO TEXT MINING Mary-Elizabeth ( M-E ) Eddlestone Principal Systems Engineer, Analytics SAS Customer Loyalty, SAS Institute, Inc. Is there valuable

More information

Probability Estimation

Probability Estimation Probability Estimation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 April 23, 2009 Outline Laplace Estimator Good-Turing Backoff The Sparse Data Problem There is a major problem with

More information

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking

ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking ANNLOR: A Naïve Notation-system for Lexical Outputs Ranking Anne-Laure Ligozat LIMSI-CNRS/ENSIIE rue John von Neumann 91400 Orsay, France annlor@limsi.fr Cyril Grouin LIMSI-CNRS rue John von Neumann 91400

More information

Academic writing: sentence level

Academic writing: sentence level Academic writing: sentence level Sentences In academic writing, every sentence you write must be grammatically complete. A grammatically complete sentence consists of a complete thought, and can makes

More information

Open Source Techniques push Enterprise Search & Search Driven Applications and especially foster the application of Text Analytics

Open Source Techniques push Enterprise Search & Search Driven Applications and especially foster the application of Text Analytics Open Source Techniques push Enterprise Search & Search Driven Applications and especially foster the application of Text Analytics Exploring the Future of Enterprise Search, IPTS, Seville, October 2011

More information

Vocabulary, Grammar and Punctuation Expectations

Vocabulary, Grammar and Punctuation Expectations The Collett School Policies, Guidance & Procedures Date for implementation: 1 September 2015 Date for review: 1 September 2016 Vocabulary, Grammar and Expectations The Collett School Curriculum is based

More information

Introduction to Manual Annotation

Introduction to Manual Annotation Introduction to Manual Annotation This document introduces the concept of annotations, their uses and the common types of manual annotation projects. This is a supplement to project-specific guidelines

More information

Are you ready for more efficient and effective ways to manage discovery?

Are you ready for more efficient and effective ways to manage discovery? LexisNexis Early Data Analyzer + LAW PreDiscovery + Concordance Software Are you ready for more efficient and effective ways to manage discovery? Did you know that all-in-one solutions often omit robust

More information

CLARIN project DiscAn :

CLARIN project DiscAn : CLARIN project DiscAn : Towards a Discourse Annotation system for Dutch language corpora Ted Sanders Kirsten Vis Utrecht Institute of Linguistics Utrecht University Daan Broeder TLA Max-Planck Institute

More information

The WITCHCRAFT Project: A Progress Report

The WITCHCRAFT Project: A Progress Report The WITCHCRAFT Project: A Progress Report Frans Wiering IMS Study Group Meeting, Zürich, 10 July 2007 Talk outline CATCH programme WITCHCRAFT project aim and team partners and their contribution results

More information

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH

LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH LASSY: LARGE SCALE SYNTACTIC ANNOTATION OF WRITTEN DUTCH Gertjan van Noord Deliverable 3-4: Report Annotation of Lassy Small 1 1 Background Lassy Small is the Lassy corpus in which the syntactic annotations

More information

PICCL: Philosophical Integrator of Computational and Corpus Libraries

PICCL: Philosophical Integrator of Computational and Corpus Libraries 1 PICCL: Philosophical Integrator of Computational and Corpus Libraries Martin Reynaert 12, Maarten van Gompel 1, Ko van der Sloot 1 and Antal van den Bosch 1 Center for Language Studies - Radboud University

More information

Independent and Dependent Clauses

Independent and Dependent Clauses Independent and Dependent Clauses Definition A clause is a group of words that contains a subject and a verb. There are two kinds of clauses: 1. An independent clause is a complete thought, a sentence.

More information

Submission guidelines for authors and editors

Submission guidelines for authors and editors Submission guidelines for authors and editors For the benefit of production efficiency and the production of texts of the highest quality and consistency, we urge you to follow the enclosed submission

More information

Data Gravity. Dell EMC Hans Timmerman

Data Gravity. Dell EMC Hans Timmerman Gravity Gravity Dell EMC Hans Timmerman Agenda Wie is Dell EMC? Wat is data gravity? Wat is het interessante van dit begrip? Wat betekent het voor data architecturen? Dell EMC is ontstaan op 7 september

More information

12th Grade English Objectives

12th Grade English Objectives 12th Grade English Objectives Short Story Students will demonstrate speaking, listening, writing, reading, and research skills while studying the short story. Locate, consult, and cite information from

More information

Common Core Reading Standards for Grade 1

Common Core Reading Standards for Grade 1 Common Core Reading Standards for Grade 1 The box on the left lists the standards teachers are using, and the box on the right is what you can do at home to support what children are learning in the classroom.

More information

Annotation Guidelines for Dutch-English Word Alignment

Annotation Guidelines for Dutch-English Word Alignment Annotation Guidelines for Dutch-English Word Alignment version 1.0 LT3 Technical Report LT3 10-01 Lieve Macken LT3 Language and Translation Technology Team Faculty of Translation Studies University College

More information

STREET CIRCUS

STREET CIRCUS CAN-DO STATEMENTS Text 1 Can find and understand specific information in lists, overviews and forms Can understand simple adverts with few abbreviations Can understand specific information in simple texts

More information

1. Dimensional Data Design - Data Mart Life Cycle

1. Dimensional Data Design - Data Mart Life Cycle 1. Dimensional Data Design - Data Mart Life Cycle 1.1. Introduction A data mart is a persistent physical store of operational and aggregated data statistically processed data that supports businesspeople

More information

Straightforward Advanced CEF Checklists

Straightforward Advanced CEF Checklists Straightforward Advanced CEF Checklists Choose from 0 5 for each statement to express how well you can carry out the following skills practised in Straightforward Advanced. 0 = I can t do this at all.

More information

3 rd Grade ELA Vocabulary Terms

3 rd Grade ELA Vocabulary Terms 3 rd Grade ELA Vocabulary Terms A abstract noun - a noun that names an idea such as childhood or friendship act - a group of connected scenes in a drama adjective - a word that describes a noun or pronoun

More information

Uw partner in system management oplossingen

Uw partner in system management oplossingen Uw partner in system management oplossingen User Centric IT Bring your Own - Corporate Owned Onderzoek Forrester Welke applicatie gebruik je het meest op mobiele devices? Email 76% SMS 67% IM / Chat 48%

More information

Research Report. Ingelien Poutsma Marnienke van der Maal Sabina Idler

Research Report. Ingelien Poutsma Marnienke van der Maal Sabina Idler Research Report Ingelien Poutsma Marnienke van der Maal Sabina Idler Research report ABSTRACT This research investigates what the ideal bank for adolescents (10 16 years) looks like. The research was initiated

More information

Index. 1. Case background 2. What did we test? 3. The results! 4. About Online Dialogue. How the usability of the internal search module

Index. 1. Case background 2. What did we test? 3. The results! 4. About Online Dialogue. How the usability of the internal search module internal search optimization with instant search How the usability of the internal search module lifted the conversion rate of this audience with 49% Index 1. Case background 2. What did we test? 3. The

More information

Utrecht Linguistic Database. Computational Tools for Linguistic Data March 15, 2002. Rapid Application Development

Utrecht Linguistic Database. Computational Tools for Linguistic Data March 15, 2002. Rapid Application Development Utrecht Linguistic Database Computational Tools for Linguistic Data March 15, 2002 Maaike Schoorlemmer Lennart Herlaar Harmen van der Iest Martin Everaert Alexis Dimitriadis Peter Ackema 1 Introduction

More information

Search and Information Retrieval

Search and Information Retrieval Search and Information Retrieval Search on the Web 1 is a daily activity for many people throughout the world Search and communication are most popular uses of the computer Applications involving search

More information

A chart generator for the Dutch Alpino grammar

A chart generator for the Dutch Alpino grammar June 10, 2009 Introduction Parsing: determining the grammatical structure of a sentence. Semantics: a parser can build a representation of meaning (semantics) as a side-effect of parsing a sentence. Generation:

More information

Crash Course on Grammar, Common Usage and APA style. Ioakim Boutakidis, Ph.D. Dept of Child & Adolescent Studies CSUF

Crash Course on Grammar, Common Usage and APA style. Ioakim Boutakidis, Ph.D. Dept of Child & Adolescent Studies CSUF Crash Course on Grammar, Common Usage and APA style Ioakim Boutakidis, Ph.D. Dept of Child & Adolescent Studies CSUF 2010 I. Punctuation: Comma Use College students generally do a good job with basic punctuation,

More information

Beyond N in N-gram Tagging

Beyond N in N-gram Tagging Beyond N in N-gram Tagging Robbert Prins Alfa-Informatica University of Groningen P.O. Box 716, NL-9700 AS Groningen The Netherlands r.p.prins@let.rug.nl Abstract The Hidden Markov Model (HMM) for part-of-speech

More information

14 Automatic language correction

14 Automatic language correction 14 Automatic language correction IA161 Advanced Techniques of Natural Language Processing J. Švec NLP Centre, FI MU, Brno December 21, 2015 J. Švec IA161 Advanced NLP 14 Automatic language correction 1

More information

Acquiring grammatical gender in northern and southern Dutch. Jan Klom, Gunther De Vogelaer

Acquiring grammatical gender in northern and southern Dutch. Jan Klom, Gunther De Vogelaer Acquiring grammatical gender in northern and southern Acquring grammatical gender in southern and northern 2 Research questions How does variation relate to change? (transmission in Labov 2007 variation

More information

Modern foreign languages

Modern foreign languages Modern foreign languages Programme of study for key stage 3 and attainment targets (This is an extract from The National Curriculum 2007) Crown copyright 2007 Qualifications and Curriculum Authority 2007

More information

Chorus Tweetcatcher Desktop

Chorus Tweetcatcher Desktop Chorus Tweetcatcher Desktop The purpose of this manual is to enable users of Chorus Tweetcatcher Desktop (Chorus-TCD) to begin collecting Twitter data for their projects. The document is split into two

More information

Workflow Solutions for Very Large Workspaces

Workflow Solutions for Very Large Workspaces Workflow Solutions for Very Large Workspaces February 3, 2016 - Version 9 & 9.1 - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

More information

NEDERBOOMS Treebank Mining for Data- based Linguistics. Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde

NEDERBOOMS Treebank Mining for Data- based Linguistics. Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde NEDERBOOMS Treebank Mining for Data- based Linguistics Liesbeth Augustinus Vincent Vandeghinste Ineke Schuurman Frank Van Eynde LOT Summer School - June, 2014 NEDERBOOMS Exploita)on of Dutch treebanks

More information

Long, often quite boring, notes of meetings

Long, often quite boring, notes of meetings Long, often quite boring, notes of meetings 1 Long, often quite boring, notes of meetings www.polidocs.nl Maarten Marx Universiteit van Amsterdam February 2009 Long, often quite boring, notes of meetings

More information

user checks! improve your design significantly"

user checks! improve your design significantly user checks! improve your design significantly" Workshop by Userneeds - Anouschka Scholten Assisted by ArjanneAnouk Interact Arjanne de Wolf AmsterdamUX Meet up - June 3, 2015 Make people s lives better.

More information

IP-NBM. Copyright Capgemini 2012. All Rights Reserved

IP-NBM. Copyright Capgemini 2012. All Rights Reserved IP-NBM 1 De bescheidenheid van een schaker 2 Maar wat betekent dat nu 3 De drie elementen richting onsterfelijkheid Genomics Artifical Intelligence (nano)robotics 4 De impact van automatisering en robotisering

More information

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise 5 APR 2011 1 2005... Advanced Analytics Harnessing Data for the Warfighter I2E GIG Brigade Combat Team Data Silos DCGS LandWarNet

More information

Mining a Corpus of Job Ads

Mining a Corpus of Job Ads Mining a Corpus of Job Ads Workshop Strings and Structures Computational Biology & Linguistics Jürgen Jürgen Hermes Hermes Sprachliche Linguistic Data Informationsverarbeitung Processing Institut Department

More information

White Mere Community Primary School

White Mere Community Primary School White Mere Community Primary School KS2 Grammar and Punctuation Overview To ensure our pupils have a complete and secure understanding of discrete grammatical terms, punctuation and spelling rules, discrete

More information

What the National Curriculum requires in reading at Y5 and Y6

What the National Curriculum requires in reading at Y5 and Y6 What the National Curriculum requires in reading at Y5 and Y6 Word reading apply their growing knowledge of root words, prefixes and suffixes (morphology and etymology), as listed in Appendix 1 of the

More information

Cambridge Primary English as a Second Language Curriculum Framework

Cambridge Primary English as a Second Language Curriculum Framework Cambridge Primary English as a Second Language Curriculum Framework Contents Introduction Stage 1...2 Stage 2...5 Stage 3...8 Stage 4... 11 Stage 5...14 Stage 6... 17 Welcome to the Cambridge Primary English

More information

Gateway A2 Practice Online

Gateway A2 Practice Online Macmillan Practice Online is the easy way to get all the benefits of online learning and with over 100 courses to choose from, covering all competence levels and ranging from business English to exam practice

More information

TwitterCracy: Exploratory Monitoring of Twitter Streams for the 2016 U.S. Presidential Election Cycle

TwitterCracy: Exploratory Monitoring of Twitter Streams for the 2016 U.S. Presidential Election Cycle TwitterCracy: Exploratory Monitoring of Twitter Streams for the 2016 U.S. Presidential Election Cycle M. Atif Qureshi (B), Arjumand Younus, and Derek Greene Insight Center for Data Analytics, University

More information

CONTENT / ACTIVITY CAN DO PAGE LEVEL GRAMMAR

CONTENT / ACTIVITY CAN DO PAGE LEVEL GRAMMAR Speakout Starter Speakout CEF ALTE UCLES IELTS TOEIC TOEFL ibt PTE Starter - - 0-245 9-18 Elementary /A2 1 KET 3.0 246-500 19-29 1 Pre-intermediate A2/B1 2 PET 4.0 500-650 30-52 2 Intermediate B1+/B2 3

More information

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University

Grammars and introduction to machine learning. Computers Playing Jeopardy! Course Stony Brook University Grammars and introduction to machine learning Computers Playing Jeopardy! Course Stony Brook University Last class: grammars and parsing in Prolog Noun -> roller Verb thrills VP Verb NP S NP VP NP S VP

More information

Language Arts 7 Curriculum. Unit 1: Nouns, Pronouns, Adjectives. 2 weeks LA7.5, LA7.6

Language Arts 7 Curriculum. Unit 1: Nouns, Pronouns, Adjectives. 2 weeks LA7.5, LA7.6 Language Arts 7 Curriculum Unit 1: Nouns, Pronouns, Adjectives Lecture Exercises in textbook Worksheets Identify common and proper nouns Identify personal and possessive pronouns Identify adjectives and

More information

Kids College Computer Game Programming Exploring Small Basic and Procedural Programming

Kids College Computer Game Programming Exploring Small Basic and Procedural Programming Kids College Computer Game Programming Exploring Small Basic and Procedural Programming According to Microsoft, Small Basic is a programming language developed by Microsoft, focused at making programming

More information

THE EMOTIONAL VALUE OF PAID FOR MAGAZINES. Intomart GfK 2013 Emotionele Waarde Betaald vs. Gratis Tijdschrift April 2013 1

THE EMOTIONAL VALUE OF PAID FOR MAGAZINES. Intomart GfK 2013 Emotionele Waarde Betaald vs. Gratis Tijdschrift April 2013 1 THE EMOTIONAL VALUE OF PAID FOR MAGAZINES Intomart GfK 2013 Emotionele Waarde Betaald vs. Gratis Tijdschrift April 2013 1 CONTENT 1. CONCLUSIONS 2. RESULTS Reading behaviour Appreciation Engagement Advertising

More information

COOLS COOLS. Cools is nominated for the Brains Award! www.brainseindhoven.nl/nl/top_10/&id=507. www.cools-tools.nl. Coen Danckmer Voordouw

COOLS COOLS. Cools is nominated for the Brains Award! www.brainseindhoven.nl/nl/top_10/&id=507. www.cools-tools.nl. Coen Danckmer Voordouw Name Nationality Department Email Address Website Coen Danckmer Voordouw Dutch / Nederlands Man and Activity info@danckmer.nl www.danckmer.nl Project: Image: Photographer: Other images: COOLS CoenDVoordouw

More information

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning Morphology Morphology is the study of word formation, of the structure of words. Some observations about words and their structure: 1. some words can be divided into parts which still have meaning 2. many

More information