1 1 Your boldest wishes concerning online corpora: OpenSoNaR and you Martin Reynaert TiCC, Tilburg University and CLST, Radboud Universiteit Nijmegen TiCC Colloquium, Tilburg University. October 16th, 2013
2 Outline 1 Section 1: SoNaR Overview 2 Section 2: SoNaR-500 Content Highlights 3 Section 3: Towards OpenSoNaR 4 Section 4: A peek at other online corpora 5 Section 5: Your own role in OpenSoNaR
3 3 Section 1: SoNaR Overview SoNaR: Patron & Consortium Patron: Dutch Language Union (Nederlandse Taalunie) in the STEVIN programme Consortium: Netherlands: Radboud University Nijmegen, Tilburg University, Twente University & Utrecht University Belgium (Flanders): University College Ghent, KU Leuven
4 Section 1: SoNaR Overview 4 STEVIN Nederlandstalig Referentiecorpus (SoNaR) project Directed at the actual construction of a 500 MW reference corpus of contemporary standard written Dutch as encountered in texts originating from the Dutch speaking language area in Flanders and the Netherlands as well as translations published in and targeted at this area Where possible, (re)use of formats, tools, protocols, and annotation schemes previously developed (D-Coi, but also COREA, DPC, Lassy)
5 Section 1: SoNaR Overview MW Reference corpus (SoNaR-500) includes native speaker language and the language of (professional) translators approx. two-thirds of the data originate from the Netherlands and one-third from Flanders only texts included from the year 1954 onwards, in practice: only born-digital texts comprises full texts rather than text samples includes texts from a wide range of 40 text types incl. texts from the new media for (almost) all data IPR has been arranged
6 6 Section 1: SoNaR Overview SoNaR-500: Automatic linguistic annotations automatic sentence splitting and tokenization: ILKTOK automatic POS tagging and lemmatisation and morphological analysis using FROG for all data tagset is essentially the set used in the CGN project automatic NE labeling by means of NERD Named Entity Recognition for Dutch: NE classifier developed on the basis of SoNaR-1
7 Section 2: SoNaR-500 Content Highlights 7 Contents Besides the texts: all available metadata has been recorded Over 30 text types old and new media all contemporary, written Dutch. Essentially all born-digital. everything from books to tweets... Only discuss a few of the highlights here...
8 8 Section 2: SoNaR-500 Content Highlights Old Media: Highlights: Books Largest donation by Dutch publisher Maarten Muntinga 145 titles: travel, thrillers, chick lit, life style, etc. almost 14 MW, almost no IPR-restrictions great news for KNAW project "The riddle of literary quality" 4 versions of the Bible 3 protestant, 1 catholic N-gram frequency lists are a shambles!!
9 9 Section 2: SoNaR-500 Content Highlights Old Media: Highlights: Magazines/Periodicals Largest donation by Flemish publisher Roularta all CD-roms published between 1991 and 2004 weekly delivery of new issues of their several titles no commercial use no language restriction: huge comparable corpus Dutch-French a possibility! Largest Dutch donation: De Groene Amsterdammer 12 years editions
10 10 Section 2: SoNaR-500 Content Highlights Old Media: Highlights: Newspapers Largest Flemish donation: Mediargus 1.3 BW available to the community, no IPR-agreement IPR-restriction 1: Take only what is required have taken Edition 2006 of 4 main Flemish newspapers. IPR-restriction 2: no commercial use Largest Dutch donation: PCM / De Persgroep Selection from Twente News Corpus 60MW
11 11 Section 2: SoNaR-500 Content Highlights Old Media: Highlights: Reports NIOD (Institute for War, Holocaust and Genocide Studies) Srebrenica report over 7,000 pages of text probably best known report in Dutch ever potential for building a parallel corpus: English version also online
12 12 Section 2: SoNaR-500 Content Highlights Old/New Media Highlights: Subtitles Largest Flemish donation: VRT wealth of TV-series, a.o. De Kampioenen, MW even more still in surplus Open Subtitle Corpus (Tiedemann) balanced set of film subtitles, produced by volunteers classified as both Dutch and Flemish huge surplus of over 384MW available...
13 13 Section 2: SoNaR-500 Content Highlights New Media: Highlights: Discussion Lists Politics.be largest internet forum in Flanders. IPR-settled. site delivered over 480MW, over 2 years ago... Incorporated only 25MW... TMF: Flemish youth forum highly dialectical... about 70% of posts classified by language recognizer TextCat as: " I don t know; Perhaps this is a language I haven t seen before? " 20MW were incorporated...
14 14 Section 2: SoNaR-500 Content Highlights New Media: Highlights: Chats Chats chats NL with IPR-agreement and metadata: 0.5MW elicited, mainly collected in schools and within a research team chats NL no IPR-agreement and without metadata: 8 M natural chat VL with IPR-agreement, without metadata: 9 M natural
15 15 Section 2: SoNaR-500 Content Highlights New Media: Highlights: Tweets & SMS Twitter: did not yet exist in 2005, no target in SoNaR MW incorporated token count prior to tokenization... SMS collected in two parallel campaigns in NL and VL a prize was offered to donators a well balanced (in terms of 1/3 VL vs. 2/3 NL) set of over 50K SMS were collected these represent about 620K SMS-words
16 16 Section 3: Towards OpenSoNaR Concluding remarks on the SoNaR corpus SoNaR project represents an excellent return on investment: SoNaR-500 definitely allows to sound most dimensions of contemporary, written Dutch SoNaR-1 has yielded world-class, large gold standards for semantic annotation SoNaR fills major gaps in the Dutch language resources infrastructure BUT: Its 2.1M text files accompanied by 2.1M metadata files present a major practical obstacle to most anyone Which is why we are building OpenSoNaR: a system for Online Personal Exploration and Navigation of SoNaR"
17 Section 3: Towards OpenSoNaR 17 Opening remarks on the OpenSoNaR project We want SoNaR online in as open a fashion as possible: usable for schoolkids and researchers alike You, researchers, are hereby invited to tell us what you expect from this system Please provide us with your wish lists, user cases The Dutch Language Union and the IPR agreements with text providers impose some restrictions: these we will honour The whole corpus is essentially free for non-commercial research purposes
18 Section 3: Towards OpenSoNaR 18 Intentions of OpenSoNaR Address this corpus exploration and exploitation problem Develop and make available an open corpus exploration and exploitation environment Tailor the front-ends to the desiderata of 4 CLARIN-NL priority groups of users
19 Section 3: Towards OpenSoNaR 19 Means employed by OpenSoNaR Intended originally to build on the well-known Corpus Workbench back-end Turned out INL had been developing a new alternative Java system BlackLab, based on Apache Lucene, open source via GitHub OpenSoNaR will build Whitelab front-ends according to the user groups specifications
20 Section 3: Towards OpenSoNaR 20 OpenSoNaR User Groups in more detail CLARIN-NL posited 6 priority user groups. OpenSoNaR addresses priority groups 2 to 5: Literary Sciences: Huygens ING, Karina van Dalen-Oskam, assisted by colleagues at Utrecht University Cultural Sciences: Meertens Institute, Nicoline van der Sijs, assisted by Ewoud Sanders Linguistics: TiCC: Jan Renkema & Ad Backus & colleagues Communication and Media Studies: TiCC: Fons Maes & Emiel Krahmer & colleagues
21 Section 4: A peek at other online corpora 21 Nederlab URL: https://www.nederlab.nl/onderzoeksportaal/ NWO Groot project Goes far beyond OpenSoNaR 4M euro budget vs. 150K, runs 5 years vs. 1 Diachronic vs. synchronic: all Dutch corpora, since about the year 800 Laboratory for research Exploration of the corpora: through time, space, genres: metadata Exploitation of the corpora: possible to make your own subcorpus, put that in your private or privately shared work space Analysis of the corpora: tools for all kinds of text/topic mining, comparison making, measuring particular features,... Let us take a look at the Clariah Demonstrator...
22 Section 4: A peek at other online corpora 22 Språkbanken (the Swedish Language Bank) 147 corpora to date, 1.4 Billion word tokens Korp (E.: crow): the corpora Karp (E: carp): the dictionaries From bird s eye view of the text to below surface information Mediated by wonderful little icons Below surface information from dictionaries, Wordnet, automatic parsing etc. Based on older technology: Corpus Work Bench Uses Corpus Query Language More elaborate than OpenSoNaR Less so than Nederlab
23 Section 4: A peek at other online corpora 23 Brieven als Buit: Why does he not write?" URL: About 1, th century Dutch letters Subset of the Gekaapte Brieven : mail taken as loot by privateers and confiscated by the High Court of Admiralty during the wars fought between The Netherlands and England Digitized and annotated by Dutch Institute of Lexicology (INL) INL corpus back-end system BlackLab also the basis for OpenSoNaR OpenSoNaR will at first have the functionalities of this site Let s get acquainted!
24 24 Section 5: Your own role in OpenSoNaR OpenSoNaR vs. these three online resources To summarize: Språkbanken Definitely a great source for inspiration and emulation Brieven als Buit Great start that already will provide good basic functionality to further build on Nederlab If the OpenSoNaR project provides, Nederlab will no doubt be enhanced by its accomplishments
25 Section 5: Your own role in OpenSoNaR 25 OpenSoNaR: online soon We are setting up a server for the OpenSoNaR website URL: Will host the rudimentary OpenSoNaR soon, corpus is being indexed at INL We will keep you informed, of course Practical guidelines for submitting wish-lists, case descriptions and for getting feedback via our issue tracker will be made available asap.
26 Section 5: Your own role in OpenSoNaR 26
27 Section 5: Your own role in OpenSoNaR 27
28 28 Thanks!! Section 5: Your own role in OpenSoNaR Thanks for your attention! Papers about SoNaR are available at: Your boldest wishes concerning online corpora: OpenSoNaR and you Martin Reynaert TiCC, Tilburg University and CLST, Radboud Universiteit Nijmegen
Dutch-Flemish Research Programme for Dutch Language and Speech Technology stevin programme project results Contents Design and editing: Erica Renckens (www.ericarenckens.nl) Design front cover: www.nieuw-eken.nl
Application form Investment Subsidy NWO Large The completed application form including attachments should Please read the brochure Investment Subsidy NWO Large 2011- be submitted electronically via the
Top 10 Reasons Faculty Fail When Using Blackboard CMS Dr. Penny Johnson Computer Science Carroll College email@example.com Abstract In today s ever increasing world of information technology it is not enough
Beyond Recommender Systems: Helping People Help Each Other Loren Terveen and Will Hill AT&T Labs - Research Abstract The Internet and World Wide Web have brought us into a world of endless possibilities:
http://www.davidlankes.org TITLE: Virtual Service AUTHOR(s): R. David Lankes PUBLICATION TYPE: Chapter DATE: 2002 FINAL CITATION: Virtual Service. Lankes, R. David (2002). In Melling, M. & Little, J. (Ed.),
THE ALTE CAN DO PROJECT English version Articles and Can Do statements produced by the members of ALTE 1992-2002 Contents Page Part 1 Appendix D to Council of Europe Common European Framework of Reference
DELIVERABLE Project Acronym: Ev3 Grant Agreement number: 620484 Project Title: Europeana Version 3 D1.1: RECOMMENDATIONS TO IMPROVE AGGREGATION INFRASTRUCTURE Revision 1.0 Date of submission 3 March 2015
A VISION OF THE ROLE AND FUTURE OF WEB ARCHIVES Kalev H. Leetaru 1 Graduate School of Library and Information Science University of Illinois Imagine a world in which libraries and archives had never existed.
The Invisible Library: Paradox of the Global Information Infrastructure CHRISTINEL. BORGMAN ABSTRACT LIBRARIESARE AN ESSENTIAL COMPONENT of a nation s information infrastructure, yet often they are invisible
Survey of learner corpora Norma A. Pravec Montclair State University 1 Introduction A learner corpus is a computerized textual database of the language produced by foreign language learners (Leech 1998).
Using the CEFR: Principles of Good Practice October 2011 What [the CEFR] can do is to stand as a central point of reference, itself always open to amendment and further development, in an interactive international
Many Hands Make Light Work: Public Collaborative OCR Text Correction in Australian Historic Newspapers Rose Holley March 2009 National Library of Australia ISBN 978 0 642 27694 0 About the author: Rose
http://www.davidlankes.org TITLE: Statistics, Measures and Quality Standards for Assessing Digital Library Services: Guidelines and Procedures AUTHOR(s): Charles McClure, R. David Lankes, Melissa Gross,
SENSE Midterm Review 2004 Wageningen University Leiden University Utrecht University University of Amsterdam (UvA) Vrije Universiteit, Amsterdam (VU) WIMEK - Wageningen Institute for Environment and Climate
Understanding and Exploiting User Intent in Community Question Answering Long Chen Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the Department of Computer
PARSE.Insight Deliverable D3.4 Survey Report Insight Project Number into 223758 digital preservation Project Title PARSE.Insight. INSIGHT into issues of Permanent Access to the Records of Science in Europe
Accessible Information and Communication Acknowledgements The : is an initiative of the Global Alliance for Accessible Technologies and Environments (GAATES). GAATES is the leading international notfor-profit
E-LEARNING C O N C E P T S, T R E N D S, A P P L I C A T I O N S About Epignosis LLC. All rights reserved. Epignosis LLC 315 Montgomery Street, 8th and 9th Floors San Francisco, California, CA 94104 United
Forum Product Development & Dissemination Guide National Forum on Education Statistics Sponsored by the National Center for Education Statistics Questions or comments should be directed to the Chairperson
DOI 10.1007/s10734-007-9065-5 Is that paper really due today? : differences in first-generation and traditional college students understandings of faculty expectations Peter J. Collier Æ David L. Morgan
SuccessFactors Admin: Recruiting Management Admin Guide v1204 (One Admin) For SuccessFactors v12 (One Admin) Last Modified 07/17/2012 2012 SuccessFactors, Inc. All rights reserved. Execution is the Difference
Business Certificates Business English Certificates (BEC) CEFR Levels B1, B2, C1 Handbook for Teachers CONTENTS Preface This handbook is for teachers who are preparing candidates for Cambridge English:
FINAL REPORT Union of International Associations 2013 Associations Survey International meeting organization and issues UIA International Meetings Issues Survey - 2013 Published November 2013 http://www.uia.org/publications/meetings-survey
Research report January 2010 CREATING AN ENGAGED WORKFORCE CREATING AN ENGAGED WORKFORCE FINDINGS FROM THE KINGSTON EMPLOYEE ENGAGEMENT CONSORTIUM PROJECT This report has been written by: Kerstin Alfes,
How to write CVs and Cover Letters lse.ac.uk/careers Contents Introduction 3 Before you start 4 How LSE Careers can help 5 Layout and design 6 Personal details 10 Education 12 Work experience 14 Achievements,
1. Project Title & Acronym and Abstract Title: Verrijkt Koninkrijk (Enriched Kingdom) Acronym: VK Abstract: Dr Loe de Jong s Het Koninkrijk der Nederlanden in de Tweede Wereldoorlog remains the most appealing
Part 1: General Portfolio Instructions Part 1 provides the following key resources, guidelines, and support: How to Use the Portfolio Instructions outlines the content of these instructions and provides
A Guide to Quality in Online Learning This work is licensed under the Creative Commons Attribution-ShareAlike 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-sa/3.0/