Introducing Czech National Corpus Brown University, 04/09/16

Similar documents
Yandex.Translate API Developer's guide

Translating for a Multilingual European Union: Putting Multilingualism into Context Dr Angeliki PETRITS Language Officer European Commission, UK

Formatting Custom List Information

How do I translate...?

Official Journal of the European Union

LANGUAGE CONNECTIONS YOUR LINGUISTIC GATEWAY

Designing Tablet Computer Keyboards for European Languages

INTERC O MBASE. Global Language Solution

Remote Desktop Services Guide

Linking the world through professional language services

CALL FOR EXPRESSIONS OF INTEREST FOR CONTRACT STAFF

INVESTING IN INTANGIBLES: ECONOMIC ASSETS AND INNOVATION DRIVERS FOR GROWTH

Europeans and their Languages

FIRST-CLASS TRANSLATIONS WORLDWIDE

Tel: Fax: P.O. Box: 22392, Dubai - UAE info@communicationdubai.com comm123@emirates.net.ae

Court interpreters and sworn translators of legal language The case of Slovenia

GUIDELINES FOR TRANSLATING THE EUROPASS CERTIFICATE SUPPLEMENT INTRODUCTION GENERAL RECOMMENDATIONS

Oracle Taleo Enterprise Mobile for Talent Management Cloud Service Administration Guide

Knowledge of Foreign Languages in the Czech Republic

Luxembourg-Luxembourg: FL/TERM15 Translation services 2015/S Contract notice. Services

Luxembourg-Luxembourg: FL/SCIENT15 Translation services 2015/S Contract notice. Services

Reference Guide: Approved Vendors for Translation and In-Person Interpretation Services

Luxembourg-Luxembourg: FL/RAIL16 Translation services 2016/S Contract notice. Services

CALL FOR EXPRESSIONS OF INTEREST FOR CONTRACT AGENTS CHILDCARE STAFF. Function Group II EPSO/CAST/S/2/2012 I. INTRODUCTION

LANGUAGE LEARNING CENTRES

European cooperation on judicial training for court staff and bailiffs. Regional seminars with national breakout sessions and other examples

Activities. but I will require that groups present research papers

Languages Supported. SpeechGear s products are being used to remove communications barriers throughout the world.

HOW COMPANIES INFLUENCE OUR SOCIETY: CITIZENS VIEW


RESEARCH ASSISTANCE. The Portal is also accessible to the general public but restricted to the free case law databases.

Introductory Guide to the Common European Framework of Reference (CEFR) for English Language Teachers

Table 1: TSQM Version 1.4 Available Translations

SAP BusinessObjects Document Version: 4.1 Support Package Dashboards and Presentation Design Installation Guide

USER GUIDE: Trading Central Indicator for the MT4 platform

Translation strategy decision support at the European Commission

Fujiyama Co. Ltd. Company profile

Speaking your language...

placing people first SALARY REPORT Summary of 2014 Bratislava

TRADING CENTRAL INDICATOR FOR METATRADER USERS GUIDE. Blue Capital Markets Limited All rights reserved.

Internet sites for machine translation available language-pairs ** Part 1 direct translation sites

Language technologies for Education: recent results by the MLLP group

UNIVERSITY OF ECONOMICS, PRAGUE W. Churchill Sq. 4, Prague 3

ADECCO BULGARIA MANAGED SERVICES

Professional. Accurate. Fast.

IBM Content Analytics with Enterprise Search, Version 3.0

Teaching Languages at School

We Answer To All Your Localization Needs!

SINGLE RESOLUTION BOARD VACANCY NOTICE DOCUMENT MANAGEMENT OFFICER (DMO) (SRB/AST/2014/008)

Data at the SFB "Mehrsprachigkeit"

PRICE LIST. ALPHA TRANSLATION AGENCY

2015 Population Office figures for October to December and year to date

User language preferences online. Analytical report

CALL FOR AN EXPRESSION OF INTEREST FOR A SECONDED NATIONAL EXPERT WITHIN EUROJUST:

We Answer All Your Localization Needs!

SAP For Insurance A focus on Billing and Collections. Robert Schwartz Industry Principal

Xerox Easy Translator Service User Guide

European Economic and Social Committee

Cyclope Internet Filtering Proxy. - User Guide -

Poland-Warsaw: MyFrontex digital workplace (COTS-based intranet) 2016/S Contract notice. Services

IPCC translation and interpretation policy. February 2015

Safe Harbor Statement

D EUOSME: European Open Source Metadata Editor (revised )

Who We Are. Services We Offer

Responsible Research and Innovation (RRI), Science and Technology

Automated Multilingual Text Analysis in the Europe Media Monitor (EMM) Ralf Steinberger. European Commission Joint Research Centre (JRC)

MM, EFES EN. Marc Mathieu

10TH EDITION MERGER CONTROL VADEMECUM FILING THRESHOLDS AND CLEARANCE CONDITIONS IN THE 29 EUROPEAN JURISDICTIONS

Globalization Status of Citrix Products

SINGLE RESOLUTION BOARD VACANCY NOTICE ICT PROJECT MANAGER AND BUSINESS ANALYST (SRB/AD/2015/017)

Translation. and multilingualism. Translation

Quality Data for Your Information Infrastructure

LSI TRANSLATION PLUG-IN FOR RELATIVITY. within

Towards Collaborative Practice - European Conference on Youth Work, Social Innovation, and Enterprise

webcertain Recruitment pack Ceri Wright [Pick the date]

VACANCY NOTICE ICT OFFICER (IT PROJECT MANAGEMENT PROFILE) REF.: ESMA/2016/VAC10/AD6

Data First Framework. How to Build Your Enterprise Data Hub. Luis Campos Big Data Solutions Director Oracle Europe, Middle East and Africa

PROJECT: EURO-AUDITS THE EUROPEAN ROAD SAFETY AUDITOR TRAINING SYLLABUS APPENDIX E SURVEY RESULTS. October 2007

Translution Price List GBP

New Features SMART Sync Collaboration Feature Improvements

Bachelor of International Business

Product Globalization Service. A Partner You Can Trust

This notice in TED website:

PROMT Technologies for Translation and Big Data

Trading Central Indicator for MetaTrader4 TRADER / USER SET UP & CONFIGURATION

Vacancy notice for the post of: Events Coordinator Reference: 09/EJ/211 Temporary Agent AST 3 M/F

List of Undergraduate Courses University of Economics, Prague Fall/Winter Semester 2015/2016

LocaTran Translations Ltd. Professional Translation, Localization and DTP Solutions.

Challenges, Solutions and Visions for the Interactive Multilingual Digital Single Market

SINGLE RESOLUTION BOARD VACANCY NOTICE DOCUMENT MANAGEMENT OFFICER (SRB/AST/2016/003) Corporate Services IT

Product Release LPB Web server V3.0. Infrastructure & Cities

Microsoft Office 2010 via Windows 7 (Word, Excel, Access, One Note, Outlook, PowerPoint and Publisher) Microsoft Exchange 2007, Visio, Project.

Vacancy notice for establishing a reserve list: Administrative Assistant to Eurojust Reference: 08/EJ/CA/55 Contract Agent FG I M/F

Recent Developments in ParaSol: Breadth for Depth and XSLT based web concordancing with CWB

Visual History Archive in the Social Scientific Research Some remarks and experiences from the user's perspective

ServiceAPI to the WorldLingo System

Australian Embassy, Seoul List of Translators and Interpreters 2013 Seoul, Busan and Daejeon

A global leader in document translations

Exploiting Sign Language Corpora in Deaf Studies

CALL FOR APPLICATIONS. Human Resources Assistant (AST 4) EIT (Budapest) Ref.: EIT/TA/2014/94

Transcription:

Introducing Czech National Corpus Brown University, 04/09/16 Václav Cvrček

IntroductionIj

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL)

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities 4,500+ registered users

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities 4,500+ registered users 1,900 queries per day

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities 4,500+ registered users 1,900 queries per day web portal: www.korpus.cz

DataIj

Language data SYN 2.3 bil. ORAL 5 mil. Diakorp 3.4 mil. InterCorp 1.4 bil.

SYN series Written (i.e. published) synchronic texts: SYN2000 100M representative (1990 1999) SYN2005 100M representative (2000 2004) SYN2006PUB 300M journalistic texts (1989 2004) SYN2009PUB 700M journalistic texts (1995 2007) SYN2010 100M representative (2005 2009) SYN2013PUB 935M journalistic texts (2005 2009) SYN2015 100M representative (2010 2014) SYN (v. 3) 2,3G union of all SYN* corpora

SYN series Written (i.e. published) synchronic texts: SYN2000 100M representative (1990 1999) SYN2005 100M representative (2000 2004) SYN2006PUB 300M journalistic texts (1989 2004) SYN2009PUB 700M journalistic texts (1995 2007) SYN2010 100M representative (2005 2009) SYN2013PUB 935M journalistic texts (2005 2009) SYN2015 100M representative (2010 2014) SYN (v. 3) 2,3G union of all SYN* corpora All corpora are: lemmatized, morphologically tagged and enriched by metadata (biblio information + text-type/genre classification)

ORAL series Unprepared, dialogical, informal spoken language

ORAL series Unprepared, dialogical, informal spoken language One-layer transcription corpora: ORAL2006 1.0M Bohemian Czech only ORAL2008 1.0M sociolinguistically balanced, Bohemian Czech only ORAL2013 2.8M sociolinguistically balanced, whole CR, text-to-sound alignment Older spoken corpora: Prague spoken corpus (0.5M), Brno spoken corpus (0.5M)

Diachronic corpus DIAKORP diachronic part of the CNC 2.5 mil. words (v. 5) the end of the 13th century to the beginning of the SYN section (1945) texts are transcribed, not transliterated current focus on 19th century lemmatization

Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages

Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles )

Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles ) Core Collections cs 85M 90M foreign 194M 1,229M

Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles ) Core Collections cs 85M 90M foreign 194M 1,229M partly lemmatized and tagged

Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles ) Core Collections cs 85M 90M foreign 194M 1,229M partly lemmatized and tagged uneven amount of texts in language pairs

Language Core Total bg Bulgarian 5.2M 28.1M da Danish 3.0M 53.0M de German 27.7M 77.1M en English 15.5M 113.9M es Spanish 17.5M 103.9M fi Finnish 3.4M 45.2M fr French 9.2M 87.0M hr Croatian 15.5M 34.6M hu Hungarian 5.4M 58.1M it Italian 7.2M 65.6M pl Polish 17.5M 79.9M ru Russian 3.3M 13.4M sk Slovak 7.4M 44.5M sl Slovenian 0.9M 49.8M sr Serbian 8.8M 29.6M uk Ukrainian 5.1M 5.3M

ToolsIj

CNC Tools main concordancer analysis of variants derivational morphology discourse analysis translation equivalents

CNC Tools main concordancer analysis of variants derivational morphology discourse analysis translation equivalents All tools are available on-line within the portal www.korpus.cz

CNC research portal www.korpus.cz

KonText CNC concordancer

KonText CNC concordancer

SyD exploring variants

SyD exploring variants

SyD exploring variants

SyD exploring variants

SyD exploring variants

Treq translation equivalents

Translation candidates for workshop % Czech English 39.4 dílna ( workroom ) workshop 30.4 seminář ( seminar ) workshop 8.7 workshop workshop 4.6 pracovní ( workring ) workshop 2.3 kurs ( course ) workshop 1.7 garáž ( garage ) workshop 1.7 krejčovna ( tailor s shop ) workshop 0.9 ateliér ( studio ) workshop 0.8 továrna ( factory ) workshop

User ServicesIj

User services hosting of corpora

User services hosting of corpora providing data packages (NLP)

User services hosting of corpora providing data packages (NLP) analysis of user data

User services hosting of corpora providing data packages (NLP) analysis of user data consulting, education, training

Repository Biblio

User forum and user support

CNC Wiki

www.korpus.cz