Introducing Czech National Corpus Brown University, 04/09/16

Introducing Czech National Corpus Brown University, 04/09/16 Václav Cvrček

IntroductionIj

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL)

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities 4,500+ registered users

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities 4,500+ registered users 1,900 queries per day

Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities 4,500+ registered users 1,900 queries per day web portal: www.korpus.cz

DataIj

Language data SYN 2.3 bil. ORAL 5 mil. Diakorp 3.4 mil. InterCorp 1.4 bil.

SYN series Written (i.e. published) synchronic texts: SYN2000 100M representative (1990 1999) SYN2005 100M representative (2000 2004) SYN2006PUB 300M journalistic texts (1989 2004) SYN2009PUB 700M journalistic texts (1995 2007) SYN2010 100M representative (2005 2009) SYN2013PUB 935M journalistic texts (2005 2009) SYN2015 100M representative (2010 2014) SYN (v. 3) 2,3G union of all SYN* corpora

SYN series Written (i.e. published) synchronic texts: SYN2000 100M representative (1990 1999) SYN2005 100M representative (2000 2004) SYN2006PUB 300M journalistic texts (1989 2004) SYN2009PUB 700M journalistic texts (1995 2007) SYN2010 100M representative (2005 2009) SYN2013PUB 935M journalistic texts (2005 2009) SYN2015 100M representative (2010 2014) SYN (v. 3) 2,3G union of all SYN* corpora All corpora are: lemmatized, morphologically tagged and enriched by metadata (biblio information + text-type/genre classification)

ORAL series Unprepared, dialogical, informal spoken language

ORAL series Unprepared, dialogical, informal spoken language One-layer transcription corpora: ORAL2006 1.0M Bohemian Czech only ORAL2008 1.0M sociolinguistically balanced, Bohemian Czech only ORAL2013 2.8M sociolinguistically balanced, whole CR, text-to-sound alignment Older spoken corpora: Prague spoken corpus (0.5M), Brno spoken corpus (0.5M)

Diachronic corpus DIAKORP diachronic part of the CNC 2.5 mil. words (v. 5) the end of the 13th century to the beginning of the SYN section (1945) texts are transcribed, not transliterated current focus on 19th century lemmatization

Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages

Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles )

Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles ) Core Collections cs 85M 90M foreign 194M 1,229M

Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles ) Core Collections cs 85M 90M foreign 194M 1,229M partly lemmatized and tagged

Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles ) Core Collections cs 85M 90M foreign 194M 1,229M partly lemmatized and tagged uneven amount of texts in language pairs

Language Core Total bg Bulgarian 5.2M 28.1M da Danish 3.0M 53.0M de German 27.7M 77.1M en English 15.5M 113.9M es Spanish 17.5M 103.9M fi Finnish 3.4M 45.2M fr French 9.2M 87.0M hr Croatian 15.5M 34.6M hu Hungarian 5.4M 58.1M it Italian 7.2M 65.6M pl Polish 17.5M 79.9M ru Russian 3.3M 13.4M sk Slovak 7.4M 44.5M sl Slovenian 0.9M 49.8M sr Serbian 8.8M 29.6M uk Ukrainian 5.1M 5.3M

ToolsIj

CNC Tools main concordancer analysis of variants derivational morphology discourse analysis translation equivalents

CNC Tools main concordancer analysis of variants derivational morphology discourse analysis translation equivalents All tools are available on-line within the portal www.korpus.cz

CNC research portal www.korpus.cz

KonText CNC concordancer

SyD exploring variants

Treq translation equivalents

Translation candidates for workshop % Czech English 39.4 dílna ( workroom ) workshop 30.4 seminář ( seminar ) workshop 8.7 workshop workshop 4.6 pracovní ( workring ) workshop 2.3 kurs ( course ) workshop 1.7 garáž ( garage ) workshop 1.7 krejčovna ( tailor s shop ) workshop 0.9 ateliér ( studio ) workshop 0.8 továrna ( factory ) workshop

User ServicesIj

User services hosting of corpora

User services hosting of corpora providing data packages (NLP)

User services hosting of corpora providing data packages (NLP) analysis of user data

User services hosting of corpora providing data packages (NLP) analysis of user data consulting, education, training

Repository Biblio

User forum and user support

CNC Wiki

www.korpus.cz