Introducing Czech National Corpus Brown University, 04/09/16 Václav Cvrček
IntroductionIj
Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák
Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL)
Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities
Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities 4,500+ registered users
Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities 4,500+ registered users 1,900 queries per day
Czech National Corpus project Basic facts about the CNC est. in 1994 by prof. František Čermák 2 departments of Faculty of Arts, Charles University in Prague (ICNC & ITCL) in 2012 acknowledged by MEYS as a research infrastructure for social sciences and humanities 4,500+ registered users 1,900 queries per day web portal: www.korpus.cz
DataIj
Language data SYN 2.3 bil. ORAL 5 mil. Diakorp 3.4 mil. InterCorp 1.4 bil.
SYN series Written (i.e. published) synchronic texts: SYN2000 100M representative (1990 1999) SYN2005 100M representative (2000 2004) SYN2006PUB 300M journalistic texts (1989 2004) SYN2009PUB 700M journalistic texts (1995 2007) SYN2010 100M representative (2005 2009) SYN2013PUB 935M journalistic texts (2005 2009) SYN2015 100M representative (2010 2014) SYN (v. 3) 2,3G union of all SYN* corpora
SYN series Written (i.e. published) synchronic texts: SYN2000 100M representative (1990 1999) SYN2005 100M representative (2000 2004) SYN2006PUB 300M journalistic texts (1989 2004) SYN2009PUB 700M journalistic texts (1995 2007) SYN2010 100M representative (2005 2009) SYN2013PUB 935M journalistic texts (2005 2009) SYN2015 100M representative (2010 2014) SYN (v. 3) 2,3G union of all SYN* corpora All corpora are: lemmatized, morphologically tagged and enriched by metadata (biblio information + text-type/genre classification)
ORAL series Unprepared, dialogical, informal spoken language
ORAL series Unprepared, dialogical, informal spoken language One-layer transcription corpora: ORAL2006 1.0M Bohemian Czech only ORAL2008 1.0M sociolinguistically balanced, Bohemian Czech only ORAL2013 2.8M sociolinguistically balanced, whole CR, text-to-sound alignment Older spoken corpora: Prague spoken corpus (0.5M), Brno spoken corpus (0.5M)
Diachronic corpus DIAKORP diachronic part of the CNC 2.5 mil. words (v. 5) the end of the 13th century to the beginning of the SYN section (1945) texts are transcribed, not transliterated current focus on 19th century lemmatization
Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages
Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles )
Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles ) Core Collections cs 85M 90M foreign 194M 1,229M
Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles ) Core Collections cs 85M 90M foreign 194M 1,229M partly lemmatized and tagged
Multilingual parallel corpus InterCorp Czech texts with translations to or from 30+ languages InterCorp (v. 8) core (=fiction) and collections (=journalism, subtitles ) Core Collections cs 85M 90M foreign 194M 1,229M partly lemmatized and tagged uneven amount of texts in language pairs
Language Core Total bg Bulgarian 5.2M 28.1M da Danish 3.0M 53.0M de German 27.7M 77.1M en English 15.5M 113.9M es Spanish 17.5M 103.9M fi Finnish 3.4M 45.2M fr French 9.2M 87.0M hr Croatian 15.5M 34.6M hu Hungarian 5.4M 58.1M it Italian 7.2M 65.6M pl Polish 17.5M 79.9M ru Russian 3.3M 13.4M sk Slovak 7.4M 44.5M sl Slovenian 0.9M 49.8M sr Serbian 8.8M 29.6M uk Ukrainian 5.1M 5.3M
ToolsIj
CNC Tools main concordancer analysis of variants derivational morphology discourse analysis translation equivalents
CNC Tools main concordancer analysis of variants derivational morphology discourse analysis translation equivalents All tools are available on-line within the portal www.korpus.cz
CNC research portal www.korpus.cz
KonText CNC concordancer
KonText CNC concordancer
SyD exploring variants
SyD exploring variants
SyD exploring variants
SyD exploring variants
SyD exploring variants
Treq translation equivalents
Translation candidates for workshop % Czech English 39.4 dílna ( workroom ) workshop 30.4 seminář ( seminar ) workshop 8.7 workshop workshop 4.6 pracovní ( workring ) workshop 2.3 kurs ( course ) workshop 1.7 garáž ( garage ) workshop 1.7 krejčovna ( tailor s shop ) workshop 0.9 ateliér ( studio ) workshop 0.8 továrna ( factory ) workshop
User ServicesIj
User services hosting of corpora
User services hosting of corpora providing data packages (NLP)
User services hosting of corpora providing data packages (NLP) analysis of user data
User services hosting of corpora providing data packages (NLP) analysis of user data consulting, education, training
Repository Biblio
User forum and user support
CNC Wiki
www.korpus.cz