Sketch Engine. Sketch Engine. SRDANOVIĆ ERJAVEC Irena, Web 1 Word Sketch Thesaurus Sketch Difference Sketch Engine

Transcription

1 Sketch Engine SRDANOVIĆ ERJAVEC Irena, Sketch Engine Sketch Engine Web 1 Word Sketch Thesaurus Sketch Difference Sketch Engine JpWaC 4 Web Sketch Engine Kilgarriff & Rundell ,000 20, Heid et al. 2000, Kilgarriff & Tugwell 2001 Sketch Engine Kilgarriff et al Srdanović et al Sketch Engine Web Word Sketch Thesaurus Sketch Difference 1

2 Sketch Engine 2. Sketch Engine Sketch Engine Kilgarriff et al Erjavec et al Web Web Sketch Engine Sketch Engine 2.1. Sketch Engine Web Sketch Engine ( 4 JpWaC Web 1 Sharoff (2006) Ueyama & Baroni (2005) Web 5 WAC Baroni & Bernardini, eds BootCat Baroni et al HTML boilerplate removal Web ChaSen token lemma tag Erjavec et al jp.com Erjavec et al Srdanović et al Sketch Engine 2 3 URL Web JpWaC

3 1 Sketch Engine 2 Sketch Engine 3 Sketch Engine 2.2. Word Sketches 22 Word Sketch, Thesaurus Sketch Difference Chasen Gahl 1998 corpus query syntax ( ) 4 Word Sketch 3

4 salience 1 modifies_n ( ) 4 2 dual *DUAL =modifier_ana/modifies_n 2:"N.Ana" "Aux" "Pref.*"? 1:[tag="N.*" & tag!="n.suff.*" & tag!="n.bnd.*"] modifier_ana modifies_n modifies_n 2:"N.Ana" "Aux" "Pref.*"? N.Ana Aux Pref.* 1: [tag="n.*" & tag! ="N.Suff.*" & tag! ="N.bnd.*"] N.* N.Suff.* N.bnd.*

5 * 0 N.* N.g N.Prop 0 1 Sketch Engine Concordance CQL Corpus Query Language [word= word= ] ChaSen [word= ] [word= ] [lemma= ] 3.2 [tag= N.* ]&[ word = ] Word Sketch Sketch Engine ChaSen IPADIC) IPADIC Sketch Engine Web ChaSen 5 ChaSen ChaSen Sketch Engine token kana lemma POS tag ( ) POS tag-eng ( ) - Adv.P - N.Ana Aux - N.g Aux Aux - Sym.p ChaSen ChaSen IPADIC ChaSen ChaSen 5

6 Word Sketch ChaSen Word Sketch Word Sketch Concordance 100 Word Sketch ChaSen Web 2.3. Thesaurus Sketch Difference Thesaurus Sketch Difference shared triples 3 triple Srdanović et al Thesaurus 6 Sketch Difference ,309 6, Web 6

7 Thesaurus 7 Sketch Difference only pattern 8 Sketch Difference only pattern 2.4. Web Web Web 7

8 Web Web Keller & Lapata 2003 Web Web JpWaC Web Web Sharoff 2006 Ueyama & Baroni 2005 Web Web Web Sharoff 2006 Ueyama & Baroni 2005 Web narrative style Web interactive style Web Web Web Ghani et al Web Web Web Web Web Crystal 2006 Web Web Web 8

9 Web 3. Sketch Engine Sketch Engine 3.1. Sketch Engine 80 Cobuild 90 Church & Hanks 1989 (MI) 2000 Word Sketch Sketch Engine BNC British National Corpus Rundell, ed Kilgarriff & Rundell (2002) Word Sketch Word Sketch Word Sketch Sketch Engine Word Sketch Sketch Engine 9

10 Kilgarriff & Rundell 2002 challenge 2004 Sketch Engine Word Sketch 9 Word Sketch 9 modifier_ana modifier_ai verb verb verb verb 9 initiation trial - 10

11 Word Sketch challenge to something/somebody Concordance 10 Concordance CQL [word=" "] []{0,3} [word=" "] {0,3} 0 3 token 11 ( Word Sketch jaslo Erjavec et al

12 Word Sketch 10 Word Sketch 1) 2) 3) 4) 1) 1, Sketch Engine 22 2 Sketch Engine Sketch Engine Sketch Engine 12

13 2) Word Sketch Word Sketch Sketch Engine Web Sketch Engine 3) Word Sketch Word Sketch 12 13

14 12 Word Sketch 4) Word Sketch Sketch Engine Thesaurus Sketch Difference A B A B A Sketch Difference 14

15 Web Web Word Sketch Sketch Engine 3.2. Sketch Engine Sketch Engine Word Sketch Thesaurus Sketch Difference Concordance suffix ( ) prefix suffix_base prefix_base bound_v V_bound suffix bound_v V_bound Sketch Difference / / 15

16 Word Sketch Word Sketch lemma 2) Concordance Concordance Concordance CQL Concordance CQL [word=" "][word=" "][lemma=" "] [word=" "][word=" "][lemma=" "] lemma 432 2,975 Collocation candidates 16

17 Concordance CQL [tag="v.*"][word=" "][word=" "][lemma=" "] Web 1,170 CQL [word=" "][word=" "][lemma=" "] Collocation candidates 10 Concordance [word=" "] [word=" "] [lemma=" "] 10,845 Collocation candidates 4, (lexical sets) 13 17

18 [word=" "][word=" "][word=" "][word=" "] [word=" "] [lemma=" "] Srdanović 2007 Word Sketch Word Sketch 3.3. Sketch Engine Sketch Engine Sketch Engine 1) Sketch Engine a b Sketch Engine Sketch Engine Nishina & Yoshihashi 2007 Smrž 2004 Sketch Engine 18

19 2) Sketch Engine 3) a ( ) b c d Sketch Engine Smrž 2004 Sketch Difference Thesaurus Sketch Engine Smrž 2004 Sketch Engine Sketch Engine 4) a b c Sketch Engine Sketch Engine Smith et al

20 3.4. Sketch Engine 2.3 Web Web Word Sketch Thesaurus Joice 2005 Sketch Engine ChaSen ChaSen Corpus Builder Sketch Engine WebBootCat Web Baroni et al Sketch Engine 1) ChaSen 4 Web 2) ChaSen Sketch Engine Word Sketch Thesaurus Sketch Difference Concordance 1) Web 2) 3) ChaSen ChaSen 20

21 Srdanović Erjavec, Irena , 83-89, 2007 Sketch Engine 18, , 2004 Baroni, Marko, Adam Kilgarriff, Jan Pomikalek & Pavel Rychly (2006) WebBootCaT: a web tool for instant corpora, Proceedings of the EuraLex Conference 2006, Baroni, Marko & Silvia Bernardini, eds. (2006) Wacky! Working papers on the Web as Corpus, Bologna: GEDIT. Church, Kenneth Ward & Patrick Hanks (1989) Word association norms, mutual information, and lexicography, Proceedings of the 27th annual meeting on Association for Computational Linguistics, Crystal, David (2006) Language and the Internet, Cambridge: Cambridge University Press. Erjavec, Tomaž, Kristina Hmeljak Sangawa & Irena Srdanović Erjavec (2006) jaslo, A Japanese-Slovene Learners' Dictionary: Methods for Dictionary Enhancement, Proceedings of the 12th EURALEX International Congress Erjavec, Tomaž, Adam Kilgarriff & Irena Srdanović Erjavec (2007) A large public-access Japanese corpus and its query tool, CoJaS 2007, The Inaugural Workshop on Computational Japanese Studies. Gahl, Susanne (1998) Automatic Extraction of subcategorization frames for corpus-based dictionary-building, Proc EURALEX 1998, Ghani, Rayid, Rosie Jones & Dunja Mladenic (2001) Using the Web to Create Minority Language Corpora, Proceedings of the 2001 ACM CIKM: Tenth International Conference on Information and Knowledge Management, Heid, Ulrich, Stefan Evert, Vincent Docherty, Wolfgang Worsch & Wermke, Matthias (2000) Computational tools for semi-automatic corpus-based updating of dictionaries, EURALEX 2000 Proceedings, Joyce, Terry (2005) Constructing a large-scale database of Japanese word associations, In Katsuo Tamaoka (ed.) Corpus Studies on Japanese Kanji (Glottometrics 10), 82-98, Tokyo: Hituzi Syobo & Germany: RAM-Verlag:Ludenschied. Keller, Frank & Maria Lapata (2003) Using the Web to Obtain Frequencies for Unseen Bigrams, Computational Linguistics 29 (3),

22 Kilgarriff, Adam & Michael Rundell (2002) Lexical Profiling Software and its Lexicographic Applications - a Case Study, EURALEX 2002 Proceedings, Kilgarriff, Adam, Pavel Rychly, Pavel Smrž & David Tugwell (2004) The Sketch Engine, Proc. Euralex, Kilgarriff Adam & David Tugwell (2001) WORD SKETCH: Extraction and Display of Significant Collocations for Lexicography, Proc. workshop "COLLOCATION: Computational Extraction, Analysis and Exploitation. 39th ACL & 10th EACL, Nishina, Kikuko & Kenji Yoshihashi (2007) Japanese Composition Support System Displaying Occurrences and Example Sentences, Symposium on Large-scale Knowledge Resources (LKR2007), Rundell, Michael, ed. (2002) Macmillan English Dictionary for Advanced Learners, London: Macmillan. Sharoff, Serge (2006) Open-source corpora: using the net to fish for linguistic data, International Journal of Corpus Linguistics 11(4), Smith, Simon, Alice Chen & Adam Kilgarriff (2007) A corpus query tool for SLA: learning Mandarin with the help of Sketch Engine, Practical Applications in Language and Computers - PALC 2007 Smrž, Pavel (2004) Integrating Natural Language Processing into E-learning A Case of Czech, Proceedings of the Workshop on elearning for Computational Linguistics and Computational Linguistics for elearning, COLING Srdanović Erjavec, Irena, Tomaž Erjavec & Adam Kilgarriff (2008 ) A web corpus and word-sketches for Japanese,, Ueyama Motoko & Marko Baroni (2005) Automated construction and evaluation of a Japanese web-based reference corpus, Proceedings of Corpus Linguistics

23 Sketch Engine corpus query tool for Japanese and its possible applications SRDANOVIĆ ERJAVEC Irena, NISHINA Kikuko Tokyo Institute of Technology Keywords Sketch Engine, corpus linguistics, lexicography, second language learning, collocations Abstract Although corpus-based language research has been developing rapidly in recent years, there is still a lack of resources in regards to their size, textual variety, and time of creation, and of efficient and user-friendly corpus query tools. This is also the case for the Japanese corpus linguistics, which is one of the primary reasons for the recent rise in projects constructing Japanese corpora resources. In this paper, we present a method for extracting linguistic information from corpora using the Sketch Engine corpus query tool, which has recently been extended for the Japanese language. The Japanese version is based on a 400 million word Japanese Web corpus, which is linguistically annotated by the morphological analyzer ChaSen, and a Japanese grammatical relations file. The tool offers efficient and user-friendly ways of extracting concise linguistic data about words their grammatical and collocational behavior, as well as thesaurus-like information and differences in usage for similar words. We explain, through examples, how the tool could be utilized in corpus lexicography, linguistic research and computer assisted language learning of the Japanese language. The investigation part of the article concentrates mainly on the ways that the tool could be applied within the dictionary creation process, and the results illustrate how each of the tool functions can greatly contribute to that process. 23