Linking corpus-driven methodology to annotated and CEFR analyzed learner data. A profitable synergy? Questions

Linking corpus-driven methodology to annotated and CEFR analyzed learner data. A profitable synergy? Jarmo Harri Jantunen CTAL 2012, Suzhou Questions Does key word analysis of annotated data produce relevant information about the second language acquisition? Does statistical information describe the development and grow of proficiency level? Can certain lexical or grammatical items be seen as indicators of a proficiency level? Are the overuse of certain items (or underuse) specific rather to the learner Finnish per se, than to a certain proficiency level? 1

Key word analysis (Scott & Tribble 2006, Scott 2007) Key word is a word whose frequency is unusually high in a corpus in comparison with some norm (reference corpus). Repetition, statistical significance Key word are calculated by comparing the frequency of a word in the studied data and its frequency in the reference data. To compute the key ness of an item, the program also computes the number of running words in the wordlist and the number of running words in the reference corpus, and cross tabulates these. Log Likelihood test (p=.0001, crit.val 15.13, min. freq. 10) WordSmith Tools 4.0 software (Wordlist, KeyWords, Concord; Scott 2007) Key items (Jantunen 2008, 2011, forthcoming) key words (topic key words, learner language key words, genre key words) key tags Example N Key word Freq. % RC. Freq. RC. % Keyness P 1 P1 5,319 1.36 1,773 0.62 913.51 0.0000000000 2 2 11,827 3.02 5,560 1.96 771.93 0.0000000000 3 MINÄ 1,350 0.35 334 0.12 376.54 0.0000000000 4 1 6,785 1.73 3,424 1.20 318.99 0.0000000000 5 NOM 10,956 2.80 6,068 2.13 303.44 0.0000000000 6 ON 2,471 0.63 980 0.34 279.27 0.0000000000 7 HUONE 348 0.09 20 259.63 0.0000000000 8 MINUN 514 0.13 78 0.03 235.44 0.0000000000 9 PRES 6,947 1.78 3,756 1.32 223.01 0.0000000000 10 KELLO 521 0.13 87 0.03 220.86 0.0000000000 Key tags: P1, NOM, PRES, 1, 2 Key words: MINÄ (I), on (to be 3SG), HUONE (ROOM), minun (I GEN), KELLO (time, o'clock SG.NOM) 2

Data ICLFI, International Corpus of Learner Finnish (Jantunen 2011) subset of ICLFI: annotated and lemmatized learner production from Estonian students of Finnish CEFR analysed (Common European Framework of Reference for Languages) The CEFR describes language ability on a scale of levels from A1 (beginners) up to C2 (mastery) communicative framework for language assessment total size of the data: 914.000 items (652 texts) A1: 5.300 items (8 texts) A2: 172.000 items (220 texts) B1: 391.000 items (336 texts) B2: 284.000 items, (77 texts) C1: 62.000 (11 texts) genres: fictional (e.g. narratives, letters), non fiction (e.g. essays, news, argumentative texts) Native Finnish Corpus (NF), non translational subset of the Corpus of Translated Finnish (Mauranen 2000), c. 36 million items categorization of key items (Jantunen 2011) Key tags Topic keywords Learner language keywords CARD (cardinal) CC (conjunction) NUM (numeral) P1 (1. person) PRES (present tense) ADE (adessive case) IND (indicative mood) HUONE 'ROOM' KAAPPI 'CLOSET' KEITTIÖ 'KITCHEN'?KELLO 'WATCH, O'CLOCK' KOTI 'HOME' KOTINI 'home 1SG.POSS' KOTOISIN 'from' LUENTO 'LECTURE' OPISKELEN 'study 1SG' OPISKELLA 'STUDY' PERHE 'FAMILY' PERHEENI 'family 1SG.POSS' SÄNKY 'BED' SISKO 'SISTER' SYÖDÄ 'EAT' SYÖN 'eat 1SG' TARTOSSA 'Tartu INE' TARTTO 'TARTU' VELI 'BROTHER' ISO 'BIG' KAHDEKSAN 'EIGHT' KAKSI 'TWO' KÄYDÄ 'GO/VISIT' KÄYN 'go/visit 1SG) MENEN 'go 1SG' MENNÄ 'GO' MINÄ 'I' MINULLA 'I ADE' PIDÄN 'like 1SG' MINUN 'I GEN' OLEN 'be 1SG' OLLA 'be' ON 'be 1SG' PALJON 'a lot' TAVALLISESTI 'usually' 3

categorization of key items CEFR analysed data A1 vs. A2 A, nom OLLA, ovat, on, PIENI (BE, be 3PL, be 3SG, SMALL) A2 vs. B1 part of speech: NUM (CARD) cases: NOM, ADE number: SG tense: PRES mood: IND person: P1 (SG) MAIN, NH, LOC verbs: on, OLLA, menen, olen adjectives: ISO, VANHA pronouns: MINÄ, minun, minulla numerals: KAKSI, YKSI adverbs: toisinaan, sitten adverbs/adpositions: vierellä, lähellä (verbs: be 3SG, BE, go 1SG, be 1SG adjectives: BIG, OLD pronouns: I, I GEN, I ADE numerals: TWO, ONE adverbs: sometimes, then adverbs/adpositions: beside, near) B1 vs. B2 categorization of key items CEFR analysed data part of speech: N, NUM (CARD), PROP cases: NOM, ADE, INE number: SG possessive: POSS verbs: on, menen, KÄYDÄ, OLLA adjectives: ISO, PIENI, MUKAVA pronouns: MINÄ, minun, minulla, HÄN numerals: KAKSI, NELJÄ, KYMMENEN, PUOLI adverbs: tavallisesti, siellä adverbs/adpositions: vieressä conjunctions: ja nouns: KELLO tense: PRES mood: IND person: P1 (SG) MAIN, NH, LOC, CC, SUBJ (verbs: be 3SG, go 1SG, GO/VISIT, BE adjectives: BIG, SMALL, NICE pronouns: I, I GEN, I ADE, S/HE numerals: TWO, FOUR, TEN, HALF adverbs: usually, there adverbs/adpositions: beside conjunctions: and nouns: WATCH/O'CLOCK) 4

B2 vs. C1 categorization of key items CEFR analysed data part of speech: CARD, PROP cases: NOM tense: PAST mood: IND person: P3 CC, SUBJ verbs: oli pronouns: HÄN, HE, ME conjunctions: ja, kuten nouns: KELLO, MIES (verbs: be PAST 3SG pronouns: S/HE, THEY, WE conjunctions: and, like/such as nouns: WATCH/O'CLOCK, MAN) % 4,50 4,00 Some key items across levels and in native Finnish 3,50 3,00 2,50 2,00 1,50 1,00 A1 A2 B1 B2 C1 NF 0,50 0,00 NOM IND CARD OLLA ON OVAT OLEN OLI 5

categorization of negative key items CEFR analysed data A1 vs. A2 PTV A2 vs. B1 part of speech: ADV, PROP cases: ELA, PTV, GEN, ILL comparison: CMP, SUP number: PL tense: PAST; person: PL voice: PASS non finite: INF (F1), PCP PREMOD, PREMARK, AD, ATTR OBJ, ADVL, CS, AUX nouns: IHMINEN verbs: EI, SAADA adjectives: HYVÄ, VANHA pronouns: KAIKKI adverbs: nyt, jo conjunctions: että, kun, joka, koska, kuin, sekä other particles: vain, niin (nouns: HUMAN BEING verbs: NOT, TO GET adjectives: GOOD, OLD pronouns: ALL/EVERY adverbs: now, already conjunctions: that, when, that/which, because, than, and/both and other particles: only, so) categorization of negative key items CEFR analysed data B1 vs. B2 part of speech: ADV cases: PTV, GEN, ILL comparison: CMP number: PL nouns: ASIA, TAPA verbs: EI, VOIDA, SANOA, oli adjectives: ERI, VAIKEA pronouns: SE, sitä, KAIKKI, TÄMÄ adverbs: miten conjunctions: että, kun, jos, eli, vaan other particles: niin, juuri tense: PAST; person: PL voice: PASS; mood: CND non finite: INF (F1, F2), PCP PREMARK, PM, ATTR OBJ, ADVL, CS, AUX (nouns: MATTER/THING, MANNER verbs: NOT, BE ABLE TO, SAY, be PAST 3SG adjectives: DIFFERENT, DIFFICULT pronouns: IT, it PTV, ALL/EVERY, THIS adverbs: how conjunctions: that, when, if, or, but other particles: so, just/right) 6

categorization of negative key items CEFR analysed data B2 vs. C1 cases: GEN, ESS voice: PASS non finite: PCP nouns: MÄÄRÄ, MAHDOLLISUUS verbs: YRITTÄÄ, en pronouns: MONI adverbs: koskaan adpositions: takia, aikana conjunctions: että other particles: niin, kuitenkin, ainakin PREMOD, ATTR, NEG (nouns: NUMBER/AMOUNT, CHANCE) verbs: TRY, not 1SG pronouns: many adverbs: (n)ever adpositions: due to, during conjunctions: that other particles: so, however, at least) Some key items across levels and in native Finnish % 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 A1 A2 B1 B2 C1 NF 0,20 0,00 GEN PTV PASS PCP ATTR EI NIIN ETTÄ 7

Conclusions Does key word analysis of annotated data produce relevant information about the second language acquisition? It seems that KWA reveals lexical and grammatical items that might describe the gradual development of language acquisition (complexity, distribution). Does statistical information describe the development and grow of proficiency level? It reveals which lexical and grammatical items are typical (overused) at certain level and which are emerging (but yet underused). Can certain lexical or grammatical items be seen as indicators of a proficiency level? It seems that e.g. certain forms of the verb OLLA ('BE'), NOMs, INDs and CARDs are typical at beginners level and that their proportion decreases when proficiency level grows. The lack of certain items (e.g. GENs, PTVs, verb EI 'NOT') also seems to caracterise certain levels. Are the overuse of certain items (or underuse) specific rather to the learner Finnish per se, than to a certain proficiency level? The items studied here do not support this, since at B2 level the proportions seem to be more or less similar to the proportions in native data. However, a more detailed KWA between different proficiency levels and native data is needed. Accuracy? > error analysis Increasing the data size (A1, C1) Comparison with other L1 backgrounds (a more universal tendency) 8