Linking corpus-driven methodology to annotated and CEFR analyzed learner data. A profitable synergy? Questions



Similar documents
Vocabulary in A1 level second language writing

EAP Grammar Competencies Levels 1 6

GUESSING BY LOOKING AT CLUES >> see it

Ling 201 Syntax 1. Jirka Hana April 10, 2006

English. Universidad Virtual. Curso de sensibilización a la PAEP (Prueba de Admisión a Estudios de Posgrado) Parts of Speech. Nouns.

Course Syllabus My TOEFL ibt Preparation Course Online sessions: M, W, F 15:00-16:30 PST

Correlation: ELLIS. English language Learning and Instruction System. and the TOEFL. Test Of English as a Foreign Language

GESE Initial steps. Guide for teachers, Grades 1 3. GESE Grade 1 Introduction

FRENCH AS A SECOND LANGUAGE TRAINING

Online Tutoring System For Essay Writing

NEW YORK UNIVERSITY IN GHANA ASANTE TWI WEEKLY SYLLABUS FOR STUDENTS 2011/2012 ACADEMIC YEAR ( JANUARY MAY 2012) (A 4 CREDIT- INTENSIVE COURSE)

Simple maths for keywords

Albert Pye and Ravensmere Schools Grammar Curriculum

Pupil SPAG Card 1. Terminology for pupils. I Can Date Word

Third Grade Language Arts Learning Targets - Common Core

Assessment in Modern Foreign Languages in the Primary School

How To Pass A Cesf

GMAT.cz GMAT.cz KET (Key English Test) Preparating Course Syllabus

Parsing Swedish. Atro Voutilainen Conexor oy CG and FDG

Exam Information: Graded Examinations in Spoken English (GESE)

Lesson Plan. Date(s)... M Tu W Th F

English Appendix 2: Vocabulary, grammar and punctuation

Elements of Writing Instruction I

Morphology. Morphology is the study of word formation, of the structure of words. 1. some words can be divided into parts which still have meaning

Parent Help Booklet. Level 3

EAST PENNSBORO AREA COURSE: LFS 416 SCHOOL DISTRICT

UNIVERSITÀ DEGLI STUDI DELL AQUILA CENTRO LINGUISTICO DI ATENEO

A Writer s Reference, Seventh Edition Diana Hacker Nancy Sommers

Writing Common Core KEY WORDS

KINDGERGARTEN. Listen to a story for a particular reason

Knowledge. Subject Knowledge Audit - Spanish Meta-linguistic challenges full some none

Cohesive writing 1. Conjunction: linking words What is cohesive writing?

Get Ready for IELTS Writing. About Get Ready for IELTS Writing. Part 1: Language development. Part 2: Skills development. Part 3: Exam practice

and the Common European Framework of Reference

A Beginner s Guide To English Grammar

Grade 4 Writing Assessment. Eligible Texas Essential Knowledge and Skills

SPANISH Kindergarten

Testing Data-Driven Learning Algorithms for PoS Tagging of Icelandic

KS2 SATS Goosewell Primary School Parents and teachers working together for the benefit of the children.

Cambridge Primary English as a Second Language Curriculum Framework

Acalanes Union High School District Adopted: 6/25/14 SUBJECT AREA WORLD LANGUAGE

3rd Grade - ELA Writing

C o p yr i g ht 2015, S A S I nstitute Inc. A l l r i g hts r eser v ed. INTRODUCTION TO SAS TEXT MINER

ANGOL CSOPORTOS TANFOLYAMOK TEMATIKA

LANGUAGE! 4 th Edition, Levels A C, correlated to the South Carolina College and Career Readiness Standards, Grades 3 5

Syntax: Phrases. 1. The phrase

POS Tagsets and POS Tagging. Definition. Tokenization. Tagset Design. Automatic POS Tagging Bigram tagging. Maximum Likelihood Estimation 1 / 23

SYNTAX: THE ANALYSIS OF SENTENCE STRUCTURE

UNKNOWN WORDS ANALYSIS IN POS TAGGING OF SINHALA LANGUAGE

Useful classroom language for Elementary students. (Fluorescent) light

INTENSIVE SWEDISH. Type of course: Intensive Swedish. Course length: 420h (30h/ week x 14 weeks) Level: A1-B2. Location: Bucharest/ Cluj

INTERMEDIATE STUDENT S BOOK B1+ Adrian Doff, Craig Thaine Herbert Puchta, Jeff Stranks, Peter Lewis-Jones with Rachel Godfrey and Gareth Davies

Course Planner for NorthStar, Second Edition, Reading and Writing, Advanced: Student Book and Writing Activity Book Six Classroom Hours

Linked sounds Listening for spelling of names and phone numbers. Writing a list of names and phone numbers Work book pg 1-6

How different is translated Chinese from native Chinese?

1. Define and Know (D) 2. Recognize (R) 3. Apply automatically (A) Objectives What Students Need to Know. Standards (ACT Scoring Range) Resources

This image cannot currently be displayed. Course Catalog. Language Arts Glynlyon, Inc.

12 FIRST QUARTER. Class Assignments

This image cannot currently be displayed. Course Catalog. Language Arts Glynlyon, Inc.

ECTACO Universal Translator ML320

20th century copyright system meets 21st century artist. Henrik Ingo Arcada, 24th October 2005

PROFICIENCY TARGET FOR END OF INSTRUCTION, SPANISH I

EFL Learners Synonymous Errors: A Case Study of Glad and Happy

Estudios de Asia y Africa Idiomas Modernas I What you should have learnt from Face2Face

Teacher training worksheets- Classroom language Pictionary miming definitions game Worksheet 1- General school vocab version

Livingston Public Schools Scope and Sequence K 6 Grammar and Mechanics

Auxiliary Verbs. Unit 6

Virginia English Standards of Learning Grade 8

CHARTES D'ANGLAIS SOMMAIRE. CHARTE NIVEAU A1 Pages 2-4. CHARTE NIVEAU A2 Pages 5-7. CHARTE NIVEAU B1 Pages CHARTE NIVEAU B2 Pages 11-14

Early Morphological Development

AK + ASD Writing Grade Level Expectations For Grades 3-6

Reading VIII Grade Level 8

Section 8 Foreign Languages. Article 1 OVERALL OBJECTIVE

Strand: Reading Literature Topics Standard I can statements Vocabulary Key Ideas and Details

Parts of Speech. Skills Team, University of Hull

Opportunities for multi-levelling

English Grammar Passive Voice and Other Items

Guide to Parsing. Guide to Parsing

ONLINE ENGLISH LANGUAGE RESOURCES

REQUIRED TEXT: Bien Dit Level 3 (Houghton Mifflin Harcourt) Breaking the French Barrier Advanced (Catharine Coursaget et Micheline Myers)

The Cambridge English Scale explained A guide to converting practice test scores to Cambridge English Scale scores

How do the principles of adult learning apply to English language learners?

English Discoveries Online Alignment with Common European Framework of Reference

Curriculum Catalog

Monday Simple Sentence

Year 1 reading expectations (New Curriculum) Year 1 writing expectations (New Curriculum)

Fall Week Schedule for the Clubs

For students in grades 6-12, the EASY Series is correlated to the national TESOL standards and most state standards.

ESL 005 Advanced Grammar and Paragraph Writing

Index. 344 Grammar and Language Workbook, Grade 8

Spanish IA Grade Levels 9 12

Speaking for IELTS. About Speaking for IELTS. Vocabulary. Grammar. Pronunciation. Exam technique. English for Exams.

CINTIL-PropBank. CINTIL-PropBank Sub-corpus id Sentences Tokens Domain Sentences for regression atsts 779 5,654 Test

Understanding Clauses and How to Connect Them to Avoid Fragments, Comma Splices, and Fused Sentences A Grammar Help Handout by Abbie Potter Henry

COMPUTER TECHNOLOGY IN TEACHING READING

Eligibility: Essay Instructions: summarize analyze print

i Innanen FINNISH DATA CENTER FORUM RY

MESLEKİ İNGİLİZCE I / VOCATIONAL ENGLISH I

More fine-grained distinction

Adlai E. Stevenson High School Course Description

Transcription:

Linking corpus-driven methodology to annotated and CEFR analyzed learner data. A profitable synergy? Jarmo Harri Jantunen CTAL 2012, Suzhou Questions Does key word analysis of annotated data produce relevant information about the second language acquisition? Does statistical information describe the development and grow of proficiency level? Can certain lexical or grammatical items be seen as indicators of a proficiency level? Are the overuse of certain items (or underuse) specific rather to the learner Finnish per se, than to a certain proficiency level? 1

Key word analysis (Scott & Tribble 2006, Scott 2007) Key word is a word whose frequency is unusually high in a corpus in comparison with some norm (reference corpus). Repetition, statistical significance Key word are calculated by comparing the frequency of a word in the studied data and its frequency in the reference data. To compute the key ness of an item, the program also computes the number of running words in the wordlist and the number of running words in the reference corpus, and cross tabulates these. Log Likelihood test (p=.0001, crit.val 15.13, min. freq. 10) WordSmith Tools 4.0 software (Wordlist, KeyWords, Concord; Scott 2007) Key items (Jantunen 2008, 2011, forthcoming) key words (topic key words, learner language key words, genre key words) key tags Example N Key word Freq. % RC. Freq. RC. % Keyness P 1 P1 5,319 1.36 1,773 0.62 913.51 0.0000000000 2 2 11,827 3.02 5,560 1.96 771.93 0.0000000000 3 MINÄ 1,350 0.35 334 0.12 376.54 0.0000000000 4 1 6,785 1.73 3,424 1.20 318.99 0.0000000000 5 NOM 10,956 2.80 6,068 2.13 303.44 0.0000000000 6 ON 2,471 0.63 980 0.34 279.27 0.0000000000 7 HUONE 348 0.09 20 259.63 0.0000000000 8 MINUN 514 0.13 78 0.03 235.44 0.0000000000 9 PRES 6,947 1.78 3,756 1.32 223.01 0.0000000000 10 KELLO 521 0.13 87 0.03 220.86 0.0000000000 Key tags: P1, NOM, PRES, 1, 2 Key words: MINÄ (I), on (to be 3SG), HUONE (ROOM), minun (I GEN), KELLO (time, o'clock SG.NOM) 2

Data ICLFI, International Corpus of Learner Finnish (Jantunen 2011) subset of ICLFI: annotated and lemmatized learner production from Estonian students of Finnish CEFR analysed (Common European Framework of Reference for Languages) The CEFR describes language ability on a scale of levels from A1 (beginners) up to C2 (mastery) communicative framework for language assessment total size of the data: 914.000 items (652 texts) A1: 5.300 items (8 texts) A2: 172.000 items (220 texts) B1: 391.000 items (336 texts) B2: 284.000 items, (77 texts) C1: 62.000 (11 texts) genres: fictional (e.g. narratives, letters), non fiction (e.g. essays, news, argumentative texts) Native Finnish Corpus (NF), non translational subset of the Corpus of Translated Finnish (Mauranen 2000), c. 36 million items categorization of key items (Jantunen 2011) Key tags Topic keywords Learner language keywords CARD (cardinal) CC (conjunction) NUM (numeral) P1 (1. person) PRES (present tense) ADE (adessive case) IND (indicative mood) HUONE 'ROOM' KAAPPI 'CLOSET' KEITTIÖ 'KITCHEN'?KELLO 'WATCH, O'CLOCK' KOTI 'HOME' KOTINI 'home 1SG.POSS' KOTOISIN 'from' LUENTO 'LECTURE' OPISKELEN 'study 1SG' OPISKELLA 'STUDY' PERHE 'FAMILY' PERHEENI 'family 1SG.POSS' SÄNKY 'BED' SISKO 'SISTER' SYÖDÄ 'EAT' SYÖN 'eat 1SG' TARTOSSA 'Tartu INE' TARTTO 'TARTU' VELI 'BROTHER' ISO 'BIG' KAHDEKSAN 'EIGHT' KAKSI 'TWO' KÄYDÄ 'GO/VISIT' KÄYN 'go/visit 1SG) MENEN 'go 1SG' MENNÄ 'GO' MINÄ 'I' MINULLA 'I ADE' PIDÄN 'like 1SG' MINUN 'I GEN' OLEN 'be 1SG' OLLA 'be' ON 'be 1SG' PALJON 'a lot' TAVALLISESTI 'usually' 3

categorization of key items CEFR analysed data A1 vs. A2 A, nom OLLA, ovat, on, PIENI (BE, be 3PL, be 3SG, SMALL) A2 vs. B1 part of speech: NUM (CARD) cases: NOM, ADE number: SG tense: PRES mood: IND person: P1 (SG) MAIN, NH, LOC verbs: on, OLLA, menen, olen adjectives: ISO, VANHA pronouns: MINÄ, minun, minulla numerals: KAKSI, YKSI adverbs: toisinaan, sitten adverbs/adpositions: vierellä, lähellä (verbs: be 3SG, BE, go 1SG, be 1SG adjectives: BIG, OLD pronouns: I, I GEN, I ADE numerals: TWO, ONE adverbs: sometimes, then adverbs/adpositions: beside, near) B1 vs. B2 categorization of key items CEFR analysed data part of speech: N, NUM (CARD), PROP cases: NOM, ADE, INE number: SG possessive: POSS verbs: on, menen, KÄYDÄ, OLLA adjectives: ISO, PIENI, MUKAVA pronouns: MINÄ, minun, minulla, HÄN numerals: KAKSI, NELJÄ, KYMMENEN, PUOLI adverbs: tavallisesti, siellä adverbs/adpositions: vieressä conjunctions: ja nouns: KELLO tense: PRES mood: IND person: P1 (SG) MAIN, NH, LOC, CC, SUBJ (verbs: be 3SG, go 1SG, GO/VISIT, BE adjectives: BIG, SMALL, NICE pronouns: I, I GEN, I ADE, S/HE numerals: TWO, FOUR, TEN, HALF adverbs: usually, there adverbs/adpositions: beside conjunctions: and nouns: WATCH/O'CLOCK) 4

B2 vs. C1 categorization of key items CEFR analysed data part of speech: CARD, PROP cases: NOM tense: PAST mood: IND person: P3 CC, SUBJ verbs: oli pronouns: HÄN, HE, ME conjunctions: ja, kuten nouns: KELLO, MIES (verbs: be PAST 3SG pronouns: S/HE, THEY, WE conjunctions: and, like/such as nouns: WATCH/O'CLOCK, MAN) % 4,50 4,00 Some key items across levels and in native Finnish 3,50 3,00 2,50 2,00 1,50 1,00 A1 A2 B1 B2 C1 NF 0,50 0,00 NOM IND CARD OLLA ON OVAT OLEN OLI 5

categorization of negative key items CEFR analysed data A1 vs. A2 PTV A2 vs. B1 part of speech: ADV, PROP cases: ELA, PTV, GEN, ILL comparison: CMP, SUP number: PL tense: PAST; person: PL voice: PASS non finite: INF (F1), PCP PREMOD, PREMARK, AD, ATTR OBJ, ADVL, CS, AUX nouns: IHMINEN verbs: EI, SAADA adjectives: HYVÄ, VANHA pronouns: KAIKKI adverbs: nyt, jo conjunctions: että, kun, joka, koska, kuin, sekä other particles: vain, niin (nouns: HUMAN BEING verbs: NOT, TO GET adjectives: GOOD, OLD pronouns: ALL/EVERY adverbs: now, already conjunctions: that, when, that/which, because, than, and/both and other particles: only, so) categorization of negative key items CEFR analysed data B1 vs. B2 part of speech: ADV cases: PTV, GEN, ILL comparison: CMP number: PL nouns: ASIA, TAPA verbs: EI, VOIDA, SANOA, oli adjectives: ERI, VAIKEA pronouns: SE, sitä, KAIKKI, TÄMÄ adverbs: miten conjunctions: että, kun, jos, eli, vaan other particles: niin, juuri tense: PAST; person: PL voice: PASS; mood: CND non finite: INF (F1, F2), PCP PREMARK, PM, ATTR OBJ, ADVL, CS, AUX (nouns: MATTER/THING, MANNER verbs: NOT, BE ABLE TO, SAY, be PAST 3SG adjectives: DIFFERENT, DIFFICULT pronouns: IT, it PTV, ALL/EVERY, THIS adverbs: how conjunctions: that, when, if, or, but other particles: so, just/right) 6

categorization of negative key items CEFR analysed data B2 vs. C1 cases: GEN, ESS voice: PASS non finite: PCP nouns: MÄÄRÄ, MAHDOLLISUUS verbs: YRITTÄÄ, en pronouns: MONI adverbs: koskaan adpositions: takia, aikana conjunctions: että other particles: niin, kuitenkin, ainakin PREMOD, ATTR, NEG (nouns: NUMBER/AMOUNT, CHANCE) verbs: TRY, not 1SG pronouns: many adverbs: (n)ever adpositions: due to, during conjunctions: that other particles: so, however, at least) Some key items across levels and in native Finnish % 1,80 1,60 1,40 1,20 1,00 0,80 0,60 0,40 A1 A2 B1 B2 C1 NF 0,20 0,00 GEN PTV PASS PCP ATTR EI NIIN ETTÄ 7

Conclusions Does key word analysis of annotated data produce relevant information about the second language acquisition? It seems that KWA reveals lexical and grammatical items that might describe the gradual development of language acquisition (complexity, distribution). Does statistical information describe the development and grow of proficiency level? It reveals which lexical and grammatical items are typical (overused) at certain level and which are emerging (but yet underused). Can certain lexical or grammatical items be seen as indicators of a proficiency level? It seems that e.g. certain forms of the verb OLLA ('BE'), NOMs, INDs and CARDs are typical at beginners level and that their proportion decreases when proficiency level grows. The lack of certain items (e.g. GENs, PTVs, verb EI 'NOT') also seems to caracterise certain levels. Are the overuse of certain items (or underuse) specific rather to the learner Finnish per se, than to a certain proficiency level? The items studied here do not support this, since at B2 level the proportions seem to be more or less similar to the proportions in native data. However, a more detailed KWA between different proficiency levels and native data is needed. Accuracy? > error analysis Increasing the data size (A1, C1) Comparison with other L1 backgrounds (a more universal tendency) 8