Assessng Student Learnng Through Keyword Densty Analyss of Onlne Class Messages Xn Chen New Jersey Insttute of Technology xc7@njt.edu Brook Wu New Jersey Insttute of Technology wu@njt.edu ABSTRACT Ths paper presents a novel approach that automatcally assesses student class performance by analyzng keyword densty n onlne class messages. Class performance of students s evaluated from three perspectves: the qualty of ther course work, the quantty of ther efforts, and the actveness of ther partcpaton; three measures - keyword densty, message length, and message count, are derved from class messages respectvely. Noun phrases are extracted from the messages and weghted based on frequency and length. Keyword densty s defned as the sum of weghts of noun phrases appearng n a student s messages. Message length and message count are the number of words and the number of messages n the student s messages respectvely. The three measures are combned nto a lnear model, n whch each measure accounts for a certan proporton of a fnal score, called performance ndcator. The experment shows that there s a hgh correlaton between the performance ndcator scores and the actual grades assgned by nstructors. The rank orders of students by the performance ndcator score are hghly correlated wth those by the actual grades as well. Evdences of the supplemental role of the computer assgned grades are also found. Keywords Keyword Densty, Learnng Assessment, Dstance Learnng, Noun Phrase INTRODUCTION Advances n nformaton technology have populated the onlne delvery of varous types of classes, so called vrtual classrooms (Hltz, 1994), n whch nstructors and students nteract wth each other manly by exchangng messages n natural language expresson. Students work submtted onlne accounts for a large porton of ther fnal grades. To assess students correctly, nstructors have to read through all ther messages and other onlne submssons, whch can easly accumulate to a large volume over a course of one semester. In addton, nstructors may have dfferent gradng preferences, whch wll ntroduce bas n grades. Combnng the assessments from multple judges has been proved useful to ncrease the correctness of the fnal decson (Wnkler and Clemen, 2002), so n addton to the nstructor who solely makes the judgments of student performance, a second grader s useful and necessary. Usng a second human grader, however, s not feasble n most stuatons because of the cost and lmt of human resources. Computer programs, therefore, could be a better alternatve. They could supplement to the judgments made by the nstructor, and help nstructors mprove ther gradng by alertng them wth dsagreements to the grades they assgn. The dea of developng computer programs to grade student s work was ntroduced n 1960s (Page, 1966). It has been a longtme nterest to researchers, and has been used wdely n assessng varous types of students works, such as computer programs (Jones, 2001), prose (Page, 1994), language tests (Bachman et al, 2002), and essays (Page and Petersen, 1995; Larkey, 1998; Foltz et al, 1999; Landauer and Psotka, 2000; and Bursten et al, 2003). However, although the goal of these approaches s to automatcally assess student learnng, most efforts have been focused on ndvdual work assessment (e.g. an essay, a pece of program, etc.), despte that students course work accumulates over tme. Correct assessment requres takng the entre volume of submssons nto consderaton. Recently the problem of text mnng, a.k.a. text data mnng (TDM) (Hearst, 1999) or knowledge dscovery from text (KDT) (Feldman, 1995), s attractng ncreasng attenton. Wth the large amount of class dscussons and other documents (e.g. assgnments, essays and projects) n text format, text mnng technques could be a soluton for learnng assessment. Ths paper addresses the problem of automated assessment of student learnng wthn a perod of learnng process, and presents a novel approach, whch assesses student learnng by mnng keyword densty n ther class messages. Our approach s, frst, to extract conceptual enttes, called keywords, from the class messages. Next, keywords are weghted accordng to ther lengths and occurrences n the messages. The weght of a keyword reveals the mportance of the correspondng concept. Proceedngs of the Tenth Amercas Conference on Informaton Systems, New York, New York, August 2004 2984
Message length and message count are two other measures, and refer to the number of words and the number of messages respectvely. The three measures are combned nto a lnear model, n whch each measure accounts for a certan proporton of a fnal score, called performance ndcator. The remander of ths paper s organzed as follows. Secton 2 presents the basc termnology, notaton, and concepts concernng keyword densty and the assessment model. In secton 3, we dscuss the experment and evaluaton method. Secton 4 s the results and our analyss. Secton 5 concludes the paper wth some fnal remarks. ASSESSMENT MODEL The basc dea n ths work s to access student learnng by analyzng the messages and documents produced n the supportng e-learnng systems. Ths secton presents the assessment model, ncludng the basc concepts concernng keyword densty, message length, and message count, as well as the assessment model consstng of the three measures. The model assesses student learnng from three aspects: the qualty of ther course work, the quantty of ther efforts, and the actveness of ther partcpaton; the proposed three measures - keyword densty, message length, and message count, are derved from the class messages to measure each assessment aspect respectvely. Keyword Densty Mnng Keyword densty measures the qualty of a student s work by analyzng the contents of the student s messages. Smlar content-based approaches are found n essay gradng lterature. They focus on the semantc relatonshps between words and the context. Manually graded tranng essays are requred to construct a semantc space, whch reflects the contextual usage of words. A test essay s compared to the documents n the space, and assgned wth a score accordng to the grades of the nearest essay(s). Examples are the early work n Educatonal Testng Servces (ETS) (Bursten et al, 1996) and the Latent Semantc Analyss (LSA) model (Landauer et al, 2000). However, the major characterstc of our approach s that t does not requre tranng essays, whch s expensve and nearly unavalable n our problem. In addton, our approach deals wth not just one essay, but a set of class messages. Below we gve the basc defntons of the concepts used throughout of the paper. Keyword We assume that qualty of learnng s revealed by the qualty of messages generated by a student. The number of key concepts appearng n the messages reflects the knowledge range of the author, so the usage of key concepts could be an ndcator for the learnng qualty. Evdences from language learnng of chldren (Snow and Ferguson, 1997) and dscourse analyss theores, e.g. Dscourse Representaton Theory, (Kamp, 1981) show that the prmary concepts n text are carred by noun phrases. Therefore, noun phrases are consdered the conceptual enttes n text messages. We defne keyword as a smple, non-recursve noun phrase,.e. basc noun phrase. Identfyng basc noun phrases from free text s well researched n the feld of Natural Language Processng. We mplement a noun phrase extractor, whch combnes both Markov-based (Church, 1988) and rule-based (Brll, 1992) approaches. It dsambguates a mult-part-of-speech (POS) word by examnng ts prevous n (2~4) tokens aganst a lst of manually defned syntactc rules. The free text s frst tokenzed. A smplfed WordNet database (Fellbaum, 1998), whch contans words dvded nto four categores (noun, verb, adjectve, and adverb) and the number of senses of each word used n one of the categores, s used to assgn the rght POS tag. The ntal POS tag for a word s determned by selectng the category wth the maxmum number of senses. If a word s found n more than one category, t s marked as a mult-tag word. The second stage s mult-tag dsambguaton. For every mult-tag word, the sequence of the POS tags of the prevous n tokens s examned aganst a lst of predefned syntactc rules. For example, ht can be ether a noun or a verb. If the prevous word s a determner (the, a, ths, etc), t wll be tagged as a noun rather than a verb and the mult-tag mark s removed. If none of the rules s matched, some heurstcs are used to dsambguate the mult-tag words. For nstance, f a word s found n both the noun and the verb category, but ends wth ton, t s tagged as a noun. After taggng the text, the noun phrase extractor extracts noun phrases by selectng the sequence of POS tags that are of nterest. The current sequence pattern s defned as [A] {N}, where A refers to Adjectve, N refers to Noun, [ ] means optonal, and { } means repetton. Proceedngs of the Tenth Amercas Conference on Informaton Systems, New York, New York, August 2004 2985
Keyword Weghtng Scheme In prevous studes (Chen and Wu, 2004; Wu and Chen, 2004), keywords (noun phrases) are assumed equally mportant. In ths study we borrow the dea of term weghtng n nformaton retreval (Salton and Buckley, 1988) to assgn dfferent weghts to keywords. The mportance of a keyword s measured by ts frequency. The more frequently a keyword s used, the more mportant t s. However, f a keyword s used by more students, t becomes less mportant n terms of dfferentatng one student s contrbuton to the class concept base from others. In other words, a student fully contrbutes to the class concept base by only addng new keywords that are not used by any other students. The contrbuton of addng a keyword to the concept base decreases when the number of ts author(s) ncreases. The extreme stuaton s that when a keyword s used by all other students, addng t to the concept base results n no contrbuton at all. From another pont of vew, concepts contrbuted by more students tend to be more general ones. For nstance, n a course of Informaton Systems, nformaton systems s lkely to be used by most of the students, whle expert systems s used by only a few of them. Messages contanng too many general concepts tend to be superfcal, and lack of deep analyss and strong arguments. In class dscusson and other course work, we encourage students to synthesze what they have learnt n the class and add n ther own understandng of the course materals. Although the usage of more specfc keywords does not necessarly result n hgh qualty work, t s stll preferred because t ndcates that the student s brnng n new concepts, not just repeatng the exstng ones. The length of a noun phrase should also be taken nto consderaton. Longer noun phrases are more descrptve than shorter ones. As we can see from a real example from the experment, COCOMO software development model s more descrptve than development model, so we assgn hgher weghts to longer phrases. Based on the analyss above, we assgn weghts to keywords as follows. w = ( 1 + log( len) ) f log where w s the weght of a keyword, len s the length (number of words) of the keyword, f s the frequency of the keyword n the concept base, N s the total number of students n the class, and n s the number of students who use the keyword n ther messages. We use a log functon of the length to prevent t from becomng domnant when t ncreases. Ths functon s smlar to the tf.df measure n Informaton Retreval (Salton, 1989), but the nverse frequency of a keyword s across students, not messages. Keyword Densty After assgnng weghts to keywords, we calculate Keyword Densty (KD) by addng up the weghts of keywords contrbuted by each student. The result s dvded by the sum of weghts of all keywords n the concept base. It s denoted as follows. W KD =, W where KD s the keyword densty for student, W s the sum of weghts of the keywords contrbuted by student, W s the sum of weghts of all keywords. Message Length The quantty of a student s efforts s measured by the length of hs/her messages (ML). ML s calculated by countng all the words (not noun phrases), no matter duplcated or not, n the student s messages. The absolute number s proportoned by the total sze, whch s the number of words n the entre class messages. Let ML be the message length for student, and n j be the number of words n message j of student, we have Message Count nj j ML =. n Actveness of partcpaton could be measured by the log on tmes, but ths nformaton s not avalable nsde the text messages. If we consder postng a message a vald partcpaton, partcpaton frequency can be measured by the Message Count (MC), whch s the number of messages posted by a student. MC s defned as j j N n Proceedngs of the Tenth Amercas Conference on Informaton Systems, New York, New York, August 2004 2986
n MC =, n where MC s the message count of student, and n s the number of messages posted by student. Assessment Model Taken together, the three measures are combned to compute a Performance Indcator (PI) score, whch s defned as PI = α KD + βml + γmc, where PI s the performance ndcator score assgned to student, and the coeffcents α, β, and γ are the weghts of each of the three measures respectvely. The coeffcents are adjustable. Instructors can defne the values by specfyng the mportance of each evaluaton aspect. For example, by defnng α=5, β= 3, and γ=1, the nstructor would lke to gve hgher grades to students who are more able to synthesze knowledge learned from the class, rather than to post many short contentpoor messages. EXPERIMENT Fve classes are selected for model valdaton. All classes are supported by an electronc conference system that enables nstructors and students to communcate wth each other va text message exchanges asynchronously. The class nformaton and ther conference desgn are summarzed n table 1. ID Doman Conference Desgn C1 Management Requred dscussons on major topcs n the course. Students are graded accordng to ther posts only. C2 Informaton Scence Optonal dscussons on topcs that students would lke to share wth each other. Grades are assgned to students onlne partcpaton. C3 C4 C5 Informaton Systems Informaton Systems Informaton Systems Actvtes nclude requred class dscusson, debates, oral presentaton, assgnment submsson, course project, and optonal dscussons. Some of the gradng tems are judged accordng to the documents submtted onlne, whle others are based on materals handed-n such as project reports, PowerPont sldes, and so on. Table 1: Summary of Course Informaton All class messages are downloaded and converted to plan text, from whch keywords are extracted and weghted. When calculatng the performance ndcator score, we set the three coeffcents, α, β, and γ, to 1, because we do not know the nstructors gradng preferences. However, when the gradng preferences are known, t s easy to adjust the coeffcents to comply wth the gradng preferences. RESULTS AND ANALYSIS Table 2 shows some selected keywords wth the weghts and frequency. Although busness has a hgh frequency (30), ts weght s stll 0 because t s used by all students. The results also show that longer phrase and more frequent phrases have hgher weghts. Keyword Weght Frequency/# of Authors busness 0.00 30/13 busness actvtes 5.15 1/1 busness applcaton technology 6.39 1/1 busness models 9.88 3/3 Table 2: Selected Keywords and Ther Weghts Proceedngs of the Tenth Amercas Conference on Informaton Systems, New York, New York, August 2004 2987
Table 3 summarzes the PI scores of the fve classes. Range N* Mean (Std. Dev.) C1 0.01~0.23 32 0.13 (0.06) C2 0.01~0.53 27 0.16 (0.15) C3 0.16~0.51 15 0.28 (0.11) C4 0.04~0.38 17 0.18 (0.10) C5 0.00~0.47 19 0.23 (0.12) *: Number of students Table 3: Summary of the PI scores To examne how accurately the PI scores assess student learnng, we calculate the Pearson product-moment correlaton (r) between the PI scores and the grades assgned by the nstructor. Correlatons between the ndvdual measures and the actual grades are also calculated (see table 4 for detals). The results show that there s a hgh correlaton between the PI scores and the actual grades (from 0.57 to 0.92). Accordng to the reports n the essay gradng lterature (Page and Petersen, 1995; Larkey, 1998; Foltz et al, 1999; Landauer and Psotka, 2000; and Bursten et al, 2003), correlaton between the judgments of human expert vares from 0.5 to 0.9 approxmately. It s reasonable to assume that correlaton between grades of two nstructors also falls nto ths range. The results, therefore, suggest that the computer grader performs well n terms of correlatng wth human evaluators. r PI-G r KD-G r ML-G r MC-G r ro C1 0.92 0.90 0.87 0.88 0.94 C2 0.87 0.85 0.84 0.84 0.98 C3 0.62 0.54 0.58 0.77 0.74 C4* 0.80 0.79 0.78 0.74 0.94 C5 0.62 0.65 0.50 0.52 0.63 r PI-G : Correlaton between the PI scores and the actual grades r KD-G : Correlaton between the KD scores and the actual grades r ML-G : Correlaton between the ML scores and the actual grades r MC-G : Correlaton between the MC scores and the actual grades r ro : Correlaton between the rank orders (Rg and Rp) of students *: Calculaton s after the removal of an outler Table 4: Correlatons The low correlaton n C3 and C5 catches our attenton. By lookng nto the conference boards, we fnd the followng possble reasons: The two courses contan a few conferences that are desgned for course admnstraton. For example, C3 has three conferences for assgnment dstrbuton and submsson. The questons and answers regardng the assgnments should have been excluded from analyss. Another smlar conference s the exam conference. If the content of a conference s not used by the nstructor for gradng but processed by our program, noses wll be ntroduced. The conference boards have some fles that cannot be processed by the current verson of our program, such as attachments, PowerPont sldes, and audo fles. However, they were manually evaluated by the nstructors. There are prvate/closed conferences that cannot be processed by our program. Unlke essay gradng approaches, whch attempt to assgn a concrete score to an object, our performance ndcator score does not attempt to predct the actual grades of students. It assesses students by comparng them wth each other to dstngush good students from poor ones. The rank order of students by the PI scores would be more nterestng to nstructors. It could help nstructors assess students by groupng them at dfferent levels, whch s stll a common approach used n practce Proceedngs of the Tenth Amercas Conference on Informaton Systems, New York, New York, August 2004 2988
by many nstructors. Frst, students are ranked by ther actual grades n descendng order, and the rank order s recorded as Rg. Smlarly, another rank order, Rp, ranks students by ther PI scores. The Spearman ranker order correlaton between Rg and Rp s then calculated. The results are shown n the last column of Table 4. The hgh correlaton between Rg and Rp suggests that the PI scores rank students correctly. Evdence of the supplementary role of the PI scores s also found from the experment. PI scores devated from the actual grades may suggest ether napproprate grades, or somethng specal n the messages, or both. In C2, the PI score of one student s relatvely hgher than the actual grade. By reexamnng the student s messages, the nstructor found that the student coped and pasted a long message along wth the source URL from the web wthout addng hs/her personal opnons. Even though the nstructor had encouraged students to share anythng they found relevant and nterestng to the class, wthout personal opnons and thoughts, the nstructor consdered ths to be less effort. Therefore, the orgnal grade was confrmed. A smlar case was found n C4, n whch a student (hereafter known as S) got the hghest PI score, but a low grade. When consulted, the nstructor explaned that S s an excepton n that class. S submtted almost every assgnment late because of famly and personal medcal problems. The nstructor accepted S late assgnments but gave low grades. After the makeup exam, the nstructor changed S fnal grade to a hgher one. Because of ths reason, we excluded S from analyss when calculatng the correlatons (see table 4 for detals). Fgure 1 llustrates the close relatonshp between Rg and Rp of C4 except the outler S dscussed above. Fgure 1: Rank Orders of Students by PI Score and by Grade (C4) Havng the program servng as a second grader, nstructors are able to capture outlers, and to reduce msjudgment, bas, or errors n gradng. CONCLUSION We have presented a model for automated assessment of student learnng n vrtual classrooms. Ths model s based on a text mnng technque - keyword densty mnng from large body of texts. Noun phrases are extracted from class messages and weghted to buld a concept base. Keyword densty s then calculated for each student, and s further combned wth two other measures, message length and message count, nto a lnear model whch predcts a performance ndcator score for each student. The experment shows that the assessment model works well. For both the absolute scores and the relatve rank orders, correlatons between the PI scores and the actual grades are generally hgh. Our future research plan ncludes enhancng the current program, so that t can handle attachment fles, and deal wth specal conferences. We also plan to test the model aganst courses n other domans. Proceedngs of the Tenth Amercas Conference on Informaton Systems, New York, New York, August 2004 2989
REFERENCES 1. Bachman, F. L., Carr, N., Kame, G., Km, M., Pan, M. J., Salvador, C., Sawak, Y., A relable approach to automatc assessment of short answer free responses, Proceedngs of COLING 2002, The 19th Internatonal Conference on Computatonal Lngustcs. 2. Brll, E.: A smple rule-based part of speech tagger. Proceedngs of the Thrd Annual Conference on Appled Natural Language Processng, ACL. 1992. 3. Bursten, J., Kaplan, R., Wolff, S. and Lu, C., Usng Lexcal Semantc Technques to Classfy Free-Responses, In Proceedngs of SIGLEX 1996 workshop, Annual Meetng of the Assocaton of Computatonal Lngustcs, Unversty of Calforna, Santa Cruz, 1996. 4. Bursten, J., Leacock, C. Chodorow, M., CrteronSM: Onlne essay evaluaton: An applcaton for automated evaluaton of student essays. Proceedngs of the Ffteenth Annual Conference on Innovatve Applcatons of Artfcal Intellgence, Acapulco, Mexco, August 2003 5. Chen, X., Wu, B., Automated Evaluaton of Students' Performance by Analyzng Onlne Messages, IRMA 2004 New Orleans 6. Church, K. W., A Stochastc Parts Programs and Noun Phrase Parser for Unrestrcted Text. In: "Proceedngs of ANLP- 88", Austn, TX, USA, 1988. 7. Feldman, R. and Math, I. D. Knowledge Dscovery n Textual Databases (KDT), In Proceedngs of the Frst Internatonal Conference on Knowledge Dscovery, KDD-95 8. Fellbaum, C. D. WordNet: An Electronc Lexcal Database, MIT Press, Cambrdge, MA, 1998 9. Foltz, P. W., Laham, D., and Landauer, T. K., The Intellgent Essay Assessor: Applcatons to Educatonal Technology, Interactve Multmeda Electronc Journal of Computer-Enhanced Learnng, 1(2), 1999. 10. Hearst, M. A. Untanglng Text Data Mnng, Proceedngs of ACL'99: the 37th Annual Meetng of the Assocaton for Computatonal Lngustcs, Unversty of Maryland, June 20-26, 1999 11. Hltz, S. R., The Vrtual Classroom: Learnng Wthout Lmts Va Computer Networks, Norwood NJ, Ablex, 1994. 12. Jones, E. L., Gradng student programs - a software testng approach, The Journal of Computng n Small Colleges, Volume 16 Issue 2, January 2001 13. Kamp, H. A, Theory of Truth and Semantc Representaton, Formal Methods n the Study of Language, Vol. 1, (J. Groenendjk, T. Janssen, and M. Stokhof Eds.), Mathema-tsche Centrum, 1981. 14. Landauer, T. K., and Psotka, J., Smulatng Text Understandng for Educatonal Applcatons wth Latent Semantc Analyss: Introducton to LSA, Interactve Learnng Envronments, 8(2) pp. 73-86, 2000. 15. Larkey, L. S., Automatc essay gradng usng text categorzaton technques. Proceedngs of the Twenty Frst Annual Internatonal ACM SIGIR Conference on Research and Development n Informaton Retreval, Melbourne, Australa, 90-95, 1998. 16. Page, E. B. The mmnence of gradng essays by computer. Ph Delta Kappan, 238-243, January 1966. 17. Page, E. B., Computer gradng of student prose, usng modern concepts and software. Journal of Expermental Educaton, 62, 127-142, 1994. 18. Page, E. B. and Petersen, N. S., The computer moves nto essay gradng, Ph Delta Kappan, March, 561-565, 1995. 19. Salton, G. and McGll, M. J. Introducton to Modern Informaton Retreval, McGraw-Hll, Inc., 1983 20. Salton, G. and Buckley, C. Term weghtng approaches n automatc text retreval, Informaton Processng and Management, vol. 24, no. 5, pp. 513--523, 1988. 21. Snow, C. E. and Ferguson, C. A. (Eds.) Talkng to Chldren: Language Input and Acquston, Cambrdge, Cambrdge Unversty Press, 1997. 22. Wnkler, R. L. and Clemen, R. T., Multple Experts vs Multple Methods: Combnng Correlaton Assessments, Decson Analyss, November 2002 23. Wu, Y. B. and Chen, X., Assessng Dstance Learnng Student's Performance: A Natural Language Processng Approach to Analyzng Onlne Class Dscusson Messages, ITCC 2004, Las Vegas Proceedngs of the Tenth Amercas Conference on Informaton Systems, New York, New York, August 2004 2990