How Course Management Systems Can Benefit from Exploratory Analysis of Student On-line Activity Manuel Rubio-Sánchez, Raquel Hijón-Neira, Francisco Domínguez-Mateos, and Ángel Velázquez-Iturbide Rey Juan Carlos University Department of Computer Science Languages and Systems c/ Tulipán s/n, 28-933 Madrid, Spain {manuel.rubio,raquel.hijon,francisco.dominguez,angel.velazquez}@urjc.es Abstract. Over the last years significant efforts have been made to develop effective Course Management Systems (CMS). Recently, they have begun to incorporate databases that contain valuable information about student on-line activity, behavior, scores, etc. Although CMS provide descriptive statistics related to these databases, they still lack adequate Exploratory Data Analysis (EDA) and Data Mining tools that would enhance a knowledge discovery process capable of revealing interesting patterns and hidden information. This paper analyzes the utility of EDA tools by examining data associated with a particular course taught through the Atnova Virtual Campus CMS. The study provides valuable information to be considered by any CMS, such as patterns of student activity, relevance of multivariate methods, or guidelines regarding the construction of databases. 1 Introduction Many sophisticated CMS have been developed, and are in use, around the world. Educators, that use these environments and tools, however, have very little support to evaluate students activities and discriminate between their different on-line behaviors. New ways of obtaining information about the learning patterns should be studied. This requires the development of effective methods for determining and evaluating student behavior in electronic environments. In addition to the descriptive statistical analysis provided by most web access log analysis tools, such as, calculating hit frequency, average, median, etc., length and duration of sessions, and other limited low-level statistical measures, there have been several data mining approaches adapted specifically for web usage mining [1]. The most popular methods include association rules mining, clustering, classification, sequential pattern analysis and dependency modelling [2], as well as prediction. None of these applications, however, were tailored to distance learning, but the methods are general enough for e-learning systems to benefit from them. All of these techniques were designed for knowledge discovery
from very large databases of numerical data [3], were adapted for web mining, and applied in on-line businesses with relative success. Therefore, in [4] the authors designed and implemented a prototype of such an application as a tool for educators to apply association rules in order to discover relationships between learning activities that learners perform, sequential analysis to reveal interesting patterns in the sequences of on-line activities, and clustering to group similar access behaviors. Another study, developed in [5], is a special web sequence analyzer for improving web pages layout and structure based on the history of access sequences. Carr-Chellman and Duchasel [6] argue that simply transposing traditional course material onto the web does not use the medium to its best advantage, and that effective on-line instruction has specific and different design requirements. However, determining learning behavior in electronic media is a complex problem. The difficulty resides in the fact that students mostly use these environments away from the classroom and out of sight of their educators. Without the informal monitoring that occurs in face-to-face teaching it is difficult for educators to know how their students are using and responding to these environments. Educators have had to seek new ways of obtaining information about the learning patterns of their students. This requires the development of effective methods that would determine and evaluate learner behavior in electronic environments. For example, an analysis of student use of a courseware website (see [7]) found that the most popular on-line activities were passive and involved retrieving information rather than contributing. Their conclusion was that the students were very goal oriented in their use of the web site. Further information can be obtained through discovering students access resources [8]. This can help educators understand students preferred learning patterns. A study carried out in [9] explored interactions of doctoral students with an on-line environment, and concluded that student interactions were goal focussed. For instance, in a study of student use of a first year geology website (see [10]), log file analysis showed that students accessed the most recent lecture notes first, picking up a couple of key slides, before returning to a previous lecture. As a result, it was shown that students were accessing resources according to immediate need. Following a similar approach, another study of these characteristics showed that the average connection to the CMS was over thirty minutes long [11]. Educational web based systems for the improvement of the e-learning experience, where the ultimate objective is the discovery of system usage patterns and, generally, database knowledge discovery, are [12]: (1) methods of classifying students based on their usage patterns on a web-based course; (2) methods of system personalization (see [13]); and (3) methods that allow automatic detection of atypical behaviors. Despite the existence of some research concerned with the mining of data generated by the use of e-learning systems, there is still a lack of standard methods and techniques to address some open problems in distance education. This paper analyzes the utility of EDA tools by examining data associated with a particular course taught through the Atnova Virtual Campus CMS. The
study provides valuable information to be considered by any CMS, such as patterns of student activity, relevance of multivariate methods, or guidelines regarding the construction of databases. The rest of the paper is organized as follows. Section 2 provides an overview of the course under study. Section 3 describes the different exploratory analysis methods, along with their results. Finally, several conclusions are summarized in Sec. 4. 2 Brief Course Description The course under analysis, How to visually show data and explanations, belongs to the ADA-Madrid project [14], a pioneering program aimed at fomenting the use of communication and information technologies in on-line teaching activities. Every year the project comprises about 20 general elective courses offered to students who attend any of the six public universities of Madrid (Complutense, Autónoma, Rey Juan Carlos, Politécnica, Alcalá, Carlos III). The maximum number of students per course is 60 (10 per university). Teaching is carried out via the Internet through the Atnova Virtual Campus CMS (see Fig. 1), but can also be complemented with special videoconference sessions. Fig. 1. ADA-Madrid Virtual Campus. The course is based on information visualization [15], a classical area of computer science, but is presented to students in a simple and pleasing manner. This is a prerequisite imposed by the ADA-Madrid project, where students must be able to benefit from the courses, regardless of their specific major. In this sense, the approach is general and informative, where the main concepts stem from areas of graphical design and visualization. There exist several works on experiences with similar courses [16]. The teaching methodology consists of posting a new lesson every week (9 total lessons). Students can either read the lessons directly from the monitors or print them on paper. They can also download them as they are posted in commonly used electronic formats (pdf, HTML, doc, avi). Interactions with professors and with other students are carried out by means of special forums, a chat tool, or simply interchanging messages (not emails) within the Atnova Virtual Campus.
Evaluations are based on two assignments where students are required to use graphical computer tools (40% each), and a final test (20%). 3 Exploratory Analysis In order to gain insight into students learning patterns and behavior, enhance the teaching methodology for future courses, and discover information unknown a priori, several visual EDA were carried out. For this purpose, a particular working-database was constructed aiming to analyze the following numeric variables and their relationships: scores on both assignments, the test, and the final grade (4 variables); number of posts in forums, total number of established sessions, total connection time in minutes (3 variables); number of established sessions per lesson (9 variables), and connection time in minutes per lesson (9 variables). Two additional nominal/categorical variables were also considered: whether the student dropped the course or not, and their particular majors (this variable was used in order to estimate their computer user skills). 3.1 Data Acquisition Process The Atnova Virtual Campus CMS provides several predefined tables or databases, which can be exported to files with a particular format, containing information related to the numeric variables, for example: connection times per day, sessions and connection times, connection times per session, connection times per lesson, test scores, or final grades. The system is quite flexible since instructors can customize the size or range of the tables. For instance, when analyzing connection times per day it is possible to choose a date interval (the first and last day). However, the system does not incorporate a tool for combining information from various tables. Note that the working-database used in the analysis contains data from several tables. This has eventually lead to a very tedious, practically manual, task of merging them. One of the main problems encountered was the different number of entries per table. Therefore, it would be very desirable for CMS to provide mechanisms enabling educators to create customized databases (see Fig. 2). In this setting, instructors would be able to select the specific variables needed for their analyses in an effortless manner. Fig. 2. CMS should provide tools for merging different tables. The data acquisition process can also reveal surprising information. In this case, when compiling information about the students majors, it was found that
every student had specified theirs, except for the 10 students belonging to the Alcalá university. This suggests a possible problem in the application/registration system at that university. 3.2 Specific Category Analysis In some analyses it is necessary to consider a particular partition of the original working-database. For on-line courses, it seems mandatory to evaluate students performance according to their computer user skills. Several EDA techniques can be used to examine or estimate probability distributions. Fig 3 shows the distributions of students grades (in a scale from 0 to 10) according to their computer user skills with box-plots. This result indicates, for example, the need to provide less experienced students with tutorials and user manuals to ease their adaptation to the CMS. Fig. 3. Analysis of final grades according to computer user skills with box-plots. Other aspects, such as gender, may also be analyzed. Although there have been various studies of web usage, there are scarce studies that report disparate manners of using courseware based on gender. A study by [7] found differences in the type of resources accessed by male and female students. Males used interactive resources significantly more than females, whereas females used passive resources more than males. 3.3 Correlations between Variables The relationships between variables can also provide useful information. There exist multiple techniques for this purpose (see [17 19]). In Fig. 4 the correlation coefficient matrix (absolute values) shows a curious behavior pattern: there exists a high degree of correlation between the number of sessions used in consecutive lessons. According to the previous result, sharp descents in interest or performance could help identify potential drop-out students. In on-line courses it is very important to detect and understand, as early as possible, factors responsible for
Fig. 4. Absolute values of the correlation coefficient symmetric matrix. dropping a course. Note that when a student drops a course, missing data is found in the database, which is difficult to process. Furthermore, the success of any on-line course depends on a high degree of student participation, therefore, instructors should pay special attention to this matter. Thus, it would be desirable for CMS to provide adaptive guidance tools to prevent student failure. A study [20], which analyzed log file interactions with different resources on a courseware website, found a relationship between the frequency of access to learning resources and final exam scores. The authors claim that this provides evidence that the use of relevant web contents improves learning. On the other hand, in [21] there is a study of six measures of student behavior in a CMS that did not consistently correlate with their grades. Note that in the present study this situation appears as well. 3.4 Multivariate Analysis Multivariate analysis methods can detect similarities in learning behaviors and group (cluster) students. However, due to the curse of dimensionality (see [18, 19]), it is difficult to draw conclusions when the number of students is low, or when working with a large amount of variables. Fig 5 shows projections onto a plane of the working-database with two graphical multivariate analysis methods: (a) Sammon s mapping [22], and (b) a self-organizing map, visualized with the U-matrix method [23, 19]. Since the database only contains information about 47 students (13 dropped the course), and since there are up to 25 variables, it is very unlikely for clusters to appear in the images. The only valuable information regards outlier data, which corresponds to two extremes: users who barely access the CMS, or who appear to be connected for weeks.
(a) Sammon s Mapping (b) Self-organizing Map Fig. 5. Graphical multivariate methods. 4 Conclusions This paper analyzes the utility and necessity of EDA tools for enhancing CMS. The study shows that these systems should incorporate a tool for customizing database tables in order to enable the selection of any particular set of variables. In this setting, EDA methods can be used effectively to discover information. Most techniques work well if the number of variables to be analyzed is relatively low. For instance, in the analyzed course, students with better computer user skills obtained, on average, higher grades. The problem of detecting clusters in the data resulted impractical due to the low number of students and the high dimensionality of the database. Graphical multivariate methods, however, were capable of recognizing outlier students. Finally, a pattern of student activity was discovered by analyzing correlations between variables: the number of sessions used in consecutive lessons were highly correlated. 5 Acknowledgements This work is supported by project TIN2004-07568 of the Spanish Ministry of Education and Science. References 1. Zaine, O.R.: Web usage mining for a better web-based learning environment. In: Proc. of Conference on Advance Technology for Education, Banff, Alberta (2001) 60 64 2. Srivastava, J., Cooley, R., Deshpande, M., Tan, P.: Web usage mining: Discovery and applications of usage patterns from web data. SIGKDD Explorations 1(2) (2000) 12 23 3. Han, J., Kamber, M.: Data Mining Concepts and Techniques. Morgan Kaufmann Publisher (2001) 4. Zaine, O.R.: Towards evaluating learners behaviour in a web-based distance learning environment. In: Second IEEE International Conference on Advance Learning Technologies (ICALT 01), Madison, WI (2001) 357 360
5. Spiliopoulou, M., Faulstich, L.C., Winkler, K.: A data miner analyzing the navigational behaviour of web users. In: Proc. of workshop on Machine Learning in User Modeling of the ACAI 99, Creta, Greece (1999) 357 360 6. Car-Chellman, A., Duchasel, P.: The ideal online course. British Journal of Educational Technology 31(3) (2000) 229 241 7. Peled, A., Rashty, D.: Logging for success: Advancing the use of WWW logs to improve computer mediated distance learning. Journal of Educational Computing Research 21 (1999) 413 431 8. Sheard, J., Albrecht, D.W., Butbul, E.: ViSION: Visualization student interactions online. In Treloar, A., Ellis, A., eds.: Proc. of the Eleventh Australasian World Wide Web Conference Gold Coast, QLD, Australia, Southern Cross University, Lismore, NSW, Australia (2005) 48 58 9. McIsaac, M.S., Blocher, J.M., Mahes, V., Vrasidas, C.: Student and teacher perceptions of interaction in online computer-mediated communication. Educational Media International 36(2) (1999) 121 131 10. Hellwege, J., Gleadow, A., Naught, C.M.: Paperless lectures on the web: An evaluation of the educational outcomes of teaching geology using the web. In: Proc. of 13th Annual Conference of the Australian Society for Computers in Learning in Tertiary education (ASCILITE 96), Adelaide, Australia, University of South Australia (1996) 11. Hijón, R., Velázquez, A.: Web, log analisis and surveys for tracking university students. In: Proc. of IADIS International Conference on Applied Computing, San Sebastián, Spain (2006) 561 564 12. Castro, F., Vellido, A., Nebot, A., Minguillón, J.: Detecting atypical student behaviour on an e-learning system. In: I National Symposium of Technologies of the Information and the Communications in the Education, SINTICE 2005, Granada, Spain (2005) 153 160 13. Brusilovsky, P.: Adaptive hypermedia. User Modeling and User-Adapted Interaction, Ten Year Anniversary Issue (1/2) (2001) 87 110 14. Velázquez, A., Rubio, M.: Chap. 12: Design and evaluation of the course: How to visually show data and explanations. In Criado, R., Conde, J.V., eds.: I Pedagogical Conference of the ADA-Madrid Project. (2005) 119 125 In Spanish. 15. Spence, R.: Information Visualization. Addison & Wesley, Harlow, England (2001) 16. Carter, R.: Teaching visual design principles for computer science students. Computer Science Education 3(1) (2003) 67 90 17. Tukey, J.W.: Exploratory Data Analysis. Addison & Wesley, Reading Mass. (1977) 18. Du Toit, S.H., Steyn, A.G., Stumpf, R.H.: Graphical exploratory data analysis. Springer-Verlag, New York, NY, USA (1986) 19. Rubio, M.: New Methods for Visual Analysis of Self-Organizing Maps. PhD thesis, Polytechnic University of Madrid, Madrid, Spain (2004) In Spanish. 20. Lu, A.X., Zhu, J.J., Stokes, M.: The use and effects of web-based instruction: Evidence from a single-source study. Journal of Interactive Learning Research 11(2) (2000) 197 218 21. Nickles, G.M.: Correlations of student grades and behavior while using a course management system under different contexts. In: Proc. of the American Society for Engineering Education Annual Conference & Exposition, Portland, OR, US (2005) 22. Sammon Jr., J.W.: A nonlinear mapping for data structure analysis. IEEE Transactions on Computers c-18(5) (1969) 401 409 23. Kohonen, T.: Self-Organizing Maps. Third edn. Springer (2001)