Higher Education Web Information System Usage Analysis with a Data Webhouse Carla Teixeira Lopes 1 and Gabriel David 2 1 ESTSP/FEUP, Portugal carla.lopes@fe.up.pt 2 INESC-Porto/FEUP, Portugal gtd@fe.up.pt Abstract. Usage analysis of a Web Information System is a valuable help to predict user needs, to assess system s impact and to guide to its improvement. This is usually done analysing clickstreams, a low-level approach, with huge amounts of data that calls for data warehouse techniques. This paper presents a dimensional model to monitor user behaviour in Higher Education Web Information Systems and an architecture for the extraction, transformation and load process. These have been applied in the development of a data warehouse to monitor the use of SIGARRA, the University of Porto s Higher Education Web Information System. The efficiency and effectiveness of this monitorization method were confirmed by the knowledge extracted from a 3 month period analysis. A brief description of the main results and recommendations are also described. 1 Introduction The Web is growing in the number of users [12], usage rate [12] and complexity of its sites [5]. The use of this medium as an access interface to organizational Information Systems (IS) and their applications is also frequent. As the experience and expectation of users increases, the need to know and meet user demands becomes more pertinent. Monitoring users behaviour helps to know their needs and allows system adaptation based on their previous behaviours [15]. Besides system adaptation, it also: supports the evaluation of the system against its initial specifications and goals, enables the development of personalization strategies [1, 4, 6], helps increase system s performance [6, 7], supports marketing decisions [3], helps detect business opportunities that otherwise could remain unnoticed [10] and may contribute to increase the system s security [4, 14]. Monitoring the use of Web Information Systems (WIS) involves analysing clickstreams, a data source that aggregates information about all user actions in a website. Log file analyzers, applications that extract data directly from log files and generate several kinds of statistics, are one of the most adopted solutions to monitor WIS usage [11]. However, with this technique it s hard, if not impossible, to obtain the level of analysis that other techniques allow. Log file analyzers lack the ability to integrate and correlate information from M. Gavrilova et al. (Eds.): ICCSA 2006, LNCS 3983, pp. 78 87, 2006. c Springer-Verlag Berlin Heidelberg 2006
Higher Education Web Information System Usage Analysis 79 different sources. They can t, for example, correlate the number of accesses from a student to the web site with the program he is enrolled into. An alternative with more analytic potential, suitable to process large quantities of data (as happens with clickstream data), involves using a data webhouse, this is, a data warehouse that stores clickstreams and other contextual in order to understand user behaviour [8]. In Section 2 a dimensional model suitable to monitor Higher Education Web Information System (HEWIS) is presented. This has been the model used in the data webhouse to monitor the usage of the University of Porto s (UPorto) HEWIS. The architecture and a description of the processes involved in the extraction, transformation and load (ETL) are presented in Section 3.1. In the following section, some of the main results and in Section 3.3 some recommendations are presented. Conclusions and lines of future work are presented in the last section. 2 Dimensional Model Considering the HEWIS scenario, a dimensional model to monitor this specific type of WIS usage has been defined. This process has begun with context analysis, followed by the establishment of the granularity, the definition of the relevant dimensions and facts identification. 2.1 Granularity Not forgetting that dimensional models should be developed with the most atomic information [9], when the business process is associated with very large quantities of information, it is crucial to choose a granularity that is meaningful to the user and that, simultaneously, adds value to the organization s knowledge. Since the main goal of the present data warehouse is the analysis of user behaviour it has been decided to implement a granularity of web pages (see Figure 1) and web sessions (see Figure 2). The web page grain will allow answering questions related to user actions inside sessions, which is not possible with just a session fact table. The web session grain allows greater performance on questions related to WIS sessions. 2.2 Dimensions The model has 12 dimensions that will be described next. The Academic Date, User, Page, Session Type and Institution dimensions are specific to the higher education context. Access Date. The Access Date dimension stores information about the day of the civil calendar day in which the request was made. It only has one hierarchy with four levels: Year, Quarter, Month and Day. Time of Day. To avoid the size of a dimension that saves the time of day for each day of the civil calendar, it has been decided to split time into a new
80 C.T. Lopes and G. David Fig. 1. Web Page Fact Table Fig. 2. Web Session Fact Table dimension. This dimension has one hierarchy with three levels: Hour, Minute and Second. It has a record for each second of a day. Academic Date. An academic calendar is usually associated with different structures that differ on the number of modules (semesters, four month periods and trimesters). Each of these structures is a different hierarchy, each with five levels: Year, Module (6, 4 or 3 months), Period (classes or examination period), Week (variable length, defined internally by each institution) and Day. It still has another hierarchy related to academic sessions, which are specific periods, of variable length, in an academic calendar. For example, the University Day (UPorto s anniversary day) is a one day academic session. All vacations are academic sessions. This hierarchy has three levels: Year, Session and Day. The Session level has information about the start and end date of the session, the session type (with classes, without classes, vacancies) and the number of days in the session. User. This is a crucial dimension to the segmentation of users and to behaviour analysis. Accesses can be made by human users (identified or anonymous) or web crawlers. Identified users are students or workers (faculty or staff). Anonymous users are those who access the HEWIS without signing in. Comparatively, with WIS that gather information from online registration forms, HEWIS have the advantage of having more trustworthy information about identified users, as they usually obtain user s information in the student s school registration or in workers act of contract. This dimension saves information about the user s academic degree, age group, gender, civil status, activity status, birthplace, role and department/service. User Machine. The User Machine dimension gathers information about the physical geography (country) and web geography (top level domain, domain) of the machine that generates the web request. It also has information about the machine s location regarding the institution and the university and the access nature (for example: structured network, wireless network). Agent. This dimension keeps information about the agent that has made the request, either a browser used by humans or a crawler.
Higher Education Web Information System Usage Analysis 81 Page. This dimension is an obvious one in WIS monitorization context. Although it has been modelled having SIGARRA in mind, it can be easily adapted to other types of HEWIS. It has one hierarchy with four levels: Application, Module, Procedure and Page. An application is an autonomous software artefact with one or more modules. Modules are logical units of the main functionalities and can be seen as a set of related procedures. A procedure generates pages and is the conceptual unit of interaction with the user. The same procedure generates different pages if the received arguments are distinct. For instance, the official pages of department A and department B are both generated by the same procedure. Referrer. This dimension describes the page that has preceded the current access. This information is gathered from log files and is related to the domain of the referrer and the referrer itself: port, procedure (if it belongs to the HEWIS), query (everything that follows the? in an URL), the identification and description of the search engine (if this is the case) and the complete URL. HTTP Status Code. This dimension has the category of the HTTP Status Code (Informational, Success, Redirection, Client Error, Server Error) and the description of the HTTP status code returned in the request. Session Type. Here, web sessions are aggregated into predefined types of sessions. It has one hierarchy with several levels: session context (for example: enrolment in a course), local context (for example: consulting information of a course) and the final state of the session (if its main goal has been achieved). Event Type. This dimension has just one hierarchy with one level and it describes what happened in a page at a specific time (for example: open a page, refresh a page, click a hyperlink, enter data in a form). Institution. Information about academic institution associated with the web request is stored in this dimension. 2.3 Fact Tables Each line in the Page Fact Table (see Figure 1) corresponds to a page served by the HEWIS. The session id degenerate dimension is used to group pages in sessions. The double connection to the User dimension is explained by a SIGARRA s functionality that allows a user to act on behalf of another user (for example, course grades may be inserted by the faculty s secretary). The fact table has 6 measures: page time to serve (number of seconds taken by the web server to process all requests related to this page), page dwell (number of seconds the complete page is visible in user s browser), page hits loaded (number of resources loaded for the presentation of the page), page bytes transferred (sum of the bytes loaded in all the resources related to this page) and page sequence number (the sequence number of this page in the overall session). A line in the Session Fact Table (see Figure 2) records the occurrence of a session in the HEWIS. A session is a set of page accesses, in a single browser session, by the same user, requested in intervals with less than 30 minutes. The double connection to the Page dimension allows the identification of the entry and exit pages of a session. The time related dimensions are associated to
82 C.T. Lopes and G. David session s first request. The referrer dimension records the session s first referrer. This fact table measures are: session span (number of seconds between the first request and the complete load of the last request), session time to serve (number of seconds taken to serve all the requests in the session), session dwell (number of seconds of visibility of all the pages in the session), session pages loaded (number of pages in the session), session procedures loaded (number of distinct procedures in the session), session pages to authentication (number of pages until authentication; if there isn t any, this measure equals session pages loaded) and session bytes transferred (number of bytes transferred in this session). 3 SIGARRA Case Study Although SIGARRA is defined at the institution level and is supported by several database and web servers, similarities between the HEWIS s structure in the several institutions and the nature of a data warehouse suggest the adoption of a centralized architecture at the university level for the data webhouse. A prototype of a data webhouse has been built to monitor SIGARRA s usage in UPorto s Engineering Faculty, the institution where it is most used. As SIGARRA uses Oracle as its database management system (DBMS), this was also the underlying DBMS used in the staging area and in the data webhouse. They both co-exist in a single machine, independent of SIGARRA s machines. A three month period of clickstream data has been loaded into a data webhouse with the dimensional model described before. As expected, after the webhouse load, the fact tables are the largest tables (Page fact table has 8 607 961 records and Session fact table has 984 848 records), followed by the Page (497 865 records), Referrer (461 832 records) and User Machine (202 898 records) dimensions. While log files from a 3 months period needed almost sixteen gigabytes of space (15,68 GB), the data webhouse with usage data from the same period needs almost three gigabytes (2,57 GB), a meaningful reduction of 83,6%. 3.1 Extraction, Transformation and Loading The ETL involves getting the data from where it is created and putting it into the data warehouse, where it will be used. The architecture defined for the ETL process has three types of data sources: clickstreams, SIGARRA s database and other sources. The first come from web servers logs. SIGARRA s database is essential to gather information about the institutions, their internal organization (departments, sections, etc.), academic data (academic calendar, academic events, evaluation periods), HEWIS application structure, users and other kind of data (countries, councils, parishes, postal codes, etc.). The last data source includes data such as IP ranges of each type of access (wireless, structured network, etc.), data relating IP addresses with geographical areas, domain names, HTTP status codes, and information on search engines, browsers, crawlers, platforms and operating systems. The extraction phases are the first to occur. At this phase, all data is extracted from its source and is transferred to the staging area with a simple file transfer.
Higher Education Web Information System Usage Analysis 83 Then, web servers logs must be joined, parsed and transferred to the staging area. Parsing is done by a Perl script that has a web log file as input and generates a tab-delimited file with several fields and includes host IP address resolution, URL and referrer parsing, search engine, browser, crawler and operating system identification and cookies parsing. After data loading into the staging area, clickstream is processed through PL/SQL, using a relational database. This process involves IP address/country resolution, session, page and user processing. Session and user tracking is based on session cookies, thus it is necessary to overcome the absence of cookies in first requests. A period of 30 minutes of inactivity will lead to a new session as proposed by several authors [5, 2, 13]. A change in the user associated with the session will lead to the same result. Users tracking must also deal with authentications that occur in the middle of a session. Dimensions have been built with information from the tab-delimited file generated by the clickstream parsing and from SIGARRA. Fact tables have been built after dimensions due to the dependencies between them. The webhouse loading is done by copying the data from the staging area to the posting schema. At the end, all records that belong to a closed session are deleted from the staging area. A session is closed if it does not have requests in the last 30 minutes of a day (sessions going on near the end of the day may continue on the following day and must be processed by the next ETL iteration). 3.2 Data Analysis The data analysis process has been made using Structured Query Language (SQL) and On-Line Analytic Processing (OLAP). Due to the star structure of the dimensional model, the queries were simple and had a good performance. Data analysis led to a detailed characterisation of SIGARRA s usage in several categories according to the main user types (students, faculty, staff, anonymous users, crawlers). Some of the results will be described next. Time Related Analysis of Sessions and Pages. The average number of sessions by day is 10 942 and its distribution by user type is as presented in Figure 3. Excluding crawlers, the average session time span is 10,89 minutes, being the staff s sessions the longest ones (Figure 4). The average number of pages accessed by day is 95 575 and its distribution by user type is as presented in Figure 5. Excluding crawler s sessions, the average number of pages by session is 7,7. Session Referrers. In the overall sessions, 79,23% were direct entries and 15,50% had origin in search engines, being Google the most used (99,8% of all search engines sessions). User Machines. There were 155 distinct access countries. Staff and faculty users access mainly from inside the institution and anonymous users from outside (Figure 6). Inside institution, most accesses are from the structured network.
84 C.T. Lopes and G. David Fig. 3. Distribution of sessions by user type Fig. 4. Session span in minutes by user type Fig. 5. Distribution of pages by user type Fig. 6. Access type by user type Fig. 7. Platforms used by user type Fig. 8. Browsers used by user type Access Agent. Windows is the most used platform. As it can be seen in Figure 7, faculty also use Unix and Macintosh platforms, although in a much smaller scale. As can be seen in Figure 8, the most used browsers are MSIE (88,10%) and Firefox (7,16%). Firefox use is growing and the inverse is happening with MSIE. The crawler with more sessions and pages requested is Googlebot (47,86% of all crawler s sessions and 86,89% of all crawler s requests for pages). Number of Sessions by User Profile. There were 8 004 distinct users in the analysed period, 7 169 were students, 416 faculty and 252 staff users. The number of sessions is higher in users with less than 20 years, in undergraduate programme s students and particularly in the first curricular years of undergraduate programmes.
Higher Education Web Information System Usage Analysis 85 HEWIS Navigation. Student, programme and institution modules are the most used ones. Students and anonymous users have similar preferences in pages viewed. Two of the main entry pages are: Dynamic Mail Files (due to following hyperlinks to files in dynamic e-mail received) and Computer labs first page (because it is loaded in the background of every lab s computer). Authentication is mainly done in home page s authentication area (64,59% of all authentications) and the main underlying motivations are access to Dynamic Mail Files, Legislation, Summaries and Courses. The Help module is mainly used by anonymous users. Specific Pages Usage. Home page connections most used are: authentication, search and programmes. The two undergraduate programmes most viewed by anonymous users are: Electrical and Computers Engineering and Informatics and Computing Engineering and the two master programmes most viewed by anonymous users are: MsC in Informatics Engineering and MsC in Information Management. The main searchs are related to students, staff and courses. 3.3 Recommendations The analysis has allowed the detection of some unusual access patterns: too long anonymous user s sessions (about 170 000 pages requested over 3 days) with a name of an institution s machine; abnormally large processing time in a specific day of the period analysed. It has also allowed the production of some improvement recommendations. It should be created a direct connection to study plans of each programme in the programmes lists (due to the frequent path: programme list / programme page / study plan / course page, that suggests the course page is distant from the home page). The initial page of the student s module should be used to provide information and communicate with students (this is the 6th page most viewed, specially by anonymous users and students). Help module usage by faculty should be stimulated (these users rarely uses this module, preferring the phone). The complexity and usability of the pages from where an higher access to the help occurs should be analysed. Marketing strategies to promote the programmes with less page views should be developed. In order to minimise changes and 404 type error code (Not Found), it should be used URL independent from the underlying technology (79,12% of 404 error code are direct entries in the HEWIS which suggests the use of bookmarks with broken links due to URL changes). The procedures where the 505 (Internal Server Error) error code has most occurred should be analysed. 4 Conclusions and Future Work Data webhouse systems are here presented as a solution to monitor the use of Higher Education Web Information Systems (HEWIS). This paper pretends to enlarge the application and study of webhousing systems to the academic context. Despite the similarities between Web Information Systems, there are differences between HEWIS and e-commerce sites, being the last frequently used to
86 C.T. Lopes and G. David exemplifications and instantiations of data webhouses. While HEWIS pretends to archive and register higher education activity and has features adapted to this scope, in e-commerce sites the main intent is to sell products and has a well defined set of procedures is available (add to shopping cart, insert payment information, etc.). As they have different goals and scopes, the relevant information is different (for example: the Academic Date dimension doesn t make sense in an e-commerce site webhouse and its very important in an HEWIS webhouse), what justifies a different dimensional model. It was described a dimensional model to monitor HEWIS s usage. This model has been implemented in a data webhouse prototype to monitor UPorto s HEWIS usage. In the development of this prototype it has been defined an extraction, transformation and loading architecture that, with adaptations to specific data sources, can be used in similar contexts. The prototype developed proved the usefulness of data webhouses to WIS and more specifically to HEWIS. It allowed the generation of knowledge on SIGARRA s user behaviour, the detection of abnormal situations and the definition of a set of recommendations. It was also possible to verify that there is a significant reduction in the amount of disk space required to store web usage data, what stimulates the storage of web usage data in a dimensional model. On the other hand, it has demonstrated the analytic flexibility of data webhouses, an advantage when compared to other monitorization techniques. Also, it showed that queries executed on a star dimensional model with a meaningful amount of data have a good performance. Future work involves applying data mining techniques that allows user clustering based on navigation paths or preferences, navigation patterns discovery, detection of set of pages that have more probability of being together in the same session and user classification based on predefined parameters. References 1. Jesper Andersen, Anders Giversen, Allan H. Jensen, Rune S. Larsen, Torben Bach Pedersen, and Janne Skyt. Analyzing Clickstreams Using Subsessions. In Proceedings of the 3rd ACM International Workshop on Data Warehousing and OLAP, pages 25 32. ACM Press, 2000. 2. Bettina Berendt and Myra Spiliopoulou. Analysis of Navigation Behaviour in Web Sites Integrating Multiple Information Systems. The VLDB Journal, 9(1):56 75, 2000. 3. M. S. Chen, J. S. Park, and P. S. Yu. Data Mining for Path Traversal Patterns in a Web Environment. In Proceedings of the 16th International Conference on Distributed Computing Systems (ICDCS 96), page 385. IEEE Computer Society, 1996. 4. Robert Cooley. The Use of Web Structure and Content to Identify Subjectively Interesting Web Usage Patterns. ACM Trans. Inter. Tech., 3(2):93 116, 2003. 5. Robert Cooley, Bamshad Mobasher, and Jaideep Srivastava. Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems, 1(2), 1999.
Higher Education Web Information System Usage Analysis 87 6. Magdalini Eirinaki and Michalis Vazirgiannis. Web Mining for Web Personalization. ACM Trans. Inter. Tech., 3(1):1 27, 2003. 7. Karuna P. Joshi, Anupam Joshi, Yelena Yesha, and Raghu Krishnapuram. Warehousing and Mining Web Logs. In Proceedings of the Second International Workshop on Web Information and Data Management, pages 63 68. ACM Press, 1999. 8. Ralph Kimball and Richard Merz. The Data Webhouse Toolkit. John Wiley & Sons, Inc., 2000. 9. Ralph Kimball, Laura Reeves, Margy Ross, and Warren Thornthwaite. The Data Warehouse Lifecycle Toolkit. John Wiley & Sons, Inc., 1998. 10. Ron Kohavi. Mining e-commerce Data: The Good, The Bad, and The Ugly. In Proceedings of the seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 8 13. ACM Press, 2001. 11. Richard Li and Jon Salz. Clickstream Data Warehousing. ArsDigita Systems Journal, 2000. Available from: http://www.eveandersson.com/arsdigita/asj/clickstream/ [cited 2005-09-11]. 12. Brij M. Masand, Myra Spiliopoulou, Jaideep Srivastava, and Osmar R. Zaiane. WEBKDD 2002: Web Mining for Usage Patterns & Profiles. SIGKDD Explor. Newsl., 4(2):125 127, 2002. 13. Jaideep Srivastava, Robert Cooley, Mukund Deshpande, and Pang-Ning Tan. Web Usage Mining: Discovery and Applications of Usage Patterns from Web Data. SIGKDD Explor. Newsl., 1(2):12 23, 2000. 14. Mark Sweiger, Mark R. Madsen, Jimmy Langston, and Howard Lombard. Clickstream Data Warehousing. John Wiley & Sons, Inc., 2002. 15. Tak Woon Yan, Matthew Jacobsen, Hector Garcia-Molina, and Umeshwar Dayal. From User Access Patterns to Dynamic Hypertext Linking. Computer Networks ISDN System, 28(7-11):1007 1014, 1996.