UNIVERSITI TEKNOLOGI MARA FACULTY OF INFORMATION TECHNOLOGY AND QUANTITATIVE SCIENCE THE DEVELOPMENT AND EVALUATION OF CONFIGURABLE WEB USAGE ANALYZER BY NASRUL AZLI BIN AHMAD 2004633591 B. Sc (HONS) DATA COMMUNICATION AND NETWORKING
THE DEVELOPMENT AND EVALUATION OF CONFIGURABLE WEB USAGE ANALYZER BY NASRUL AZLI BIN AHMAD 2004633591 A final project submitted in partial fulfillment of the requirement for the B. Sc (HONS) DATA COMMUNICATION AND NETWORKING A project paper submitted to FACULTY OF INFORMATION TECHNOLOGY AND QUANTITATIVE SCIENCE UNIVERSITI TEKNOLOGI MARA MAY 2006 Approved by the Examining Committee :.. Project Supervisor, EN MOHD FAISAL BIN IBRAHIM Examiner, PROF. MADYA DR. SAADIAH BINTI YAHYA
DECLARATION I hereby declare that the work in this project paper is on my own except for those quotations and summaries, which have been acknowledged.... NASRUL AZLI BIN AHMAD 2004633591 ii
ACKNOWLEDGEMENT Alhamdulillah, with gratitude and blessed from Allah S.W.T, finally I have completed my project without having major problems within the period given. Firstly, I am so grateful that I have been given the strength by Allah S.W.T to complete this project. I would like to take this space of opportunity to express my gratitude to my Supervisor, Encik Mohd Faisal bin Ibrahim for his guidance, encouragement and support that really helps me a lot in completing this project. I am feeling so luck for being under his supervision. He is so supportive and always gives me an idea to enhance this project. I also would like to express my appreciation to Prof. Madya Dr. Saadiah binti Yahya for their ideas and moral support. Moreover, I am so thankful to my project team members, especially to Rohaniah binti Yusof for helping me to accomplish this project successfully. Lastly, thank you to all my friends for their cooperation in helping me to complete this project. Thank you. APRIL 2006 NASRUL AZLI BIN AHMAD iii
ABSTRACT Learning Web users preferences and adapting the Web information structure to the users behaviors can improve the effectiveness of the particular websites. In addition, automatic knowledge extraction from the Web server log files, page tags, network packets and cookies can be useful for identifying such reading patterns and infer user profiles in order to design the website suited for different group of users. Therefore we develop a Web Usage Analyzer System in order to extract the useful implicit and previously unknown patterns from the usage of the website. In our study we analyzed an online course using web usage mining techniques by analyzing users behavior in terms of site usage, involvement of the most active users from their navigational activities and the number of visits throughout the semester of the course. By using this system, lecturer manage to analyze students behavior in terms of site usage such as last date accessed, login name, time and visited page. Lecturer can also investigate the trends of the website in terms of popularity by identifying the most visited page. In conclusion, the Web Usage Analyzer System provides great features to the lecturer to investigate and analyze the users behaviors, activities and their performances. iv
TABLE OF CONTENTS DECLARATION... ii ACKNOWLEDGEMENT... iii ABSTRACT... iv TABLE OF CONTENTS... v LIST OF ABBREVIATIONS... ix LIST OF FIGURES... x LIST OF TABLES... xiii 1.0 INTRODUCTION... 1 1.1 PROBLEM BACKGROUND... 1 1.2 PROBLEM OF STATEMENT... 2 1.3 OBJECTIVES OF THE RESEARCH... 3 1.4 SCOPE OF THE RESEARCH... 3 1.5 SIGNIFICANCES OF THE RESEARCH... 3 1.5.1 Significances to the Future Researchers... 4 1.5.2 Significances to the Content Developer... 4 1.5.3 Significances to the NBPL Users... 4 1.6 OUTLINED OF THE THESIS... 5 2.0 LITERATURE REVIEW... 6 2.1 INTRODUCTION... 6 2.2 DEFINITION OF THE COMMON TERMS... 6 2.3 WEB USAGE MINING... 7 2.4 WEB-BASED COLLABORATIVE LEARNING... 10 2.5 WEB USAGE MINING PHASES... 11 2.5.1 Preprosessing... 11 2.5.1.1 Content Preprocessing... 12 2.5.1.2 Structure Preprosessing... 13 2.5.1.3 Usage Preprosessing... 13 2.5.2 Pattern Discovery... 14 v
2.5.2.1 Statistical Analysis... 14 2.5.2.2 Association Rules... 15 2.5.2.3 Clustering... 15 2.5.2.4 Classification... 15 2.5.2.5 Sequential Patterns... 16 2.5.2.6 Dependency Modeling... 16 2.5.3 Pattern Analysis... 16 2.6 NETWORK BASED PROJECT LEARNING... 16 2.7 RELATED WORKS... 17 2.7.1 Optimization of an Online Course with Web Usage Mining... 17 2.7.2 Web Usage Mining for a Better Web-Based Learning Environment... 17 2.7.3 Towards Evaluating Learners Behavior in a Web-Based Distance Learning Environment... 18 2.7.4 Web Mining and Knowledge Discovery of Usage Patterns... 18 2.7.5 Web Usage Mining : Discovery and Applications of Usage Patterns from Web Data... 19 2.8 CONCLUSIONS... 19 3.0 METHODOLOGY... 20 3.1 INTRODUCTION... 20 3.2 DATA COLLECTION METHODOLOGY... 20 3.3 PROJECT METHODOLOGY... 20 3.3.1 Phase 1 : Project Planning / Preliminary Investigation... 22 3.3.2 Phase 2 : Analysis... 22 3.3.2.1 Hardware Requirements... 22 3.3.2.2 Software Requirements... 23 3.3.3 Phase 3 : Design... 23 3.3.4 Phase 4 : Development... 24 3.3.5 Phase 5 : Testing and Implementation... 26 3.3.6 Phase 6 : Maintenance... 27 3.4 CONCLUSIONS... 27 vi
4.0 SYSTEM OVERVIEW AND ARCHITECTURE... 28 4.1 INTRODUCTION... 28 4.2 USER MODEL... 28 4.2.1 Context Diagram... 28 4.3 DEVELOPMENT... 29 4.3.1 Modification to the Database... 29 4.3.2 Modification to the Source Code... 32 4.3.3 Site Statistics Pages... 32 4.3.3.1 Page Tagging... 33 4.3.3.1.1 Internal Page Tagging... 34 4.3.3.1.2 External Page Tagging... 38 4.3.3.2 Site Statistics Menu... 45 4.3.3.3 Detailed Statistics... 45 4.3.3.4 Visitor Tracking... 48 4.3.3.5 Personal Information... 50 4.4 CONCLUSIONS... 51 5.0 RESULTS AND FINDINGS... 52 5.1 INTRODUCTION... 52 5.2 CONFIGURABLE RANGE OF DATE LINE CHART... 52 5.3 SITE STATISTICS ANALYSIS... 54 5.3.1 Users Behavior... 54 5.3.2 Users Involvement... 56 5.3.3 Trends of NBPL Website... 60 5.3.4 Data Transfer Efficiency and Duration Time Analysis... 62 5.3.4.1 Testing... 62 5.3.4.2 Data Transfer Efficiency... 63 5.3.4.3 Duration Time... 65 5.3.4.4 Application- and Total Frames... 67 5.4 CONCLUSIONS... 68 vii
6.0 CONCLUSIONS AND RECOMMENDATIONS... 69 6.1 INTRODUCTION... 69 6.2 CONCLUSIONS... 69 6.3 RECOMMENDATIONS... 70 REFERENCES... 72 APPENDIX A... 75 viii
LIST OF ABBREVIATIONS WWW NBPL HTML World Wide Web Network Based Project Learning HyperText Markup Language ix
LIST OF FIGURES Figure 2. 1 : Web mining categories ((Cooley et al., 1997) and (Chakrabarti et al., 1999))... 8 Figure 2. 2 : General Architecture for Web Usage Mining (R. Cooley, B. Mobasher, J. Srivastava, 1999)... 9 Figure 2. 3 : Web Usage Mining sub steps (L. E. Akman, B. Akkan, N. Baykal, 2003)... 11 Figure 2. 4 : Details of Web Usage Mining Preprocessing (R. Cooley, B. Mobasher, J. Srivastava, 1999)... 11 Figure 2. 5 : Raw data format (L. E. Akman, B. Akkan, N. Baykal, 2003)... 12 Figure 2. 6 : Sample Web Server Log (J. Srivastava, R. Cooley, M. Deshpande, P. Tan, 2000)... 13 Figure 2. 7 : Preprocessed Web Log (L. E. Akman, B. Akkan, N. Baykal, 2003)... 14 Figure 3. 1: Project Methodology Diagram... 21 Figure 3. 2 : Block Chart of the System... 24 Figure 3. 3 : Raw data string based on file name... 24 Figure 3. 4 : Raw data string... 25 Figure 3. 5 : Data cleaning technique will discard the unwanted keywords... 25 Figure 3. 6 : Location and section will be saved into the database based on visited page... 25 Figure 3. 7 : Example of Statistical Analysis Output... 26 Figure 3. 8 : Example of SQL commands in Pattern Analysis phase... 26 Figure 4. 1 : Context Diagram of Web Usage Analyzer System... 28 Figure 4. 2 : Relationship Between Database Tables that Involves in Web Usage Analyzer System... 32 Figure 4. 3 : File Tagging Page... 33 Figure 4. 4 : File Tags Form.... 34 x
Figure 4. 5 : Files in the root folder (named as htdocs) located in the web server... 35 Figure 4. 6 : Syntax that must be included in each file in order to apply the Internal Page Tagging... 35 Figure 4. 7 : PHP coding in `tiki-stats_tag.php` which is included in every files in the root folder... 36 Figure 4. 8 : Example of Internal Page Tagging... 37 Figure 4. 9 : New field is generated automatically into the `tiki_stats_all_section` table... 38 Figure 4. 10 : PHP code to retrieve filename... 38 Figure 4. 11 : PHP codings in `ext_links_in.php`... 39 Figure 4. 12 : Example to tag a non-php files but located in root folder... 40 Figure 4. 13 : Online Notes Menu option... 40 Figure 4. 14 : Need to bypass `ext_links_in.php` file before loading the requested page... 40 Figure 4. 15 : Redirect to `Copper_Media-Coaxial_Cable.htm` page... 41 Figure 4. 16 : PHP codings in `ext_links_out.php`... 42 Figure 4. 17 : Example to tag an external link such as URL website... 43 Figure 4. 18 : External Links Menu option... 43 Figure 4. 19 : Need to bypass `ext_links_out.php` file before loading the requested page... 43 Figure 4. 20 : Redirect to `www.uitm.edu.my` page... 44 Figure 4. 21 : Site Statistics Menu Options.... 45 Figure 4. 22 : Options for Detailed Stats Submenu.... 45 Figure 4. 23 : Detailed Stats Output for File Galleries Section.... 46 Figure 4. 24 : Popup window for viewing the output in spreadsheet format... 46 Figure 4. 25 : Detailed Stats output viewed in Microsoft Excel... 47 Figure 4. 26 : Redirect to another page by clicking the hyperlink... 47 Figure 4. 27 : Detailed information for individual NBPL user only... 48 Figure 4. 28 : Visitor Tracking Page... 49 Figure 4. 29 : Personal Information Details for All NBPL Users... 50 Figure 4. 30 : Student Registration Form... 51 xi
Figure 5. 1 (a) and (b) : Example of Line Chart from 6th March to 24th April 2006... 53 Figure 5. 2 : Example of Total Hits as 9th April 2006... 54 Figure 5. 3 : Detailed Statistics of NBPL Users For `File Galleries`... 55 Figure 5. 4 : Total Hits For Selected User... 55 Figure 5. 5 : Total Hits for All NBPL Users... 56 Figure 5. 6 : Detailed Statistics For One of the Most Active... 57 Figure 5. 7 : Detailed Statistics For One of the Most Inactive User... 59 Figure 5. 8 : Daily Chart... 60 Figure 5. 9 : Weekly Charts... 61 Figure 5. 10 : Bandwidth and Latency... 62 Figure 5. 11 : Data Transfer Efficiency to Load Statistics Pages... 63 Figure 5. 12 : Data Transfer Efficiency for Viewing Different Chart Types... 64 Figure 5. 13 : Duration Time to Load Site Statistics Page... 65 Figure 5. 14 : Duration Time for Exporting Data to Ms Excel... 66 Figure 5. 15 : Number of Frames Sent for Each Page in Windows 2003 Server... 67 Figure 5. 16 : Number of Frames Sent for Each Page in Linux Server... 68 xii
LIST OF TABLES Table 4. 1 : Table `tiki_stats_all_section`... 30 Table 4. 2 : Table `tiki_stats_file_download`... 30 Table 4. 3 : Table `tiki_stats_files_tag`.... 31 Table 4. 4 : Table `tiki_stats_logs`.... 31 Table 4. 5 : Table `tiki_stats_logs_day`... 31 Table 4. 6 : Table `tiki_stats_poll`... 31 xiii
CHAPTER I INTRODUCTION 1.1 PROBLEM BACKGROUND With the rapid evolvement of the information technology, the World Wide Web (WWW) nowadays becomes the most important media for collecting, sharing and distributing information to anyone, anytime and anywhere. According to Y. Wang (2000), the Web is a huge, diverse, dynamic, explosive and mostly unstructured data repository that supplies an incredible amount of information. Education is one of the disciplines where web-based technology has been rapidly and successfully adopted. In this field, it may support many of the activities that occur in the classroom. It also provides and facilitates communication and feedback between users. These two elements are essential keys to effective online learning environment. Furthermore, it supports various styles of learning such as collaborative learning, discussion-led learning, student-based learning, student-centered learning and resourcebased learning. Instead of this field is providing flexible of time and place, it can also accommodate the increased number of students who use it and may share and reuse of the available resources. Learning Web users preferences and adapting the Web information structure to the users behaviors can improve the effectiveness of the particular websites. In addition, automatic knowledge extraction from the Web server log files, page tags, network packets and cookies can be useful for identifying such reading patterns and infer user profiles in order to design the website suited for different group of users. Therefore, the Web mining can be useful to encounter these problems. 1
Web Usage Mining is one of the techniques that fall under Web mining. This research will discuss in depth, adapt this kind of technique for the web-based collaborative learning website or even for portal web-based. In short, Y. Wang (2000) describes Web Usage Mining as the technique to predict user behavior while interacting with the Web. In this category, some information sources such as Web server logs are needed to be extracted to discover the hidden patterns after undergo some processes. Furthermore, Web Usage Mining also will interact with Web Content Mining and Web Structure Mining through the clustering process of pattern discovery as a bridge. 1.2 PROBLEM OF STATEMENT In recent days, there are several commercially available Web log analysis tools, but it is difficult to find appropriate and suitable tools for analyzing raw Web log data in order to retrieve significant and useful information. Moreover, most of the available tools are considered too slow, inflexible, expensive, difficult to maintain or very limited in the result that they can actually produce (D. Batista, M. J. Silva, 2001). There are some examples of the Web usage mining tools that are available nowadays such as WebSIFT, SpeedTracer, WebLogMiner, Shahabi, SurfAid, Analog and many more. Furthermore, web-based collaborative learning that currently available nowadays lack of a closer student-lecturer relationship. Most of them do not support suitable tools to allow lecturers to keep track and assess all the activities performed by their students. Lecturers cannot evaluate the course content provided and the effectiveness of the learning process. From the students point of view, they cannot deliver their problems and deficiencies in a natural way. (M. E. Zorrilla, E. Menasalvas, D. Marin, E. Mora, J. Segovia, 2005). So, the main objective of this project is to extract the valuable patterns for access Web log data that containing the behavior characteristics of the users which can be used to improve the web-based collaborative learning system environment or help in the learning evaluation. 2
1.3 OBJECTIVES OF THE RESEARCH As a guidance to successfully complete this research, some objectives are determined and defined precisely. The objectives of this project are as follows : a. to analyze users behavior in terms of site usage such as last date accessed, login name, time and visited page. b. keep track and trace the involvement of the frequent (most active) users from their navigational activities through the whole website. c. to investigate the trends of the website which are more attractive to the users in terms of developing the Web contents or features to fulfill the users wants when surfing the particular website. d. to analyze the data transfer efficiency and duration time to load the page for both Windows 2003 Server and Linux Server platforms. 1.4 SCOPE OF THE RESEARCH This research will focus on the Network Based Project Learning (NBPL) community as an initial platform. The respondent of this project will be those who registered with NBPL. These respondents will act as an input and contribute towards the analysis of the Web Usage Mining. 1.5 SIGNIFICANCES OF THE RESEARCH This research will bring great significances and contributions to many parties especially to the future researchers, the content developer and also to the users themselves. 3
1.5.1 Significances to the Future Researchers Since this research is conducted by focusing on NBPL, thus the Web Usage Mining technique can be further enhanced and applied for the larger scope of web-based collaborative learning disciplines or any related transactional activities via website. For an instance, Web Usage Mining technique can be applied for the Learning Management System (LMS), Course Management System (CMS), Enterprise Learning Management (ELM) and also Learning Content Management System (LCMS). 1.5.2 Significances to the Content Developer Besides, Web Usage Mining technique is beneficial to the content developer for allowing them to do an analysis which can support the management of NBPL website for continuous improvement. The above element can be part of the research value of this study. The content developer will continuously improve and enrich the content of the website based on users interests so that the structure of the course content is effective for web-based collaborative learning purpose. The content developer can analyze the page s- and user s statistics through some charts that are developed in this project. 1.5.3 Significances to the NBPL Users By applying Web Usage Mining technique, the content of the website are more beneficial to the users and can fulfill their needs. Each interaction of the users will be an input to the content developer. For an example, Web Usage Mining can be used to exploit user s information such as user id, date and time accessed and visited page accessed. In addition, the users will enjoy themselves when surfing the website since the contents can build their interests. 4
1.6 OUTLINED OF THE THESIS Basically, this thesis report consists of six chapters. This chapter highlights the background of this project including the problem statement, objectives to be achieved, scope and significances of the research. As the next chapter, Literature Review will give a clear definition of some common terms used throughout this project and also discussing similar related works. Followed by Chapter Three, Methodology, will explain some phases and methods that have to be followed in completing this project. The overview and architecture of the system are explained in Chapter Four, System Overview and Architecture. Besides, any findings and results gained through this project will be discussing in depth in Chapter Five. Finally, Chapter Six will provide some conclusions, suggestions and recommendations either to improve NBPL or as a reference for the next future researcher. 5
CHAPTER II LITERATURE REVIEW 2.1 INTRODUCTION This chapter defines some common terms that are used throughout this project for a better understanding. In addition, this chapter also includes previous works that related to this project. 2.2 DEFINITION OF THE COMMON TERMS The title of this research is The Development and Evaluation of Configurable Web Usage Analyzer. As defined by The American Heritage Dictionary of the English Language (2004), the word development means the act of developing or determination of the best techniques for applying a new device or process to production of goods or services. The word evaluation can be defined as assesses the effectiveness of an ongoing program in achieving its objectives, relies on the standards of project design to distinguish a program's effects, and aims at program improvement through a modification of current operations. Meanwhile, the word configurable can be understood as design, arrange, set up, or shape with a view to specific applications or for a particular purpose depends on user s requirements. The word analyzer is any of various instruments used for performing an analysis, as interpreted by WordNet. 6
The American Heritage Dictionary of the English Language (2004) defines the word technique as the systematic procedure by which a complex or scientific task is accomplished. In addition, the word technique also means a practical method applied to some particular task or skillfulness in the command of fundamentals deriving from practice and familiarity, according to WordNet. 2.3 WEB USAGE MINING Further on, the word Web is actually a short from World Wide Web, which can be defined as a system of Internet servers that support specially formatted documents. The documents are formatted in a script called HyperText Markup Language (HTML) that supports links to other documents, as well as graphics, audio, and video files. It means that one document can be linked to another, simply by clicking on hot spots. Not all Internet servers are part of the World Wide Web (Glossary of Library Terms, University at Buffalo Libraries, 2005). Meanwhile, the word usage can be simply defined as use, cause to act or to serve for a purpose or as an instrument of material. The word mining means extracting something useful or valuable from a baser substance. But in the context of this overall research, the term web mining or data mining can be used which also brings the same meanings that support the title of this study. By referring to XSB Inc, data mining is interpreted as the process of autonomously extracting useful information or knowledge ( actionable assets ) from large data stores or sets. Data mining can be performed on a variety of data stores, including the World Wide Web, relational databases, transactional databases, internal legacy systems, pdf documents, and data warehouses. Besides, web mining is the term of applying data mining technique to automatically discover and extract useful information from the World Wide Web documents and services. (O. Etzioni, 1996). According to L. E. Akman, B. Akkan and N. Baykal (2003), they define web mining as discovering, analyzing and processing the information from the World Wide Web. 7
Furthermore, the word website means a collection of interlinked documents on a Web server (Glossary of Library Terms, University at Buffalo Libraries, 2005) on a particular subject, including a beginning file called a home page. Other pages on the site can be reached, directly or indirectly, from the home page (Starr Sites, 1999). Actually, Etzioni was the first person who brought the idea with the term of Web mining in his research paper (O. Etzioni, 1996). He questioned whether it is practical or not to mine the Web data and he stated that Web mining could be divided into three processes, which are Web Content Mining, Web Structure Mining and Web Usage Mining (also known as Web Log Mining). Figure 2. 1 : Web mining categories ((Cooley et al., 1997) and (Chakrabarti et al., 1999)) 8
Figure 2. 2 : General Architecture for Web Usage Mining (R. Cooley, B. Mobasher, J. Srivastava, 1999) Basically, the main idea of Web mining is to meet the web users needs. Briefly, according to Y. Wang (2000), Web Content Mining concentrates on the discovery or retrieval of the useful information from the Web contents, data or documents. Meanwhile, Web Structure Mining describing the technique to model the underlying link structures of the Web. Lastly, he defines Web Usage Mining as the technique, which discovers the users usage patterns and attempts to predict the users behaviors while they interact with the Web. L. E. Akman, B. Akkan and N. Baykal (2003) in their research stated that Web Usage Mining processes the usage data by extracting the information in the Web logs to discover the hidden patterns. The Web log provides a raw trace of the learners navigation and activities on the site, as cited by O. R. Zaiane (2002). The Web log data are relatively poor, unstructured and also containing erroneous and irrelevant entries. In the context of learning environment, the discovery of patterns from navigation history by Web Usage Mining can reveal the learners navigation behavior, the efficiency of the models used in the online learning process besides evaluating the learners activities. These patterns cannot be simply extracted with the common statistically analysis. (O. R. Zaiane, J. Luo, 2001). 9
Many sophisticated web-based learning environment have been developed and are in use around the world, but there is very little done to automatically discover access patterns to understand learners behavior on web-based distance learning (O. R. Zaiane, 2002). 2.4 WEB-BASED COLLABORATIVE LEARNING The Web-based environment creates new possibilities to support and enhance this communication within the lecturer-student community, while retaining the familiar faceto-face classroom interaction, as one of the essential aspects of a learning process (K. H. Vat, 2001). And nowadays, web-based collaborative learning environments are benefiting from the rising of communication and information sharing services. However, the mere fact of setting up an environment for students and lecturers does not guarantee mutual collaboration or successful student learning (E. Gaudioso, O. C. Santos, A. Rodríguez, J. G. Boticario, 2004). Collaborative learning is the idea that small, interdependent groups of students work together as a team to help each other learn. So, small learning group plays a very important role in collaborative learning process, especially in web-based collaborative learning environment (J. Zhao, D. McConnell, 2001). Besides, B. L. Smith and J. T. MacGregor (1992) address that collaborative learning represents a significant shift away from the typical teacher-centered or lecturecentered milieu in college classrooms. In collaborative classrooms, the lecturing, listening or note-taking process may not disappear entirely, but it lives alongside other processes that are based in students discussion and active work with the course material. 10