Hybrid Website, Based on Web Usage Mining Technique and Using Association Rules

Hybrid Website, Based on Web Usage Mining Technique and Using Association Rules Ali Seyed Shirkhorshidi, Hema Latha Krishna Nair, Yahya AL-Murtadha Asia Pacific University of Technology and Innovation Kuala Lumpur, Malaysia {shirkhorshidi_ali@yhoo.co.uk, hema.krishna@ucti.edu.my, yahy.murtadha@ucti.edu.my} Abstract: With the explosive growth of the World Wide Web, websites are also getting richer every day. Consequently some contain hundreds of web pages and lots of content to show. According to this fact web development is going to be more complicated every day which requires to be improved for better service. Previous researches show Web log mining has a great power to mine visitor s usage browsing behavior patterns. While these patterns are showing users preferences, they can also be employed to improve the usability of website, users satisfaction and loyalty. This article aims to introduce hybrid website based on one of the web mining techniques called association rules, to propose a new approach in website design and development. Hybrid is allocated to an idea about designing website with two separate portions: Dynamic portion and Static portion. Static portion will represent a static structure prepared by administrators, while the dynamic portion will interact with users real time performances. New framework empower website to interact with users, reshape and redesign its dynamic portion based on their browsing behavior. This website will act like a human, and as the time goes by is acting wiser because of their experience. In the same way with gaining more data about the users will serve them better based on their preferences. Figure 1: Traditional website system Keywords: Hybrid website; Web mining; Website usability 1. Introduction The World Wide Web (WWW) continues to grow at an amazing rate as an information gateway. There is ample information available on the websites, however finding the desired one is always a problem that users are confronted with, and sometimes they get lost between lots of web pages. Visitors like to find their desired information rapidly; so there is a drastic competition between competitor websites on representing their information with the high level of usability to satisfy users and gain their loyalty. International standard organization (ISO) defines the term usability as Extent to which a product can be used by specified users to achieve specified goals with effectiveness, efficiency and satisfaction in a specified context of use. [1]For the web development this definition can be simplified as follows: a usable website should represent information effectively for efficient use of visitors which gives them satisfaction. The shape of an ordinary website is shown in figure 1, and figure 2 is the online section of the system. Even though the whole proposed system has been demonstrated in figure 5, it is also shown here separately in order to simplify comparison between what exists and what is proposed. Figure 2: Proposed website system In the traditional ordinary website a user clicks on a link, the request will be sent to the server and the webpage will be represented to the user. How user browse the website does not make any differences, and design of a web page is always the same for all. For example when a user is going to visit a webpage, no matter how or when they are going to visit the webpage, as long as designers do not make any changes on that webpage, it will always have the same shape with the same sorting and design of information. The system has no intelligence to understand and recognize user preferences and behavior and has a static design which is not adaptable with users.

As studies have shown, 80% of total maintenance expenses are related to the users problem with the system and not technical bugs. Between them 64% are usability problems. [2] [3] Websites are different in comparison with other applications. Users may have limited choices in terms of choosing software applications, and in some cases they should pay to use them and for changing the software they have to pay extra cost. But numerous websites are available in each category for the users to visit without any expense. So as soon as users encounter difficulty to find desired information in the current website they will abandon it and head to the competitor website. [4] [5] Definition of usability addressed by Kerg may represent all the above concerns briefly and completely in a simple sentence: Don t make user think. [6]Ratschiller and Gerken believe the key for better usability achievement is user-centered development. They represented some key principals for user-centered development: [7] [8] Directly involving users in design process Early and continuing evaluation of website Iterative design and development User-centered, web development, and web usability are tightly related together. [9] [10] Website development needs a careful decision making process to discover the best way of representing information to the users. In fact nobody can communicate the preferable information arrangement better than the users themselves. One of the most important concerns is how to understand users preferences. Web usage mining is an application of data mining which aims to discover users browsing behavior patterns. [11] [12] [13] [14] [15] [16] Previous researches mostly worked on applicability and efficiency of web usage mining to empower administrators to make better decisions based on users browsing behavior. The considerable issue is that these patterns should be deployed by administrators which are a time consuming process. In this paper by proposing a hybrid website with two separated sections we will represent a model to reduce the time. The first one, static section, controlled and designed by administrators protect the main structure of website which users are familiar with, and a dynamic section which is controlled by a system that use web usage mining to represent information based on the latest patterns discovered from the web log file. In the following section we will have a brief literature review about web mining, in section 3 association rules is explained and in between, it will discuss how the proposed system is going to work. In section 4 the solution and proposed model will be discussed and finally conclusions are presented in section 5. 2. Previous works The aim of Web mining and data mining is searching for meaningful knowledge from web log and data base. Databases used in data mining are far more structured in comparison with web log files, so web mining needs an extensive effort to preprocess log files before using them. [17] Web mining classifies into three main categories: Web content mining, web structure mining, and web usage mining. [18] Figure 3: Web mining clasification Web content mining aims to extract knowledge from content of web document or its description. In recent years this approach is broadly used in search engines in order to improve their efficiency and accuracy by introducing agent based mining and data base driven mining instead of traditional search method. Agents are intelligent software which supports the search towards more relevant web contents by intelligently using user profiles. [17] Web structure mining is an attempt to find personalized information from pages by analyzing the hyperlinks in and between the pages. [14] Page rank and hyperlink analysis also are in this category. Generating structured summary about the website and webpage is the goal of web structure mining. It is common that both web content mining and web structure mining are using in an application. [15] Web usage mining is another approach in web mining which is using log files to mine users browsing behavior and is used for different purposes such as site construction, adaptation, management, marketing and user modeling. [15]Web usage mining can be defined as the process of deploying data mining techniques to the web data and is aimed to discover something that users are interested in based on their browsing behavior. [19] This paper is concentrated on web usage mining and using association rules to add interactivity to the websites which can understand users preferences based on their browsing patterns and consequently proposes the most probable pages they might want to visit. Previous researches show web usage mining can be deployed in order to discover users behavior patterns using server log files. Ankit and his colleagues proposed basic association rules to optimize the content of the server log data. They discussed about Apriori algorithm in terms of speed and memory usage and concluded that Apriori algorithm is an efficient approach for market basket analysis, website navigation analysis, homeland security, and so on. [11] Kumar and Rukmani also did a similar research on Apriori and FP Growth algorithms. They mentioned how to deploy these algorithms in a system for web usage mining and what are their strengths and weaknesses. [13] On the other hand, using web log files for web usage mining is the issue which is explained by Gao on his article Research on client behavior pattern recognition system

based on web log mining. He goes into more detail of the system and explains about how to design it. [20] Currently, Web mining techniques has emerged as an important research area to help web users find the information needed. Subsequent researches have used mining server log files to generate accurate user preferred patterns to develop recommendation system for the website administrators, for marketing and targeted advertising. In many cases after warehousing and cleaning log files, association rules had been used for pattern recognition, the proposed system by these research works is illustrated in figure 4. [11] [20] The primary data for data mining process is provided by a database which is warehousing server log file. Different types of data are stored in server log file such as IP address, request time, requested file URL, transmission bytes and etc. [20] The main responsibility of the pre-processing stage is data selection and standardization of data to form an analytical model applicable to data mining process. When data has been cleaned and prepared for data mining process, patterns will be generated by data mining techniques. These patterns can facilitate administrators decision making process by identifying users preferences, so they can make changes on the website based on users needs in order to increase user satisfaction and loyalty. Figure 4: Pattern discovery system Although it provides more knowledge about users, it doesn t have direct and instant effect on their browsing experiences. All the benefits of this system depend on administrator decision and since decision making and change management processes are time consuming processes, it takes long time to feel and benefit from them. The most important problem is that this system is working off line and needs human decision making process. These studies and other research show the potential of web usage mining and prove its efficiency in terms of speed and time. Even though such studies show the technical portion of web mining, nevertheless it does not propose an integrated solution for its application which can have direct impact on website users. The new framework proposed in this article, tries to fulfill this gap by a combination of two separated systems: online website system and offline pattern discovery system as a single integrated system. This study is an aim to propose a new framework for website systems to be more intelligent in order to understand their users preferences and behaviors and respond to their different performances with distinctive website design. The proposed framework is based on web log mining. 3. Association Rules and Proposed System Association rules is a data mining technique which is broadly used in marketing intelligence and market basket analysis. [17] However this study aims to use association rules in a different way to help users by improving their browsing experience by predicting and proposing them web pages they might want to see. Existing systems consider relativity between contents and help users see information relevant with what they are visiting. In fact they are content based systems not user based. Association rules are able to recognize and predict users preferences and propose web pages based on users behavior patterns. A brief explanation of association rules and its usage log file analysis can clarify its strength. An example association rule generated from a website log file could be: URL1 URL2 (Support 20%, Confidence90%) Support in the rule means that 20% of users visit both URL1 and URL 2 and the confidence means those who visit URL1 also visit URL2 90% of the time. Liu in his book Mining the web defines concepts of the association rules such as support and confidence clearly. In the following paragraphs these definitions are mentioned. If we consider I = {i 1, i 2, i 3, i 4, }as a set of items and T = (t 1, t 2, t 3, t 4,... ) as a set of the data base transaction, then association rules could represent as: X Y, where X I, Y I, and X Y =. X or Y is a set of items, called an item set. [21] The support count of X in T (denoted by X.count) is the number of transactions in T that contain X. The strength of a rule is measured by its support and confidence. Support: support for a rule like X Y, is the percentage of transactions in T that includes X Y. It can show as an estimate of the probability, P(X Y). The support of the rule determines how frequent the rule is applicable in the transaction set T. if we consider n as the number of transactions in T, the support of the rule X Y is computed as follows: Confidence: The confidence for a rule like X Y, is defined as the percentage of transactions in T that include X also include Y. In fact it is an estimate of the conditional probability, Pr(Y X). It is computed as follows: In fact confidence determines the predictability of the rule. A rule is not reliable if the confidence of a rule is too low and it means that it is not reliable to infer or predict Y from X.

Now according to the website discussion if the website has 5 web pages, I set can be shown as following: I= {URL1, URL2, URL3, URL4, URL5} And the rules can be represented as: URL1 URL2 URL1 URL3 URL1 URL4 URL1 URL5 URL2 URL3 URL2 URL4 URL2 URL5 URL1, URL2 URL3 URL1, URL2, URL3 URL4 URL1, URL2, URL3 URL5 There are other rules that can be generated from set I, but these are enough for explanation, so other rules are not shown here. Association rules generate support and confidence for each of the rule. These support and confidence will be generated based on the previous users behavior. In fact the system analyzes previous users behavior and produces the percents of support and confidence which represent each rule s value. For example consider a user is visiting URL1 now. Based on above rules, let us assume that association rules generate following support and confidence: URL1 URL2 (support 40%, confidence 80%) URL1 URL3 (support 10%, confidence 50%) URL1 URL4 (support 60%, confidence 80%) URL1 URL5 (support 20%, confidence 60%) If system want to propose 2 most probable pages that user might like to visit, based on above rules the system will propose URL2 and URL4 to the new user. One of the strength of this proposing system is that it is not dependent on the users profiles data, so this system will work for very new users even those who are visiting the website for the first time. In fact the system analyzes the whole set of previous users to find the patters which are similar to the new user s visiting pattern and then based on that, system will propose the best web pages which are more likely to be the new user s preferred pages to visit. With each click, user changes the rule and so the system will update its propositions based on the new rule. For instance consider that user click on the URL2 as the next page to visit, and then the new set of rules will be: URL1, URL2 URL3 URL1, URL2 URL4 URL1, URL2 URL5 The system will calculate the support and confidence then will represent the rules with the best support and confidence to the user. And again after user s next click the rules will be changed and consequently the system will propose respective proposition to the user based on that rule once more. It is noticeable that system is continuously interacting with the users on each click they make. With this interactions user will feel that the website is alive and aware of each of his/her click and propose the best choices to him/her. Beside the exiting user experience, user always find most probable web pages he might want to see easily and quickly. So the users do not need to surf the whole website to find their desired information, because the system will provide all the desired pages for them. Another important issue about the system is that by visits to the web site by different people as the time goes by, the system will get more accurate, because the decision will be made based on additional data gathered from new users. All the information of users browsing behavior will be warehoused by the system and this warehouse will be richer when the time goes by and with more users visiting the website. We can say that as the time passes by, the system gains more experience on how to interact with the visitors. 4. Hybrid website framework This research is going to represent a framework which is a mixture of three different and separate knowledge areas as it is represented in Figure 3. It tries to fully utilize the technologies and potentials available to develop more usable website that can interact with users in real time and change the shape of the website based on the way users interact and behave with the website. The framework prepares a base model for the websites to be interactive and respond to each movement and clicks that users make. Website analyzes users behavior and based on that it will represent related links and information available on the website based on visitor s preferences. As a result with each click users will see more relevant data that even they never thought it might be available in the website, or they had to spend lots of time to find it. Beside, as the website adapts its design with each individual user in different way, each user will feel that the website is alive and interact instantly to each of his movements. So user will feel more comfortable and will be more satisfied, moreover as a result their loyalty to the website will be increased. Figure 5: Knowledge areas linked by hybrid interactive website

4.1 Hybrid Method Main idea in dividing the webpage into two sections is to use the power of automatic designing tools without sacrificing the main structure of the website. If the whole design of the website passes to the automated website designing tools, users might get confused; because in this case each time they visit website, they would see different design. Using dynamic and static sections together will provide a balance between the structure of the website and its instructiveness. Figure 4 shows a schematic comparison between new proposed web page and traditional fully static web page. Static section of the website is the same as traditional web page and the decisions about design and content representation will be fully dependent on the administrators decisions. Dynamic section of the web page interacts with users and automatically reshapes itself based on their real time behavior. ordering the links and information, then combines the static and dynamic web pages together and shows it as a final web page to the user. Proposing the pages based on the rules is as simple as fetching data from rule repository database and representing it at the webpage. In fact the whole thing which is added to the website is a very simple data base transaction, and all other time consuming processes are shifted into the offline section. Figure 7: Hybrid interactive website framework Figure 6: Traditional and hybrid website 4.2 Interactivity on dynamic section Trying to provide interaction between the websites and the users is the main functionality of the system. In order to reach to this point, website system needs to understand users preferences and then reshape its dynamic section to show their desired contents. For understanding the users and their behaviors, association rules are used as mentioned in the previous section. But for analyzing, the system needs users data. These data will come from warehousing web log files. Following paragraphs are devoted to explain how these data will be warehoused and how they will come into use. The whole proposed system is shown in Figure 5. The proposed system is divided into online section and offline section. In the offline section, server log files will be cleaned and warehoused in a database. This data base will always be fed by new data from log files. As the time passes, more data from users will be stored in this warehouse. Data inside the data warehouse will be analyzed and association rules will be generated, these rules will be stored in another database called Rule Repository. Rules can be generated from the cleansed data and be stored based on preset time; for example hourly or daily or real time which depends on the availability of the resources. On line section of the website is almost the same as other traditional websites; the only difference is the Online proposing system application which is available in the web server. Responsibility of this application is to fetch rules from rule repository and making the dynamic section by 5. Conclusion and future works As discussed, Hybrid interactive website is an efficient framework for new generation of the websites which can partially design themselves based on users preferences. This approach will improve users browsing experience and they will feel that the website is interacting with them, and with each click will predict their desires and reshape itself based on that. The main benefits can be represented as follows: Interactivity with the users on each click The system gains experience and improve itself As all the websites are dealing with different database transactions, and this new framework just add a very common database fetching operation in the online section, it would not have a negative impact on the website s speed and performance. In order to discuss about the best way to implement this system, it can be put into practice in the future by using different languages and different technologies 6. Acknowledgments Ali Seyed Shirkhorshidi gratefully acknowledges Fatemeh Zahdedifar for her helpful discussion and revision and Zailan Arabee Bin Abdul Salam for his help.

7. Bibliography [1] ISO-9241, International standard - ISO 9241-11:1998(E), 1 ed., International Organization for Standardization, 1998. [2] Ahmed Seffah,Jan Gulliksen,Michel C. Desmarais, Human-Centered Software Engineering - Integrating Usabilityin the Software Development Lifecycle, vol. 8, Dordrecht, The Netherlands: Springer, 2005. [3] B. W. Boehm, IEEE Software, vol. 8, no. 1, pp. 32-41, 1991. [4] Lisa E, "5 Signs That Indicate Website Usability Problems," 2012. [Online]. Available: http://usabilitygeek.com/5-signs-that-indicate-websiteusability-problems/. [Accessed 4 May 2012]. [5] J. Nielsen, "Designing Web Usability," New Riders, vol. 3, no. 1, p. 419, 2000. [6] S. Krug, Don't make me think, 2nd ed., Berkeley, California: New Riders, 2006. [7] Tobias Ratschiller, Till Gerken, Web Application Development with PHP 4.0, 1st ed., Indianapolis, Indiana: New Riders, 2000. [8] R. Katz-Haas, "User-Centered Design and Web Development," 1998. [Online]. Available: http://www.stcsig.org/usability/topics/articles/ucd%20_ web_devel.html. [Accessed 2 June 2012]. [9] J. Lazar, User-Centered Web Development, Jones and Bartlett Publishers, 2001. [10] Noor Azura Zakaria,Media A. Ayu, User-centered web development for GMI alumni website, IEEE, 2010, pp. A32-A36. [11] Ankit R Kharwar, Viral Kapadia, Nilesh Prajapati, Permal Patel, "Implementing Apriori algorithm on web server log," Gujarat,India, 2011. [12] Ashok Kumar D, Lorain Charlet Annie M.C., "Web log mining using K-Apriori algorithm," International Journal of Computer Applications, vol. 41, no. 11, pp. 16-20, 2012. [13] B.Santhosh Kumar, K.V. Rukmani, "Implementation of web usage mining using Apriori and FP Growth algorithms," Advanced Networking and Applications, vol. 01, no. 06, pp. 400-404, 2010. [14] Jiang Yongbo, Zhang Ruili, "Intelligent search engines based on web mining," IEEE, pp. 171-144, 2011. [15] Rekha Jain, G. N. Purohit, "Page ranking algorithms for web mining," International journal of computer applications, vol. 13, no. 5, pp. 22-25, 2011. [16] Yan Li, Xin-Zhong Chen, Bing-Ru Yang, "Research on web mining based intelligent search engine," Beijing, 2002. [17] Liu Jian, Wang Yan-Qing, "Web log data mining based on association rule," in International conference on fuzzy systems and knowlege discovery (FSKD), 2011. [18] Miguel Gomes da Costa Júnior,Zhiguo Gong, "Web Structure Mining: An Introduction," in International Conference on Information Acquisition, Hong Kong and Macau, China, 2005. [19] Qingtian Han, Xiaoyan Gao, Wenguo Wu, "Study on web mining algorithm based on usage mining," in 9th International Conference on ComputerAided Industrial Design and Conceptual Design, 2008. [20] W.-H. Gao, "Research on client behavior pattern recognition system based on web log mining," Qingdao, 2010. [21] B. Liu, Web Data Mining, 2nd ed., Berlin: Springer, 2011.