1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom Abstract Application of data mining techniques to the World Wide Web, referred to as Web mining However, there is no established vocabulary, leading to confusion when comparing research efforts The term Web mining has been used in two distinct ways The first, called Web content mining which is the process of information discovery from sources across the World Wide Web The second, called Web usage mining, is the process of mining for user browsing and access patterns This research concentrates on one particular aspect which is how to make the web usage mining more efficient This can be done by adding new level for web usage This new level would be located before the data mining level and present classification of the web login files The classification will be done according to some of the attributes consistence to these web files Then apply the web usage on each class in independence form This would present efficient web usage for each class due to the volume limitation of each class Then the visualization level would be very clear and understood by the users without needing for the analysis followed by data mining process to increase the understanding and clearing of the visualized mined patterns Keywords Web mining, web usage, decision tree classification 1 Introduction [1] With the explosive growth of information sources available on the World Wide Web, it has become increasingly necessary for users to utilize automated tools in find the desired information resources, and to track and analyze their usage patterns These factors give rise to the necessity of creating server-side and clientside intelligent systems that can effectively mine for knowledge Web mining can be broadly defined as the discovery and analysis of useful information from the World Wide Web This describes the automatic search of
2/10 information resources available on-line, ie Web content mining, and the discovery of user access patterns from Web servers, ie, Web usage mining We present taxonomy of Web mining, and place various aspects of Web mining in their proper There are several important issues, unique to the Web paradigm, that come into play if sophisticated types of analyses are to be done on server side data collections These include integrating various data sources such as server access logs, referrer logs, user registration or profile information; resolving difficulties in the identification of users due to missing unique key attributes in collected data; and the importance of identifying user sessions or transactions from usage data, site topologies, and models of user behavior 2 A Taxonomy of Web Mining [1,2,3] In this section we present taxonomy of Web mining, ie Web content mining and Web usage mining This taxonomy is depicted in Figure 1 Web Mining Agent Based Approach *intelligent search agent *information filtering categorization *personalized web agent Web Content Mining Database Approach *Multilevel Database *Query web database Web Usage Mining *preprocessing *transaction identification *pattern discovery tools *pattern analysis tools Figure 1 Web mining taxonomy 21 Web Content Mining The lack of structure that permeates the information sources on the World Wide Web makes automated discovery of Web-based information difficult Traditional search engines such as Lycos, Alta Vista, WebCrawler, MetaCrawler, and others provide some comfort to users, but do not generally provide structural information nor categorize, filter, or interpret documents A recent study provides a comprehensive and statistically thorough comparative evaluation of the most popular search engines In recent years these factors have prompted re- searchers to develop more intelligent tools for information retrieval, such as intelligent Web agents, and to extend data
3/10 mining techniques to provide a higher level of organization for semi-structured data available on the Web 22 Web Usage Mining Web usage mining is the automatic discovery of user access patterns from Web servers Organizations collect large volumes of data in their daily operations, generated automatically by Web servers and collected in server access logs Other sources of user information include referrer logs which contain information about the referring pages for each page reference, and user registration or survey data gathered via CGI scripts Analyzing such data can help organizations determine the life time value of customers, cross marketing strategies across products, and effectiveness of promotional campaigns, among other things It can also provide information on how to restructure a Web site to create a more effective organizational presence, and shed light on more effective management of workgroup communication and organizational infrastructure For selling advertisements on the World Wide Web, analyzing user access patterns helps in targeting ads to specific groups of users Most existing Web analysis tools provide mechanisms for reporting user activity in the servers and various forms of data filtering Using such tools it is possible to determine the number of accesses to the server and to individual files, the times of visits, and the domain names and URLs of users However, these tools are designed to handle low to moderate traffic servers, and usually provide little or no analysis of data relationships among the accessed files and directories within the Web space More sophisticated systems and techniques for discovery and analysis of patterns are now emerging These tools can be placed into two main categories, as discussed below 221 Pattern Discovery Tools The emerging tools for user pattern discovery use sophisticated techniques from AI, data mining, psychology, and information theory, to mine for knowledge from collected data For example, the WEBMINER system introduces a general architecture for Web usage mining WEBMINER automatically discovers association rules and sequential patterns from server access logs In algorithms are introduced for finding maximal forward references and large reference sequences These can, in turn be used to perform various types of user traversal path analysis, such as identifying the most traversed paths thorough a Web locality use information foraging theory to combine path traversal patterns, Web page typing, and site topology information to categorize pages for easier access by users
4/10 222 Pattern Analysis Tools Once access patterns have been discovered, analysts need the appropriate tools and techniques to understand, visualize, and interpret these patterns, eg the WebViz system Others have proposed using OLAP techniques such as data cubes for the purpose of simplifying the analysis of usage statistics from server access logs 3 What can be discovered [4] The kinds of patterns that can be discovered depend upon the data mining tasks employed By and large, there are two types of data mining tasks: descriptive data mining tasks that describe the general properties of the existing data, and predictive data mining tasks that attempt to do predictions based on inference on available data The data mining functionalities and the variety of knowledge they discover are briefly presented in the following list: Characterization: Data characterization is a summarization of general features of objects in a target class, and produces what is called characteristic rules Discrimination: Data discrimination produces what are called discriminant rules and is basically the comparison of the general features of objects between two classes referred to as the target class and the contrasting class Association analysis: Association analysis is the discovery of what are commonly called association rules It studies the frequency of items occurring together in transactional databases, and based The WEBMINER system proposes an SQLlike query mechanism for querying the discovered knowledge (in the form of association rules and sequential patterns) These techniques and others are further discussed in the subsequent sections on a threshold called support, identifies the frequent item sets Another threshold, confidence, which is the conditional probability than an item appears in a transaction when another item appears, is used to pinpoint association rules Classification: Classification analysis is the organization of data in given classes Also known as supervised classification, the classification uses given class labels to order the objects in the data collection Classification approaches normally use a training set where all objects are already associated with known class labels The classification algorithm learns from the training set and builds a model The model is used to classify new objects Clustering: Similar to classification, clustering is the organization of data in classes However, unlike classification, in clustering, class labels are unknown and it is up to the clustering algorithm to discover acceptable classes
5/10 4 Decision Tree Classification [5] Decision Tree Classifiers (DTC's) are used successfully in many diverse areas such as radar signal classification, character recognition, remote sensing, medical diagnosis, expert systems, and speech recognition, to name only a few Perhaps, the most important feature of DTC's is their capability to break down a complex decisionmaking process into a collection of simpler decisions, thus providing a solution which is often easier to interpret The decision tree classifier is one of the possible approaches to multistage decision making; table look-up rules, decision table conversion to optimal decision trees, and sequential approaches are others The basic idea involved in any multistage approach is to break up a complex decision into a union of several simpler decisions, hoping the final solution obtained this way would resemble the intended desired solution We briefly describe some necessary terminology for describing trees Definitions: 1) A graph G = (V, E) consists of a finite, nonempty set of nodes (or vertices) V and a set of edges E If the edges are ordered pairs (v,w) of vertices, then the graph is said to be directed 2) A path in a graph is a sequence of edges of the form (v 1, v2), (v2, v3),,(vn-1, vn) We say the path is from v1to vn and is of the length n 3) A directed graph with no cycles is called a directed acyclic graph A directed (or rooted) tree is a directed acyclic graph satisfying the following properties: i) There is exactly one node, called the root, which no edges enter The root node contains all the class labels ii) Every node except the root has exactly one entering edge iii) There is a unique path from the root to each node 4) If (v,w) is an edge in a tree, then v is called the father of w, and w is a son of v If there is a path from v to w (v w), then v is a proper ancestor of w and w is a proper descendant of v 5) A node with no proper descendant is called a leaf (or a terminal) All other nodes (except the root) are called internal nodes The main objectives of decision tree classifiers are: 1) to classify correctly as much of the training sample as possible; 2) generalize beyond the training sample so that unseen samples could be classified with as high of an accuracy as possible; 3) be easy to update as more training sample becomes available (ie, be incremental - see section IV) ; 4) and have as simple a structure as possible Then the design of a DTC can be decomposed into following tasks: 1) The appropriate choice of the tree structure 2) The choice of feature subsets to be used at each internal node 3) The choice of the decision rule or strategy to be used at each internal node
6/10 5 The Proposed System In this research we describe an efficient web usage mining framework The key ideas are to preprocess the web log files and then classify this log file into number of files each one represent a class, this classification done by a decision tree classifier After that each class would be submitted to web usage mining this make the web usage more efficient because each application with all it is services and their visitors would be studied separately The general algorithm and all the details would be explained in the following sections briefly: 51 Preprocessing Tasks As discussed in previous section, analysis of how users are accessing a site is critical for determining effective marketing strategies and optimizing the logical structure of the Web site In this research, specially, there are a number of issues in pre processing data for mining that must be addressed before the mining algorithms can be run done on Web usage data, sequences of page references must be grouped into logical units representing Web trans- actions or user sessions This done by convert the web log file see figure (2) to a relational database see figure 3, this would give each user entry a transaction identifier, this preprocessing represent the basic step in beginning with web usage The first major preprocessing task is transaction identification Before any mining is Figure 2 application specialized to collect the web log information
7/10 TID Local IP (A) Local Remote IP (C) Remote port State (E) Type (F) Time port (D) stamp (B) (G) 1 12222318 139 335623377 80 Listen Tcp 2:30 2 12222318 139 44567822 50 Listen Udp 5:50 Figure 3 the relational database (D) for the web login/out file The second preprocessing task is data cleaning Techniques to clean a server log to eliminate irrelevant items are of importance for any type of Web log analysis, not just data mining The discovered associations or reported statistics are only useful if the data represented in the server log gives an accurate picture of the user accesses of the Web site Here, in this research the cleaning technique would mean eliminate the transaction of intruders this would be done by using the decision tree classifier And then by these classifier the web log file would be classified into no of classes to mining each class separately The decision tree classifier would be explained in details in the following section 52 Decision Tree Classifier We used the sibairwall program for dealing with the data of web log, and then it converted to a relational database This database have one global scheme include the following attributes ( local IP address, remote IP address, local port, remote port, state, type of protocol and time stamp) Now we would built the decision tree classifier for two purposes the first is to delete all intruder transactions, the second is to classify the web log file to some previous known classes DTC is a tree as declared in the above previous sections So the most important step is how to choice the attribute to be the root nod, then how to choice each one of the internal nods to complete the splitting and built this classifier In our research we would built this classifier without needing to measure the entropy of each on of these attributes to decide which one represent the root nod and then which one represent the more power attribute
8/10 to be the internal node to complete the splitting Because the power of each attribute is very clear in the web log files, so: first class is the intruder visitors class and the second is the normal visitors class See the following figure 4 The root nod would be the remote IP address so, by this attribute the tree would be split into two classes the Remote IP address another best attribute If IP in set of normal visitors If IP in set of intruder visitors Intruder visitor class Figure 4 first step to built the DTC include choice the best attribute (root node ) After splitting the DT in to two classes, usually we would eliminate the intruder visitors class and we would continue with normal visitors class This by selecting the more power attribute for splitting the last class into the final classes This attribute would be the local port number, so the final classes would be web log files each file represent the visitors of specific application see the following figure 5 If IP in set of normal visitors Remote IP address If IP in set of intruder visitors Local port number Intruder visitor class 64 80 23 Application1 Application2 Applications Figure 5 represent the final DTC which classify the web log files into classes according local port number attribute
9/10 53 Discovery Techniques on Web Transactions in each class Once user transactions or sessions have been identified, there are several kinds of access pattern mining that can be performed depending on the needs of the analyst, such as path analysis, discovery of association rules and sequential patterns, and clustering and classification In this research we would depend on the association rule analysis as a basic tool for web usage mining this by assume A be a set of attributes and I be a set of values on A, called items Any subset of I is called an itemset The number of items in an itemset is called its length Let D be a database with n attributes (columns) Define support(x) as the percentage of transactions (records) in D that contain itemset X An association rule is the expression X Y, c, s Here X and Y are itemsets, and X Y = Ф s = support(x U Y ) is the support of the rule, and c = support(x U Y ) / support(x) is the confidence Association rule discovery techniques are generally applied to databases of transactions where each transaction consists of a set of items In such a framework the problem is to discover all associations and correlations among data items where the presence of one set of items in a transaction implies (with a certain degree of confidence) the presence of other items In the context of Web usage mining, this problem amounts to discovering the correlations among references to various files available on the server by a given client But by classify the huge web log file into smaller web log file for each application the association rule would be very efficient for web usage this, because that will minimize the amount of discovering the associations and correlations among the itemsets of the files Each transaction is comprised of a set of URLs accessed by a client in one visit to the server For example, using association rule discovery techniques we can find correlations such as the following: _ 40% of clients who accessed the Web page with email application /company/product1, also accessed /company/product2; or _ 30% of clients who accessed with gopher application/company/special, placed an online order in /company/product1 After the web mining process on each of classified files and extract the hidden pattern we don t need to analyze these discovered patterns because it would be very clear and understood in the visualization level This clearness of these pattern depend on applying the web usage mining on specialized and determined files, instead of huge data web files, as in more previous research 6 Conclusion The term Web mining has been used to refer to techniques that encompass a broad range of Web mining to mean different things to different people, and there is a need to develop issues However, while meaningful and attractive, this very broadness has caused a common vocabulary Towards this goal we display a definition of Web mining, and
10/10 developed taxonomy of the various ongoing efforts related to it Next, we presented a survey of the research in this area and concentrated on Web usage mining We proposed that system for the following aims: 1 apply the web usage mining on web log file after classify the last one into number of files each one represent web log data for an application this would minimize the amount of associations and correlations needed for the usage, then we would have optimize web usage in time, space storage, performance and no need to analyze the discovered patterns to present high quality for visualization 2 In this research the DTC used for classify the web log file into web log application files, because it represent the more powerful classification tool for it is speed in the classification and little data used for training it 3 The root node selected as the remote IP address to eliminate the intruder transactions before classify web file into application web files References 1 M S Chen, J Han, and P S Yu Data mining: An overview from a database perspective IEEE Trans Knowledge and Data Engineering, 8:866-883, 1996 U M Fayyad, G Piatetsky-Shapiro, P Smyth, and R Uthurusamy Advances in Knowledge Discovery and Data Mining AAAI/MIT Press, 1996 3 Web Mining Research: A Survey, Raymond Kosala, Hendrik Blockeel, 2003 4 J Han and M Kamber Data Mining: Concepts and Techniques Morgan Kaufmann, 2000 5 A Survey of Decision Tree Classifier Methodology, S Rasoul Safavian and David Landgrebe, landgreb@ecnpurdueedu, 1999