A Hybrid Approach To Web Usage Mining

Size: px
Start display at page:

Download "A Hybrid Approach To Web Usage Mining"

Transcription

1 A Hybrid Approach To Web Usage Mining Authors: Søren E. Jespersen Jesper Thorhauge Torben Bach Pedersen Technical Report Department of Computer Science Aalborg University Created on July 17, 2002

2 A Hybrid Approach To Web Usage Mining Søren E. Jespersen Jesper Thorhauge Torben Bach Pedersen Department of Computer Science, Aalborg University Abstract With the large number of companies using the Internet to distribute and collect information, knowledge discovery on the web, or web mining, has become an important research area. Web usage mining, which is the main topic of this paper, focuses on knowledge discovery from the clicks in the web log for a given site (the so-called click-stream), especially on analysis of sequences of clicks. Existing techniques for analyzing click sequences have different drawbacks, i.e., either huge storage requirements, excessive I/O cost, or scalability problems when additional information is introduced into the analysis. In this paper we present a new hybrid approach for analyzing click sequences that aims to overcome these drawbacks. The approach is based on a novel combination of existing approaches, more specifically the Hypertext Probabilistic Grammar (HPG) and Click Fact Table approaches. The approach allows for additional information, e.g., user demographics, to be included in the analysis without introducing performance problems. The development is driven by experiences gained from industry collaboration. A prototype has been implemented and experiments are presented that show that the hybrid approach performs well compared to the existing approaches. This is especially true when mining sessions containing clicks with certain characteristics, i.e., when constraints are introduced. The approach is not limited to web log analysis, but can also be used for general sequence mining tasks. 1 Introduction With the large number of companies using the Internet to distribute and collect information, knowledge discovery on the web, or web mining has become an important research area. Web mining can be divided into three areas, namely web content mining, web structure mining and web usage mining (also called web log mining) [11]. Web content mining focuses on discovery of information stored on the Internet, i.e., the various search engines. Web structure mining can be used when improving the structural design of a website. Web usage mining, the main topic of this paper, focuses on knowledge discovery from the usage of individual web sites. Web usage mining is mainly based on the activities recorded in the web log, the log file written by the web server recording individual requests made to the server. An important notion in a web log is the existence of user sessions. A user session is a sequence of requests from a single user within a certain time window. Of particular interest is the discovery of frequently performed sequences of actions by web user, i.e., frequent sequences of visited web pages. The work presented in this paper has been motivated by collaboration with the Zenaria company [25]. Zenaria is in the business of creating interactive stories, told through a series of video-sequences. The story is formed by a user first viewing a video-sequence and then choosing between some predefined options, based on the video-sequence. Depending on the user s choice, a new video-sequence is shown and new options are presented. The choices of a user will form a complete story - a walkthrough - reflecting the choices made by the individual user. Traditionally, stories have been distributed on CD-ROM and a consultant has evaluated the walkthroughs by observing the users. However, Zenaria now wants to distribute their stories using the Internet, i.e., a walkthrough will correspond to a web session. Thus, the consultants will now have to use web usage mining technology to evaluate the walkthroughs. Much work has been performed on extracting various pattern information from web logs and the application of the discovered knowledge range from improving the design and structure of a web site to enabling companies to provide more targeted marketing. One line of work features techniques for working directly on the log file [11, 12]. Another line of work concentrates on creating aggregated structures of the information in the web log [21, 17]. The 1

3 Hypertext Probabilistic Grammar (HPG) framework [5], utilizing the theory of grammars, is such an aggregated structure. Yet another line of work focuses on using database technology in the clickstream analysis [1, 8], building so-called data webhouses [14]. Several database schemas have been suggested, e.g. the click fact star schema where the individual click is the primary fact[14]. Several commercial tools for analyzing web logs exist [24, 20, 22], but their focus is mostly on statistical measures, e.g., most frequently visited pages and they provide only limited facilities for clickstream analysis. Finally, a prominent line of work focuses on mining sequential patterns in general sequence databases [2, 3, 17, 18]. However, all the mentioned approaches have inherent weaknesses in that they either have huge storage requirements, slow performance due to many scans over the data, or problems when additional information, e.g., user demographics, are introduced into the analysis. In this paper we present a new hybrid approach for analyzing click sequences that aims to overcome these drawbacks. The approach is based on a novel combination of existing approaches, more specifically the Hypertext Probabilistic Grammar (HPG) [4] and Click Fact Table [14] approaches. The new approach attempts to utilize the quick access and the flexibility with respect to additional information of the click fact table, and the capability of the HPG framework to quickly mine rules from large amounts of data. Specialized information is extracted from the click fact schema and presented using the HPG framework. The approach allows for additional information, e.g., user demographics, to be included in the analysis without introducing performance problems. A prototype has been implemented and experiments are presented that show that the hybrid approach performs very well compared to the existing approaches. This is especially true when mining sessions containing clicks with certain characteristics, i.e., when constraints are introduced. The approach is not limited to web log analysis, but can also be used for general sequence mining tasks. We believe this paper to be the first to present an approach for mining frequent sequences in web logs that at the same time provides small storage requirements, very fast rule mining performance, and the ability to introduce additional information into the analysis with only a small performance penalty. This is done by exploiting existing data warehouse technology. The paper is organized as follows. Section 2 describes the techniques on which we base our new hybrid approach. Section 3 describes the hybrid approach in details. Section 4 describes the prototype implementation. Section 5 examines the performance of the hybrid approach. Section 6 concludes and points to future work. Appendix A presents the case in question, describing the problem domain. Appendix B provides further details on the performance experiments. Appendix C presents the details of the database schemas. The paper can be read and understood without reading the appendices. 2 Background This section describes the approaches underlying our hybrid approach, namely the Click fact, Session fact, and Subsession fact schemas, and the Hypertext Probabilistic Grammar (HPG) approach. Each approach has both strong and weak points which are briefly described. 2.1 Database-Based Approaches The click fact schema uses the individual click on the web site as the essential fact in the data warehouse[14]. This will preserve most of the information found in a web log in the data warehouse. In our case, the click fact schema contains the following dimensions: URL dimension (the web pages) (note that the fact table has references to both the requested page and the referring page, also known as the referrer. ), Date dimension, TimeOfDay dimension, Session dimension, and Timespan dimension. We could easily add any desired additional information as extra dimensions, e.g., a User dimension capturing user demographics. An example Click Fact schema can be found in Appendix C. Pros: A strong point of this approach is that almost all information from the web log is retained within the data warehouse. Individual clicks within the individual sessions can be tracked and very detailed click information is available. This approach can utilize existing OLAP techniques such as pre-aggregation to efficiently extract knowledge on individual pages. 2

4 Cons: When querying for sequences of clicks, using the click fact schema, several join and self-join operations are needed to extract even relative short sequences [1], severely degrading performance for large fact tables. The Session Fact schema uses the entire session as the primary fact in the data warehouse[14], thereby ignoring the individual clicks within a session. Only the requests for the start and end page are stored as a field value on the individual session fact entry. The approach is therefore not suitable for querying for detailed information about sequences of clicks or even individual clicks in a session. Queries on entire sessions are however quite efficient [1]. An example Session Fact schema can be found in Appendix C. We will not present strong and weak points for this approach, since it is unable to perform analysis of sequences of clicks which is a key requirement for us. The Subsession Fact schema was first proposed by Andersen et. al [1] and is aimed specifically at clickstream analysis on sequences of clicks. The approach introduces the concept of using subsessions as the fact, that is, explicitly storing all possible subsequences of clicks from a session in the data warehouse. This means that for all sessions, subsessions of all lengths are generated and stored explicitly in the data warehouse. In our case, the subsession fact schema has the dimensions Date, Session, TimeOfDay, Timespan, as well as the URLSequence dimension capturing the corresponding sequence of web page requests. An example Subsession Fact schema can be found in Appendix C. Pros: Storing subsessions explicitly in the database instead of implicitly (through join-operations) will allow for better performance on queries concerning sequences of clicks. There is a tradeoff between the amount of storage used and query performance, but if a web site has relatively short sessions, the storage overhead is manageable [1]. Thi technique can answer sequence-related queries much faster than, e.g., the click fact schema [1]. Cons: As hinted above, the longer the sessions the more storage is needed for storing the information for all subsessions. The number of subsessions generated from a session of length can be estimated using the formula [1]. A study shows that the storage overhead can vary between a factor of 4 and a factor above 20, depending on the characteristics of the sessions on a web site [1]. Some methods for reducing the number of subsessions that are to be stored have been suggested, including ignoring either very short or very long subsessions or ignoring certain subsessions. 2.2 Hypertext Probabilistic Grammars The nature of web sites, web pages and link navigation has a nice parallel in formal language theory which proves rather intuitive and presents a model for extracting information about user sessions. The model uses a Hypertext Probabilistic Grammar(HPG)[15] that rests upon the well established theoretical area of languages and grammars. We will present this parallel using the example in Figure 2.1. A1 A3 A6 A5 A2 A4 Figure 2.1: Example of a Web site structure. The figure show a number of web pages and the links that connect them. As can be seen, the structure is very similar to a grammar with a number of states and a number of productions leading from one state to another. It is this parallel that the model explores. The model uses all web pages as states 1 and adds two additional artificial states, the 1 This is only true if the HPG is created with a so-called history depth of 1, see later. 3

5 start state S and the end state F, to form all states of the grammar. We will throughout the paper use the terms state and page interchangeably. From the processing of the sessions in the web log each state will be marked with the number of times it has been requested. The probability of a production is assigned based on the information in the web log so that the probability of a production is proportional to the number of times the given link was traversed relative to the number of times the state on the left side of the production was visited. Note that not all links within a web site may have been traversed so some of the links might not be represented in an HPG. The probability of a string in the language of the HPG can be found by multiplying the probabilities of the productions needed to generate the string. Note that web pages might be linked in a circular fashion and therefore the language of the HPG could be infinite. A HPG therefore specifies a threshold against which all strings are evaluated. If the probability of the string is below the threshold, the string will not be included in the language of the HPG (with the assigned threshold). This will generate a complete language for a given HPG with a given threshold,. Mining an HPG is essentially the process of extracting high-probability strings from the grammar. These strings are called rules. 2 These rules will describe the most preferred trails on the web site since they are traversed with a high probability. Mining can be done using both a breath-first and a depth-first search algorithm[4]. A parameter is used when mining an HPG to allow for mining of rules inside the grammar, that is rules with a leftmost state that has not necessarily been the first request in any session. This is done by assigning probability to productions from the start state to all other states depended on and whether or not the state are first in a session. An example of an HPG is shown in Figure 2.2. S 0,25 0,50 A3 0,75 0,25 0,66 A1 0,50 0,25 1,00 A6 1,00 F A5 0,34 A2 0,33 0,50 0,67 A4 Figure 2.2: Example of a Hypertext Probabilistic Grammar. Note that, as mentioned above, not all links on the web pages are necessarily represented in the grammar. The link from Figure 2.1 has not been traversed and is therefore not represented in the grammar. In Figure 2.2 a web page maps to a state in the grammar. The HPG can also be generated with a history-depth above. With a history-depth of e.g. each state represents two web pages requested in sequence. The structure of the HPG remains the same but each state now has a memory of the last states traversed. With, a state might then be named!, for the traversal ". The mining of rules on the HPG using a simple breath-first search algorithm has been shown to be too un-precise for extracting a manageable number of rules. Heuristics have been proposed to allow for a more fine-grained mining of rules from an HPG[7][6]. The heuristics are aimed at specifying controls that more accurately and intuitively presents relevant rules mined from an HPG and allow for e.g. generation of longer rules and for only returning a subset of the complete ruleset. Representing Additional Information An HPG has no memory of detailed click information in the states, so if rules using additional information, e.g. sessions for users with specific demographic parameters, were to be mined, each production should be split into a number of middle-states, where each middle-state would represent some specific combination of the parameters. Each middle-state should then be labeled with the respective combinations of information. This is illustrated in Figure 2.3 were and # are original states and to $ represent new middlestates. Note that the probability of going from a middle-state to # 2 The notion of a rule and a string will be used interchangeably is for each middle-state. 4

6 1 0,02 0,10 0, A 0.89 B A 0,32 0,08 0, B 0, Figure 2.3: Including demographic information. Note also that Figure 2.3 only illustrates additional information grouped into seven different categories, which e.g. could be some demographic information about the user. If several different kinds of information were to be represented, the number of middle-states would increase exponentially since all combinations of the different kinds of parameters should potentially be represented for all states in the HPG, thus creating a problem with stateexplosion. For instance, if the HPG should represent the gender and maritial status of users, each production could potentially be split into four middle-states. However, if the HPG should also represent whether a user had children or not, each production needs to be split into eight middle-states. This factor increases with the cardinality of each additional parameter. This solution scales very poorly. For instance, representing gender, age in years, salary (grouped into ten categories), number of children (five categories), job status (ten categories) and years of working experience could easily require each production to be split into over 4 million middle-states. This can be seen be multiplying the cardinality of all parameters, % & & ' ( *)+. Doing this for an HPG that includes only ten interconnected states would require over 400 million states (including middle-states) in the full HPG. This is clearly not a scalable solution, since the number of clicks represented might not even be 400 million. Furthermore, the existing algorithms[4] should be expanded to be able to work on this new type of states in the HPG. Alternatively a single middle-state could be inserted in each production, containing a structure indicating the distribution of clicks over the parameters. This would effectively reduce the state-explosion but require significant changes to the existing mining algorithms to include parameters in the mining process and to support e.g. mining of rules for a range of parameter-values. Pros: The size of an HPG is not proportional to the number of sessions it represents but to the number of states and productions in the grammar, which makes the model very compact, scalable and self-contained. Additional sessions can easily be added to an existing HPG and thereby allow for an HPG to grow in the number of sessions represented without growing in size. Performing data mining on an HPG outputs a number of rules describing the most preferred trails on the web site using probability which is a relatively intuitive measure of the usage of a web site. Mining the HPG for a small, high probability set of rules requires the usage of the described heuristics which provides the user of the HPG with several parameters in order to tune the ruleset to his or her liking and extract rules for a very specific area. Cons: The HPG does not preserve the ordering of clicks which is found in the sessions added 3. Therefore the rules mined could potentially be false trails in the sense that none or few sessions include the trail, but a large number of sessions include the different parts of the trail. In order to extract rules for a specific subset of all sessions in the HPG, a specialized HPG for exactly these sessions needs to be created. This is because a complete HPG does not have a memory of individual clicks and their session, as described above. A collection of sessions can potentially contain a very large number of different subsets so building and storing specialized HPGs for every subset is not a scalable option, as mentioned above. Therefore, if more specialized HPGs are needed there would be a need for storing the web log information so that the desired session could be extracted later on and a specialized HPG could be built and mined. 3 Using a history depth above 1 will preserve some ordering 5

7 Summary As mentioned in the preceding sections, each approach has some inherent strengths and weaknesses. The click and subsession fact approaches handle additional information easily, but result either in huge I/O or huge storage requirements. On the other hand, an HPG can efficiently mine long rules for a large collection of sessions, but is not able to represent additional information in a scalable way. To remedy this situation, we now present a new approach for extracting information about the use of a web site, utilizing the potential for mining very specific rules present in the HPG approach while still allowing for the representation of additional information using a click fact schema in a data warehouse, utilizing existing DW technology. The main idea is to create HPG s on demand, where each dynamically created HPG represents a specialized part of the information in the web log. 3 The Hybrid Approach 3.1 Overview Our proposal for a hybrid approach combines the click fact schema with the HPG model, creating a flexible technique able to answer both general and detailed queries regarding web usage. Using the detailed information from the click fact schema results in almost no data being lost in the conversion from web log to database. However, we also need some kind of abstraction or simplification applied to the query results. In short, we want to be able to specify exactly what subset of the information we want to discover knowledge for. An HPG provides us with a simple technique to create an overview from detailed information. The scalability issues and somewhat lack of flexibility in the HPG model (see Section 2.2) must though also be kept in mind as we want to be flexible with regards to querying possibilities. The concept of the hybrid approach is shown in Figure 3.1. Database Clickfact Schema Extraction Hypertext Probabilistic Grammar Mined Rules Construction Mining Constraints,,-- /../0011 Figure 3.1: Overview Of The Hybrid Approach. As mentioned in Section 2.2, creating specialized HPGs would require storing the original web log information and creating each specialized HPG when required. Storing the original web log file is not very efficiently since e.g. a lot of non-optimized string-processing would be required, so some other format needs to be devised. The click fact schema described above provides this detailed level, since it preserves the ordering of the web log and furthermore offers database functionality such as backup, optimized querying and the possibility of OLAP techniques such as pre-aggregation being applied. However, a more subtle feature of the click fact schema actually offers itself directly to the HPG model and proves useful in our solution. The database schema for the click fact table includes unique keys for both the referrer and the destination for each click. These two keys uniquely identify a specific production within a grammar since each key is a reference to a page and thus a state. Thereby we are able to extract all productions from the click fact table simply by returning all combinations of url_key and referer_key. Each occurrence of a specific combination of keys will represent a single traversal of the corresponding link on the website. Retrieving all states from an HPG is immediately possible from the click fact schema. The url_dimension table holds information about each individual page on the web site, therefore a single query could easily retrieve all states in the grammar and a count of how many times the state was visited, both in total and as first or last in a session. The queries can be used to initialize an HPG. This would normally be done using an algorithm iterating over all states in the sessions[4] but using the database representation, the required information can retrieved in a few simple database queries. Note that some post-processing of the query results are necessary for a nice in-memory representation. 6

8 3.2 Constructing Specialized HPGs Creating specialized HPG s is indeed possible with this approach. Inserting a constraint layer between the database software and the creation process for an HPG will allow for restrictions on the information represented in the HPG. Extending the queries described above to only extract information for clicks with certain characteristics will allow for creation of an HPG only representing this information. Thereby rules mined on this HPG will be solely for clicks with the specific characteristics. Using this concept, specialized HPGs can be created on-demand from the database. For instance, the consultant might be interested in learning about the characteristics of walkthroughs for male, senior salesmen in a company. The consultant will specify this constraint and the queries above are modified to only extract the productions and the states that apply to the constraint. The queries will therefore only extract the sessions generated by male, senior salesmen in the company and the HPG built from these sessions will produce rules telling about the characteristic behavior of male senior salesmen. This approach utilizes some of the techniques earlier described but combines them to utilize the strong sides and avoid some of the weak sides. Pros: The limitations of the simple HPG framework (see Section 2.2) of not being able to efficiently represent additional information are avoided with this hybrid approach. The ability to easily generate a specialized HPG overcome the shortcomings of not being able to store all possible specialized HPGs. Saving the web log information in the click fact table (and thus in the database) gives us a tool for storing information which arguably is preferred to storing the original log file. A DBMS has many techniques for restoring, querying and analyzing the information with considerable performance gains over processing on raw textual data such as a log file. Combining the click fact schema, which offers a detailed level of information, and the HPG framework, which offers a more generalized and compact view of the data will allow for different views on the same data within the same model, without storing information redundantly on non-volatile storage. Cons: As the hybrid approach mine results using the HPG, false trails might be presented to the user, which is a characteristic inherited from the general HPG approach. This is obviously a critical issue since this might lead to misinterpretations of the data. Using a history depth greater than 1 might reduce the number of false trails. Open Issues: The performance of generating specialized HPG s using queries on top of a database is an open issue that will be explored next. The number of rules mined from an HPG should not be too large and the rules should not be too short, so some of the heuristics mentioned in Section 2.2 needs to implemented when mining the HPG to be able to present the information to the user in a manageable way. 4 Prototype description 4.1 System Overview The architecture of the hybrid approach prototype is seen in Figure 4.1. All modules within the system are implemented in the Java programming language. The prototype works as follows. First, the web log file from the web server is converted into an XML-based format. Then a Quilt query[9] is executed on the XML file, resulting a new XML file. An XML parser is then invoked which parses the XML into the click fact schema contained within the database. This part of the system is called the Data Warehouse Loader. We then use a simple Graphical User Interface (GUI) to control how the SQL query extracting data from the database should be constructed. The SQL generator then constructs four SQL queries which are used to query the database. We call this part the Constraint Layer. The result from these queries are used to construct an HPG structure held in main memory. A BFS mining algorithm then extract rules from the HPG. This final part is simply named HPG. 4.2 Loading the Data Warehouse The system is designing to read information from a web log. The typical format of a web log is the Common Log Format (CLF)[23] or the Extended Common Log Format (ECLF)[23]. Such web logs contain the URL requested, the IP address 4 of the computer making the request and timestamps. 4 Or a resolved name, if possible. 7

9 Rules Web log HPG Implementation BFS Mining Control GUI CLF to XML Data Warehouse HPG Quilt Query Loader Constraint Layer SQL Generator XML Parser Database Figure 4.1: Overview of the prototype. An entry in a CLF log file for a single request is seen below. ask.cs.auc.dk - - [31/Oct/2001:09:48: ] "GET /education/dat1inf1 HTTP/1.0" The identification of users is non-trivial as mentioned in Section 2.2. The prototype does not implement any logic to handle to user identification but simply assumes that an IP maps directly to a user. Also, the prototype does not at present include any means to avoid proxy caching of data. Instead of writing a data warehouse loader reading the CLF format directly, we have chosen to convert it into a more high-level format, based on XML. Once in this format, we are able to import and process our log data in a waste variety of programs which provides flexibility. The first step in loading the data into the data warehouse is cleansing the log data. Instead of cleansing directly on the log file using a temporary table in the database, or using other proposed techniques[10], we use an implementation of the Quilt query language named Kweelt[19]. Quilt is the predecessor of the new XQuery language. proposed by the W3C and is an SQL-like language for querying an XML structure. We use two Quilt queries to produce an XML file cleansed for irrelevant information and grouped into sessions for each host. When the web log is transformed into XML, we are ready to load it into the data warehouse. Using a SAX 5 parser, we have implemented separate transformers from XML to each of the data warehouse schemas described in Section 2. Note that the only modifications needed if we want to change the underlying DBMS is to ensure that the SQL types used by the transformer is supported on the chosen DBMS and that a driver for connecting the transformer with the DBMS exists. Provided with the XML file, the transformers parse our XML structure into the data warehouse schemas and the loading is completed. No scheme for handling the uncertainty associated with the decision-time is implemented in the prototype but it could readily be extended to assume that a predefined timespan was used handling e.g. streaming-time for each request. Based on previous work [10], we have chosen to spawn a new session if the dwell time in a session exceeded 25 minutes. The spawning of a new session is handled entirely by the log transformers in the prototype, and not by the Kweelt queries. 4.3 Constraint Layer Implementation The main idea of combining a data warehouse with the HPG technique is the ability to constrain the data on which the HPG is built. A detailed description of how this constraining is achieved is presented here. We need SQL queries to extract our constrained set of data and then pass it on to a mechanism initializing our HPG. However, the extracted 5 Simple API for XML. 8

10 data must first be divided into session specific and click specific information. This distinction is very important, as the constructed HPG could otherwise be incorrect. Session specific: Dimensions which are specific to an entire session will, when constrained on one or more of their attributes, always return entire sessions as the result. One such dimension is the session dimension. If the click fact schema is constrained to only return clicks referencing a subset of all sessions in the session dimension, it is assured that the clicks returned will form complete sessions. Also, if we assume that all sessions start and finish on the same date, the date dimension will provide the same property as the session dimension. Clicks constrained on a specific date will then always belong to a connected sequence of clicks. In an HPG context it means that the constructed HPG never has any disconnected states - states where no productions are going either to or from. Click specific: Dimensions containing information about a single click will, if the click fact table is constrained on a subset of these keys, produce a set of single clicks which are probably not forming complete sessions. The probability of this will increase as the cardinality of the attribute grows or the number of selected attributes shrinks. For instance, when constraining on 3 URL s from a set of 1000 in the URL dimension, the clicks returned will properly not constitute complete sessions and the probability of false trails dramatically increase. Furthermore, the HPG produced will then consist of 3 states with potentially no productions between them and some productions leading to states not included in this HPG. These 3 states are disconnected states. To be able to derive any rules from an HPG we need to have states with productions connecting them. As we want the approach to provide the overview as well as the detail, these two types of dimensions must be dealt with before the HPG is built. The solution proposed here is to constrain the data in two stages. The two stages (see Figure 4.2) can briefly be described as follows. First, we retrieve a temporary result using dimensions which are thought to be click specific (1a). The temporary result comes from joining the click specific dimensions with the click fact schema on all constrained attributes. The distinct session keys from the temporary result (1b) can be used to get the subset of all clicks having these session keys, which is done in step 2. These distinct keys are session specific and will assure an interconnected HPG. Second, we constrain the result using dimensions which are thought to be session specific and the distinct session keys (2a). All constrains must be fulfilled on the dimension when joining with the click fact schema. The collection of clicks retrieve are interconnected (2b). Note that both steps in practice will be performed as one query. 1a. Indentify clicks 1b. Retreive session specific keys products.html 2a. Indentify all clicks in sessions 2b. Use all clicks to build HPG Figure 4.2: Extraction of data for the HPG in two stages. When executing queries on the database, java.util.resultset objects are returned. The HPG s initializing method is then provided with the respective ResultSet objects and dealt with internally in each method. The method used to initialize the HPG state set will initially run through the ResultSet object provided and create a State object for each row in the ResultSet object. Transparency between the specific SQL query and the result used for HPG initialization is maintained as the ResultSet object is well-defined and not affected by a change in the SQL query. The HPG initialization method is at this stage provided with data, and as such, the extraction of data from the data warehouse is completed. 9

11 4.4 HPG Implementation The HPG implementation consists of two parts; The HPG building method and a Breadth First Search (BFS) mining algorithm. The BFS algorithm is chosen instead of the Depth First Search (DFS) algorithm as the latter has a higher memory consumption [4]. We have chosen to implement both parts using algorithms and specifications as presented by Borges [4]. As our work is focused on creating a hybrid between the standard click fact schema and the standard HPG model we are not interested in altering any of these techniques. The only difference between the prototype implementation and work by Borges[4] is, that rules and productions are stored in main memory instead of a database. The HPG implemented in the prototype holds two simple lists in main memory, one for states and one for productions. These lists are initialized from the results from the constraint layer and the mining process work entirely on these lists. 5 Experimental Evaluations In evaluating the hybrid approach we decided to evaluate the performance of the creation of HPGs on-the-fly by it self and against a straight-forward SQL-query on a subsession schema (see Section 2.1). Evaluating the hybrid model by it self will allow us to identify potential bottlenecks in the creation of HPGs on-the-fly and target any optimization to the areas where performance suffers. Evaluating against the subsession schema will allow us to more precisely foresee for what type of queries our approach could be a better choice than the subsession schema and how the arguably greater effort of building and mining an HPG performs against a straight-forward query on a database. We decided to evaluate against the subsession schema since Zenaria as described will mostly be interested in information about sequences of clicks. As walkthroughs are generally long sessions, the subsession schema will arguably perform better than the click fact schema since each additional click in a sequence returned from a click fact schema requires self-join operations on the click fact table query[1]. We decided not to evaluate against a complete HPG representing the specialized information since we argued, that this solution would scale very poorly and the mining algorithms would need some expansion (see Section 2.2). Also, techniques requiring a number of scans of the database for mining the patterns [2, 3, 17, 18] would perform too poorly to be used for on-line mining. 5.1 Experimental Settings We now describe the various elements used as settings for our experimental evaluation. The goals of these experiments is to retrieve information concerning the performance of creating HPGs on-the-fly on top of a database. The experiments will include both unconstrained mining and mining constrained on both click and session specific dimensions (see Section 4.3) and be performed on both a non-indexed and an indexed database. The conclusions of the experiments are summarized in Section 5.4. We have used the web log from the Computer Science Department at Aalborg University for October The web site is primarily an information site for students and staff but also contain personal sites with a wide range of content. The web log was 300 MB in ECLF, but we used only CLF-compliant entries, only accepted requests processed with code 200 (OK), and only included requests from outside users. In overall the web log then contained sessions for a total of valid clicks amongst unique pages. The DBMS used in the evaluation is a MySQL running on an Intel TM 933 Mhz machine. The prototype described in Section 4 is running on an AMD Athlon TM 1800 Mhz machine through a JBoss Application server and implemented using JSP-files running on Tomcat beta. The tests were executed from a JSP file, since this was the entry point initially chosen to the prototype. The insertion and generation of subsessions can be quite resource-demanding [1] so in order to avoid filling up the database with very long subsessions (some of the session from the web log where as long as 1900 requests) we adopted the suggested convention[1] of limiting the subsession length to 10 clicks. This will avoid a massive blowup in the number of subsessions generated and ensure a constant number of extra subsessions being generated for each request longer than 10 clicks. Insertion of subsessions yielded a total of entries in the subsession fact table divided amongst different subsessions. We now turn to the actual experimental results. 10

12 5.2 Analysis of The Hybrid Approach The combination of using a click fact schema together with the HPG approach might pose problems or bottlenecks for the hybrid approach as a whole. To clarify if certain parts of the approach performs better than others, and thereby identifying possible areas in which optimization could have a significant effect, different experiments were performed. These experiments are not performed for comparison with other techniques, but merely as a detailed measurement of each part in the system. In the process of extracting rules from the data warehouse, three main tasks are performed. First, the database is queried for data used to initialize the HPG. Second, the HPG is built in main memory based on the extracted data and third, the BFS mining algorithm extracts rules from the HPG. To test how the approach performs as the number of states increase, a range query test was performed. All clicks belonging to sessions in which the amount of time being spent is below a certain threshold was extracted. Ranging from 0 seconds to seconds, all clicks belonging to sessions in which e.g. the total time spend is below 3000 seconds was extracted and so forth. Increasing the threshold in steps of 100 seconds up to seconds then increases the amount of states included in the HPG. The result of this experiment is shown to the left in Figure 5.1. Figure 5.1: Total Processing Time and Database Query Time The figure shows that the by far most time consuming part in the hybrid approach is the database query time. In most cases, more than 90% of the total time is spent in this part. On the other hand, the time used for the BFS mining is so short that it can hardly be distinguished in the figure. The fact running time is the result of an optimization of the data structures used for representing the HPG. The time spent on initializing the HPG only adds little processing time to the overall processing time, usually 5 10%. A detailed version of the time used to initialize the HPG can be found in Appendix B. Thus, it seems that a possible tuning effort should focus on the database query part. The most promising way to optimize the query time would be to apply materialized views[16], which could precompute most of the aggregate results required to build the HPG, yielding a huge performance improvement and making the database query time almost independent of the number of clicks in the database However, the MySQL database used in the prototype does not implement this feature, so the experiment of using this technique is left for future work. The right side in Figure 5.1 which presents the database querying part of the total processing, shows that the query times are distributed with 60 70% for the StateSet extraction query, 20-30% for the ProductionSet extraction query and 5 10% for the Session extraction query. Here, we have made the optimization of reusing the ProductionSet result to compute the total number of requests for each state. However, optimizing the query part should not be targeted at a specific database query, but at the overall database performance. One of the most obvious ways to optimize the database, is to create indexes on the key attributes in the click fact schema. Our experiments (presented in detail in Appendix B) shows a performance gain of around 25% from doing this. As mentioned above, materialized views is another promising opportunity for tuning. 11

13 5.3 Comparative Analysis We now compare the hybrid approach to the subsession schema. We compare the two approaches on extraction of information for all sessions, session specific constraints and click specific constraints (see Section 4.3 for a explanation). All sessions: This test compares the hybrid approach, extracting all clicks to the HPG and mining, against the subsession schema, which must query the database for the most frequent and longest subsessions. The results for indexed and non-indexed databases are shown to the left in Figure 5.2. It is seen that the non-optimized hybrid approach performs poorer than the subsession schema by a factor of approximately 25. However, the hybrid approach where both the database querying and the BFS mining is optimized, takes only 30% longer than the subsession approach. This is rather surprising, since the subsession approach only needs to query on one table in the database, whereas the hybrid approach needs to extract information to memory and perform the mining process on the built HPG. The small performance advantage of the subsession approach comes at a large price: the storage requirements are more than 5 times higher than for the hybrid approach, due to the need to store subsession of all lengths. Figure 5.2: Hybrid Versus Subsession Approach For All Rules and Click Constraints Click specific constraints: This test is evaluated by extracting information for all sessions containing requests including the string tbp, which is a specific personal homepage on the web site. The results using both approaches with this constraint (again, both non-indexed and indexed) is shown to the right in Figure 5.2. The hybrid approach proves to be a factor of 10 faster than the subsession schema approach, even in the non-optimized case, the optimized version is 20 times faster. The obvious reason for the relative slow database query in the subsession schema approach is the need to compare each entry in the url_sequence_dimension with tbp. This is necessary since the sequence of URL-requests are stored as a single string, so all entries must be compared. Figure 5.3: Hybrid Versus Subsession Approach For Session Specific Constraints 12

14 Session specific constraints: This test resembles the range query test performed in Section 5.2. The session specific constraint is evaluated using a constraint on the total_session_seconds field from the session_dimension table. Figure 5.3 shows the results of evaluating the hybrid approach and the subsession schema on this session constraint, using both a non-indexed and indexed version of the schemas. The figure shows that the optimized hybrid approach outperforms all the other approaches by an average factor of 3 4. All the other approaches are rather similar in performance. Again, the superior performance of the optimized hybrid approach is accompanied by a storage requirement that is more than 5 times smaller than the subsession approach. Note furthermore that the hybrid approach provides some functionality which cannot be achieved with the subsession schema. Since the click fact schema stores each click, click specific constraints can be put on the time spend on individual clicks. This is not possible with the subsession schema approach, since the spend on individual clicks is not stored in the schema. 5.4 Experiment Summary To conclude, our evaluation of the hybrid approach has shown that the optimized hybrid approach is very competitive when compared to the subsession fact approach. Even with a storage requirement that is more than 5 times smaller than for the subsession fact approach, the optimized hybrid approach performs 3 20 times faster. Only for mining all rules the performance is a little slower, making the hybrid approach the clear winner. As a consequence, the query response time is fast enough to allow on-line rule mining. Furthermore, the hybrid approach allows for click specific constraints on the time spent on individual pages, which cannot be handled by the subsession fact approach. Also, the subsession fact table approach will break down if the average sessions are very long, as in the Zenaria case. 6 Conclusion and Future Work Motivated by the need for an efficient way of mining sequence rules for large amounts of web log data, we presented a hybrid approach to web usage mining that aims to overcome some of the drawbacks of previous solutions. The approach was based on a novel combination of existing approaches, more specifically the Hypertext Probabilistic Grammar (HPG) [4] and Click Fact Table [14] approaches. The new approach attempted to utilize the quick access and the flexibility with respect to additional information of the click fact table, and the capability of the HPG framework to quickly mine rules from large amounts of data. Specialized information was extracted from the click fact schema and presented using the HPG framework. The approach allows for additional information, e.g., user demographics, to be included in the analysis without introducing performance problems. A prototype has been implemented and experiments are presented that show that the hybrid approach performs very well compared to existing approaches, especially when mining sessions containing clicks with certain characteristics, i.e., when constraints are introduced. In future work, we will finish the development of a tool for Zenaria. First of all, a front end which can be used by the consultants needs to be developed, including methods for visualization of the extracted rules. The mining process should be modified to include the heuristics mentioned in Section 2.2, which will allow for a better control of rule extraction. Solutions to the open issues mentioned in Section A needs to be added to the Data Warehouse Loader module, including, e.g., user tracking using cookies. Optimization of the prototype, especially the database querying is also an area of future improvement. The hybrid approach inherits the risk of presenting false trails from the HPG as mentioned in Section 3. Developing a method to work around this risk, potentially utilizing history depth or some other technique is an interesting way to go. Finally, an expansion of the HPG mining process where individual pages could be assigned a measure of importance in the mining process is desirable. Such an expansion would improve the capability of the consultants to tune the mining process to specific pages in the interactive stories. References [1] J.Andersen, A. Giversen, A. H. Jensen, R. S. Larsen, T. B. Pedersen, and J. Skyt. Analyzing clickstreams using subsessions. In Proceedings of the Second International Workshop on Data Warehousing and OLAP, 13

15 pp , [2] R. Agrawal and R. Srikant. Mining sequential patterns. In Proceedings of the Eleventh International Conference on Data Engineering, pp. 6 10, [3] R. Srikant and R. Agrawal. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the Fifth International Conference on Extending Database Technology, pp. 3 17, [4] J. Borges. A Data Mining Model to Capture User Web Navigation Patterns. PhD thesis, Department of Computer Science, University College London, [5] J. Borges and M. Levene. Data mining of user navigation patterns. In Proceedings of WEBKDD, pp , [6] J. Borges and M. Levene. Heuristics for mining high quality user web navigation patterns. Research Note RN/99/68. Department of Computer Science, University College London, Gower Street, London, UK, [7] J. Borges and M. Levene. A fine grained heuristic to capture web navigation patterns. SIGKDD Explorations, 2(1):40 50, [8] A.G. Büchner, S.S. Anand, M.D. Mulvenna, and J.G. Hughes. Discovering internet marketing intelligence through web log mining. SIGMOD Record, 27(4):54 61, [9] D. D. Chamberlin, J. Robie, and D. Florescu. Quilt: An XML query language for heterogeneous data sources. In WebDB (Informal Proceedings), pp , [10] R. Cooley, B. Mobasher, and J. Srivastava. Data preparation for mining world wide web browsing patterns. Knowledge and Information Systems, 1(1):5 32, [11] R. Cooley, J. Srivastava, and B. Mobasher. Web mining: Information and pattern discovery on the world wide web. In Proceedings of the Ninth IEEE International Conference on Tools with Artificial Intelligence (ICTAI 97), [12] R. Cooley, P. Tan, and J. Srivastava. Websift: the web site information filter system. In Proceedings of the 1999 KDD Workshop on Web Mining, [13] Jiawai Han and Michelle Kamber. Data Mining - Concepts and Techniques. Morgan Kaufmann, [14] Ralph Kimball and Richard Merz. The Data Webhouse Toolkit. Wiley, [15] Mark Levene and George Loizou. A probabilistic approach to navigation in hypertext. Information Sciences, 114(1-4): , [16] Oracle Corporation. Oracle 8.1.5i Tuning with materialized views. server.815/a67775/ch2.htm. [17] J. Pei, J. Han, B. Mortazavi-asl, and H. Zhu. Mining access patterns efficiently from web logs. In Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp , [18] J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal, and M. Hsu. PrefixSpan: Mining Sequential Patterns by Prefix-Projected Growth. In Proceedings of the Seventeenth International Conference on Data Engineering, pp [19] Arnaud Sahuguet. Kweelt querying xml in the new millennium, [20] Sawmill, [21] Myra Spiliopoulou and Lukas C. Faulstich. WUM: a Web Utilization Miner. In Proceedings of the Workshop on the Web and Data Bases, pp , [22] WebTrends LogAnalyzer. [23] World Wide Web Consortium. W3c httpd common log format. [24] K.-L. Wu, P. S. Yu, and A. Ballman. Speedtracer: A web usage mining and analysis tool. IBM System Journal, Internet Computing, Volume 37, [25] Zenaria A/S. 14

16 A The Zenaria Case The paper is written in cooperation with the company Zenaria A/S[25]. This section will provide some background information about the company and describe the area in which they are in need of a new solution. It will also include some details about adopting the web mining concept to the Zenaria case. A.1 Current Scenario Zenaria A/S is in the business of creating interactive stories mainly in the form of a story told through a series of video-sequences. The story is formed by a user viewing a video-sequence and choosing between some predefined options based on the current video-sequence. Depending on the actual choice of the user, a new video-sequence is shown and new options are presented. The choices of a user will form a complete story - a walkthrough - and the outcome of this walkthrough will reflect the choices made by the individual user. An example of the structure of an interactive story is illustrated in Figure A.1. Stories vary in length but a typical walkthrough of a story features around 15 video-sequences. A screen-shot from a video-sequence is shown in Figure A.2. Figure A.1: Example of the structure of a story Stories are typically designed to educate and evaluate employees in a company and the evaluation is supported by a consultant supervising the walkthrough of each user. Several factors weigh into the evaluation of a single user, but the complete story as formed by the choices and the time spent from a video-sequence has been viewed to a choice is made, referred to as the decision-time 6, are important parameters in evaluating a user. The stories are distributed on individual CD-ROMs and does not at present include any way of storing the information generated by walkthroughs in a central database. The consultant must therefore supervise the walkthrough of most users to be able to evaluate the employees of a company as a whole. A.2 Future Scenario Zenaria A/S would like to change this way of distributing their interactive stories and use the Internet instead. The company do not wish to develop a specialized program or use any existing proprietary program to allow users to access the stories online. They have decided to allow users to access the stories using only a web browser. This will allow users to access the stories at all times and from any computer connected to the Internet. Distributing the stories this way does however make the task of the consultant rather impossible. He will no longer be able to supervise all users and will therefore not be able to evaluate the usage of the story as a whole, at least not without some extra information being presented to him. The main task is therefore to present as many parameters as possible 6 Not to confuse with the dwell-time, see later. 15

17 Figure A.2: Screenshot from a Zenaria story. to the consultant and thus enable the consultant to perform some of the same evaluation(s) as before. This means, amongst other things, that the walkthrough of users must be traceable and the decision-time for each user on each video-sequence should also be available. A.3 Using Web Usage Mining Since access to the interactive stories is done entirely through the HTTP protocol, the web log will reflect all walkthroughs of a story as a number of requests forming user sessions. Presenting such information to the consultant and allowing for some kind of querying upon the information would allow the consultant to evaluate the usage of the story, without actually being present when the users walks through the story 7. Designing a system capable of supporting the job of the consultant requires some knowledge of what types of questions the consultant could desire to have answered. Some of the important types are mentioned below. 4 Questions concerning common choices of users, e.g. finding a typical path through the story or parts of the story that have been used frequently by the users. 4 Questions concerning the decision time of the users, e.g. what constitutes a typical or a minimal walkthrough measured in decision-time. 4 Questions focusing on how specific groups of users have used the story e.g. all men or all team leaders within a company 4 Questions combining elements of the above would further improve the usefulness of a web mining tool for the consultant. Note specifically that the consultant is interested in how the stories are used and not as much in the usage of individual video-sequences. This means that there are a definite need for knowledge discovery on sequences of clicks in the web log. 7 Note that using the previous approach, the consultant could visually see the user. Using the Internet, this perspective is lost for the consultant. 16

18 The task is therefore to create an environment in which the consultant easily can pose these and other questions and have results returned and illustrated in a manner that supports the consultant in evaluating the use of a story. Answering questions concerning specific groups of users will require the users to specify some information, e.g., demographic data, which can be used by the consultant in a later evaluation. It is not a part of the software currently used by Zenaria so this must also be supplied with a new framework. Note also that the consultant cannot be presumed to be a technical person and therefore the tools must present themselves easily and understandably to the consultant. A.4 Potential Problems Several specific problems arise when attempting to adopt web usage mining to the problem domain of Zenaria. Some of these will be mentioned briefly in the following. Note that some of the problems are inherent to web mining whereas others are specific to the Zenaria case. The solution chosen to these problems in our implemented prototype is described in Section 4. Estimating Decision-time: As mentioned in Section A.2, tracking the decision-time is important, but transferring this to an Internet setting is rather hard. In any case it will only be possible to create an estimate of this time, since the web log of the server will only reflect the time between two requests. In between two requests a lot of other things besides the decision-time will take place namely generating the response, transferring the content to the client (where the streamed video in Zenaria s case is a potential bottleneck), viewing the video-sequence, and transferring the request back to the web server. Therefore some logic handling the estimation of the actual decision-time needs to be implemented, or a schema where this uncertainty is handled must be adopted. Tracking the Chosen Option: Several choices presented after viewing a video-sequence might lead to the exact same next video-sequence (and thus potentially the same web page) but have very different meanings, semantically, when evaluating the walkthrough. This presents a need for the consultant to know, not only what sequence of pages are viewed, but also what choices were taken leading from one web page to another. This calls for either sending the option chosen along with the request as a parameter (thus identifying the option from the request-string recorded in the web log) or solving it using some other kind of server side logic. User Identification: Operating only on the web log will make the process of uniquely identifying users impossible. The only information present is the IP address which only identifies a computer 8 and not a user. The problem multiplies when computers support several concurrent users making requests to the web server. A similar problem arises when a number of users are located behind a gateway or a firewall, since all request made behind the gateway will appear in the web log with the IP-address of the gateway. Several schemas for solving these problems have been suggested and includes using cookie-identification and web server logic to identify users[14]. There are also issues concerning privacy on the Internet that will further hinder identification of users, but that is beyond the scope of this paper. Tracking Requests: Some users access the Internet through a proxy-server that stores information locally, away from the web server. Depending on how Zenaria choose to implement the framework, the fact that requests will not be forwarded all the way to the web server will severely hinder the tracking of walkthroughs from the web log. Some solutions are suggested[14] including explicitly setting no-cache information on the response to the client and calculating unseen requests. B Additional Experiments This section contains more details about the experimental evaluation of the hybrid approach. First, the running time of the BFS mine algorithm is presented to the left in Figure B.3. We see that the time used by BFS-Mine is neglible (less than 50ms) in almost all cases. Only in one case the time goes up to 350 ms, probably due to the garbage collection performed by the Java runtime system. As the other parts of the hybrid approach take several seconds, the time used for BFS-mine makes no real difference. 8 An IP-address can be re-assigned using e.g. DHCP, so in principle it does not even identify an individual computer. 17

19 Figure B.3: BFS Mining Performance and Non-indexed vs Indexed Click Fact Next, experiments with indexing the click fact schema are presented. As the keys from each dimension are widely used when joins with the fact table are performed, we have chosen to create indexes on the primary keys of the dimension tables. The performance gained when comparing a non-indexed version versus this specific type of indexed version of the click fact schema can be seen to the right in Figure B.3. It is seen that a performance gain of around 25% is obtained by creating these indexes. This kind of indexing strategy might not even be the most optimal. Creating an index on all referenced keys in the fact table could further decrease the query time. Figure B.4: HPG Initialization Time A detailed version of the time used to initialize the HPG can be found in Figure B.4. The HPG initialization process is divided into three parts. First, the set of states for the HPG is initialized. Then, the set of productions are initialized and finally start production probabilities are calculated. Figure B.4 shows that this last step uses very little time. Otherwise the initialization time is evenly divided amongst stateset and productionset initialization. Again, we believe the spikes in the figure to be caused by the Java garbage collection. However, the spikes only make a difference of about 200 ms and are thus of no importance for the overall performance. C Data Warehouse Schemas This section contains examples of DW schemas for the Click Fact, Session Fact, and Subsession Fact approaches. The Click Fact schema is shown in Figure C.5. The Session Fact schema is shown in Figure C.6. The Subsession Fact schema is shown in Figure C.7. 18

20 URL Dimension Url_key Url_name Part_of_site Date Dimension Date_key Day_of_month Month Quarter Year Day_of_week Day_of_year Workday Holiday Click Fact Url_key Referrer_key Date_key TimeOfDay_key Session_key Timespan_key Number_in_session Is_first Is_last Click_seconds TimeOfDay Dimension TimeOfDay_key Hour Minute Second Secs_since_midnight Time_span Session Dimension Session_key Ip Login Start_page End_page Session_clicks Total_session_seconds Timespan Dimension Timespan_key Seconds Figure C.5: The Click Fact Schema Date Dimension Date_key Day_of_month Month Quarter Year Day_of_week Day_of_year Workday Holiday Session Fact Date_key TimeOfDay_key User_key Start_page End_page TimeOfDay Dimension TimeOfDay_key Hour Minute Second Secs_since_midnight Time_span User Dimension User_key Ip Login Figure C.6: The Session Fact Schema Url_Sequence Dimension Url_sequence_key Url_sequence Is_last Is_first Length Number_of Last_url First_url Date Dimension Date_key Day_of_month Month Quarter Year Day_of_week Day_of_year Workday Holiday Subsession Fact Url_sequence_key Session_key TimeOfDay_key Date_key Timespan_key Session Dimension Session_key Ip Login Start_page End_page Session_clicks Total_session_seconds TimeOfDay Dimension TimeOfDay_key Hour Minute Second Secs_since_midnight Time_span Timespan Dimension Timespan_key Seconds Figure C.7: The Subsession Fact Schema 19

KOINOTITES: A Web Usage Mining Tool for Personalization

KOINOTITES: A Web Usage Mining Tool for Personalization KOINOTITES: A Web Usage Mining Tool for Personalization Dimitrios Pierrakos Inst. of Informatics and Telecommunications, [email protected] Georgios Paliouras Inst. of Informatics and Telecommunications,

More information

PREPROCESSING OF WEB LOGS

PREPROCESSING OF WEB LOGS PREPROCESSING OF WEB LOGS Ms. Dipa Dixit Lecturer Fr.CRIT, Vashi Abstract-Today s real world databases are highly susceptible to noisy, missing and inconsistent data due to their typically huge size data

More information

Higher Education Web Information System Usage Analysis with a Data Webhouse

Higher Education Web Information System Usage Analysis with a Data Webhouse Higher Education Web Information System Usage Analysis with a Data Webhouse Carla Teixeira Lopes 1 and Gabriel David 2 1 ESTSP/FEUP, Portugal [email protected] 2 INESC-Porto/FEUP, Portugal [email protected]

More information

Analyzing Clickstreams Using Subsessions

Analyzing Clickstreams Using Subsessions Analyzing Clickstreams Using Subsessions Jesper Andersen Rune S. Larsen Department of Computer Science, Aalborg University spawn,giversen,skyt @cs.auc.dk Anders Giversen Torben Bach Pedersen analyze.dk,

More information

Data Mining and Database Systems: Where is the Intersection?

Data Mining and Database Systems: Where is the Intersection? Data Mining and Database Systems: Where is the Intersection? Surajit Chaudhuri Microsoft Research Email: [email protected] 1 Introduction The promise of decision support systems is to exploit enterprise

More information

A Framework for Developing the Web-based Data Integration Tool for Web-Oriented Data Warehousing

A Framework for Developing the Web-based Data Integration Tool for Web-Oriented Data Warehousing A Framework for Developing the Web-based Integration Tool for Web-Oriented Warehousing PATRAVADEE VONGSUMEDH School of Science and Technology Bangkok University Rama IV road, Klong-Toey, BKK, 10110, THAILAND

More information

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS

WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS WEB SITE OPTIMIZATION THROUGH MINING USER NAVIGATIONAL PATTERNS Biswajit Biswal Oracle Corporation [email protected] ABSTRACT With the World Wide Web (www) s ubiquity increase and the rapid development

More information

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data

Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data Enhance Preprocessing Technique Distinct User Identification using Web Log Usage data Sheetal A. Raiyani 1, Shailendra Jain 2 Dept. of CSE(SS),TIT,Bhopal 1, Dept. of CSE,TIT,Bhopal 2 [email protected]

More information

Data Preprocessing and Easy Access Retrieval of Data through Data Ware House

Data Preprocessing and Easy Access Retrieval of Data through Data Ware House Data Preprocessing and Easy Access Retrieval of Data through Data Ware House Suneetha K.R, Dr. R. Krishnamoorthi Abstract-The World Wide Web (WWW) provides a simple yet effective media for users to search,

More information

The Data Webhouse. Toolkit. Building the Web-Enabled Data Warehouse WILEY COMPUTER PUBLISHING

The Data Webhouse. Toolkit. Building the Web-Enabled Data Warehouse WILEY COMPUTER PUBLISHING The Data Webhouse Toolkit Building the Web-Enabled Data Warehouse Ralph Kimball Richard Merz WILEY COMPUTER PUBLISHING John Wiley & Sons, Inc. New York Chichester Weinheim Brisbane Singapore Toronto Contents

More information

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria

More information

A Survey on Web Mining From Web Server Log

A Survey on Web Mining From Web Server Log A Survey on Web Mining From Web Server Log Ripal Patel 1, Mr. Krunal Panchal 2, Mr. Dushyantsinh Rathod 3 1 M.E., 2,3 Assistant Professor, 1,2,3 computer Engineering Department, 1,2 L J Institute of Engineering

More information

Advanced Preprocessing using Distinct User Identification in web log usage data

Advanced Preprocessing using Distinct User Identification in web log usage data Advanced Preprocessing using Distinct User Identification in web log usage data Sheetal A. Raiyani 1, Shailendra Jain 2, Ashwin G. Raiyani 3 Department of CSE (Software System), Technocrats Institute of

More information

Data Warehousing and OLAP Technology for Knowledge Discovery

Data Warehousing and OLAP Technology for Knowledge Discovery 542 Data Warehousing and OLAP Technology for Knowledge Discovery Aparajita Suman Abstract Since time immemorial, libraries have been generating services using the knowledge stored in various repositories

More information

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02) Internet Technology Prof. Indranil Sengupta Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture No #39 Search Engines and Web Crawler :: Part 2 So today we

More information

Understanding Web personalization with Web Usage Mining and its Application: Recommender System

Understanding Web personalization with Web Usage Mining and its Application: Recommender System Understanding Web personalization with Web Usage Mining and its Application: Recommender System Manoj Swami 1, Prof. Manasi Kulkarni 2 1 M.Tech (Computer-NIMS), VJTI, Mumbai. 2 Department of Computer Technology,

More information

BUILDING OLAP TOOLS OVER LARGE DATABASES

BUILDING OLAP TOOLS OVER LARGE DATABASES BUILDING OLAP TOOLS OVER LARGE DATABASES Rui Oliveira, Jorge Bernardino ISEC Instituto Superior de Engenharia de Coimbra, Polytechnic Institute of Coimbra Quinta da Nora, Rua Pedro Nunes, P-3030-199 Coimbra,

More information

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole

Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole Paper BB-01 Lost in Space? Methodology for a Guided Drill-Through Analysis Out of the Wormhole ABSTRACT Stephen Overton, Overton Technologies, LLC, Raleigh, NC Business information can be consumed many

More information

MS SQL Performance (Tuning) Best Practices:

MS SQL Performance (Tuning) Best Practices: MS SQL Performance (Tuning) Best Practices: 1. Don t share the SQL server hardware with other services If other workloads are running on the same server where SQL Server is running, memory and other hardware

More information

Best Practices for Hadoop Data Analysis with Tableau

Best Practices for Hadoop Data Analysis with Tableau Best Practices for Hadoop Data Analysis with Tableau September 2013 2013 Hortonworks Inc. http:// Tableau 6.1.4 introduced the ability to visualize large, complex data stored in Apache Hadoop with Hortonworks

More information

Agile Business Intelligence Data Lake Architecture

Agile Business Intelligence Data Lake Architecture Agile Business Intelligence Data Lake Architecture TABLE OF CONTENTS Introduction... 2 Data Lake Architecture... 2 Step 1 Extract From Source Data... 5 Step 2 Register And Catalogue Data Sets... 5 Step

More information

Web Usage Mining for a Better Web-Based Learning Environment

Web Usage Mining for a Better Web-Based Learning Environment Web Usage Mining for a Better Web-Based Learning Environment Osmar R. Zaïane Department of Computing Science University of Alberta Edmonton, Alberta, Canada email: zaianecs.ualberta.ca ABSTRACT Web-based

More information

A Brief Tutorial on Database Queries, Data Mining, and OLAP

A Brief Tutorial on Database Queries, Data Mining, and OLAP A Brief Tutorial on Database Queries, Data Mining, and OLAP Lutz Hamel Department of Computer Science and Statistics University of Rhode Island Tyler Hall Kingston, RI 02881 Tel: (401) 480-9499 Fax: (401)

More information

SPATIAL DATA CLASSIFICATION AND DATA MINING

SPATIAL DATA CLASSIFICATION AND DATA MINING , pp.-40-44. Available online at http://www. bioinfo. in/contents. php?id=42 SPATIAL DATA CLASSIFICATION AND DATA MINING RATHI J.B. * AND PATIL A.D. Department of Computer Science & Engineering, Jawaharlal

More information

DATA WAREHOUSING AND OLAP TECHNOLOGY

DATA WAREHOUSING AND OLAP TECHNOLOGY DATA WAREHOUSING AND OLAP TECHNOLOGY Manya Sethi MCA Final Year Amity University, Uttar Pradesh Under Guidance of Ms. Shruti Nagpal Abstract DATA WAREHOUSING and Online Analytical Processing (OLAP) are

More information

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Turban, Aronson, and Liang Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

More information

Website Personalization using Data Mining and Active Database Techniques Richard S. Saxe

Website Personalization using Data Mining and Active Database Techniques Richard S. Saxe Website Personalization using Data Mining and Active Database Techniques Richard S. Saxe Abstract Effective website personalization is at the heart of many e-commerce applications. To ensure that customers

More information

WebAdaptor: Designing Adaptive Web Sites Using Data Mining Techniques

WebAdaptor: Designing Adaptive Web Sites Using Data Mining Techniques From: FLAIRS-01 Proceedings. Copyright 2001, AAAI (www.aaai.org). All rights reserved. WebAdaptor: Designing Adaptive Web Sites Using Data Mining Techniques Howard J. Hamilton, Xuewei Wang, and Y.Y. Yao

More information

A Time Efficient Algorithm for Web Log Analysis

A Time Efficient Algorithm for Web Log Analysis A Time Efficient Algorithm for Web Log Analysis Santosh Shakya Anju Singh Divakar Singh Student [M.Tech.6 th sem (CSE)] Asst.Proff, Dept. of CSE BU HOD (CSE), BUIT, BUIT,BU Bhopal Barkatullah University,

More information

High-Volume Data Warehousing in Centerprise. Product Datasheet

High-Volume Data Warehousing in Centerprise. Product Datasheet High-Volume Data Warehousing in Centerprise Product Datasheet Table of Contents Overview 3 Data Complexity 3 Data Quality 3 Speed and Scalability 3 Centerprise Data Warehouse Features 4 ETL in a Unified

More information

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10 1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom

More information

How To Evaluate Web Applications

How To Evaluate Web Applications A Framework for Exploiting Conceptual Modeling in the Evaluation of Web Application Quality Pier Luca Lanzi, Maristella Matera, Andrea Maurino Dipartimento di Elettronica e Informazione, Politecnico di

More information

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course 803401 DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization Oman College of Management and Technology Course 803401 DSS Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization CS/MIS Department Information Sharing

More information

Automatic Recommendation for Online Users Using Web Usage Mining

Automatic Recommendation for Online Users Using Web Usage Mining Automatic Recommendation for Online Users Using Web Usage Mining Ms.Dipa Dixit 1 Mr Jayant Gadge 2 Lecturer 1 Asst.Professor 2 Fr CRIT, Vashi Navi Mumbai 1 Thadomal Shahani Engineering College,Bandra 2

More information

A Cube Model for Web Access Sessions and Cluster Analysis

A Cube Model for Web Access Sessions and Cluster Analysis A Cube Model for Web Access Sessions and Cluster Analysis Zhexue Huang, Joe Ng, David W. Cheung E-Business Technology Institute The University of Hong Kong jhuang,kkng,[email protected] Michael K. Ng,

More information

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM. DATA MINING TECHNOLOGY Georgiana Marin 1 Abstract In terms of data processing, classical statistical models are restrictive; it requires hypotheses, the knowledge and experience of specialists, equations,

More information

low-level storage structures e.g. partitions underpinning the warehouse logical table structures

low-level storage structures e.g. partitions underpinning the warehouse logical table structures DATA WAREHOUSE PHYSICAL DESIGN The physical design of a data warehouse specifies the: low-level storage structures e.g. partitions underpinning the warehouse logical table structures low-level structures

More information

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP Data Warehousing and End-User Access Tools OLAP and Data Mining Accompanying growth in data warehouses is increasing demands for more powerful access tools providing advanced analytical capabilities. Key

More information

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Chapter 5. Warehousing, Data Acquisition, Data. Visualization Decision Support Systems and Intelligent Systems, Seventh Edition Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization 5-1 Learning Objectives

More information

Prediction of Heart Disease Using Naïve Bayes Algorithm

Prediction of Heart Disease Using Naïve Bayes Algorithm Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,

More information

Database Marketing, Business Intelligence and Knowledge Discovery

Database Marketing, Business Intelligence and Knowledge Discovery Database Marketing, Business Intelligence and Knowledge Discovery Note: Using material from Tan / Steinbach / Kumar (2005) Introduction to Data Mining,, Addison Wesley; and Cios / Pedrycz / Swiniarski

More information

Arti Tyagi Sunita Choudhary

Arti Tyagi Sunita Choudhary Volume 5, Issue 3, March 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Web Usage Mining

More information

Visionet IT Modernization Empowering Change

Visionet IT Modernization Empowering Change Visionet IT Modernization A Visionet Systems White Paper September 2009 Visionet Systems Inc. 3 Cedar Brook Dr. Cranbury, NJ 08512 Tel: 609 360-0501 Table of Contents 1 Executive Summary... 4 2 Introduction...

More information

An application for clickstream analysis

An application for clickstream analysis An application for clickstream analysis C. E. Dinucă Abstract In the Internet age there are stored enormous amounts of data daily. Nowadays, using data mining techniques to extract knowledge from web log

More information

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Managing Big Data with Hadoop & Vertica A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database Copyright Vertica Systems, Inc. October 2009 Cloudera and Vertica

More information

MINING CLICKSTREAM-BASED DATA CUBES

MINING CLICKSTREAM-BASED DATA CUBES MINING CLICKSTREAM-BASED DATA CUBES Ronnie Alves and Orlando Belo Departament of Informatics,School of Engineering, University of Minho Campus de Gualtar, 4710-057 Braga, Portugal Email: {alvesrco,obelo}@di.uminho.pt

More information

Graph Database Proof of Concept Report

Graph Database Proof of Concept Report Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

More information

Indexing Techniques for Data Warehouses Queries. Abstract

Indexing Techniques for Data Warehouses Queries. Abstract Indexing Techniques for Data Warehouses Queries Sirirut Vanichayobon Le Gruenwald The University of Oklahoma School of Computer Science Norman, OK, 739 [email protected] [email protected] Abstract Recently,

More information

Speeding ETL Processing in Data Warehouses White Paper

Speeding ETL Processing in Data Warehouses White Paper Speeding ETL Processing in Data Warehouses White Paper 020607dmxwpADM High-Performance Aggregations and Joins for Faster Data Warehouse Processing Data Processing Challenges... 1 Joins and Aggregates are

More information

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database Built up on Cisco s big data common platform architecture (CPA), a

More information

PartJoin: An Efficient Storage and Query Execution for Data Warehouses

PartJoin: An Efficient Storage and Query Execution for Data Warehouses PartJoin: An Efficient Storage and Query Execution for Data Warehouses Ladjel Bellatreche 1, Michel Schneider 2, Mukesh Mohania 3, and Bharat Bhargava 4 1 IMERIR, Perpignan, FRANCE [email protected] 2

More information

University Data Warehouse Design Issues: A Case Study

University Data Warehouse Design Issues: A Case Study Session 2358 University Data Warehouse Design Issues: A Case Study Melissa C. Lin Chief Information Office, University of Florida Abstract A discussion of the design and modeling issues associated with

More information

Extending a Web Browser with Client-Side Mining

Extending a Web Browser with Client-Side Mining Extending a Web Browser with Client-Side Mining Hongjun Lu, Qiong Luo, Yeuk Kiu Shun Hong Kong University of Science and Technology Department of Computer Science Clear Water Bay, Kowloon Hong Kong, China

More information

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer.

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles available online http://www.ijoer. REVIEW ARTICLE ISSN: 2321-7758 UPS EFFICIENT SEARCH ENGINE BASED ON WEB-SNIPPET HIERARCHICAL CLUSTERING MS.MANISHA DESHMUKH, PROF. UMESH KULKARNI Department of Computer Engineering, ARMIET, Department

More information

Data Warehousing Systems: Foundations and Architectures

Data Warehousing Systems: Foundations and Architectures Data Warehousing Systems: Foundations and Architectures Il-Yeol Song Drexel University, http://www.ischool.drexel.edu/faculty/song/ SYNONYMS None DEFINITION A data warehouse (DW) is an integrated repository

More information

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA OLAP and OLTP AMIT KUMAR BINDAL Associate Professor Databases Databases are developed on the IDEA that DATA is one of the critical materials of the Information Age Information, which is created by data,

More information

Binary Coded Web Access Pattern Tree in Education Domain

Binary Coded Web Access Pattern Tree in Education Domain Binary Coded Web Access Pattern Tree in Education Domain C. Gomathi P.G. Department of Computer Science Kongu Arts and Science College Erode-638-107, Tamil Nadu, India E-mail: [email protected] M. Moorthi

More information

Chapter 20: Data Analysis

Chapter 20: Data Analysis Chapter 20: Data Analysis Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 20: Data Analysis Decision Support Systems Data Warehousing Data Mining Classification

More information

Oracle EXAM - 1Z0-117. Oracle Database 11g Release 2: SQL Tuning. Buy Full Product. http://www.examskey.com/1z0-117.html

Oracle EXAM - 1Z0-117. Oracle Database 11g Release 2: SQL Tuning. Buy Full Product. http://www.examskey.com/1z0-117.html Oracle EXAM - 1Z0-117 Oracle Database 11g Release 2: SQL Tuning Buy Full Product http://www.examskey.com/1z0-117.html Examskey Oracle 1Z0-117 exam demo product is here for you to test the quality of the

More information

Web-Based Genomic Information Integration with Gene Ontology

Web-Based Genomic Information Integration with Gene Ontology Web-Based Genomic Information Integration with Gene Ontology Kai Xu 1 IMAGEN group, National ICT Australia, Sydney, Australia, [email protected] Abstract. Despite the dramatic growth of online genomic

More information

Data Mining Solutions for the Business Environment

Data Mining Solutions for the Business Environment Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over

More information

Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms

Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms Y.Y. Yao, Y. Zhao, R.B. Maguire Department of Computer Science, University of Regina Regina,

More information

Data Integration and ETL Process

Data Integration and ETL Process Data Integration and ETL Process Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Software Development Technologies Master studies, second

More information

1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle

1Z0-117 Oracle Database 11g Release 2: SQL Tuning. Oracle 1Z0-117 Oracle Database 11g Release 2: SQL Tuning Oracle To purchase Full version of Practice exam click below; http://www.certshome.com/1z0-117-practice-test.html FOR Oracle 1Z0-117 Exam Candidates We

More information

Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm International Journal of Engineering Inventions e-issn: 2278-7461, p-issn: 2319-6491 Volume 2, Issue 5 (March 2013) PP: 16-21 Web Mining Patterns Discovery and Analysis Using Custom-Built Apriori Algorithm

More information

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2 Class Announcements TIM 50 - Business Information Systems Lecture 15 Database Assignment 2 posted Due Tuesday 5/26 UC Santa Cruz May 19, 2015 Database: Collection of related files containing records on

More information

Web Usage mining framework for Data Cleaning and IP address Identification

Web Usage mining framework for Data Cleaning and IP address Identification Web Usage mining framework for Data Cleaning and IP address Identification Priyanka Verma The IIS University, Jaipur Dr. Nishtha Kesswani Central University of Rajasthan, Bandra Sindri, Kishangarh Abstract

More information

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING

AN EFFICIENT APPROACH TO PERFORM PRE-PROCESSING AN EFFIIENT APPROAH TO PERFORM PRE-PROESSING S. Prince Mary Research Scholar, Sathyabama University, hennai- 119 [email protected] E. Baburaj Department of omputer Science & Engineering, Sun Engineering

More information

SESSION DEPENDENT DE-IDENTIFICATION OF ELECTRONIC MEDICAL RECORDS

SESSION DEPENDENT DE-IDENTIFICATION OF ELECTRONIC MEDICAL RECORDS SESSION DEPENDENT DE-IDENTIFICATION OF ELECTRONIC MEDICAL RECORDS A Thesis Presented in Partial Fulfillment of the Requirements for the Degree Bachelor of Science with Honors Research Distinction in Electrical

More information

Toad for Oracle 8.6 SQL Tuning

Toad for Oracle 8.6 SQL Tuning Quick User Guide for Toad for Oracle 8.6 SQL Tuning SQL Tuning Version 6.1.1 SQL Tuning definitively solves SQL bottlenecks through a unique methodology that scans code, without executing programs, to

More information

What is Data Virtualization? Rick F. van der Lans, R20/Consultancy

What is Data Virtualization? Rick F. van der Lans, R20/Consultancy What is Data Virtualization? by Rick F. van der Lans, R20/Consultancy August 2011 Introduction Data virtualization is receiving more and more attention in the IT industry, especially from those interested

More information

What is Data Virtualization?

What is Data Virtualization? What is Data Virtualization? Rick F. van der Lans Data virtualization is receiving more and more attention in the IT industry, especially from those interested in data management and business intelligence.

More information

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya

Chapter 6 Basics of Data Integration. Fundamentals of Business Analytics RN Prasad and Seema Acharya Chapter 6 Basics of Data Integration Fundamentals of Business Analytics Learning Objectives and Learning Outcomes Learning Objectives 1. Concepts of data integration 2. Needs and advantages of using data

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION 1 CHAPTER 1 INTRODUCTION 1.1 MOTIVATION OF RESEARCH Multicore processors have two or more execution cores (processors) implemented on a single chip having their own set of execution and architectural recourses.

More information

Data Warehousing and Data Mining

Data Warehousing and Data Mining Data Warehousing and Data Mining Part I: Data Warehousing Gao Cong [email protected] Slides adapted from Man Lung Yiu and Torben Bach Pedersen Course Structure Business intelligence: Extract knowledge

More information

Topics in basic DBMS course

Topics in basic DBMS course Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch

More information

II. OLAP(ONLINE ANALYTICAL PROCESSING)

II. OLAP(ONLINE ANALYTICAL PROCESSING) Association Rule Mining Method On OLAP Cube Jigna J. Jadav*, Mahesh Panchal** *( PG-CSE Student, Department of Computer Engineering, Kalol Institute of Technology & Research Centre, Gujarat, India) **

More information

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 ISSN 2229-5518 International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April-2014 442 Over viewing issues of data mining with highlights of data warehousing Rushabh H. Baldaniya, Prof H.J.Baldaniya,

More information

A Design and implementation of a data warehouse for research administration universities

A Design and implementation of a data warehouse for research administration universities A Design and implementation of a data warehouse for research administration universities André Flory 1, Pierre Soupirot 2, and Anne Tchounikine 3 1 CRI : Centre de Ressources Informatiques INSA de Lyon

More information

FRAMEWORK FOR WEB PERSONALIZATION USING WEB MINING

FRAMEWORK FOR WEB PERSONALIZATION USING WEB MINING FRAMEWORK FOR WEB PERSONALIZATION USING WEB MINING Monika Soni 1, Rahul Sharma 2, Vishal Shrivastava 3 1 M. Tech. Scholar, Arya College of Engineering and IT, Rajasthan, India, [email protected] 2 M.

More information

Association rules for improving website effectiveness: case analysis

Association rules for improving website effectiveness: case analysis Association rules for improving website effectiveness: case analysis Maja Dimitrijević, The Higher Technical School of Professional Studies, Novi Sad, Serbia, [email protected] Tanja Krunić, The

More information

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE SK MD OBAIDULLAH Department of Computer Science & Engineering, Aliah University, Saltlake, Sector-V, Kol-900091, West Bengal, India [email protected]

More information

An Overview of Database management System, Data warehousing and Data Mining

An Overview of Database management System, Data warehousing and Data Mining An Overview of Database management System, Data warehousing and Data Mining Ramandeep Kaur 1, Amanpreet Kaur 2, Sarabjeet Kaur 3, Amandeep Kaur 4, Ranbir Kaur 5 Assistant Prof., Deptt. Of Computer Science,

More information

The basic data mining algorithms introduced may be enhanced in a number of ways.

The basic data mining algorithms introduced may be enhanced in a number of ways. DATA MINING TECHNOLOGIES AND IMPLEMENTATIONS The basic data mining algorithms introduced may be enhanced in a number of ways. Data mining algorithms have traditionally assumed data is memory resident,

More information

1 File Processing Systems

1 File Processing Systems COMP 378 Database Systems Notes for Chapter 1 of Database System Concepts Introduction A database management system (DBMS) is a collection of data and an integrated set of programs that access that data.

More information

The Role of Metadata for Effective Data Warehouse

The Role of Metadata for Effective Data Warehouse ISSN: 1991-8941 The Role of Metadata for Effective Data Warehouse Murtadha M. Hamad Alaa Abdulqahar Jihad University of Anbar - College of computer Abstract: Metadata efficient method for managing Data

More information

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services

ETL Overview. Extract, Transform, Load (ETL) Refreshment Workflow. The ETL Process. General ETL issues. MS Integration Services ETL Overview Extract, Transform, Load (ETL) General ETL issues ETL/DW refreshment process Building dimensions Building fact tables Extract Transformations/cleansing Load MS Integration Services Original

More information

AN OVERVIEW OF PREPROCESSING OF WEB LOG FILES FOR WEB USAGE MINING

AN OVERVIEW OF PREPROCESSING OF WEB LOG FILES FOR WEB USAGE MINING AN OVERVIEW OF PREPROCESSING OF WEB LOG FILES FOR WEB USAGE MINING N. M. Abo El-Yazeed Demonstrator at High Institute for Management and Computer, Port Said University, Egypt [email protected]

More information

White Paper April 2006

White Paper April 2006 White Paper April 2006 Table of Contents 1. Executive Summary...4 1.1 Scorecards...4 1.2 Alerts...4 1.3 Data Collection Agents...4 1.4 Self Tuning Caching System...4 2. Business Intelligence Model...5

More information

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics

Data Warehouse Snowflake Design and Performance Considerations in Business Analytics Journal of Advances in Information Technology Vol. 6, No. 4, November 2015 Data Warehouse Snowflake Design and Performance Considerations in Business Analytics Jiangping Wang and Janet L. Kourik Walker

More information