A Data Warehouse/OLAP Framework for Web Usage Mining and Business Intelligence Reporting

Transcription

1 A Data Warehouse/OLAP Framework for Web Usage Mining and Business Intelligence Reporting Xiaohua Hu College of Information Science Drexel University, Philadelphia PA, USA Nick Cercone Faculty of Computer Science Dalhousie University Halifax, Nova Scotia, Canada Abstract Web usage mining is the application of data mining techniques to discover usage patterns and behaviors from web data (clickstream, purchase information, customer information etc) in order to understand and serve e-commerce customers better and improve the online business. In this paper we present a general Data Warehouse/OLAP framework for web usage mining and business intelligence reporting. We integrate the web data warehouse construction, data mining, On-Line Analytical Processing (OLAP) into the e-commerce system, this tight integration dramatically reduces the time and effort for web usage mining, business intelligence reporting and mining deployment. Our Data Warehouse/OLAP framework consists of four phases: data capture, webhouse construction (clickstream marts), pattern discovery and cube construction, pattern evaluation and deployment. We discuss data transformation operations for web usage mining and business reporting in clickstream, session and customer level, describe the problems and challenging issues in each phase in details and provide plausible solution to the issues and demonstrate with some examples from some real websites. Our Data Warehouse/OLAP framework has been integrated into some commercial e-commerce systems. We believe this Data Warehouse/OLAP framework would be very useful for developing any real-world web usage mining and business intelligence reporting systems. 1. Introduction Knowledge about customers and understanding customer needs is essential for customer retention in a web store for online e-commerce applications, since competitors are just one click away. To maintain a successful e-commerce solution, it is necessary to collect and analyze customer click behaviors at the web store. A web site generates a large amount of reliable data and is a killer domain for data mining application. Web usage mining can help an e-commerce solution to improve up-selling, cross-selling, personalized ads, clickthrough rate and so on by analyzing the clickstream and customer purchase data through data mining techniques. Web usage mining has attracted much attention recently from research and e-business professionals and it offers many benefits to an e-commerce web site such as: Targeting customers based on usage behavior or profile (personalization) Adjusting web content and structure dynamically based on page access pattern of users (adaptive web site) 1

2 Enhancing the service quality and delivery to the end user (cross-selling, up-selling) Improving web server system performance based on the web traffic analysis Identifying hot area/killer area of the web site. We present a general Data Warehouse/OLAP framework for web usage mining and business intelligence reporting. In our framework, data mining is tightly integrated into the E-commerce systems. Our Data Warehouse/OLAP framework consists of four phases: data capture, webhouse construction (clickstream marts), pattern discovery and pattern evaluation as shown in Figure 1. In this framework, it provides the appropriate data transformations (also called ETL: Extraction, Transformation and Loading) from the OLTP system to data warehouse, build data cubes from the data warehouse and mine the data for business analysis and finally deploy the mining results to improve the on-line business. We describe the problems and challenging issues in each phase in detail and provide a general approach and guideline to web usage mining and business intelligence reporting for e- commerce. The rest of the paper is organized as follows: in Section 2, we discuss the various data capture methods and some of the pitfalls and challenging issues. In Section 3, we will describe the data transformation operations for web data at different level of granularity (clickstream level, session level and customer level) and show how to organize the dimensions and facts tables for the webhouse, which is the data source for the web usage mining and business intelligence reporting. We discuss the cube construction and various data mining methods for web usage mining in Section 4 and pattern evaluation (mining rules evaluation) in Section 5. We conclude in Section 6 with some insightful discussion. Data Capture (clickstream, sale, customer, product, etc) Data Webhouse Construction ( dimensions, fact tables, aggregation table, etc) Mining, OLAP ( rules, prediction models, cubes, reports, etc) Pattern Evaluations & Deployment Figure 1: The Data Warehouse/OLAP Data Flow Diagram 2. Data Capture Capturing the necessary data in the data collection stage is a key step for a successful data mining task. A large part of web data is represented in the web log collected in the web server. A web log records the interactions between web server and web user (web browsers). A typical web log (Common Log format) contains information such as Internet provider IP address, ID or password for access to a restricted area, a time stamp of the URL request, method of transaction, status of error code, and size in bytes of the transaction. For the Extended Log format, it includes the extra information such as a referrer and agent. Web logs were originally designed to help debugging web server. One of the fundamental flaws of analyzing web log data is that log files contain information about the files transferred from the server to the client not information about people visiting the web site 2

3 [9,19]. Some of these fields are useless for data mining and are filtered in the data preprocessing step. Some of them such as IP address, referrer and agent can reveal much about the site visitors and the web site. Mining the web store often starts with the web log data. Web log data need to go through a set of transformation before data mining algorithms can be applied. In order to have a complete picture of the customers, web usage data should include the web server access log, browser logs, user profiles, registration data, user sessions, cookies, user search keywords, and user business events [1,9,14]. Based on our practice and experience in web usage mining, we believe that web usage mining requires conflation of multiple data sources. The data needed to perform the analysis should consist of five main sources: (1) The web server logs recording the visitors click stream behaviors (pages template, cookie, transfer log, time stamp, IP address, agent, referrer etc.) (2) Product information (product hierarchy, manufacturer, price, color, size etc.) (3) Content information of the web site (image, gif, video clip etc.) (4) The customer purchase data (quantity of the products, payment amount and method, shipping address etc.) (5) Customer demographics information (age, gender, income, education level, Lifestyle etc.) Data collected in a typical web site categorize to different levels of granularity: page view, session, order item, order header, customer. A page view has the information such as type of the page, duration on the page. A session consists of a sequence of page views; an order contains a few order items. It is the best practice in the data collection phase to collect the finest granular and detailed data possible describing the clicks on the web server, and items sold at the web store. Each web server will potentially report different details, but at the lowest level, we should be able to obtain a record for every page hit and every item sold if we want to have a complete portfolio of the click behavior and sale situation of the web store. There are various methods to capture and collect valuable information for visitors for e- commerce at the server level, proxy level and client level through the CGI interface, Java API, JavaScript [1,9,14]. Most of them use web log data or packet sniffers as a data source for clickstream. Web log data are not sufficient for data mining purpose for the following main reasons: (1) Unable to identify the sessions (2) Lack of web store transaction data; the web store transaction records all sale related information of a web store and it is necessary for business analysis and data mining in order to answer some basic and important business questions such as which referrer site leads more product sale at my site?, what is the conversion rate of the web site, which part of my web sites are more attractive to purchaser?. (3) Lack of business events of web store; business events of a web store such as add a item to shopping car, research key event, abandoning shopping cart are very useful to analyze the user shopping and browsing behavior of a web store. In our framework, we believe that collecting data at the web application server layer is the most effective approach, as suggested by some commercial vendors [9,14]. The web application server controls all the user activities such as registration, logging in/out, and can create a unified database to store web log data, sale transaction data and business events of 3

4 the web site. The discussion of these methods is beyond the scope of this paper. For interested readers, please refer to [9, 14]. There are challenging issues in the data capture phase for web usage mining. The following challenges illustrate three problems: (1) how to sessionize the clickstream data; (2) how to filter crawler s sessions; and (3) how to gather customer s information. These challenges are the most popular ones encountered in almost all the web usage mining projects. And these problems have a huge impact on the success or failure of web usage mining projects. Below we discuss each of them in detail. 2.1 Session Data A user web session is a sequence of consecutive page views (hits) before the user explicitly logs out or times out. A user who visits a site in the morning and then again in the evening would count as two user visits (sessions). Because of the statelessness of HTTP, clickstream data is just a sequence of page hits, a page hit may be an isolated event that is hard to analyze without considering the context. To make the raw clickstream data usable in web usage mining, the clickstream needs to be collected and transformed in such a way that it has a session perspective. Thus the first task after the data collection is to identify the sessions for the clickstream stream (sessionizing the clickstream). In some web usage mining systems, during preprocessing, individual log entries are aggregated into server sessions according to the IP address and agent information. New sessions are also identified using a 30-minute intersession timeout period [23,24]. Within each session, the log entries are grouped into a separate request where each request may correspond to an individual user click or a search event. Nonetheless there are some serious problems when processing in this way. Many internet users utilize an Internet Service Provider (ISP), their IP address may be assigned dynamically, so it is very likely that the same user will have a different address in different sessions [6,7,14]. Another problem is that users behind a firewall can all have the same IP address; an IP address is not suitable as an identification variable for such sessions. Realizing the limitations of relying on the IP address; cookies are used as a workaround to solve this problem and to sessionize the clickstream in many web sites. A cookie is a mechanism that allows the web server to store its own information about a user on the user s hard driver. It is a small file that a web server sends to a web user and stores on his computer so that it can remember something about you at a later time. The location of the cookies depends on the browser. Internet Explorer stores each cookie as a separate file under a Window s subdirectory. Netscape stores all cookies in a single cookies.txt file. Sites to store customization information or to store user demographic data often use this information. The main purpose of cookies is to identify users and possibly prepare customized web pages for them. If the cookie is turned on, that means the user will send the cookie back to the web server each time his browser opens one of web pages and the web server can identify the requesting users computer unambiguously. The browser thus puts all the hits with the same cookie as one session until the user explicitly logs out or times out. In some situations, for privacy concerns, some users choose to turn off cookies, then the web site needs to use login id, referrer and agent information, if possible, to identify user and server sessions [9,23]. 4

5 2.2 Crawlers Session A crawler is a software agent that traverses web sites based on web linkages in web pages. Search engines use crawlers to index web pages and crawlers can help users to gather information such as prices for certain products, and help web designers to diagnose web site problems (such as response time, isolated web pages etc). Most crawlers adopt a breadthfirst retrieval strategy to increase their coverage of the web site. In our experience with some wet site data, at times up to 30% of site clickstream session traffic may be crawlers; these sessions are called crawler sessions. Crawler sessions may mislead data mining analysis to generate inaccurate or incorrect results if they are not filtered. For example, an associate algorithm is used to find the page click orders in a session, as pointed out in [1, 4,10,23], and an association rule mining algorithm may inadvertently generate frequent item sets involving web pages from different page categories. Such spurious patterns may lead an analyst of an e-commerce site to believe that web surfers are interested in products from various categories when in fact crawlers induce such patterns [9,23]. This problem can be avoided if web crawler sessions are removed from the data set during data preprocessing. Thus, identifying crawler sessions is very important for web usage mining. There are a few ways to identify a crawler session. In [23], they build a classification model to identify sessions. The crawler sessions may have some of the characteristics such as: images turned off, empty referrers, visit robots.txt file, page duration time is very short, pattern is a depth-first or breadth-first search of the site, never purchase [6]. Some web sites adopt the approach that creates a invisible link on a page and since only crawlers follow invisible links (regular users can t click invisible links), the session consists of the invisible links are considered to be a crawler session. 2.3 Customer Demographics (Offline Date) Retaining customers and increasing sales is the only way for an e-commerce web store to survive in this very competitive on-line market. To retain customers, you need to understand their needs and preferences. As pointed in [7,11,17], fostering and promoting repeated sales requires knowledge about customers preferences, consumption rate, behavior, and lifestyle. This knowledge generally requires knowing items such as a customer s income, age, gender, life style and so on. To find the best way to reach its customers and increase sales, it is necessary for a company to enrich the clickstream with this offline information. The user of demographics, psychographics, property information, household characteristics, individual characteristics, lifestyle, has been used by database marketing professionals to improve their sales, retain customers and acquire new customers for bricks-and-mortar stores for decades. This information should also be used in a web store to enhance the vast amount of customer and clickstream behavior already captured at the website. In the web store, customer information can be collected through a registration form, which is often limited. Some web site offers incentives to users to encourage them to register or answer a set of questions. The problem is that users tend not to give the information or provide inaccurate information in registration forms. Fortunately, there are many commercial marketing database vendors that collect this information based on zip code or physical addresses. This information should be integrated to web data for additional insight into the identity, attributes, lifestyles, and behaviors of the web site visitors and customers [17]. There are several sources of demographic information at various levels like CACI, Acxiom, and Experian, to name a few. CACI provides neighborhood demographics; 5

6 Acxiom gives household-level psychographics; and Experian provides the MOSAIC targeting system, which identifies consumers according to the type of neighborhood in which they live [17]. These external offline demographics can inform whom your online visitors and customers are, where they live, and subsequently how they think, behave, and are likely to react to your online offers and incentives. Database marketers have used this information for years to segment their customers and potential prospects. The demographics and socioeconomic profiles are aggregated from several sources including credit card issuers, county recorder offices, census records, and other cross-referenced statistics [17]. When analyzing and mining customer demographics data from web data, it should always be kept in mind the privacy of the customers. Profiling customers is bad when web sites fail to do it anonymously. 3. Data Webhouse Construction A data warehouse provides the data source for online analytical processing and data mining. Designing a proper data warehouse schema and populate the data from the OLTP system to the warehouse is very time consuming and complex. A well-designed data warehouse would feed business with the right information at the right time in order to make the right decisions in e-commerce system [20,21,9]. In Section 2, we discussed data capture methods for the web site, which collect the clickstream, sales, customers, shipments, payment, and product information etc. These data are on-line transaction data and are stored in the transaction database system (OLTP). The database schemas of the OLTP are based on E-R modeling, normalized to reduce redundancy in the database and designed to maintain atomicity, consistency and integrity so as to maintain speed and efficiency for use in day-to-day business operations such as insert, update and delete a transaction. For an OLTP query, it normally only need to access a small set of records in the databases but demand very quick responses. For web usage mining purposes, it needs to have a database schema (called a data warehouse) designed to support decision-making and data analysis (On-Line Analytical Processing). Typical relational databases are designed for on-line transactional processing (OLTP) and do not meet the requirements for effective on-line analytical processing. As a result, data warehouses are designed differently than traditional relational databases. Data warehouses use OLTP data for historical, read-only analysis. The data in a data warehouse system are normally organized in multidimensional modeling with star schema (fact tables plus the surrounding dimension tables). The requirement of clickstream data in the data warehouse make the schema design even more complicated. The web challenges the current view of the data warehouse with multiple new requirements [11]. The data warehouse is required to make the customer clickstream available for analysis, so a new term webhouse was coined by Ralph Kimball [11,12]. A webhouse plays an integral role in the web revolution as the analysis platform for all the behavior data arriving from the clickstream, as well as for the many web sites that rely on the data warehouse to customize and drive the end user s web experience in real time [11]. We use webhouse to refer to the data warehouse system for the web usage mining. The webhouse is the source data of data mining and business intelligence reporting in the Data Warehouse/OLAP framework and it contains the fundamental business content of what a web store sells with web services and capabilities. A webhouse should allow you to analyze all hits on a web site, all the products sold in the web store from many viewpoints. Many systems have developed to mine web log records, which can find association patterns and sequential patterns about web accessing, but in order to understand customers such as repeated visitors vs. single visitors, single purchase customers vs. multiple purchase customers, it is necessary to include additional information such as order information from 6

7 the web store, product information about the product, user browsing sequence from clickstream, and customers information from user table. Below we discuss the requirement analysis, and dimensional modeling technique to design the webhouse. 3.1 Requirement Analysis of the Webhouse It is necessary to build a comprehensive view of the immerse stream of clicks arriving at web sites, including items sold thorough the site. We want to build a webhouse to provide insightful information and answer the important business questions for e-commerce. The design of a webhouse starts with the requirement analysis. We spent significant time to interview our clients, business analysts, engineers/developers and end-users to gather their requirements and what kinds of business problems they hope to get answers from the webhouse. Their questions cover a wide ranges and areas: Web site activity (hourly, daily, weekly, monthly, quarterly etc) Product sale (by region, by brand, by domain, by browser type, by time etc) Customers (by type, by age, by gender, by region, buyer vs. visitor, heavy buyer vs. light buyer etc) Vendors (by type, by region, by price range etc) Referrers (by domain, by sale amount, by visit numbers etc) Navigational behavior pattern (top entry page, top exit page, killer age, hot page etc) Click conversion-ratio Shipments (by regular, by express mail etc) Payments (by cash, by credit card, e-money etc) Some of the important questions are Who are my most profitable customers? What is the difference between buyers and non-buyers at my site? Which parts of my site attract the most visits? Which part of my site is a session killer? Which parts of the sites lead to the most purchases? What is the typical click path that leads to a purchase? What is the typical path of customers who abandoned the shopping cart? What are the percentages of customers visiting the product section? What is the new-visitor click profile? What are the top/bottom products? What are the peak traffic hours? We analyze these questions and determine the dimensions need to be constructed and the fact measure the business analysts are interested in. After identify the dimension and measures, we can move to the next step: webhouse schema design 7

8 3.2 Webhouse Schema Design In the webhouse, there are one or a few fact tables and a set of small tables called dimensional tables. The fact table is where numerical measurements of the business are stored. Each of these measurements is taken at the intersection of all dimensions. The dimension tables are where the textual descriptions of the dimensions of the business are stored [11]. There are several design methodologies to design a data warehouse such as architecture-based methodology proposed by Anahory and Murraym [2], four-step methodology used by Ralph Kimball. To construct the webhouse for the Data Warehouse/OLAP framework, we adopted Kimball s methodology to build the webhouse through dimensional modeling techniques. The four-steps include (1) define the source of data, (2) choose the grain of the fact tables, (3) choose the dimensions appropriate for the grain, and (4) choose the facts appropriate for that grain. Below we discuss each step in detail: Define the Source Data Since we wish to analyze click behavior and the sales situation of a web site for an on-line web store, we need to have the data for every web hit on the web site and every item sold in the web site. Data collection is at the page request level (clickstream) or purchase in the order item level and all of this information is already available in the transaction database (OLTP). In the data collection phase, we have collected every web page view, purchase data and customer information in the web database system, which is the data source for our webhouse. We need to extract them from the transaction database and transform them into the webhouse according to the design of the dimension tables and fact tables, which is described later. In addition, we also need to have product, use, page, time, payment, shipping, and promotion information Choose the Grain of the Fact Tables The fact table is the center of the webhouse. It contains a list of all measures and points to the key value of the lowest level of each dimension. The lowest level of each dimension table, business problems and domain determines the granularity of the fact table. Before the fact tables can be designed in detail, a decision must be made as to what an individual lowlevel record in that fact table means. This is the grain of the fact table [11]. In order to analyze the clickstream, every page hit should have a row in the clickstream fact table, which is the grain of the clickstream. To analyze the sales business of the web store, every item sold should have a row for order item fact table. Thus the grain of the item fact tables in the webhouse is every item sold Choose the Dimensions Appropriate for the Grain Dimensions are qualifiers that give meaning to measures. They organize the data based on the what, when, and where components of a business question. Dimensions are stored in dimension tables made up of dimensional elements and attributes. Each dimension is composed of related items, or elements. Dimensions are hierarchies of related elements. Each element represents a different level of summarization. For example, products roll up to subcategories, which roll up to categories (which in turn roll up to department etc). The lowest level in the hierarchy is determined by the lowest level of detail required for the analysis. Levels higher than the base level store redundant data. This denormalized table reduces the number of joins required for a query, and makes it easier for users to start querying at a higher level and to drill down to lower levels of detail as needed. All of the elements relating to the product, for example, would comprise the product dimension. This 8

9 allows the user to query for all categories, and drill down to the subcategories or product level for more detailed information. Below we discuss some of the important dimensions in the webhouse. (The discussions of some other dimensions such as business event dimension, promotion dimension are omitted because of space limitations; these dimensions are almost the same in the webhouse compared with the traditional data warehouse) Session Dimension The session dimension is more than just a tag that groups together all the page events that constitute a single user's session. The session dimension is the place where we label the session and trace its activity [11], to describe the characteristics of the session. The characteristics of a session should consist of: session length, the total page requests of the session, the first pages of the session, last pages of the session, the referrer of the session, cookie id, user agent, client host, the first request time, and last request time, total browsing time of the session, average viewing time of each page of the session, session visit count. We may need to characterize sessions as sessions with purchase, sessions without purchase random browsing, crawler sessions etc. With this information, we are able to answer business questions such as which page my customer comes to my site (top first page of the sessions) and where they leave? (top last page of the session), what are the characteristic of the sessions lead to purchase? Page Dimension Site area analysis is very important in order to understand which part of the web site attracts most of the hits, which part leads to a purchase, which part of the site is a killer, which part of the site is less visited and superfluous. The page dimension should contain meaningful context that tells the analyst the user's web site location. Each web page must contain some simple descriptors identifying the location and type of page such as Log in, Registration Hot Product, "Product Info," "Company Info," "Frequently Asked Questions," and "Order Form." [12]. A large web site should have a hierarchical description associated with each page that gives progressively more detail about what constitutes the page. This information needs to be stored in the page dimension and be maintained consistently as we update and modify the web site. A page dimension should also contain such information as Page Template, Page Category, Number of Images, and Banners in the Page etc. Time Dimension The time dimension is very important in every data warehouse because every fact table in the data warehouse is a time series of observations of some sort. In traditional data warehouses, the time dimension is at a daily grain, but for the webhouse the granularity is finer. We have seen some webhouses record at the hourly grain or even minute level. Data column in a relational table normally has the format as year, month, day, hour, minute and seconds (YYYYMMDD:: HHMMSS). We need to create new attributes representing date of week, day of year, quarters from date column attributes. Since, in a web environment, we are analyzing both the click stream behavior and sales, it makes perfect sense to have two time hierarchies: One is more or less the traditional time dimension in the data warehouse: date related to day, week, month, quarter, years (may need to use data transformation function to construct new attributes and properties: weekday, weekend, holiday season etc) which is useful to compare sale across day, month, quarter, or year. The other time hierarchy is the time-of-the-day related to a specific spot within a day, hour, minute in the 9

10 day, (some derived useful attributes are early morning, late afternoon, evening, working hour, lunch break, etc). This time hierarchy is useful for site traffic analysis. User Dimension To obtain good customer profiles, variables describing the characteristics of the customer should be added. If available, this information is given in a data warehouse where all customer characteristics and historical information about click behavior are stored. To combine this information with the transaction data, the users must identify themselves when visiting the web site so the cookie id could be matched with their names and the transactional data can be merged with customer-relevant data. The customer dimension should contain information such as name, addresses, gender, age, demographics, and lifestyle. Identifying the user is very important to distinguish different types of visitor to the web site. In the user dimension we need to label users as single visitors, repeat visitors, visit with single purchase, visitor with multiple purchases, or most profitable customer based on the amount they spend. Based on the user dimension information, we should be able to answer business questions related to different user types. Product Dimension The product dimension describes the complete portfolio of what the web site sells on-line and the information varies from different on-line stores. For example Amazon.com has a very large product dimension than an on-line bank. Normally the product dimension should contain information such as product key, SKU description, product property (weight, size, color, package type etc), brand, sub category, department), price, manufacturer, warranty information Choose the Facts Appropriate for That Grain Choosing the appropriate fact measures for the grain in the fact table depends on the business objective and analysis purposes. For the clickstream fact, we can choose the time (number of seconds) the user spent on each page. For the order fact table, we can choose revenue, profit, cost, quantity and other measures. The star schema for the webhouse is constructed as shown in Figure Data Transformation Creating a warehouse is not enough because many important information are not in the data warehouse yet, for example, for the session, it is essential to know the number of pages, time spent, or the session leads to purchase or not. For the customer, it is necessary to create attributes such as whether the customers are repeat visitors, heavy-spender or occasional shoppers etc. These new attributes need to be created/derived from existing database columns to make data mining and reporting easier or even possible. There are two sets of transformations that need to take place: (1) data must be transferred from the OLTP systems to the OLAP systems, (2) data may need to go through some transformation or conversion to create some new values, which are not explicitly represented in the data warehouse. The first set of transformation is relatively stable and straightforward. There are a lot of ETL tools on the market for this purpose [10]. The second set of transformation provides a significant challenging for web usage mining since a lot of these transformations are related to the application domains and business goals. Typically in the web warehouse, the data are collected in the clickstream level. For data mining and business intelligence reporting purpose, the data in the data warehouse need to be transformed or aggregated to different level of granularity (session level, order-header level or customer level) depending on the 10

11 Session Dimension Business Event Dimension Time Dimension Session_id Session_length Referrer Agent Session_host_name Session_IPAddress Cookie_id Client_host First_request_time Last_request_time Total_time_spent Average_time_per/page Session_customer_id Session_visit_count SessionWPurchase_flag RandomBrowsing_flag CrawlerSession_flag Sessiontimeout_flag many more User Dimension User_id City State Country Gender Age Profession Education_level Marital_status Phone_# Repeat_visitor_flag Frequent_purchase_flag Date Heavy_spender_flag Dimension Reader/Browser_flag #OfKids House_income..many more Date Dimension Date_id Day Week Month Quarter Year Day_#_in_month Day_#_in_quarter Day_#_in_year Week_#_in_month Week_#_in_quarter Week_#_in_year Weekday_flag Weekend_flag Holiday_flag Season many more BusinessEvent_id BusinessEventType BusinessEventDesc Search_key_flag Shopping_cart_flag many more Clickstream Fact Table BusinessEvent_id Session_id Time_id User_id Page_id Date_id Product_id Page_view_time Quantity_ordered many more Order Item Fact Table Order_item_id Session_id Time_id User_id Page_id Date_id Product_id Promotion_id Order_item_price QuantitySold TotalCost Profit Revenue..many more Promotion Dimension Promotion_id PromitionName PriceReductionPct AdvType CoupleType BeginDate EndDate Promotion_cost Promition_region many more Figure 2: Star Schema of Webhouse Time_id Second Minute Hour EarlyMorning_flag LateAfternoon_flag LunchTime_flag DinnerTim_flag LateEvening_flag many more Page Dimension Page_id PageTemplate PageLocation PageType PageCategory PageDescription Registration_page_flag Shipping_page_flag Checkout_page_flg NumOfProducts NumOfImage NumberOfBanner many more Product Dimension Product_id SKUDesciption. Brand SubCategory Dept. Size Color Weight Price Manufacturer Warranty_info.. many more 11

12 mining and reporting goals. For example if the analyst is interested in the difference between session with purchase and without purchase, then the transformation/aggregation operations need to be performed to convert clickstream data into session level. If she wants to understand the customers such as what are the characteristics of the most-profitable customers, then the data need to transform/aggregated further from session level to customer level. There are 3 types of transformations in the web usage mining context: 1. Generalizing/extracting the primitive values to high level values. For example, the referrer column for each click session has too many different values but some useful information is embedded in it, so it is useful to create new columns from it, such as the host of the referrer and the domain of the referrer. Similarly, new columns such as domain and host from ISPs and customer s 2. Grouping/summarizing information from multiple columns. For example, in the customer preference survey, there are columns such as Prefer Basketball, Prefer Football, and Prefer Baseball corresponding to customer s first, second, third preferred sport. For mining or reporting purpose, it s better to increase the granularity by generating new column to indicate the customer preference philosophy. 3. The third type of new column created is for inferring information not directly available from existing database columns. For example, to have an image of a customer s product page views, it needs to know whether a click record is a product page view from Brands which is not directly available. This information can be inferred from the Template and Referrer columns. Based on our experience, below are some of the typical data transformation operations we found are very popular and useful for web usage mining and reporting. (1) Click Level Transformation Transformation Transformation Description Result Name Referrer indicator for a product page Creates an indicator variable for referrer for an arbitrary product page. Product detail page views are important information for a webstore. Within a webstore, you can go to a product page from different places depending on how the webstore is designed. (e.g., ViewfromHotArea, ViewfromGifts). To analyzing the click stream data, it is helpful to know which area each product page view comes from and it is defined based on the Type, Template, and Referrer columns: Type Boolean Page view time The number of seconds that a person spends viewing a page Double Credit card indicator (MasterCard, Visa, AMX etc) Indicate whether the transaction was completed using which type of credit card Boolean Decode the query string The operation returns the search results of the search arguments String the customers typed while they surfed the web site. Path of session This operation pulls the templates of the session into a long String string. Detailed path of the session Similar to the operation above, except you will get the detailed String dynamic generated page Last page of the session Return the last page of the session String First page of the session Return the first page of the session String Clickarea The column tells which area a click is on, or None for nonclick. Boolean 12

13 click. Click tags Check whether a click is a giftbox view or shopping cart view. String Purchases of products that appear on Whats_hot pages It is very useful to know who bought products from the Whats_hot pages, or what products were bought from the Whats_hot pages. However, this is very hard to be done without event log. What can be done is to find purchases of products that appear on Whats_hot pages. Note that these products may appear on other pages and customers can buy Boolean When did a customer fill the registration (survey) form? them there. Web site normally has an optional registration form that contains some survey questions. Knowing answers to these questions can help to understand customers better (before any purchases, or after purchases), Boolean The above transformations can capture a lot of essential information for reporting and help the business analysts understand and improve the website performance and function, increase customer satisfaction. For example, use the query decode string transformation, it can capture the top 10 failed search key words from the customers as shown in Table 1 from a real online motor store. Fat boy, Chrome are the most popular items the customers are looking for. Then the store manager can make a decision to add these items to the webstore if a lot of customers showed interests in these items. Search String # Of Search Fat boy 1566 Chrome 791 Motorclothes 443 Gtype Fuel tank 325 G-sportster 280 maintenance 260 C-sidecar 210 sissy bar 175 seat 169 touring 163 Table 1: Top 10 Failed Search (2) Session Level Transformation Transformation Name Transformation Description Result Type Customer browser name Return a string containing browser s name from useragent. If String the flag is true,otherwise group all unknown browser s names to Others Browse release The release number of the browser given the useragent string. String The main release number is for Mozilla. It will contain the release number for MS Internet Explorer inside () if the browser is an IE browser. It contains AOL and its release number inside [] if the browser is an AOL browser. Browse OS The OS running the browse String Returned visitor True if the user is a returned visitor Boolean Session Length The total clicks of this session Integer 13

14 Long session Indicate whether the session is a long one or not (more than 6 Boolean clicks) Short session Indicate whether the session is a long one or not (1 0r 2 clicks) Boolean Session duration The total time spent on this session Double Referrer host Host of the referrer String Referrer domain Domain of the referrer String Url site Return the url site such as YAHOO, Excite and so on String ISP host Internet Service Provide Host String What day it is of the first A number to indicate which day it is for the first visit Double visit What day it is of the last A number to indicate which day it is for the last visit Double visit Is the visit a weekend Indicate whether the visit happen is a weekend or not Boolean Is the visit a weekday Indicate whether the visit happen is a weekday or not Boolean Any purchase on this Indicate whether the session leads any purchase or not Boolean session Purchase amount in In addition to the numbers of product detail page views that Double different areas came from different areas, it is also important to know customers purchase amount from each area. Since it is hard trace where a purchase come from precisely, it can be estimated by distributing the total purchase amount to each area using the proportion of the number of product detail page views from this area. Purchase quantity in Similar to customers purchase amount from each area, it is Double different areas necessary to know customers purchase quantity from each area. HourofDayof the server Shows the visitor s time from the first request date based on the Double location of the server Time period of the day Based on the HourofDay, you can add more columns to indicate whether the visit time is in the morning, or evening and so on (such as early morning, late evening, lunch time etc) Boolean Table 2 below shows the top 10 path of a website without any purchase. These paths can help the website understand the customer click behavior and reveal a lot of reasons why the customers left the website without purchase. For example, the top one path is main.jsp-> splash.jsp (14622 sessions). The customers visited the main.jsp and then left the website after they clicked the splash.jsp. A further analysis by the web designer found out that the splash.jsp took a while to compile and download and the effect of the animation of the spash.jsp make a huge portion of the contents of the store invisible, thus caused a lot of frustrations among the customers, so they left. After the splash.jsp was removed, the conversion rates got improved significantly Web Path Count main.jsp->splash.jsp main.jsp->main.jsp 3731 main.jsp->main.jsp->main.jsp 790 main.jsp->login.jsp 329 main.jsp->hot.jsp->registration.jsp 303 Login.jsp 274 main.jsp->survey.jsp 216 product.jsp 212 main.jsp->product.jsp 192 main.jsp->search.jsp

15 Table 2 : Top 10 Paths Lead to Non-Purchased Sessions (3) Customer Level Transformation Transformation Name Transformation Description Result Type domain The domain name is the portion of the address after the last String period such as COM, NET, EDU etc. hostname The hostname is the portion of the address after the at sign String (@) and before the last period (.) Time zone Time zone of the customer Areacode Area code of the customer s phone number String Country region Country region of the customer String Repeat buyers Indicator whether the visitor is a repeated buyer or not Boolean Single visit customer Customer only visit once and no purchase Boolean Multiple visit customer Customer visit multiple time but no purchase Boolean Single visit buyer Customer visit once and have purchase Boolean Multiple visit buyer Customer visit multiple times and have at least one purchase Boolean Profit ration (Average Profit ratio is defined to be the total number of sales divided by Double revenue per visit) total number of visits Propensity to purchase Indicate the likelihood the visit is going to purchase something. Double ratio Things preferred and things really bought In the survey form, there are questions like preferred brands preferred products special needs It should be valuable information if we know the correlation between what a customer prefers and what he/she buys. String The customer level transformation created a lot of new columns in the data warehouse to make the reporting and data mining easier and more meaningful at the customer level. For example, identify whether a customer is single visitor, buyer, repeated buyers etc, is very important for the webstore. Table 3 can reveal how many customers are loyal customers, occasional shoppers, or just pure visitors. Type Count Single Visit 1823 Multiple Visit 37 Single Visit Buyer 269 Multiple Visit Buyer 58 Unknown 2846 Table 3: Single/Multiple visitors/buyers After the data transformations are done, data in the web house are organized into different level. Below are some of the most useful summary tables and facts table for web usage mining and reporting. CLICK_LINES SESSIONS CUSTOMERS GIFT_LINES ORDER_LINE ORDER_HEADERS PROMOTIONS A row for each Web page viewed A row for each Web session A row for each customer A row for each gift registry item of each customer Contains a row for each order line of each order A row for each order of each customer A row for each promotion folder and promotion defined in the system 15

16 LINE_ITEMS ORDER_LINES joins with CUSTOMER, ORDER_HEADERS, PRODUCTS, ASSORTMENT, PROMOTIONS Table 4: Some Summary and Facts Table in the Web House 4. Pattern Discovery: A Data Warehouse/OLAP Approach Data Warehouse/OLAP (On Line Analytical Processing) is an approach of integrating data mining, data warehousing and OLAP technologies. OLAP systems pre-calculate summary information (data cubes) to enable drilling, pivoting, slicing and dicing, filtering to analyze business from multiple angles or views (dimensions). Web mining your site in the webhouse can reveal actionable and meaningful patterns for users and useful click sequence for the web site design. Below we discuss each of them in details. 4.1 Construct Cubes from Webhouse A data cube is pre-calculated summary data organized in a way that the cells of the cube contain measured values and the edges of the cube define the natural dimensions of the data. (The data cube may have more than 3 dimensions so technically it should be called a hypercube). The dimensional elements in the cube are organized in a hierarchy and you can roll-up and/or drill down the dimension hierarchy to get a different view or understanding about the cube data. A data cube offers the benefits for data analysis such as an immediate response to a business query, the ability to drill down and roll up the multiple dimensional data in the cube, to analyze business measures such as profit, revenue, quantity from different angles, perspectives and various ancillary factors. We can create two cubes from the webhouse as shown in Figure 2, one cube for the clickstream, and another cube for the order item based on the fact table clickstream, the ordered item and session, product, user, page, and time dimension table. In the webhouse, we already have data organized based on a multiple dimensional model. All that is required is to plug into the OLAP software. There are many OLAP tools such as MS OLAP, Cognos, and Essbase to choose from to build large cubes. Cutting and dicing these cubes reveal significant information about your web site and sales situations. For example, we can find out the top pages of the sites, top domains, top browsers, the view time of top page, top exits page of the site, top referrer of the site, top products by sales, quantity, top referrers by page request, sale, quantity, users, web site activity based on day, month or even hour, minutes. We can also find out who our visitors include, how much they spend, the sale cycles etc. From the OLAP cubes, many business intelligence reports can be derived. Business reports are the most important tool for business analysts but are unappreciated by a lot of companies. Business intelligent reports can provide many insightful information about the web store such as sales of products across different refers, best selling products/bottom products, top domains/bottom domain, top searched keywords etc. 4.2 Mining the Webhouse Data OLAP is a key component of this approach, but OLAP alone is not good enough for e- commerce applications. Some of the challenging issues cannot be answered by examining the measured values in the cubes. For example, to answer the following question such as Given a set of page views, will the visitor view another page on the site or will the visitor leave, it is very difficult if not impossible to find a satisfactory answer to them based on the OLAP cube data from the webhouse. A lot of mining algorithms and methods such as association algorithm, decision tree, neural network, Bayesian algorithm, clustering method etc can be applied for web usage mining to derive insightful knowledge rules to understand 16

17 the business and customers, build prediction models for classification and generate campaign scores for product promotion. Below we discuss how these algorithms can help to solve some of the challenging problems for the e-commerce Association Rules Association rule algorithms were originally designed to analyze market basket data to find correlations in items purchased together, like if a customer buys product A, what is the likelihood that he will buy product B. In the web usage mining, association rule algorithms can be used for two purposes. First, analyzing the on-line purchase data to determine which products are sold together by on-line customers (similar to the traditional supermarket basket data analysis). On-line shopping databases contain historical data on prior customer choices where each customer has selected a subset of products. This data can be used to generate a dynamic recommendation of new items to a customer who is in the process of making the item choice. Another use of the association rule algorithm is to analyze the page view hits in a session. Websites also display dynamically changing set of links to related sites depending on the browsing pattern during a surfing session. Use of the adapted association algorithm can find related pages that are often visited together. The pages may not have hyperlinks between them. As a result of association rule analysis, it is possible to optimize the web site structure and detect drawbacks that had not been obvious in the past. This information may help the web designer to redesign their web site (add direct link between those strong correlated pages), it may also help web server do per-fetching or precompiling the web pages (presently many web sites have dynamic page generation) to reduce the user waiting time. Websites also display dynamically changing set of links to related sites depending on the browsing pattern during a surfing session. However we feel that the recommendation is inherently a different problem, the main reason for this is that preferences are due largely to the taste and interest. When a customer surf the webstore, whether purchasing or visiting web site, not all actions (put a items into shopping cart, or click thorough different web pages) are elected because of their association with some previous actions (other item already in the cart, or already went thorough some previous pages) [8]. We believe there are two behaviors: renew choice or association choice. Starting from the scratch, some need drives the customer to select the click first page/first item, which is due to some independent need that we call the renewal choice. After the first move, a customer may stop, or click another page/select another item by association or by another renewal choice, iteratively. We propose a hybrid approach (statistical association rule approach) to compute the probability of a new move becoming the next choice given the current status, and make a recommendation list based on a ranking of this probability. What makes this approach different from the usual association rule approaches, is that they account not only for the choice making, or buying associated with the items present in the shopping cart (associative buying), but also for the fact that a customer exercises an independent choice unrelated to the existing items in the shopping cart (renew buying). We compute the probability of both renewal choice and associative choice given the items in the shopping cart, and obtain the probabilities for each item given the partial basket content, and obtain the probabilities for each item given one of these two buying modes. The results from this analysis are very useful for promoting cross-sell and up-selling for the online web store. Based on this consideration, we tested on one client s 17

18 site and the association rules in Table 5 reveals that this approach generates more meaningful and actionable associations. #Of Rela- Lift Support Confidence Rule Rule tions (%) (%) Bloom ==> Dirty_Girl Dirty_Girl ==> Bloom Philosophy ==> Bloom Bloom ==> Philosophy Dirty_Girl ==> Blue_Q Blue_Q ==> Dirty_Girl Tony_And_Tina ==> Girl Philosophy ==> Tony_And_Tina Tony_And_Tina ==> Philosophy Demeter_Fragrances ==> Smell_This Girl ==> Tony_And_Tina Smell_This ==> Demeter_Fragrances Table 5: Associations in a beauty-supply web store Classification/Prediction Classification/prediction is a very popular data mining technique to build a model based on the training data and then apply the model to assign a new item to a certain class. There are many algorithms such as decision trees, neural networks, Bayesian networks, and probability theory for classification. For example, to understand the customers who spend more than $12 in the web site, you can use decision tree algorithm to build a model, which may reveal such pattern as: the customer spends more than $12 are the single female, age between and make more than a year. Another application for classification/prediction is target-oriented campaign. Mass campaign has a very low response rate, typically 2-3%. In target-oriented campaign, company only send campaign message to a small portion of customer who are mostly like to respond. Even though sending s to all on-line customer is very cheap, It is still important to target effectively as suggested by Berry and Linoff because customers might read one target , are less likely to read a lot of junk messages [4,6]. And another important reason is that if the customers are fed up with these off-target messages, they can revoke their permission to be contacted by the web store. To identify who are mostly like to respond to your campaign, avoid generating too many off-target s and improve the service equality, we can build a prediction model based on the historical data (which has the responders vs. non-responders in the past campaigns) then apply the prediction model to the current customers and sort the customer list with a probability score, the top of the sorted list are those customers who are likely to respond the campaign Clustering Clustering techniques are useful when there are no classes to be classified or predicted. Clustering algorithms group a set of objects into different groups based on measures so that the objects in the same group are similar to each other and objects in different groups are different. In web usage mining, clustering algorithms can be used in several ways: (1) Profiling customers based on some features such as purchasing amount, region, purchased products. For example, we group customers into different groups such as heavy-spenders, light-spenders, or browsers based on the amount. We can extract 18

19 similar features from this cluster and find out heavy-spenders are mostly young technical professionals, single male. The results of clustering of web data can help on-line store to identify proper customer segments with common characteristics, and target these segments for campaign or product promotion, make special offer tailored to their needs and requirements. (2) Clustering navigational path of web hits. As shown in [7,16], clustering navigational path is very important for user segmentation, the result can help web designer understand, or predict visitors navigation pattern to make the web site more efficient or more close to the visitors preference. For example, if the clustering results shows Page P1, P2, P3 are in the same cluster, then the web server can prefetch Pages P2 and P3 or pre-compile Page P2, P3 while the user is still viewing Page P1 to reduce the loading time or compile time, that way it help reduce the user waiting latency. Another potential use is to find subsets of the users that would benefit from sharing a single web cache rather than using individual ones. 5. Pattern Evaluations and Deployment In the Data Darehouse/OLAP framework, the last step is to evaluate the mining results and then adopt actionable results. After the mining algorithms are applied, many patterns may be identified but not all of them are interesting or actionable. Unlike most of the pattern evaluation approaches, which rely on an SQL statement to query the database and evaluate the results, in our Data Warehouse/OLAP framework, the data cube is an essential component in the mining procedure and we can dice and roll up the data cube to easily verify mining results. After the mining patterns are verified to be golden nuggets, data miners and data analysts can take proper actions based on useful actionable mining results. In traditional data mining applications, it is always challenging or time consuming to convince the organization to take actions based on the mining results to improve the business. For example, in a brick-and-mortar store, if the data mining results reveal the customers buy product A tend to buy product B, in order to create the cross-sell opportunity based on this finding, some possible actions are to put the products A and B together in the same shelf, which need to physically move the products A and B from different shelves. In a web-store, such discovery can be made easily without much cost and hassle. It is very flexible to change the web site design, layout and put relevant product information together to create cross-sell and upper-sell opportunity. Another example is customer campaign. A lot of companies send marketing campaign information such as catalog and coupon via snail mail based on data mining prediction models, the whole procedure normally takes a few months and costs of millions dollars. But in a web-store, sending campaign s to massive customers is already a key component of E-commerce system, and target customers based on data mining findings via to promote product is easy to implement and these data mining findings can quickly bring more revenues to the web-store. 6. Conclusion E-commerce webstore provides a killer domain for data mining application. In this paper we have proposed a framework for web usage mining and business intelligence reporting. We address some of the key problems and issues in the web usage mining application. We use web application server to collect all the relevant data (clickstream, transaction, customer information) for the analyze purpose and provide a unified database schema for difference data source. The construction of the webhouse is an integral part of our framework, which provides an integrated environment for data collection, data 19

20 transformation. In the framework, we integrate the data warehouse construction, data mining, business intelligence reporting and pattern deployment into e-commerce system smoothly. This tight integration significantly reduces the total time and effort to build a data mining system for web usage mining. We provided a general approach and guidelines for on-line web-stores to mine their web data and generate business intelligence reports. We identify some of the challenging problems and pitfalls in each phase and provide possible solutions to them. Our framework focus on on-line web store and some of the ideas have been implemented in some commercial web usage mining system. We believe this framework can be adapted to apply in some other domain such as Business-to-Business. The framework and ideas presented in the paper have been implemented in some commercial web usage mining systems through the first author s consulting engagement with some industry vendors. There are some other challenging problems in web usage mining such as how to scale the web mining algorithms to handle large amount of data in the 100G or even terabyte range (some of the large e-commerce site like Yahoo handle 1 billion page view a day). The scalability is crucial for a successful e-commerce system. We hope to report our findings in this research topic in the near future. 7. References [1] Accrue Software Inc, Driving Business Decision in Web Time, Web Mining Whitepaper [2] Anahory,S. and Murray,D., Data Warehousing in the Real World, Addison Wesley, 1997 [3] Suhail Ansari, Ron Kohavi, Llew Mason and Zijian Zheng, Integrating E-Commerce and Data Mining: Architecture and Challenges, WebKDD2000 Workshop [4] Jonathan Becher, Ronny Kahavi, Tutorial on E-commerce and Clickstream Mining, First SIAM International Conference on Data Mining. [5] Michael Berry, Gordon Linoff, Mastering Data Mining: The Art and Science of Customer Relationship Management, John Wiley & Sons [6] Catledge L. and Pitkow J., Characterizing browsing behaviors on the world wide web, Computer Networks and ISDN Systems, 27(6), 1995 [7] Domingos P., Hulten G., A General Method for Scaling Up Machine Learning Algorithms and its Application to Clustering, Prod of the ICML-2001 [8] Hong S.J., Natarajan R., Belitskaya I., A New Approach for Item Choice Recommendation [9] Hu X., Cercone N., An OLAM approach for Web Usage Mining, Prod. o 2002 IEEE Fuzzy Systems [10] Kdnuggets.com [11] Ralph Kimball The Data Warehouse Toolkit, John Willey and Sons, 1996 [12] Ralph Kimball, Clicking with your Customer, Intelligence Enterprise, Intelligent Enterprise, Jan 05, 1999, Vol 2, No. 1 [13] Ralph Kimball, Richard Merz, The Data Webhouse Toolkit: Building the Web-Enabled Data Warehouse, John Willey and Sons, 2002 [14] Ronny Kohavi, Mining E-Commence Data: The Good, the Bad and the Ugly, Invited paper at SIGKDD 2001 Industry track [15] Ronny Kohavi and Foster Provost, Application of Data Mining to Electronic Commerce, Data Mining and Knowledge Discovery, 5(1), 2001 [16] Raymond Kosala, Hendrik Blockeel, Web Mining Research: A Survey, ACM SIGKDD, July 2002, Vol 2, Issue 1 [17] Jesun Mena, Data Mining Your Website, Digital Press, [18] Jesus Mena, Beyond the Shopping Cart, Intelligent Enterprise, Jan 05, 1999, Vol 2, No. 1 [19] Sane Solution, Analyzing Web Site Traffic, 2002, 20

21 [20] Song I., Whang K., Database Design for Real-World E-Commerce Systems, IEEE Data Engineering Bulletin, March 2000, Vol. 23, No. 1, [21] Song I., LeVan-Shultz K., Data Warehouse Design for E-Commerce Environment, WWWCM99 [22] Jaideep Srivastava, Robert Cooley, Mukund Deshpande, Pang-Ning Tan, Web Usage Mining: Discovery and Application of Usage Patterns from Web Data, SIGKDD Explorations, Vol 1, Issue 2 [23] Pang-Ning Tan, Vipin Kumar, Modeling of Web Robot Navigational Patterns, WebKDD 2000 conference [24] Chsittiane Theusinger, Klaus-Peter Huber, Analyzing the footsteps of your customers, WebKDD 1999 Workshop [25] Gaul Wolfgang, Lars Schmidt-Thieme, Mining web navigation path fragments, WebKDD 2000 conference 21