Databases - Data Mining. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 1 / 25



Similar documents
Data Mining for Business Analytics

Project Report. 1. Application Scenario

Chapter 20: Data Analysis

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Data Mining Applications in Manufacturing

Foundations of Artificial Intelligence. Introduction to Data Mining

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Laboratory Module 8 Mining Frequent Itemsets Apriori Algorithm

Introduction to Data Mining

Fuzzy Association Rules

Using Data Mining and Machine Learning in Retail

A Data Mining Tutorial

Data Warehousing and Data Mining

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Association Rule Mining

ISSUES IN MINING SURVEY DATA

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Information Management course

Data Mining Algorithms Part 1. Dejan Sarka

So, how do you pronounce. Jilles Vreeken. Okay, now we can talk. So, what kind of data? binary. * multi-relational

New Matrix Approach to Improve Apriori Algorithm

Data Mining Techniques Chapter 9: Market Basket Analysis and Association Rules

Mine Your Business A Novel Application of Association Rules for Insurance Claims Analytics

Data Mining On Diabetics

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

DEVELOPMENT OF HASH TABLE BASED WEB-READY DATA MINING ENGINE

Customer Analysis - Customer analysis is done by analyzing the customer's buying preferences, buying time, budget cycles, etc.

Analytics on Big Data

not possible or was possible at a high cost for collecting the data.

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

DATA MINING TECHNIQUES AND APPLICATIONS

Customer Classification And Prediction Based On Data Mining Technique

Data Mining Introduction

1. Introduction to Data Mining

Market Basket Analysis for a Supermarket based on Frequent Itemset Mining

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data Mining for Fun and Profit

Introduction to Data Mining

BUSINESS ANALYTICS. Overview. Lecture 0. Information Systems and Machine Learning Lab. University of Hildesheim. Germany

Distributed Data Mining Algorithm Parallelization

Big Data Analysis. Rajen D. Shah (Statistical Laboratory, University of Cambridge) joint work with Nicolai Meinshausen (Seminar für Statistik, ETH

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Improving the Customer Experience in Big Box Retail Stores

Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Table of Contents. Chapter No. 1 Introduction 1. iii. xiv. xviii. xix. Page No.

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

How To Write An Association Rules Mining For Business Intelligence

Introduction. A. Bellaachia Page: 1

CAS CS 565, Data Mining

Association Technique on Prediction of Chronic Diseases Using Apriori Algorithm

The Data Mining Process

KNIME TUTORIAL. Anna Monreale KDD-Lab, University of Pisa

Hexaware E-book on Predictive Analytics

Security in Outsourcing of Association Rule Mining

Data Warehousing and Data Mining

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

Getting Started. Complete your purchase and wait for your goods to arrive. It s as straightforward as that!

Data Warehousing and Data Mining. A.A Datawarehousing & Datamining 1

Performing Data Mining in (SRMS) through Vertical Approach with Association Rules

Quick Introduction of Data Mining Techniques

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Social Media Mining. Data Mining Essentials

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining Techniques

Chapter 12 Discovering New Knowledge Data Mining

Distributed Apriori in Hadoop MapReduce Framework

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

LECTURE 1 NEW SERVICE DESIGN & DEVELOPMENT

Data Mining: An Introduction

Building A Smart Academic Advising System Using Association Rule Mining

AMIS 7640 Data Mining for Business Intelligence

Data Mining: Overview. What is Data Mining?

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Implementation of Data Mining Techniques to Perform Market Analysis

Association Rule Mining: A Survey

Big Data, Data Analytics and Actuaries. Adam Driussi, Quantium

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

Correlational Research

Selection of Optimal Discount of Retail Assortments with Data Mining Approach

PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining and Machine Learning in Bioinformatics

DATA MINING AND WAREHOUSING CONCEPTS

A Survey on Association Rule Mining in Market Basket Analysis

Transcription:

Databases - Data Mining (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 1 / 25

This lecture This lecture introduces data-mining through market-basket analysis. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 2 / 25

Data Mining Data Mining An organization that stores data about its operations will rapidly accumulate a vast amount of data. For example, a supermarket might enter check-out scanner data directly into a database thus getting a record of all purchases made at that supermarket. Alternatively data might be collected for the express purpose of data mining. For example, the Busselton Project is a longitudinal study that has accumulated 30-years of health-related data about the people in Busselton. (See http://bsn.uwa.edu.au.) Data Mining, or more generally Knowledge Discovery in Databases (KDD) refers to the general process of trying to extract interesting or useful patterns from a (usually huge) dataset. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 3 / 25

Data Mining Rules One of the fundamental types of interesting pattern is to identify associations between observations that might reflect some important underlying mechanism. For example, the Busselton Project may find associations between health-related observations: perhaps a correlation between elevated blood pressure at age 30 and the development of Type 2 diabetes at age 50. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 4 / 25

Data Mining Data Mining Research Research into data mining is one of the most active areas of current database research, with a number of different aspects: KDD techniques Research into the theoretical statistical techniques underlying KDD, such as regression, classification, clustering etc. Scalability Research into algorithms for these techniques that scale effectively as the data volumne reaches many terabytes. Integration Research into integrating KDD tools into standard databases. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 5 / 25

Market Basket Analysis Market Basket Analysis We will only consider one simple technique, called market basket analysis, for finding association rules. A market basket is a collection of items associated with a single transaction. The canonical example of market basket analysis is a supermarket customer who purchases all the items in their shopping basket. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 6 / 25

Market Basket Analysis Market Basket Analysis By analysing the contents of the shopping basket one may be able to infer purchasing behavours of the customer. The aim of market basket analysis is to analyse millions of transactions to try and determine patterns in the items that are purchased together. This information can then be used to guide specials, display layout, shop-a-docket vouchers, catalogues and so on. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 7 / 25

Market Basket Analysis Sample Data transid custid item 111 201 pen 111 201 ink 111 201 milk 111 201 juice 112 105 pen 112 105 ink 112 105 milk 113 106 pen 113 106 milk 114 201 pen 114 201 ink 114 201 juice 114 201 water (This dataset is from Chapter 26 of R & G.) (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 8 / 25

Market Basket Analysis Terminology An itemset is a set of one or more items: for example {pen} is an itemset, as is {milk, juice}. The support of an itemset is the percentage of transactions in the database that contain all of the items in the itemset. For example: The itemset {pen} has support 100% {pens} are purchased in all 4 transactions The itemset {pen, juice} has support 50% {pen, juice} are purchased together in 2 of the 4 transactions The itemset {pen, ink} has support 75% {pen, ink} are purchased together in 3 of the 4 transactions (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 9 / 25

Market Basket Analysis Frequent Itemsets The first step of mining for association rules is to identify frequent itemsets that is, all itemsets that have support at least equal to some user-defined minimum support. In this example, setting minimum support to 70% we would get Itemset Support {pen} 100% {ink} 75% {milk} 75% {pen, ink} 75% {pen, milk} 75% (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 10 / 25

Market Basket Analysis Finding Frequent Itemsets In a toy example like this, it is simple to just check every possible combination of the items, but this process does not scale very well! However it is easy to devise a straightforward algorithm based on the a priori property Every subset of a frequent itemset is also a frequent itemset. The algorithm proceeds by first finding single-element frequent itemsets and then extending them, element-by-element until they are no longer frequent. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 11 / 25

Market Basket Analysis Sample run The first scan of our sample relation yields the three itemsets {pen}, {ink}, {milk} In the second step we augment each of these by an additional item that is itself a frequent item, and then check each of the itemsets thereby determining that are frequent itemsets. {pen, ink}, {milk, ink}, {pen, milk} {pen, ink}, {pen, milk} (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 12 / 25

Market Basket Analysis Sample run cont. Now the only possible 3-item itemset to check would be {pen, ink, milk} but this can be rejected immediately because it contains a subset that is not itself frequent. {milk, ink} With a huge database, the scan to check the frequency of each candidate itemset dominates the time taken and hence eliminating an itemset without a scan is very useful. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 13 / 25

Association Rules Association rules An association rule is an expression such as {pen ink} indicating that the occurrences of ink in a transaction are associated with the occurrences of pen. The overall aim of market basket analysis is to try to find association rules in the data. If an association rule extracted from the data represents a genuine pattern in shopper behaviour, then this can be used in a variety of ways. Although the terminology is all about shopper behaviour, the concepts can easily be translated to more significant projects such as trying to associate behaviour or diet with disease or mortality. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 14 / 25

Association Rules Support of a rule Suppose that X and Y are itemsets. Then the support of the association rule X Y is the support of the itemset X Y. Thus the support of the rule is 75%. {pen ink} Normally market basket analysis is only concerned with association rules involving frequent itemsets, because while there may be a very strong association between, say, lobster and champagne, this will not form a large proportion of sales. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 15 / 25

Association Rules Confidence of a rule The confidence of a rule X Y is the proportion of transactions involving X that also involve Y. In other words, if s(x) denotes the support of X, then the confidence is s(x Y)/s(X) Thus in our example, has a confidence level of {pen ink} 75/100 = 75%, whereas the rule has confidence 100%. {ink pen} (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 16 / 25

Association Rules Beer and Nappies One of the most well-known marketing stories is the (possibly apocryphal) story of beer and nappies. A Wal-Mart manager noticed one Friday that a lot of customers were buying both beer and nappies. Analysing past transaction data showed that while beer and nappies were not particularly associated during the week, there was a sudden upsurge in the association on Friday evenings. Thinking about why there might be this association, the manager concluded that because nappies are heavy and bulky, the job of buying nappies was often left to fathers who picked them up after work on Fridays, and also stocked up on beer for the weekend. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 17 / 25

Association Rules Cross-selling The manager responded to this information by putting the premium beer displays and specials right next to the nappy aisle. The fathers who previously bought regular beer were now encouraged to buy the premium beer, and some of the fathers who hadn t even thought about beer started to buy it. This version of the story paraphrased from http: //www.information-drivers.com/market_basket_analysis.htm. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 18 / 25

Association Rules Interpreting association rules The 100% confidence level indicates that the data shows that if shoppers buy ink, then they always buy a pen as well. How should an association rule of this type be interpreted? Clearly there is a high correlation between buying pens and buying ink. When faced with a high correlation, it is tempting, but incorrect, to assume that the rule indicates a causal relationship. Buying ink causes people to buy pens (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 19 / 25

Association Rules Example scenario A Pizza restaurant records the following sales for pizzas with extra toppings, in various combinations. The toppings are mushrooms (M), pepperoni (P) and extra cheese (C). Menu Pizza Extra Item Sales Toppings 1 100 M 2 150 P 3 200 C 4 400 M & P 5 300 M & C 6 200 P & C 7 100 M, P & C 8 550 None Total 2000 (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 20 / 25

Association Rules Required analysis Complete a market-basket analysis to answer the following questions 1 Find the frequent itemsets with minimum support 40%. 2 Find association rules with minimum confidence 50%. 3 What is the strongest inference we can make about consumer behaviour when choosing extra toppings? (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 21 / 25

Association Rules Support - single item sets 1 Find the frequent itemsets with minimum support 40%. {M} = 100 + 400 + 300 + 100 2000 = 45% {P} = 150 + 400 + 200 + 100 2000 = 42.5% {C} = 200 + 300 + 200 + 100 2000 = 40% (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 22 / 25

Association Rules Support - two item sets 1 Find the frequent itemsets with minimum support 40%. {M, P} = {M, C} = {P, C} = 400 + 100 2000 300 + 100 2000 200 + 100 2000 = 25% = 20% = 15% (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 23 / 25

Association Rules Rule confidence 1 Find association rules with minimum confidence 50%. M P = {M, P} {M} = 25 45 = 55.6% P M = {M, P} {P} = 25 42.5 = 58.8% C M = {M, C} {C} = 20 40 = 50% All other associations are < 50% (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 24 / 25

Association Rules Inference 1 What is the strongest inference we can make about consumer behaviour when choosing extra toppings? People who order a pizza with extra pepperoni are likely to order extra mushrooms. (GF Royle, N Spadaccini 2006-2010) Databases - Data Mining 25 / 25