Data Mining & Knowledge Discovery: Personalization and Profiling Technologies

Similar documents

Introduction to IR Systems: Supporting Boolean Text Search. Information Retrieval. IR vs. DBMS. Chapter 27, Part A

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Foundations of Business Intelligence: Databases and Information Management

TIM 50 - Business Information Systems

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

Introduction. A. Bellaachia Page: 1

Fluency With Information Technology CSE100/IMT100

Foundations of Business Intelligence: Databases and Information Management

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Foundations of Business Intelligence: Databases and Information Management

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data Mining Techniques

Data Mining + Business Intelligence. Integration, Design and Implementation

Introduction to Data Mining

Course MIS. Foundations of Business Intelligence

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Data W a Ware r house house and and OLAP II Week 6 1

Hexaware E-book on Predictive Analytics

Databases and Information Management

Foundations of Business Intelligence: Databases and Information Management

Principles of Data Mining by Hand&Mannila&Smyth

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

Session 10 : E-business models, Big Data, Data Mining, Cloud Computing

Vendor briefing Business Intelligence and Analytics Platforms Gartner 15 capabilities

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

Multi-dimensional index structures Part I: motivation

MBA Data Mining & Knowledge Discovery

Data Warehousing. Read chapter 13 of Riguzzi et al Sistemi Informativi. Slides derived from those by Hector Garcia-Molina

Foundations of Business Intelligence: Databases and Information Management

Online Courses. Version 9 Comprehensive Series. What's New Series

Building Data Cubes and Mining Them. Jelena Jovanovic

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Week 13: Data Warehousing. Warehousing

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Introduction to Data Mining

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Data Warehouse: Introduction

Data Mining as Part of Knowledge Discovery in Databases (KDD)

DATA WAREHOUSING AND OLAP TECHNOLOGY

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

Data Warehousing and Data Mining. A.A Datawarehousing & Datamining 1

CUSTOMER RELATIONSHIP MANAGEMENT (CRM) CII Institute of Logistics

Tutorials for Project on Building a Business Analytic Model Using Data Mining Tool and Data Warehouse and OLAP Cubes IST 734

Slide 7. Jashapara, Knowledge Management: An Integrated Approach, 2 nd Edition, Pearson Education Limited Nisan 14 Pazartesi

Data Mining Solutions for the Business Environment

Information Management course

Data Warehouse design

Outline. Data Warehousing. What is a Warehouse? What is a Warehouse?

Chapter 6 8/12/2015. Foundations of Business Intelligence: Databases and Information Management. Problem:

ANALYTICS CENTER LEARNING PROGRAM

Chapter 6 - Enhancing Business Intelligence Using Information Systems

Business Intelligence: Effective Decision Making

IAF Business Intelligence Solutions Make the Most of Your Business Intelligence. White Paper November 2002

Federico Rajola. Customer Relationship. Management in the. Financial Industry. Organizational Processes and. Technology Innovation.

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90

Technology in Action. Alan Evans Kendall Martin Mary Anne Poatsy. Eleventh Edition. Copyright 2015 Pearson Education, Inc.

relevant to the management dilemma or management question.

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

When to consider OLAP?

Distance Learning and Examining Systems

Week 3 lecture slides

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Fast and Easy Delivery of Data Mining Insights to Reporting Systems

III JORNADAS DE DATA MINING

What is Customer Relationship Management? Customer Relationship Management Analytics. Customer Life Cycle. Objectives of CRM. Three Types of CRM

Data Warehousing and Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

STATISTICA. Financial Institutions. Case Study: Credit Scoring. and

Databases in Organizations

Business Benefits From Microsoft SQL Server Business Intelligence Solutions How Can Business Intelligence Help You? PTR Associates Limited

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Applications in Manufacturing

Data Warehousing and Data Mining in Business Applications

Foundations of Business Intelligence: Databases and Information Management

Database Marketing, Business Intelligence and Knowledge Discovery

Know Your Buyer: A predictive approach to understand online buyers behavior By Sandip Pal Happiest Minds, Analytics Practice

Outlines. Business Intelligence. What Is Business Intelligence? Data mining life cycle

from Larson Text By Susan Miertschin

Product recommendations and promotions (couponing and discounts) Cross-sell and Upsell strategies

CS2032 Data warehousing and Data Mining Unit II Page 1

Chapter 20: Data Analysis

Data Mining Techniques in CRM

DATA MINING TECHNIQUES AND APPLICATIONS

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Top Data Management Terms to Know Fifteen essential definitions you need to know

BIG DATA What it is and how to use?

Business Analytics and Data Visualization. Decision Support Systems Chattrakul Sombattheera

IBM Campaign & Interact Access external customer profile data to augment and enhance segmentation in marketing campaigns

n Assignment 4 n Due Thursday 2/19 n Business paper draft n Due Tuesday 2/24 n Database Assignment 2 posted n Due Thursday 2/26

Indexing Techniques in Data Warehousing Environment The UB-Tree Algorithm

Transcription:

Data Mining & Knowledge Discovery: Personalization and Profiling Technologies 1 Predictive Modeling and Knowledge Discovery via Data Mining v A black box that makes predictions about the future based on information from the past and present Age balance income Model (Crystal ball?) How much will customer spend on next catalog order? 2

Preprocessing and Mining Original Data Target Data: Data Warehouse Preprocessed Data Preprocessing Patterns Model Construction Knowledge Interpretation Data Integration and Selection 3 Why DM for E-Businesses? v CRM (Customer Relationship Management) - important success factor in E-commerce price differentiation no longer enough customer service more important v Links with suppliers already exist (B2B) - JIT, joint forecasting, planning, procurement v Current emphasis on links with customers - feedback, input in design, etc. 4

CRM v Goal: Positively interact with your customers and prospects define customer segments Know profitable customers Better service for valuable customers lights out execution of campaigns against segments attribution and evaluation of responses 5 Personalization in E-Business v Positive: much better chance of personalization customer identification tracking across visits and within visit ability to do what if experiments v Negative: cost of switching is much less is web based shopping good for touchy feely things price differentiation across geographies not easy 6

Use of Personalization v In addition to storing and retrieving information on the individual s profile on the fly can also use mining software to analyze the information in the database to make recommendations or comments specific to the individual v Learn about your customers 7 Personalization Know the Customer Questionnaires Past history Click Streams Profile Mapping to peers Identify Login Credit Card# Extrapolation from past Look &feel Give the customer his/her wants Product selection& promotions Predicting the wants New Product Extrapolation from peers (firefly.com) 8

Storefront Personalization Basic Storefront Product Hierarchy Clothes Men s Women s Children s Shirts Pants Casuals Evening Infants Kids John s View Mary s View 9 Security and Privacy as Barrier to Personalization v Large number of customers concerned about personalization v will they pay more to preserve privacy? v Some falsify info to preserve privacy Can we still deduce patterns?? Guest Lecture on Privacy will address some of these issues v customers give more info to trusted site v need secure site with clear privacy policies stated at site 10

Data Mining Steps and Techniques v Data Warehouse v On-Line Analytical Processing (OLAP) v Mining techniques Market Basket Analysis Cluster Detection Decision Trees and Rule Induction Neural Networks Genetic Algorithms 11 Data Warehouse v Warehouse is a large database v First collect the data that you will analyze v find, summarize, and interpret large amounts of data effectively data is distributed across many different databases v Cannot get from Transaction processing system; cannot interfere with its timing Get from off-line source old TP data v integrate data from various TPS to a single storage which can then be analyzed v Is this integration/warehousing easy? Noise, missing data, redundancy, data transformation (numeric to text etc.) 12

Legal and Ethical Issues v Privacy concerns becoming more important will impact the way that data can be used and analyzed ownership issues v Often data included in the data warehouse cannot legally be used in decision making process Race, Gender, Age v Data contamination will become critical 13 OLAP: OnLine Analytical Processing v Mainly a data presentation concept v Way of presenting data to facilitate understanding any patterns inside it Provide for drilling down into the data starting from summarized views v OLAP lets you look at data and manipulate interactively v OLAP allows users to slice and dice data Allows user to drill-down into detail data 14

Relational vs Multidimensional 15 OLAP Consolidations 16

Data Mining techniques v How do you define a pattern of interest? v Three common techniques Prediction and Classification Affinity Grouping Clustering 17 Data Mining techniques: Prediction and Classification v Prediction and Classification Directed: if A then B Examples: Forecast: How many units will be sold on a given day? What will be the stock price on a given day? Will a customer buy the product or not? Who are the high paying customers of T-Mobile Detect sequences: balance increase missed payment default paid in cash one-way ticket question the passenger 18

DM techniques: Affinity Grouping v Affinity grouping Undirected Which products go together naturally? Market basket analysis Examples: Which products peak in demand simultaneously? 19 DM techniques: Clustering v Clustering task Undirected Segmenting into similar clusters Different from classification Examples Customers with similar buying profiles Products with similar demand patterns Classification and segmentation: yes/no 20

Data Mining Algorithms v Four algorithms commonly cited Association Rule (used in over 90% of the cases!) Nearest Neighbor quick and easy but models get large Cluster detection using geometric and statistical methods Decision Tree Used for prediction techniques Neural Network difficult to interpret and large time 21 Decision Trees v Series of if/then rules easy to understand, complexity in implementation Balance<10K Balance > 10K yes Age< 48 Age > 48 No yes 22

Cluster Detection v Undirected data mining v Finds records that are similar to each other (clusters) v Clusters are found using geometric methods, statistical methods, and neural networks v Good way to start any analysis 23 Market Basket Analysis: Association Rule Mining v Form of clustering used for finding items that occur together (in a transaction or market basket) v Likelihood of different products being purchased together as rules v Planning store layouts, limiting specials to one of the products in a set,... 24

Supermarket Transaction data Customer Products 1 Milk, Soda 2 Milk, Beer, diapers 3 Milk, cleaner 4 Beer, diapers, soda 5 Beer, soda 25 Association Rules: Support and Confidence Parameters v looking for a rule that says: If A then B v Support is defined as the ratio of number of transactions that include both A and B to total number of transactions How useful are A and B v Confidence is defined by the ratio of the number of transactions that include both A and B to the number of transactions that include A. if A then B with probability p v How do you specify significant support and confidence? 26

Association Rule example v Using the sample data create a co-occurrence table What items occur together 27 Co-occurrence Matrix Milk Soda Beer Diapers Cleaner Milk 3 1 1 1 1 Soda 1 2 2 1 0 Beer 1 2 3 2 0 Diapers 1 1 2 2 0 Cleaner 1 0 0 0 1 28

Association Rule example v Using the sample data create a co-occurrence table v Let relevant Support = 25% and Confidence= 50%: Beer and Diapers appear in 2/5= 40% If beer then diapers has confidence of 2/3=67% Thus, If customer buys beer then customer buys diapers satisfies 25% support & 50% confidence v Conclusion drawn by mining system: Customers who buy beer also buy diapers 29 Applying the Results v Is the relationship useful? Beer and Diapers may not be of use Retailers use transaction mining to send specific apparel to specific stores -- Microstrategy software v Who defines usefullness only as good as rules specified by humans/ marketing workforce NBA mining: designers of s/w did not include height mismatches at first coaches made the correction 30

Algorithm for Finding Association Rules v Input is Min-Support and Min-Confidence v Find all sets of items with Min-Support (frequent itemsets) Can mean searching all subsets? v Frequent Itemsets Property: Every subset of a frequent itemset must also be a frequent itemset iterative algorithm: start with frequent itemsets with one item, and construct larger itemsets using only smaller frequent itemsets. 31 Data Mining: Summary v Using the new media of the one-to-one future, you will be able to communicate directly with customers individually.. - Don Peppers & Martha Rogers (One-to-One Future) v What are you afraid of?..even if you re not afraid of these things, the beauty is,with proper marketing, we can make you afraid -- Michael Saylor, CEO Microstrategy. 32

Information Retrieval v A research field traditionally separate from Databases Goes back to IBM, Rand and Lockheed in the 50 s v Products traditionally separate Originally, document management systems for libraries, government, law, etc. Gained prominence in recent years due to web search v Multimedia and spatial data adds new dimension 33 Text Indexes v When IR folks say text index Usually mean more than what DB people mean v In our terms, both tables and indexes Really a logical schema (i.e., tables) With a physical schema (i.e., indexes) Usually not stored in a DBMS Tables implemented as files in a file system 34

A Simple Relational Text Index v Create and populate a table InvertedFile(term string, docurl string) v Build a B+-tree or Hash index on InvertedFile.term Note: URL instead of RID, the web is your heap file! Can also cache pages and use RIDs v This is often called an inverted file or inverted index Maps from words -> docs v Can now do single-word text search queries! 35 An Inverted File v Search for databases microsoft term data database date day dbms decision demonstrate description design desire developer differ disability discussion division do document docur L http://www.microsoft.com http://www.microsoft.com 36

Handling Boolean Logic v How to do term1 OR term2? Union of two DocURL sets! v How to do term1 AND term2? Intersection of two DocURL sets! Can be done by sorting both lists alphabetically and merging the lists v How to do term1 AND NOT term2? Set subtraction, also done via sorting v How to do term1 OR NOT term2 Union of term1 and NOT term2. Not term2 = all docs not containing term2. Large set!! Usually not allowed! v Refinement: What order to handle terms if you have many ANDs/NOTs? 37 Boolean Search in SQL Windows AND ( Glass OR Door ) AND NOT Microsoft v (SELECT docurl FROM InvertedFile WHERE term = windows INTERSECT SELECT docurl FROM InvertedFile WHERE term = glass OR term = door ) EXCEPT SELECT docurl FROM InvertedFile WHERE term = Microsoft ORDER BY relevance() 38

Boolean Search in SQL v Really only one SQL query in Boolean Search IR: Single-table selects, UNION, INTERSECT, EXCEPT v relevance () is the secret sauce in the search engines: Combos of statistics, linguistics, and graph theory tricks! Unfortunately, not easy to compute this efficiently using typical DBMS implementation. 39 Relevance/Ranking? v Lots of variation here Often messy; details proprietary and fluctuating v Relevance calculation involves how often search terms appear in doc, and how often they appear in collection: More search terms found in doc à doc is more relevant Greater importance attached to finding rare terms Doing this efficiently in current SQL engines is not easy v Combining subsets of: IR-style relevance: Based on term frequencies, proximities, position (e.g., in title), font, etc. Popularity information Link analysis information 40

Relevance: Going Beyond IR v Page popularity (e.g., DirectHit) Frequently visited pages (in general) Frequently visited pages as a result of a query v Link co-citation (e.g., Google) Which sites are linked to by other sites? Draws upon sociology research on bibliographic citations to identify authoritative sources 41 Standard Web Search Engine Architecture crawl the web Check for duplicates, store the documents DocIds user query create an inverted index Show results To user Search engine servers Inverted index 42