Data Mining as Part of Knowledge Discovery in Databases (KDD)



Similar documents
Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Knowledge Discovery in Databases (KDD) - Data Mining (DM)

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Introduction. A. Bellaachia Page: 1

Introduction to Data Mining

A Review of Data Mining Techniques

Week 3 lecture slides

Data W a Ware r house house and and OLAP II Week 6 1

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Foundations of Business Intelligence: Databases and Information Management

Data Warehouse design

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

from Larson Text By Susan Miertschin

not possible or was possible at a high cost for collecting the data.

DATA WAREHOUSING AND OLAP TECHNOLOGY

Database Marketing, Business Intelligence and Knowledge Discovery

Data Mining Solutions for the Business Environment

Fluency With Information Technology CSE100/IMT100

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

BUILDING BLOCKS OF DATAWAREHOUSE. G.Lakshmi Priya & Razia Sultana.A Assistant Professor/IT

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Building Data Cubes and Mining Them. Jelena Jovanovic

An Overview of Knowledge Discovery Database and Data mining Techniques

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Importance or the Role of Data Warehousing and Data Mining in Business Applications

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Warehousing and Data Mining in Business Applications

An Overview of Database management System, Data warehousing and Data Mining

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Information Management course

Data Mining System, Functionalities and Applications: A Radical Review

Outlines. Business Intelligence. What Is Business Intelligence? Data mining life cycle

DATA WAREHOUSE E KNOWLEDGE DISCOVERY

OLAP & DATA MINING CS561-SPRING 2012 WPI, MOHAMED ELTABAKH

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

DATA WAREHOUSING - OLAP

Data Mining Algorithms Part 1. Dejan Sarka

Foundations of Business Intelligence: Databases and Information Management

Distance Learning and Examining Systems

Week 13: Data Warehousing. Warehousing

Data Mining for Fun and Profit

SPATIAL DATA CLASSIFICATION AND DATA MINING

Chapter 6 - Enhancing Business Intelligence Using Information Systems

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

IT and CRM A basic CRM model Data source & gathering system Database system Data warehouse Information delivery system Information users

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

European Archival Records and Knowledge Preservation Database Archiving in the E-ARK Project

Anwendersoftware Anwendungssoftwares a. Data-Warehouse-, Data-Mining- and OLAP-Technologies. Online Analytic Processing

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Web Data Mining: A Case Study. Abstract. Introduction

Introduction to Data Warehousing. Ms Swapnil Shrivastava

Data Warehousing. Overview, Terminology, and Research Issues. Joachim Hammer. Joachim Hammer

IBM Cognos 8 Business Intelligence Analysis Discover the factors driving business performance

2074 : Designing and Implementing OLAP Solutions Using Microsoft SQL Server 2000

CHAPTER 5: BUSINESS ANALYTICS

Bussiness Intelligence and Data Warehouse. Tomas Bartos CIS 764, Kansas State University

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

White Paper. How Streaming Data Analytics Enables Real-Time Decisions

Data Mining Techniques

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Hexaware E-book on Predictive Analytics

Foundations of Business Intelligence: Databases and Information Management

Data Warehouse: Introduction

Concept and Applications of Data Mining. Week 1

Foundations of Business Intelligence: Databases and Information Management

Course MIS. Foundations of Business Intelligence

Chapter 14: Databases and Database Management Systems

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Data Mining + Business Intelligence. Integration, Design and Implementation

B.Sc (Computer Science) Database Management Systems UNIT-V

Data Mining for Everyone

Subject Description Form

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data Mining for Successful Healthcare Organizations

CHAPTER 4: BUSINESS ANALYTICS

Data Warehousing and Data Mining

DATA MINING AND WAREHOUSING CONCEPTS

M Designing and Implementing OLAP Solutions Using Microsoft SQL Server Day Course

Introduction to Data Mining

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining Analytics for Business Intelligence and Decision Support

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

A Survey on Web Research for Data Mining

Data Mining: Overview. What is Data Mining?

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

TIM 50 - Business Information Systems

DBTech Pro Workshop. Knowledge Discovery from Databases (KDD) Including Data Warehousing and Data Mining. Georgios Evangelidis

Data Warehousing. Outline. From OLTP to the Data Warehouse. Overview of data warehousing Dimensional Modeling Online Analytical Processing

Data Warehousing and OLAP Technology for Knowledge Discovery

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

BUILDING OLAP TOOLS OVER LARGE DATABASES

SQL Server 2012 Business Intelligence Boot Camp

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Winter 2009 Lecture 15 - Data Warehousing: Cubes

Transcription:

Mining as Part of Knowledge Discovery in bases (KDD) Presented by Naci Akkøk as part of INF4180/3180, Advanced base Systems, fall 2003 (based on slightly modified foils of Dr. Denise Ecklund from 6 November 2002) KDD-1 Lecture Contents KDD: Knowledge Discovery in bases Definition and Applications OLAP: On-Line Analytical Processing Architectures for OLAP and KDD KDD Life Cycle Mining Mechanisms Implications for the base System Literature: Elmasri & Navathe - Section 26.2 Garcia-Molina, Ullman, & Widom - Section 20.6 KDD-2 2002 Denise Ecklund KDD-1

Definition - KDD Knowledge Discovery in bases (KDD) the non-trivial extraction of implicit, previously unknown and potentially useful knowledge from data Why? To find trends and correlations in existing databases, that help you change structures and processes in human organizations to be more effective to save time to make more money to improve product quality to etc. KDD-3 Applications - KDD Marketing Find the most important segmentation for classifying my customers Predict future sales for a specific product Product Maintenance Define a service maintenance contract of interest to a majority of my customers Human Resource Management Define an employee compensation package to increase employee retention to at least 5 years of service For each university department, predict the number of new students that will major in that subject area Finance Detect fraudulent use of credit cards KDD-4 2002 Denise Ecklund KDD-2

Where SQL Stops Short Observe Analyze Theoritize Predict Mining Tools Creates new knowledge in the form of predictions, classifications, associations, time-based patterns OLAP Tools DSS-Extended SQL Queries Analyzes a multi-dimensional view of existing knowledge Extends the DB schema to a limited, multi-dimensional view for Decision Support Systems (DSS) Complex SQL Queries DB schema describes the structure you already know base Successful KDD uses the entire tool hierarchy KDD-5 OnLine Analytical Processing (OLAP) OLAP the dynamic synthesis, analysis, and consolidation of large volumes of multi-dimensional data Focuses on multi-dimensional relationships among existing data records Differs from extended SQL for data warehouses DW operations: roll-up, drill-down, slice, dice, pivot OLAP packages add application-specific analysis Finance depreciation, currency conversion,... Building regulations usable space, electrical loading,... Computer Capacity Planning disk storage requirements, failure rate estimation... KDD-6 2002 Denise Ecklund KDD-3

OnLine Analytic Processing (OLAP) OLAP differs from data mining OLAP tools provide quantitative analysis of multi-dimensional data relationships mining tools create and evaluate a set of possible problem solutions (and rank them) Ex: Propose 3 marketing strategies and order them based on marketing cost and likely sales income Three system architectures are used for OLAP Relational OLAP (ROLAP) Multi-dimensional OLAP (MOLAP) Managed Query Environment (MQE) KDD-7 Relational OLAP Relational base Relational base System On-Demand Queries Result Relations ROLAP Server and Analysis App Relation Cube Runtime Steps: Translate client request into 1 or more SQL queries Present SQL queries to a back-end RDBMS Convert result relations into multi-dimensional datacubes Execute the analysis application and return the results Analysis Request Analysis Output Optional Cache Advantages: No special data storage structures, use standard relational storage structures Uses most current data from the OLTP server Disadvantages: Typically accesses only one RDBMS => no data integration over multiple DBSs Conversion from flat relations to memory-based datacubes is slow and complex Poor performance if large amounts of data are retrieved from the RDBMS KDD-8 2002 Denise Ecklund KDD-4

Multi-dimensional OLAP base System Source Fetch Source Fetch Source Create Cubes Warehouse MOLAP Server and Analysis App Analysis Request Analysis Output Preprocessing Steps: Extract data from multiple data sources Store as a data warehouse, using custom storage structures Runtime Steps: Access datacubes through special index structures Execute the analysis application and returns the results Advantages: Special, multi-dimensional storage structures give good retrieval performance Warehouse integrates clean data from multiple data sources Disadvantages: Inflexible, multi-dimensional storage structures support only one application well Requires people and software to maintain the data warehouse KDD-9 Managed Query Environment (MQE) Relational base Relational base System Cubes MOLAP Server Client Runtime Steps: Fetch data from MOLAP Server, or RDBMS directly Build memory-based data structures, as required Execute the analysis application Advantages: Distributes workload to the clients, offloading the servers Simple to install, and maintain => reduced cost Retrieve Retrieve Cube Cache Analysis App Disadvantages: Provides limited analysis capability (i.e., client is less powerful than a server) Lots of redundant data stored on the client systems Client-defined and cached datacubes can cause inconsistent data Uses lots of network bandwidth KDD-10 2002 Denise Ecklund KDD-5

KDD System Architecture Cubes Fetch Cubes Warehouse Query Multi-dim KDD Server DM-T1 DM-T2 Knowledge Request Transaction DB Multimedia DB Web Web Web Fetch Relations Fetch Relations Fetch Fetch ROLAP Server MOLAP Server Return Multi-dim Values DM-T3 OLAP T1 SQL i/f DM-T4 OLAP T2 nd SQL New Knowledge Need to mine all types of data sources DM tools require multi-dimensional input data General KDD systems also hosts OLAP tools, SQL interface, and SQL for DWs (e.g., nd SQL) KDD-11 Typical KDD Deployment Architecture Cubes Fetch cubes Warehouse Server Query Warehouse Return Cubes KDD Server T1 T2 Knowledge Request Knowledge Runtime Steps: Submit knowledge request Fetch datacubes from the warehouse Execute knowledge acquisition tools Return findings to the client for display Advantages: warehouse provides clean, maintained, multi-dimensional data retrieval is typically efficient warehouse can be used by other applications Easy to add new KDD tools Disadvantages: KDD is limited to data selected for inclusion in the warehouse If DW is not available, use MOLAP server or provide warehouse on KDD server KDD-12 2002 Denise Ecklund KDD-6

KDD Life Cycle Selection and Extraction Cleaning Enrichment Coding Commonly part of data warehouse preparation Additional preparation for KDD Preparation Phases Mining Result Reporting preparation and result reporting are 80% of the work! Based on results you may decide to: Get more data from internal data sources Get additional data from external sources Recode the data KDD-13 Enrichment Integrating additional data from external sources Sources are public and private agencies Government, Credit bureau, Research lab, Typical data examples: Average income by city, occupation, or age group Percentage of homeowners and car owners by A person s credit rating, debt level, loan history, Geographical density maps (for population, occupations, ) New data extends each record from internal sources base issues: More heterogenous data formats to be integrated Understand the semantics of the external source data KDD-14 2002 Denise Ecklund KDD-7

Coding Goal: to streamline the data for effective and efficient processing by the target KDD application Steps: 1) Delete records with many missing values But in fraud detection, missing values are indicators of the behavior you want to discover! 2) Delete extra attributes Ex: delete customer names if looking at customer classes 3) Code detailed data values into categories or ranges based on the types of knowledge you want to discover Ex: divide specific ages into age ranges, 0-10, 11-20, map home addresses to regional codes convert homeownership to yes or no convert purchase date to a month number starting from Jan. 1990 KDD-15 The Mining Process Based on the questions being asked and the required form of the output 1) Select the data mining mechanisms you will use 2) Make sure the data is properly coded for the selected mechnisms Ex. A tool may accept numeric input only 3) Perform rough analysis using traditional tools Create a naive prediction using statistics, e.g., averages The data mining tools must do better than the naive prediction or you are not learning more than the obvious! 4) Run the tool and examine the results KDD-16 2002 Denise Ecklund KDD-8

Mining - Mechanisms Purpose/Use To define classes and predict future behavior of existing instances. To define classes and categorize new instances based on the classes. To identify a cause and predict the effect. To define classes and detect new and old instances that lie outside the classes. Knowledge Type Predictive Modeling base Segmentation Link Analysis Deviation Detection Mechanisms Decision Trees, Neural Networks, Regression Analysis Genetic Algorithms K-nearest Neighbors, Neural Networks, Kohonen Maps Negative Association, Association Discovery, Sequential Pattern Discovery, Matching Time Sequence Discovery Statistics, Visualization, Genetic Algorithms KDD-17 Configuring the KDD Server mining mechanisms are not application-specific, they depend on the target knowledge type The application area impacts the type of knowledge you are seeking, so the application area guides the selection of data mining mechanisms that will be hosted on the KDD server. To configure a KDD server Select an application area Select (or build) a data source Select N knowledge types (types of questions you will ask) For each knowledge type Do Select 1 or more mining tools for that knowledge type Example: Application area: marketing Source: data warehouse of current customer info Knowledge types: classify current customers, predict future sales, predict new customers Mining Tools: decision trees, and neural networks KDD-18 2002 Denise Ecklund KDD-9

Mining Example - base Segmentation Given: a coded database with 1 million records on subscriptions to five types of magazines Goal: to define a classification of the customers that can predict what types of new customers will subscribe to which types of magazines Client# Age Income Credit Car House Owner Owner Home Area House Sports Magaz Car Magaz Magaz Music Comic Magaz Magaz 23 20 18.5 17.8 0 0 1 1 1 0 1 1 29 25 36.0 26.6 1 0 1 0 0 0 0 1 2313 42 24.0 20.8 0 0 1 0 0 1 0 0 2327 31 48.2 24.5 1 1 1 0 1 1 1 0 2328 56 94.0 40.0 1 1 1 1 0 1 0 1 23 43 18.0 7.0 1 0 1 0 0 1 0 0 2333 22 36.3 15.8 1 0 1 1 0 1 0 1........................ KDD-19 Decision Trees for DB Segmentation Which attribute(s) best predicts which magazine(s) a customer subscribes to (sensitivity analysis) Attributes: age, income, credit, car-owner, house-owner, area Classify people who subscribe to a car magazine Car Magaz Only 1% of the people over 44.5 years of age buys a car magazine Age > 44.5 Age <= 44.5 99% 62% of the people under 44.5 years of age buys a car magazine 38% Target advertising segment for car magazines Age > 48.5 Age <= 48.5 Income > 34.5 Income <= 34.5 100% 92% 100% 47% No one over 48.5 years of age buys a car magazine Age > 31.5 Age <= 31.5 Everyone in this DB who has an income of less than $34, 500 AND buys a car magazine is less than 31.5 years of age 46% 0% KDD-20 2002 Denise Ecklund KDD-10

Neural Networks Car Input nodes are connected to output nodes by a set of hidden nodes and edges House Sports Music Inputs describe DB instances Outputs are the categories we want to recognize Input Nodes Hidden Layer Nodes Output Nodes Comic Hidden nodes assign weights to each edge so they represent the weight of relationships between the input and the output of a large set of training data KDD-21 Training and Mining Training Phase (learning) Age < Age < 50 Age 50 Income < 20 20 Income < 35 35 Income < 50 Income 50 Car House Input Nodes Hidden Layer Nodes Output Nodes Car House Sports Music Comic Code all DB data as 1 s and 0 s Set all edge weights to prob = 0 Input each coded database record Check that the output is correct The system adjusts the edge weights to get the correct answer Mining Phase (recognition) Input a new instance coded as 1 s and 0 s Output is the classification of the new instance Issues: Training sample must be large to get good results Network is a black box, it does not tell why an instance gives a particular output (no theory) KDD-22 2002 Denise Ecklund KDD-11

An Explosion of Mining Results Mining tools can output thousands of rules and associations (collectively called patterns) When is a pattern interesting? 1) If it can be understood by humans 2) The pattern is strong (i.e., valid for many new data records) 3) It is potentially useful for your business needs 4) It is novel (i.e., previously unknown) Metrics of interestingness Simplicity: Rule Length for (A B) is the number of conjunctive conditions or the number of attributes in the rule Confidence: Rule Strength for (A B) is the conditional probability that A implies B (#recs with A & B)/(#recs with A) Support: Support for (A B) is the number or percentage of DB records that include A and B Novelty: Rule Uniqueness for (A B) means no other rule includes or implies this rule KDD-23 Genetic Algorithms Based on Darwin s theory of survival of the fittest Living organisms reproduce, individuals evolve/mutate, individuals survive or die based on fitness The output of a genetic algorithm is the set of fittest solutions that will survive in a particular environment The input is an initial set of possible solutions The process Produce the next generation (by a cross-over function) Evolve solutions (by a mutation function) Discard weak solutions (based on a fitness function) KDD-24 2002 Denise Ecklund KDD-12

Classification Using Genetic Algorithms Suppose we have money for 5 marketing campaigns, so we wish to cluster our magazine customers into 5 target marketing groups Customers with similar attribute values form a cluster (assumes similar attributes => similar behavior) Preparation: Define an encoding to represent solutions (i.e., use a character sequence to represent a cluster of customers) Create 5 possible initial solutions (and encode them as strings) Define the 3 genetic functions to operate on a cluster encoding Cross-over( ), Mutate( ), Fitness_Test( ) KDD-25 Genetic Algorithms - Initialization Define 5 initial solutions Use a subset of the database to create a 2-dim scatter plot Map customer attributes to 2 dimensions Divide the plot into 5 regions Calculate an initial solution point (guide point) in each region Equidistant from region lines Define an encoding for the solutions Strings for the customer attribute values Encode each guide point Voronoi Diagram 40 55 80 10 20 28 50 16 Guide Point #1 10 Attribute Values KDD-26 2002 Denise Ecklund KDD-13

Genetic Algorithms Evolution 1st Gen 10 40 20 55 28 50 80 16 Solution 1 35 18 42 20 55 15 40 48 75 16 Solution 2 15 37 15 50 28 50 75 16 Solution 3 24 40 55 80 10 20 28 50 45 16 35 42 50 75 18 20 28 50 16 15 37 50 40 75 15 28 48 16 Mutation Cross-over Fitness 16 creating 2 children 2nd Gen 24 55 80 10 20 28 45 16 35 18 42 20 50 28 50 75 16 15 37 50 40 75 28 48 16 10 90 33 10 48 28 26 50 57 80 Solution 4 35 42 55 40 75 18 20 15 48 16 10 90 33 10 37 28 26 37 57 46 Solution 5 Fitness 14.5 Fitness 17 Fitness Fitness 33 Fitness 13 Fitness 25 15 15 37 50 75 15 28 50 16 Cross-over function Create 2 children Take 6 attribute values from one parent and 4 from the other Mutate function Randomly switch several attribute values from values in the sample subspace Fitness function: Average distance between the solution point and all the points in the sample subspace Stop when the solutions change very little KDD-27 Selecting a Mining Mechanism Multiple mechanisms can be used to answer a question select the mechanism based on your requirements Many recs Many Numeric String attrs values values Learn rules Learn incre Est stat L-Perf signif disk L-Perf cpu A-Perf disk A-Perf cpu Decision Trees Neural Networks Genetic Algs Good Avg Poor Quality of Input Quality of Output Learning Performance Application Performance KDD-28 2002 Denise Ecklund KDD-14

So... Artificial Intelligence is fun,... but what does this have to do with database? mining is just another database application mining applications have requirements The database system can help mining applications meet their requirements So, what are the challenges for data mining systems and how can the database system help? KDD-29 base Support for DM Challenges Mining Challenge Support many data mining mechanisms in a KDD system Support interactive mining and incremental mining Guide the mining process by integrity constraints Determine usefulness of a data mining result Help humans to better understand mining results (e.g., visualization) base Support Support multiple DB interfaces (for different DM mechanisms) Intelligent caching and support for changing views over query results Make integrity constraints query-able (meta-data) Gather and output runtime statistics on DB support, confidence, and other metrics Prepare output data for selectable presentation formats KDD- 2002 Denise Ecklund KDD-15

base Support for DM Challenges Mining Challenge Accurate, efficient, and flexible methods for data cleaning and data encoding Improved performance for data mining mechanisms Ability to mine a wide variety of data types Support web mining Define a data mining query language base Support Programmable data warehouse tools for data cleaning and encoding; Fast, runtime re-encoding Parallel data retrieval and support for incremental query processing New indexing and access methods for non-traditional data types Extend XML and WWW database technology to support large, long running queries Extend SQL, OQL and other interfaces to support data mining KDD-31 Commercial Mining Products and Tools Some DB companies with Mining products: Oracle Oracle 9i, with BI-Beans (an OLAP toolset) IBM Miner for and Miner for Text NCR TeraMiner for the Tera warehouse Companies with Mining tools COGNOS Scenario, a set of DM and OLAP tools Elseware Classpad (classification) and Previa (prediction) Logic Programming Associates, Ltd. mite (clustering) Prudential System Software GmbH credit card fraud RedShed Software Dowser (association discovery) http://www.kdnuggets.com/companies/products.html KDD-32 2002 Denise Ecklund KDD-16

Conclusions To support data mining, we should Enhance database technologies developed for warehouses Very large database systems Object-oriented (navigational) database systems View Management mechanisms Heterogeneous databases Create new access methods basedonminingaccess Develop query languages or other interfaces to better support data mining mechanisms Select and integrate the needed DB technologies KDD-33 Studying for the exam? Send questions to the appropriate lecturer. ellenmk@ifi.uio.no ragnarn@ifi.uio.no nacia@ifi.uio.no There will be a review session 19.11.2003, hopefully with questions and answers. For the success of the last session, please send questions in advance, preferably by mail and preferably to the right lecturer. KDD-34 2002 Denise Ecklund KDD-17