Knowledge Discovery and Data Mining



Similar documents
Introduction to Data Mining

Data Warehousing and Data Mining

Information Management course

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

Data Mining: Partially from: Introduction to Data Mining by Tan, Steinbach, Kumar

Data Mining Apriori Algorithm

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Objectives of Lecture 1. Labs and TAs. Class and Office Hours. CMPUT 391: Introduction. Introduction

Data Mining Introduction

Knowledge Discovery from Data Bases Proposal for a MAP-I UC

Data Warehousing and Data Mining

Data Mining and Business Intelligence CIT-6-DMB. Faculty of Business 2011/2012. Level 6

Introduction. A. Bellaachia Page: 1

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining Solutions for the Business Environment

SPATIAL DATA CLASSIFICATION AND DATA MINING

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

CAS CS 565, Data Mining

Dynamic Data in terms of Data Mining Streams

Introduction to Data Mining

Objectives of Lecture 1. Class and Office Hours. Labs and TAs. CMPUT 391: Introduction. Introduction

Data Mining Analytics for Business Intelligence and Decision Support

AMIS 7640 Data Mining for Business Intelligence

CS 5890: Introduction to Data Science Syllabus, Utah State University, Fall

Introduction to Data Mining Techniques

Association Analysis: Basic Concepts and Algorithms

How To Solve The Kd Cup 2010 Challenge

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Data Mining System, Functionalities and Applications: A Radical Review

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Unique column combinations

DATA MINING FOR BUSINESS INTELLIGENCE. Data Mining For Business Intelligence: MIS 382N.9/MKT 382 Professor Maytal Saar-Tsechansky

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Mining Association Rules. Mining Association Rules. What Is Association Rule Mining? What Is Association Rule Mining? What is Association rule mining

Introduction to Data Mining

College of Health and Human Services. Fall Syllabus

Data Mining Association Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 6. Introduction to Data Mining

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining and Soft Computing. Francisco Herrera

Principles of Dat Da a t Mining Pham Tho Hoan hoanpt@hnue.edu.v hoanpt@hnue.edu. n

Integrated Data Mining and Knowledge Discovery Techniques in ERP

A Web-based Interactive Data Visualization System for Outlier Subspace Analysis

2.1. Data Mining for Biomedical and DNA data analysis

Market Basket Analysis and Mining Association Rules

College information system research based on data mining

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

CPSC 340: Machine Learning and Data Mining. Mark Schmidt University of British Columbia Fall 2015

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

Healthcare Measurement Analysis Using Data mining Techniques

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Mining Online GIS for Crime Rate and Models based on Frequent Pattern Analysis

A New Approach for Evaluation of Data Mining Techniques

Association Rule Mining

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

More details >>> HERE <<<

Quick Introduction of Data Mining Techniques

Introduction to Data Mining. Lijun Zhang

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

Prediction of Heart Disease Using Naïve Bayes Algorithm

DATA MINING - SELECTED TOPICS

CHAPTER-24 Mining Spatial Databases

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Knowledge Discovery from Databases

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

Data mining in the e-learning domain

DATA MINING TECHNIQUES AND APPLICATIONS

Cleveland State University

AMIS 7640 Data Mining for Business Intelligence

City University of Hong Kong. Information on a Course offered by the Department of Management Sciences with effect from Semester A in 2012 / 2013

Classification and Prediction

Association Rule Mining: A Survey

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Database Marketing, Business Intelligence and Knowledge Discovery

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Subject Description Form

Data Warehousing and Data Mining. A.A Datawarehousing & Datamining 1

COURSE RECOMMENDER SYSTEM IN E-LEARNING

Principles of Data Mining

An Overview of Knowledge Discovery Database and Data mining Techniques

MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM

Mining an Online Auctions Data Warehouse

Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

ISQS 3358 BUSINESS INTELLIGENCE FALL 2014

COURSE SYLLABUS. Enterprise Information Systems and Business Intelligence

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Data Mining: An Overview. David Madigan

Foundations of Business Intelligence: Databases and Information Management

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

University of Southern California MARSHALL SCHOOL OF BUSINESS Spring, 2004 Course Guidelines & Syllabus

Data Mining Part 5. Prediction

Transcription:

Knowledge Discovery and Data Mining Data Mining in a nutshell Dr. Osmar R. Zaïane Fall 2007 Extract interesting knowledge (rules, regularities, patterns, constraints) from data in large collections. Knowledge Data Actionable knowledge KDD at the Confluence of Many Disciplines DBMS Query processing Datawarehousing OLAP Database Systems Artificial Intelligence Machine Learning Neural Networks Agents Knowledge Representation Indexing Inverted files Information Retrieval High Performance Computing Statistics Visualization Computer graphics Human Computer Interaction 3D representation Parallel and Distributed Computing Other Statistical and Mathematical Modeling

Who Am I? UNIVERSITY OF ALBERTA Osmar R. Zaïane, Ph.D. Associate Professor Department of Computing Science 221 Athabasca Hall Edmonton, Alberta Canada T6G 2E8 Research Interests: Data Mining, Web Mining, Multimedia Mining, Data Visualization, Information Retrieval. Telephone: Office +1 (780) 492 2860 Fax +1 (780) 492 1071 E-mail: zaiane@cs.ualberta.ca http://www.cs.ualberta.ca/~zaiane/ Applications: Analytic Tools, Adaptive Systems, Intelligent Systems, Diagnostic and Categorization, Recommender Systems Database Laboratory PhD on Web Mining & Multimedia Mining With Dr. Jiawei Han at Simon Fraser University, Canada Achievements: (in last 6 years): 2 PhD and 18 MSc, 90+ publications, ICDM PC co-chair 2007, ADMA co-chair 2007, WEBKDD and MDM/KDD co-chair (2000 to 2003) Currently: 3 PhD + 1 MSc Since 1999 (3 Ph.D., Dr. Osmar R. Zaïane, 1 M.Sc.) 1999-2007 Who Am I? Research Activities Pattern Discovery for Intelligent Systems Currently: 4 graduate students UNIVERSITY OF ALBERTA Osmar R. Zaïane, Ph.D. Associate Professor Department of Computing Science 352 Athabasca Hall Edmonton, Alberta Canada T6G 2E8 Past 6 years Telephone: Office +1 (780) 492 2860 Fax +1 (780) 492 1071 E-mail: zaiane@cs.ualberta.ca http://www.cs.ualberta.ca/~zaiane/ 2001 2002 2003 MSc 5 5 1 PhD 0 0 0 Principles Total of Knowledge 5Discovery 5in Data 1 Database Laboratory 2004 2005 2006 Total 4 1 2 18 1 0 1 2 5 University 1 of Alberta 3 20 Research Interests: Data Mining, Web Mining, Multimedia Mining, Data Visualization, Information Retrieval. Where I Stand Artificial Intelligence HCI Graphics Applications: Analytic Tools, Adaptive Systems, Intelligent Systems, Diagnostic and Categorization, Recommender Systems Database Management Systems Achievements: 90+ publications, IEEE-ICDM PC-chair & ADMA chair -2007 WEBKDD and MDM/KDD co-chair (2000 to 2003), WEBKDD 05 co-chair, Associate Editor for ACM- SIGKDD Explorations Thanks To Without my students my research work wouldn t have been possible. Current: (No particular order) Maria-Luiza Antonie JiyangChen Andrew Foss Seyed-Vahid Jazayeri Past: (No particular order) Stanley Oliveira Yang Wang Lisheng Sun JiaLi Alex Strilets William Cheung Andrew Foss Yue Zhang Chi-Hoon Lee Weinan Wang Ayman Ammoura Hang Cui Jun Luo Jiyang Chen Yuan Ji Yi Li Yaling Pei Yan Jin Mohammad El-Hajj Maria-Luiza Antonie

Knowledge Discovery and Data Mining Class and Office Hours Class: Tuesday and Thursdays from 9:30 to 10:50 Office Hours: Tuesdays from 13:00 to 14:00 By mutually agreed upon appointment: E-mail: zaiane@cs.ualberta.ca Tel: 492 2860 - Office: ATH 3-52 Course Requirements Understand the basic concepts of database systems Understand the basic concepts of artificial intelligence and machine learning Be able to develop applications in C/C ++ and/or Java But I prefer Class: Once a week from 9:00 to 11:50 (with one break) 10 Course Objectives To provide an introduction to knowledge discovery in databases and complex data repositories, and to present basic concepts relevant to real data mining applications, as well as reveal important research issues germane to the knowledge discovery domain and advanced mining applications. Students will understand the fundamental concepts underlying knowledge discovery in databases and gain hands-on experience with implementation of some data mining algorithms applied to real world cases. Evaluation and Grading There is no final exam for this course, but there are assignments, presentations, a midterm and a project. I will be evaluating all these activities out of 100% and give a final grade based on the evaluation of the activities. The midterm is either a take-home exam or an oral exam. Assignments 20% (2 assignments) Midterm 25% Project 39% Quality of presentation + quality of report and proposal + quality of demos Preliminary project demo (week 11) and final project demo (week 15) have the same weight (could be week 16) Class presentations 16% Quality of presentation + quality of slides + peer evaluation A+ will be given only for outstanding achievement. 11 12

Choice Implement data mining project Projects Deliverables Project proposal + project pre-demo + final demo + project report Examples and details of data mining projects will be posted on the course web site. Assignments 1- Competition in one algorithm implementation (in C/C ++ ) 2- Devising Exercises with solutions More About Projects Students should write a project proposal (1 or 2 pages). project topic; implementation choices; Approach, references; schedule. All projects are demonstrated at the end of the semester. December 11-12 to the whole class. Preliminary project demos are private demos given to the instructor on week November 19. Implementations: C/C ++ or Java, OS: Linux, Window XP/2000, or other systems. 13 More About Evaluation Re-examination. About Plagiarism None, except as per regulation. Collaboration. Collaborate on assignments and projects, etc; do not merely copy. Plagiarism. Work submitted by a student that is the work of another student or any other person is considered plagiarism. Read Sections 26.1.4 and 26.1.5 of the calendar. Cases of plagiarism are immediately referred to the Dean of Science, who determines what course of action is appropriate. Plagiarism, cheating, misrepresentation of facts and participation in such offences are viewed as serious academic offences by the University and by the Campus Law Review Committee (CLRC) of General Faculties Council. Sanctions for such offences range from a reprimand to suspension or expulsion from the University. 15

Notes and Textbook Course home page: http://www.cs.ualberta.ca/~zaiane/courses/cmput695/ We will also have a mailing list and newsgroup for the course. No Textbook but recommended books: Data Mining: Concepts and Techniques Jiawei Han and Micheline Kamber Morgan Kaufmann Publisher http://www-faculty.cs.uiuc.edu/~hanj/bk2/ 2006 ISBN 1-55860-901-6 800 pages 2001 ISBN 1-55860-489-8 550 pages Other Books Principles of Data Mining David Hand, Heikki Mannila, Padhraic Smyth, MIT Press, 2001, ISBN 0-262-08290-X, 546 pages Data Mining: Introductory and Advanced Topics Margaret H. Dunham, Prentice Hall, 2003, ISBN 0-13-088892-3, 315 pages Dealing with the data flood: Mining data, text and multimedia Edited by Jeroen Meij, SST Publications, 2002, ISBN 90-804496-6-0, 896 pages Introduction to Data Mining Pang-Ning Tan, Michael Steinbach, Vipin Kumar Addison Wesley, ISBN: 0-321-32136-7, 769 pages 17 Course Schedule There are 13 weeks from Sept 6 th to December 4 th. Week 1: Sept 6 : Introduction to Data Mining Week 2: Sept 11-13 : Association Rules Week 3: Sept 18-20 : Association Rules (advanced topics) Week 4: Sept 25-27 : Sequential Pattern Analysis Week 5: Oct 2-4 : Classification (Neural Networks) Week 6: Oct 9-11 : Classification (Decision Trees and +) Week 7: Oct 16-18 : Data Clustering Week 8: Oct 23-25 : Outlier Detection Week 9: Oct 30-Nov 1 : Data Clustering in subspaces Week 10: Nov 6-8 : Contrast sets + Web Mining Week 11: Nov 13-15 : Web Mining + Class Presentations Week 12: Nov 20-22 : Class Presentations Week 12: Nov 27-29 : Class Presentations Week 13: Dec 4 : Class Presentations Week 15: Dec 11 : Project Demos (Tentative, subject to changes) Due dates -Midterm week 8 -Assignment 1 week 6 -Assignment 2 variable dates Course Content Introduction to Data Mining Association analysis Sequential Pattern Analysis Classification and prediction Contrast Sets Data Clustering Outlier Detection Web Mining Other topics if time permits (spatial data, biomedical data, etc.) 19 20

For those of you who watch what you eat... Here's the final word on nutrition and health. It's a relief to know the truth after all those conflicting medical studies. The Japanese eat very little fat and suffer fewer heart attacks than the British or Americans. The Mexicans eat a lot of fat and suffer fewer heart attacks than the British or Americans. The Japanese drink very little red wine and suffer fewer heart attacks than the British or Americans The Italians drink excessive amounts of red wine and suffer fewer heart attacks than the British or Americans. The Germans drink a lot of beer and eat lots of sausages and fats and suffer fewer heart attacks than the British or Americans. CONCLUSION: Eat and drink what you like. Speaking English is apparently what kills you. What Is Association Mining? Association Rules Clustering Classification Outlier Detection Association rule mining searches for relationships between items in a dataset: Finding association, correlation, or causal structures among sets of items or objects in transaction databases, relational databases, and other information repositories. Rule form: Body Ηead [support, confidence]. Examples: buys(x, bread ) buys(x, milk ) [0.6%, 65%] major(x, CS ) ^ takes(x, DB ) grade(x, A ) [1%, 75%]

Basic Concepts A transaction is a set of items: T={i a, i b,i t } Association Rule Mining T I, where I is the set of all possible items {i 1, i 2,i n } D, the task relevant data, is a set of transactions. An association rule is of the form: P Q, where P I, Q I, and P Q = P Q holds in D with support s and P Q has a confidence c in the transaction set D. Support(P Q) = Probability(P Q) Confidence(P Q)=Probability(Q/P) FIM Frequent Itemset Mining Association Rules Generation abc Bound by a support threshold 1 2 ab c b ac Bound by a confidence threshold Frequent itemset generation is still computationally expensive Frequent Itemset Generation null A B C D E AB AC AD AE BC BD BE CD CE DE ABC ABD ABE ACD ACE ADE BCD BCE BDE CDE ABCD ABCE ABDE ACDE BCDE ABCDE Given d items, there are 2 d possible candidate itemsets Frequent Itemset Generation Brute-force approach (Basic approach): Each itemset in the lattice is a candidate frequent itemset Count the support of each candidate by scanning the database Transactions List of Candidates N TID Items 1 Bread, Milk 2 Bread, Diaper, Beer, Eggs 3 Milk, Diaper, Beer, Coke 4 Bread, Milk, Diaper, Beer 5 Bread, Milk, Diaper, Coke w Match each transaction against every candidate Complexity ~ O(NMw) => Expensive since M = 2 d!!! M

Grouping Grouping Grouping Grouping Clustering Clustering Partitioning Partitioning We need a notion of similarity or closeness (what features?) Should we know apriori how many clusters exist? How do we characterize members of groups? How do we label groups? What about objects that belong to different groups? We need a notion of similarity or closeness (what features?) Should we know apriori how many clusters exist? How do we characterize members of groups? How do we label groups? Classification Classification Categorization What is Classification? The goal of data classification is to organize and categorize data in distinct classes. A model is first created based on the data distribution. The model is then used to classify new data. Given the model, a class can be predicted for new data. With classification, I can predict in which bucket to put the ball, but I can t predict the weight of the ball. 1 2 3 4 n Predefined buckets i.e. known labels? 1 2 3 4 n

Classification = Learning a Model Training Set (labeled) Framework Classification Model Labeled Data Training Data Testing Data Derive Classifier (Model) Estimate Accuracy New unlabeled data Labeling=Classification Unlabeled New Data Classification Methods Decision Tree Induction Neural Networks Bayesian Classification K-Nearest Neighbour Support Vector Machines Associative Classifiers Case-Based Reasoning Genetic Algorithms Rough Set Theory Fuzzy Sets Etc. Outlier Detection To find exceptional data in various datasets and uncover the implicit patterns of rare cases Inherent variability - reflects the natural variation Measurement error (inaccuracy and mistakes) Long been studied in statistics An active area in data mining in the last decade Many applications Detecting credit card fraud Discovering criminal activities in E-commerce Identifying network intrusion Monitoring video surveillance

Outliers are Everywhere Data values that appear inconsistent with the rest of the data. Some types of outliers We will see: Statistical methods; distance-based methods; density-based methods; resolution-based methods, etc. Protein Localization Contrasting Sequence Sets Collaborative e-learning E-commerce sites Brassica Napus (Canola) A B C D E F G H R S T Class Class What contrasts genes or proteins? What makes a buyer buy & a non-buyer leave empty handed? Class Finding emerging sequences and contrasting the sets of sequences would give insight about what behaviour Dr. Osmar R. leads Zaïane, 1999-2007 to success Principles and of Knowledge what doesn t. Discovery in Data

Mammography Breast Cancer Recommender Systems Automatic cropping and enhancement Automatic segmentation and feature extraction Training Classification model Transaction (IID, class, F 1, F 2, F 3, F f ) F α F β F γ F δ class Normal Malignant benign Question Query expansion 1,2,3,4,5,6,7,8,9,.. 1,2,3,4,5,6,7,8,9,.. 1,2,3,4,5,6,7,8,9,.. Recommend brief answers Concise summaries Text Summarization WebViz Clustering with Constraints How to model constraints? Polygons, lines? How to include the constraint verification in the clustering phase?

d 1 d 2 d n How to Apply This? Crime events with descriptors d 1 d 2 d n Identification of Serial crimes d 1 d 2 d n Construction 1 Communication 2 Visualization 3 Interaction 4 CORBA+XML DIVE-ON Project Data mining in an Immersed Virtual Environment CORBA+XML Over a Network We need to define the notion of similarity Distributed databases Federated multidimensional data warehouses Local 3 dimensional data cube Visualization room (CAVE) Immersed user 3D rotating torus menu Hand signs Dr. Osmar to R. Zaïane, operate 1999-2007 the CAVE Principles with OLAP of Knowledge operations. Discovery in Data VWV Project Virtual Web View Mediator WebML VWV 1 VWV 2 VWV n Private onthology