Data Mining: Concepts and Techniques



Similar documents
Introduction to Data Mining

Introduction. A. Bellaachia Page: 1

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Data Mining: Concepts and Techniques

Introduction. What is Data Mining?

Data Mining: Concepts and Techniques

Information Management course

Knowledge Discovery Process and Data Mining - Final remarks

Introduction to Data Mining

What is Data Mining?

Data Mining: Concepts and Techniques Chapter 1 Introduction

Massive Data Analytics

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Data Warehousing and Data Mining

DATA MINING - 1DL105, 1Dl111

Data Mining. Introduction to Modern Information Retrieval from Databases and the Web. Administrivia

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Data Mining Introduction


Data Mining System, Functionalities and Applications: A Radical Review

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Data Warehousing and Data Mining

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

Data Mining Solutions for the Business Environment

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

CS490D: Introduction to Data Mining Chris Clifton. What Is Data Mining?

Knowledge Discovery from Data Bases Proposal for a MAP-I UC

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Concept and Applications of Data Mining. Week 1

Data Mining Analytics for Business Intelligence and Decision Support

Assessing Data Mining: The State of the Practice

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

An Overview of Database management System, Data warehousing and Data Mining

Database Marketing, Business Intelligence and Knowledge Discovery

Dynamic Data in terms of Data Mining Streams

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

2.1. Data Mining for Biomedical and DNA data analysis

EFFECTIVE USE OF THE KDD PROCESS AND DATA MINING FOR COMPUTER PERFORMANCE PROFESSIONALS

Data Mining: Introduction. Lecture Notes for Chapter 1. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

The Role of Visualization in Effective Data Cleaning

SPATIAL DATA CLASSIFICATION AND DATA MINING

Data Mining: Concepts and Techniques

Introduction to Data Mining. Lijun Zhang

Overview of Data Mining

Introduction to Data Mining

Introduction to Data Mining

Data Security and Privacy in Data Mining: Research Issues & Preparation

DATA MINING AND WAREHOUSING CONCEPTS

Healthcare Measurement Analysis Using Data mining Techniques

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

Chapter 2 Literature Review

DATA MINING: AN OVERVIEW

Data Mining - Introduction

A RESEARCH STUDY ON DATA MINING TECHNIQUES AND ALGORTHMS

Three Perspectives of Data Mining

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

DATA MINING AND KNOWLEDGE DISCOVERY FROM RESEARCH PROBLEMS

Data Mining: Motivations and Concepts

Data Warehouse: Introduction

FREQUENT PATTERN MINING FOR EFFICIENT LIBRARY MANAGEMENT

Data Mining: Opportunities and Challenges

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Perspectives on Data Mining

Introduction. Chapter Bibliographic Notes

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

The Scientific Data Mining Process

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

Data Mining course Master in Information Technologies Enginyeria Informàtica Tomàs Aluja. LIAM EIO. UPC Lluis Belanche LSI. UPC

Horizontal Aggregations in SQL to Prepare Data Sets for Data Mining Analysis

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

CAS CS 565, Data Mining

Data Mining as Part of Knowledge Discovery in Databases (KDD)

not possible or was possible at a high cost for collecting the data.

Customer Classification And Prediction Based On Data Mining Technique

A Review of Data Mining Techniques

The KDD Process for Extracting Useful Knowledge from Volumes of Data

Knowledge Discovery from Databases

Vishnu Swaroop Computer Science and Engineering College Madan Mohan Malaviya Engineering College Gorakhpur, India

Introduction of Information Visualization and Visual Analytics. Chapter 4. Data Mining

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

Sanjeev Kumar. contribute

Data Warehousing and Data Mining for improvement of Customs Administration in India. Lessons learnt overseas for implementation in India

Data Mining: Introduction and a Health Care Application

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

MINING CLICKSTREAM-BASED DATA CUBES

A Brief Tutorial on Database Queries, Data Mining, and OLAP

Data Mining for Fun and Profit

Designing an Object Relational Data Warehousing System: Project ORDAWA * (Extended Abstract)

Transcription:

Data Mining: Concepts and Techniques Chapter 1 Introduction SURESH BABU M ASST PROF IT DEPT VJIT 1

Chapter 1. Introduction Motivation: Why data mining? What is data mining? Data Mining: On what kind of data? Data mining functionality Are all the patterns interesting? Classification of data mining systems Data Mining Task Primitives Integration of data mining system with a DB and DW System Major issues in data mining 2

Why Data Mining? The Explosive Growth of Data: from terabytes to petabytes Data collection and data availability Automated data collection tools, database systems, Web, computerized society Major sources of abundant data Business: Web, e-commerce, transactions, stocks, Science: Remote sensing, bioinformatics, scientific simulation, Society and everyone: news, digital cameras, We are drowning in data, but starving for knowledge! Necessity is the mother of invention Data mining Automated analysis of massive data sets 3

Evolution of Database Technology 1960s: Data collection, database creation, IMS and network DBMS 1970s: Relational data model, relational DBMS implementation 1980s: RDBMS, advanced data models (extended-relational, OO, deductive, etc.) Application-oriented DBMS (spatial, scientific, engineering, etc.) 1990s: Data mining, data warehousing, multimedia databases, and Web databases 2000s Stream data management and mining Data mining and its applications Web technology (XML, data integration) and global information systems 4

What Is Data Mining? Data mining (knowledge discovery from data) Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data Data mining: a misnomer? Knowledge mining from data Alternative names Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, business intelligence, etc. 5

Why Data Mining? Potential Applications Data analysis and decision support Market analysis and management Target marketing, customer relationship management (CRM), market basket analysis, cross selling, market segmentation Risk analysis and management Forecasting, customer retention, improved underwriting, quality control, competitive analysis Fraud detection and detection of unusual patterns (outliers) Other Applications Text mining (news group, email, documents) andweb mining Stream data mining Bioinformatics and bio-data analysis 6

Ex. 1: Market Analysis and Management 7 Where does the data come from? Credit card transactions, loyalty cards, discount coupons, customer complaint calls, plus (public) lifestyle studies Target marketing Find clusters of model customers who share the same characteristics: interest, income level, spending habits, etc., Determine customer purchasing patterns over time Cross-market analysis Find associations/co-relations between product sales, & predict based on such association Customer profiling What types of customers buy what products (clustering or classification) Customer requirement analysis Identify the best products for different customers Predict what factors will attract new customers Provision of summary information Multidimensional summary reports Statistical summary information (data central tendency and variation)

Ex. 2: Corporate Analysis & Risk Management Finance planning and asset evaluation cash flow analysis and prediction contingent claim analysis to evaluate assets cross-sectional and time series analysis (financial-ratio, trend analysis, etc.) Resource planning summarize and compare the resources and spending Competition monitor competitors and market directions group customers into classes and a class-based pricing procedure set pricing strategy in a highly competitive market 8

Ex. 3: Fraud Detection & Mining Unusual Patterns Approaches: Clustering & model construction for frauds, outlier analysis Applications: Health care, retail, credit card service, telecomm. Auto insurance: ring of collisions Money laundering: suspicious monetary transactions Medical insurance Professional patients, ring of doctors, and ring of references Unnecessary or correlated screening tests Telecommunications: phone-call fraud Phone call model: destination of the call, duration, time of day or week. Analyze patterns that deviate from an expected norm Retail industry Analysts estimate that 38% of retail shrink is due to dishonest employees Anti-terrorism 9

Knowledge Discovery (KDD) Process Data mining core of knowledge discovery process Pattern Evaluation Task-relevant Data Data Mining Data Warehouse Selection Data Cleaning Data Integration 10 Databases

KDD Process: Several Key Steps Learning the application domain relevant prior knowledge and goals of application Creating a target data set: data selection Data cleaning and preprocessing: (may take 60% of effort!) Data reduction and transformation Find useful features, dimensionality/variable reduction, invariant representation Choosing functions of data mining summarization, classification, regression, association, clustering Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation visualization, transformation, removing redundant patterns, etc. Use of discovered knowledge 11

Data Mining and Business Intelligence Increasing potential to support business decisions Decision Making Data Presentation Visualization Techniques Data Mining Information Discovery End User Business Analyst Data Analyst 12 Data Exploration Statistical Summary, Querying, and Reporting Data Preprocessing/Integration, Data Warehouses Data Sources Paper, Files, Web documents, Scientific experiments, Database Systems DBA

Why Not Traditional Data Analysis? Tremendous amount of data Algorithms must be highly scalable to handle such as tera-bytes of data High-dimensionality of data Micro-array may have tens of thousands of dimensions High complexity of data Data streams and sensor data Time-series data, temporal data, sequence data Structure data, graphs, social networks and multi-linked data Heterogeneous databases and legacy databases Spatial, spatiotemporal, multimedia, text and Web data Software programs, scientific simulations New and sophisticated applications 13

Data Mining: On What Kinds of Data? Database-oriented data sets and applications Relational database, data warehouse, transactional database Advanced data sets and advanced applications Data streams and sensor data Time-series data, temporal data, sequence data (incl. bio-sequences) Structure data, graphs, social networks and multi-linked data Object-relational databases Heterogeneous databases and legacy databases Spatial data and spatiotemporal data Multimedia database Text databases The World-Wide Web 14

Data Mining Functionalities Multidimensional concept description: Characterization and discrimination Generalize, summarize, and contrast data characteristics, e.g., dry vs. wet regions Frequent patterns, association, correlation vs. causality Diaper Beer [0.5%, 75%] (Correlation or causality?) Classification and prediction Construct models (functions) that describe and distinguish classes or concepts for future prediction E.g., classify countries based on (climate), or classify cars based on (gas mileage) Predict some unknown or missing numerical values 15

Data Mining Functionalities (2) Cluster analysis Class label is unknown: Group data to form new classes, e.g., cluster houses to find distribution patterns Maximizing intra-class similarity & minimizing interclass similarity Outlier analysis Outlier: Data object that does not comply with the general behavior of the data Noise or exception? Useful in fraud detection, rare events analysis Trend and evolution analysis Trend and deviation: e.g., regression analysis Sequential pattern mining: e.g., digital camera large SD memory Periodicity analysis Similarity-based analysis Other pattern-directed or statistical analyses 16

Are All the Discovered Patterns Interesting? Data mining may generate thousands of patterns: Not all of them are interesting Suggested approach: Human-centered, query-based, focused mining Interestingness measures A pattern is interesting if it is easily understood by humans, valid on new or test data with some degree of certainty, potentially useful, novel, or validates some hypothesis that a user seeks to confirm Objective vs. subjective interestingness measures Objective: based on statistics and structures of patterns, e.g., support, confidence, etc. Subjective: based on user s belief in the data, e.g., unexpectedness, novelty, actionability, etc. 17

Architecture: Typical Data Mining System Graphical User Interface Pattern Evaluation Data Mining Engine Knowl edge- Base Database or Data Warehouse Server data cleaning, integration, and selection Database Data Warehouse World-Wide Web Other Info Repositories 18

Major Issues in Data Mining Mining methodology Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web Performance: efficiency, effectiveness, and scalability Pattern evaluation: the interestingness problem Incorporation of background knowledge Handling noise and incomplete data Parallel, distributed and incremental mining methods Integration of the discovered knowledge with existing one: knowledge fusion User interaction Data mining query languages and ad-hoc mining Expression and visualization of data mining results Interactive mining of knowledge at multiple levels of abstraction Applications and social impacts Protection of data security, integrity, and privacy 19

Summary Data mining: Discovering interesting patterns from large amounts of data A natural evolution of database technology, in great demand, with wide applications A KDD process includes data cleaning, data integration, data selection, transformation, data mining, pattern evaluation, and knowledge presentation Data mining functionalities: characterization, discrimination, association, classification, clustering, outlier and trend analysis, etc. Data mining systems and architectures Major issues in data mining 20

Conferences and Journals on Data Mining KDD Conferences 21 ACM SIGKDD Int. Conf. on Knowledge Discovery in Databases and Data Mining (KDD) SIAM Data Mining Conf. (SDM) (IEEE) Int. Conf. on Data Mining (ICDM) Conf. on Principles and practices of Knowledge Discovery and Data Mining (PKDD) Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD) Other related conferences ACM SIGMOD VLDB (IEEE) ICDE WWW, SIGIR ICML, CVPR, NIPS Journals Data Mining and Knowledge Discovery (DAMI or DMKD) IEEE Trans. On Knowledge and Data Eng. (TKDE) KDD Explorations ACM Trans. on KDD

Recommended Reference Books S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002 R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000 T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003 U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining. AAAI/MIT Press, 1996 U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan Kaufmann, 2001 J. Han and M. Kamber.. Morgan Kaufmann, 2 nd ed., 2006 D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001 T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction, Springer-Verlag, 2001 T. M. Mitchell, Machine Learning, McGraw Hill, 1997 G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991 P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005 S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998 I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, Morgan Kaufmann, 2 nd ed. 2005 22