Data Warehousing and Data Mining

Similar documents
Data Warehousing and Data Mining

Data Mining Introduction

Introduction to Data Mining

Information Management course

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

Introduction. A. Bellaachia Page: 1

Introduction to Data Mining

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining - Introduction

(b) How data mining is different from knowledge discovery in databases (KDD)? Explain.

International Journal of Scientific & Engineering Research, Volume 5, Issue 4, April ISSN

Introduction to Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data mining in the e-learning domain

Data Mining: Concepts and Techniques

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Data Mining System, Functionalities and Applications: A Radical Review

Data Mining: Concepts and Techniques Chapter 1 Introduction

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Syllabus. HMI 7437: Data Warehousing and Data/Text Mining for Healthcare

SPATIAL DATA CLASSIFICATION AND DATA MINING

More details >>> HERE <<<

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Data Mining and Business Intelligence CIT-6-DMB. Faculty of Business 2011/2012. Level 6

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

Dynamic Data in terms of Data Mining Streams

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Database Marketing, Business Intelligence and Knowledge Discovery

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

An Overview of Database management System, Data warehousing and Data Mining

Applications and Trends in Data Mining

DATA MINING TECHNIQUES AND APPLICATIONS

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Building Data Cubes and Mining Them. Jelena Jovanovic

Concept and Applications of Data Mining. Week 1

Data Warehousing and Data Mining. A.A Datawarehousing & Datamining 1

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

AMIS 7640 Data Mining for Business Intelligence

Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10

John R. Vacca INSIDE

Data Mining Analytics for Business Intelligence and Decision Support

Principles of Data Mining by Hand&Mannila&Smyth

Data Mining: Concepts and Techniques

Data Mining for Fun and Profit

Integrated Data Mining and Knowledge Discovery Techniques in ERP

Data Warehousing and OLAP Technology for Knowledge Discovery

Data Mining: Overview. What is Data Mining?

DATA WAREHOUSING AND OLAP TECHNOLOGY

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining as Part of Knowledge Discovery in Databases (KDD)

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Copyright 2007 Ramez Elmasri and Shamkant B. Navathe. Slide 29-1

Data Warehousing and Data Mining in Business Applications

Data, Measurements, Features

The Scientific Data Mining Process

Foundations of Business Intelligence: Databases and Information Management

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

Data Mining Solutions for the Business Environment

Importance or the Role of Data Warehousing and Data Mining in Business Applications

Data Mining. Vera Goebel. Department of Informatics, University of Oslo

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Introduction to Data Mining

Data Mining. Introduction to Modern Information Retrieval from Databases and the Web. Administrivia

DATA MINING: AN OVERVIEW

Data Mining: Concepts and Techniques

What is Data Mining?

Prediction of Heart Disease Using Naïve Bayes Algorithm

An Empirical Study of Application of Data Mining Techniques in Library System

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

DATA MINING AND WAREHOUSING CONCEPTS

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Warehouse: Introduction

ORACLE OLAP. Oracle OLAP is embedded in the Oracle Database kernel and runs in the same database process

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

CHAPTER-24 Mining Spatial Databases

Chapter 6. Foundations of Business Intelligence: Databases and Information Management

How To Use Data Mining For Knowledge Management In Technology Enhanced Learning

The basic data mining algorithms introduced may be enhanced in a number of ways.

Sanjeev Kumar. contribute

Statistical Models in Data Mining

Chapter ML:XI. XI. Cluster Analysis

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

Azure Machine Learning, SQL Data Mining and R

Web Data Mining: A Case Study. Abstract. Introduction

How To Solve The Kd Cup 2010 Challenge

Data Mining: Motivations and Concepts

When to consider OLAP?

CSE 544 Principles of Database Management Systems. Magdalena Balazinska Fall 2007 Lecture 16 - Data Warehousing

3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools

Customer Classification And Prediction Based On Data Mining Technique

Transcription:

Data Warehousing and Data Mining Winter Semester 2012/2013 Free University of Bozen, Bolzano DM Lecturer: Mouna Kacimi mouna.kacimi@unibz.it http://www.inf.unibz.it/dis/teaching/dwdm/index.html

Organization } Lectures Tuesdays From 10:30 To 12:30 Office hours Dr. Kacimi: (appointment by email) } Projects Lab hours on Tuesdays from 14:00 to 16:00 } Announcements There will be no class on and no lab on Tuesday, October 30 The lab will be recovered on November 19 from 10:30 to 12:30 } Textbooks Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques, Second Edition, 2006 Pang-Ning Tan, Michael Steinbach, and Vipin Kumar, "Introduction to Data Mining", Pearson Addison Wesley, 2008, ISBN: 0-32-134136-7 Margaret H. Dunham, Data Mining: Introductory and Advanced Topics, Prentice Hall, 2003

Exam Procedure } Requirements: obtain 18 credit points in each of the following g Project g Exam Final grade= Project_grade 0.4+ Exam_grade 0.6 } Important g A successful project remains valid g If a project is unsuccessful until the day of the exam, its validity expires g Students can do a new project until the next exam session. In this case, the teaching assistant does not guarantee support for supervising the students.

Project Procedure } There will be two types of projects: Data warehouse Project Data Mining Project g you can do a DW project g you can do a DM project g you can do ½ DM project & ½ DW project } Structure DW Project start date 09.10.12 6 labs part1 DM Project start date 16.10.12 6 labs part1 start date 27.11.12 start date 27.11.12 6 labs part2 6 labs part2

Data Mining Project } Part 1 (Mining interesting patterns in data) " Get to know data choose a dataset you like analyze and represent its properties define knowledge to be extracted " Extract frequent patterns from your dataset store data for efficient processing implement one of the well established algorithms in the area integrate optimization techniques to scale analyze the extracted knowledge and show how it can be useful for decision making report encountered problems, challenges and how did you face them

Data Mining Project } Part 2 (Supervised & Unsupervised Learning) " Define a Data Mining Task choose a dataset you like define a suitable DM task for it (supervised or unsupervised) motivate your choice with a real application " Implement your data mining task based on the properties of your dataset and the data mining task you have proposed, choose the most suitable algorithm implement the algorithm (or compare different algorithms) enhance the algorithm with necessary optimization techniques apply the algorithm to your dataset report the findings and show their usefulness report encountered problems, challenges and how did you face them

Data Mining Project } Part 2 (Free choice option) " Choose any problem in the data mining area " Your choice must be first approved by the teaching assistant

Data Mining

Outline } Introduction to Data Mining } Data Analysis and Uncertainty } Frequent Pattern Mining } Cluster Analysis } Classification & Prediction Part I: Introduction & Foundations Part II: Patterns Extraction Part III: Unsupervised Learning Part IV: Supervised Learning

Chapter I: Introduction & Foundations } 1.1 Introduction 1.1.1 Definitions & Motivations 1.1.2 Data to be Mined 1.1.3 Knowledge to be discovered 1.1.4 Techniques Utilized 1.1.5 Applications Adapted 1.1.6 Major Issues in Data Mining } 1.2 Getting to Know Your Data 1.2.1 Data Objects and Attribute Types 1.2.2 Basic Statistical Descriptions of Data } 1.2.3 Measuring Data Similarity and Dissimilarity 1.3 Basics from Probability Theory and Statistics 13.1 Probability Theory 1.3.2 Statistical Inference: Sampling and Estimation 1.3.3 Statistical Inference: Hypothesis Testing and Regression

1.1 Definitions & Motivations Why Data Mining? } Explosive Growth of Data: from terabytes to petabytes } Data Collections and Data Availability " Crawlers, database systems, Web, etc. " Sources Business: Web, e-commerce, transactions, etc. Science: Remote sensing, bioinformatics, etc. Society and everyone: news, YouTube, etc. } Problem: We are drowning in data, but starving for knowledge! } Solution: Use Data Mining tools for Automated Analysis of massive data sets

What is Data Mining? } Data mining (knowledge discovery from data) " Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from huge amount of data " Data mining: a misnomer? Stone Gold Mining Not Stone Mining Data Knowledge Knowledge Mining? } Alternative names " Knowledge discovery (mining) in databases (KDD), knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Knowledge Discovery (KDD) Process } Data Mining as a step in the knowledge discovery process Knowledge Evaluation & Presentation Data Mining Patterns Selection & transformation Task-relevant Data Data Cleaning & Integration Data Warehouse } Data mining plays an essential role in the knowledge discovery process Databases

Knowledge Discovery (KDD) Process } } } } } } } Data Cleaning " Remove noise and inconsistent data Data Integration " Combine multiple data sources Data Selection " Data relevant to analysis tasks are retrieved form the data Data transformation " Transform data into appropriate form for mining (summary, aggregation, etc.) Data mining " Extract data patterns Pattern Evaluation " Identify truly interesting patterns Knowledge representation " Use visualization and knowledge representation tools to present the mined data to the user

Typical Architecture of a Data Mining System } Knowledge Base " Guide the search " Evaluate interestingness of the results } Include " Concept hierarchies " User believes " Constraints, thresholds, metadata, etc. User Interface Pattern Evaluation Data Mining Engine Database or Data Warehouse Server Knowledge Base Data cleaning, Integration and selection Database Data Warehouse World Wide Web Other Info Repositories

Confluence of Multiple Disciplines Database Technology Statistics Machine Learning Data Mining Visualization Pattern Recognition Algorithm Other Disciplines

Why Confluence of Multiple Disciplines? } Tremendous amount of data " Scalable algorithms to handle terabytes of data (e.g., Flickr hits 6 billion images but facebook does that every 2 weeks http://thenextweb.com/socialmedia/2011/08/05/flickr-hits-6-billion-total-photos-but-facebook-does-thatevery-2-months//) } High dimensionality of data " Data can have tens of thousands of features (e,g., DNA microarray) } High complexity of data } " Data can be highly complex, can be of different types, and can include different descriptors Images can be described using text and visual features such as color, texture, contours, etc. Videos can be described using text, images and their descriptors, audio phonemes, etc. Social networks can have a complex structure New and sophisticated applications " Applications can be difficult (e.g., medical applications).

} Data View Different Views of Data Mining " Kinds of data to be mined } Knowledge view " Kinds of knowledge to be discovered } Method view " Kinds of techniques utilized } Application view " Kinds of applications

1.1.2 Data to be Mined } In principle, data mining should be applicable to any data repository } This lecture includes examples about: " Relational databases " Data warehouses " Transactional databases " Advanced database systems

} Database System Relational Databases " Collection of interrelated data, known as database " A set of software programs that manage and access the data } Relational Databases (RD) " A collection of tables. Each one has a unique name " A table contains a set of attributes (columns) & tuples (rows). " Each object in a relational table has a unique key and is described by a set of attribute values. Costumers " Data are accessed using database queries (SQL): projection, join, etc. } Data Mining applied to RD " Search for trends or data patterns " Example: cust_id Name age income predict the credit risk of costumers based on their income, age and expenses. 152... Purchases Anna... 27... 24000... trans_id cust_id method Amount T156... 152... Visa... 1357...

Data Warehouses } A data warehouse (DW) is a repository of information collected from multiple sources, stored under a unified schema. Data source in Bolzano Data source in Paris Data source in Madrid Clean Integrate Transform Load refresh Data Warehouse Query and Analysis Tools Client Client } Data organized around major subjects (using summarization) } Multidimensional database structure (e.g., data Cube) " Dimension = one attribute or a set of attributes " Cell = stores the value of some aggregated measures. } Data Mining applied to DW " Data warehouse tools help data analysis " Data Mining tools are required to allow more in-depth and automated analysis

Transactional Databases } A transactional database (TD) consists of a file where each record represents a transaction. } A transaction includes a unique transaction identifier (trans_id) and a list of the items making the transaction. } A transaction database may include other tables containing other information regarding the sale trans_id List of items_ids (customer_id, location, etc.) } Basic analysis (examples) " Show me all the items purchased by David Winston? " How many transactions include item number 5? } Data Mining on TD " Perform a deeper analysis " Example: Which items sold well together? " Basically, data mining systems can identify frequent sets in transactional databases and perform market basket data analysis. T100 T200 I1,I3,I8,I16 I2,I8......

Advanced Database Systems(1) } Advanced database systems provide tools for handling complex data " Spatial data (e.g., maps) " Engineering design data (e.g., buildings, system components) " Hypertext and multimedia data (text, image, audio, and video) " Time-related data (e.g., historical records) " Stream data (e.g., video surveillance and sensor data) " World Wide Web, a huge, widely distributed information repository made available by Internet } Require efficient data structures and scalable methods to handle " Complex object structures and variable length records " Semi structured or unstructured data " Multimedia and spatiotemporal data " Database schema with complex and dynamic structures

Advanced Database Systems(2) } } Example: World Wide Web " Provide rich, worldwide, online and distributed information services. " Data objects are linked together " Users traverse from one object via links to another " Problems Data can be highly unstructured Difficult to understand the semantic of web pages and their context. Data Mining on WWW " Web usage Mining (user access pattern) Improve system design (efficiency) Better marketing decisions (adverts, user profile) " Authoritative Web page Analysis Ranking web pages based on their importance " Automated Web page clustering and classification Group and arrange web pages based on their content " Web community analysis Identify hidden web social networks and observe their evolution

1.1.3 Knowledge to be Discovered } Data mining functionalities are used to specify the kind of patterns to be found in data mining tasks } Data mining tasks can be classified into two categories " Descriptive : Characterize the general properties of the data " Predictive : Perform inference on the current data to make predictions } What to extract? " Users may not have an idea about what kinds of patterns in their data can be interesting } What to do? " Have a data mining system that can mine multiple types of patterns to handle different user and application needs. " Discover patterns at various granularities (levels of abstraction) Street City Country " Allow users to guide the search for interesting patterns Example of different granularities

Characterization and Discrimination (1) } Data can be associated with classes or concepts Example of data from a store Classes Concepts } Class/Concept descriptions: describe individual classes and concepts in summarized, concise, and precise way. " Data characterization Summarize the data of the class under study (target class) " Data Discrimination Compare the target class with a set of comparative classes (contrasting classes) " Data characterization & Discrimination printers computers Big-Spenders Budget-Spenders Perform both analysis

Characterization and Discrimination (2) } Data Characterization " Summarize the general features of a target class of data " Tools: statistical measures, data cube-based OLAP roll-up, etc. " Output: charts, curves, multidimensional data cubes, etc. " Example Summarize the characteristics of costumers who spend more than 1000 } Data Discrimination " Comparison of the general features of a target class with the general features of contrasting classes " Output: similar to characterization + comparative measures " Example Compare customers who shop for computer products regularly( more than 2 times a month) with those who rarely shop for such products(less then three times a year) Frequent costumers Costumers profile 80% Are between 20 and 40 Have university education 40-50 years old Employed excellent credit ratings Comparative profile Rare costumers 60% Are senior or youths Have no university degree

Frequent Patterns, Associations, Correlations } Frequent patterns are patterns occurring frequently in the data (e.g., item-sets, sub-sequences, and substructures) " Frequent item-sets: items that frequently appear together Example in a transactional data set: bred and milk " Frequent Sequential pattern: a frequently occurring subsequence Example in a transactional data set: buy first PC, second digital camera, third memory card } Association Analysis " Derive some association rules buys(x, computer ) buys (X, software ) [support =1%, confidence=50%] age(x, 20...29 ) income(x, 20K...29K ) buys (X, CD player ) [support =2%, confidence=60%] } Correlation Analysis " Uncover interesting statistical correlations between associated attribute-value pairs

Classification and Prediction } Construct models (functions) based on some training examples } Describe and distinguish classes or concepts for future prediction } Predict some unknown class labels Training examples Age Income Class label 27 28K Budget-Spenders 35 36K Big-Spenders 65 45K Budget-Spenders Supervised Learning Classification model (function) Unlabeled data Age Income 29 25K } Typical methods " Decision trees, naive Bayesian classification, logistic regression, support vector machine, neural networks, etc. } Typical Applications Classifier Class label [Budget Spender] Numeric value " Credit card fraud detection, classifying web pages, stars, diseases, etc. [Budget Spender (0.8)]

Cluster Analysis } Unsupervised learning (i.e., Class label is unknown) } Group data to form new categories (i.e., clusters), e.g., cluster houses to find distribution patterns } Principle: Maximizing intra-class similarity & minimizing interclass similarity } Typical methods " Hierarchical methods, density-based methods, Grid-based methods, Model-Based methods, constraint-based methods, etc. } Typical Applications " WWW, social networks, Marketing, Biology, Library, etc.

Outlier Analysis } Outlier: A data object that does not comply with the general behavior of the data learning (i.e., Class label is unknown) } Noise or exception? One person s garbage could be another person s treasure } Typical methods " Product of clustering or regression analysis, etc. } Typical Applications " Useful in fraud detection " Example Or? How to uncover fraudulent usage of credit card? Detect purchases of extremely large amounts for a given account number in comparison to regular charges incurred by the same account Outliers may also be detected with respect to the location and type of purchase, or the frequency.

Evolution Analysis } Evolution Analysis describes trends of data objects whose behavior changes over time } It includes " Characterization and discrimination analysis " Association and correlation analysis " Classification and prediction " Clustering of time related data } Distinct features for such analysis " Time-series data analysis " Sequence or periodicity pattern matching e.g., first buy digital camera, then buy large SD memory cards " Similarity-based analysis

1.1.4 Techniques Utilized } Data-intensive } Data warehouse (OLAP) } Machine learning } Statistics } Pattern recognition } Visualization } High-performance }...

1.1.5 Applications Adapted } Web page analysis: from web page classification, clustering to PageRank & HITS algorithms } Collaborative analysis & recommender systems } Basket data analysis to targeted marketing } Biological and medical data analysis: classification, cluster analysis (microarray data analysis), biological sequence analysis, biological network analysis } From major dedicated data mining systems/tools (e.g., SAS, MS SQL-Server Analysis Manager, Oracle Data Mining Tools) to invisible data mining

1.1.6 Major Challenges in Data Mining } Efficiency and scalability of data mining algorithms } Parallel, distributed, stream, and incremental mining methods } Handling high-dimensionality } Handling noise, uncertainty, and incompleteness of data } Incorporation of constraints, expert knowledge, and background knowledge in data mining } Pattern evaluation and knowledge integration } Mining diverse and heterogeneous kinds of data: e.g., bioinformatics, Web, software/system engineering, information networks } Application-oriented and domain-specific data mining } Invisible data mining (embedded in other functional modules) } Protection of security, integrity, and privacy in data mining

Summary of Section 1.1 } Data Mining is a process of extracting knowledge from data } Data to be mined can be of any type " Relational Databases, Advanced databases, etc. } Knowledge to be discovered " Frequent patterns, correlations, associations, classification, prediction, clustering } Techniques to be used " Statistics, machine learning, visualization, etc. } Data Mining is interdisciplinary " Large amount of complex data and sophisticated applications } Challenges of data Mining " Efficiency, scalability, parallel and distributed mining, handling high dimensionality, handling noisy data, mining heterogeneous data, etc.