Knowledge Discovery in Databases

Similar documents
Introduction to Data Mining

Introduction. A. Bellaachia Page: 1

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining Solutions for the Business Environment

An Introduction to Data Mining

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Introduction to Data Mining

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Database Marketing, Business Intelligence and Knowledge Discovery

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Knowledge Discovery Process and Data Mining - Final remarks

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Introduction to Data Mining

Information Management course

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Using Data Mining for Mobile Communication Clustering and Characterization

A Review of Data Mining Techniques

Data Mining: Overview. What is Data Mining?

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining and Machine Learning in Bioinformatics

Sunnie Chung. Cleveland State University

Lluis Belanche + Alfredo Vellido. Intelligent Data Analysis and Data Mining

Data Mining Part 5. Prediction

Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

SPATIAL DATA CLASSIFICATION AND DATA MINING

Data Mining System, Functionalities and Applications: A Radical Review

Big Data. Introducción. Santiago González

Data Mining + Business Intelligence. Integration, Design and Implementation

Principles of Data Mining by Hand&Mannila&Smyth

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Management Decision Making. Hadi Hosseini CS 330 David R. Cheriton School of Computer Science University of Waterloo July 14, 2011

Overview. Background. Data Mining Analytics for Business Intelligence and Decision Support

Statistics for BIG data

An Overview of Knowledge Discovery Database and Data mining Techniques

The Scientific Data Mining Process

The basic data mining algorithms introduced may be enhanced in a number of ways.

DATA MINING - SELECTED TOPICS

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Bijan Raahemi, Ph.D., P.Eng, SMIEEE Associate Professor Telfer School of Management and School of Electrical Engineering and Computer Science

Machine Learning, Data Mining, and Knowledge Discovery: An Introduction

not possible or was possible at a high cost for collecting the data.

Azure Machine Learning, SQL Data Mining and R

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

1. What are the uses of statistics in data mining? Statistics is used to Estimate the complexity of a data mining problem. Suggest which data mining

Chapter 2 Literature Review

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data Mining for Fun and Profit

How To Understand And Understand A Problem

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Classification and Prediction

Learning is a very general term denoting the way in which agents:

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

Chapter 12 Discovering New Knowledge Data Mining

DATA MINING TECHNIQUES SUPPORT TO KNOWLEGDE OF BUSINESS INTELLIGENT SYSTEM

DATA MINING AND WAREHOUSING CONCEPTS

OLAP and Data Mining. Data Warehousing and End-User Access Tools. Introducing OLAP. Introducing OLAP

Data Warehousing and Data Mining

Data, Measurements, Features

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Introduction to Pattern Recognition

A Case of Study on Hadoop Benchmark Behavior Modeling Using ALOJA-ML

ADVANCES IN KNOWLEDGE DISCOVERY IN DATABASES

Bayesian networks - Time-series models - Apache Spark & Scala

Chapter 5. Warehousing, Data Acquisition, Data. Visualization

Data Analysis. Management Information Systems 13

Introduction to Data Mining and Business Intelligence Lecture 1/DMBI/IKI83403T/MTI/UI

Unsupervised Data Mining (Clustering)

from Larson Text By Susan Miertschin

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

An interdisciplinary model for analytics education

Data Warehousing and Data Mining

Data Warehousing and Data Mining in Business Applications

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Analytics on Big Data

The Data Mining Process

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Data Mining Introduction

The University of Jordan

TDS - Socio-Environmental Data Science

An Overview of Database management System, Data warehousing and Data Mining

Data Mining Jargon. Bob Muenchen The Statistical Consulting Center

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Subject Description Form

Foundations of Business Intelligence: Databases and Information Management

THE COMPARISON OF DATA MINING TOOLS

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

Big Data and Analytics (Fall 2015)

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining as Part of Knowledge Discovery in Databases (KDD)

DATA PREPARATION FOR DATA MINING

DATA ANALYSIS USING BUSINESS INTELLIGENCE TOOL. A Thesis. Presented to the. Faculty of. San Diego State University. In Partial Fulfillment

CS590D: Data Mining Chris Clifton

Transcription:

Knowledge Discovery in Databases Javier Béjar cbea CS - MIA AMLT - 2016/2017 Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 1 / 32

Outline 1 Knowledge Discovery in Databases Introduction Definitions of KDD 2 The KDD process Steps of KDD Discovery goals Mining Methodologies 3 Applications 4 Tools 5 Challenges Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 2 / 32

Knowledge Discovery in Databases 1 Knowledge Discovery in Databases 2 The KDD process 3 Applications 4 Tools 5 Challenges Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 3 / 32

Knowledge Discovery in Databases Introduction Knowledge Discovery in Databases Practical application of the methodologies from machine learning/statistics to large amounts of data The main problem addressed is the impossible task of manually analyzing (make sense of) all the data we are systematically collecting These methodologies are useful for automating/helping the process of analysis/discovery The final goal is to extract (semi)automatically actionable/useful knowledge We are drowning in information and starving for knowledge Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 4 / 32

Knowledge Discovery in Databases Introduction Knowledge Discovery in Databases The high point of KDD starts around late 1990s Many companies show their interest in obtaining the (possibly) valuable information stored in their databases (purchase transactions, e-commerce, web data,...) The goal is to obtain information that can lead to better commercial strategies and practices from a better understanding of the consumers preferences and their behaviour Many companies are putting a lot of effort on the development/use of this kind of technology (analysis and tools) Several buzz words have appeared: Business Intelligence, Business Analytics, Predictive Analytics, Data Science, Big Data... Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 5 / 32

Knowledge Discovery in Databases Introduction Knowledge Discovery in Databases Not only business data are in need of these kinds of techniques Analyzing scientific data has supposed an important impulse Space probes Remote sensors on satellites Astronomical observations (big array observatories) Large scientific experiments (LHC, ITER) Genome Project, microarray data Bioinformatics Neuroscience (Human Brain Project) Data grows faster that the ability to analyze it Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 6 / 32

Knowledge Discovery in Databases Introduction KDD: Machine learning Inductive machine learning: Discovery of patterns/models from data Supervised discovery/unsupervised discovery Unstructured/Structured representations Logic representations/probabilistical representations Scalability Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 7 / 32

Knowledge Discovery in Databases Introduction KDD: Statistics/Data Analysis Statistical Data Modeling: Fitting of probability models to data Supervised/Unsupervised modeling Structured models Probabilistic representation/interpretation of data Scalability Statistical Machine Learning Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 8 / 32

Knowledge Discovery in Databases Introduction KDD: Databases/Algorithmics/Visualization Data access: SQL vs NoSQL Distributed file systems Redundancy/Fault Tolerance/Parallelism Databases for structured data: Transactions, Graphs, Time sequences Distributed processing paradigms/scalabilility: MapReduce, Hadoop, Spark,.. Data visualization: from data cubes to structured data representation Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 9 / 32

Knowledge Discovery in Databases Definitions of KDD KDD definitions It is the search of valuable information in great volumes of data It is the explorations and analysis, by automatic or semiautomatic tools, of great volumes of data in order to discover patterns and rules It is the nontrivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 10 / 32

Knowledge Discovery in Databases Definitions of KDD Elements of KDD Pattern: Any representation formalism capable to describe the common characteristics of a group if instances Valid: A pattern is valid if it is able to predict the behaviour of new information with a degree of certainty Novelty: It is novel any knowledge that it is not know respect the domain knowledge and any previous discovered knowledge Useful: New knowledge is useful if it allows to perform actions that yield some benefit given a established criteria Understandable: The knowledge discovered must be analyzed by an expert in the domain, in consequence the interpretability of the result is important Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 11 / 32

The KDD process 1 Knowledge Discovery in Databases 2 The KDD process 3 Applications 4 Tools 5 Challenges Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 12 / 32

The KDD process KDD as a process The actual discovery of patterns is only one part of a more complex process Raw data in not always ready for processing (80/20 project effort) Some general methodologies have been defined for the whole process (CRISP-DM or SEMMA) These methodologies address KDD as an engineering process, despite being business oriented are general enough to be applied on any data discovery domain Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 13 / 32

The KDD process Steps of KDD The KDD process (I) Steps of the Knowledge Discovery in DB process 1 Domain study 2 Creating the dataset 3 Data preprocessing 4 Dimensionality reduction 5 Selection of the discovery goal 6 Selection of the adequate methodologies 7 Data Mining 8 Result assessment and interpretation 9 Using the knowledge Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 14 / 32

The KDD process Steps of KDD The KDD process (II) 1. Study of the domain Gather information about the domain. Characteristics, goal of the discovering process (attributes, representative examples, types of patterns, sources of data) 2. Creating the dataset From the information of the previous step it is decided what source of data will be used. It has to be decided what attributes will describe the data and what examples are needed for the goals of the discovery process Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 15 / 32

The KDD process Steps of KDD The KDD process (III) 3. Data preprocessing and cleaning It has to be studied the circumstances that affect the quality of the data Outliers Noise (does it exists?, does it present any pattern?, can it be reduced?) Missing values Discretization of continuous values Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 16 / 32

The KDD process Steps of KDD The KDD process (V) 4. Data reduction and projection We have to study what attributes are relevant to our goal (depending on the task some techniques can be used to measure the relevance of the attributes) and the number of examples that are needed. Not all the data mining algorithms are scalable Instance selection (do we need all the examples? sampling techniques) Attribute selection (what is really relevant?) Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 17 / 32

The KDD process Steps of KDD The KDD process (VI) - Attribute selection It is very important to use methods for attribute selection: Reduces the dimensionality of the data (curse of dimensionality) Eliminates/Reduces irrelevant and redundant information The result of the process is easier to interpret Attribute selection techniques: Mathematical/Statistical techniques: Principal component analysis (PCA), projection pursuit, Multidimensional scaling Heuristics for attribute relevance evaluation (ranking of attributes, search in the space of subsets of attributes) Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 18 / 32

The KDD process Steps of KDD The KDD process (VII) 5. Selecting the discovery goal The characteristics of the data, the domain and the aim of the project determines what kind of analysis are feasible or possible (group partitioning, summarization, classification, discovery of attribute relations,...) 6. Selecting the adequate methodologies The goal and the characteristics of the data determines the more adequate methodologies Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 19 / 32

The KDD process Steps of KDD The KDD process (VIII) 7. Applying the methodologies (Data Mining) The different parameters of the chosen methodologies has to be adjusted by experimentation and analysis in order to obtain the best possible results 8. Interpreting the results From the knowledge of the domain (expert) it will be assessed the relevance and importance of the result. This interpretation step could suppose feedback for the previous steps, it is possible that some adjustments are needed or some previous decisions have to be changed 9. Incorporating the new knowledge The new knowledge is used to perform the intended task goal of the discovery process Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 20 / 32

The KDD process Discovery goals Goals of the KDD process There are different goals that can be pursued as the result of the discovery process, among them: Classification: We need models that allow to discriminate instances that belong to a previously known set of groups (the model could or could not be interpretable) Clustering/Partitioning/Segmentation: We need to discover models that clusters the data into groups with common characteristics (a characterizations of the groups is desirable) Regression: We look for models that predicts the behaviour of continuous variables as a function of others Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 21 / 32

The KDD process Discovery goals Goals of the KDD process Summarization: We look for a compact description that summarizes the characteristics of the data Causal dependence: We need models that reveal the causal dependence among the variables and assess the strength of this dependence Structure dependence: We need models that reveal patterns among the relations that describe the structure of the data Change: We need models that discover patterns in data that has temporal or spatial dependence Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 22 / 32

The KDD process Mining Methodologies Methodologies for KDD There are a lot of methodologies that can be applied in the discovery process, the more usual are: Decision trees, decision rules: Usually are interpretable models Can be used for: Classification, regression, and summarization trees: C4.5, CART, QUEST, rules: RIPPER, CN2,.. Classifiers, Regression: Low interpretability but good accuracy Can be used for: Classification and regression Statistical regression, function approximation, Neural networks, Support Vector Machines, k-nn, Local Weighted Regression,... Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 23 / 32

The KDD process Mining Methodologies Methodologies for KDD Clustering: Its goal is to partition datasets or discover groups Can be used for: Clustering, summarization Statistical Clustering, Unsupervised Machine learning, Unsupervised Neural networks (Self-Organizing Maps) Dependency models (attribute dependence, temporal dependence, graph substructures) Its goal is to obtain models (some interpretables) of the dependence relations (structural, causal temporal) among attributes/instances Can be used for: causal dependence discovery, temporal change, substructure discovery Bayesian networks, association rules, Markov models, graph algorithms,... Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 24 / 32

Applications 1 Knowledge Discovery in Databases 2 The KDD process 3 Applications 4 Tools 5 Challenges Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 25 / 32

Applications Applications Business: Costumer segmentation, costumer profiling, costumer transaction data, customer churn Fraud detection Control/analysis of industrial processes e-commerce, on-line recommendation Financial data (stock market analysis) WEB mining Text mining, document search/organization Social networks analysis User behavior Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 26 / 32

Applications Applications Scientific applications: Medicine (patient data, MRI scans, ECG, EEG,...) Pharmacology (Drug discovery, screening, in-silicon testing) Astronomy (astronomical bodies identification) Genetics (gen identification, DNA microarrays, bioinformatics) Satellite/Probe data (meteorology, astronomy, geological,...) Large scientific experiments (CERN LHC, ITER) Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 27 / 32

Tools 1 Knowledge Discovery in Databases 2 The KDD process 3 Applications 4 Tools 5 Challenges Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 28 / 32

Tools Tools for KDD There are a lot of tools available for KDD Some tools were developed at universities (C5.0, CART/MARS) and have become a commercial product, others still remain open source (Weka, R, scikit-learn) Big fish eats little fish (C5.0 Clementine SPSS-clementine IBM DBMiner) Data analysis software companies incorporate KDD techniques inside classical data analysis tools (SPSS, SAS) Companies selling databases add KDD tools as an added value (IBM DB2 (intelligent Miner), SQL Server, Oracle) Machine Learning as a Service (Amazon, Microsoft, Google, IBM Watson, Big ML,...) Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 29 / 32

Tools Tools for the course Python General programming language, easy to learn numpy, scipy, pandas scikit-learn (http://scikit-learn.org) Data preprocessing, Clustering Algorithms, Association Rules,... R (http://cran.r-project.org/) Statistic analysis oriented language, more steep learning curve Many packages Data preprocessing, Clustering Algorithms, Association Rules,... Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 30 / 32

Challenges 1 Knowledge Discovery in Databases 2 The KDD process 3 Applications 4 Tools 5 Challenges Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 31 / 32

Challenges Open problems Scalability (More data, more attributes) Overfitting (Patterns with low interest) Statistical significance of the results Methods for temporal data/relational data/structured data Methods for data cleaning (Missing data and noise) Pattern comprehensibility Use of domain knowledge Integration with other techniques (OLAP, DataWarehousing, Business Intelligence, Intelligent Decision Support Systems) Privacy Javier Béjar cbea (CS - MIA) Knowledge Discovery in Databases AMLT - 2016/2017 32 / 32