Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data



Similar documents
Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Data Preprocessing. Week 2

The Scientific Data Mining Process

1 Data-Mining Concepts

Social Media Mining. Data Mining Essentials

COM CO P 5318 Da t Da a t Explora Explor t a ion and Analysis y Chapte Chapt r e 3

Data, Measurements, Features

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

An Approach for Facilating Knowledge Data Warehouse

An Overview of Knowledge Discovery Database and Data mining Techniques

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

Data Mining: Data Preprocessing. I211: Information infrastructure II

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Data Mining: A Preprocessing Engine

Data Mining: Overview. What is Data Mining?

Exploratory Data Analysis

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Database Marketing, Business Intelligence and Knowledge Discovery

Foundations of Business Intelligence: Databases and Information Management

CS Introduction to Data Mining Instructor: Abdullah Mueen

DATA PREPARATION FOR DATA MINING

Customer Classification And Prediction Based On Data Mining Technique

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Foundations of Business Intelligence: Databases and Information Management

Environmental Remote Sensing GEOG 2021

Classification and Prediction

Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI

Data Mining for Knowledge Management. Classification

Data Mining: An Overview. David Madigan

Data Exploration Data Visualization

TIETS34 Seminar: Data Mining on Biometric identification

ETL PROCESS IN DATA WAREHOUSE

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

TIM 50 - Business Information Systems

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Exploratory data analysis (Chapter 2) Fall 2011

Course MIS. Foundations of Business Intelligence

Introduction to Data Mining and Machine Learning Techniques. Iza Moise, Evangelos Pournaras, Dirk Helbing

Fluency With Information Technology CSE100/IMT100

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Knowledge Discovery and Data Mining

Introduction to Data Mining

Gerry Hobbs, Department of Statistics, West Virginia University

Standardization and Its Effects on K-Means Clustering Algorithm

Big Data Text Mining and Visualization. Anton Heijs

Data Mining Classification: Decision Trees

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Sanjeev Kumar. contribute

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Slides by Tan, Steinbach, Kumar adapted by Michael Hahsler

BIRCH: An Efficient Data Clustering Method For Very Large Databases

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Exploratory Data Analysis for Ecological Modelling and Decision Support

CHAPTER SIX DATA. Business Intelligence The McGraw-Hill Companies, All Rights Reserved

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Chapter 12 Discovering New Knowledge Data Mining

IBM SPSS Direct Marketing 23

Chapter 3: Cluster Analysis

5.5 Copyright 2011 Pearson Education, Inc. publishing as Prentice Hall. Figure 5-2

Unsupervised Data Mining (Clustering)

OLAP and OLTP. AMIT KUMAR BINDAL Associate Professor M M U MULLANA

SPATIAL DATA CLASSIFICATION AND DATA MINING

Active Learning SVM for Blogs recommendation

IBM SPSS Data Preparation 22

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Data Warehousing and Data Mining

Statistics Review PSY379

Knowledge Discovery and Data Mining

Chapter 6 FOUNDATIONS OF BUSINESS INTELLIGENCE: DATABASES AND INFORMATION MANAGEMENT Learning Objectives

Course DSS. Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Data Mining Analytics for Business Intelligence and Decision Support

DATA WAREHOUSING AND OLAP TECHNOLOGY

Adobe Insight, powered by Omniture

Northumberland Knowledge

Categorical Data Visualization and Clustering Using Subjective Factors

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Iris Sample Data Set. Basic Visualization Techniques: Charts, Graphs and Maps. Summary Statistics. Frequency and Mode

Introduction to Data Mining

Topics. Database Essential Concepts. What s s a Good Database System? Using Database Software. Using Database Software. Types of Database Programs

Alexander Nikov. 5. Database Systems and Managing Data Resources. Learning Objectives. RR Donnelley Tries to Master Its Data

Data Mining. Nonlinear Classification

Framing Business Problems as Data Mining Problems

Introduction. A. Bellaachia Page: 1

Data Mining and Database Systems: Where is the Intersection?

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Strategic Online Advertising: Modeling Internet User Behavior with

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Knowledge Discovery from patents using KMX Text Analytics

DATA MINING CONCEPTS AND TECHNIQUES. Marek Maurizio E-commerce, winter 2011

Data Mining: Motivations and Concepts

Chapter 5 Business Intelligence: Data Warehousing, Data Acquisition, Data Mining, Business Analytics, and Visualization

Business Intelligence, Analytics & Reporting: Glossary of Terms

White Paper. Thirsting for Insight? Quench It With 5 Data Management for Analytics Best Practices.

Transcription:

Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values. Examples of semi-structured data are electronic images of business documents, medical reports, executive summaries, etc. The majority of web documents also fall in this category. An example of unstructured data is a video recorded by a surveillance camera in a departmental store. This form of data generally requires extensive processing to extract and structure the information contained in it. 2 1

Structured vs. Non-Structured Data (Cont d) Structured data is often referred to as traditional data, while the semi-structured and unstructured data are lumped together as non-traditional data. Most of the current data mining methods and commercial tools are applied to traditional data. 3 SQL vs. Data Mining SQL (Structured Query Language) is a standard relational database language that is good for queries that impose some kind of constraints on data in the database in order to extract an answer. In contrast, data mining methods are good for queries that are exploratory in nature, trying to extract hidden, not so obvious information. SQL is useful when we know exactly what we are looking for and we can describe it formally. We use data mining methods when we know only vaguely what we are looking for. 4 2

OLAP vs. Data Mining OLAP tools make it very easy to look at dimensional data from any angle or to slice-and-dice it. The derivation of answers from data in OLAP is analogous to calculations in a spreadsheet; because they use simple and given-in-advance calculations. OLAP tools do not learn from data, not do they create new knowledge. They are usually special-purpose visualization tools that can help end-users draw their own conclusions and decisions, based on graphically condensed data. 5 Statistics vs. Machine Learning Data mining has its origins in various disciplines, of which the two most important are statisticsand machine learning. Statistics has its roots in mathematics, and therefore, there has been an emphasis on mathematical rigor, a desire to establish that something is sensible on theoretical grounds before testing it in practice. In contrast, the machine learning community has its origin very much in computer practice. This has led to a practical orientation, a willingness to test something out to see how well it performs, without waiting for a formal proof of effectiveness. 6 3

Statistics vs. Machine Learning (Cont d) Modern statistics is entirely driven by the notion of a model. This is a postulated structure, or an approximation to a structure, which could have led to the data. In place of the statistical emphasis on models, machine learning tends to emphasize algorithms. 7 Types of Attributes There are different types of attributes Nominal Examples: ID numbers, eye color, zip codes Ordinal Examples: rankings (e.g., taste of potato chips on a scale from 1-10), grades, height in {tall, medium, short} Ratio Examples: temperature in Kelvin, length, time, counts 8 4

Properties of Attribute Values The type of an attribute depends on which of the following properties it possesses: Distinctness: = Order: < > Addition: + - Multiplication: * / Nominal attribute: distinctness Ordinal attribute: distinctness & order Ratio attribute: all 4 properties 9 Missing Values Reasons for missing values Information is not collected (e.g., people decline to give their age and weight) Attributes may not be applicable to all cases (e.g., annual income is not applicable to children) Handling missing values Eliminate Data Objects Estimate Missing Values Ignore the Missing Value During Analysis Replace with all possible values (weighted by their probabilities) 10 5

Duplicate Data Data set may include data objects that are duplicates, or almost duplicates of one another Major issue when merging data from heterogeous sources Examples: Same person with multiple email addresses Data cleaning Process of dealing with duplicate data issues 11 Data Preprocessing Aggregation Sampling Dimensionality Reduction Feature subset selection Discretization Attribute Transformation 12 6

Aggregation Combining two or more attributes (or objects) into a single attribute (or object) Purpose Data reduction Reduce the number of attributes or objects Change of scale Cities aggregated into regions, states, countries, etc More stable data Aggregated data tends to have less variability 13 Data Normalization Some data mining methods, typically those that are based on distance computation between points in an n-dimensional space, may need normalized data for best results. If the values are not normalized, the distance measures will overweight those features that have, on average, larger values. 14 7

Normalization Techniques Decimal Scaling v (i) = v(i) / 10 k For the smallest k such that max v (i) < 1. Min-Max Normalization v (i) = [v(i) min(v(i))]/[max(v(i)) min(v(i))] Standard Deviation Normalization v (i) = [v(i) mean(v)]/sd(v) 15 Normalization Example Given one-dimensional data set X = {-5.0, 23.0, 17.6, 7.23, 1.11}, normalize the data set using Decimal scaling on interval [-1, 1]. Min-max normalization on interval [0, 1]. Min-max normalization on interval [-1, 1]. Standard deviation normalization. 16 8

Outlier Detection Statistics-based Methods (for one dimensional data) Threshold = Mean +K x Standard Deviation Age = {3, 56, 23, 39, 156, 52, 41, 22, 9, 28, 139, 31, 55, 20, -67, 37, 11, 55, 45, 37} Distance-based Methods (for multidimensional data) Distance-based outliers are those samples which do not have enough neighbors, where neighbors are defined through the multidimensional distance between samples. Deviation-based Methods These methods are based on dissimilarity functions. For a given set of n samples, a possible dissimilarity function is the total variance of the sample set. Now the task is to define the smallest subset of samples whose removal results in the greatest reduction of the dissimilarity function. Theoretically an NP-Hard problem. 17 Outlier Detection Example S = {s1, s2, s3, s4, s5, s6, s7} = {(2, 4), (3, 2), (1, 1), (4, 3), (1, 6), (5, 3), (4, 2)} Threshold Values: p >4, d >3 S1 S2 S3 S4 S5 S6 s7 S1 2.236 3.162 2.236 2.236 3.162 2.828 S2 2.236 1.414 4.472 2.236 1.000 S3 3.605 5.000 4.472 3.162 S4 4.242 1.000 1.000 S5 5.000 5.000 s6 1.414 Sample p S1 2 S2 1 S3 5 S4 2 S5 5 s6 3 18 9

Outlier Detection Example II The number of children for different patients in a database is given with a vector C = {3, 1, 0, 2, 7, 3, 6, 4, -2, 0, 0, 10, 15, 6}. Find the outliers in the set C using standard statistical parameters mean and variance. If the threshold value is changed from +3 standard deviations to +2 standard deviations, what additional outliers are found? 19 Outlier Detection Example III For a given data set X of three-dimensional samples, X = [{1, 2, 0}, {3, 1, 4}, {2, 1, 5}, {0, 1, 6}, {2, 4, 3}, {4, 4, 2}, {5, 2, 1}, {7, 7, 7}, {0, 0, 0}, {3, 3, 3}]. Find the outliers using the distance-based technique if The threshold distance is 4, and threshold fraction p for non-neighbor samples is 3. The threshold distance is 6, and threshold fraction p for non-neighbor samples is 2. Describe the procedure and interpret the results of outlier detection based on mean values and variances for each dimension separately. 20 10

Data Reduction The three basic operations in a data-reduction process are: Delete a row Delete a column (dimensionality reduction) Reduce the number of values in a column (smooth a feature) The main advantages of data reduction are Computing time simpler data can hopefully lead to a reduction in the time taken for data mining. Predictive/descriptive accuracy We generally expect that by using only relevant features, a data mining algorithm can not only learn faster but with higher accuracy. Irrelevant data may mislead a learning process. Representation of the data-mining model The simplicity of representation often implies that a model can be better understood. 21 Sampling The key principle for effective sampling is the following: using a sample will work almost as well as using the entire data sets, if the sample is representative A sample is representative if it has approximately the same property (of interest) as the original set of data 22 11

Types of Sampling Simple Random Sampling There is an equal probability of selecting any particular item Stratified sampling Split the data into several partitions; then draw random samples from each partition Sampling without replacement As each item is selected, it is removed from the population Sampling with replacement Objects are not removed from the population as they are selected for the sample. In sampling with replacement, the same object can be picked up more than once 23 Sample Size 8000 points 2000 Points 500 Points 24 12

Dimensionality Reduction Purpose: Avoid curse of dimensionality Reduce amount of time and memory required by data mining algorithms Allow data to be more easily visualized May help to eliminate irrelevant features or reduce noise 25 Feature Subset Selection Another way to reduce dimensionality of data Redundant features duplicate much or all of the information contained in one or more other attributes Example: purchase price of a product and the amount of sales tax paid Irrelevant features contain no information that is useful for the data mining task at hand Example: students' ID is often irrelevant to the task of predicting students' GPA 26 13

Feature Subset Selection Techniques: Brute-force approch: Try all possible feature subsets as input to data mining algorithm Embedded approaches: Feature selection occurs naturally as part of the data mining algorithm Filter approaches: Features are selected before data mining algorithm is run Wrapper approaches: Use the data mining algorithm as a black box to find best subset of attributes 27 Mean and Variance based Feature Selection Suppose A and B are sets of feature values measured for two different classes, and n1 and n2 are the corresponding number of samples. SE(A B) = Sqrt(var(A)/n1 + var(b)/n2) TEST: mean(a) mean(b) /SE(A B) > threshold value It is assumed that the given feature is independent of the others. 28 14

Mean-Variance Example SE(X A X B ) = 0.4678 SE(Y A Y B ) = 0.0875 mean(x A ) mean(x B ) / SE(X A X B ) = 0.0375 < 0.5 mean(y A ) mean(y B ) / SE(Y A Y B ) = 2.2667 < 0.5 X Y C 0.3 0.7 A 0.2 0.9 B 0.6 0.6 A 0.5 0.5 A 0.7 0.7 B 0.4 0.9 B 29 Feature Ranking Example Given the data set X with three input features and one output feature representing the classification of samples I1 I2 I3 O 2.5 1.6 5.9 0 7.2 4.3 2.1 1 3.4 5.8 1.6 1 5.6 3.6 6.8 0 4.8 7.2 3.1 1 8.1 4.9 8.3 0 6.3 4.8 2.4 1 Rank the features using a comparison of means and variances 30 15