Social Media Mining. Data Mining Essentials

Similar documents

Data Mining Essentials

Data Mining Algorithms Part 1. Dejan Sarka

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Classification algorithm in Data mining: An Overview

Data Mining - Evaluation of Classifiers

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Chapter 12 Discovering New Knowledge Data Mining

Classification and Prediction

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Machine Learning using MapReduce

Knowledge Discovery and Data Mining

Using Data Mining for Mobile Communication Clustering and Characterization

Monday Morning Data Mining

Spam Detection A Machine Learning Approach

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Chapter 20: Data Analysis

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Data Mining Practical Machine Learning Tools and Techniques

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Data Mining for Knowledge Management. Classification

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Distances, Clustering, and Classification. Heatmaps

Standardization and Its Effects on K-Means Clustering Algorithm

Unsupervised learning: Clustering

Classification Techniques (1)

Data Mining Classification: Decision Trees

An Overview of Knowledge Discovery Database and Data mining Techniques

Machine Learning in Spam Filtering

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Microsoft Azure Machine learning Algorithms

8. Machine Learning Applied Artificial Intelligence

Chapter 7. Cluster Analysis

An Introduction to Data Mining

Supervised Learning (Big Data Analytics)

Performance Metrics for Graph Mining Tasks

Knowledge Discovery and Data Mining

Foundations of Artificial Intelligence. Introduction to Data Mining

Decision-Tree Learning

Web Document Clustering

Knowledge Discovery and Data Mining

Neural Networks Lesson 5 - Cluster Analysis

Data Mining: Overview. What is Data Mining?

The Data Mining Process

Practical Introduction to Machine Learning and Optimization. Alessio Signorini

Introduction to Machine Learning Using Python. Vikram Kamath

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Data Mining Part 5. Prediction

Comparison of K-means and Backpropagation Data Mining Algorithms

Data Preprocessing. Week 2

not possible or was possible at a high cost for collecting the data.

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Data Mining 5. Cluster Analysis

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Data Mining: Foundation, Techniques and Applications

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Rule based Classification of BSE Stock Data with Data Mining

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1

Chapter ML:XI (continued)

Comparison of Data Mining Techniques used for Financial Data Analysis

Content-Based Recommendation

Analytics on Big Data

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Unsupervised Learning and Data Mining. Unsupervised Learning and Data Mining. Clustering. Supervised Learning. Supervised Learning

Role of Social Networking in Marketing using Data Mining

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Detecting Spam Using Spam Word Associations

Clustering Technique in Data Mining for Text Documents

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Data Mining. Nonlinear Classification

DATA MINING USING INTEGRATION OF CLUSTERING AND DECISION TREE

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

Final Project Report

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Cross-Validation. Synonyms Rotation estimation

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Introduction to Artificial Intelligence G51IAI. An Introduction to Data Mining

Chapter 6. The stacking ensemble approach

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

Heritage Provider Network Health Prize Round 3 Milestone: Team crescendo s Solution

DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES

Lecture 6 - Data Mining Processes

Lecture 10: Regression Trees

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Categorical Data Visualization and Clustering Using Subjective Factors

A Review of Data Mining Techniques

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Using multiple models: Bagging, Boosting, Ensembles, Forests

Transcription:

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers need useful or actionable knowledge and gain insight from raw data for various purposes It s not just searching data or databases The process of extracting useful patterns from raw data is known as Knowledge Discovery in Databases (KDD). Measures and Metrics 22

KDD Process Measures and Metrics 33

Data Mining The process of discovering hidden patterns in large data sets It utilizes methods at the intersection of artificial intelligence, machine learning, statistics, and database systems Extracting or mining knowledge from large amounts of data, or big data Data-driven discovery and modeling of hidden patterns in big data Extracting implicit, previously unknown, unexpected, and potentially useful information/knowledge from data Measures and Metrics 44

Data Measures and Metrics 55

Data Instances In the KDD process, data is represented in a tabular format A collection of properties and features related to an object or person A patient s medical record A user s profile A gene s information Instances are also called points, data points, or observations Data Instance: Feature Value Class Attribute Features ( Attributes or measurements) Class Label Measures and Metrics 66

Data Instances Predicting whether an individual who visits an online book seller is going to buy a specific book Unlabeled Example Continues feature: values are numeric values Money spent: $25 Discrete feature: Can take a number of values Money spent: {high, normal, low} Labeled Example Measures and Metrics 77

Data Types + Permissible Operations (statistics) Nominal (categorical) Operations: Mode (most common feature value), Equality Comparison E.g., {male, female} Ordinal Feature values have an intrinsic order to them, but the difference is not defined Operations: same as nominal, feature value rank E.g., {Low, medium, high} Interval Operations: Addition and subtractions are allowed whereas divisions and multiplications are not E.g., 3:08 PM, calendar dates Ratio Operations: divisions and multiplications are are allowed E.g., Height, weight, money quantities Measures and Metrics 8

Sample Dataset outlook temperature humidity windy play sunny 85 85 FALSE no sunny 80 90 TRUE no overcast 83 86 FALSE yes rainy 70 96 FALSE yes rainy 68 80 FALSE yes rainy 65 70 TRUE no overcast 64 65 TRUE yes sunny 72 95 FALSE no sunny 69 70 FALSE yes rainy 75 80 FALSE yes sunny 75 70 TRUE yes overcast 72 90 TRUE yes overcast 81 75 FALSE yes rainy 71 91 TRUE no Interval Ratio Ordinal Nominal Measures and Metrics 99

Text Representation The most common way to model documents is to transform them into sparse numeric vectors and then deal with them with linear algebraic operations This representation is called Bag of Words Methods: Vector space model TF-IDF Measures and Metrics 10

Vector Space Model In the vector space model, we start with a set of documents, D Each document is a set of words The goal is to convert these textual documents to vectors d i : document i, w j,i : the weight for word j in document i we can set it to 1 when the word j exists in document i and 0 when it does not. We can also set this weight to the number of times the word j is observed in document i Measures and Metrics 11

Vector Space Model: An Example Documents: d1: data mining and social media mining d2: social network analysis d3: data mining Reference vector: (social, media, mining, network, analysis, data) Vector representation: analysis data media mining network social d1 0 1 1 1 0 1 d2 1 0 0 0 1 1 d3 0 1 0 1 0 0 Measures and Metrics 12

TF-IDF (Term Frequency-Inverse Document Frequency) tf-idf of term t, document d, and document corpus D is calculated as follows: is the frequency of word j in document i The total number of documents in the corpus The number of documents where the term j appears Measures and Metrics 13

TF-IDF: An Example Consider the words apple and orange that appear 10 and 20 times in document 1 (d1), which contains 100 words. Let D = 20 and assume the word apple only appears in document d1 and the word orange appears in all 20 documents Measures and Metrics 14

TF-IDF : An Example Documents: d1: social media mining d2: social media data d3: financial market data TF values: TFIDF Measures and Metrics 15

Data Quality When making data ready for data mining algorithms, data quality need to be assured Noise Noise is the distortion of the data Outliers Outliers are data points that are considerably different from other data points in the dataset Missing Values Missing feature values in data instances To solve this problem: 1) remove instances that have missing values 2) estimate missing values, and 3) ignore missing values when running data mining algorithm Duplicate data Measures and Metrics 16

Data Preprocessing Aggregation It is performed when multiple features need to be combined into a single one or when the scale of the features change Example: image width, image height -> image area (width x height) Discretization From continues values to discrete values Example: money spent -> {low, normal, high} Feature Selection Choose relevant features Feature Extraction Creating new features from original features Often, more complicated than aggregation Sampling Random Sampling Sampling with or without replacement Stratified Sampling: useful when having class imbalance Social Network Sampling Measures and Metrics 17

Data Preprocessing Sampling social networks: starting with a small set of nodes (seed nodes) and sample (a) the connected components they belong to; (b) the set of nodes (and edges) connected to them directly; or (c) the set of nodes and edges that are within n-hop distance from them. Measures and Metrics 18

Data Mining Algorithms Supervised Learning Algorithm Classification (class attribute is discrete) Assign data into predefined classes Spam Detection, fraudulent credit card detection Regression (class attribute takes real values) Predict a real value for a given data instance Predict the price for a given house Unsupervised Learning Algorithm Group similar items together into some clusters Detect communities in a given social network Measures and Metrics 19

Supervised Learning Data Measures Mining and Essentials Metrics 20

Classification Example Learning patterns from labeled data and classify new data with labels (categories) For example, we want to classify an e-mail as "legitimate" or "spam" Measures and Metrics 21

Supervised Learning: The Process We are given a set of labeled examples These examples are records/instances in the format (x, y) where x is a vector and y is the class attribute, commonly a scalar The supervised learning task is to build model that maps x to y (find a mapping m such that m(x) = y) Given an unlabeled instances (x,?), we compute m(x ) E.g., spam/non-spam prediction Data Measures Mining and Essentials Metrics 22

A Twitter Example Data Measures Mining and Essentials Metrics 23

Supervised Learning Algorithm Classification Decision tree learning Naive Bayes Classifier K-nearest neighbor classifier Classification with Network information Regression Linear Regression Logistic Regression Data Measures Mining and Essentials Metrics 24

Decision Tree A decision tree is learned from the dataset (training data with known classes) and later applied to predict the class attribute value of new data (test data with unknown classes) where only the feature values are known Measures and Metrics 25

Decision Tree: Example Class Labels Learned Decision Tree 1 Learned Decision Tree 2 Data Measures Mining and Essentials Metrics 26

Decision Tree Construction Decision trees are constructed recursively from training data using a top-down greedy approach in which features are sequentially selected. After selecting a feature for each node, based on its values, different branches are created. The training set is then partitioned into subsets based on the feature values, each of which fall under the respective feature value branch; the process is continued for these subsets and other nodes When selecting features, we prefer features that partition the set of instances into subsets that are more pure. A pure subset has instances that all have the same class attribute value. Measures and Metrics 27

Decision Tree Construction When reaching pure subsets under a branch, the decision tree construction process no longer partitions the subset, creates a leaf under the branch, and assigns the class attribute value for subset instances as the leaf s predicted class attribute value To measure purity we can use [minimize] entropy. Over a subset of training instances, T, with a binary class attribute (values in {+,-}), the entropy of T is defined as: p + is the proportion of positive examples in D p - is the proportion of negative examples in D Data Measures Mining and Essentials Metrics 28

Entropy Example Assume there is a subset T, containing 10 instances. Seven instances have a positive class attribute value and three have a negative class attribute value [7+, 3-]. The entropy measure for subset T is In a pure subset, all instances have the same class attribute value and the entropy is 0. If the subset being measured contains an unequal number of positive and negative instances, the entropy is between 0 and 1. Data Measures Mining and Essentials Metrics 29

Naive Bayes Classifier For two random variables X and Y, Bayes theorem states that, class variable the instance features Then class attribute value for instance X We assume that features are independent given the class attribute Data Measures Mining and Essentials Metrics 30

NBC: An Example Measures and Metrics 31

Nearest Neighbor Classifier k-nearest neighbor or knn, as the name suggests, utilizes the neighbors of an instance to perform classification. In particular, it uses the k nearest instances, called neighbors, to perform classification. The instance being classified is assigned the label (class attribute value) that the majority of its k neighbors are assigned When k = 1, the closest neighbor s label is used as the predicted label for the instance being classified To determine the neighbors of an instance, we need to measure its distance to all other instances based on some distance metric. Commonly, Euclidean distance is employed Data Measures Mining and Essentials Metrics 32

K-NN: Algorithm Data Measures Mining and Essentials Metrics 33

K-NN example When k=5, the predicted label is: triangle When k=9, the predicted label is: square Data Measures Mining and Essentials Metrics 34

Classification with Network Information Consider a friendship network on social media and a product being marketed to this network. The product seller wants to know who the potential buyers are for this product. Assume we are given the network with the list of individuals that decided to buy or not buy the product. Our goal is to predict the decision for the undecided individuals. This problem can be formulated as a classification problem based on features gathered from individuals. However, in this case, we have additional friendship information that may be helpful in building better classification models Measures and Metrics 35

Let y_i denote the label for node i. We can assume that Data Measures Mining and Essentials Metrics 36

Weighted-vote Relational-Neighbor (wvrn) To find the label of a node, we can perform a weighted vote among its neighbors Note that we need to compute these probabilities using some order until convergence [i.e., they don t change] Measures and Metrics 37

wvrn example Data Measures Mining and Essentials Metrics 38

Regression Data Measures Mining and Essentials Metrics 39

Regression In regression, the class attribute takes real values In classification, class values or labels are categories y f(x) Class attribute(dependent variable) y R Features (Regressors) x 1, x 2,, x m Regression finds the relation between y and the vector (x 1, x 2,, x m ) Data Measures Mining and Essentials Metrics 40

Linear Regression In linear regression, we assume the relation between the class attribute y and feature set x to be linear where w represents the vector of regression coefficients The problem of regression can be solved by estimating w and using the provided dataset and the labels y The least squares is often used to solve the problem Measures and Metrics 41

Solving Linear Regression Problems The problem of regression can be solved by estimating w and using the dataset provided and the labels y Least squares is a popular method to solve regression problems Data Measures Mining and Essentials Metrics 42

Least Squares Find W such that minimizing ǁY - XWǁ 2 for regressors X and labels Y Measures and Metrics 43

Least Squares Data Measures Mining and Essentials Metrics 44

Evaluating Supervised Learning To evaluate we use a training-testing framework A training dataset (i.e., the labels are known) is used to train a model the model is evaluated on a test dataset. Since the correct labels of the test dataset are unknown, in practice, the training set is divided into two parts, one used for training and the other used for testing. When testing, the labels from this test set are removed. After these labels are predicted using the model, the predicted labels are compared with the masked labels (ground truth). Measures and Metrics 45

Evaluating Supervised Learning Dividing the training set into train/test sets divide the training set into k equally sized partitions, or folds, and then using all folds but one to train and the one left out for testing. This technique is called leave-one-out training. Divide the training set into k equally sized sets and then run the algorithm k times. In round i, we use all folds but fold i for training and fold i for testing. The average performance of the algorithm over k rounds measures the performance of the algorithm. This robust technique is known as k-fold cross validation. Data Measures Mining and Essentials Metrics 46

Evaluating Supervised Learning As the class labels are discrete, we can measure the accuracy by dividing number of correctly predicted labels (C) by the total number of instances (N) Accuracy = C/N Error rate = 1 Accuracy More sophisticated approaches of evaluation will be discussed later Measures and Metrics 47

Evaluating Regression Performance The labels cannot be predicted precisely It is needed to set a margin to accept or reject the predictions For example, when the observed temperature is 71 any prediction in the range of 71±0.5 can be considered as correct prediction Data Measures Mining and Essentials Metrics 48

Unsupervised Learning Data Measures Mining and Essentials Metrics 49

Unsupervised Learning Unsupervised division of instances into groups of similar objects Clustering is a form of unsupervised learning The clustering algorithms do not have examples showing how the samples should be grouped together (unlabeled data) Clustering algorithms group together similar items Data Measures Mining and Essentials Metrics 50

Measuring Distance/Similarity in Clustering Algorithms The goal of clustering: to group together similar items Instances are put into different clusters based on the distance to other instances Any clustering algorithm requires a distance measure The most popular (dis)similarity measure for continuous features are Euclidean Distance and Pearson Linear Correlation Measures and Metrics 51

Similarity Measures: More Definitions Once a distance measure is selected, instances are grouped using it. Measures and Metrics 52

Clustering Clusters are usually represented by compact and abstract notations. Cluster centroids are one common example of this abstract notation. Partitional Algorithms Partition the dataset into a set of clusters In other words, each instance is assigned to a cluster exactly once and no instance remains unassigned to clusters. k-means Measures and Metrics 53

k-means for k=6 Measures and Metrics 54

k-means The algorithm is the most commonly used clustering algorithm and is based on the idea of Expectation Maximization in statistics. Measures and Metrics 55

When do we stop? Note that this procedure is repeated until convergence. The most common criterion to determine convergence is to check whether centroids are no longer changing. This is equivalent to clustering assignments of the data instances stabilizing. In practice, the algorithm execution can be stopped when the Euclidean distance between the centroids in two consecutive steps is bounded above by some small positive Measures and Metrics 56

k-means As an alternative, k-means implementations try to minimize an objective function. A well-known objective function in these implementations is the squared distance error: where is the jth instance of cluster i, n(i) is the number of instances in cluster i, and ci is the centroid of cluster i. The process stops when the difference between the objective function values of two consecutive iterations of the k-means algorithm is bounded by some small value. Measures and Metrics 57

k-means Finding the global optimum of the k partitions is computationally expensive (NP-hard). This is equivalent to finding the optimal centroids that minimize the objective function However, there are efficient heuristic algorithms that are commonly employed and converge quickly to an optimum that might not be global. running k-means multiple times and selecting the clustering assignment that is observed most often or is more desirable based on an objective function, such as the squared error. Data Measures Mining and Essentials Metrics 58

Evaluating the Clusterings When we are given objects of two different kinds, the perfect clustering would be that objects of the same type are clustered together. Evaluation with ground truth Evaluation without ground truth Measures and Metrics 59

Evaluation with Ground Truth When ground truth is available, the evaluator has prior knowledge of what a clustering should be That is, we know the correct clustering assignments. We will discuss these methods in community analysis chapter Data Measures Mining and Essentials Metrics 60

Evaluation without Ground Truth Cohesiveness In clustering, we are interested in clusters that exhibit cohesiveness. In cohesive clusters, instances inside the clusters are close to each other. Separateness We are also interested in clusterings of the data that generates clusters that are well separated from one another Measures and Metrics 61

Cohesiveness Cohesiveness In statistical terms, this is equivalent to having a small standard deviation, i.e., being close to the mean value. In clustering, this translates to being close to the centroid of the cluster Data Measures Mining and Essentials Metrics 62

Separateness Separateness We are also interested in clusterings of the data that generates clusters that are well separated from one another In statistics, separateness can be measured by standard deviation. Standard deviation is maximized when instances are far from the mean. In clustering terms, this is equivalent to cluster centroids being far from the mean of the entire dataset Data Measures Mining and Essentials Metrics 63

Separateness Example In general we are interested in clusters that are both cohesive and separate -> Silhouette index Data Measures Mining and Essentials Metrics 64

Silhouette Index The silhouette index combines both cohesiveness and separateness. It compares the average distance value between instances in the same cluster and the average distance value between instances in different clusters. In a well-clustered dataset, the average distance between instances in the same cluster is small (cohesiveness) and the average distance between instances in different clusters is large (separateness). Measures and Metrics 65

Silhouette Index For any instance x that is a member of cluster C Compute the within-cluster average distance Compute the average distance between x and instances in cluster G that is closest to x in terms of the average distance between x and members of G Data Measures Mining and Essentials Metrics 66

Silhouette Index Clearly we are interested in clusterings where a(x)<b(x) Silhouette can take values between [-1,1] The best case happens when for all x, a(x)=0, b(x)>a(x) Measures and Metrics 67

Silhouette Index - Example Data Measures Mining and Essentials Metrics 68