Chapter ML:XI (continued)

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Chapter ML:XI (continued)"

Transcription

1 Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster Analysis ML:XI-274 Cluster Analysis STEIN

2 Person Resolution Task ML:XI-275 Cluster Analysis STEIN

3 Person Resolution Task ML:XI-276 Cluster Analysis STEIN

4 Person Resolution Task target name Michael Jordan other names ML:XI-277 Cluster Analysis STEIN

5 Person Resolution Task target name Michael Jordan other names target name Michael Jordan (referent 1) other names... target name Michael Jordan (referent r) other names The basket ball player. The statistician. ML:XI-278 Cluster Analysis STEIN

6 Person Resolution Task target name Michael Jordan other names target name Michael Jordan (referent 1) other names... target name Michael Jordan (referent r) other names The basket ball player. The statistician. Multi-document resolution task: Names, Target names: N = {n 1,..., n l }, T N Referents: R = {r 1,..., r m }, τ : R T, R T Documents: D = {d 1,..., d n }, ν : D P(N), ν(d i ) T = 1 A solution: γ : D R, s.t. τ(γ(d i )) ν(d i ) ML:XI-279 Cluster Analysis STEIN

7 Person Resolution Task target name Michael Jordan other names target name Michael Jordan (referent 1) other names... target name Michael Jordan (referent r) other names The basket ball player. The statistician. Facts about the Spock data mining challenge: Target names: T = 44 Referents: R = Documents: D train = (labeled 2.3GB) D test = (unlabeled 7.8GB) ML:XI-280 Cluster Analysis STEIN

8 Person Resolution Task # Referents per target name target name Michael Jordan Target name up to 105other referents names for a single target name about 25 referents on average per target name target 200 name Michael Jordan 150 (referent 1) 100 # Documents per referent The basket ball player. other names Referent target name Michael Jordan (referent r) about 23 documents on average per referent The statistician. other names Facts about the Spock data mining challenge: Target names: T = 44 Referents: R = Documents: D train = (labeled 2.3GB) D test = (unlabeled 7.8GB) ML:XI-281 Cluster Analysis STEIN

9 Applied to Multi-Document Resolution Referent 1 Referent 2 1. Model similarities new and established retrieval models: global and context-based vector space models explicit semantic analysis ontology alignment 2. Learn class memberships (supervised) logistic regression 3. Find equivalence classes (unsupervised) cluster analysis: (a) (b) (c) adaptive graph thinning multiple, density-based cluster analysis clustering selection by expected density maximization ML:XI-282 Cluster Analysis STEIN

10 Applied to Multi-Document Resolution Referent 1 Referent 2 1. Model similarities new and established retrieval models: global and context-based vector space models explicit semantic analysis ontology alignment 2. Learn class memberships (supervised) logistic regression 3. Find equivalence classes (unsupervised) cluster analysis: (a) (b) (c) adaptive graph thinning multiple, density-based cluster analysis clustering selection by expected density maximization ML:XI-283 Cluster Analysis STEIN

11 Applied to Multi-Document Resolution Referent 1 Referent 2 1. Model similarities new and established retrieval models: global and context-based vector space models explicit semantic analysis ontology alignment 2. Learn class memberships (supervised) logistic regression 3. Find equivalence classes (unsupervised) cluster analysis: (a) (b) (c) adaptive graph thinning multiple, density-based cluster analysis clustering selection by expected density maximization ML:XI-284 Cluster Analysis STEIN

12 Applied to Multi-Document Resolution Referent 1 Referent 2 1. Model similarities new and established retrieval models: global and context-based vector space models explicit semantic analysis ontology alignment 2. Learn class memberships (supervised) logistic regression 3. Find equivalence classes (unsupervised) cluster analysis: (a) (b) (c) adaptive graph thinning multiple, density-based cluster analysis clustering selection by expected density maximization ML:XI-285 Cluster Analysis STEIN

13 Applied to Multi-Document Resolution Referent 1 Referent 2 1. Model similarities new and established retrieval models: global and context-based vector space models explicit semantic analysis ontology alignment 2. Learn class memberships (supervised) logistic regression 3. Find equivalence classes (unsupervised) cluster analysis: (a) (b) (c) adaptive graph thinning multiple, density-based cluster analysis clustering selection by expected density maximization ML:XI-286 Cluster Analysis STEIN

14 Applied to Multi-Document Resolution Referent 1 Referent 2 1. Model similarities new and established retrieval models: global and context-based vector space models explicit semantic analysis ontology alignment 2. Learn class memberships (supervised) logistic regression 3. Find equivalence classes (unsupervised) cluster analysis: (a) (b) (c) adaptive graph thinning multiple, density-based cluster analysis clustering selection by expected density maximization ML:XI-287 Cluster Analysis STEIN

15 Idealized Class Membership Distribution over Similarities different referents same referent <0,100 <0,050 <0,200 <0,150 <0,300 <0,250 <0,400 <0,350 <0,500 <0,450 <0,600 <0,550 <0,700 <0,650 <0,750 <0,800 <0,850 <0,900 <0,950 <1,000 Similarity distributions for document pairs from different referents and same referent. Logistic regression task: sample size: classes imbalance: non-target class : target class 25:1 items are drawn uniformly distributed wrt. non-targets and targets items are uniformly distributed over the groups of target names ML:XI-288 Cluster Analysis STEIN

16 Membership Distribution under tf idf Vector Space Model 20,00% different referents same referent 15,00% 10,00% 5,00% 0,00% <0,550 <0,500 <0,450 <0,400 <0,350 <0,300 <0,250 <0,200 <0,150 <0,100 <0,050 <0,700 <0,650 <0,600 <0,750 <0,800 <0,850 <0,900 <0,950 <1,000 Model details: corpus size: documents dictionary size: 1,2 Mio terms stopwords number: 850 stopword volume: 36% ML:XI-289 Cluster Analysis STEIN

17 Membership Distribution under Context-Based Vector Space Model 20,00% different referents same referent 15,00% 10,00% 5,00% 0,00% <0,550 <0,500 <0,450 <0,400 <0,350 <0,300 <0,250 <0,200 <0,150 <0,100 <0,050 <0,850 <0,800 <0,750 <0,700 <0,650 <0,600 <0,900 <0,950 <1,000 Model details: corpus size: documents dictionary size: 1,2 Mio terms stopwords number: 850 stopword volume: 36% ML:XI-290 Cluster Analysis STEIN

18 Membership Distribution under Ontology Alignment Model 60,00% different referents same referent 45,00% 30,00% 15,00% 0,00% <0,050 <0,100 <0,150 <0,200 <0,250 <0,300 <0,350 <0,400 <0,450 <0,500 <0,550 <0,600 <0,650 <0,700 <0,750 <0,800 <0,850 <0,900 <0,950 <1,000 Model details: DMOZ open directory project Top Classifier > 5 million documents 12 top-level categories top level Arts Business Computers Games World 31 second level categories ML: hierarchical Bayes second level Algorithms AI Virtual Reality training set: pages ML:XI-291 Cluster Analysis STEIN

19 In-Depth: Multi-Class Hierarchical Classification Flat (big-bang) classification Hierarchical (top-down) classification... + simple realization loss of discriminative power with increasing number of categories + specialized classifiers (divide and conquer) misclassification at higher levels can never become repaired ML:XI-292 Cluster Analysis STEIN

20 In-Depth: Multi-Class Hierarchical Classification State of the art of effectiveness analyses: 1. independence assumption between categories 2. neglection of both hierarchical structure and degree of misclassification (a) (b) Improvements: Consider similarity ϕ(c i, C j ) between correct and wrong category. Consider graph distance d(c i, C j ) between correct and wrong category. ML:XI-293 Cluster Analysis STEIN

21 In-Depth: Multi-Class Hierarchical Classification Improvements continued: Multi-label (multi path) classification Multi-classifier (ensemble) classification p > α p > β traverse more than one path and return all labels classification result is a majority decision employ probabilistic classifiers with a threshold: split a path or not employ different classifier (different types or differently parameterized) ML:XI-294 Cluster Analysis STEIN

22 Membership Distribution under Optimized Retrieval Model Combination 20,00% different referents same referent 15,00% 10,00% 5,00% 0,00% <0,300 <0,250 <0,200 <0,150 <0,100 <0,050 <0,850 <0,800 <0,750 <0,700 <0,650 <0,600 <0,550 <0,500 <0,450 <0,400 <0,350 <0,900 <0,950 <1,000 Retrieval Model F 1/3 -Measure tf idf vector space 0.39 context-based vector space 0.32 ESA Wikipedia persons 0.30 phrase structure grammar 0.17 ontology alignment 0.15 optimized combination 0.42 ML:XI-295 Cluster Analysis STEIN

23 Membership Distribution under Optimized Retrieval Model Combination 20,00% different referents same referent 15,00% 10,00% 5,00% 0,00% <0,300 <0,250 <0,200 <0,150 <0,100 <0,050 <0,550 <0,500 <0,450 <0,400 <0,350 <0,700 <0,650 <0,600 <0,850 <0,800 <0,750 <0,900 <0,950 <1,000 Retrieval Model F 1/3 -Measure tf idf vector space 0.39 context-based vector space 0.32 ESA Wikipedia persons 0.30 phrase structure grammar 0.17 ontology alignment 0.15 optimized combination 0.42 Referent 1 Referent 2 Referent m... ML:XI-296 Cluster Analysis STEIN

24 Membership Distribution under Optimized Retrieval Model Combination 20,00% different referents same referent 15,00% 10,00% 5,00% 0,00% <0,300 <0,250 <0,200 <0,150 <0,100 <0,050 <0,550 <0,500 <0,450 <0,400 <0,350 <0,700 <0,650 <0,600 <0,850 <0,800 <0,750 <0,900 <0,950 <1,000 Retrieval Model F 1/3 -Measure tf idf vector space 0.39 context-based vector space 0.32 ESA Wikipedia persons 0.30 phrase structure grammar 0.17 ontology alignment 0.15 optimized combination 0.42 Referent 1 Referent 2 Referent m ML:XI-297 Cluster Analysis STEIN

25 Membership Distribution under Optimized Retrieval Model Combination 20,00% different referents same referent 15,00% 10,00% 5,00% 0,00% <0,300 <0,250 <0,200 <0,150 <0,100 <0,050 <0,550 <0,500 <0,450 <0,400 <0,350 <0,700 <0,650 <0,600 <0,850 <0,800 <0,750 <0,900 <0,950 <1,000 Retrieval Model F 1/3 -Measure tf idf vector space 0.39 context-based vector space 0.32 ESA Wikipedia persons 0.30 phrase structure grammar 0.17 ontology alignment 0.15 optimized combination 0.42 Referent 1 Referent 2 Referent m ML:XI-298 Cluster Analysis STEIN

26 Membership Distribution under Optimized Retrieval Model Combination 20,00% different referents same referent 15,00% 10,00% 5,00% 0,00% <0,300 <0,250 <0,200 <0,150 <0,100 <0,050 <0,550 <0,500 <0,450 <0,400 <0,350 <0,700 <0,650 <0,600 <0,850 <0,800 <0,750 <0,900 <0,950 <1,000 InRetrieval the example: Model F 1/3 -Measure tf idf precision vector space = context-based vector space 0.32 recall = 0.43 ESA Wikipedia persons 0.30 phrase F 1/3 structure = 0.41 grammar 0.17 ontology alignment 0.15 (if false negatives are uniformly distributed) optimized combination 0.42 Referent 1 Referent 2 Referent m ML:XI-299 Cluster Analysis STEIN

27 In-Depth: Analysis of Classifier Effectiveness 20,00% different referents same referent 15,00% 10,00% 5,00% 0,00% <0,350 <0,400 <0,450 <0,650 <0,050 <0,100 <0,150 <0,200 <0,250 <0,300 <0,500 <0,550 <0,600 <0,700 <0,750 <0,800 <0,850 <0,900 <0,950 <1,000 Consideration of imbalance: different referents same referents Interval ML:XI-300 Cluster Analysis STEIN

28 In-Depth: Analysis of Classifier Effectiveness 20,00% 15,00% class imbalance factor (CIF) of 25 different referents same referent precision in interval [0.725; 1] for edges between same referents: ,00% 5,00% How can F 1/3 = 0.42 be achieved via cluster analysis? <1,000 <0,950 <0,900 <0,850 <0,800 <0,750 <0,700 <0,650 <0,600 <0,550 <0,500 <0,450 <0,400 <0,350 <0,300 <0,250 <0,200 <0,150 <0,100 <0,050 0,00% Consideration of imbalance: different referents same referents Interval ML:XI-301 Cluster Analysis STEIN

29 In-Depth: Analysis of Classifier Effectiveness Assumption: uniform distribution of referents over documents (here: 25 clusters with C = 23) TP true 1-similarities per cluster (here: threshold 0.725) TP C degree of true positives per node (here: 11) 1 TP ( precision 1) false 1-similarities per cluster (here: 760) Density-based cluster analysis: effective false positives, FP, connect to same cluster analyze P ( FP > k D, R iid ) (here: E( FP ) = 2.7) edge tie factor (ETF) specifies the excess of true positives until tie (here: ) ETF = TP C E( FP ), CIF effective precision = precision ETF ML:XI-302 Cluster Analysis STEIN

30 In-Depth: Analysis of Classifier Effectiveness Assumption: uniform distribution of referents over documents (here: 25 clusters with C = 23) TP true 1-similarities per cluster (here: threshold 0.725) TP C degree of true positives per node (here: 11) 1 TP ( precision 1) false 1-similarities per cluster (here: 760) Density-based cluster analysis: effective false positives, FP, connect to same cluster analyze P ( FP > k D, R iid ) (here: E( FP ) = 2.7) edge tie factor (ETF) specifies the excess of true positives until tie (here: ) ETF = TP C E( FP ), CIF effective precision = precision ETF ML:XI-303 Cluster Analysis STEIN

31 In-Depth: Analysis of Classifier Effectiveness Assumption: uniform distribution of referents over documents (here: 25 clusters with C = 23) TP true 1-similarities per cluster (here: threshold 0.725) TP C degree of true positives per node (here: 11) 1 TP ( precision 1) false 1-similarities per cluster (here: 760) Density-based cluster analysis: effective false positives, FP, connect to same cluster analyze P ( FP > k D, R iid ) (here: E( FP ) = 2.7) edge tie factor (ETF) specifies the excess of true positives until tie (here: ) ETF = TP C E( FP ), CIF effective precision = precision ETF ML:XI-304 Cluster Analysis STEIN

32 In-Depth: Analysis of Classifier Effectiveness Assumption: uniform distribution of referents over documents (here: 25 clusters with C = 23) TP true 1-similarities per cluster (here: threshold 0.725) TP C degree of true positives per node (here: 11) 1 TP ( precision 1) false 1-similarities per cluster (here: 760) Density-based cluster analysis: effective false positives, FP, connect to same cluster analyze P ( FP > k D, R iid ) (here: E( FP ) = 2.7) edge tie factor (ETF) specifies the excess of true positives until tie (here: ) ETF = TP C E( FP ), CIF effective precision = precision ETF ML:XI-305 Cluster Analysis STEIN

33 In-Depth: Analysis of Classifier Effectiveness Assumption: uniform distribution of referents over documents (here: 25 clusters with C = 23) TP true 1-similarities per cluster (here: threshold 0.725) TP C degree of true positives per node (here: 11) 1 TP ( precision 1) false 1-similarities per cluster (here: 760) Density-based cluster analysis: effective false positives, FP, connect to same cluster analyze P ( FP > k D, R iid ) (here: E( FP ) = 2.7) edge tie factor (ETF) specifies the excess of true positives until tie (here: ) ETF = TP C E( FP ), CIF effective precision = precision ETF ML:XI-306 Cluster Analysis STEIN

34 In-Depth: Analysis of Classifier Effectiveness 1 different absolute referents precision same effective referents precision 0.8 Precision Assumption: uniform distribution of referents over documents (here: 25 clusters with C = 23) TP true 1-similarities per cluster (here: threshold 0.725) TP 0.2 C degree of true 0 positives per node (here: 11) 1 TP ( precision 1) false 1-similarities per cluster Interval (here: 760) Density-based cluster analysis: effective false positives, FP, connect to same cluster analyze P ( FP > k D, R iid ) (here: E( FP ) = 2.7) edge tie factor (ETF) specifies the excess of true positives until tie (here: ) ETF = TP C E( FP ), CIF effective precision = precision ETF ML:XI-307 Cluster Analysis STEIN

35 In-Depth: Analysis of Classifier Effectiveness 1 different absolute referents precision same effective referents precision 0.8 Precision Assumption: uniform distribution of referents over documents (here: 25 clusters with C = 23) TP true 1-similarities per cluster (here: threshold 0.725) TP 0.2 C degree of true 0 positives per node (here: 11) 1 TP ( precision 1) false 1-similarities per cluster Interval (here: 760) Determine optimum similarity threshold for class-membership function: Density-based cluster analysis: effective false positives, FP, connect to same cluster analyze P ( FP > k D, R iid ) (here: E( FP ) = 2.7) θ = argmax{ edge tie factor (ETF) specifies the excess of true positives until tie (here: ) θ [0;1] 1 + α ETF precision + θ CIF α } recall θ θ considers TP ETF co-variate = shift, introduces C E( FP ), model CIF effective formation precision bias and = precision sample selection bias. ETF ML:XI-308 Cluster Analysis STEIN

36 Model Selection: Our Risk Minimization Strategy Retrieval Model F 1/3 -Measure tf idf vector space 0.39 context-based vector space 0.32 ESA Wikipedia persons 0.30 phrase structure grammar 0.17 ontology alignment 0.15 optimized combination 0.42 Ensemble cluster analysis 0.40 Ensemble cluster analysis: higher bias, better generalization. (1) Do we speculate on a better fit for D test? (2) Do we expect a significant covariate shift, more noise, etc. in D test? ML:XI-309 Cluster Analysis STEIN

37 Recap 1. Multi-document resolution can be tackled with constrained cluster analysis. 2. Constraints are derived from labeled examples. 3. Class membership function ties constraints to multiple retrieval models. 4. Advanced density-based clustering technology is key. ML:XI-310 Cluster Analysis STEIN

38 References Disambiguating Web Appearances of People in a Social Network. [R. Bekkerman, A. McCallum. WWW 2005] A Bayesian Model for Supervised Clustering with the Dirichlet Process Prior. [H. Daumé III, D. Marcu. Journal MLR 2005] Computing Semantic Relatedness Using Wikipedia-based Explicit Semantic Analysis. Unsupervised Discrimination of Person Names in Web Contexts. On Information Need and Categorizing Search. [E. Gabrilovich, S. Markovitch. IJCAI 2007] [T. Pedersen, A. Kulkarni. CICLing 2007] [S. Meyer zu Eissen. Dissertation, Paderborn University, 2007] Weighted Experts: A Solution for the Spock Data Mining Challenge. [B. Stein, S. Meyer zu Eissen. I-KNOW 2008] GRAPE: A System for Disambiguating and Tagging People Names in Web Search. [L. Jiang, W. Shen, J. Wang, N. An. WWW 2010] ML:XI-311 Cluster Analysis STEIN

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

Machine Learning for NLP

Machine Learning for NLP Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Experiments in Web Page Classification for Semantic Web

Experiments in Web Page Classification for Semantic Web Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Automatic Text Processing: Cross-Lingual. Text Categorization

Automatic Text Processing: Cross-Lingual. Text Categorization Automatic Text Processing: Cross-Lingual Text Categorization Dipartimento di Ingegneria dell Informazione Università degli Studi di Siena Dottorato di Ricerca in Ingegneria dell Informazone XVII ciclo

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank

Data Mining Practical Machine Learning Tools and Techniques. Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 7 of Data Mining by I. H. Witten and E. Frank Engineering the input and output Attribute selection Scheme independent, scheme

More information

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING

Practical Applications of DATA MINING. Sang C Suh Texas A&M University Commerce JONES & BARTLETT LEARNING Practical Applications of DATA MINING Sang C Suh Texas A&M University Commerce r 3 JONES & BARTLETT LEARNING Contents Preface xi Foreword by Murat M.Tanik xvii Foreword by John Kocur xix Chapter 1 Introduction

More information

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati tnpatil2@gmail.com, ss_sherekar@rediffmail.com

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

Chapter ML:XI. XI. Cluster Analysis

Chapter ML:XI. XI. Cluster Analysis Chapter ML:XI XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained Cluster

More information

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.

Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool. International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

Identifying free text plagiarism based on semantic similarity

Identifying free text plagiarism based on semantic similarity Identifying free text plagiarism based on semantic similarity George Tsatsaronis Norwegian University of Science and Technology Department of Computer and Information Science Trondheim, Norway gbt@idi.ntnu.no

More information

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University

Collective Behavior Prediction in Social Media. Lei Tang Data Mining & Machine Learning Group Arizona State University Collective Behavior Prediction in Social Media Lei Tang Data Mining & Machine Learning Group Arizona State University Social Media Landscape Social Network Content Sharing Social Media Blogs Wiki Forum

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Clustering Documents with Active Learning using Wikipedia

Clustering Documents with Active Learning using Wikipedia Clustering Documents with Active Learning using Wikipedia Anna Huang David Milne Eibe Frank Ian H. Witten Department of Computer Science, University of Waikato Private Bag 3105, Hamilton, New Zealand {lh92,

More information

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

10-601. Machine Learning. http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html 10-601 Machine Learning http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html Course data All up-to-date info is on the course web page: http://www.cs.cmu.edu/afs/cs/academic/class/10601-f10/index.html

More information

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification CS 688 Pattern Recognition Lecture 4 Linear Models for Classification Probabilistic generative models Probabilistic discriminative models 1 Generative Approach ( x ) p C k p( C k ) Ck p ( ) ( x Ck ) p(

More information

Text Clustering. Clustering

Text Clustering. Clustering Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

More information

Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore. CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes

More information

Machine Learning Final Project Spam Email Filtering

Machine Learning Final Project Spam Email Filtering Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE

More information

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

More information

Clustering Technique in Data Mining for Text Documents

Clustering Technique in Data Mining for Text Documents Clustering Technique in Data Mining for Text Documents Ms.J.Sathya Priya Assistant Professor Dept Of Information Technology. Velammal Engineering College. Chennai. Ms.S.Priyadharshini Assistant Professor

More information

Machine Learning for Data Science (CS4786) Lecture 1

Machine Learning for Data Science (CS4786) Lecture 1 Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report. Document Clustering. Meryem Uzun-Per Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

OUTLIER ANALYSIS. Data Mining 1

OUTLIER ANALYSIS. Data Mining 1 OUTLIER ANALYSIS Data Mining 1 What Are Outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism Ex.: Unusual credit card purchase,

More information

Projektgruppe. Categorization of text documents via classification

Projektgruppe. Categorization of text documents via classification Projektgruppe Steffen Beringer Categorization of text documents via classification 4. Juni 2010 Content Motivation Text categorization Classification in the machine learning Document indexing Construction

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Personalized Hierarchical Clustering

Personalized Hierarchical Clustering Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de

More information

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction to Machine Learning Lecture 1 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Introduction Logistics Prerequisites: basics concepts needed in probability and statistics

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

8. Machine Learning Applied Artificial Intelligence

8. Machine Learning Applied Artificial Intelligence 8. Machine Learning Applied Artificial Intelligence Prof. Dr. Bernhard Humm Faculty of Computer Science Hochschule Darmstadt University of Applied Sciences 1 Retrospective Natural Language Processing Name

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014

LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014 LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph

More information

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering

A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering A Two-Pass Statistical Approach for Automatic Personalized Spam Filtering Khurum Nazir Junejo, Mirza Muhammad Yousaf, and Asim Karim Dept. of Computer Science, Lahore University of Management Sciences

More information

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer Machine Learning Chapter 18, 21 Some material adopted from notes by Chuck Dyer What is learning? Learning denotes changes in a system that... enable a system to do the same task more efficiently the next

More information

Chapter 6. The stacking ensemble approach

Chapter 6. The stacking ensemble approach 82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

More information

Unsupervised Learning: Clustering with DBSCAN Mat Kallada

Unsupervised Learning: Clustering with DBSCAN Mat Kallada Unsupervised Learning: Clustering with DBSCAN Mat Kallada STAT 2450 - Introduction to Data Mining Supervised Data Mining: Predicting a column called the label The domain of data mining focused on prediction:

More information

A Review of Performance Evaluation Measures for Hierarchical Classifiers

A Review of Performance Evaluation Measures for Hierarchical Classifiers A Review of Performance Evaluation Measures for Hierarchical Classifiers Eduardo P. Costa Depto. Ciências de Computação ICMC/USP - São Carlos Caixa Postal 668, 13560-970, São Carlos-SP, Brazil Ana C. Lorena

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning.  CS 2750 Machine Learning. Lecture 1 Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x-5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS Chapter 6 A SURVEY OF TEXT CLASSIFICATION ALGORITHMS Charu C. Aggarwal IBM T. J. Watson Research Center Yorktown Heights, NY charu@us.ibm.com ChengXiang Zhai University of Illinois at Urbana-Champaign

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning and Pattern Recognition Logistic Regression Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,

More information

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup Network Anomaly Detection A Machine Learning Perspective Dhruba Kumar Bhattacharyya Jugal Kumar KaKta»C) CRC Press J Taylor & Francis Croup Boca Raton London New York CRC Press is an imprint of the Taylor

More information

Christfried Webers. Canberra February June 2015

Christfried Webers. Canberra February June 2015 c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 5, Sep-Oct 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 5, Sep-Oct 2015 RESEARCH ARTICLE Multi Document Utility Presentation Using Sentiment Analysis Mayur S. Dhote [1], Prof. S. S. Sonawane [2] Department of Computer Science and Engineering PICT, Savitribai Phule Pune University

More information

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

More information

Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006

Robert Collins CSE598G. More on Mean-shift. R.Collins, CSE, PSU CSE598G Spring 2006 More on Mean-shift R.Collins, CSE, PSU Spring 2006 Recall: Kernel Density Estimation Given a set of data samples x i ; i=1...n Convolve with a kernel function H to generate a smooth function f(x) Equivalent

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Sheeba J.I1, Vivekanandan K2

Sheeba J.I1, Vivekanandan K2 IMPROVED UNSUPERVISED FRAMEWORK FOR SOLVING SYNONYM, HOMONYM, HYPONYMY & POLYSEMY PROBLEMS FROM EXTRACTED KEYWORDS AND IDENTIFY TOPICS IN MEETING TRANSCRIPTS Sheeba J.I1, Vivekanandan K2 1 Assistant Professor,sheeba@pec.edu

More information

Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu

Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning CS 6830 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu What is Learning? Merriam-Webster: learn = to acquire knowledge, understanding, or skill

More information

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?

More information

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines. Colin Campbell, Bristol University Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

More information

Data mining knowledge representation

Data mining knowledge representation Data mining knowledge representation 1 What Defines a Data Mining Task? Task relevant data: where and how to retrieve the data to be used for mining Background knowledge: Concept hierarchies Interestingness

More information

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 2nd International Conference on Advances in Mechanical Engineering and Industrial Informatics (AMEII 2016) The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2 1 School of

More information

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Open Domain Information Extraction. Günter Neumann, DFKI, 2012 Open Domain Information Extraction Günter Neumann, DFKI, 2012 Improving TextRunner Wu and Weld (2010) Open Information Extraction using Wikipedia, ACL 2010 Fader et al. (2011) Identifying Relations for

More information

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov

Search and Data Mining: Techniques. Text Mining Anya Yarygina Boris Novikov Search and Data Mining: Techniques Text Mining Anya Yarygina Boris Novikov Introduction Generally used to denote any system that analyzes large quantities of natural language text and detects lexical or

More information

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)

Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of

More information

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Web Mining Margherita Berardi LACAM Dipartimento di Informatica Università degli Studi di Bari berardi@di.uniba.it Bari, 24 Aprile 2003 Overview Introduction Knowledge discovery from text (Web Content

More information

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

TDWI Best Practice BI & DW Predictive Analytics & Data Mining TDWI Best Practice BI & DW Predictive Analytics & Data Mining Course Length : 9am to 5pm, 2 consecutive days 2012 Dates : Sydney: July 30 & 31 Melbourne: August 2 & 3 Canberra: August 6 & 7 Venue & Cost

More information

Data Mining with SQL Server Data Tools

Data Mining with SQL Server Data Tools Data Mining with SQL Server Data Tools Data mining tasks include classification (directed/supervised) models as well as (undirected/unsupervised) models of association analysis and clustering. 1 Data Mining

More information

An Introduction to Data Mining

An Introduction to Data Mining An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail

More information

C19 Machine Learning

C19 Machine Learning C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for

More information

Email Spam Detection A Machine Learning Approach

Email Spam Detection A Machine Learning Approach Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn

More information

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

More information

A Content based Spam Filtering Using Optical Back Propagation Technique

A Content based Spam Filtering Using Optical Back Propagation Technique A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT

More information

Web Document Clustering

Web Document Clustering Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,

More information

from Larson Text By Susan Miertschin

from Larson Text By Susan Miertschin Decision Tree Data Mining Example from Larson Text By Susan Miertschin 1 Problem The Maximum Miniatures Marketing Department wants to do a targeted mailing gpromoting the Mythic World line of figurines.

More information

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not

More information

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577

T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577 T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or

More information

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

More information

Fig. 1 A typical Knowledge Discovery process [2]

Fig. 1 A typical Knowledge Discovery process [2] Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

More information

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP

TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r Bing Liu Web Data Mining Exploring Hyperlinks, Contents, and Usage Data With 177 Figures ~ Spring~r Table of Contents 1. Introduction.. 1 1.1. What is the World Wide Web? 1 1.2. ABrief History of the Web

More information

Using Artificial Intelligence to Manage Big Data for Litigation

Using Artificial Intelligence to Manage Big Data for Litigation FEBRUARY 3 5, 2015 / THE HILTON NEW YORK Using Artificial Intelligence to Manage Big Data for Litigation Understanding Artificial Intelligence to Make better decisions Improve the process Allay the fear

More information