Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach



Similar documents
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Chapter 8. Final Results on Dutch Senseval-2 Test Data

W6.B.1. FAQs CS535 BIG DATA W6.B If the distance of the point is additionally less than the tight distance T 2, remove it from the original set

Random Projection for High Dimensional Data Clustering: A Cluster Ensemble Approach

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Globally Optimal Crowdsourcing Quality Management

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Social Media Mining. Data Mining Essentials

Probabilistic Latent Semantic Analysis (plsa)

Statistical Databases and Registers with some datamining

Categorical Data Visualization and Clustering Using Subjective Factors

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Experiments in Web Page Classification for Semantic Web

Data Mining - Evaluation of Classifiers

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Performance Metrics for Graph Mining Tasks

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

How To Identify Noisy Variables In A Cluster

Error Log Processing for Accurate Failure Prediction. Humboldt-Universität zu Berlin

Lecture 9: Introduction to Pattern Analysis

Big Data Text Mining and Visualization. Anton Heijs

Data Mining for Knowledge Management. Classification

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Standardization and Its Effects on K-Means Clustering Algorithm

Introducing diversity among the models of multi-label classification ensemble

Data Preprocessing. Week 2

Conclusions and Future Directions

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

DATA VERIFICATION IN ETL PROCESSES

Evaluation & Validation: Credibility: Evaluating what has been learned

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

IT services for analyses of various data samples

Knowledge Discovery and Data Mining

Linear Threshold Units

KEYWORD SEARCH OVER PROBABILISTIC RDF GRAPHS

MVA ENS Cachan. Lecture 2: Logistic regression & intro to MIL Iasonas Kokkinos Iasonas.kokkinos@ecp.fr

Cluster Analysis. Isabel M. Rodrigues. Lisboa, Instituto Superior Técnico

Data Mining Algorithms Part 1. Dejan Sarka

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Graph Mining and Social Network Analysis

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

Beating the MLB Moneyline

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Data Mining Part 5. Prediction

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

An Overview of Knowledge Discovery Database and Data mining Techniques

System Identification for Acoustic Comms.:

User Authentication/Identification From Web Browsing Behavior

CRISP - DM. Data Mining Process. Process Standardization. Why Should There be a Standard Process? Cross-Industry Standard Process for Data Mining

Syllabus for MATH 191 MATH 191 Topics in Data Science: Algorithms and Mathematical Foundations Department of Mathematics, UCLA Fall Quarter 2015

Component Ordering in Independent Component Analysis Based on Data Power

Semantic Video Annotation by Mining Association Patterns from Visual and Speech Features

On the Effectiveness of Obfuscation Techniques in Online Social Networks

Big Data with Rough Set Using Map- Reduce

Affinity Prediction in Online Social Networks

Data Mining. Nonlinear Classification

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Predicting the Stock Market with News Articles

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Software-assisted document review: An ROI your GC can appreciate. kpmg.com

Hadoop SNS. renren.com. Saturday, December 3, 11

Annotated bibliographies for presentations in MUMT 611, Winter 2006

Random forest algorithm in big data environment

DATA PREPARATION FOR DATA MINING

III. DATA SETS. Training the Matching Model

A Near Real-Time Personalization for ecommerce Platform Amit Rustagi

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

Large-Scale Similarity and Distance Metric Learning

Environmental Remote Sensing GEOG 2021

Structural Health Monitoring Tools (SHMTools)

Neural Networks Lesson 5 - Cluster Analysis

IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH

System Behavior Analysis by Machine Learning

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Implementation of Data Mining Techniques for Weather Report Guidance for Ships Using Global Positioning System

A Survey on Pre-processing and Post-processing Techniques in Data Mining

: Introduction to Machine Learning Dr. Rita Osadchy

Transcription:

Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering Ensemble clustering Bayesian generative model Novel Crowdclustering approach Data Sets Results and Analysis Full Annotations Sampled annotations Conclusion and Comments! What is the Crowdsourcing? It is a new business model provides an easy and inexpensive way to : 1. Accomplish small-scale tasks ( e.g., HITs ). 2. Utilize human capabilities to solve difficult problems. Scenario :! Each human worker is asked to solve a part of a big problem.! Develop a computational algorithm to combine the partial solutions into an integrated one ( single data partition ) Examples : Classification, clustering, and segmentation. It provides similarity measure between objects based on manual annotations! Data Clustering Problem.! Crowdclustering! It addresses a key challenge in data clustering (similarity between objects).! It applies crowdsourcing technique to data clustering to define the similarity between two objects.! It utilizes human power to obtain pairwise similarity by asking each worker to do clustering on a subset of objects.! Similarity measure : based on the percentage of workers who put pairs of objects into same cluster.

! Crowdclustering! Ensemble clustering : Scenario : A collection of objects need to be clustered! divide to subsets of objects! sample each subset of objects in each HIT! each worker annotates the subset of objects in each HIT! partial clustering! a single data partition To combine the partial clustering results, generated by individual workers, into a complete data partition ( C ). Challenges : 1. Each worker for only a subset! partial clustering results. 2. Different human workers! different partial clustering results.! Noise and Inter-worker variations in the clustering results. Annotation types : grouping objects ( similarity ) Describing individual objects ( keywords )! Uncertain data pairs observed.! Create inappropriate data partitions. Novel Crowdclustering Approach! Bayesian Generative Model ( The baseline algorithm ): To address the large variations in the pairwise annotation labels provided by different workers. It explicitly models the hidden factors that are deployed by individual workers to group objects into the same cluster. But it requires a large number of manual annotation, or HITs, to discover the hidden factors.! High cost in computation and annotation.! limits the scalability to clustering large data set. o A novel approach based on the theory of matrix completion. o Matrix of low rank can be recovered by only a few entries. o To overcome the limitation of the Bayesian approach. Basic Idea : Compute a partially observed similarity matrix based only on the reliable pairwise annotation labels. (uncertain data pairs = unobserved) Use a matrix completion algorithm to complete the partially observed similarity matrix by filtering out the unobserved entries. Apply a spectral clustering algorithm to the completed similarity matrix to obtain the final data partition ( final clustering ).

Novel Crowdclustering Approach Novel Crowdclustering Approach Advantages : Similarity Matrix ( SM ) : from the partial clustering result of k th HIT 1. Only small number of pairwise annotations are need to construct the partially observed similarity matrix. 2. By filtering out the uncertain data pairs, this approach is less sensitive to the noisy labels! more robust clustering of data. W ij k = 1 if objects i and j are assigned to the same cluster 0 if they are assigned to different clusters -1 if the pairwise label for the two objects can not be derived from the partial clustering result. N = total number of objects. ; i, j = 1,.N A is a matrix of the average W ij k for m HITs ; k = 1,2,..m Novel Crowdclustering Approach Image data Sets : Two key steps : o Filtering step : To remove the entries associated with the uncertain data pairs from SM.! The partially observed SM ( A ~ ) 1. Scenes Data Set 1,001 images, 13 categories, 131 workers. HIT : To group images into multiple clusters. Number of clusters was determined by individual workers. Pairwise labels : partial clustering results generated in HITs. Data Source ( Gomes et al. 2011). d 0 & d 1 : thresholds depend on the quality of annotation. o Matrix completion step : To complete the partially observed SM.! The completed SM ( A * )

Image data Sets : 2. Tattoo Data Set 3,000 images, 3 categories ( Human, Animal, and Plant ). HIT : To annotate tattoo images with keywords of the workers choice. On average, each image is annotated by 3 workers. Pairwise labels : comparing the number of matched keywords between images to a threshold ( =1 ). Baseline : the Bayesian approach for crowdclustering. Metrics : To evaluate the clustering performance ( Accuracy ) : 1. Normalized mutual information NMI. 2. Pairwise F-measure PWF. NMI and PWF values = [ 0, 1 ], where 1 = Perfect Match, and 0 = Completely mismatch. To evaluate the efficiency : Measuring the running time. Software : MATLAB o Normalized mutual information metric ( NMI ) : Ground truth partition C = { C 1, C 2, C r } of r clusters, Partition generated by a clustering algorithm C = {C 1, C 2, C r } The NMI of partition C and C is : o Pairwise F-measure metric ( PMF ) : A = a set of data pairs that share same class labels according to the ground truth. B = a set of data pairs that are assigned to the same cluster by a clustering algorithm. Where, MI ( X, Y ) : mutual information between random variables X and Y. The PMF is : H(X) : Shannon entropy of random variable X.

First experiment ( with full annotations ) o Perform the algorithms on the Scenes and tattoo data sets. o Use all the pairwise labels. o For both data sets, d 0 = 0, and d 1 = 0.9 (Scenes), and 0.5 (Tattoo) Evaluation of the accuracy and efficiency : Scenes data set : The proposed algorithm has similar, slightly lower performance as Bayesian, but significantly lower runner time. Tattoo data set : The proposed algorithm outperform the Bayesian for both. Two Criteria for choosing d 1 : o o d 1 should be large enough to ensure that most pairwise labels are consistent with the cluster assignments. d 1 should be small enough to obtain sufficiently large number of entries with value 1 in the partially observed matrix A ~ The higher efficiency is due to that the proposal algorithm uses only a subset of reliable pairwise labels while Bayesian needs to all of them. Examine how the conditions are satisfied for both data sets : Condition: A majority of the reliable pairwise labels derived from manual annotation should be consistent with the cluster assignments. Evaluate the significance of the filtering step : Observing : a large portion of pairwise labels derived from the manual annotation process are inconsistent with the cluster assignments. ( 80 % for Scenes data set). Scenes data set : 95 % Tattoo data set : 71 %

Observe how noisy labels affect the proposal algorithm : Fix d 0 = 0, Vary d 1 from 0.1 to 0.9 ( d 1 : to determine the reliable pairwise labels) From table 2 below : The higher the percentage of consistent pairwise labels! better performance. Second experiment ( with sampled annotations ) Objective : To verify how to obtain accurate clustering result even with small number of manual annotations. Scenes data set : Using annotations that provided by 20, 10, 7, and 5 randomly sampled workers. Tattoo data set : Randomly sample 10 %, 5 %, 2 %, and 1 % of all annotations. ( 3 annotators per image ) - run both algorithms on the sampled annotations. - Repeat the experiment 5 times and report the average performance ( NMI ). Second experiment results : Second experiment results : Expected result : reducing number of annotations! lower performance The proposed algorithm is more robust and performs better for all levels because it needs to small number of reliable pairwise labels to recover the cluster assignment matrix The baseline algorithm needs to a large number to overcome the noisy labels and hidden factors. As the numbers of annotations decreases! significant reduction in the performance of the baseline algorithm.

Conclusion and Comments! Crowdclustering uses the crowdsourcing technique to solve data clustering problems.! The matrix completion approach improves the performance of the crowedclustering assignments. Questions?! To derive the full similarity matrix, we need a subset of data pairs with reliable pairwise labels as the input for a matrix completion algorithm.! A sufficient number of workers are needed to determine the reliable data pairs.! The proposed algorithm needs just for smaller number of pairwise labels than the baseline algorithm which leads to low cost in both computation and annotation.! To reduce the number of pairwise labels, we can reduce the number of workers or the number of HITs per worker.