Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012



Similar documents
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Unsupervised Data Mining (Clustering)

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

The Scientific Data Mining Process

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University

Doing Multidisciplinary Research in Data Science

Big Data: Image & Video Analytics

BIG DATA CHALLENGES AND PERSPECTIVES

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Machine Learning for Data Science (CS4786) Lecture 1

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Fast Matching of Binary Features

Active Learning SVM for Blogs recommendation

Data, Measurements, Features

The Big Data Paradigm Shift. Insight Through Automation

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Pixels Description of scene contents. Rob Fergus (NYU) Antonio Torralba (MIT) Yair Weiss (Hebrew U.) William T. Freeman (MIT) Banksy, 2006

Introduction to the Mathematics of Big Data. Philippe B. Laval

Introduction to Data Mining

Predict Influencers in the Social Network

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Information Management course

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

ADVANCED MACHINE LEARNING. Introduction

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Learning to Process Natural Language in Big Data Environment

Keywords Big Data Analytic Tools, Data Mining, Hadoop and MapReduce, HBase and Hive tools, User-Friendly tools.

Feature Selection vs. Extraction

Social Media Mining. Data Mining Essentials

Introduction to Predictive Analytics. Dr. Ronen Meiri

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

BIG Big Data Public Private Forum

Hadoop SNS. renren.com. Saturday, December 3, 11

Digital Earth: Big Data, Heritage and Social Science

Character Image Patterns as Big Data

Statistical Models in Data Mining

Mammoth Scale Machine Learning!

Deep learning applications and challenges in big data analytics

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Introduction to Data Mining

Distributed forests for MapReduce-based machine learning

Statistical Challenges with Big Data in Management Science

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

What happens when Big Data and Master Data come together?

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Steven C.H. Hoi School of Information Systems Singapore Management University

How To Perform An Ensemble Analysis

Neural Networks Lesson 5 - Cluster Analysis

Table of Contents. June 2010

Data Mining Analytics for Business Intelligence and Decision Support

MS1b Statistical Data Mining

Big Data Analytics. The Hype and the Hope* Dr. Ted Ralphs Industrial and Systems Engineering Director, Laboratory

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

CIS492 Special Topics: Cloud Computing د. منذر الطزاونة

What makes Big Visual Data hard?

Statistics for BIG data

Classification and Prediction

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Robust Outlier Detection Technique in Data Mining: A Univariate Approach

How To Cluster

An Introduction to Data Mining

The Data Mining Process

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

The Delicate Art of Flower Classification

Map-Reduce for Machine Learning on Multicore

Final Project Report

Foundations of Artificial Intelligence. Introduction to Data Mining

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Rolling the Dice on Big Data. Ilse Ipsen Department of Mathematics

Behavior Analysis in Crowded Environments. XiaogangWang Department of Electronic Engineering The Chinese University of Hong Kong June 25, 2011

Big Data Analytics. Lucas Rego Drumond

SEAIP 2009 Presentation

A Review of Data Mining Techniques

Research Article Traffic Analyzing and Controlling using Supervised Parametric Clustering in Heterogeneous Network. Coimbatore, Tamil Nadu, India

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Specific Usage of Visual Data Analysis Techniques

Concept and Project Objectives

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

Cees Snoek. Machine. Humans. Multimedia Archives. Euvision Technologies The Netherlands. University of Amsterdam The Netherlands. Tree.

Transcription:

Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Outline Big Data How to extract information? Data clustering Clustering Big Data Kernel K-means & approximation Summary

How Big is Big Data? Big is a fast moving target: kilobytes, megabytes, gigabytes, terabytes (10 12 ), petabytes (10 15 ), exabytes (10 18 ), zettabytes (10 21 ), Over 1.8 zb created in 2011; ~8 zb by 2015 D ata s i z e E x a b y te s Source: IDC s Digital Universe study, sponsored by EMC, June 2011 http://idcdocserv.com/1142 http://www.emc.com/leadership/programs/digital-universe.htm As of June 2012 Nature of Big Data: Volume, Velocity and Variety

Big Data on the Web Over 225 million users generating over 800 tweets per second ~900 million users, 2.5 billion content items, 105 terabytes of data each half hour, 300M photos and 4M videos posted per day http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingestedevery-day/ http://royal.pingdom.com/2012/01/17/internet-2011-in-numbers/ http://www.dataversity.net/the-growth-of-unstructured-data-what-are-we-going-to-do-with-all-those-zettabytes/

Big Data on the Web Over 50 billion pages indexed and more than 2 million queries/min Articles from over 10,000 sources in real time ~4.5 million photos uploaded/day 48 hours of video uploaded/min; more than 1 trillion video views No. of mobile phones will exceed the world s population by the end of 2012

What to do with Big Data? Extract information to make decisions Evidence-based decision: data-driven vs. analysis based on intuition & experience Analytics, business intelligence, data mining, machine learning, pattern recognition Big Data computing: IBM is promoting Watson (Jeopardy champion) to tackle Big Data in healthcare, finance, drug design,.. Steve Lohr, Amid the Flood, A Catchphrase is Born, NY Times, August 12, 2012

Decision Making Data Representation Features and similarity Learning Classification (labeled data) Clustering (unlabeled data) Most big data problems have unlabeled objects!

Pattern Matrix n x d pattern matrix

Similarity Matrix Polynomial kernel: n x n similarity matrix

Classification Dogs Cats Given a training set of labeled objects, learn a decision rule

Clustering Given a collection of (unlabeled) objects, find meaningful groups

Semi-supervised Clustering Supervised Dogs Unsupervised Cats Semi-supervised Pairwise constraints improve the clustering performance

What is a cluster? A group of the same or similar elements gathered or occurring closely together Galaxy clusters Birdhouse clusters Cluster munition Cluster computing Cluster lights Hongkeng Tulou cluster

Clusters in 2D

Challenges in Data Clustering Measure of similarity No. of clusters Cluster validity Outliers

Data Clustering Organize a collection of n objects into a partition or a hierarchy (nested set of partitions) Data clustering returned ~6,100 hits for 2011 (Google Scholar)

Clustering is the Key to Big Data Problem Not feasible to label large collection of objects No prior knowledge of the number and nature of groups (clusters) in data Clusters may evolve over time Clustering provides efficient browsing, search, recommendation and organization of data

Clustering Users on Facebook ~300,000 status updates per minute on tens of thousands of topics Cluster users based on topic of status messages http://www.insidefacebook.com/2011/08/08/posted-about-page/ http://searchengineland.com/by-the-numbers-twitter-vs-facebook-vs-google-buzz-36709

Clustering Articles on Google News Topic cluster Article Listings http://blogoscoped.com/archive/2006-07-28-n49.html

Clustering Videos on Youtube Keywords Popularity Viewer engagement User browsing history http://www.strutta.com/blog/blog/six-degrees-of-youtube

Clustering for Efficient Image retrieval Retrieval without clustering Retrieval with clustering Fig. 1. Upper-left image is the query. Numbers under the images on left side: image ID and cluster ID; on the right side: Image ID, matching score, number of regions. Retrieval accuracy for the food category (average precision): Without clustering: 47% With clustering: 61% Chen et al., CLUE: cluster-based retrieval of images by unsupervised learning, IEEE Tans. On Image Processing, 2005.

Clustering Algorithms Hundreds of clustering algorithms are available; many are admissible, but no algorithm is optimal K-means Gaussian mixture models Kernel K-means Spectral Clustering Nearest neighbor Latent Dirichlet Allocation A.K. Jain, Data Clustering: 50 Years Beyond K-Means, PRL, 2011

K-means Algorithm Repeat Randomly Assign until Compute points assign there Re-compute the to is cluster no the center change nearest labels centers of each in to cluster the cluster data center points labels

K-means: Limitations Prefers compact and isolated clusters

Gaussian Mixture Model Figueiredo & Jain, Unsupervised Learning of Finite Mixture Models, PAMI, 2002

Kernel K-means Non-linear mapping to find clusters of arbitrary shapes Polynomial kernel representation

Spectral Clustering Represent data using the top K eigenvectors of the kernel matrix; equivalent to Kernel K-means

K-means vs. Kernel K-means Data K-means Kernel K-means Kernel clustering is able to find complex clusters How to choose the right kernel? RBF kernel is the default

Kernel K-means is Expensive No. of operations No. of Objects (n) K-means Kernel K-means O(nKd) O(n 2 K) 1M 10 13 (6412*) 10 16 10M 10 14 10 18 100M 10 15 10 20 1B 10 16 10 22 d = 10,000; K=10 * Runtime in seconds on Intel Xeon 2.8 GHz processor using 40 GB memory A petascale supercomputer (IBM Sequoia, June 2012) with ~1 exabyte memory is needed to run kernel K-means on 1 billion points!

Clustering Big Data Sampling Data n x n similarity matrix Pre-processing Clustering Summarization Incremental Distributed Approximation Cluster labels

Distributed Clustering Clustering 100,000 2-D points with 2 clusters on 2.3 GHz quad-core Intel Xeon processors, with 8GB memory in intel07 cluster K-means Kernel K-means Number of processors Speedup K-means Kernel K- means 2 1.1 1.3 3 2.4 1.5 4 3.1 1.6 5 3.0 3.8 6 3.1 1.9 7 3.3 1.5 8 1.2 1.5 Network communication cost increases with the no. of processors

Approximate kernel K-means Tradeoff between clustering accuracy and running time Given n points in d-dimensional space Obtain the final cluster labels Linear runtime and memory complexity Chitta, Jin, Havens & Jain, Approximate Kernel k-means: solution to Large Scale Kernel Clustering, KDD, 2011

Approximate Kernel K-Means Running time (seconds) Clustering accuracy (%) No. of objects (n) Kernel K- means Approximate kernel K- means (m=100) K-means Kernel K-means Approximate kernel K- means (m=100) K-means 10K 3.09 0.20 0.03 100 93.8 50.1 100K 320.10 1.18 0.17 100 93.7 49.9 1M - 15.06 0.72-95.1 50.0 10M - 234.49 12.14-91.6 50.0 2.8 GHz processor, 40 GB

Tiny Image Data set ~80 million 32x32 images from ~75K classes (bamboo, fish, mushroom, leaf, mountain, ); image represented by 384- dim. GIST descriptors Fergus et al., 80 million tiny images: a large dataset for non-parametric object and scene recognition, PAMI 2008

Tiny Image Data set 10-class subset (CIFAR-10): 60K manually annotated images Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck Krizhevsky, Learning multiple layers of features from tiny images, 2009

Clustering Tiny Images C 1 C 2 C 3 Average clustering time (100 clusters) Approximate kernel K- means (m=1,000) 8.5 hours C 4 K-means 6 hours C 5 Example Clusters 2.3GHz, 150GB memory

Clustering Tiny Images Best Supervised Classification Accuracy on CIFAR-10: 54.7% Clustering accuracy Kernel K-means 29.94% Approximate kernel K-means (m = 5,000) 29.76% Spectral clustering 27.09% K-means 26.70% Ranzato et. Al., Modeling pixel means and covariances using factorized third-order boltzmann machines, CVPR 2010 Fowlkes et al., Spectral grouping using the Nystrom method, PAMI 2004

Distributed Approx. Kernel K-means For better scalability and faster clustering Split Combine Assign Run the approximate remaining each the Randomly Given point labels n kernel in from - points task m sample randomly K-means each s in (s d-dimensional m task t) points into using each the p (m partitions ensemble closest task << space n) t and center and clustering find assign from the partition cluster algorithm task P t centers to t task t

Distributed Approximate kernel K-means 2-D data set with 2 concentric circles Running time Size of data set Speedup 10K 3.8 100K 4.8 1M 3.8 10M 6.4 2.3 GHz quad-core Intel Xeon processors, with 8GB memory in the intel07 cluster

Limitations of Approx. kernel K-means Clustering data with more than 10 million points will require terabytes of memory! Sample and Cluster Algorithm (SnC) Sample s points from data Run approximate kernel K-means on the s points Assign remaining points to the nearest cluster center

Clustering one billion points Sample and Cluster (s = 1 million, m = 100) Running time Average Clustering Accuracy K-means SnC SnC distributed (8 cores) K-means SnC 53 minutes 1.2 hours 45 minutes 50% 85%

Clustering billions of points Work in progress Application to real data sets Yahoo! AltaVista Web Page Hyperlink Connectivity Graph (2002) containing URLs and hyperlinks for over 1.4 billion public web pages Challenges Graph Sparsity: Reduce the dimensionality using random projection, PCA Cluster Evaluation: No ground truth available, internal measures such as link density of clusters

Summary Clustering is an exploratory technique; used in every scientific field that collects data Choice of clustering algorithm & its parameters is data dependent Clustering is essential for Big Data problem Approximate kernel K-means provides good tradeoff between scalability & clustering accuracy Challenges: Scalability, very large no. of clusters, heterogeneous data, streaming data, validity

Big Data Big Data http://dilbert.com/strips/comic/2012-07-29/