Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012



Similar documents
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University

The Scientific Data Mining Process

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Unsupervised Data Mining (Clustering)

Doing Multidisciplinary Research in Data Science

Big Data: Image & Video Analytics

BIG DATA CHALLENGES AND PERSPECTIVES

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Fast Matching of Binary Features

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Predict Influencers in the Social Network

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

The Big Data Paradigm Shift. Insight Through Automation

Active Learning SVM for Blogs recommendation

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Neural Networks Lesson 5 - Cluster Analysis

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Social Media Mining. Data Mining Essentials

Map-Reduce for Machine Learning on Multicore

Feature Selection vs. Extraction

Classification and Prediction

Steven C.H. Hoi School of Information Systems Singapore Management University

Distributed forests for MapReduce-based machine learning

Introduction to the Mathematics of Big Data. Philippe B. Laval

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Foundations of Artificial Intelligence. Introduction to Data Mining

How To Cluster

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Information Management course

Digital Earth: Big Data, Heritage and Social Science

Keywords Big Data Analytic Tools, Data Mining, Hadoop and MapReduce, HBase and Hive tools, User-Friendly tools.

CIS492 Special Topics: Cloud Computing د. منذر الطزاونة

AN EFFICIENT SELECTIVE DATA MINING ALGORITHM FOR BIG DATA ANALYTICS THROUGH HADOOP

Rolling the Dice on Big Data. Ilse Ipsen Department of Mathematics

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

IJCSES Vol.7 No.4 October 2013 pp Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

What happens when Big Data and Master Data come together?

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Data, Measurements, Features

Concept and Project Objectives

Environmental Remote Sensing GEOG 2021

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Statistical Models in Data Mining

Standardization and Its Effects on K-Means Clustering Algorithm

BIG Big Data Public Private Forum

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Use of Data Mining Techniques to Improve the Effectiveness of Sales and Marketing

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Introduction to Data Mining

Mammoth Scale Machine Learning!

Introduction to Data Mining

SACOC: A spectral-based ACO clustering algorithm

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

Machine Learning for Data Science (CS4786) Lecture 1

Data Mining Analytics for Business Intelligence and Decision Support

An Overview of Database management System, Data warehousing and Data Mining

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Survey of Big Data Benchmarking

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Big Data Analytics. The Hype and the Hope* Dr. Ted Ralphs Industrial and Systems Engineering Director, Laboratory

Big Data Analytics. Lucas Rego Drumond

Hadoop SNS. renren.com. Saturday, December 3, 11

Introduction to Predictive Analytics. Dr. Ronen Meiri

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Machine Learning with MATLAB David Willingham Application Engineer

Data Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland

Advances in Natural and Applied Sciences

Accelerating BIRCH for Clustering Large Scale Streaming Data Using CUDA Dynamic Parallelism

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

Deep learning applications and challenges in big data analytics

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Now, Next and the Future: IT, Big Data and other Implications for RIM. Presented by Michael S. Smith /

Four Orders of Magnitude: Running Large Scale Accumulo Clusters. Aaron Cordova Accumulo Summit, June 2014

ADVANCED MACHINE LEARNING. Introduction

An analysis of suitable parameters for efficiently applying K-means clustering to large TCPdump data set using Hadoop framework

An Introduction to Data Mining

Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification

Exploiting Data at Rest and Data in Motion with a Big Data Platform

The Data Mining Process

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Learning to Process Natural Language in Big Data Environment

A Review of Data Mining Techniques

Specific Usage of Visual Data Analysis Techniques

RevoScaleR Speed and Scalability

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Transcription:

Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012

E-Mart No. of items sold per day = 139x2000x20 = ~6 million items/day (Walmart sells ~276 million items/day) Find groups (clusters) of customers with similar purchasing pattern Co-cluster customers and items to find which items are purchased together; useful for shelf management, marketing promotion,..

Outline Big Data How to extract information? Data clustering Clustering Big Data Kernel K-means & approximation Summary

Big Data http://dilbert.com/strips/comic/2012-07-29/

How Big is Big Data? Big is a fast moving target: kilobytes, megabytes, gigabytes, terabytes (10 12 ), petabytes (10 15 ), exabytes (10 18 ), zettabytes (10 21 ), Over 1.8 zb created in 2011; ~8 zb by 2015 D a t a s i z e E x a b y t e s Source: IDC s Digital Universe study, sponsored by EMC, June 2011 http://idcdocserv.com/1142 http://www.emc.com/leadership/programs/digital-universe.htm As of June 2012 Nature of Big Data: Volume, Velocity and Variety

Big Data on the Web Over 225 million users generating over 800 tweets per second ~900 million users, 2.5 billion content items, 105 terabytes of data each half hour, 300M photos and 4M videos posted per day http://techcrunch.com/2012/08/22/how-big-is-facebooks-data-2-5-billion-pieces-of-content-and-500-terabytes-ingestedevery-day/ http://royal.pingdom.com/2012/01/17/internet-2011-in-numbers/ http://www.dataversity.net/the-growth-of-unstructured-data-what-are-we-going-to-do-with-all-those-zettabytes/

Big Data on the Web Over 50 billion pages indexed and more than 2 million queries/min Articles from over 10,000 sources in real time ~4.5 million photos uploaded/day 48 hours of video uploaded/min; more than 1 trillion video views

What to do with Big Data? Extract information to make decisions Evidence-based decision: data-driven vs. analysis based on intuition & experience Analytics, business intelligence, data mining, machine learning, pattern recognition Big Data computing: IBM is promoting Watson (Jeopardy champion) to tackle Big Data in healthcare, finance, drug design,.. Steve Lohr, Amid the Flood, A Catchphrase is Born, NY Times, August 12, 2012

Decision Making Data Representation Features and similarity Learning Classification (supervised learning) Clustering (unsupervised learning) Semi-supervised clustering 9

Pattern Matrix n x d pattern matrix

Similarity Matrix Polynomial kernel: T x y 4 K( x, y) 1 16 15 14 4 6 6 4 3 1 15 16 14 4 5 5 6 4 3 14 14 16 9 9 9 8 7 4 4 4 9 16 15 15 9 10 6 6 5 9 15 16 16 7 8 4 6 5 9 15 16 16 7 8 4 4 6 8 9 7 7 16 16 14 3 4 7 10 8 8 16 16 14 1 3 4 6 4 4 14 14 16 n x n similarity matrix

Classification Dogs Cats Given a training set of labeled objects, learn a decision rule

Clustering Given a collection of (unlabeled) objects, find meaningful groups

Semi-supervised Clustering Supervised Unsupervised Dogs Cats Semi-supervised Pairwise constraints improve the clustering performance

What is a cluster? A group of the same or similar elements gathered or occurring closely together Galaxy clusters Birdhouse clusters Cluster munition Cluster computing Cluster lights Hongkeng Tulou cluster

Clusters in 2D

Clusters in 2D

Clusters in 2D

Clusters in 2D

Clusters in 2D

Challenges in Data Clustering Measure of similarity No. of clusters Cluster validity Outliers

Data Clustering Organize a collection of n objects into a partition or a hierarchy (nested set of partitions) Data clustering returned ~6,100 hits for 2011 alone (Google Scholar)

Clustering is the Key to Big Data Problem Not feasible to label large collection of objects No prior knowledge of the number and nature of groups (clusters) in data Clusters may evolve over time Clustering leads to more efficient browsing, searching, recommendation and classification/organization of Big Data

Clustering Users on Facebook ~300,000 status updates per minute on tens of thousands of topics Cluster status messages from different users based on the topic http://www.insidefacebook.com/2011/08/08/posted-about-page/ http://searchengineland.com/by-the-numbers-twitter-vs-facebook-vs-google-buzz-36709

Clustering Articles on Google News Topic cluster Article Listings http://blogoscoped.com/archive/2006-07-28-n49.html

Clustering Videos on Youube Keywords Popularity Viewer engagement User browsing history http://www.strutta.com/blog/blog/six-degrees-of-youtube

Clustering for Efficient Image retrieval Retrieval without clustering Retrieval with clustering Fig. 1. Upper-left image is the query. Numbers under the images on left side: image ID and cluster ID; on the right side: Image ID, matching score, number of regions. Retrieval accuracy for the food category (average precision): Without clustering: 47% With clustering: 61% Chen et al., CLUE: cluster-based retrieval of images by unsupervised learning, IEEE Tans. On Image Processing, 2005.

Clustering Algorithms Hundreds of clustering algorithms are available; many are admissible, but no algorithm is optimal K-means Gaussian mixture models Kernel K-means Spectral Clustering Nearest neighbor Latent Dirichlet Allocation

K-means Algorithm

K-means Algorithm Randomly assign cluster labels to the data points

K-means Algorithm Compute the center of each cluster

K-means Algorithm Assign points to the nearest cluster center

K-means Algorithm Re-compute centers

K-means Algorithm Repeat until there is no change in the cluster labels

K-means: Limitations Good at finding compact and isolated clusters n K min u ik x i c k 2 i=1 k=1

Gaussian Mixture Model Figueiredo & Jain, Unsupervised Learning of Finite Mixture Models, PAMI, 2002

Kernel K-means Non-linear mapping to find clusters of arbitrary shapes n K min Trace u ik φ x i c k φ φ x i c k φ T i=1 k=1 φ φ x, y = x 2, 2xy, y 2 T Polynomial kernel representation

Spectral Clustering Represent data using the top K eigenvectors of the similarity matrix; equivalent to Kernel K-means

K-means vs. Kernel K-means Data K-means Kernel K-means Kernel clustering is able to find complex clusters. But how to choose the right kernel?

Kernel K-means is Expensive No. of operations No. of Objects (n) K-means Kernel K-means O(nKd) O(n 2 K) 1M 10 13 (6412) 10 16 10M 10 14 10 18 100M 10 15 10 20 1B 10 16 10 22 d = 10,000; K=10 * Runtime in seconds on Intel Xeon 2.8 GHz processor using 40 GB memory A petascale supercomputer (IBM Sequoia, June 2012) with ~1 exabyte memory is needed to run kernel K-means on1 billion points!

Clustering Big Data Sampling Data n x n similarity matrix Pre-processing Clustering Summarization Incremental Distributed Approximation Cluster labels

(s) Distributed Clustering K-means on 1 billion 2-D points with 2 clusters 2.3 GHz quad-core Intel Xeon processors, with 8GB memory in the HPCC intel07 cluster Number of processors Speedup 2 1.9 3 2.5 4 3.8 5 4.9 6 5.2 7 5.5 8 6.3 9 5.5 10 5.1 Network communication cost increases with the no. of processors

Approximate kernel K-means Tradeoff between clustering accuracy and running time Given n points in d-dimensional space Chitta, Jin, Havens & Jain, Approximate Kernel k-means: solution to Large Scale Kernel Clustering, KDD, 2011

Approximate kernel K-means Tradeoff between clustering accuracy and running time Randomly sample m points y 1, y 2, y m, m n and compute the kernel similarity matrices K A (m m) and K B (n m) (K A ) ij = φ(y i ) T φ(y j ) (K B ) ij = φ(x i ) T φ(y j ) Chitta, Jin, Havens & Jain, Approximate Kernel k-means: solution to Large Scale Kernel Clustering, KDD, 2011

Approximate kernel K-means Tradeoff between clustering accuracy and running time Iteratively optimize for the cluster centers min U max α K n k=1 i=1 u ik φ x i α jk φ(y j ) Chitta, Jin, Havens & Jain, Approximate Kernel k-means: solution to Large Scale Kernel Clustering, KDD, 2011 m j=1 (equivalent to running K-means on K B K A 1 K B T )

Approximate kernel K-means Tradeoff between clustering accuracy and running time Obtain the final cluster labels Linear runtime and memory complexity Chitta, Jin, Havens & Jain, Approximate Kernel k-means: solution to Large Scale Kernel Clustering, KDD, 2011

Approximate Kernel K-Means Running time (seconds) Clustering accuracy (%) No. of objects (n) Kernel K- means Approximate kernel K- means (m=100) K-means Kernel K-means Approximate kernel K- means (m=100) K-means 10K 3.09 0.20 0.03 100 93.8 50.1 100K 320.10 1.18 0.17 100 93.7 49.9 1M - 15.06 0.72-95.1 50.0 10M - 234.49 12.14-91.6 50.0 2.8 GHz processor, 40 GB

Tiny Image Data set ~80 million 32x32 images from ~75K classes (bamboo, fish, mushroom, leaf, mountain, ); image represented by 384- dim. GIST descriptors Fergus et al., 80 million tiny images: a large dataset for non-parametric object and scene recognition, PAMI 2008

Tiny Image Data set 10-class subset (CIFAR-10): 60K manually annotated images Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck Krizhevsky, Learning multiple layers of features from tiny images, 2009

Clustering Tiny Images C 1 C 2 C 3 Average clustering time (100 clusters) Approximate kernel K- means (m=1,000) 8.5 hours C 4 K-means 6 hours C 5 Example Clusters 2.3GHz, 150GB memory

Clustering Tiny Images Best Supervised Classification Accuracy on CIFAR-10: 54.7% Clustering accuracy Kernel K-means 29.94% Approximate kernel K-means (m = 5,000) 29.76% Spectral clustering 27.09% K-means 26.70% Out of 6,000 automobile images, K-means places 4,090 images in wrong clusters; for approximate Kernel k-means this number is 3,788 images Ranzato et. Al., Modeling pixel means and covariances using factorized third-order boltzmann machines, CVPR 2010 Fowlkes et al., Spectral grouping using the Nystrom method, PAMI 2004

Distributed Approx. Kernel K-means For better scalability and faster clustering Given n points in d-dimensional space

Distributed Approx. Kernel K-means For better scalability and faster clustering Randomly sample m points (m << n)

Distributed Approx. Kernel K-means For better scalability and faster clustering Split the remaining n - m randomly into p partitions and assign partition P t to task t

Distributed Approx. Kernel K-means For better scalability and faster clustering Run approximate kernel K-means in each task t and find the cluster centers

Assign each point in task s (s t) to the closest center from task t Distributed Approx. Kernel K-means For better scalability and faster clustering

Distributed Approx. Kernel K-means For better scalability and faster clustering Combine the labels from each task using ensemble clustering algorithm

Distributed Approximate kernel K-means 2-D data set with 2 concentric circles 7000 Running time 6000 5000 4000 Size of data set Speedup 3000 2000 1000 10K 3.8 100K 4.8 0 10K 100K 1M 10M 1M 3.8 10M 6.4 Distributed approximate Kernel K-means (8 nodes) Approximate Kernel K-means 2.3 GHz quad-core Intel Xeon processors, with 8GB memory in the HPCC intel07 cluster

Summary Clustering is an exploratory technique; used in every scientific field that collects data Choice of clustering algorithm & its parameters is data dependent Clustering is essential to tackle Big Data Approximate kernel K-means provides good tradeoff between scalability & clustering accuracy Challenges: Scalability, very large no. of clusters, heterogeneous data, streaming data