Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012

Size: px
Start display at page:

Download "Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012"

Transcription

1 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012

2 E-Mart No. of items sold per day = 139x2000x20 = ~6 million items/day (Walmart sells ~276 million items/day) Find groups (clusters) of customers with similar purchasing pattern Co-cluster customers and items to find which items are purchased together; useful for shelf management, marketing promotion,..

3 Outline Big Data How to extract information? Data clustering Clustering Big Data Kernel K-means & approximation Summary

4 Big Data

5 How Big is Big Data? Big is a fast moving target: kilobytes, megabytes, gigabytes, terabytes (10 12 ), petabytes (10 15 ), exabytes (10 18 ), zettabytes (10 21 ), Over 1.8 zb created in 2011; ~8 zb by 2015 D a t a s i z e E x a b y t e s Source: IDC s Digital Universe study, sponsored by EMC, June As of June 2012 Nature of Big Data: Volume, Velocity and Variety

6 Big Data on the Web Over 225 million users generating over 800 tweets per second ~900 million users, 2.5 billion content items, 105 terabytes of data each half hour, 300M photos and 4M videos posted per day

7 Big Data on the Web Over 50 billion pages indexed and more than 2 million queries/min Articles from over 10,000 sources in real time ~4.5 million photos uploaded/day 48 hours of video uploaded/min; more than 1 trillion video views

8 What to do with Big Data? Extract information to make decisions Evidence-based decision: data-driven vs. analysis based on intuition & experience Analytics, business intelligence, data mining, machine learning, pattern recognition Big Data computing: IBM is promoting Watson (Jeopardy champion) to tackle Big Data in healthcare, finance, drug design,.. Steve Lohr, Amid the Flood, A Catchphrase is Born, NY Times, August 12, 2012

9 Decision Making Data Representation Features and similarity Learning Classification (supervised learning) Clustering (unsupervised learning) Semi-supervised clustering 9

10 Pattern Matrix n x d pattern matrix

11 Similarity Matrix Polynomial kernel: T x y 4 K( x, y) n x n similarity matrix

12 Classification Dogs Cats Given a training set of labeled objects, learn a decision rule

13 Clustering Given a collection of (unlabeled) objects, find meaningful groups

14 Semi-supervised Clustering Supervised Unsupervised Dogs Cats Semi-supervised Pairwise constraints improve the clustering performance

15 What is a cluster? A group of the same or similar elements gathered or occurring closely together Galaxy clusters Birdhouse clusters Cluster munition Cluster computing Cluster lights Hongkeng Tulou cluster

16 Clusters in 2D

17 Clusters in 2D

18 Clusters in 2D

19 Clusters in 2D

20 Clusters in 2D

21 Challenges in Data Clustering Measure of similarity No. of clusters Cluster validity Outliers

22 Data Clustering Organize a collection of n objects into a partition or a hierarchy (nested set of partitions) Data clustering returned ~6,100 hits for 2011 alone (Google Scholar)

23 Clustering is the Key to Big Data Problem Not feasible to label large collection of objects No prior knowledge of the number and nature of groups (clusters) in data Clusters may evolve over time Clustering leads to more efficient browsing, searching, recommendation and classification/organization of Big Data

24 Clustering Users on Facebook ~300,000 status updates per minute on tens of thousands of topics Cluster status messages from different users based on the topic

25 Clustering Articles on Google News Topic cluster Article Listings

26 Clustering Videos on Youube Keywords Popularity Viewer engagement User browsing history

27 Clustering for Efficient Image retrieval Retrieval without clustering Retrieval with clustering Fig. 1. Upper-left image is the query. Numbers under the images on left side: image ID and cluster ID; on the right side: Image ID, matching score, number of regions. Retrieval accuracy for the food category (average precision): Without clustering: 47% With clustering: 61% Chen et al., CLUE: cluster-based retrieval of images by unsupervised learning, IEEE Tans. On Image Processing, 2005.

28 Clustering Algorithms Hundreds of clustering algorithms are available; many are admissible, but no algorithm is optimal K-means Gaussian mixture models Kernel K-means Spectral Clustering Nearest neighbor Latent Dirichlet Allocation

29 K-means Algorithm

30 K-means Algorithm Randomly assign cluster labels to the data points

31 K-means Algorithm Compute the center of each cluster

32 K-means Algorithm Assign points to the nearest cluster center

33 K-means Algorithm Re-compute centers

34 K-means Algorithm Repeat until there is no change in the cluster labels

35 K-means: Limitations Good at finding compact and isolated clusters n K min u ik x i c k 2 i=1 k=1

36 Gaussian Mixture Model Figueiredo & Jain, Unsupervised Learning of Finite Mixture Models, PAMI, 2002

37 Kernel K-means Non-linear mapping to find clusters of arbitrary shapes n K min Trace u ik φ x i c k φ φ x i c k φ T i=1 k=1 φ φ x, y = x 2, 2xy, y 2 T Polynomial kernel representation

38 Spectral Clustering Represent data using the top K eigenvectors of the similarity matrix; equivalent to Kernel K-means

39 K-means vs. Kernel K-means Data K-means Kernel K-means Kernel clustering is able to find complex clusters. But how to choose the right kernel?

40 Kernel K-means is Expensive No. of operations No. of Objects (n) K-means Kernel K-means O(nKd) O(n 2 K) 1M (6412) M M B d = 10,000; K=10 * Runtime in seconds on Intel Xeon 2.8 GHz processor using 40 GB memory A petascale supercomputer (IBM Sequoia, June 2012) with ~1 exabyte memory is needed to run kernel K-means on1 billion points!

41 Clustering Big Data Sampling Data n x n similarity matrix Pre-processing Clustering Summarization Incremental Distributed Approximation Cluster labels

42 (s) Distributed Clustering K-means on 1 billion 2-D points with 2 clusters 2.3 GHz quad-core Intel Xeon processors, with 8GB memory in the HPCC intel07 cluster Number of processors Speedup Network communication cost increases with the no. of processors

43 Approximate kernel K-means Tradeoff between clustering accuracy and running time Given n points in d-dimensional space Chitta, Jin, Havens & Jain, Approximate Kernel k-means: solution to Large Scale Kernel Clustering, KDD, 2011

44 Approximate kernel K-means Tradeoff between clustering accuracy and running time Randomly sample m points y 1, y 2, y m, m n and compute the kernel similarity matrices K A (m m) and K B (n m) (K A ) ij = φ(y i ) T φ(y j ) (K B ) ij = φ(x i ) T φ(y j ) Chitta, Jin, Havens & Jain, Approximate Kernel k-means: solution to Large Scale Kernel Clustering, KDD, 2011

45 Approximate kernel K-means Tradeoff between clustering accuracy and running time Iteratively optimize for the cluster centers min U max α K n k=1 i=1 u ik φ x i α jk φ(y j ) Chitta, Jin, Havens & Jain, Approximate Kernel k-means: solution to Large Scale Kernel Clustering, KDD, 2011 m j=1 (equivalent to running K-means on K B K A 1 K B T )

46 Approximate kernel K-means Tradeoff between clustering accuracy and running time Obtain the final cluster labels Linear runtime and memory complexity Chitta, Jin, Havens & Jain, Approximate Kernel k-means: solution to Large Scale Kernel Clustering, KDD, 2011

47 Approximate Kernel K-Means Running time (seconds) Clustering accuracy (%) No. of objects (n) Kernel K- means Approximate kernel K- means (m=100) K-means Kernel K-means Approximate kernel K- means (m=100) K-means 10K K M M GHz processor, 40 GB

48 Tiny Image Data set ~80 million 32x32 images from ~75K classes (bamboo, fish, mushroom, leaf, mountain, ); image represented by 384- dim. GIST descriptors Fergus et al., 80 million tiny images: a large dataset for non-parametric object and scene recognition, PAMI 2008

49 Tiny Image Data set 10-class subset (CIFAR-10): 60K manually annotated images Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck Krizhevsky, Learning multiple layers of features from tiny images, 2009

50 Clustering Tiny Images C 1 C 2 C 3 Average clustering time (100 clusters) Approximate kernel K- means (m=1,000) 8.5 hours C 4 K-means 6 hours C 5 Example Clusters 2.3GHz, 150GB memory

51 Clustering Tiny Images Best Supervised Classification Accuracy on CIFAR-10: 54.7% Clustering accuracy Kernel K-means 29.94% Approximate kernel K-means (m = 5,000) 29.76% Spectral clustering 27.09% K-means 26.70% Out of 6,000 automobile images, K-means places 4,090 images in wrong clusters; for approximate Kernel k-means this number is 3,788 images Ranzato et. Al., Modeling pixel means and covariances using factorized third-order boltzmann machines, CVPR 2010 Fowlkes et al., Spectral grouping using the Nystrom method, PAMI 2004

52 Distributed Approx. Kernel K-means For better scalability and faster clustering Given n points in d-dimensional space

53 Distributed Approx. Kernel K-means For better scalability and faster clustering Randomly sample m points (m << n)

54 Distributed Approx. Kernel K-means For better scalability and faster clustering Split the remaining n - m randomly into p partitions and assign partition P t to task t

55 Distributed Approx. Kernel K-means For better scalability and faster clustering Run approximate kernel K-means in each task t and find the cluster centers

56 Assign each point in task s (s t) to the closest center from task t Distributed Approx. Kernel K-means For better scalability and faster clustering

57 Distributed Approx. Kernel K-means For better scalability and faster clustering Combine the labels from each task using ensemble clustering algorithm

58 Distributed Approximate kernel K-means 2-D data set with 2 concentric circles 7000 Running time Size of data set Speedup K K K 100K 1M 10M 1M M 6.4 Distributed approximate Kernel K-means (8 nodes) Approximate Kernel K-means 2.3 GHz quad-core Intel Xeon processors, with 8GB memory in the HPCC intel07 cluster

59 Summary Clustering is an exploratory technique; used in every scientific field that collects data Choice of clustering algorithm & its parameters is data dependent Clustering is essential to tackle Big Data Approximate kernel K-means provides good tradeoff between scalability & clustering accuracy Challenges: Scalability, very large no. of clusters, heterogeneous data, streaming data

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering

More information

Using Very Deep Autoencoders for Content-Based Image Retrieval

Using Very Deep Autoencoders for Content-Based Image Retrieval Using Very Deep Autoencoders for Content-Based Image Retrieval Alex Krizhevsky and Georey E. Hinton University of Toronto - Department of Computer Science 6 King's College Road, Toronto, M5S 3H5 - Canada

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Semi-supervised learning

Semi-supervised learning Semi-supervised learning Learning from both labeled and unlabeled data Semi-supervised learning Learning from both labeled and unlabeled data Motivation: labeled data may be hard/expensive to get, but

More information

Clustering / Unsupervised Methods

Clustering / Unsupervised Methods Clustering / Unsupervised Methods Jason Corso, Albert Chen SUNY at Buffalo J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 1 / 41 Clustering Introduction Until now, we ve assumed our training

More information

An Introduction to Text Data Mining

An Introduction to Text Data Mining An Introduction to Text Data Mining Adam Zimmerman The Ohio State University Data Mining and Statistical Learning Discussion Group September 6, 2013 Adam Zimmerman (OSU) DMSL Group Intro to Text Mining

More information

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University

Mining Big Data. Pang-Ning Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University Mining Big Data Pang-Ning Tan Associate Professor Dept of Computer Science & Engineering Michigan State University Website: http://www.cse.msu.edu/~ptan Google Trends Big Data Smart Cities Big Data and

More information

Content-Based Image Retrieval ---Challenges & Opportunities

Content-Based Image Retrieval ---Challenges & Opportunities Content-Based Image Retrieval ---Challenges & Opportunities Jianping Fan Department of Computer Science University of North Carolina at Charlotte Charlotte, NC 28223 Networks How can I access image/video

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based Fuzzy c-means Mixture Model Clustering Density-based

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

Current Topics in Information-theoretic Data Mining

Current Topics in Information-theoretic Data Mining Current Topics in Information-theoretic Data Mining NINA HUBIG, ANNIKA TONCH Helmholtz-Hochschul research group ikdd Applications: Neuroscience, Diabetes research. Outline 1. Introduction 2. General Information

More information

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Overview of some topics covered and some topics not covered on this course Linear Algebra Methods for Data

More information

Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications

Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications Data Mining In Modern Astronomy Sky Surveys: Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications Ching-Wa Yip cwyip@pha.jhu.edu; Bloomberg 518 Human are Great Pattern Recognizers

More information

Doing Multidisciplinary Research in Data Science

Doing Multidisciplinary Research in Data Science Doing Multidisciplinary Research in Data Science Assoc.Prof. Abzetdin ADAMOV CeDAWI - Center for Data Analytics and Web Insights Qafqaz University aadamov@qu.edu.az http://ce.qu.edu.az/~aadamov 16 May

More information

BIG DATA CHALLENGES AND PERSPECTIVES

BIG DATA CHALLENGES AND PERSPECTIVES BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,

More information

Big Data: Image & Video Analytics

Big Data: Image & Video Analytics Big Data: Image & Video Analytics How it could support Archiving & Indexing & Searching Dieter Haas, IBM Deutschland GmbH The Big Data Wave 60% of internet traffic is multimedia content (images and videos)

More information

KERNEL-BASED CLUSTERING OF BIG DATA. Radha Chitta A DISSERTATION

KERNEL-BASED CLUSTERING OF BIG DATA. Radha Chitta A DISSERTATION KERNEL-BASED CLUSTERING OF BIG DATA By Radha Chitta A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science Doctor of Philosophy

More information

An Enhanced Clustering Algorithm to Analyze Spatial Data

An Enhanced Clustering Algorithm to Analyze Spatial Data International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav

More information

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

More information

Fast Matching of Binary Features

Fast Matching of Binary Features Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been

More information

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Clusters With Core-Tail Hierarchical Structure And Their Applications To Machine Learning Classification

Clusters With Core-Tail Hierarchical Structure And Their Applications To Machine Learning Classification Clusters With Core-Tail Hierarchical Structure And Their Applications To Machine Learning Classification Dmitriy Fradkin Ilya B. Muchnik Dept. of Computer Science DIMACS dfradkin@paul.rutgers.edu muchnik@dimacs.rutgers.edu

More information

Bag-of-features for category classification. Cordelia Schmid

Bag-of-features for category classification. Cordelia Schmid Bag-of-features for category classification Cordelia Schmid Category recognition Image classification: assigning a class label to the image Car: present Cow: present Bike: not present Horse: not present

More information

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach

Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering

More information

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014 Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser

More information

Efficient visual search of local features. Cordelia Schmid

Efficient visual search of local features. Cordelia Schmid Efficient visual search of local features Cordelia Schmid Visual search change in viewing angle Matches 22 correct matches Image search system for large datasets Large image dataset (one million images

More information

Part II: Web Content Mining Chapter 3: Clustering

Part II: Web Content Mining Chapter 3: Clustering Part II: Web Content Mining Chapter 3: Clustering Learning by Example and Clustering Hierarchical Agglomerative Clustering K-Means Clustering Probability-Based Clustering Collaborative Filtering Slides

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

CPSC 340: Machine Learning and Data Mining. K-Means Clustering Fall 2015

CPSC 340: Machine Learning and Data Mining. K-Means Clustering Fall 2015 CPSC 340: Machine Learning and Data Mining K-Means Clustering Fall 2015 Admin Assignment 1 solutions posted after class. Tutorials for Assignment 2 on Monday. Random Forests Random forests are one of the

More information

Recognition: A machine learning approach. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, Kristen Grauman, and Derek Hoiem

Recognition: A machine learning approach. Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, Kristen Grauman, and Derek Hoiem Recognition: A machine learning approach Slides adapted from Fei-Fei Li, Rob Fergus, Antonio Torralba, Kristen Grauman, and Derek Hoiem The machine learning framework Apply a prediction function to a feature

More information

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Mobile Phone APP Software Browsing Behavior using Clustering Analysis Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis

More information

Topic and Trend Detection in Text Collections using Latent Dirichlet Allocation

Topic and Trend Detection in Text Collections using Latent Dirichlet Allocation Topic and Trend Detection in Text Collections using Latent Dirichlet Allocation Levent Bolelli 1, Şeyda Ertekin 2, and C. Lee Giles 3 1 Google Inc., 76 9 th Ave., 4 th floor, New York, NY 10011, USA 2

More information

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns

. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties

More information

Active Learning SVM for Blogs recommendation

Active Learning SVM for Blogs recommendation Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the

More information

Watson A System Designed for Answers

Watson A System Designed for Answers IBM Systems and Technology February 2011 An IBM White Paper Watson A System Designed for Answers The future of workload optimized systems design 2 Watson A System Designed for Answers Executive summary

More information

Face Recognition using SIFT Features

Face Recognition using SIFT Features Face Recognition using SIFT Features Mohamed Aly CNS186 Term Project Winter 2006 Abstract Face recognition has many important practical applications, like surveillance and access control.

More information

A Survey on Issues and Challenges in Handling Big Data

A Survey on Issues and Challenges in Handling Big Data A Survey on Issues and Challenges in Handling Big Data Sandeep K N Usha R G Dept of Information Science and Engineering Dept of Information Science and Engineering JSS Academy of Technical Education, Bangalore,

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Fig. 1 A typical Knowledge Discovery process [2]

Fig. 1 A typical Knowledge Discovery process [2] Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will

More information

Statistical Models in Data Mining

Statistical Models in Data Mining Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

More information

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg

Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual

More information

Mixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models

Mixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models Mixture Models Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Clustering by Mixture Models General bacground on clustering Example method: -means Mixture model based

More information

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab

More information

Distributed forests for MapReduce-based machine learning

Distributed forests for MapReduce-based machine learning Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication

More information

Lecture 13: Clustering and K-means

Lecture 13: Clustering and K-means Lecture 13: Clustering and K-means CSE 6040 Fall 2014 CSE 6040 K-means 1 / 22 What is clustering? Unsupervised learning: Given an unlabeled data set (no prior knowledge), how can we find structures/patterns

More information

A Map Reduce Framework for Programming Graphics Processors. Bryan Catanzaro Narayanan Sundaram Kurt Keutzer

A Map Reduce Framework for Programming Graphics Processors. Bryan Catanzaro Narayanan Sundaram Kurt Keutzer A Map Reduce Framework for Programming Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Overview Map Reduce is a good abstraction to map to GPUs It is easy for programmers to understand

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

Feature Selection vs. Extraction

Feature Selection vs. Extraction Feature Selection In many applications, we often encounter a very large number of potential features that can be used Which subset of features should be used for the best classification? Need for a small

More information

Clustering. Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods. CS 5751 Machine Learning

Clustering. Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods. CS 5751 Machine Learning Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form of unsupervised learning - no information

More information

Machine Learning with MATLAB U. M. Sundar Senior Application Engineer MathWorks India

Machine Learning with MATLAB U. M. Sundar Senior Application Engineer MathWorks India Machine Learning with MATLAB U. M. Sundar Senior Application Engineer MathWorks India 2013 The MathWorks, Inc. 1 Agenda Machine Learning Overview Machine Learning with MATLAB Unsupervised Learning Clustering

More information

Information Management course

Information Management course Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)

More information

The Big Data Paradigm Shift. Insight Through Automation

The Big Data Paradigm Shift. Insight Through Automation The Big Data Paradigm Shift Insight Through Automation Agenda The Problem Emcien s Solution: Algorithms solve data related business problems How Does the Technology Work? Case Studies 2013 Emcien, Inc.

More information

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers

Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers Sonal Gupta Christopher Manning Natural Language Processing Group Department of Computer Science Stanford University Columbia

More information

MODULE 15 Clustering Large Datasets LESSON 34

MODULE 15 Clustering Large Datasets LESSON 34 MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1 Clustering Large Datasets Pattern matrix It is convenient to view the input data

More information

arxiv: v1 [cs.cv] 4 Feb 2016

arxiv: v1 [cs.cv] 4 Feb 2016 RANDOM FEATURE MAPS VIA A LAYERED RANDOM PROJECTION (LARP) FRAMEWORK FOR OBJECT CLASSIFICATION A. G. Chung, M. J. Shafiee, and A. Wong Vision & Image Processing Research Group, System Design Engineering

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

Overview. Introduction to Information Retrieval. K means. Evaluation. How many clusters? TDT4215 Web intelligence Based on slides by:

Overview. Introduction to Information Retrieval. K means. Evaluation. How many clusters? TDT4215 Web intelligence Based on slides by: Introduction to Information Retrieval TDT4215 Web intelligence Based on slides by: Hinrich Schütze and Christina Lioma Chapter 16: Flat Clustering 1 Overview ❶ Clustering: Introduction ti ❸ Clustering

More information

Classifying Runtime Performance with SVM

Classifying Runtime Performance with SVM Classifying Runtime Performance with SVM David Eliahu Shaddi Hasan Omer Spillinger May 14, 2013 Abstract We present a machine-learning based technique for the problem of algorithm selection, specifically

More information

Foundations of Artificial Intelligence. Introduction to Data Mining

Foundations of Artificial Intelligence. Introduction to Data Mining Foundations of Artificial Intelligence Introduction to Data Mining Objectives Data Mining Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees Present

More information

9.54 Class 13. Unsupervised learning Clustering. Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert

9.54 Class 13. Unsupervised learning Clustering. Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert 9.54 Class 13 Unsupervised learning Clustering Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert Outline Introduction to clustering K-means Bag of words (dictionary learning)

More information

Introduction to the Mathematics of Big Data. Philippe B. Laval

Introduction to the Mathematics of Big Data. Philippe B. Laval Introduction to the Mathematics of Big Data Philippe B. Laval Fall 2015 Introduction In recent years, Big Data has become more than just a buzz word. Every major field of science, engineering, business,

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

Classification and Prediction

Classification and Prediction Classification and Prediction Classification and Prediction What is classification? What is regression? Issues regarding classification and prediction Classification by decision tree induction Bayesian

More information

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform

More information

A Novel Framework for Incorporating Labeled Examples into Anomaly Detection

A Novel Framework for Incorporating Labeled Examples into Anomaly Detection A Novel Framework for Incorporating Labeled Examples into Anomaly Detection Jing Gao Haibin Cheng Pang-Ning Tan Abstract This paper presents a principled approach for incorporating labeled examples into

More information

Big Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs

Big Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs 1 Big Data Analytics Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE

More information

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age

Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics

More information

SymNMF: Symmetric Nonnegative Matrix Factorization for Graph Clustering

SymNMF: Symmetric Nonnegative Matrix Factorization for Graph Clustering SymNMF: Symmetric Nonnegative Matrix Factorization for Graph Clustering Da Kuang 1, Chris Ding 2, Haesun Park 1 1 Georgia Institute of Technology 2 University of Texas, Arlington Apr. 26, 2012 2012 SIAM

More information

Lecture 20: Clustering

Lecture 20: Clustering Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association

More information

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based

More information

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

More information

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181

IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel k-means Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE

More information

LEARNING TECHNIQUES AND APPLICATIONS

LEARNING TECHNIQUES AND APPLICATIONS MSc Course MACHINE LEARNING TECHNIQUES AND APPLICATIONS Classification with GMM + Bayes 1 Clustering, semi-supervised clustering and classification Clustering No labels for the points! Semi-supervised

More information

6.867 Machine learning

6.867 Machine learning 6.867 Machine learning Mid-term exam October 15, 3 ( points) Your name and MIT ID: SOLUTIONS Problem 1 Suppose we are trying to solve an active learning problem, where the possible inputs you can select

More information

An Algorithm for Automated View Reduction in Weighted Clustering of Multiview Data

An Algorithm for Automated View Reduction in Weighted Clustering of Multiview Data An Algorithm for Automated View Reduction in Weighted Clustering of Multiview Data N. Aparna PG Scholar Department of Computer Science Sri Ramakrishna College of Engineering, Coimbatore M. Kalaiarasu Associate

More information

CHAPTER 5 SOFT CLUSTERING BASED MULTIPLE DICTIONARY BAG OF WORDS FOR IMAGE RETRIEVAL

CHAPTER 5 SOFT CLUSTERING BASED MULTIPLE DICTIONARY BAG OF WORDS FOR IMAGE RETRIEVAL 84 CHAPTER 5 SOFT CLUSTERING BASED MULTIPLE DICTIONARY BAG OF WORDS FOR IMAGE RETRIEVAL Object classification is a highly important area of computer vision and has many applications including robotics,

More information

COMPARATIVE INVESTIGATIONS AND PERFORMANCE ANALYSIS OF FCM AND MFPCM ALGORITHMS ON IRIS DATA

COMPARATIVE INVESTIGATIONS AND PERFORMANCE ANALYSIS OF FCM AND MFPCM ALGORITHMS ON IRIS DATA COMPARATIVE INVESTIGATIONS AND PERFORMANCE ANALYSIS OF FCM AND MFPCM ALGORITHMS ON IRIS DATA VUDA SREENIVASA RAO * Research Scholar, CSIT Department JNT University, Hyderabad. Andhra Pradesh, India. ABSTRACT:

More information

IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS

IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS V.Sudhakar 1 and G. Draksha 2 Abstract:- Collective behavior refers to the behaviors of individuals

More information

Prototype Methods: K-Means

Prototype Methods: K-Means Prototype Methods: K-Means Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Prototype Methods Essentially model-free. Not useful for understanding the nature of the

More information

Concept and Project Objectives

Concept and Project Objectives 3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Data Clustering: K-means and Hierarchical Clustering

Data Clustering: K-means and Hierarchical Clustering Data Clustering: K-means and Hierarchical Clustering Piyush Rai CS5350/6350: Machine Learning October 4, 2011 (CS5350/6350) Data Clustering October 4, 2011 1 / 24 What is Data Clustering? Data Clustering

More information

Fast Visual Vocabulary Construction for Image Retrieval using Skewed-Split k-d trees

Fast Visual Vocabulary Construction for Image Retrieval using Skewed-Split k-d trees Fast Visual Vocabulary Construction for Image Retrieval using Skewed-Split k-d trees Ilias Gialampoukidis, Stefanos Vrochidis, and Ioannis Kompatsiaris Information Technologies Institute, CERTH, Thessaloniki,

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Pit Pattern Classification with Support Vector Machines and Neural Networ

Pit Pattern Classification with Support Vector Machines and Neural Networ Pit Pattern Classification with s and s February 1, 2007 Pit Pattern Classification with s and Neural Networ Introduction Pit Pattern Classification with s and Neural Networ Objectives Investigation of

More information

Introduction to Weka- A Toolkit for Machine Learning

Introduction to Weka- A Toolkit for Machine Learning 1. Introduction Weka is open source software under the GNU General Public License. System is developed at the University of Waikato in New Zealand. Weka stands for the Waikato Environment for Knowledge

More information

Edge Detection. Computer Vision (CS 543 / ECE 549) University of Illinois Derek Hoiem 02/05/15. Magritte, Decalcomania

Edge Detection. Computer Vision (CS 543 / ECE 549) University of Illinois Derek Hoiem 02/05/15. Magritte, Decalcomania Edge Detection 02/05/15 Magritte, Decalcomania Computer Vision (CS 543 / ECE 549) University of Illinois Derek Hoiem Many slides from Lana Lazebnik, Steve Seitz, David Forsyth, David Lowe, Fei-Fei Li Last

More information