Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012


 Alexis Bishop
 1 years ago
 Views:
Transcription
1 Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University September 19, 2012
2 EMart No. of items sold per day = 139x2000x20 = ~6 million items/day (Walmart sells ~276 million items/day) Find groups (clusters) of customers with similar purchasing pattern Cocluster customers and items to find which items are purchased together; useful for shelf management, marketing promotion,..
3 Outline Big Data How to extract information? Data clustering Clustering Big Data Kernel Kmeans & approximation Summary
4 Big Data
5 How Big is Big Data? Big is a fast moving target: kilobytes, megabytes, gigabytes, terabytes (10 12 ), petabytes (10 15 ), exabytes (10 18 ), zettabytes (10 21 ), Over 1.8 zb created in 2011; ~8 zb by 2015 D a t a s i z e E x a b y t e s Source: IDC s Digital Universe study, sponsored by EMC, June As of June 2012 Nature of Big Data: Volume, Velocity and Variety
6 Big Data on the Web Over 225 million users generating over 800 tweets per second ~900 million users, 2.5 billion content items, 105 terabytes of data each half hour, 300M photos and 4M videos posted per day
7 Big Data on the Web Over 50 billion pages indexed and more than 2 million queries/min Articles from over 10,000 sources in real time ~4.5 million photos uploaded/day 48 hours of video uploaded/min; more than 1 trillion video views
8 What to do with Big Data? Extract information to make decisions Evidencebased decision: datadriven vs. analysis based on intuition & experience Analytics, business intelligence, data mining, machine learning, pattern recognition Big Data computing: IBM is promoting Watson (Jeopardy champion) to tackle Big Data in healthcare, finance, drug design,.. Steve Lohr, Amid the Flood, A Catchphrase is Born, NY Times, August 12, 2012
9 Decision Making Data Representation Features and similarity Learning Classification (supervised learning) Clustering (unsupervised learning) Semisupervised clustering 9
10 Pattern Matrix n x d pattern matrix
11 Similarity Matrix Polynomial kernel: T x y 4 K( x, y) n x n similarity matrix
12 Classification Dogs Cats Given a training set of labeled objects, learn a decision rule
13 Clustering Given a collection of (unlabeled) objects, find meaningful groups
14 Semisupervised Clustering Supervised Unsupervised Dogs Cats Semisupervised Pairwise constraints improve the clustering performance
15 What is a cluster? A group of the same or similar elements gathered or occurring closely together Galaxy clusters Birdhouse clusters Cluster munition Cluster computing Cluster lights Hongkeng Tulou cluster
16 Clusters in 2D
17 Clusters in 2D
18 Clusters in 2D
19 Clusters in 2D
20 Clusters in 2D
21 Challenges in Data Clustering Measure of similarity No. of clusters Cluster validity Outliers
22 Data Clustering Organize a collection of n objects into a partition or a hierarchy (nested set of partitions) Data clustering returned ~6,100 hits for 2011 alone (Google Scholar)
23 Clustering is the Key to Big Data Problem Not feasible to label large collection of objects No prior knowledge of the number and nature of groups (clusters) in data Clusters may evolve over time Clustering leads to more efficient browsing, searching, recommendation and classification/organization of Big Data
24 Clustering Users on Facebook ~300,000 status updates per minute on tens of thousands of topics Cluster status messages from different users based on the topic
25 Clustering Articles on Google News Topic cluster Article Listings
26 Clustering Videos on Youube Keywords Popularity Viewer engagement User browsing history
27 Clustering for Efficient Image retrieval Retrieval without clustering Retrieval with clustering Fig. 1. Upperleft image is the query. Numbers under the images on left side: image ID and cluster ID; on the right side: Image ID, matching score, number of regions. Retrieval accuracy for the food category (average precision): Without clustering: 47% With clustering: 61% Chen et al., CLUE: clusterbased retrieval of images by unsupervised learning, IEEE Tans. On Image Processing, 2005.
28 Clustering Algorithms Hundreds of clustering algorithms are available; many are admissible, but no algorithm is optimal Kmeans Gaussian mixture models Kernel Kmeans Spectral Clustering Nearest neighbor Latent Dirichlet Allocation
29 Kmeans Algorithm
30 Kmeans Algorithm Randomly assign cluster labels to the data points
31 Kmeans Algorithm Compute the center of each cluster
32 Kmeans Algorithm Assign points to the nearest cluster center
33 Kmeans Algorithm Recompute centers
34 Kmeans Algorithm Repeat until there is no change in the cluster labels
35 Kmeans: Limitations Good at finding compact and isolated clusters n K min u ik x i c k 2 i=1 k=1
36 Gaussian Mixture Model Figueiredo & Jain, Unsupervised Learning of Finite Mixture Models, PAMI, 2002
37 Kernel Kmeans Nonlinear mapping to find clusters of arbitrary shapes n K min Trace u ik φ x i c k φ φ x i c k φ T i=1 k=1 φ φ x, y = x 2, 2xy, y 2 T Polynomial kernel representation
38 Spectral Clustering Represent data using the top K eigenvectors of the similarity matrix; equivalent to Kernel Kmeans
39 Kmeans vs. Kernel Kmeans Data Kmeans Kernel Kmeans Kernel clustering is able to find complex clusters. But how to choose the right kernel?
40 Kernel Kmeans is Expensive No. of operations No. of Objects (n) Kmeans Kernel Kmeans O(nKd) O(n 2 K) 1M (6412) M M B d = 10,000; K=10 * Runtime in seconds on Intel Xeon 2.8 GHz processor using 40 GB memory A petascale supercomputer (IBM Sequoia, June 2012) with ~1 exabyte memory is needed to run kernel Kmeans on1 billion points!
41 Clustering Big Data Sampling Data n x n similarity matrix Preprocessing Clustering Summarization Incremental Distributed Approximation Cluster labels
42 (s) Distributed Clustering Kmeans on 1 billion 2D points with 2 clusters 2.3 GHz quadcore Intel Xeon processors, with 8GB memory in the HPCC intel07 cluster Number of processors Speedup Network communication cost increases with the no. of processors
43 Approximate kernel Kmeans Tradeoff between clustering accuracy and running time Given n points in ddimensional space Chitta, Jin, Havens & Jain, Approximate Kernel kmeans: solution to Large Scale Kernel Clustering, KDD, 2011
44 Approximate kernel Kmeans Tradeoff between clustering accuracy and running time Randomly sample m points y 1, y 2, y m, m n and compute the kernel similarity matrices K A (m m) and K B (n m) (K A ) ij = φ(y i ) T φ(y j ) (K B ) ij = φ(x i ) T φ(y j ) Chitta, Jin, Havens & Jain, Approximate Kernel kmeans: solution to Large Scale Kernel Clustering, KDD, 2011
45 Approximate kernel Kmeans Tradeoff between clustering accuracy and running time Iteratively optimize for the cluster centers min U max α K n k=1 i=1 u ik φ x i α jk φ(y j ) Chitta, Jin, Havens & Jain, Approximate Kernel kmeans: solution to Large Scale Kernel Clustering, KDD, 2011 m j=1 (equivalent to running Kmeans on K B K A 1 K B T )
46 Approximate kernel Kmeans Tradeoff between clustering accuracy and running time Obtain the final cluster labels Linear runtime and memory complexity Chitta, Jin, Havens & Jain, Approximate Kernel kmeans: solution to Large Scale Kernel Clustering, KDD, 2011
47 Approximate Kernel KMeans Running time (seconds) Clustering accuracy (%) No. of objects (n) Kernel K means Approximate kernel K means (m=100) Kmeans Kernel Kmeans Approximate kernel K means (m=100) Kmeans 10K K M M GHz processor, 40 GB
48 Tiny Image Data set ~80 million 32x32 images from ~75K classes (bamboo, fish, mushroom, leaf, mountain, ); image represented by 384 dim. GIST descriptors Fergus et al., 80 million tiny images: a large dataset for nonparametric object and scene recognition, PAMI 2008
49 Tiny Image Data set 10class subset (CIFAR10): 60K manually annotated images Airplane Automobile Bird Cat Deer Dog Frog Horse Ship Truck Krizhevsky, Learning multiple layers of features from tiny images, 2009
50 Clustering Tiny Images C 1 C 2 C 3 Average clustering time (100 clusters) Approximate kernel K means (m=1,000) 8.5 hours C 4 Kmeans 6 hours C 5 Example Clusters 2.3GHz, 150GB memory
51 Clustering Tiny Images Best Supervised Classification Accuracy on CIFAR10: 54.7% Clustering accuracy Kernel Kmeans 29.94% Approximate kernel Kmeans (m = 5,000) 29.76% Spectral clustering 27.09% Kmeans 26.70% Out of 6,000 automobile images, Kmeans places 4,090 images in wrong clusters; for approximate Kernel kmeans this number is 3,788 images Ranzato et. Al., Modeling pixel means and covariances using factorized thirdorder boltzmann machines, CVPR 2010 Fowlkes et al., Spectral grouping using the Nystrom method, PAMI 2004
52 Distributed Approx. Kernel Kmeans For better scalability and faster clustering Given n points in ddimensional space
53 Distributed Approx. Kernel Kmeans For better scalability and faster clustering Randomly sample m points (m << n)
54 Distributed Approx. Kernel Kmeans For better scalability and faster clustering Split the remaining n  m randomly into p partitions and assign partition P t to task t
55 Distributed Approx. Kernel Kmeans For better scalability and faster clustering Run approximate kernel Kmeans in each task t and find the cluster centers
56 Assign each point in task s (s t) to the closest center from task t Distributed Approx. Kernel Kmeans For better scalability and faster clustering
57 Distributed Approx. Kernel Kmeans For better scalability and faster clustering Combine the labels from each task using ensemble clustering algorithm
58 Distributed Approximate kernel Kmeans 2D data set with 2 concentric circles 7000 Running time Size of data set Speedup K K K 100K 1M 10M 1M M 6.4 Distributed approximate Kernel Kmeans (8 nodes) Approximate Kernel Kmeans 2.3 GHz quadcore Intel Xeon processors, with 8GB memory in the HPCC intel07 cluster
59 Summary Clustering is an exploratory technique; used in every scientific field that collects data Choice of clustering algorithm & its parameters is data dependent Clustering is essential to tackle Big Data Approximate kernel Kmeans provides good tradeoff between scalability & clustering accuracy Challenges: Scalability, very large no. of clusters, heterogeneous data, streaming data
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
More informationUsing Very Deep Autoencoders for ContentBased Image Retrieval
Using Very Deep Autoencoders for ContentBased Image Retrieval Alex Krizhevsky and Georey E. Hinton University of Toronto  Department of Computer Science 6 King's College Road, Toronto, M5S 3H5  Canada
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationSemisupervised learning
Semisupervised learning Learning from both labeled and unlabeled data Semisupervised learning Learning from both labeled and unlabeled data Motivation: labeled data may be hard/expensive to get, but
More informationClustering / Unsupervised Methods
Clustering / Unsupervised Methods Jason Corso, Albert Chen SUNY at Buffalo J. Corso (SUNY at Buffalo) Clustering / Unsupervised Methods 1 / 41 Clustering Introduction Until now, we ve assumed our training
More informationAn Introduction to Text Data Mining
An Introduction to Text Data Mining Adam Zimmerman The Ohio State University Data Mining and Statistical Learning Discussion Group September 6, 2013 Adam Zimmerman (OSU) DMSL Group Intro to Text Mining
More informationMining Big Data. PangNing Tan. Associate Professor Dept of Computer Science & Engineering Michigan State University
Mining Big Data PangNing Tan Associate Professor Dept of Computer Science & Engineering Michigan State University Website: http://www.cse.msu.edu/~ptan Google Trends Big Data Smart Cities Big Data and
More informationContentBased Image Retrieval Challenges & Opportunities
ContentBased Image Retrieval Challenges & Opportunities Jianping Fan Department of Computer Science University of North Carolina at Charlotte Charlotte, NC 28223 Networks How can I access image/video
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, MayJun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
More informationLargeScale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 59565963 Available at http://www.jofcis.com LargeScale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationEM Clustering Approach for MultiDimensional Analysis of Big Data Set
EM Clustering Approach for MultiDimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationData Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototypebased Fuzzy cmeans Mixture Model Clustering Densitybased
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationCurrent Topics in Informationtheoretic Data Mining
Current Topics in Informationtheoretic Data Mining NINA HUBIG, ANNIKA TONCH HelmholtzHochschul research group ikdd Applications: Neuroscience, Diabetes research. Outline 1. Introduction 2. General Information
More informationLinear Algebra Methods for Data Mining
Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Overview of some topics covered and some topics not covered on this course Linear Algebra Methods for Data
More informationConcepts in Machine Learning, Unsupervised Learning & Astronomy Applications
Data Mining In Modern Astronomy Sky Surveys: Concepts in Machine Learning, Unsupervised Learning & Astronomy Applications ChingWa Yip cwyip@pha.jhu.edu; Bloomberg 518 Human are Great Pattern Recognizers
More informationDoing Multidisciplinary Research in Data Science
Doing Multidisciplinary Research in Data Science Assoc.Prof. Abzetdin ADAMOV CeDAWI  Center for Data Analytics and Web Insights Qafqaz University aadamov@qu.edu.az http://ce.qu.edu.az/~aadamov 16 May
More informationBIG DATA CHALLENGES AND PERSPECTIVES
BIG DATA CHALLENGES AND PERSPECTIVES Meenakshi Sharma 1, Keshav Kishore 2 1 Student of Master of Technology, 2 Head of Department, Department of Computer Science and Engineering, A P Goyal Shimla University,
More informationBig Data: Image & Video Analytics
Big Data: Image & Video Analytics How it could support Archiving & Indexing & Searching Dieter Haas, IBM Deutschland GmbH The Big Data Wave 60% of internet traffic is multimedia content (images and videos)
More informationKERNELBASED CLUSTERING OF BIG DATA. Radha Chitta A DISSERTATION
KERNELBASED CLUSTERING OF BIG DATA By Radha Chitta A DISSERTATION Submitted to Michigan State University in partial fulfillment of the requirements for the degree of Computer Science Doctor of Philosophy
More informationAn Enhanced Clustering Algorithm to Analyze Spatial Data
International Journal of Engineering and Technical Research (IJETR) ISSN: 23210869, Volume2, Issue7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distancebased Kmeans, Kmedoids,
More informationFast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
More informationSurfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics
Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics Dr. Liangxiu Han Future Networks and Distributed Systems Group (FUNDS) School of Computing, Mathematics and Digital Technology,
More informationDistance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I  Applications Motivation and Introduction Patient similarity application Part II
More informationClusters With CoreTail Hierarchical Structure And Their Applications To Machine Learning Classification
Clusters With CoreTail Hierarchical Structure And Their Applications To Machine Learning Classification Dmitriy Fradkin Ilya B. Muchnik Dept. of Computer Science DIMACS dfradkin@paul.rutgers.edu muchnik@dimacs.rutgers.edu
More informationBagoffeatures for category classification. Cordelia Schmid
Bagoffeatures for category classification Cordelia Schmid Category recognition Image classification: assigning a class label to the image Car: present Cow: present Bike: not present Horse: not present
More informationCrowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach
Outline Crowdclustering with Sparse Pairwise Labels: A Matrix Completion Approach Jinfeng Yi, Rong Jin, Anil K. Jain, Shaili Jain 2012 Presented By : KHALID ALKOBAYER Crowdsourcing and Crowdclustering
More informationBig Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014
Big Data Analytics An Introduction Oliver Fuchsberger University of Paderborn 2014 Table of Contents I. Introduction & Motivation What is Big Data Analytics? Why is it so important? II. Techniques & Solutions
More informationClassification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
More informationEfficient visual search of local features. Cordelia Schmid
Efficient visual search of local features Cordelia Schmid Visual search change in viewing angle Matches 22 correct matches Image search system for large datasets Large image dataset (one million images
More informationPart II: Web Content Mining Chapter 3: Clustering
Part II: Web Content Mining Chapter 3: Clustering Learning by Example and Clustering Hierarchical Agglomerative Clustering KMeans Clustering ProbabilityBased Clustering Collaborative Filtering Slides
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationCPSC 340: Machine Learning and Data Mining. KMeans Clustering Fall 2015
CPSC 340: Machine Learning and Data Mining KMeans Clustering Fall 2015 Admin Assignment 1 solutions posted after class. Tutorials for Assignment 2 on Monday. Random Forests Random forests are one of the
More informationRecognition: A machine learning approach. Slides adapted from FeiFei Li, Rob Fergus, Antonio Torralba, Kristen Grauman, and Derek Hoiem
Recognition: A machine learning approach Slides adapted from FeiFei Li, Rob Fergus, Antonio Torralba, Kristen Grauman, and Derek Hoiem The machine learning framework Apply a prediction function to a feature
More informationMobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
More informationTopic and Trend Detection in Text Collections using Latent Dirichlet Allocation
Topic and Trend Detection in Text Collections using Latent Dirichlet Allocation Levent Bolelli 1, Şeyda Ertekin 2, and C. Lee Giles 3 1 Google Inc., 76 9 th Ave., 4 th floor, New York, NY 10011, USA 2
More information. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering NonSupervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationWatson A System Designed for Answers
IBM Systems and Technology February 2011 An IBM White Paper Watson A System Designed for Answers The future of workload optimized systems design 2 Watson A System Designed for Answers Executive summary
More informationFace Recognition using SIFT Features
Face Recognition using SIFT Features Mohamed Aly CNS186 Term Project Winter 2006 Abstract Face recognition has many important practical applications, like surveillance and access control.
More informationA Survey on Issues and Challenges in Handling Big Data
A Survey on Issues and Challenges in Handling Big Data Sandeep K N Usha R G Dept of Information Science and Engineering Dept of Information Science and Engineering JSS Academy of Technical Education, Bangalore,
More informationClustering. Adrian Groza. Department of Computer Science Technical University of ClujNapoca
Clustering Adrian Groza Department of Computer Science Technical University of ClujNapoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 Kmeans 3 Hierarchical Clustering What is Datamining?
More informationFig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationStatistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
More informationSteven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg
Steven C.H. Hoi School of Information Systems Singapore Management University Email: chhoi@smu.edu.sg Introduction http://stevenhoi.org/ Finance Recommender Systems Cyber Security Machine Learning Visual
More informationMixture Models. Jia Li. Department of Statistics The Pennsylvania State University. Mixture Models
Mixture Models Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Clustering by Mixture Models General bacground on clustering Example method: means Mixture model based
More informationCLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab
More informationDistributed forests for MapReducebased machine learning
Distributed forests for MapReducebased machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationLecture 13: Clustering and Kmeans
Lecture 13: Clustering and Kmeans CSE 6040 Fall 2014 CSE 6040 Kmeans 1 / 22 What is clustering? Unsupervised learning: Given an unlabeled data set (no prior knowledge), how can we find structures/patterns
More informationA Map Reduce Framework for Programming Graphics Processors. Bryan Catanzaro Narayanan Sundaram Kurt Keutzer
A Map Reduce Framework for Programming Graphics Processors Bryan Catanzaro Narayanan Sundaram Kurt Keutzer Overview Map Reduce is a good abstraction to map to GPUs It is easy for programmers to understand
More informationMapReduce for Machine Learning on Multicore
MapReduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers  dual core to 12+core Shift to more concurrent programming paradigms and languages Erlang,
More informationFeature Selection vs. Extraction
Feature Selection In many applications, we often encounter a very large number of potential features that can be used Which subset of features should be used for the best classification? Need for a small
More informationClustering. Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods. CS 5751 Machine Learning
Clustering Unsupervised learning Generating classes Distance/similarity measures Agglomerative methods Divisive methods Data Clustering 1 What is Clustering? Form of unsupervised learning  no information
More informationMachine Learning with MATLAB U. M. Sundar Senior Application Engineer MathWorks India
Machine Learning with MATLAB U. M. Sundar Senior Application Engineer MathWorks India 2013 The MathWorks, Inc. 1 Agenda Machine Learning Overview Machine Learning with MATLAB Unsupervised Learning Clustering
More informationInformation Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli (alberto.ceselli@unimi.it)
More informationThe Big Data Paradigm Shift. Insight Through Automation
The Big Data Paradigm Shift Insight Through Automation Agenda The Problem Emcien s Solution: Algorithms solve data related business problems How Does the Technology Work? Case Studies 2013 Emcien, Inc.
More informationSIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationAnalyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers
Analyzing the Dynamics of Research by Extracting Key Aspects of Scientific Papers Sonal Gupta Christopher Manning Natural Language Processing Group Department of Computer Science Stanford University Columbia
More informationMODULE 15 Clustering Large Datasets LESSON 34
MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1 Clustering Large Datasets Pattern matrix It is convenient to view the input data
More informationarxiv: v1 [cs.cv] 4 Feb 2016
RANDOM FEATURE MAPS VIA A LAYERED RANDOM PROJECTION (LARP) FRAMEWORK FOR OBJECT CLASSIFICATION A. G. Chung, M. J. Shafiee, and A. Wong Vision & Image Processing Research Group, System Design Engineering
More informationNeural Networks Lesson 5  Cluster Analysis
Neural Networks Lesson 5  Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt.  Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationOverview. Introduction to Information Retrieval. K means. Evaluation. How many clusters? TDT4215 Web intelligence Based on slides by:
Introduction to Information Retrieval TDT4215 Web intelligence Based on slides by: Hinrich Schütze and Christina Lioma Chapter 16: Flat Clustering 1 Overview ❶ Clustering: Introduction ti ❸ Clustering
More informationClassifying Runtime Performance with SVM
Classifying Runtime Performance with SVM David Eliahu Shaddi Hasan Omer Spillinger May 14, 2013 Abstract We present a machinelearning based technique for the problem of algorithm selection, specifically
More informationFoundations of Artificial Intelligence. Introduction to Data Mining
Foundations of Artificial Intelligence Introduction to Data Mining Objectives Data Mining Introduce a range of data mining techniques used in AI systems including : Neural networks Decision trees Present
More information9.54 Class 13. Unsupervised learning Clustering. Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert
9.54 Class 13 Unsupervised learning Clustering Shimon Ullman + Tomaso Poggio Danny Harari + Daneil Zysman + Darren Seibert Outline Introduction to clustering Kmeans Bag of words (dictionary learning)
More informationIntroduction to the Mathematics of Big Data. Philippe B. Laval
Introduction to the Mathematics of Big Data Philippe B. Laval Fall 2015 Introduction In recent years, Big Data has become more than just a buzz word. Every major field of science, engineering, business,
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationAsking Hard Graph Questions. Paul Burkhardt. February 3, 2014
Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate  R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)
More informationClassification and Prediction
Classification and Prediction Classification and Prediction What is classification? What is regression? Issues regarding classification and prediction Classification by decision tree induction Bayesian
More informationRecognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang
Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform
More informationA Novel Framework for Incorporating Labeled Examples into Anomaly Detection
A Novel Framework for Incorporating Labeled Examples into Anomaly Detection Jing Gao Haibin Cheng PangNing Tan Abstract This paper presents a principled approach for incorporating labeled examples into
More informationBig Data Analytics. Genoveva VargasSolar http://www.vargassolar.com/bigdataanalytics French Council of Scientific Research, LIG & LAFMIA Labs
1 Big Data Analytics Genoveva VargasSolar http://www.vargassolar.com/bigdataanalytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE
More informationScalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age
Scalable and High Performance Computing for Big Data Analytics in Understanding the Human Dynamics in the Mobile Age Xuan Shi GRA: Bowei Xue University of Arkansas Spatiotemporal Modeling of Human Dynamics
More informationSymNMF: Symmetric Nonnegative Matrix Factorization for Graph Clustering
SymNMF: Symmetric Nonnegative Matrix Factorization for Graph Clustering Da Kuang 1, Chris Ding 2, Haesun Park 1 1 Georgia Institute of Technology 2 University of Texas, Arlington Apr. 26, 2012 2012 SIAM
More informationLecture 20: Clustering
Lecture 20: Clustering Wrapup of neural nets (from last lecture Introduction to unsupervised learning Kmeans clustering COMP424, Lecture 20  April 3, 2013 1 Unsupervised learning In supervised learning,
More informationIntroduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Preprocessing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationIEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 20, NO. 7, JULY 2009 1181 The Global Kernel kmeans Algorithm for Clustering in Feature Space Grigorios F. Tzortzis and Aristidis C. Likas, Senior Member, IEEE
More informationLEARNING TECHNIQUES AND APPLICATIONS
MSc Course MACHINE LEARNING TECHNIQUES AND APPLICATIONS Classification with GMM + Bayes 1 Clustering, semisupervised clustering and classification Clustering No labels for the points! Semisupervised
More information6.867 Machine learning
6.867 Machine learning Midterm exam October 15, 3 ( points) Your name and MIT ID: SOLUTIONS Problem 1 Suppose we are trying to solve an active learning problem, where the possible inputs you can select
More informationAn Algorithm for Automated View Reduction in Weighted Clustering of Multiview Data
An Algorithm for Automated View Reduction in Weighted Clustering of Multiview Data N. Aparna PG Scholar Department of Computer Science Sri Ramakrishna College of Engineering, Coimbatore M. Kalaiarasu Associate
More informationCHAPTER 5 SOFT CLUSTERING BASED MULTIPLE DICTIONARY BAG OF WORDS FOR IMAGE RETRIEVAL
84 CHAPTER 5 SOFT CLUSTERING BASED MULTIPLE DICTIONARY BAG OF WORDS FOR IMAGE RETRIEVAL Object classification is a highly important area of computer vision and has many applications including robotics,
More informationCOMPARATIVE INVESTIGATIONS AND PERFORMANCE ANALYSIS OF FCM AND MFPCM ALGORITHMS ON IRIS DATA
COMPARATIVE INVESTIGATIONS AND PERFORMANCE ANALYSIS OF FCM AND MFPCM ALGORITHMS ON IRIS DATA VUDA SREENIVASA RAO * Research Scholar, CSIT Department JNT University, Hyderabad. Andhra Pradesh, India. ABSTRACT:
More informationIJCSES Vol.7 No.4 October 2013 pp.165168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS
IJCSES Vol.7 No.4 October 2013 pp.165168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS V.Sudhakar 1 and G. Draksha 2 Abstract: Collective behavior refers to the behaviors of individuals
More informationPrototype Methods: KMeans
Prototype Methods: KMeans Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Prototype Methods Essentially modelfree. Not useful for understanding the nature of the
More informationConcept and Project Objectives
3.1 Publishable summary Concept and Project Objectives Proactive and dynamic QoS management, network intrusion detection and early detection of network congestion problems among other applications in the
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationData Clustering: Kmeans and Hierarchical Clustering
Data Clustering: Kmeans and Hierarchical Clustering Piyush Rai CS5350/6350: Machine Learning October 4, 2011 (CS5350/6350) Data Clustering October 4, 2011 1 / 24 What is Data Clustering? Data Clustering
More informationFast Visual Vocabulary Construction for Image Retrieval using SkewedSplit kd trees
Fast Visual Vocabulary Construction for Image Retrieval using SkewedSplit kd trees Ilias Gialampoukidis, Stefanos Vrochidis, and Ioannis Kompatsiaris Information Technologies Institute, CERTH, Thessaloniki,
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 3448 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationData Clustering. Dec 2nd, 2013 Kyrylo Bessonov
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms kmeans Hierarchical Main
More informationPit Pattern Classification with Support Vector Machines and Neural Networ
Pit Pattern Classification with s and s February 1, 2007 Pit Pattern Classification with s and Neural Networ Introduction Pit Pattern Classification with s and Neural Networ Objectives Investigation of
More informationIntroduction to Weka A Toolkit for Machine Learning
1. Introduction Weka is open source software under the GNU General Public License. System is developed at the University of Waikato in New Zealand. Weka stands for the Waikato Environment for Knowledge
More informationEdge Detection. Computer Vision (CS 543 / ECE 549) University of Illinois Derek Hoiem 02/05/15. Magritte, Decalcomania
Edge Detection 02/05/15 Magritte, Decalcomania Computer Vision (CS 543 / ECE 549) University of Illinois Derek Hoiem Many slides from Lana Lazebnik, Steve Seitz, David Forsyth, David Lowe, FeiFei Li Last
More information