Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Similar documents
BIRCH: An Efficient Data Clustering Method For Very Large Databases

Making SVMs Scalable to Large Data Sets using Hierarchical Cluster Indexing

Classifying Large Data Sets Using SVMs with Hierarchical Clusters

Clustering on Large Numeric Data Sets Using Hierarchical Approach Birch

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Echidna: Efficient Clustering of Hierarchical Data for Network Traffic Analysis

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

A Survey of Classification Techniques in the Area of Big Data.

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Classification and Prediction

Clustering UE 141 Spring 2013

Support Vector Machines with Clustering for Training with Very Large Datasets

Decision Trees from large Databases: SLIQ

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Lecture 10: Regression Trees

The SPSS TwoStep Cluster Component

Clustering & Visualization

Clustering Via Decision Tree Construction

A fast multi-class SVM learning method for huge databases

Support Vector Machine (SVM)

Data Mining for Knowledge Management. Classification

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Data Mining Techniques Chapter 6: Decision Trees

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

A Non-Linear Schema Theorem for Genetic Algorithms

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Cluster Analysis: Advanced Concepts

DATABASE DESIGN - 1DL400

Using multiple models: Bagging, Boosting, Ensembles, Forests

Social Media Mining. Data Mining Essentials

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Predictive Dynamix Inc

Euclidean Minimum Spanning Trees Based on Well Separated Pair Decompositions Chaojun Li. Advised by: Dave Mount. May 22, 2014

Classification algorithm in Data mining: An Overview

Neural Networks and Support Vector Machines

Clustering through Decision Tree Construction in Geology

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Chapter 4 Data Mining A Short Introduction. 2006/7, Karl Aberer, EPFL-IC, Laboratoire de systèmes d'informations répartis Data Mining - 1

Content-Based Recommendation

Vector storage and access; algorithms in GIS. This is lecture 6

Knowledge Discovery from patents using KMX Text Analytics

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Web Document Clustering

Scalable Developments for Big Data Analytics in Remote Sensing

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Support Vector Machines

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

LCs for Binary Classification

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Classification: Decision Trees

Going Big in Data Dimensionality:

Clustering. Chapter Introduction to Clustering Techniques Points, Spaces, and Distances

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Linear Threshold Units

Clustering Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Classification of Bad Accounts in Credit Card Industry

Classification and Prediction techniques using Machine Learning for Anomaly Detection.

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Fast Matching of Binary Features

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

SURVEY OF TEXT CLASSIFICATION ALGORITHMS FOR SPAM FILTERING

Neural Networks Lesson 5 - Cluster Analysis

R-trees. R-Trees: A Dynamic Index Structure For Spatial Searching. R-Tree. Invariants

The Data Mining Process

Binary Coded Web Access Pattern Tree in Education Domain

Lecture 2: The SVM classifier

The basic data mining algorithms introduced may be enhanced in a number of ways.

Clustering Data Streams

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Predicting Flight Delays

The Operational Value of Social Media Information. Social Media and Customer Interaction

Unsupervised Data Mining (Clustering)

Support Vector Machines Explained

Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data

Big Data and Scripting. Part 4: Memory Hierarchies

Forecasting and Hedging in the Foreign Exchange Markets

Authors. Data Clustering: Algorithms and Applications

Outline BST Operations Worst case Average case Balancing AVL Red-black B-trees. Binary Search Trees. Lecturer: Georgy Gimel farb

Comparison of machine learning methods for intelligent tutoring systems

MHI3000 Big Data Analytics for Health Care Final Project Report

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

Supervised Learning (Big Data Analytics)

Information Retrieval and Web Search Engines

Beating the MLB Moneyline

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Intrusion Detection via Machine Learning for SCADA System Protection

Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets

Character Image Patterns as Big Data

Machine Learning in Spam Filtering

A Simple Introduction to Support Vector Machines

The Artificial Prediction Market

Transcription:

Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang

Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental Evaluation Conclusion & Future work

SVM Overview What is Support Vector Machine? analyze data & recognize patterns How does SVM work? Training phase : Given a sequence of training examples, find a hyperplane that separate these data points. Predicting phase : When a new data points arrive, correctly classifies it.

SVM Overview Example : H1(blue) & H2(red) separate the data points H3(green) does not

SVM Overview Maximum Margin Principle Samples on the margin are called support vectors

Motivation SVM has been promising methods However, training time of the standard SVM is O(N 3 ) Therefore, not scalable for very large data sets

Hierarchical micro-clustering algorithm Goals Minimize running time and data scans, thus formulating the problem for large data sets Clustering decisions made without scanning the whole data Exploit the non uniformity of data treat dense areas as one, and remove outliers (noise)

Hierarchical micro-clustering algorithm Clustering Feature (CF) CF is a compact storage for data on points in a cluster Has enough information to calculate the intracluster distances Additivity theorem allows us to merge subclusters

Hierarchical micro-clustering algorithm CF (contd.) Given N d-dimensional data points in a cluster: {Xi} where i = 1, 2,, N, CF = ( N, LS, SS) N is the number of data points in the cluster, LS is the linear sum of the N data points, SS is the square sum of the N data points.

Hierarchical micro-clustering algorithm CF Tree Comprising of a root node, non-leaf nodes and leaf nodes Two parameters : branching factor b & threshold t Non-leaf node at most b entries Radius of an leaf node entry must less than t

Hierarchical micro-clustering algorithm CF Tree (contd.) Example Thus, CF Tree is a compact representation of a large data set

Hierarchical micro-clustering algorithm CF Tree (contd.) Algorithm : 1. Identifying the appropriate leaf Root -> closet centroid 2. Modifying the leaf Absorb, add an entry or split 3. Modifying the path to the leaf Recursively back to the root

Hierarchical micro-clustering algorithm CF Tree (contd.) Outlier handling : After building the CF tree, remove the entries that contains far fewer data points than the average Running time : If we only consider the dependence of the size of the data set, the computation complexity of the algorithm is O(N)

Clustering-Based SVM (CB-SVM) Train data using the hierarchical micro-cluster 1. Construct two CF trees 2. Train an SVM boundary function from the centroids of the root entries 3. Decluster the entries near the boundary into the next level 4. Update the SVM boundary function from the centroids of the entries in the training set, repeat 3 until nothing is accumulated

Clustering-Based SVM (CB-SVM) Decluster the low-margin cluster : Let D s be the distance from the boundary to the centroid of a support cluster D i be the distance from the boundary to the centroid of a cluster E i Then if D i R i < D s E i is considered to be a low margin cluster

Clustering-Based SVM (CB-SVM) Next, we look at how CB-SVM works

Clustering-Based SVM (CB-SVM)

Clustering-Based SVM (CB-SVM) Running time : Let r = s / b s : average number of support entries 0 < r << 1 Therefore, CB-SVM trains from leaf entries is O(b 2h ) If the # of leaf entries = number of data points, It trains 1/r 2h-2 faster than the standard SVM which is a huge improvement in performance compared with the standard SVM as the data set becomes larger

Experimental Evaluation Environment : All experiments are done in a Pentium III 800Mhz machine with 906MB memory Premise : Perform binary classification on 2 dimensional data sets Note : training and testing data are drawn from the same distribution

Experimental Evaluation

Experimental Evaluation

Experimental Evaluation

Experimental Evaluation

Experimental Evaluation

Conclusion & Future work CB-SVM very scalable for large data sets Generating high classification accuracy However, Limited to the usage of linear kernels Radius and distances will not be preserved in a high-dimensional feature space Future work Constructing an efficient indexing structure for nonlinear kernels

Questions? Thank you!