Top 10 Algorithms in Data Mining

Similar documents
Top Top 10 Algorithms in Data Mining

Data Mining: Concepts and Techniques

Big Data: Introduction


The Effect of Clustering in the Apriori Data Mining Algorithm: A Case Study

Random forest algorithm in big data environment

An Overview of Knowledge Discovery Database and Data mining Techniques

Model Trees for Classification of Hybrid Data Types

A Comparative study of Techniques in Data Mining

A Holistic Analysis of Cloud Based Big Data Mining

Top 10 algorithms in data mining

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Fuzzy Logic -based Pre-processing for Fuzzy Association Rule Mining

Static Data Mining Algorithm with Progressive Approach for Mining Knowledge

2009 by Taylor & Francis Group, LLC

A Survey of Graph Pattern Mining Algorithm and Techniques

Web Mining Seminar CSE 450. Spring 2008 MWF 11:10 12:00pm Maginnes 113

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Comparison of Data Mining Techniques for Money Laundering Detection System

Predicting time-to-event from Twitter messages

Detecting Spam Using Spam Word Associations

How To Solve The Kd Cup 2010 Challenge

CS Data Science and Visualization Spring 2016

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Survey on Multiclass Classification Methods

IncSpan: Incremental Mining of Sequential Patterns in Large Database

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Comparison of Data Mining Techniques used for Financial Data Analysis

Link Mining: A New Data Mining Challenge

Assessing Data Mining: The State of the Practice

MD - Data Mining

On Mining Group Patterns of Mobile Users

Meta Learning Algorithms for Credit Card Fraud Detection

On the effect of data set size on bias and variance in classification learning

A Way to Understand Various Patterns of Data Mining Techniques for Selected Domains

Explanation-Oriented Association Mining Using a Combination of Unsupervised and Supervised Learning Algorithms

An Introduction to Data Mining

REPORT DOCUMENTATION PAGE

Discovering Sequential Rental Patterns by Fleet Tracking

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Business Lead Generation for Online Real Estate Services: A Case Study

Dynamical Clustering of Personalized Web Search Results

Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms

Inductive Learning in Less Than One Sequential Data Scan

TOWARDS BIG DATA MINING AND DISCOVERY

Building an Iris Plant Data Classifier Using Neural Network Associative Classification

Data Mining. Concepts, Models, Methods, and Algorithms. 2nd Edition

Directed Graph based Distributed Sequential Pattern Mining Using Hadoop Map Reduce

ISSN: (Online) Volume 2, Issue 7, July 2014 International Journal of Advance Research in Computer Science and Management Studies

Data Mining: Opportunities and Challenges

An Experimental Study on Ensemble of Decision Tree Classifiers

SPMF: a Java Open-Source Pattern Mining Library

Application of Data Mining Technique for Fraud Detection in Health Insurance Scheme Using Knee-Point K-Means Algorithm

International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: X DATA MINING TECHNIQUES AND STOCK MARKET

Selecting Data Mining Model for Web Advertising in Virtual Communities

How To Learn Data Analytics

Ensemble Approach for the Classification of Imbalanced Data

Model Combination. 24 Novembre 2009

INTRODUCTION TO KNOWLEDGE DISCOVERY IN DATABASES

A Hybrid Decision Tree Approach for Semiconductor. Manufacturing Data Mining and An Empirical Study

Database Preprocessing and Comparison between Data Mining Methods

An Analysis of Machine Learning Methods for Spam Host Detection

GraphZip: A Fast and Automatic Compression Method for Spatial Data Clustering

HACE-CSA: Heterogeneous, Autonomous, Complex and Evolving based Contextual Structure Analysis of the big data for Data Mining

Data Mining: Concepts and Techniques

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Outlier Detection in Stream Data by Machine Learning and Feature Selection Methods

Data Mining: Concepts and Techniques. (3 rd ed.) Chapter 1

College information system research based on data mining

Searching frequent itemsets by clustering data

Parallel Methods for Scaling Data Mining Algorithms to Large Data Sets

Cluster Analysis for Optimal Indexing

Improve Teaching Method of Data Mining Course

D-optimal plans in observational studies

How To Identify A Churner

BIG DATA IN HEALTHCARE THE NEXT FRONTIER

Community Mining from Multi-relational Networks

FrIDA Free Intelligent Data Analysis Toolbox.

Performance Analysis of Sequential Pattern Mining Algorithms on Large Dense Datasets

Mining Concept-Drifting Data Streams

MAXIMAL FREQUENT ITEMSET GENERATION USING SEGMENTATION APPROACH

Ensemble Data Mining Methods

Student Profiling on Academic Performance Using Cluster Analysis

Data Mining to Recognize Fail Parts in Manufacturing Process

A Mechanism for Selecting Appropriate Data Mining Techniques

Combining SVM classifiers for anti-spam filtering

MALLET-Privacy Preserving Influencer Mining in Social Media Networks via Hypergraph

Data Mining: Concepts and Techniques. Jiawei Han. Micheline Kamber. Simon Fräser University К MORGAN KAUFMANN PUBLISHERS. AN IMPRINT OF Elsevier

Data Mining as an Automated Service

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

MS1b Statistical Data Mining

Autoregressive HMM Based Trajectory Modeling on Social Network

The Role of Visualization in Effective Data Cleaning

Support Vector Machines with Clustering for Training with Very Large Datasets

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Transcription:

Top 10 Algorithms in Data Mining Xindong Wu ( 吴 信 东 ) Department of Computer Science University of Vermont, USA; 合 肥 工 业 大 学 计 算 机 与 信 息 学 院 1

Top 10 Algorithms in Data Mining by the IEEE ICDM Conference 1. The 3-step identification process 2. 18 identified candidates in 10 data mining topics 3. The top 10 algorithms 4. Follow-up actions 2

The 3-Step Identification Process 1. Nominations. ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners were invited in September 2006 for nominations Each nomination was asked to come with the following information: a) the algorithm name b) a brief justification c) a representative publication reference Up to 10 nominations from each nominator The nominations as a group should have a reasonable representation of the different areas in data mining All except one in this distinguished set of award winners responded. 3

The 3-Step Identification Process (2) 2. Verification. Each nomination was verified for its citations on Google Scholar in late October 2006, and those nominations that did not have at least 50 citations were removed. 18 nominations survived and were then organized in 10 topics. 3. Voting by the wider community. (a) Program Committee members of KDD-06, ICDM '06, and SDM '06 and (b) ACM KDD Innovation Award and IEEE ICDM Research Contributions Award winners The top 10 algorithms are ranked by their number of votes, and when there is a tie, the alphabetic order is used. 4

Agenda 1. The 3-step identification process 2. 18 identified candidates (in 10 data mining topics) 3. The top 10 algorithms 4. Follow-up actions 5

18 Identified Candidates Classification #1. C4.5: Quinlan, J. R. 1993. C4.5: Programs for Machine Learning. Morgan Kaufmann Publishers Inc. #2. CART: L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification and Regression Trees. Wadsworth, Belmont, CA, 1984. #3. K Nearest Neighbours (knn): Hastie, T. and Tibshirani, R. 1996. Discriminant Adaptive Nearest Neighbor Classification. IEEE Trans. Pattern Anal. Mach. Intell. (TPAMI). 18, 6 (Jun. 1996), 607-616. #4. Naive Bayes: Hand, D.J., Yu, K., 2001. Idiot's Bayes: Not So Stupid After All? Internat. Statist. Rev. 69, 385-398. Statistical Learning #5. SVM: Vapnik, V. N. 1995. The Nature of Statistical Learning Theory. Springer-Verlag New York, Inc. #6. EM: McLachlan, G. and Peel, D. (2000). Finite Mixture Models. J. Wiley, New York. Association Analysis #7. Apriori: Rakesh Agrawal and Ramakrishnan Srikant. Fast Algorithms for Mining Association Rules. In VLDB '94. #8. FP-Tree: Han, J., Pei, J., and Yin, Y. 2000. Mining frequent patterns without candidate generation. In SIGMOD '00. Link Mining #9. PageRank: Brin, S. and Page, L. 1998. The anatomy of a large-scale hypertextual Web search engine. In WWW-7, 1998. #10. HITS: Kleinberg, J. M. 1998. Authoritative sources in a hyperlinked environment. In Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms, 1998. 6

18 Candidates (2) Clustering #11. K-Means: MacQueen, J. B., Some methods for classification and analysis of multivariate observations, in Proc. 5th Berkeley Symp. Mathematical Statistics and Probability, 1967. #12. BIRCH: Zhang, T., Ramakrishnan, R., and Livny, M. 1996. BIRCH: an efficient data clustering method for very large databases. In SIGMOD '96. Bagging and Boosting #13. AdaBoost: Freund, Y. and Schapire, R. E. 1997. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 1 (Aug. 1997), 119-139. Sequential Patterns #14. GSP: Srikant, R. and Agrawal, R. 1996. Mining Sequential Patterns: Generalizations and Performance Improvements. In Proceedings of the 5th International Conference on Extending Database Technology, 1996. #15. PrefixSpan: J. Pei, J. Han, B. Mortazavi-Asl, H. Pinto, Q. Chen, U. Dayal and M-C. Hsu. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. In ICDE '01. Integrated Mining #16. CBA: Liu, B., Hsu, W. and Ma, Y. M. Integrating classification and association rule mining. KDD- 98. Rough Sets #17. Finding reduct: Zdzislaw Pawlak, Rough Sets: Theoretical Aspects of Reasoning about Data, Kluwer Academic Publishers, Norwell, MA, 1992. Graph Mining #18. gspan: Yan, X. and Han, J. 2002. gspan: Graph-Based Substructure Pattern Mining. In ICDM '02. 7

Agenda 1. The 3-step identification process 2. 18 identified candidates 3. The top 10 algorithms 4. Follow-up actions 8

The Top 10 Algorithms #1: C4.5, presented by Hiroshi Motoda #2: K-Means, presented by Joydeep Ghosh #3: SVM, presented by Qiang Yang #4: Apriori, presented by Christos Faloutsos #5: EM, presented by Joydeep Ghosh #6: PageRank, presented by Christos Faloutsos #7: AdaBoost, presented by Zhi-Hua Zhou #7: knn, presented by Vipin Kumar #7: Naive Bayes, presented by Qiang Yang #10: CART, presented by Dan Steinberg 9

Agenda 1. The 3-step identification process 2. 18 identified candidates 3. The top 10 algorithms 4. Follow-up actions 10

Open Votes for Top Algorithms Top 3 Algorithms: C4.5 SVM Apriori Top 10 Algorithms The top 10 algorithms voted from the 18 candidates at the panel are the same as the voting results from the 3-step identification process. 11

Follow-Up Actions A survey paper on Top 10 Algorithms in Data Mining (X. Wu, V. Kumar, J.R. Quinlan, et al., Knowledge and Information Systems, 14(1), 2008, 1~37) Written by the original authors and presenters Cited 1841 times on Google Scholar as of 1/19/2016 How to make a good use of these top 10 algorithms? Curriculum development A textbook on The Top 10 Algorithms in Data Mining, Chapman and Hall/CRC Press, April 2009 Various questions on these 10 algorithms? Why not this algorithm or that topic? Will the votes change in the future? Sure, let s work together to make positive changes! 12