Stream data state of the art and new results
|
|
|
- Conrad Elliott
- 10 years ago
- Views:
Transcription
1 Stream data state of the art and new results Leszek Rutkowski e mail: [email protected] Cooperation with Ph. D. Students: Piotr Duda, Maciej Jaworski, Lena Pietruczuk, Czestochowa University of Technology, Department of Computer Engineering, Poland 1
2 Table of contents Introduction Data Stream Applications Examples of Concept Drift Decision Trees for Stream Data Mining Challenges References 2
3 Czym są dane strumieniowe? strumień danych to potencjalnie nieskończony ciąg przychodzących danych Cechy danych strumieniowych: Dane docierają do systemu nieustannie, teoretycznie ich rozmiar jest nieskończony Algorytm działa na paczkach danych lub po każdej danej (rekurencyjnie) Algorytm musi analizować daną (paczkę) na tyle szybko, aby zdążyć przed przybyciem kolejnej danej (paczki) Nie ma miejsca w pamięci na przechowywanie danych historycznych Informacja z przeszłości przechowywana w postaci przybliżonych struktur podsumowujących Rozkład danych może ulegać zmianie w czasie (concept drift)
4 Główne problemy jakie niesie ze sobą badanie danych strumieniowych to: - ograniczenie pamięci - możliwość wyłącznie jednorazowego odczytu danych - duża szybkość przychodzenia danych - zjawisko zwane concept drift Co oznacza concept drift? Concept drift jest zjawiskiem oznaczającym zmianę rozkładu danych, ich charakteru lub znaczenia w czasie. Może ono następować powoli albo zmiana taka może nastąpić nagle w pewnym momencie w czasie.
5 Przykłady strumieni danych: -monitorowanie sieci -dane pochodzące z sensorów -ruch uliczny -informacje o transakcjach płatniczych -sygnały radiowe pochodzące z przestrzeni kosmicznej -rozmowy telefoniczne
6 Review of the literature 1. Oza N. C., Online Ensemble Learning, Ph.D. Thesis, The University of California, Berkeley, CA, (2001) 6
7 Review of the literature 2. Tsymbal A., The Problem of Concept Drift: Definitions and Related Work, Techn. Report of Dept. of Comp. Sci, Trinity College Dublin, Ireland, (2004) 7
8 Review of the literature 3. Han J. and Kamber M., Data Mining: Concepts and Techniques The Morgan Kaufmann Series in Data Management Systems Series Elsevier Science \& Tech, (2006) 8
9 Review of the literature 4. Aggraval C. C., Data Streams: Models and Algorithms, Springer, (2007) 9
10 Review of the literature 5. Kirkby R., Improving Hoeffding trees, PhD thesis,. University of Waikato, November, (2007) 10
11 Review of the literature 6. Bifet A., and Kirkby R., Data Stream Mining - A Practical approach, University of WAIKATO, New Zeland, (2009) 11
12 Review of the literature 7. Franke C. Adaptivity in Data Stream Mining, PhD thesis, University of California at Davis, (2009) 12
13 Review of the literature 8. Bifet A., Adaptive Stream Mining: Pattern Learning and Mining from Evolving Data Streams, Frontiers in Artificial Intelligence and Applications Frontiers in Artificial Intelligence and Applications IOS Press, (2010) 13
14 Review of the literature 9. Brzeziński D., Mining Data Streams with Concept Drifts, Master's thesis supervised by J. Stefanowski, Poznan University of Technology, Poznan, Poland, (2010) 14
15 Review of the literature 10. Žliobaitė I., Adaptive Training Set Formation, Vilnius University, Lithuania, (2010) 15
16 DECISION TREES for MINING DATA STREAMS based on Leszek Rutkowski, Lena Pietruczuk, Piotr Duda, Maciej Jaworski, Decision Trees for Mining Data Streams Based on the McDiarmid's Bound, IEEE Transactions on Knowledge and Data Engineering, vol. 25, no. 6, pp , Leszek Rutkowski, Maciej Jaworski, Lena Pietruczuk, Piotr Duda, Decision Trees for Mining Data Streams Based on the Gaussian Approximation, IEEE Transactions on Knowledge and Data Engineering, vol. 99, no. to be published, Leszek Rutkowski, Maciej Jaworski, Lena Pietruczuk, Piotr Duda, The CART Decision Tree for Mining Data Streams, submitted for publication. Leszek Rutkowski, Maciej Jaworski, Lena Pietruczuk, Piotr Duda, A new method for data stream mining based on the misclassification error, submitted for publication.
17 DECISION TREES
18 The central choice in the ID3 algorithm is selecting which attribute to test at each node in the tree, according to the first task in step 4 of the CLS algorithm. We would like to select the attribute that is most useful for classifying examples. What is a good quantitative measure of the worth of an attribute?
19 We will define a statistical property called information gain that measures how well a given attribute separates the training examples according to their target classification. ID3 uses this information gain measure to select among the candidate attributes at each step while growing the tree.
20 Entropy The entropy of S is defined as follows M Ent(S) = i=1 p i log p i, S training set M number of classes p i - probability that element belongs to class i (proportion of S belonging to class i ) When all data a set belong to a single class, there is no uncertainty and the entropy is zero. If the target attribute can take on M possible values, the entropy can be as large as log 2 M.
21 Given entropy as a measure of the impurity in a collection of training examples, we can now define a measure of the effectiveness of an attribute in classifying the training data. The measure we will use, called, information gain, is simply the expected reduction in entropy caused by partitioning the examples according to this attribute.
22 Information gain H S a = Ent S N i=1 w i Ent(S i ), where a an attribute with values in set {a 1,, a N } N number of different values of attribute a w i - fraction of number of elements with value a i (probability that elements in S take value a i ) S i - subset of set S, with data elements for which the value of attribute a was equal a i
23 Set S married age uses computer at work buy computer Yes young Yes Yes Yes young Yes Yes No young Yes Yes Yes middle Yes Yes No middle Yes No No old Yes Yes Yes old No No Yes old No No Yes young No Yes Yes middle No Yes Ent S = 7 10 log log = 0,88129
24 Attribute a use computer at work S Yes is married age uses computer at work by computer Yes young Yes Yes Yes young Yes Yes No young Yes Yes Yes middle Yes Yes No middle Yes No No old Yes Yes S No is married age uses computer at work by computer Yes old No No Yes old No No Yes young No Yes Yes middle No Yes
25 H(S use computer at work) S Yes is married age uses computer at work by computer Yes young Yes Yes Yes young Yes Yes No young Yes Yes Yes middle Yes Yes No middle Yes Yes No old Yes Yes S No is married age uses computer at work by computer Yes old No No Yes old No No Yes young No Yes Yes middle No No Pr a = Yes = 6 10 Ent S Yes = 0 Pr a = No = 4 10 Ent S No = 0, H S a = Ent S , = 0,55678
26 Decision tree attribute a H(S a) is married 0, age 0, uses computer at work 0,55678
27 Gini Index (the CART algorithm) Gini S = p i 1 p i k or equivalently i=1 Gini S = 1 p i 2 k i=1 Misclassification error MIS S = 1 p max, where p max - the fraction of data elements with majority class in set S
28 The Hoeffding's bound The commonly known algorithm called Hoeffdings Tree was introduced by P. Domingos and G. Hulten in [1]. The main mathematical tool used in this algorithm was the Hoeffding s bound [2] in the form Theorem 1: If X 1, X 2,, X n are independent random variables and a i X i b i (i = 1, 2,, n), then for ε > 0 where X = 1 n P X E[X] ε e 2n2 ε 2 n / i=1 (b i a i ) 2 n i=1 X i and E[X] is expected value of X. [1] P. Domingos and G. Hulten, Mining high-speed data streams, Proc. 6th ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining, pp , [2] W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, vol. 58, issue 301, pp , March 1963.
29 The Hoeffding's bound For a i = a and b i = b, (i = 1, 2,, n) it states that after n observations the true mean of the random variable of range R = b a does not differ from the estimated mean by more than ε H = R2 ln 1/α 2n with probability 1 α. [1] P. Domingos and G. Hulten, Mining high-speed data streams, Proc. 6th ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining, pp , [2] W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, vol. 58, issue 301, pp , March 1963.
30 For independent random variables X 1, X 2,, X n, X = 1 n The Hoeffding's bound n i=1 X i and E[X] is expected value of X P X E[X] R2 ln 1/α 2n 1 α
31 The Hoeffding's bound The Hoeffding's bound is wrong tool to solve the problem of choosing the best attribute to make a split in the node (see [3]). This observation follows from the fact that the split measures, like information gain and Gini index, can not be presented as a sum of elements they are using only frequency of elements. Theorem 1 is applicable only for numerical data. Therefore the idea presented in [1] violates the assumptions of Theorem 1 (see [2]) and the concept of Hoeffding Trees has no theoretical justification. [1] P. Domingos and G. Hulten, Mining high-speed data streams, Proc. 6th ACM SIGKDD Internat. Conf. on Knowledge Discovery and Data Mining, pp , [2] W. Hoeffding, Probability inequalities for sums of bounded random variables, Journal of the American Statistical Association, vol. 58, issue 301, pp , March [3] L. Rutkowski, L. Pietruczuk, P. Duda, M. Jaworski, Decision trees for mining data streams based on the McDiarmid s bound, IEEE Transactions on Knowledge and Data Engineering, vol. 99, no. PrePrints, 2012
32 Algorithms based on the Hoeffding bound a) Hoeffding Tree Algorithm The Hoeffding tree algorithm is a decision tree learning method for stream data classification. It uses Hoeffding trees and the Hoeffding bound, which exploit the idea that a small sample can often be enough to choose an optimal splitting attribute. 32
33 Algorithms based on the Hoeffding bound b) Very Fast Decision Tree (VFDT) The VFDT (Very Fast Decision Tree) algorithm makes several modifications to the Hoeffding tree algorithm to improve both speed and memory utilization. The modifications include: breaking near-ties during attribute selection more aggressively, computing the G function after a number of training examples, deactivating the least promising leaves whenever memory running low, dropping poor splitting attributes, improving the initialization method. 33
34 Algorithms based on the Hoeffding bound c) Concept-adapting Very Fast Decision Tree (CVFDT) CVFDT uses a sliding window approach, the main features are the following: it does not construct a new model from scatch each time, it updates statistics at the modes by incrementing the counts associated with new examples and decrementing the counts associated with old ones, if there is a concept drift, some nodes may no longer pass the Hoeffding bound. When this happens, an alternate subtree will be grown, with the new best splitting attribute at the root. As new examples stream in, the alternate subtree will continue to develop, without yet being used for classification. Once the alternate subtree becomes more accurate than the existing one, the old 34 subtree is replaced.
35 Challenges in Stream Data Mining a) Develop a theoretical tool to deal with data streams, in particular replace the Hoeffding bound (wrong technique) by another approach. 35
36 Challenges in Stream Data Mining b) Design a decision tree learning system for stream data such that its output is nearly identical to that of a conventional learner. 36
37 Challenges in Stream Data Mining c) Find (analytically) a number of samples of the infinite stream data such that a split in decision tree learning system can be made. 37
38 Decision Trees for Mining Data Streams 38
39 Decision Trees for Mining Data Streams 39
40 Decision Trees for Mining Data Streams 40
41 Decision Trees for Mining Data Streams 41
42 Decision Trees for Mining Data Streams 42
43 Decision Trees for Mining Data Streams Heoffding inequality Obviously Hoeffding inequality holds only for real valued data 43
44 Unfortunately, by analyzing descriptions it can be easily seen that they do not fit the Hoeffding s inequality probabilistic model. Firstly, only numerical data are applicable to the Hoeffding s inequality. In the general case the data do not have to be numerical. 44
45 Secondly, split measures, like information gain and Gini index, can not be expressed as a sum S of elements Y i. Moreover they are using only the frequency of elements. 45
46 McDIARMID S INEQUALITY 46
47 Z X,..., X,..., X (2) 1 i N 47
48 McDiarmid s inequality 48
49 ID3 ALGORITHM ADOPTED for STREAM DATA MINING McDiarmid s bound for information gain 49
50 50
51 Theorem 51
52 CART ALGORITHM ADOPTED for STREAM DATA MINING McDiarmid s bound for Gini index 52
53 53
54 54
55 55
56 56
57 Theorem 57 57
58 Corollary 58
59 Example 59
60 Example 60
61 61
62 The McDIARMID CART TREE ALGORITHM 62
63 63
64 64
65 65
66 66
67 67
68 Accuracy as a function of parameter 68
69 Training dataset size N 69
70 Training dataset size 70
71 ID3 ALGORITHM ADOPTED for STREAM DATA MINING Gaussian based approximation for information gain 71
72 72
73 73
74 74
75 Theorem 75
76 Example 76
77 Example 77
78 Example 78
79 Example 79
80 Example 80
81 Example 81
82 Example 82
83 ID3 ALGORITHM ADOPTED for STREAM DATA MINING The Gaussian decision tree algorithm - GDT 83
84 84
85 85
86 86
87 87
88 The dependence between the value of parameter and the accuracy of the Gaussian Decision Tree and the McDiarmid Tree algorithms. 88
89 The dependence between the number of training data and the accuracy of the Gaussian Decision Tree and the ID3 algorithm. 89
90 The dependence between the number of training data and the processing time of Gaussian Decision Tree and the ID3 algorithms. 90
91 Example for the GDT algorithm Data elements described by 3 nominal attributes: attribute a, 3 possible values: a 1, a 2, a 3 attribute b, 2 possible values: b 1, b 2 attribute c, 2 possible values: c 1, c 2 Two class problem (True (T), False (F)) Split measure function: Information gain Parameters: th = 0.01, α = 0.05, which gives: C = 99, Q(C) , z 1 α ε bound from the GDT algorithm ε = z 1 α 2Q(C) N N
92 Example for the GDT algorithm Let us assume that the statistics in the considered tree node are given as follows: attribute a The numbers of elements from class T and F are equal: the entropy is maximal: E = 1 The weighted entropy for attributes a, b, and c: attribute b E(a) = E(b) = E(c) = This gives the following values of information gain: attribute c G(a) = E E(a) = G(b) = E E(b) = G(c) = E E(c) =
93 Example for the GDT algorithm Let us assume that the statistics in the considered tree node are given as follows: attribute a Information gain values: G(a) = G(b) = G(c) = Attributes a and c provide the highest value of information gain. The difference between them is: attribute b G(c) G(a) = The bound ε for N = elements: ε = attribute c Since the difference is lower than ε:, i.e. G(c) G(a) < ε, The considered node can not be split at the moment
94 Example for the GDT algorithm Now the following data element reaches the considered node: [a 3, b 2, c 2, F]. The statistics are changed as follows: attribute a Current value of entropy: E = Current values of weighted entropy: attribute b E(a) = E(b) = E(c) = Current values of inormation gain: attribute c G(a) = G(b) = G(c) =
95 Example for the GDT algorithm Now the following data element reaches the considered node: [a 3, b 2, c 2, F]. The statistics are changed as follows: attribute a Previous values G(a) = G(b) = G(c) = Current values G(a) = G(b) = G(c) = attribute b G(c) G(a) = G(c) G(a) = ε = ε = attribute c Now the difference is higher than ε, i.e. G(c) G(a) > ε The considered node is split according to the attribute c
96 Conclusions a) We developed a theoretical tool to deal with data streams, in particular we replaced the Hoeffding bound (wrong technique) by another approaches. b) We designed a decision tree learning system for stream data such that its output is nearly identical to that of a conventional learner. c) We found analytically a number of samples of the infinite stream data such that a split in decision tree learning system can be made. 96
97 ICAISC th International Conference Artificial Intelligence and Soft Computing Zakopane, Poland, June 1-5, 2014 organized by the Polish Neural Network Society and the Academy of Management in Łódź (SWSPiZ) in cooperation with the Department of Computer Engineering at Częstochowa University of Technology and the IEEE Computational Intelligence Society Poland Chapter
98 Dziękuję bardzo za uwagę 98
Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams
Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams Pramod D. Patil Research Scholar Department of Computer Engineering College of Engg. Pune, University of Pune Parag
Data Mining on Streams
Data Mining on Streams Using Decision Trees CS 536: Machine Learning Instructor: Michael Littman TA: Yihua Wu Outline Introduction to data streams Overview of traditional DT learning ALG DT learning ALGs
Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality
Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality Tatsuya Minegishi 1, Ayahiko Niimi 2 Graduate chool of ystems Information cience,
Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data
Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream
Data Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
Predictive Analytics. Omer Mimran, Spring 2015. Challenges in Modern Data Centers Management, Spring 2015 1
Predictive Analytics Omer Mimran, Spring 2015 Challenges in Modern Data Centers Management, Spring 2015 1 Information provided in these slides is for educational purposes only Challenges in Modern Data
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Using One-Versus-All classification ensembles to support modeling decisions in data stream mining
Using One-Versus-All classification ensembles to support modeling decisions in data stream mining Patricia E.N. Lutu Department of Computer Science, University of Pretoria, South Africa [email protected]
Performance Analysis of Decision Trees
Performance Analysis of Decision Trees Manpreet Singh Department of Information Technology, Guru Nanak Dev Engineering College, Ludhiana, Punjab, India Sonam Sharma CBS Group of Institutions, New Delhi,India
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE
www.arpapress.com/volumes/vol13issue3/ijrras_13_3_18.pdf NEW TECHNIQUE TO DEAL WITH DYNAMIC DATA MINING IN THE DATABASE Hebah H. O. Nasereddin Middle East University, P.O. Box: 144378, Code 11814, Amman-Jordan
Data mining techniques: decision trees
Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
D A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification
Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification Tina R. Patil, Mrs. S. S. Sherekar Sant Gadgebaba Amravati University, Amravati [email protected], [email protected]
A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model
A Study of Detecting Credit Card Delinquencies with Data Mining using Decision Tree Model ABSTRACT Mrs. Arpana Bharani* Mrs. Mohini Rao** Consumer credit is one of the necessary processes but lending bears
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Visualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
Performance and efficacy simulations of the mlpack Hoeffding tree
Performance and efficacy simulations of the mlpack Hoeffding tree Ryan R. Curtin and Jugal Parikh November 24, 2015 1 Introduction The Hoeffding tree (or streaming decision tree ) is a decision tree induction
The Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Data Mining Methods: Applications for Institutional Research
Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm [email protected].
Decision Trees Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email protected] 42-268-7599 Copyright Andrew W. Moore Slide Decision Trees Decision trees
Massive Online Analysis Manual
Massive Online Analysis Manual Albert Bifet and Richard Kirkby August 2009 Contents 1 Introduction 1 1.1 Data streams Evaluation..................... 2 2 Installation 5 3 Using the GUI 7 4 Using the command
Research Article www.ijptonline.com EFFICIENT TECHNIQUES TO DEAL WITH BIG DATA CLASSIFICATION PROBLEMS G.Somasekhar 1 *, Dr. K.
ISSN: 0975-766X CODEN: IJPTFI Available Online through Research Article www.ijptonline.com EFFICIENT TECHNIQUES TO DEAL WITH BIG DATA CLASSIFICATION PROBLEMS G.Somasekhar 1 *, Dr. K.Karthikeyan 2 1 Research
How To Use Data Mining For Knowledge Management In Technology Enhanced Learning
Proceedings of the 6th WSEAS International Conference on Applications of Electrical Engineering, Istanbul, Turkey, May 27-29, 2007 115 Data Mining for Knowledge Management in Technology Enhanced Learning
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information
Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
Classification and Prediction
Classification and Prediction Slides for Data Mining: Concepts and Techniques Chapter 7 Jiawei Han and Micheline Kamber Intelligent Database Systems Research Lab School of Computing Science Simon Fraser
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Healthcare Data Mining: Prediction Inpatient Length of Stay
3rd International IEEE Conference Intelligent Systems, September 2006 Healthcare Data Mining: Prediction Inpatient Length of Peng Liu, Lei Lei, Junjie Yin, Wei Zhang, Wu Naijun, Elia El-Darzi 1 Abstract
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil [email protected] 2 Network Engineering
ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION
ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo [email protected],[email protected]
Efficient Integration of Data Mining Techniques in Database Management Systems
Efficient Integration of Data Mining Techniques in Database Management Systems Fadila Bentayeb Jérôme Darmont Cédric Udréa ERIC, University of Lyon 2 5 avenue Pierre Mendès-France 69676 Bron Cedex France
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
How To Classify Data Stream Mining
JOURNAL OF COMPUTERS, VOL. 8, NO. 11, NOVEMBER 2013 2873 A Semi-supervised Ensemble Approach for Mining Data Streams Jing Liu 1,2, Guo-sheng Xu 1,2, Da Xiao 1,2, Li-ze Gu 1,2, Xin-xin Niu 1,2 1.Information
Predicting Students Final GPA Using Decision Trees: A Case Study
Predicting Students Final GPA Using Decision Trees: A Case Study Mashael A. Al-Barrak and Muna Al-Razgan Abstract Educational data mining is the process of applying data mining tools and techniques to
Static Data Mining Algorithm with Progressive Approach for Mining Knowledge
Global Journal of Business Management and Information Technology. Volume 1, Number 2 (2011), pp. 85-93 Research India Publications http://www.ripublication.com Static Data Mining Algorithm with Progressive
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing
www.ijcsi.org 198 Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing Lilian Sing oei 1 and Jiayang Wang 2 1 School of Information Science and Engineering, Central South University
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.
EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER ANALYTICS LIFECYCLE Evaluate & Monitor Model Formulate Problem Data Preparation Deploy Model Data Exploration Validate Models
Uplift Modeling in Direct Marketing
Paper Uplift Modeling in Direct Marketing Piotr Rzepakowski and Szymon Jaroszewicz National Institute of Telecommunications, Warsaw, Poland Abstract Marketing campaigns directed to randomly selected customers
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
A Data Generator for Multi-Stream Data
A Data Generator for Multi-Stream Data Zaigham Faraz Siddiqui, Myra Spiliopoulou, Panagiotis Symeonidis, and Eleftherios Tiakas University of Magdeburg ; University of Thessaloniki. [siddiqui,myra]@iti.cs.uni-magdeburg.de;
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
CLOUDS: A Decision Tree Classifier for Large Datasets
CLOUDS: A Decision Tree Classifier for Large Datasets Khaled Alsabti Department of EECS Syracuse University Sanjay Ranka Department of CISE University of Florida Vineet Singh Information Technology Lab
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)
260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case
Selecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: [email protected] Mariusz Łapczyński
SVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Agenda. Interface Agents. Interface Agents
Agenda Marcelo G. Armentano Problem Overview Interface Agents Probabilistic approach Monitoring user actions Model of the application Model of user intentions Example Summary ISISTAN Research Institute
How To Find Influence Between Two Concepts In A Network
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Influence Discovery in Semantic Networks: An Initial Approach Marcello Trovati and Ovidiu Bagdasar School of Computing
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM
MINING THE DATA FROM DISTRIBUTED DATABASE USING AN IMPROVED MINING ALGORITHM J. Arokia Renjit Asst. Professor/ CSE Department, Jeppiaar Engineering College, Chennai, TamilNadu,India 600119. Dr.K.L.Shunmuganathan
Prediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE
International Journal of Computer Science and Applications, Vol. 5, No. 4, pp 57-69, 2008 Technomathematics Research Foundation PREDICTIVE MODELING OF INTER-TRANSACTION ASSOCIATION RULES A BUSINESS PERSPECTIVE
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC
Machine Learning for Medical Image Analysis A. Criminisi & the InnerEye team @ MSRC Medical image analysis the goal Automatic, semantic analysis and quantification of what observed in medical scans Brain
HUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet)
HUAWEI Advanced Data Science with Spark Streaming Albert Bifet (@abifet) Huawei Noah s Ark Lab Focus Intelligent Mobile Devices Data Mining & Artificial Intelligence Intelligent Telecommunication Networks
International Journal of World Research, Vol: I Issue XIII, December 2008, Print ISSN: 2347-937X DATA MINING TECHNIQUES AND STOCK MARKET
DATA MINING TECHNIQUES AND STOCK MARKET Mr. Rahul Thakkar, Lecturer and HOD, Naran Lala College of Professional & Applied Sciences, Navsari ABSTRACT Without trading in a stock market we can t understand
Big Data & Scripting Part II Streaming Algorithms
Big Data & Scripting Part II Streaming Algorithms 1, Counting Distinct Elements 2, 3, counting distinct elements problem formalization input: stream of elements o from some universe U e.g. ids from a set
Data Mining. 1 Introduction 2 Data Mining methods. Alfred Holl Data Mining 1
Data Mining 1 Introduction 2 Data Mining methods Alfred Holl Data Mining 1 1 Introduction 1.1 Motivation 1.2 Goals and problems 1.3 Definitions 1.4 Roots 1.5 Data Mining process 1.6 Epistemological constraints
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington
Mining Wiki Usage Data for Predicting Final Grades of Students
Mining Wiki Usage Data for Predicting Final Grades of Students Gökhan Akçapınar, Erdal Coşgun, Arif Altun Hacettepe University [email protected], [email protected], [email protected]
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"
!"!!"#$$%&'()*+$(,%!"#$%$&'()*""%(+,'-*&./#-$&'(-&(0*".$#-$1"(2&."3$'45"!"#"$%&#'()*+',$$-.&#',/"-0%.12'32./4'5,5'6/%&)$).2&'7./&)8'5,5'9/2%.%3%&8':")08';:
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
Decision-Tree Learning
Decision-Tree Learning Introduction ID3 Attribute selection Entropy, Information, Information Gain Gain Ratio C4.5 Decision Trees TDIDT: Top-Down Induction of Decision Trees Numeric Values Missing Values
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
A Review of Online Decision Tree Learning Algorithms
A Review of Online Decision Tree Learning Algorithms Johns Hopkins University Department of Computer Science Corbin Rosset June 17, 2015 Abstract This paper summarizes the most impactful literature of
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
