Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data
|
|
|
- Austin Hawkins
- 10 years ago
- Views:
Transcription
1 Fifth International Workshop on Computational Intelligence & Applications IEEE SMC Hiroshima Chapter, Hiroshima University, Japan, November 10, 11 & 12, 2009 Extension of Decision Tree Algorithm for Stream Data Mining Using Real Data Tatsuya Minegishi, Masayuki Ise, Ayahiko Niimi, Osamu Konishi Graduate School of Future University-Hakodate, Systems Information Science Future University Hakodate. 2, 116 Kameda-nakano-cho, Hakodate, Hokkaido, Japan {g , Future University-Hakodate, School of Systems Information Science Future University Hakodate. 2, 116 Kameda-nakano-cho, Hakodate, Hokkaido, Japan {niimi, Abstract Recently, because of increasing amount of data in the society, data stream mining targeting large scale data has attracted attention. The data mining is a technology of discovery new knowledge and patterns from the massive amounts of data, and what the data correspond to data stream is data stream mining. In this paper, we propose the feature selection with online decision tree. At first, we construct online type decision tree to regard credit card transaction data as data stream on data stream mining. At second, we select attributes thought to be important for detection of illegal use. We apply VFDT (Very Fast Decision Tree learner) algorithm to online type decision tree construction. I. INTRODUCTIONS In recently network society, the development of information processing technique enables us to collect and utilize massive amount of data, and the data mining is technology of discovery new knowledge and patterns in those data has been paid attention. But those data are changing from moment to moment, and have become new type large scale data. Record of financial and distributional transactions, telecommunications records and network access logs are typical examples, and those data are called data stream. By data stream, it is that the conditions temporally-changed massive amount of data record are generated, cumulative and consumed are looked on as flow of data (stream). In the real world, the requirement that whenever we need information, we want to elicit from those large scale data stream has been growing. At first glance data mining seems to be effective, but data stream has following dynamic properties: (i) (ii) (iii) (iv) massive amount of data are coming over high-speed stream temporally-changing continue to arrive permanently, and there is a limitation applied data stream to data mining intending static data. Data mining to efficiently deal in large scale data stream, therefore data stream mining technology has been developed. In this paper, we conduct data stream mining to deal in real data and propose online type decision tree construction as algorithm of machine learning. For dealing in real data, we conducted verification experiments to use credit card transaction data. We say that those data fit into concepts of data stream from amount of data and contents, and because we use to think that those are most suitable in our experiments. Additionally we discuss that credit card transaction data have problem of massiveness on analysis. When we detect illegal use on data mining, the massiveness of data has a problem. It s important how to decrease amount of data using analysis to keep detection accuracy. In this paper, we propose attributes selection technique that we construct decision tree from credit card transaction data for preprocessing of data is a process of data mining and analyze using appearing attributes. Therefore we select attributes thought to be important for detection of illegal use from the online type decision tree. And we consider those attributes. We compare the difference of attributes selected from decision tree to attributes using for analysis of existing technique. The next section presents data mining, data stream mining and credit card transactions. In section III we describe proposed algorithm, and in section IV we describe their verification experiments. From the results, in section V and VI we discuss them results. The last section we conclude our proposed method and discuss our future works. II. RELATED WORK A. Offline Type Decision Tree Construction Decision tree is graphic representation described by tree diagram hierarchized multistage branching process when it multistage and repeatedly executes decision-making or classification of stuff. It started out root node with given data. Then it is pursuing and answering questions concerned attributes of data. It classifies by making value having finally arriving leaf node into class of the data. Offline type decision tree construction is the method of constructing decision tree after taking all examples as input. The most well-known decision 208
2 tree algorithm includes C4.5[1]. C4.5 is decision tree construction algorithm developed by J. Ross Quinlan who is Australian researcher. It recurrently splits data while selecting the attribute and the question that classifies data best based on entropy info = p i log 2 p i (1) where the p i is the occurrence probability of the i th event for the information gain Gain = (average entropy bef ore splitting average entropy af ter splitting), (2) and recurrently constructs the decision tree at the same time. However, the error rate is high because it becomes very complicated tree only the structure was caught disregarding meaning content. Therefore, decision tree is pruned to minimize error rate and becomes more simply and easily understandable. B. Online Type Decision Tree Construction Offline type decision tree construction in II.A is a given fact that it will take all examples as input first. However, this method cannot start constructing without all examples and needs to access randomly to them. Therefore, it cannot apply to data stream. Decision tree construction method to improve weakness of this offline type is called online type decision tree construction. Representative example includes VFDT(Very Fast Decision Tree learner)[2]. VFDT is decision tree construction algorithm for data stream and adaptively grows the tree without waiting for all examples arrival. It does not store any examples in main memory, requiring only space proportional to the size of the tree and associated sufficient statistics. It can learn by seeing each example at once, and therefore does not require examples from an online stream to ever be stored. VFDT is the same as C4.5 to grow from root node in sequence but every time it takes new data, it sets up leaf node which the data arrive at current tree, and stores arrival data there. Then, the data of often visible type accumulate at leaf nodes, after data to satisfy enough statistical criterion accumulate, it grows the leaf using those data to make a more detail prediction. The statistical criterion to grow leaf moreover includes Hoeffding bound. The data accumulated at leaf is part of all available data, therefore there is a possibility that the leaf has error. However, in consideration of the infinitely-long data stream produced stochastically based on stationary distribution, data sets arriving at each leaves are considered as ideal them in offline case. Consider a real-valued random variable r whose range is R. Suppose we have made n independent observations of this variable, and computed their mean r. The Hoeffding bound states that, with probability 1 δ, the true mean of the variable is at least r ɛ, where R2 ln(1/δ) ɛ = (3) 2n If the difference between the best standard level at one leaf and the next standard level is bigger, then it makes branching from the leaf. C. Credit Card Transaction Data In this paper, we use the credit card transaction data as real data. In actual credit card transactions, the data are complex changing and arriving continuously online. The data are following: (i) (ii) (iii) (iv) arrive around one million transactions per day, the speed of less than one second per one transaction, arrive around one hundred transactions per one second at the peak time, 24 hours a day, every day, continue to arrive permanently. Therefore, the credit card transaction data can be exactly called a data stream. However, even if we use data mining for those data, around 2,000 transactions per day which people can accommodate by monitoring are generally. Therefore, we have to detect suspicious transactions data effectively under the rigid conditions of 0.02% detection numbers of the total. In addition, there is an issue people detect extremely low illegal use from massive amount of transaction data because real illegal use is extremely low rate that is from 0.02% to 0.05% to all transaction data. The data we use in this paper is described one transaction data as CSV format in time order and the data exists as attributes. Credit card transaction data have 124 attributes: 84 are called transaction data include one attribute to discriminate whether the data is illegal use, and the others are called behavior data calculated by user s usage. The file size is about 700 MB per one month. As we said that the illegal use rate is from 0.02% to 0.05% before, this data re-sampled to about 0.5%. III. PROPOSAL TECHNIQUE A. Attribute Selection from Offline Type Decision Tree Construction When we detect illegal use on data mining, the massiveness of data has a problem. It s important how to decrease amount of data using analysis to keep detection accuracy. In this paper, we propose attributes selection technique that we construct decision tree from credit card transaction data for preprocessing of data is a process of data mining and analyze using appearing attributes. Additionally, we change number of attributes and type of attributes using analysis to consider some attribute selections. There are following technique as an existing research[3][4]. (i) Constructing decision tree. We construct decision tree using data set of (a) in IV. Those trees have about the same attribute type and positions of attributes until rank of top five therefore we regard as stable rank. 209
3 (ii) Gathering data that failed in classification of the tree. We collect only data failed classifying until stable rank. (iii) Constructing decision tree using only gathered data again. We construct decision tree using only those data again (iv) Adding the attributes for analysis too. When the decision tree constructed using only data failed classifying contains never seen before attributes, we add those attributes to analytic attributes. We compare the difference of attributes selected from decision tree to attributes using for analysis of existing technique. B. Attribute Selection from Online Type Decision Tree Construction In III.A we proposed offline type decision tree construction from credit card transaction data, and attribute selection. Here III.B, we propose online type decision tree construction using real data. When a certain level of decision tree constructs, we select attributes in order near from root node of the tree. We regard the real data using III.A as stream data like real credit card transaction. We apply VFDT we described in II.B to this algorithm. Moreover when we use VFDT, we use VFML(Very Fast Machine Learning)[5] is implementation code of machine learning for data stream, and construct VFDT by it. In this paper, we definitely don t decide how to select attributes from VFDT. But we have two methods of selection. The first method, for constructing VFDT in arriving data through stream, we select attributes when the branches grow from certain leaves. This algorithm started from root node. Then it grows the VFDT gradually. When new attributes appear, we select those attributes as analytic attributes. The other, for constructing VFDT in arriving data through stream, we select appeared attributes when we stop growing VFDT after a certain period of time. In both cases, it is important to determine when attributes are selected. It is whether decision tree change around stopping growing VFDT is important. These are changed by data arrived through stream. As data for construction are over the long term, there is a possibility that usage tendency between past data and latest data changes significantly. But we got a result that VFDT has no big differences if we construct VFDT using data for about a year. But as data for construction are over the long term, there is a possibility that usage tendency between past data and latest data changes significantly. As a result, it is expected to change VFDT. A. Construction of Offline Type Decision Tree We use following data sets in the experiment. The number of attributes of data - 57 transaction data attributes and 42 behavior data attributes Sampling rate of illegal use (three ways) (a) 0.02% is actual illegal use rate (b) 0.5% is sampling rate of data provided (c) 10% is setup in this experiment Usually provided data have around 120 attribute but we except some attributes are irrelevant for construction decision tree and has low relation for illegal use models. We also use around fifty thousand data. We construct decision tree using three ways data by J48 algorithm based on C4.5 implemented in data mining tool software called Weka[6]. B. Construction of Online Type Decision Tree In this experiment, we use the 10% data described as (c) in IV.A. As we discuss later, this data was the best results on offline type decision tree construction in IV.A. We construct VFDT by VFML using this data. V. EXPERIMENTAL RESULTS In the case of offline type decision tree, (a) became tree split two leaf nodes by root node. (b) became tree that has 101 nodes including 51 leaves. In Fig.1, we hold showing real attribute names back as confidential information. So we set low resolution consciously. Both (a) and (b) have more than 99 % accuracy but they classify almost all illegal data as normal data. This result shows that the accuracy of decision tree is high, but actually it cannot classify illegal data exactly. Fig.2 has 1,221 nodes including 611 leaves. Fig.2 is set low resolution as well as Fig.1. Also its accuracy is %. In tis experiment, this tree is able to classify illegal data well because the 10% illegal use rate is higher than another two cases. In the case of online type decision tree, we constructed it using data (c) for offline type. The accuracy is % and the size is 91. VI. EVALUATIONS We show results of offline and online type decision tree. TABLE I RESULT OF NUMBER OF ATTRIBUTES All Transaction Behavior C VFDT IV. EXPERIMENTS We constructed online type decision tree from credit card transaction data provided by described technique, and compare accuracy, size and attributes. 210
4 Fig. 1. Decision Tree (b) TABLE II RESULTS OF ACCURACY AND SIZE Accuracy(%) Size C ,221 VFDT We constructed both tree random-sampled 10 data without changing illegal use rate. In this Table I and Table II, it is average of 10 trees. Each tree set up the results using 10-folds cross validation, therefore each method constructs actually 100 trees. In the accuracy of classification, VFDT is lower than C4.5. It is because not construction of decision tree after receiving all examples as input data like C4.5 but growing tree using Hoeffding bound at the point of accumulating adequate data in nodes. The size of VFDT is less than 1/10 of C4.5. In terms of attributes, 10 offline type decision trees have almost the same feature and positions until 5th depth. Therefore there are no major differences and attributes selected from their trees are important in classification of illegal use. C4.5 has 55 attributes from 99 input attributes using construction. Especially, there are many behavior attributes around the top. However, in spite of having 1,200 nodes, appeared attributes are 55 out of about half of using attributes. As a result, same specific attributes are near root node appear many times. Also we constructed 10 online type decision trees, but there are 3 types as a root. Moreover, in case of tree having same root, attributes that follow are different from C4.5 s attributes and positions. Because those attributes as root node show paying division and limit of amount of usage, they have some features of illegal use. And we asked about this results to experts, and proved that those attributes are appropriate. In online type decision tree, the number of attribute is an average of 31. As well as C4.5, number of attributes got fewer than all data used by construction. Also behavior attributes are more than transaction attributes. This is the reason that C4.5 and VFDT adopt many behavior attributes as nodes. Therefore behavior attributes affect detection of illegal use. In this paper, in case of the data that are large-scale and very low illegal use rate for constructing, we don t have a concrete Fig. 2. Decision Tree (c) 211
5 plan for improving accuracy of classification without growing tree structure size. The size of tree of C4.5 is about 1,200 but appeared attributes until stable rank are 14 from 99 input attributes using construction. Also appeared attributes from all nodes of tree are 55. TABLE III RESULT OF NUMBER OF ATTRIBUTES All Transaction Behavior Stable rank All nodes Table III shows the number of transaction attributes, behavior attributes, and all attributes. For this reason as well as C4.5, appeared attributes are expected to drastically decrease than input attributes using construction in spite of tree structure size for VFDT. Also the detection rate of illegal use keeps more than 80% if analytic attributes decrease to 25%. Therefore it is known that type of attributes is more important rather than the number of attributes.[4] VII. CONCLUSION AND FUTURE WORK In this paper, we constructed online type decision tree using credit card transaction data with data stream mining. And we proposed that we select attributes thought to be important for detection of illegal use from the tree. At first, we applied VFDT algorithm to online type decision tree construction. At second, we compared VFDT to C4.5 from focused on accuracy, size and attributes. As a result, the accuracy of VFDT is inferior to C4.5. However, it is simple tree whose size is less than 1/10 to C4.5. In future work, we are aiming to construct VDFT without using all examples like C4.5. It would appear that we can cut the time for analysis to select attributes at proper time. We need to define rule of feature selection from VFDT to keep decline accuracy constant. REFERENCES [1] J. Ross Quinlan. C4.5 : programs for machine learning, Morgan Kaufmann, San Mateo, Calif., [2] P. Domingos. G. Hulten. Mining High-Speed Data Streams, Proceedings of the ACM Sixth International Conference on Knowledge Discovery and Data Mining, ACM Press, pp.71-80,2000. [3] T. Minegishi. M. ISE. A. Niimi. O. Konishi. A proposal of abusing credit cards detecting systems using attribute selection method with multistage decision tree construction, Information Processing Society of Japan, The 71st National Convention of IPSJ, pp ,2009. [4] T. Minegishi. M. ISE. A. Niimi. O. Konishi. Comparison with two attribute selection methods using actual data, stepwise procedure in logistic regression analysis and selection by decision, Japan Society for Fuzzy Theory and Intelligent Informatics, The 25th Fuzzy System Symposium, 1A2-02 (6 pages in CD-ROM),2009. [5] P. Domingos. G. Hulten. VFML - a toolkit for mining high-speed timechanging data streams, [6] Ian. H Witten. Eibe. Frank. DATA MINING, Morgan Kaufmann, pp ,
Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality
Proposal of Credit Card Fraudulent Use Detection by Online-type Decision Tree Construction and Verification of Generality Tatsuya Minegishi 1, Ayahiko Niimi 2 Graduate chool of ystems Information cience,
Data Mining on Streams
Data Mining on Streams Using Decision Trees CS 536: Machine Learning Instructor: Michael Littman TA: Yihua Wu Outline Introduction to data streams Overview of traditional DT learning ALG DT learning ALGs
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES
IDENTIFYING BANK FRAUDS USING CRISP-DM AND DECISION TREES Bruno Carneiro da Rocha 1,2 and Rafael Timóteo de Sousa Júnior 2 1 Bank of Brazil, Brasília-DF, Brazil [email protected] 2 Network Engineering
Data Mining & Data Stream Mining Open Source Tools
Data Mining & Data Stream Mining Open Source Tools Darshana Parikh, Priyanka Tirkha Student M.Tech, Dept. of CSE, Sri Balaji College Of Engg. & Tech, Jaipur, Rajasthan, India Assistant Professor, Dept.
Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams
Adaptive Classification Algorithm for Concept Drifting Electricity Pricing Data Streams Pramod D. Patil Research Scholar Department of Computer Engineering College of Engg. Pune, University of Pune Parag
Unit 4 DECISION ANALYSIS. Lesson 37. Decision Theory and Decision Trees. Learning objectives:
Unit 4 DECISION ANALYSIS Lesson 37 Learning objectives: To learn how to use decision trees. To structure complex decision making problems. To analyze the above problems. To find out limitations & advantages
Data quality in Accounting Information Systems
Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania
Data Mining with Weka
Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz Data Mining with Weka a practical course on how to
Performance and efficacy simulations of the mlpack Hoeffding tree
Performance and efficacy simulations of the mlpack Hoeffding tree Ryan R. Curtin and Jugal Parikh November 24, 2015 1 Introduction The Hoeffding tree (or streaming decision tree ) is a decision tree induction
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Web Document Clustering
Web Document Clustering Lab Project based on the MDL clustering suite http://www.cs.ccsu.edu/~markov/mdlclustering/ Zdravko Markov Computer Science Department Central Connecticut State University New Britain,
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients
An innovative application of a constrained-syntax genetic programming system to the problem of predicting survival of patients Celia C. Bojarczuk 1, Heitor S. Lopes 2 and Alex A. Freitas 3 1 Departamento
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News
Analysis of WEKA Data Mining Algorithm REPTree, Simple Cart and RandomTree for Classification of Indian News Sushilkumar Kalmegh Associate Professor, Department of Computer Science, Sant Gadge Baba Amravati
Predicting Student Performance by Using Data Mining Methods for Classification
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance
Automatic Resolver Group Assignment of IT Service Desk Outsourcing
Automatic Resolver Group Assignment of IT Service Desk Outsourcing in Banking Business Padej Phomasakha Na Sakolnakorn*, Phayung Meesad ** and Gareth Clayton*** Abstract This paper proposes a framework
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
How To Understand How Weka Works
More Data Mining with Weka Class 1 Lesson 1 Introduction Ian H. Witten Department of Computer Science University of Waikato New Zealand weka.waikato.ac.nz More Data Mining with Weka a practical course
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
Data Mining Classification: Decision Trees
Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous
A comparative study of data mining (DM) and massive data mining (MDM)
A comparative study of data mining (DM) and massive data mining (MDM) Prof. Dr. P K Srimani Former Chairman, Dept. of Computer Science and Maths, Bangalore University, Director, R & D, B.U., Bangalore,
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE
A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data
Data mining techniques: decision trees
Data mining techniques: decision trees 1/39 Agenda Rule systems Building rule systems vs rule systems Quick reference 2/39 1 Agenda Rule systems Building rule systems vs rule systems Quick reference 3/39
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
Predicting Students Final GPA Using Decision Trees: A Case Study
Predicting Students Final GPA Using Decision Trees: A Case Study Mashael A. Al-Barrak and Muna Al-Razgan Abstract Educational data mining is the process of applying data mining tools and techniques to
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
Interactive Exploration of Decision Tree Results
Interactive Exploration of Decision Tree Results 1 IRISA Campus de Beaulieu F35042 Rennes Cedex, France (email: pnguyenk,[email protected]) 2 INRIA Futurs L.R.I., University Paris-Sud F91405 ORSAY Cedex,
Prediction of Stock Performance Using Analytical Techniques
136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
Method of Fault Detection in Cloud Computing Systems
, pp.205-212 http://dx.doi.org/10.14257/ijgdc.2014.7.3.21 Method of Fault Detection in Cloud Computing Systems Ying Jiang, Jie Huang, Jiaman Ding and Yingli Liu Yunnan Key Lab of Computer Technology Application,
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One
Introduction Predictive Analytics Tools: Weka
Introduction Predictive Analytics Tools: Weka Predictive Analytics Center of Excellence San Diego Supercomputer Center University of California, San Diego Tools Landscape Considerations Scale User Interface
AnalysisofData MiningClassificationwithDecisiontreeTechnique
Global Journal of omputer Science and Technology Software & Data Engineering Volume 13 Issue 13 Version 1.0 Year 2013 Type: Double Blind Peer Reviewed International Research Journal Publisher: Global Journals
Predicting Critical Problems from Execution Logs of a Large-Scale Software System
Predicting Critical Problems from Execution Logs of a Large-Scale Software System Árpád Beszédes, Lajos Jenő Fülöp and Tibor Gyimóthy Department of Software Engineering, University of Szeged Árpád tér
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS
ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS Abstract D.Lavanya * Department of Computer Science, Sri Padmavathi Mahila University Tirupati, Andhra Pradesh, 517501, India [email protected]
Feature Construction for Time Ordered Data Sequences
Feature Construction for Time Ordered Data Sequences Michael Schaidnagel School of Computing University of the West of Scotland Email: [email protected] Fritz Laux Faculty of Computer Science
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION
ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical
An Introduction to WEKA. As presented by PACE
An Introduction to WEKA As presented by PACE Download and Install WEKA Website: http://www.cs.waikato.ac.nz/~ml/weka/index.html 2 Content Intro and background Exploring WEKA Data Preparation Creating Models/
EFFICIENT CLASSIFICATION OF BIG DATA USING VFDT (VERY FAST DECISION TREE)
EFFICIENT CLASSIFICATION OF BIG DATA USING VFDT (VERY FAST DECISION TREE) Sourav Roy 1, Brina Patel 2, Samruddhi Purandare 3, Minal Kucheria 4 1 Student, Computer Department, MIT College of Engineering,
Studying Auto Insurance Data
Studying Auto Insurance Data Ashutosh Nandeshwar February 23, 2010 1 Introduction To study auto insurance data using traditional and non-traditional tools, I downloaded a well-studied data from http://www.statsci.org/data/general/motorins.
How To Predict Web Site Visits
Web Site Visit Forecasting Using Data Mining Techniques Chandana Napagoda Abstract: Data mining is a technique which is used for identifying relationships between various large amounts of data in many
Decision Tree Learning on Very Large Data Sets
Decision Tree Learning on Very Large Data Sets Lawrence O. Hall Nitesh Chawla and Kevin W. Bowyer Department of Computer Science and Engineering ENB 8 University of South Florida 4202 E. Fowler Ave. Tampa
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Financial Trading System using Combination of Textual and Numerical Data
Financial Trading System using Combination of Textual and Numerical Data Shital N. Dange Computer Science Department, Walchand Institute of Rajesh V. Argiddi Assistant Prof. Computer Science Department,
Less naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:[email protected] also CoSiNe Connectivity Systems
Weather forecast prediction: a Data Mining application
Weather forecast prediction: a Data Mining application Ms. Ashwini Mandale, Mrs. Jadhawar B.A. Assistant professor, Dr.Daulatrao Aher College of engg,karad,[email protected],8407974457 Abstract
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
Predictive Analytics. Omer Mimran, Spring 2015. Challenges in Modern Data Centers Management, Spring 2015 1
Predictive Analytics Omer Mimran, Spring 2015 Challenges in Modern Data Centers Management, Spring 2015 1 Information provided in these slides is for educational purposes only Challenges in Modern Data
2 Decision tree + Cross-validation with R (package rpart)
1 Subject Using cross-validation for the performance evaluation of decision trees with R, KNIME and RAPIDMINER. This paper takes one of our old study on the implementation of cross-validation for assessing
Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product
Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product Sagarika Prusty Web Data Mining (ECT 584),Spring 2013 DePaul University,Chicago [email protected] Keywords:
Integration Misuse and Anomaly Detection Techniques on Distributed Sensors
Integration Misuse and Anomaly Detection Techniques on Distributed Sensors Shih-Yi Tu Chung-Huang Yang Kouichi Sakurai Graduate Institute of Information and Computer Education, National Kaohsiung Normal
Classification of Learners Using Linear Regression
Proceedings of the Federated Conference on Computer Science and Information Systems pp. 717 721 ISBN 978-83-60810-22-4 Classification of Learners Using Linear Regression Marian Cristian Mihăescu Software
Data Mining based on Rough Set and Decision Tree Optimization
Data Mining based on Rough Set and Decision Tree Optimization College of Information Engineering, North China University of Water Resources and Electric Power, China, [email protected] Abstract This paper
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
Data Mining Techniques Chapter 6: Decision Trees
Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
Open-Source Machine Learning: R Meets Weka
Open-Source Machine Learning: R Meets Weka Kurt Hornik Christian Buchta Achim Zeileis Weka? Weka is not only a flightless endemic bird of New Zealand (Gallirallus australis, picture from Wekapedia) but
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Online Failure Prediction in Cloud Datacenters
Online Failure Prediction in Cloud Datacenters Yukihiro Watanabe Yasuhide Matsumoto Once failures occur in a cloud datacenter accommodating a large number of virtual resources, they tend to spread rapidly
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ
Analyzing PETs on Imbalanced Datasets When Training and Testing Class Distributions Differ David Cieslak and Nitesh Chawla University of Notre Dame, Notre Dame IN 46556, USA {dcieslak,nchawla}@cse.nd.edu
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention
, 225 October, 2013, San Francisco, USA Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention Ji-Wu Jia, Member IAENG, Manohar Mareboyana Abstract---In this paper, we have
Measuring the Performance of an Agent
25 Measuring the Performance of an Agent The rational agent that we are aiming at should be successful in the task it is performing To assess the success we need to have a performance measure What is rational
Investigation of Adherence Degree of Agile Requirements Engineering Practices in Non-Agile Software Development Organizations
Investigation of Adherence Degree of Agile Requirements Engineering Practices in Non-Agile Software Development Organizations Mennatallah H. Ibrahim Department of Computers and Information Sciences Institute
Keywords Data mining, Classification Algorithm, Decision tree, J48, Random forest, Random tree, LMT, WEKA 3.7. Fig.1. Data mining techniques.
International Journal of Emerging Research in Management &Technology Research Article October 2015 Comparative Study of Various Decision Tree Classification Algorithm Using WEKA Purva Sewaiwar, Kamal Kant
Data Mining. Knowledge Discovery, Data Warehousing and Machine Learning Final remarks. Lecturer: JERZY STEFANOWSKI
Data Mining Knowledge Discovery, Data Warehousing and Machine Learning Final remarks Lecturer: JERZY STEFANOWSKI Email: [email protected] Data Mining a step in A KDD Process Data mining:
Roulette Sampling for Cost-Sensitive Learning
Roulette Sampling for Cost-Sensitive Learning Victor S. Sheng and Charles X. Ling Department of Computer Science, University of Western Ontario, London, Ontario, Canada N6A 5B7 {ssheng,cling}@csd.uwo.ca
D A T A M I N I N G C L A S S I F I C A T I O N
D A T A M I N I N G C L A S S I F I C A T I O N FABRICIO VOZNIKA LEO NARDO VIA NA INTRODUCTION Nowadays there is huge amount of data being collected and stored in databases everywhere across the globe.
Hadoop Technology for Flow Analysis of the Internet Traffic
Hadoop Technology for Flow Analysis of the Internet Traffic Rakshitha Kiran P PG Scholar, Dept. of C.S, Shree Devi Institute of Technology, Mangalore, Karnataka, India ABSTRACT: Flow analysis of the internet
Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction
Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction Huanjing Wang Western Kentucky University [email protected] Taghi M. Khoshgoftaar
Research of Postal Data mining system based on big data
3rd International Conference on Mechatronics, Robotics and Automation (ICMRA 2015) Research of Postal Data mining system based on big data Xia Hu 1, Yanfeng Jin 1, Fan Wang 1 1 Shi Jiazhuang Post & Telecommunication
Decision Trees. Andrew W. Moore Professor School of Computer Science Carnegie Mellon University. www.cs.cmu.edu/~awm [email protected].
Decision Trees Andrew W. Moore Professor School of Computer Science Carnegie Mellon University www.cs.cmu.edu/~awm [email protected] 42-268-7599 Copyright Andrew W. Moore Slide Decision Trees Decision trees
COURSE RECOMMENDER SYSTEM IN E-LEARNING
International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
Uplift Modeling in Direct Marketing
Paper Uplift Modeling in Direct Marketing Piotr Rzepakowski and Szymon Jaroszewicz National Institute of Telecommunications, Warsaw, Poland Abstract Marketing campaigns directed to randomly selected customers
First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms
First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms Azwa Abdul Aziz, Nor Hafieza IsmailandFadhilah Ahmad Faculty Informatics & Computing
Classification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal [email protected] Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal [email protected]
Experiments in Web Page Classification for Semantic Web
Experiments in Web Page Classification for Semantic Web Asad Satti, Nick Cercone, Vlado Kešelj Faculty of Computer Science, Dalhousie University E-mail: {rashid,nick,vlado}@cs.dal.ca Abstract We address
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
Classification/Decision Trees (II)
Classification/Decision Trees (II) Department of Statistics The Pennsylvania State University Email: [email protected] Right Sized Trees Let the expected misclassification rate of a tree T be R (T ).
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trajkovski [email protected] Ensembles 2 Learning Ensembles Learn multiple alternative definitions of a concept using different training
Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set
Overview Evaluation Connectionist and Statistical Language Processing Frank Keller [email protected] Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification
Massive Online Analysis Manual
Massive Online Analysis Manual Albert Bifet and Richard Kirkby August 2009 Contents 1 Introduction 1 1.1 Data streams Evaluation..................... 2 2 Installation 5 3 Using the GUI 7 4 Using the command
Data Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Enhanced Boosted Trees Technique for Customer Churn Prediction Model
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction
ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies
ISSN: 2321-7782 (Online) Volume 2, Issue 10, October 2014 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online
Optimization of C4.5 Decision Tree Algorithm for Data Mining Application
Optimization of C4.5 Decision Tree Algorithm for Data Mining Application Gaurav L. Agrawal 1, Prof. Hitesh Gupta 2 1 PG Student, Department of CSE, PCST, Bhopal, India 2 Head of Department CSE, PCST, Bhopal,
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
