Mining Wiki Usage Data for Predicting Final Grades of Students
|
|
|
- Buddy Montgomery
- 9 years ago
- Views:
Transcription
1 Mining Wiki Usage Data for Predicting Final Grades of Students Gökhan Akçapınar, Erdal Coşgun, Arif Altun Hacettepe University Abstract This study aims to predict students final grades (A, B, C, D and F) based on their wiki usage data. Usage data are stored in wiki database in a limited way when default settings are used. Therefore an extension is developed to extend its capability to log students login and navigation data. A tool is developed for extracting information from this data and preprocessing of it. Dataset includes server side wiki usage log of 81 students throughout 3 months. Classification performance of Random Forest, Support Vector Machines, Naive Bayes and Boosted Classification Tree algorithms are compared for classifying students. Tenfold cross validation is used to evaluate the performance of the models. According to our findings, SVM outperforms other methods with the best classification performance. Keywords: wiki, classification, educational data mining, predicting final grade. Main Conference Topic: New Trends and Experiences, Educational Data Mining Introduction Using wiki in online learning environments is increased in recent years especially with the increasing demand on collaborative learning. Wikis can be used in the following fields to support learning: in-class collaboration, group projects outside of class, collaborative environment for learning from peers, peer and teacher feedback and review, and assessment and management of group performance [1]. Although wiki has a great potential for online learning environments, assessment of the individual contribution is difficult and time consuming if traditional methods are used as many students can be contributed to the content creation. On the other hand there are lots of students - system interaction data are stored in wiki database like other online learning environments (e.g. forum, lms, vle). By analyzing these data with the help of statistical and data mining (DM) techniques, lots of useful Information can be extracted for tutoring, assessment or understanding of learning and learner behavior [2, 3]. Educational data mining is one of the remarkable research areas which has emerged in recent years and defined as the application of data mining techniques to dataset that come from educational settings to address important educational questions [4]. According to Romero and Ventura [2] these questions includes Analysis and Visualization of Data, Providing Feedback for Supporting Instructors, Recommendations for Students, Predicting Student s Performance, Student Modeling, Grouping Students, etc. To answer these questions, educational data mining research uses different DM methods such as Prediction, Clustering, Relationship Mining, Discovery with Models and Distillation of Data for Human Judgment [5]. Among others one of the key application of educational data mining is predicting student s performance. Prediction of a student s performance is one of the oldest and most popular applications of DM in education, and different techniques and models have been applied so far [2]. In a recent study Lopez et al. [6] demonstrates the potential of the classification via clustering approach to predict students final marks (passed or failed) on the basis of their participation in forums. Their results showed that student participation in the course forum 1
2 was a good predictor of the final marks for the course. Fausett and Elwasif [7] found that neural networks can be trained to predict students' grades in Calculus I based on their placement test responses. They used student's test response pattern as input and the grade in Calculus I as the target responses. Martinez [8] suggested that student pre-college assessment data can be used to predict academic success (a grade of A, B, or C) in community college courses with discriminant function analysis. Minaei-Bidgoli and Punch [9] presented an approach for classifying students by using genetic algorithms to predict their final grade based on logged data in online learning environment. Superby et al. [10] used classification for predicting factors influencing the academic success of the first-year university students by means of discriminant analysis, neural networks, random forests and decision tree. Kotsiantis et al. [11] compared six different machine learning algorithms for predicting students marks (pass or fail) in Hellenic Open University data. They also compared six regression algorithms to predict students marks on similar data [12]. Delgado et al. [13] implemented a neural network to Moodle access logs and trained trying to predict the surpass of a course from the students. The model proposed by these authors showed that it is possible to predict those students with problems to pass a course. Two recent studies compared different data mining methods and techniques for classifying students based on students Moodle interaction data for predicting the final marks obtained in the course [14, 15]. In this study we sought to examine the extent to which we can predict students course grades (A, B, C, D and F) on the basis of their wiki usage. MediaWiki was used as the wiki engine. MediaWiki is a free, open source and easy to use wiki engine for creating wiki based web sites. We developed an extension to log students login and navigation data which are not tracked in default configuration. Background We applied four of the most commonly used classification algorithm for predicting students final grades and compared their prediction performance. The following paragraphs describe these methods briefly. Random Forest: A random forest is a decision tree ensemble classifier, with each tree grown using some type of randomization. Random forests have a capacity for processing huge amounts of data with high training speeds, based on a Classification and Regression Tree (CART) [16]. CART is a simple statistical tool applying recursive binary partitioning of the feature space. CART is well known for its efficiency in coping with large data sets. However, as the data become noisier, and less information is contained in each variable, the predictive ability of CART diminishes. RF overcomes this problem by introducing random elements into the model by which subsets of variables are chosen at random and bootstrap samples are selected with replacement for tree growing [17]. For each classification tree, a bootstrap sample is drawn from the original samples [18]. At each non-leaf node of a classification tree, the best split feature is selected from a small random subset of the original features. When the forest receives an input vector, each classification tree casts a unique vote, the final prediction is determined by the majority votes of all the trees in the random forest. Since the bootstrap sample is drawn with replacement, the samples which are not in the bootstrap samples are called out-of-bag (OOB) data [18, 19]. Boosted Classification Tree: The algorithm for Boosting Trees evolved from the application of boosting methods to regression trees. The general idea is to compute a sequence of simple CARTs, where each successive tree is built for the prediction residuals of the preceding tree. This method will build binary trees, i.e., partition the data into two samples at each split node. We suppose that user were to limit the complexities of the trees to 3 nodes 2
3 only: a root node and two child nodes, i.e., a single split. Thus, at each step of the boosting (boosting trees algorithm), a simple (best) partitioning of the data is determined, and the deviations of the observed values from the respective means (residuals for each partition) are computed. The next 3-node tree will then be fitted to those residuals, to find another partition that will further reduce the residual (error) variance for the data, given the preceding sequence of trees. It can be shown that such "additive weighted expansions" of trees can eventually produce an excellent fit of the predicted values to the observed values, even if the specific nature of the relationships between the predictor variables and the dependent variable of interest is very complex (nonlinear in nature). Hence, the method of gradient boosting - fitting a weighted additive expansion of simple trees - represents a very general and powerful machine learning algorithm [20]. Support Vector Machines: SVMs are a relatively new computational learning methods based on the statistical learning theory presented by Vapnik [21]. In SVMs, original input space mapped into a high-dimensional dot product space called a feature space, and in the feature space the optimal hyper plane is determined to maximize the generalization ability of the classifier. The maximal hyper plane is found by exploiting the optimization theory, and respecting insights provided by the statistical learning theory [22]. Naïve Bayes: Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given sample belongs to a particular class. Bayesian classifier is based on Bayes theorem. Naive Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. It is made to simplify the computation involved and, in this sense, is considered naïve [23]. Let X = (x1, x2,..., xn) be a sample, whose components represent values made on a set of n attributes. In Bayesian terms, X is considered evidence. Let H be some hypothesis, such as that the data X belongs to a specific class C. For classification problems, our goal is to determine P(H X), the probability that the hypothesis H holds given the evidence, (i.e. the observed data sample X). In other words, we are looking for the probability that sample X belongs to class C, given that we know the attribute description of X. According to Bayes theorem, the probability that we want to compute P(H X) can be expressed in terms of probabilities P(H), P(X H), and P(X) as P(H X) = [P(X H)* P(H)] / P(X) [23]. Description of the Data Used The dataset used in this study was gathered from a wiki used by university students during a third-year course. Students used wiki to write reflection about concepts that they learned in Computer Network and Communication course. Variables selected for this experiment were extracted from two different tables. One was a revision table of wiki which stored all changes conducted by students. The other was a table which stored students login and navigation data via extension. Revision table included more than 1900 records and we used a WikLog tool developed by Akçapınar and Aşkar [24] to extract information automatically from this table. The usage data included server side wiki usage log of 81 students with a total of 1800 sessions and page requests throughout 3 months. The tool was developed for extracting information from this data and pre-processing of it. Variables extracted from these two tables are shown in Table 1. Table 2 shows the summary of statistics for the extracted variables. 3
4 Table 1. Variables of a student in a wiki Name Domain Description n_session Usage log Total session count a_time Usage log Average time in one session n_mainpagereturn Usage log Main page return rate n_uniquepage Usage log Unique page visits n_revisits Usage log Total number of revisited web pages n_edit MediaWiki db Total number of edits n_word MediaWiki db Total word count f_grade Class Final grade of the student Table 2. Descriptive statistics for variables mean sd median min max n_session 22,69 20,77 18,00 1,00 143,00 a_time 17,81 7,57 17,38 1,47 46,49 n_mainpagereturn 25,19 13,99 22,00 6,00 80,00 n_uniquepage 143,25 77,83 146,00 2,00 265,00 n_revisits 56,15 18,91 60,00 0,00 87,00 n_edit 21,90 29,24 9,00 0,00 130,00 n_word 161,98 251,66 60,00 0, ,00 Results Naive Bayes, Support Vector Machines, Boosted Classification Tree and Random Forest were implemented by R software. We used the gbm package for BCT, the randomforest package for RF, and the e1071 package for SVM and Naive Bayes. The models were generalized with 10-fold Cross Validation (CV). In this study True Classification Rate of four different data mining techniques for classifying students are compared. Table 3 shows classification accuracy of these techniques. According to these results the best method with our data is SVMs. Table 3. Classification accuracy of classification algorithm Algorithm Classification Accuracy (%) Random Forest 1 63,3 Support Vector Machine 2 67,1 Naïve Bayes 3 59,6 Boosted Classification Tree 4 61,4 1 RF: 1000 tree, 5 mtry. 2 SVM: Radial Based Kernel. 3 Naive Bayes: Threshold: 0.100, Sub-Sample Rate: 0,30. 4 Boosted Classification Tree: 1000 tree, Number of Additive Terms: 200, Learning Rate:
5 Conclusions Although mining educational data to predict students' performance is not a new phenomenon, there is no published paper on the use of data mining techniques to predict student performance based on their wiki usage data until now. This paper reports the comparison of Random Forest, Support Vector Machines, Naive Bayes, and Boosted Classification Tree for classifying students for predicting final grades obtained in an undergraduate course on the basis of their wiki usage data. In recent years, these methods became popular and robust for the prediction problems. We compared different classification algorithm because there is not one single algorithm that obtains the best classification accuracy in all cases and all datasets [15, 25]. According to our findings, SVM outperforms other methods. Possible reason of this result could be that our classification problem is nonlinear. On the other hand, tree based methods have enough performance for prediction as well. These findings showed that data mining methods can help researchers to assess students individual contributions to wiki if the necessary information is stored in a database or in log files. Presented study also showed that students navigation logs and wiki usage data are good predictors of their course performance. For future research, instructors can use the extracted knowledge for decision making and for classifying new students [15]. Feedback is an important variable in changing behavior, and studies suggests that many students will respond appropriately in the face of feedback that they understand [26]. These extracted knowledge can also be used as a feedback to help students who are potentially at risk and intervene in their problems early enough to allow them to change their behavior. References 1. Ben-Zvi, D., Using Wiki to Promote Collaborative Learning in Statistics Education. Technology Innovations in Statistics Education, (1). 2. Romero, C. and S. Ventura, Educational Data Mining: A Review of the State of the Art. Systems, Man, and Cybernetics, Part C: Applications and Reviews, IEEE Transactions on, (6): p Rudas, I.J. and P. Tóth. Web Mining Usage in Course Development. in The SEFI Annual Conference Lisbon, Portugal. 4. Romero, C. and S. Ventura, Data mining in education. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, (1): p Baker, R.S.J.d., Data Mining for Education, in In International Encyclopedia of Education, 3rd Ed., B. McGaw, P. Peterson, and E. Baker, Editors. 2011, Oxford, UK: Elsevier. 6. Lopez, M.I., et al. Classification via clustering for predicting final marks based on student participation in forums. in 5th International Conference on Educational Data Mining, EDM Chania, Greece. 7. Fausett, L.V. and W. Elwasif. Predicting performance from test scores using backpropagation and counterpropagation. in Neural Networks, IEEE World Congress on Computational Intelligence., 1994 IEEE International Conference on Martinez, D. Predicting Student Outcomes Using Discriminant Function Analysis Minaei-Bidgoli, B. and W. Punch, Using Genetic Algorithms for Data Mining Optimization in an Educational Web-Based System Genetic and Evolutionary Computation GECCO 2003, E. Cantú-Paz, et al., Editors. 2003, Springer Berlin / Heidelberg. p
6 10. Superby, J.F., J.P. Vandamme, and N. Meskens. Determination of Factors Influencing the Achievement of the First-year University Students using Data Mining Methods. in Workshop on Educational Data Mining Kotsiantis, S., C. Pierrakeas, and P. Pintelas, Predicting Students' Performance in Distance Learning Using Machine Learning Techniques. Applied Artificial Intelligence, (5): p Kotsiantis, S.B. and P.E. Pintelas. Predicting students marks in Hellenic Open University. in Advanced Learning Technologies, ICALT Fifth IEEE International Conference on Delgado, M., et al. Predicting Students Marks from Moodle Logs using Neural Network Models. in Current Developments in Technology-Assisted Education Badajoz. 14. Romero, C., et al. Data mining algorithms to classify students. in Proc. Int. Conf. Educ. Data Mining Montreal, Canada. 15. Romero, C., et al., Web usage mining for predicting final marks of students that use Moodle courses. Computer Applications in Engineering Education, 2010: p. n/a-n/a. 16. Ko, B., S. Kim, and J.-Y. Nam, X-ray Image Classification Using Random Forests with Local Wavelet-Based CS-Local Binary Patterns. Journal of Digital Imaging, (6): p Chen, C.C.M., et al., Methods for Identifying SNP Interactions: A Review on Variations of Logic Regression, Random Forest and Bayesian Logistic Regression. Computational Biology and Bioinformatics, IEEE/ACM Transactions on, (6): p Breiman, L., Random Forests. Machine Learning, (1): p Lin, X., et al., A method for handling metabonomics data from liquid chromatography/mass spectrometry: combinational use of support vector machine recursive feature elimination, genetic algorithm and random forest for feature selection. Metabolomics, (4): p StatSoft, I., Electronic Statistics Textbook. 2011, StatSoft: Tulsa. 21. Vapnik, V., Statistical learning theory. 1998: Wiley. 22. Widodo, A., B.-S. Yang, and T. Han, Combination of independent component analysis and support vector machines for intelligent faults diagnosis of induction motors. Expert Systems with Applications, (2): p Han, J., M. Kamber, and J. Pei, Data Mining: Concepts and Techniques, Second Edition (The Morgan Kaufmann Series in Data Management Systems). 2006: Morgan Kaufmann. 24. Akçapınar, G. and P. Aşkar. Measuring Author Contributions to the Mediawiki. in IADIS International Conference WWW/Internet Rome, Italy. 25. Osmanbegović, E. and M. Suljić, Data Mining Approach for Predicting Student Performance. Economic Review, (1). 26. Bienkowski, M., M. Feng, and B. Means, Enhancing Teaching and Learning Through Educational Data Mining and Learning Analytics: An Issue Brief. 2012: Washington, D.C. 6
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
DATA MINING APPROACH FOR PREDICTING STUDENT PERFORMANCE
. Economic Review Journal of Economics and Business, Vol. X, Issue 1, May 2012 /// DATA MINING APPROACH FOR PREDICTING STUDENT PERFORMANCE Edin Osmanbegović *, Mirza Suljić ** ABSTRACT Although data mining
BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL
The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University
E-Learning Using Data Mining. Shimaa Abd Elkader Abd Elaal
E-Learning Using Data Mining Shimaa Abd Elkader Abd Elaal -10- E-learning using data mining Shimaa Abd Elkader Abd Elaal Abstract Educational Data Mining (EDM) is the process of converting raw data from
Scholars Journal of Arts, Humanities and Social Sciences
Scholars Journal of Arts, Humanities and Social Sciences Sch. J. Arts Humanit. Soc. Sci. 2014; 2(3B):440-444 Scholars Academic and Scientific Publishers (SAS Publishers) (An International Publisher for
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA
ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA D.Lavanya 1 and Dr.K.Usha Rani 2 1 Research Scholar, Department of Computer Science, Sree Padmavathi Mahila Visvavidyalayam, Tirupati, Andhra Pradesh,
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
Comparison of Data Mining Techniques used for Financial Data Analysis
Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification
Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde
Data Mining Algorithms to Classify Students
Data Mining Algorithms to Classify Students Cristóbal Romero, Sebastián Ventura, Pedro G. Espejo and César Hervás {cromero, sventura, pgonzalez, chervas}@uco.es Computer Science Department, Córdoba University,
Data Mining Part 5. Prediction
Data Mining Part 5. Prediction 5.1 Spring 2010 Instructor: Dr. Masoud Yaghini Outline Classification vs. Numeric Prediction Prediction Process Data Preparation Comparing Prediction Methods References Classification
The Predictive Data Mining Revolution in Scorecards:
January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms
Introducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
How To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China [email protected] [email protected]
DATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
Random forest algorithm in big data environment
Random forest algorithm in big data environment Yingchun Liu * School of Economics and Management, Beihang University, Beijing 100191, China Received 1 September 2014, www.cmnt.lv Abstract Random forest
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Selecting Data Mining Model for Web Advertising in Virtual Communities
Selecting Data Mining for Web Advertising in Virtual Communities Jerzy Surma Faculty of Business Administration Warsaw School of Economics Warsaw, Poland e-mail: [email protected] Mariusz Łapczyński
Customer Classification And Prediction Based On Data Mining Technique
Customer Classification And Prediction Based On Data Mining Technique Ms. Neethu Baby 1, Mrs. Priyanka L.T 2 1 M.E CSE, Sri Shakthi Institute of Engineering and Technology, Coimbatore 2 Assistant Professor
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM
AUTO CLAIM FRAUD DETECTION USING MULTI CLASSIFIER SYSTEM ABSTRACT Luis Alexandre Rodrigues and Nizam Omar Department of Electrical Engineering, Mackenzie Presbiterian University, Brazil, São Paulo [email protected],[email protected]
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms
Fine Particulate Matter Concentration Level Prediction by using Tree-based Ensemble Classification Algorithms Yin Zhao School of Mathematical Sciences Universiti Sains Malaysia (USM) Penang, Malaysia Yahya
Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes
Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk [email protected] Tom Kelsey ID5059-19-B &
Chapter 12 Bagging and Random Forests
Chapter 12 Bagging and Random Forests Xiaogang Su Department of Statistics and Actuarial Science University of Central Florida - 1 - Outline A brief introduction to the bootstrap Bagging: basic concepts
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING
EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
SVM Ensemble Model for Investment Prediction
19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
Advanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
How To Make A Credit Risk Model For A Bank Account
TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző [email protected] 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions
International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
Better credit models benefit us all
Better credit models benefit us all Agenda Credit Scoring - Overview Random Forest - Overview Random Forest outperform logistic regression for credit scoring out of the box Interaction term hypothesis
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris
Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS
FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS Breno C. Costa, Bruno. L. A. Alberto, André M. Portela, W. Maduro, Esdras O. Eler PDITec, Belo Horizonte,
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
Car insurance risk assessment with data mining for an Iranian leading insurance company
International Journal of Business and Economics Research 2014; 3(3): 128-134 Published online May 30, 2014 (http://www.sciencepublishinggroup.com/j/ijber) doi: 10.11648/j.ijber.20140303.12 Car insurance
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Leveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
Data Mining Algorithms Part 1. Dejan Sarka
Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka ([email protected]) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION
HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan
Predicting Student Performance by Using Data Mining Methods for Classification
BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, No 1 Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0006 Predicting Student Performance
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Enhanced Boosted Trees Technique for Customer Churn Prediction Model
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 2250-3021, ISSN (p): 2278-8719 Vol. 04, Issue 03 (March. 2014), V5 PP 41-45 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS
ENHANCED CONFIDENCE INTERPRETATIONS OF GP BASED ENSEMBLE MODELING RESULTS Michael Affenzeller (a), Stephan M. Winkler (b), Stefan Forstenlechner (c), Gabriel Kronberger (d), Michael Kommenda (e), Stefan
131-1. Adding New Level in KDD to Make the Web Usage Mining More Efficient. Abstract. 1. Introduction [1]. 1/10
1/10 131-1 Adding New Level in KDD to Make the Web Usage Mining More Efficient Mohammad Ala a AL_Hamami PHD Student, Lecturer m_ah_1@yahoocom Soukaena Hassan Hashem PHD Student, Lecturer soukaena_hassan@yahoocom
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier
International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
Comparison of machine learning methods for intelligent tutoring systems
Comparison of machine learning methods for intelligent tutoring systems Wilhelmiina Hämäläinen 1 and Mikko Vinni 1 Department of Computer Science, University of Joensuu, P.O. Box 111, FI-80101 Joensuu
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)
260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case
An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset
P P P Health An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset Peng Liu 1, Elia El-Darzi 2, Lei Lei 1, Christos Vasilakis 2, Panagiotis Chountas 2, and Wei Huang
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Scalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE
EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE S. Anupama Kumar 1 and Dr. Vijayalakshmi M.N 2 1 Research Scholar, PRIST University, 1 Assistant Professor, Dept of M.C.A. 2 Associate
Predictive Data modeling for health care: Comparative performance study of different prediction models
Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath [email protected] National Institute of Industrial Engineering (NITIE) Vihar
Azure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
Predicting borrowers chance of defaulting on credit loans
Predicting borrowers chance of defaulting on credit loans Junjie Liang ([email protected]) Abstract Credit score prediction is of great interests to banks as the outcome of the prediction algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm
Prediction of Heart Disease Using Naïve Bayes Algorithm R.Karthiyayini 1, S.Chithaara 2 Assistant Professor, Department of computer Applications, Anna University, BIT campus, Tiruchirapalli, Tamilnadu,
Decision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
Knowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Evaluating the Accuracy of a Classifier Holdout, random subsampling, crossvalidation, and the bootstrap are common techniques for
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
The Operational Value of Social Media Information. Social Media and Customer Interaction
The Operational Value of Social Media Information Dennis J. Zhang (Kellogg School of Management) Ruomeng Cui (Kelley School of Business) Santiago Gallino (Tuck School of Business) Antonio Moreno-Garcia
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation
Application of Event Based Decision Tree and Ensemble of Data Driven Methods for Maintenance Action Recommendation James K. Kimotho, Christoph Sondermann-Woelke, Tobias Meyer, and Walter Sextro Department
USE DATA MINING TO IMPROVE STUDENT RETENTION IN HIGHER EDUCATION A CASE STUDY
USE DATA MINING TO IMPROVE STUDENT RETENTION IN HIGHER EDUCATION A CASE STUDY Ying Zhang, Samia Oussena Thames Valley University, London,UK [email protected], [email protected] Tony Clark, Hyeonsook
College of Health and Human Services. Fall 2013. Syllabus
College of Health and Human Services Fall 2013 Syllabus information placement Instructor description objectives HAP 780 : Data Mining in Health Care Time: Mondays, 7.20pm 10pm (except for 3 rd lecture
Predicting Students Final GPA Using Decision Trees: A Case Study
Predicting Students Final GPA Using Decision Trees: A Case Study Mashael A. Al-Barrak and Muna Al-Razgan Abstract Educational data mining is the process of applying data mining tools and techniques to
A Methodology for Predictive Failure Detection in Semiconductor Fabrication
A Methodology for Predictive Failure Detection in Semiconductor Fabrication Peter Scheibelhofer (TU Graz) Dietmar Gleispach, Günter Hayderer (austriamicrosystems AG) 09-09-2011 Peter Scheibelhofer (TU
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
Model Combination. 24 Novembre 2009
Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy
REVIEW OF ENSEMBLE CLASSIFICATION
Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IJCSMC, Vol. 2, Issue.
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH
EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One
COURSE RECOMMENDER SYSTEM IN E-LEARNING
International Journal of Computer Science and Communication Vol. 3, No. 1, January-June 2012, pp. 159-164 COURSE RECOMMENDER SYSTEM IN E-LEARNING Sunita B Aher 1, Lobo L.M.R.J. 2 1 M.E. (CSE)-II, Walchand
Research Phases of University Data Mining Project Development
Research Phases of University Data Mining Project Development Dorina Kabakchieva 1, Kamelia Stefanova 2, Valentin Kissimov 3, and Roumen Nikolov 4 1 Sofia University St. Kl. Ohridski, 125 Tzarigradsko
