Transfer Learning for Predictive Models in Massive Open Online Courses
|
|
|
- Ashlee White
- 10 years ago
- Views:
Transcription
1 Transfer Learning for Predictive Models in Massive Open Online Courses Sebastien Boyer and Kalyan Veeramachaneni Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology Abstract. Data recorded while learners are interacting with Massive Open Online Courses (MOOC) platforms provide a unique opportunity to build predictive models that can help anticipate future behaviors and develop interventions. But since most of the useful predictive problems are defined for a real-time framework, using knowledge drawn from previous courses becomes crucial. To address this challenge, we designed a set of processes that take advantage of knowledge from both previous courses and previous weeks of the same course to make real time predictions on learners behavior. In particular, we evaluate multiple transfer learning methods. In this article, we present our results for the stopout prediction problem (predicting which learners are likely to stop engaging in the course). We believe this paper is a first step towards addressing the need of transferring knowledge across courses. 1 Introduction Data recorded while learners are interacting with the Massive Open Online Courses (MOOC) platform provide a unique opportunity to learn about the efficacy of the different resources, build predictive models that can help develop interventions and propose/recommend strategies for the learner. Consider for example a model built to predict stopout, that is, how likely is the learner to stop engaging with the course in the coming weeks. One can learn the model retrospectively on data generated from a finished course by following a typical data mining procedure: splitting the data into test-train, learning the model and tuning its parameters via cross validation using the training data, and testing the models accuracy on test data which is a proxy for how model may perform on unseen data. In this paper, we focus on operationalization of the predictive models for real time use. We raise a fundamental question: whether models trained on previous courses would perform well on a new course (perhaps a subsequent offering of the same course). This question is of utmost importance in MOOCs due to the following observations: Offerings could be subtly different: Due to the very nascent nature of the online learning platforms, many aspects of the course evolve. Learners are thus placed in a different environment each time, so they may interact and behave differently.
2 Offerings have a different learner/instructor population: Subsequent offerings of a course have a different population of students, teaching assistants, and in some cases, different instructors. Some variables may not transfer: Some of the variables developed for one offering may not exist in the subsequent offering. For example, if a variable is the time spent on a specific tutorial, the variable may not be computed if the tutorial is not offered in subsequent offering. To develop methods for operationalization of models for real time predictions, in this paper we study three different offerings of a course offered by MIT via edx called Circuits and Electronics. We formulate two versions of the prediction problem. One version of the problem, in a very limited number of prediction scenarios, allows us to use data gathered in an on-going course to make predictions in the same course. We call this in-situ learning. In the second version we employ multiple transfer learning methods and examine their performance. In our models we rely on a set of variables that we consider are universal, and employ techniques to overcome the differences in their ranges. The paper is organized as follows. In next section we present the datasets we are working with, and the features/variables we defined and extracted. In Section 3 we define the stopout prediction problem and multiple versions of the problem formulation. In Section 4 we present different transfer learning approaches we employ. Section 5 presents the results. Section 6 presents the related work in this area. Section 7 concludes. 2 Dataset In a MOOC every mouse click learners make on the course website is recorded, their submissions are collected and their interactions on forums are recorded. In this paper, our data is from three consecutive offerings of an MITx course called 6.002x : Circuits and Electronics offered via edx platform. We chose three offerings of this course as an ideal test case for transfer learning since we expect that different offerings of the same course may have statistically similar learner-behavior data. We name the offerings as A (offered in spring 2012), B (offered in fall 2012) and C (offered in spring 2013). The number of learners who registered/certified in these three offerings were /7157, 51394/2987 and 29,050/1099 respectively. Per learner longitudinal variables Even though each course could be different in terms of content, student population and global environment, the data gathered during the student s interaction with the platform allows us to extract a common set of longitudinal variables as shown in table 1 (more details about these features are presented in [12]). 3 Stopout prediction in MOOCs We consider the problem of predicting stopout (a.k.a dropout) for learners in MOOCs. The prediction problem is described as: considering a course of duration k weeks, given a set of longitudinal observations (covariates) that describe the learner behavior and performance in a course up until a week i predict whether or not the learner will persist in week i + j to k where j {1... k i} is the lead
3 Table 1. Features derived per learner per week for the above-mentioned courses x 1 Total time spent on all resources x 2 Number of distinct problems attempted x 3 Number of submissions x 4 Number of distinct correct problems x 5 Average number of submissions per problem x 6 Ratio of total time spent to number of distinct correct problems x 7 Ratio of number of problems attempted to number of distinct correct problems x 8 Duration of longest observed event x 9 Total time spent on lecture resources x 10 Total time spent on book resources x 11 Total time spent on wiki resources x 12 Number of correct submissions x 13 Percentage of the total submissions that were correct x 14. Average time between a problem submission and problem due date [10]. A learner is said to persist in week i if s/he attempts at least one problem presented in the course during the week. In the context of MOOC s, making real time predictions about stopout would give a tremendous opportunity to make interventions, collect surveys and improve course outcomes. 3.1 Problem formulation and learning possibilities Given the learner interactions data up until week i, we formulate two different types of prediction problems. These different formulations have implications on how training data could be assembled and how model learning could incorporate information from a previous course. Formulations Entire history: In this formulation, we use all the information available regarding the learner up until week i for making predictions beyond week i. Moving window: In this formulation, we use a fixed amount of historical information (parameterized by window size) of the learner to make predictions. That is, if the window size is set to 2 then for any week i we only use information from weeks i 1 and i 2. Learning In-situ learning: In-situ learning attempts to learn a predictive model from the data from the on-going course itself. To be able to do so we have to assemble data corresponding to learners that stopped out (negative exemplars) along with learners who are persisting (positive exemplars). This is only possible for the moving window formulation. Under the formulation, when the course is at week i and the window size= w and lead = j such that w+j < i, we can assemble training examples by by following a sliding window protocol over the data till i. Transfer learning: Through transfer learning we attempt to transfer information (training data samples or models) from a previous course to establish a predictive model for an ongoing course.
4 Performance metric: We note that the positive examples in our dataset (learners that don t stopout) only represent around 10% of the total amount of learners. A simple baseline could lead to very high accuracy (e.g. predicting that every learner dropped out for all the test samples gives an accuracy of nearly 90%). We use Area Under the Curve as a metric to capture and compare the discriminatory performance of our models. 4 Transfer learning Notationally we call offering S (for past source offering) and target offering T (for current on-going course). n S and n T respectively are the number of samples in the source and target offering and D S = {x Si, y Si } ns i=1 represents the data from the source. We distinguish two main scenarios for transfer learning as defined in [8] and [13]: inductive transfer learning where some labels are available for the target course given by D T = {x T i, y T i } nt i=1 and transductive transfer where no labels are available for the target course data, given by D T = {x T i } nt 4.1 A Naive approach When using samples from a previous courses to help predict in a new course, we first wonder if the two tasks (predicting in the first course and predicting in the second course) are similar enough so that applying a model learnt on the first course to the second one would give satisfying results. To answer this question we train a logistic regression model with optimized ridge regression parameter on S and apply it to T. We call this naive transfer method. 4.2 Inductive learning approach Multi-task learning method In the multi-task learning method (MT)[2], two sets of weights are learnt for samples from source and target. The weights are coupled by having a common component that is shared between a weight for a covariate from source and target. This sharing is represented by: { ws = v S + c 0 i=1. w T = v T + c 0 where v S, v T R d, d = m + 1 for m covariates. From a logistic regression perspective, this method requires learning two coupled weight vectors, while regularizing the two components separately: (ws, wt ) = arg min l(x Si, y Si, w S )+ v S,v T,w 0 i=1:n S l(x T i, y T i, w T ) + λ 1 2 ( v s 2 + v T 2 ) + λ 2 c 0 2 i=1:n T 1 where l(x, y, w) = log( ). Regularization allows us to set the 1+exp( y(w 0 +x T.w)) degree of similarity between courses via parameter λ2 λ 1 stands for relative significance of the independent part over the common part of the weights. For small
5 values of λ2 λ 1 we expect the two models to have very common weight vectors (regularization constraints are higher on the distinct part v S and v T than on the common part c 0 ). In our experiments we set λ 1 = 0.2 and λ 2 =. Logistic Regression with Prior method In MT method, the regularization parameters impose the same common/particular penalization ratio to all the components of the weight vector that correspond to different covariates. However, in most cases we would like to differentiate between covariates as their importance is likely to vary between source and the target course. Hence we resort to a method that uses Prior distribution on the weights of the target model [3]. The Logistic Regression with Prior method, PM, first estimates the prior distribution (assumed to be Gaussian) on the weights by splitting the data from source course into D = 10 sub-samples (without replacement). For each sub-sample, we fit the usual logistic regression classifier using ridge regularization. We obtain a D sets of weights {{wi k} i=1:d} k=1:d and use them to compute our prior belief on the weights we expect to derive from any new course as follows: µ j = 1 wj k for j = 1 : d D k=1:d 1 σ j = (wj k D 1 µ j) 2 for j = 1 : d k=1:d When building a model for the target domain, in this formulation, the usual logistic regression cost functionnll(w 0, w) = l(x T i, y T i, w T ) λ becomes NLL(w 0, w) = i=1:n i=1:n l(x T i, y T i, w T ) λ j=0:d wj 2 j=1:d (w j µ j) 2, where µ = σj 2 1 [µ 1,..., µ d ], σ = [σ 1,..., σ d ] and l(x i, y i, w) = log( 1+exp( y i(w 0 +x T i.w))) is the log-likelihood of the i th data point for a multivariate gaussian prior on the distributions. The effect of such strategy is to allow higher flexibility (less constraints) for weights which vary significantly across source sub-domains and constrain more the weights whose variation across source sub-domains are smaller. In our experiments, we set λ = Transductive learning approach In this section we focus on the scenario where no labeled samples are available in the target course. This is the case when considering a entire history setting (no labeled samples from the ongoing course are available because the week of the prediction is in the future). To transfer the model from source to target, we follow our belief that the covariates from the two courses may overlap to a significant degree and we use a sample correction bias [5]. This importance sampling, or, method is equivalent to assuming that the covariates are drawn from a common distribution and that their difference comes from a selection bias during the sampling process (out of this general common distribution). In
6 order to correct this sample selection bias, the idea is to give more weight to the learners in the source course that are close to the learners in the target course. Doing so, the classification algorithm takes advantage of the similar learners of the source course and barely considers the significantly different ones. For each target sample we predict : ŷ T i = arg max l(x T i, y, w ). The weights are y {+1, 1} estimated from the source data by optimizing: where w = arg min w i=1:n S β i l(x Si, y Si, w) Note that each learner s data is reweighted using a parameter β i in the log likelihood objective function. Finding the optimal β i for such a reweighting procedure would be straigthforward if we knew the two distributions from which the source and the target learners data are drawn. To find these in practice, we use a gaussian kernel k(x, z) = exp( x z 2 σ ) x, z S T to measure the distance 2 between data points. In our experiments σ = 1. We then compute for each source data point an average distance to the target domain : κ i = 1 n T k(x Si, x T j ) j=1:n T and use a quadratic formulation found in [5] to evaluate the optimal coefficients βi. β 1 arg min β 2 βt Kβ n S κ T βs.t. β i [0, B] and β i n S n Sɛ i=1:n S where K ij = k(x Si, x Sj ). The second term in the optimization problem makes sure we choose β i high when the average distance of X Si to all the X T is low. 5 Experimental settings and results Prediction problem MW EH Transfer learning In-situ T N T I N Fig. 1. Different ways we can learn and execute a model depending on the formulation we choose. MW stands for Moving Window, EH stands for Entire History, T stands for Transductive model, I stands for Inductive model and N stands for Naive model. Problem settings: As per Figure 1 there are number of ways a prediction problem can be solved. For inductive transfer learning, we have two methods
7 - prior and multi-task. In total the same prediction problem can be solved via 7 different methods. As a reminder, we denote the courses as follows: 6.002x spring 2012 as A, 6.002x fall 2012 as B, and 6.002x spring 2013 as C. First we consider the entire history setting and then present results for moving window setting. For comparison we define an a-posteriori model. Definition 1 a-posteriori model: It is the model built retrospectively using the labeled data from the target course itself. That is, the data from the target course is split into train and test and a model is trained on the training data and is reported on the test data. We note that this is typically how results on dropout prediction are reported in several papers up until now. Entire history formulation: Within the entire history setting, two different methods can be applied - naive and importance sampling. When applying these two methods we normalize features (using min-max normalization) independently within each course. Intuitively this makes sense, as normalizing all data using the normalization parameters calculated on the training set/previous course could induce misrepresentation of variables. For example, the variable number of correct problems could have different ranges as the total number of problems offered during a particular week may vary from offering to offering. Course-wise normalization allows us to compare learners to their peers thus making variables across courses comparable. 0.9 Aposteriori Naive 0.9 Aposteriori Naive 0.9 Aposteriori Naive (Source: A, Target: B) (Source: A, Target: C) (Source: B, Target: C) Fig. 2. Performance of transfer learning models for the Entire History formulation. Solid black line plots stands for the obtained for the hypothetical case where we could access data from the future of the ongoing course (a-posteriori model). Naive method stands for the obtained by naively using the source course as a training dataset and the target course as a test set. stands for importance sampling method. Figure 2 shows the results achieved for three different transfer learning scenarios. On x-axis is the lead for the prediction problem and the y-axis is the average across different amounts of history (same lead at different weeks). For example, at lead= 1, the y-axis value is the average of values for 13 prediction problems defined at week 1-to-13. To summarize our results we use a question-answer format. Is the performance of an a-posteriori model a good indicator of real time prediction performance?: The a-posteriori model performance in most
8 cases presents an optimistic estimate. Models trained on a previous offering struggle to achieve the same performance when transferred. This is especially true for one week ahead prediction. In two out of three cases in Figure 2 we notice a drop of 0.1 or more in value when a model from previous offering is used. Hence, we argue that when developing stopout models for MOOCs for real time use, one must evaluate the performance of the model on successive offerings and report its performance. Which is better: Naive or importance sampling based transfer?: Based on our experiments we are not able to conclusively say whether naive transfer or the importance sampling based transfer is better. In some prediction problems one is better than the other and vice-versa. We hypothesize that the performance of importance sampling based approach can be further improved by tuning its parameters and perhaps fusing the predictions of both the methods can yield a better/robust performance. What could explain the slightly better performance of transferred models on C?: Number of learners in course C are significantly less than that in B or A. We posit that perhaps having more data in A or B enabled development of models that transferred well to C. This is specially true for B-to-C transfer. Additionally, since chronologically B is closer to C we hypothesize that not much may have changed in the platform or the course structure between these two offerings. This, together with the fact that B has more data enabled models trained on B to perform well on C (when compared to the a-posteriori model developed on C). Moving window formulation We next consider the moving window formulation. We consider that the target course is at week 4 and our goal is to predict whether or not learners who are in week 4 will stopout or will stay in the course. Fixing the window size to be 2 we can generate training examples for lead 1 by assembling covariates from week 1 and 2 and week 2 and 3. For lead 2 we can form training examples by assembling covariates from week 1 and 2. This allows us to generate labeled exemplars as we have knowledge about stopouts for week 3 and 4 in the ongoing course. Thus we are able to perform in-situ learning and inductive transfer learning methods. However, we are only able to solve two prediction problems - lead 1 and lead 2. We present the results achieved via multiple methods in Figure 3. We present the summary of our results. How did in-situ learning do?: In-situ learning did surprisingly well when compared to transfer learning methods under the moving window formulation. For both B and C the performance for lead 1 and 2 is between - (at week 4). This performance is also comparable to the entire history version of the same problem. In our subsequent work we intend to evaluate in-situ learning for a wide variety of prediction problems and not just at week 4. 6 Related Work Dropout prediction and analysis of reasons for dropout are of great interest to the educational research community in a variety of contexts: e-learning, distance education, and online courses offered by community colleges. However, most studies focus on the identifying aspects of student s behavior, progress, and performance
9 0.9 in-situ naive multi-task prior 0.9 in-situ naive multi-task prior 0.9 in-situ naive multi-task prior (Source: A, Target: B) (Source: A, Target: C) (Source: B, Target: C) Fig. 3. Performance of transfer learning models for the Moving Window formulation. that correlate with dropout. These studies inform policy or give a broader understanding as to why students dropout [11, 9]. Real-time predictions are rarely studied [7, 6], but to the best of our knowledge, we have not found studies that examine whether models transfer well or if they could be deployed. Perhaps, MOOCs provide an ideal use case for transfer learning; with multiple offerings of the same course during a year, it is paramount that we use what we learned from the previous course to make predictions and design interventions in the next course. Within MOOC literature, there is an increasing amount of interest in predicting stopout [4, 1]. In most cases, methods have focused on predicting one-week ahead and have not attempted to use trained on one course on another. Through this study, we are taking the first steps toward understanding different situations in which one can transfer models/data samples from one course to another. We have succeeded in (a) defining different ways real-time predictions can be achieved for a new offering, (b) multiple ways in which transfer learning can be achieved, and (c) demonstration of transfer learning performance. 7 Conclusion In this paper we presented different methods that allow us to make real time predictions for learners in MOOCs. Particularly, we emphasized on transfer learning techniques that form the foundation for using models in real time. The key ideas of these methods were to formulate two different settings for a same prediction problem (using aggregate data from previous course or partial data form the ongoing course) and implement advanced machine learning techniques for both. From an engineering and scientific perspective, we believe these are first steps towards building high fidelity predictive systems for MOOC s (beyond retrospectively analyzing stopout prediction problem). To complete the model, we expect further work to designing new transfer learning methods, designing methodologies to tune the parameters for the current transfer learning approaches, evaluating the methods systematically on a number of courses, defining more covariates for learners and deploying the predictive system in an actual course.
10 Bibliography [1] Girish Balakrishnan and Derrick Coetzee. Predicting student retention in massive open online courses using hidden markov models. In Technical Report No. UCB/EECS EECS, University of California, Berkeley, [2] Rich Caruana. Multitask learning. Springer, [3] Ciprian Chelba and Alex Acero. Adaptation of maximum entropy capitalizer: Little data can help a lot. Computer Speech & Language, 20(4): , [4] Sherif Halawa, Daniel Greene, and John Mitchell. Dropout prediction in moocs using learner activity features. In Proceedings of the European MOOC Summit. EMOOCs, [5] Jiayuan Huang, Arthur Gretton, Karsten M Borgwardt, Bernhard Schölkopf, and Alex J Smola. Correcting sample selection bias by unlabeled data. In Advances in neural information processing systems, pages , [6] Eitel JM Lauría, Joshua D Baron, Mallika Devireddy, Venniraiselvi Sundararaju, and Sandeep M Jayaprakash. Mining academic data to improve college student retention: An open source perspective. In Proceedings of the 2nd International Conference on Learning Analytics and Knowledge, pages ACM, [7] Ioanna Lykourentzou, Ioannis Giannoukos, Vassilis Nikolopoulos, George Mpardis, and Vassili Loumos. Dropout prediction in e-learning courses through the combination of machine learning techniques. Computers & Education, 53(3): , [8] Sinno Jialin Pan and Qiang Yang. A survey on transfer learning. Knowledge and Data Engineering, IEEE Transactions on, 22(10): , [9] Hannah D Street. Factors influencing a learners decision to drop-out or persist in higher education distance learning. Online Journal of Distance Learning Administration, 13(4), [10] Colin Taylor, Kalyan Veeramachaneni, and Una-May O Reilly. Likely to stop? predicting stopout in massive open online courses. arxiv preprint arxiv: , [11] Keith Tyler-Smith. Early attrition among first time elearners: A review of factors that contribute to drop-out, withdrawal and non-completion rates of adult learners undertaking elearning programmes. Journal of Online learning and Teaching, 2(2):73 85, [12] Kalyan Veeramachaneni, Una-May O Reilly, and Colin Taylor. Towards feature engineering at scale for data from massive open online courses. arxiv preprint arxiv: , [13] Qiang Yang and Sinno Jialin Pan. Transfer learning and applications. In Intelligent Information Processing, page 2, 2012.
MOOCdb: Developing Data Standards for MOOC Data Science
MOOCdb: Developing Data Standards for MOOC Data Science Kalyan Veeramachaneni, Franck Dernoncourt, Colin Taylor, Zachary Pardos, and Una-May O Reilly Massachusetts Institute of Technology, USA. {kalyan,francky,colin
arxiv:1407.5238v1 [cs.cy] 20 Jul 2014
Towards Feature Engineering at Scale for Data from Massive Open Online Courses arxiv:1407.5238v1 [cs.cy] 20 Jul 2014 Kalyan Veeramachaneni Massachusetts Institute of Technology, Cambridge, MA 02139 USA
MOOCviz 2.0: A Collaborative MOOC Analytics Visualization Platform
MOOCviz 2.0: A Collaborative MOOC Analytics Visualization Platform Preston Thompson Kalyan Veeramachaneni Any Scale Learning for All Computer Science and Artificial Intelligence Laboratory Massachusetts
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100
Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100 Erkan Er Abstract In this paper, a model for predicting students performance levels is proposed which employs three
Stopout Prediction in Massive Open Online Courses
Stopout Prediction in Massive Open Online Courses by Colin Taylor B.S., Massachusetts Institute of Technology (2012) Submitted to the Department of Electrical Engineering and Computer Science in partial
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
Feature Factory: A Crowd Sourced Approach to Variable Discovery From Linked Data
Feature Factory: A Crowd Sourced Approach to Variable Discovery From Linked Data Kiarash Adl Advisor: Kalyan Veeramachaneni, Any Scale Learning for All Computer Science and Artificial Intelligence Laboratory
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,
E-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee [email protected] Seunghee Ham [email protected] Qiyi Jiang [email protected] I. INTRODUCTION Due to the increasing popularity of e-commerce
CS229 Project Report Automated Stock Trading Using Machine Learning Algorithms
CS229 roject Report Automated Stock Trading Using Machine Learning Algorithms Tianxin Dai [email protected] Arpan Shah [email protected] Hongxia Zhong [email protected] 1. Introduction
Mining the Software Change Repository of a Legacy Telephony System
Mining the Software Change Repository of a Legacy Telephony System Jelber Sayyad Shirabad, Timothy C. Lethbridge, Stan Matwin School of Information Technology and Engineering University of Ottawa, Ottawa,
Towards better accuracy for Spam predictions
Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 [email protected] Abstract Spam identification is crucial
Data Mining Applications in Higher Education
Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2
On the effect of data set size on bias and variance in classification learning
On the effect of data set size on bias and variance in classification learning Abstract Damien Brain Geoffrey I Webb School of Computing and Mathematics Deakin University Geelong Vic 3217 With the advent
Distributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Active Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
Lecture 6: Logistic Regression
Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,
Predicting Student Retention in Massive Open Online Courses using Hidden Markov Models
Predicting Student Retention in Massive Open Online Courses using Hidden Markov Models Girish Balakrishnan Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Chapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
Benchmarking of different classes of models used for credit scoring
Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want
Cross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
Linear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
Modeling Problem Solving in Massive Open Online Courses. Fang Han
Modeling Problem Solving in Massive Open Online Courses by Fang Han Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree
Tree based ensemble models regularization by convex optimization
Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier
A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier G.T. Prasanna Kumari Associate Professor, Dept of Computer Science and Engineering, Gokula Krishna College of Engg, Sullurpet-524121,
CSCI567 Machine Learning (Fall 2014)
CSCI567 Machine Learning (Fall 2014) Drs. Sha & Liu {feisha,yanliu.cs}@usc.edu September 22, 2014 Drs. Sha & Liu ({feisha,yanliu.cs}@usc.edu) CSCI567 Machine Learning (Fall 2014) September 22, 2014 1 /
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets
Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, [email protected] Abstract: Independent
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
Learning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
Machine Learning and Data Mining. Fundamentals, robotics, recognition
Machine Learning and Data Mining Fundamentals, robotics, recognition Machine Learning, Data Mining, Knowledge Discovery in Data Bases Their mutual relations Data Mining, Knowledge Discovery in Databases,
W6.B.1. FAQs CS535 BIG DATA W6.B.3. 4. If the distance of the point is additionally less than the tight distance T 2, remove it from the original set
http://wwwcscolostateedu/~cs535 W6B W6B2 CS535 BIG DAA FAQs Please prepare for the last minute rush Store your output files safely Partial score will be given for the output from less than 50GB input Computer
Machine Learning in FX Carry Basket Prediction
Machine Learning in FX Carry Basket Prediction Tristan Fletcher, Fabian Redpath and Joe D Alessandro Abstract Artificial Neural Networks ANN), Support Vector Machines SVM) and Relevance Vector Machines
ASSESSING DECISION TREE MODELS FOR CLINICAL IN VITRO FERTILIZATION DATA
ASSESSING DECISION TREE MODELS FOR CLINICAL IN VITRO FERTILIZATION DATA Technical Report TR03 296 Dept. of Computer Science and Statistics University of Rhode Island LEAH PASSMORE 1, JULIE GOODSIDE 1,
The Optimality of Naive Bayes
The Optimality of Naive Bayes Harry Zhang Faculty of Computer Science University of New Brunswick Fredericton, New Brunswick, Canada email: hzhang@unbca E3B 5A3 Abstract Naive Bayes is one of the most
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
Gaussian Processes in Machine Learning
Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany [email protected] WWW home page: http://www.tuebingen.mpg.de/ carl
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts
BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an
Data Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University [email protected] Chetan Naik Stony Brook University [email protected] ABSTRACT The majority
Cell Phone based Activity Detection using Markov Logic Network
Cell Phone based Activity Detection using Markov Logic Network Somdeb Sarkhel [email protected] 1 Introduction Mobile devices are becoming increasingly sophisticated and the latest generation of smart
MapReduce Approach to Collective Classification for Networks
MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty
How can we discover stocks that will
Algorithmic Trading Strategy Based On Massive Data Mining Haoming Li, Zhijun Yang and Tianlun Li Stanford University Abstract We believe that there is useful information hiding behind the noisy and massive
Introduction to Logistic Regression
OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree
Predicting the Risk of Heart Attacks using Neural Network and Decision Tree S.Florence 1, N.G.Bhuvaneswari Amma 2, G.Annapoorani 3, K.Malathi 4 PG Scholar, Indian Institute of Information Technology, Srirangam,
arxiv:1506.04135v1 [cs.ir] 12 Jun 2015
Reducing offline evaluation bias of collaborative filtering algorithms Arnaud de Myttenaere 1,2, Boris Golden 1, Bénédicte Le Grand 3 & Fabrice Rossi 2 arxiv:1506.04135v1 [cs.ir] 12 Jun 2015 1 - Viadeo
Forecaster comments to the ORTECH Report
Forecaster comments to the ORTECH Report The Alberta Forecasting Pilot Project was truly a pioneering and landmark effort in the assessment of wind power production forecast performance in North America.
Regularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not
DATA PREPARATION FOR DATA MINING
Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI
Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur
Probabilistic Linear Classification: Logistic Regression Piyush Rai IIT Kanpur Probabilistic Machine Learning (CS772A) Jan 18, 2016 Probabilistic Machine Learning (CS772A) Probabilistic Linear Classification:
Detecting Corporate Fraud: An Application of Machine Learning
Detecting Corporate Fraud: An Application of Machine Learning Ophir Gottlieb, Curt Salisbury, Howard Shek, Vishal Vaidyanathan December 15, 2006 ABSTRACT This paper explores the application of several
Knowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs [email protected] Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Recognizing Informed Option Trading
Recognizing Informed Option Trading Alex Bain, Prabal Tiwaree, Kari Okamoto 1 Abstract While equity (stock) markets are generally efficient in discounting public information into stock prices, we believe
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING
ANALYSIS, THEORY AND DESIGN OF LOGISTIC REGRESSION CLASSIFIERS USED FOR VERY LARGE SCALE DATA MINING BY OMID ROUHANI-KALLEH THESIS Submitted as partial fulfillment of the requirements for the degree of
Cluster Analysis for Evaluating Trading Strategies 1
CONTRIBUTORS Jeff Bacidore Managing Director, Head of Algorithmic Trading, ITG, Inc. [email protected] +1.212.588.4327 Kathryn Berkow Quantitative Analyst, Algorithmic Trading, ITG, Inc. [email protected]
An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information
Credit Card Fraud Detection and Concept-Drift Adaptation with Delayed Supervised Information Andrea Dal Pozzolo, Giacomo Boracchi, Olivier Caelen, Cesare Alippi, and Gianluca Bontempi 15/07/2015 IEEE IJCNN
MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS
MAXIMIZING RETURN ON DIRET MARKETING AMPAIGNS IN OMMERIAL BANKING S 229 Project: Final Report Oleksandra Onosova INTRODUTION Recent innovations in cloud computing and unified communications have made a
Comparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
A Logistic Regression Approach to Ad Click Prediction
A Logistic Regression Approach to Ad Click Prediction Gouthami Kondakindi [email protected] Satakshi Rana [email protected] Aswin Rajkumar [email protected] Sai Kaushik Ponnekanti [email protected] Vinit Parakh
Class-specific Sparse Coding for Learning of Object Representations
Class-specific Sparse Coding for Learning of Object Representations Stephan Hasler, Heiko Wersing, and Edgar Körner Honda Research Institute Europe GmbH Carl-Legien-Str. 30, 63073 Offenbach am Main, Germany
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
Data Mining Solutions for the Business Environment
Database Systems Journal vol. IV, no. 4/2013 21 Data Mining Solutions for the Business Environment Ruxandra PETRE University of Economic Studies, Bucharest, Romania [email protected] Over
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Statistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
Learning with Local and Global Consistency
Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany
Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.
Chronological Sampling for Email Filtering Ching-Lung Fu 2, Daniel Silver 1, and James Blustein 2 1 Acadia University, Wolfville, Nova Scotia, Canada 2 Dalhousie University, Halifax, Nova Scotia, Canada
Faculty of Science School of Mathematics and Statistics
Faculty of Science School of Mathematics and Statistics MATH5836 Data Mining and its Business Applications Semester 1, 2014 CRICOS Provider No: 00098G MATH5836 Course Outline Information about the course
Learning with Local and Global Consistency
Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany
In this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
Spam detection with data mining method:
Spam detection with data mining method: Ensemble learning with multiple SVM based classifiers to optimize generalization ability of email spam classification Keywords: ensemble learning, SVM classifier,
A Learning Based Method for Super-Resolution of Low Resolution Images
A Learning Based Method for Super-Resolution of Low Resolution Images Emre Ugur June 1, 2004 [email protected] Abstract The main objective of this project is the study of a learning based method
Getting Even More Out of Ensemble Selection
Getting Even More Out of Ensemble Selection Quan Sun Department of Computer Science The University of Waikato Hamilton, New Zealand [email protected] ABSTRACT Ensemble Selection uses forward stepwise
Network Machine Learning Research Group. Intended status: Informational October 19, 2015 Expires: April 21, 2016
Network Machine Learning Research Group S. Jiang Internet-Draft Huawei Technologies Co., Ltd Intended status: Informational October 19, 2015 Expires: April 21, 2016 Abstract Network Machine Learning draft-jiang-nmlrg-network-machine-learning-00
A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE
STUDIA UNIV. BABEŞ BOLYAI, INFORMATICA, Volume LIX, Number 1, 2014 A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE DIANA HALIŢĂ AND DARIUS BUFNEA Abstract. Then
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms
Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail
Intrusion Detection System using Log Files and Reinforcement Learning
Intrusion Detection System using Log Files and Reinforcement Learning Bhagyashree Deokar, Ambarish Hazarnis Department of Computer Engineering K. J. Somaiya College of Engineering, Mumbai, India ABSTRACT
KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics
ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM KATE GLEASON COLLEGE OF ENGINEERING John D. Hromi Center for Quality and Applied Statistics NEW (or REVISED) COURSE (KGCOE- CQAS- 747- Principles of
Machine Learning. Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos)
Machine Learning Mausam (based on slides by Tom Mitchell, Oren Etzioni and Pedro Domingos) What Is Machine Learning? A computer program is said to learn from experience E with respect to some class of
