CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Save this PDF as:

Size: px
Start display at page:

Download "CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES"

Transcription

1 CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS, Berkeley, California Abstract We analyze data from flight sectors. The questions are whether there are differences between weekend and weekdays and among sectors. We compare expected prediction errors of linear logistic regression and of linear and non linear kernel classifiers. Linear decision boundaries impose an average prediction error of around around 26 % for the weekend data and around 15 % for the sector name data. Non linear boundaries do not improve the predictive accuracy by more than 4 %. Thus, there is some characteristic in the data which is identified by both methods. General Background Airspace is divided into geographical regions, called sectors. For safety reasons, no more than a certain number of aircraft is allowed to enter certain sectors during one hour. Such numbers are called sector capacities. Airlines pose a demand to enter sectors before take off by submitting a flight plan to a control center. A flight plan is essentially a time stamped list of way points. When demand is higher than capacity either take off is delayed or aircraft are rerouted. We speak of initial demand and regulated demand of a sector. Although pilots have to follow their flight plans, there are differences between the number of aircraft planned to enter sectors and the number that really entered them (the real demand). By consequence, safety is not always guaranteed and available capacity is not always optimally used. We call these differences planning differences. They are consequences of uncertain events like weather conditions, delays, en air reroutings or more. Such events are not taken into account by the current traffic planning. If there are regularities in planning differences, they can be used to improve current traffic planning. Data Description We focus on four sectors in the upper Berlin airspace where planning differences are reported to occur. The sectors are roughly equal in size. The average traversal time of a sector is ten minutes. We use regulated demand (number of aircraft planned to enter a sector) and real demand data (number of aircraft that really entered a sector) counted in intervals of 60 minutes for a total of 141 weekdays and 68 weekend days in the period June 2003 April 2004 of the four sectors EDBBUR1 4. Approach to Uncertainty We consider the data as a finite number of realizations of random variables 1. More precisely, we define S REAL t1;t2 = 'number of aircraft entering sector S between t1 and t2' for the real demand. Similarly, we define REG for regulated demand and DIFF =REAL REG for the planning differences. A sector is thus represented by a vector of random variables, one variable for each time interval. Hypothesis of the paper Planning differences show irregular patterns in every sector: Figures 1 and 3 display eleven days of planning differences for the sectors EDBBUR2 and 3 respectively. However, their empirical probability distributions turn out to have similar shapes invariantly of time and sector [1] (figure 2 shows histograms of the first twelve hours of the day of planning differences for sector EDBBUR2, the shapes of distributions of the other three sectors are similar). Our hypothesis is that planning differences have the same characteristics between sectors and on various days. If this is the case, we 1 for a definition of terms from probability theory and statistics we refer to [4] or any introductory book of the subject

2 Illustration 1 Planning Differences UR2 Illustration 3 Planning Differences UR3 can reasonably simplify any future model of planning differences. We formulate the question as a binary classification problem. If we find classes, the hypothesis is unlikely to be true. We are in an exploratory phase of analysis. Formal statistical inference is not intended. This paper is organized as follows: in the next section we explain briefly the ideas behind logistic regression and support vector machines and why we compare these two classification techniques against each other. We then explain the experimental settings and our results. Finally, we give conclusions and ideas for future work. Binary Classification In a binary classification problem, one has a set of known items belonging to one of two classes 1 and 1 and one likes to predict for a new item, to which class it might belong. We speak of the vector of predictor variables X (the 24 hourly variables representing a sector) and the class variable Y (representing the question whether a vector is from a weekday or weekend or from a given sector). There are different approaches to this problem. Some of them refer to a probabilistic and others to a geometric interpretation. In this section we review the ideas behind one out of each; logistic regression and support vector machines. The different approaches are related to each other. For more detailed information about classification we refer to [2] or [3]. Logistic Regression Logistic regression explicitly assumes a functional form for class probabilities: Illustration 2 Similar Shape of Distributions over time P Y=1 X=x = ef X 1 e F X The equation means that a functional relationship F between an item x and the probability, that it belongs to class 1 is assumed. Since there are only two classes, the probability for class 1 follows directly. The fraction on the right side is a transformation that guarantees that the probability

3 estimates lie in [0,1] for any value of F(x). This transformation is well known in statistics and discussed for example in [5], [6]. Linear logistic regression [7] is a special case, where F x = x is chosen to be a linear function. The parameters of the linear model are usually estimated by binomial maximum likelihood. Other forms for F are studied in [4]. Prediction with logistic regression models translates into the question: given a new measurement, what is the probability that it belongs to class 1? Logistic regression is a popular technique in data analysis and known since the early 19th century [5]. The goal is to understand the role of the predictor variables in explaining the class. Support Vector Machines (SVMs) It is also possible to characterize classes by their boundaries that they draw in Euclidean space. For the binary classification problem, only one boundary characterizes the whole solution. Class 1 lies on the one side and class 1 on the other side of the boundary. The idea behind support vector machines is to directly estimate decision boundaries as a function of the predictor variables. The result is a classifier mapping input values to the one or the other side of the boundary. SVMs overcome a number of limitations of other related techniques, namely the linearity of the decision boundaries and the problem of overlapping classes [2]. We only characterize informally how SVMs work and refer to literature for detailed information (e.g. [2],[8]). Nonlinear decision boundaries are found by a transformation of the predictor variables in a high, sometimes infinite dimensional space in which a linear boundary is sought. This transformation is calculated efficiently by the use of a kernel function. In the case of overlapping classes, the condition that points of one class have to lie on the one side and those of the other class on the other side of the boundary can be relaxed. In both cases, finding an optimally classifying boundary is formulated as a quadratic optimization problem and solved by standard techniques. Model selection is originally based on the Vapnik Chernovenkis theory [8]. It is not known, however, whether this technique has an advantage over cross validation [2]. Prediction with SVMs is done by evaluating the obtained classifier on a new point, with class 1 or 1 as a direct result. This classifier generally depends only on few training examples; the support vectors. Since their invention in 1988 [8], Support vector machines have been successfully applied in different domains [9]. Despite their promising technical strength, they are sometimes criticized. Their results cannot be easily interpreted and one does not know the role of the predictor variables [9]. Comparison between Logistic Regression and Support Vector Machines Logistic regression and SVMs are related: a SVM can be seen as an estimator of the class probabilities [2]. When a linear separation is possible, logistic regression will always find it [2]. Logistic regression implies linear decision boundaries, which can be seen by equaling the two class equations. It is not the scope of this report to elaborate this relationship. We are mainly interested in the question whether these two approaches a traditional, linear approach, and a newer, non linear approach can give us different insight in hidden structures in the sector data. Experiments We conduct several classification experiments in order to gain insight into how planning differences behave in different sectors and on different days. For this, we create data in three stages: the randomly permuted sector data, sector data where the number of positive and negative instances is and data, where only a subset of variables is selected. As described above, predictive accuracy of SVMs is critical to free parameters used. We combine thus a large number of SVM parameters systematically. These are the Kernel Functions: linear, Gaussian, polynomial, each in raw and in centered and normalized form, their associated parameters and the loss functions (one norm and two norm). In total, more than 800 SVM models are estimated per experiment. Parameters of the SVMs are estimated by cross validation and of the logistic regression by standard techniques [10]. Expected prediction errors for both are estimated by cross validation. For this, we split our data into 15 parts of equal size, train a model on 14 parts and calculate prediction error for the remaining one. To obtain an estimate of the prediction error, the average for 15 runs for each model is taken. We compare our results with a

4 Wilcoxon Mann Whitney test. We use the best 10 runs for the Logistic regression and for the SVM. Two categories of classification experiments have been carried out: Weekend/Weekday Experiment Data from one sector is classified according to whether it is from a weekday or the weekend. Best classification results are obtained on the raw data, that is, un. Here, performance of logistic regression and SVM do not differ significantly. EPE are around 26 %. In more detail, SVM perform significantly worse on data and on data with variable selection. Logistic regression performance does not differ between raw data and data but is significantly better on raw data than on data with variable selection. As a baseline, we assigned class attributes arbitrarily with EPE ~ 50 % (tables 1,2,3). Sector Name Experiment In this experiment we are interested whether data from a sector S differs from that of the other sectors. Data is attributed to group 1 if belonging to sector S and to group 0 otherwise. The classification experiment is run for each of the sectors. Best SVM performs significantly better than best Logistic Regression but no more than 4 % (on a 10 % level). Average EPE for logistic regression is 18.4 % and 12.8 % for SVM. Kernel Statistics: The top ten performances in the raw data are achieved exclusively by Gauss and Gauss CN Kernels. For the dataset, the situation is similar, but 3 Poly CN appear in the list, as well. No linear kernel appears in the list. Balancing the sample decreases quality significantly for logistic regression and has no impact on SVM. As above, our baseline resulted in EPE ~ 50 % (tables 4,5). Conclusions and Future Work We conducted 20 classification experiments on the two datasets 'Weekend/Weekday' and 'Sector Name'. Taking the best results of each method tells us that the expected prediction errors lie around 26 % for the weekend data and around 15 % for the sector name data. Thus, there is some characteristic in the data, which is identified by both methods. Comparing logistic regression and SVM tells us that there is no significant difference in predictive accuracy on the weekend data and less than 4 % of better accuracy for the SVM than for the logistic regression in the sector name experiment. The best kernels are Gauss and Gauss CN in 98 % and Poly CN in the remaining. No linear kernel appears in the top ten performances of every experiment. We conclude that SVMs do not promise a major improvement on predictive accuracy, even if more parameter tuning experiments should follow. For future work, we shall identify the reasons for the differences, such as traffic density or sector complexity. The implication of the existence of classifiers for the hypothesis of different underlying probability distributions has to be studied in further detail. Acknowledgments The authors like to thank Alexandre d'aspremont, Laurent el Ghaoui and Devan Sohier for their helpful comments. Bibliography [1] Some Spatio Temporal Characteristics of the Planning Error in European ATFM. C. Gwiggner, P. Baptiste, V. Duong. Proceedings of the 7th International IEEE Conference on Intelligent Transportation Systems. ITSC [2] The Elements of Statistical Learning. Data Mining, Inference, and Prediction, T. Hastie, R. Tibshirani, J. Friedman, Springer Series in Statistics, [3] Probabilites, Analyse des Donnees et Statistique. G. Saporta. Editions Technip. Paris [4] Additive logistic regression: a statistical view of boosting. J. Friedman, T. Hastie, and R. Tibshirani. Ann. Statist. 28 (2000), no. 2, [5] The Origins of Logistic Regression. J.S. Cramer. Tinbergen Institute Discussion Papers /4. Tinbergen Institute [6] Why the logistic function? A tutorial discussion on probabilities and neural networks. M. I. Jordan. MIT Computational Cognitive Science Report 9503, [7] Generalized Linear Models, 2nd Edition. P. McCullagh, J. A. Nelder, Chapman \& Hall, Monographs on Statistics and Applied Probability

5 [8] Support Vector and Kernel Methods. N. Christianini, J. S. Shawe Taylor. In Intelligent Data Analysis. An Introduction, 2nd Edition. M. Berthold, D. J. Hand (Eds). Springer [9] Recent advances in predictive (machine) learning. J. Friedman, Nov Technical Report. Stanford University. [10] R: A language and environment for statistical computing, R Development Core Team, R Foundation for Statistical Computing, Vienna, Austria, Annex The following tables contain the expected prediction errors in columns LReg and SVM and the result of the comparison in the column Comp. Here, the symbol ~ is used for no significant difference. raw UR SVM UR SVM UR ~ UR ~ Table 1 Results Weekend raw UR ~ UR SVM UR LReg UR ~ Table 2 Results Weekend 6 19 UR LReg UR LReg UR SVM UR LReg Table 3 Results Weekend 6 19 h Name raw UR SVM UR SVM UR ~ UR ~ Table 4 Results Named raw Name UR SVM UR ~ UR ~ UR SVM Table 5 Results Named

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

C19 Machine Learning

C9 Machine Learning 8 Lectures Hilary Term 25 2 Tutorial Sheets A. Zisserman Overview: Supervised classification perceptron, support vector machine, loss functions, kernels, random forests, neural networks

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

Classification of Bad Accounts in Credit Card Industry

Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Introduction to Support Vector Machines Colin Campbell, Bristol University 1 Outline of talk. Part 1. An Introduction to SVMs 1.1. SVMs for binary classification. 1.2. Soft margins and multi-class classification.

A Learning Algorithm For Neural Network Ensembles

A Learning Algorithm For Neural Network Ensembles H. D. Navone, P. M. Granitto, P. F. Verdes and H. A. Ceccatto Instituto de Física Rosario (CONICET-UNR) Blvd. 27 de Febrero 210 Bis, 2000 Rosario. República

Employer Health Insurance Premium Prediction Elliott Lui

Employer Health Insurance Premium Prediction Elliott Lui 1 Introduction The US spends 15.2% of its GDP on health care, more than any other country, and the cost of health insurance is rising faster than

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

Lecture 6: Logistic Regression

Lecture 6: CS 194-10, Fall 2011 Laurent El Ghaoui EECS Department UC Berkeley September 13, 2011 Outline Outline Classification task Data : X = [x 1,..., x m]: a n m matrix of data points in R n. y { 1,

Statistical Models in Data Mining

Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets Fabian Grüning Carl von Ossietzky Universität Oldenburg, Germany, fabian.gruening@informatik.uni-oldenburg.de Abstract: Independent

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct

Bootstrapping Big Data

Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu

DATA ANALYTICS USING R

DATA ANALYTICS USING R Duration: 90 Hours Intended audience and scope: The course is targeted at fresh engineers, practicing engineers and scientists who are interested in learning and understanding data

Knowledge Discovery from patents using KMX Text Analytics

Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

Regression Using Support Vector Machines: Basic Foundations

Regression Using Support Vector Machines: Basic Foundations Technical Report December 2004 Aly Farag and Refaat M Mohamed Computer Vision and Image Processing Laboratory Electrical and Computer Engineering

Package acrm. R topics documented: February 19, 2015

Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author

Lecture 3: Linear methods for classification

Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,

Making Sense of the Mayhem: Machine Learning and March Madness

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

Predicting Flight Delays

Predicting Flight Delays Dieterich Lawson jdlawson@stanford.edu William Castillo will.castillo@stanford.edu Introduction Every year approximately 20% of airline flights are delayed or cancelled, costing

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

SAS Software to Fit the Generalized Linear Model

SAS Software to Fit the Generalized Linear Model Gordon Johnston, SAS Institute Inc., Cary, NC Abstract In recent years, the class of generalized linear models has gained popularity as a statistical modeling

Predict the Popularity of YouTube Videos Using Early View Data

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

Machine Learning in FX Carry Basket Prediction

Machine Learning in FX Carry Basket Prediction Tristan Fletcher, Fabian Redpath and Joe D Alessandro Abstract Artificial Neural Networks ANN), Support Vector Machines SVM) and Relevance Vector Machines

II. RELATED WORK. Sentiment Mining

Sentiment Mining Using Ensemble Classification Models Matthew Whitehead and Larry Yaeger Indiana University School of Informatics 901 E. 10th St. Bloomington, IN 47408 {mewhiteh, larryy}@indiana.edu Abstract

THE SVM APPROACH FOR BOX JENKINS MODELS

REVSTAT Statistical Journal Volume 7, Number 1, April 2009, 23 36 THE SVM APPROACH FOR BOX JENKINS MODELS Authors: Saeid Amiri Dep. of Energy and Technology, Swedish Univ. of Agriculture Sciences, P.O.Box

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

Penalized Logistic Regression and Classification of Microarray Data

Penalized Logistic Regression and Classification of Microarray Data Milan, May 2003 Anestis Antoniadis Laboratoire IMAG-LMC University Joseph Fourier Grenoble, France Penalized Logistic Regression andclassification

Classification Problems

Classification Read Chapter 4 in the text by Bishop, except omit Sections 4.1.6, 4.1.7, 4.2.4, 4.3.3, 4.3.5, 4.3.6, 4.4, and 4.5. Also, review sections 1.5.1, 1.5.2, 1.5.3, and 1.5.4. Classification Problems

Notes on Support Vector Machines

Notes on Support Vector Machines Fernando Mira da Silva Fernando.Silva@inesc.pt Neural Network Group I N E S C November 1998 Abstract This report describes an empirical study of Support Vector Machines

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

260 IJCSNS International Journal of Computer Science and Network Security, VOL.11 No.6, June 2011 Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

E-commerce Transaction Anomaly Classification

E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods

Mining Direct Marketing Data by Ensembles of Weak Learners and Rough Set Methods Jerzy B laszczyński 1, Krzysztof Dembczyński 1, Wojciech Kot lowski 1, and Mariusz Paw lowski 2 1 Institute of Computing

Intrusion Detection via Machine Learning for SCADA System Protection

Intrusion Detection via Machine Learning for SCADA System Protection S.L.P. Yasakethu Department of Computing, University of Surrey, Guildford, GU2 7XH, UK. s.l.yasakethu@surrey.ac.uk J. Jiang Department

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Lecture 1 Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x-5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

Pattern Analysis. Logistic Regression. 12. Mai 2009. Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Pattern Analysis Logistic Regression 12. Mai 2009 Joachim Hornegger Chair of Pattern Recognition Erlangen University Pattern Analysis 2 / 43 1 Logistic Regression Posteriors and the Logistic Function Decision

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools 2009-2010 Week 1 Week 2 14.0 Students organize and describe distributions of data by using a number of different

The Artificial Prediction Market

The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory

The Data Mining Process

Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

Machine Learning in Automatic Music Chords Generation

Machine Learning in Automatic Music Chords Generation Ziheng Chen Department of Music zihengc@stanford.edu Jie Qi Department of Electrical Engineering qijie@stanford.edu Yifei Zhou Department of Statistics

Ensemble Approach for the Classification of Imbalanced Data

Ensemble Approach for the Classification of Imbalanced Data Vladimir Nikulin 1, Geoffrey J. McLachlan 1, and Shu Kay Ng 2 1 Department of Mathematics, University of Queensland v.nikulin@uq.edu.au, gjm@maths.uq.edu.au

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

Least Squares Estimation

Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

Predict Influencers in the Social Network

Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

Data Mining. Nonlinear Classification

Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

Face Recognition using SIFT Features

Face Recognition using SIFT Features Mohamed Aly CNS186 Term Project Winter 2006 Abstract Face recognition has many important practical applications, like surveillance and access control.

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

Support Vector Machine. Tutorial. (and Statistical Learning Theory)

Support Vector Machine (and Statistical Learning Theory) Tutorial Jason Weston NEC Labs America 4 Independence Way, Princeton, USA. jasonw@nec-labs.com 1 Support Vector Machines: history SVMs introduced

Detecting Corporate Fraud: An Application of Machine Learning

Detecting Corporate Fraud: An Application of Machine Learning Ophir Gottlieb, Curt Salisbury, Howard Shek, Vishal Vaidyanathan December 15, 2006 ABSTRACT This paper explores the application of several

MACHINE LEARNING. Introduction. Alessandro Moschitti

MACHINE LEARNING Introduction Alessandro Moschitti Department of Computer Science and Information Engineering University of Trento Email: moschitti@disi.unitn.it Course Schedule Lectures Tuesday, 14:00-16:00

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers

On the Path to an Ideal ROC Curve: Considering Cost Asymmetry in Learning Classifiers Francis R. Bach Computer Science Division University of California Berkeley, CA 9472 fbach@cs.berkeley.edu Abstract

SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH

1 SURVIVABILITY OF COMPLEX SYSTEM SUPPORT VECTOR MACHINE BASED APPROACH Y, HONG, N. GAUTAM, S. R. T. KUMARA, A. SURANA, H. GUPTA, S. LEE, V. NARAYANAN, H. THADAKAMALLA The Dept. of Industrial Engineering,

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

The Scientific Data Mining Process

Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

Statistics for BIG data

Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

Big Data Big Knowledge?

EBPI Epidemiology, Biostatistics and Prevention Institute Big Data Big Knowledge? Torsten Hothorn 2015-03-06 The end of theory The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (Chris

Prediction of Stock Performance Using Analytical Techniques

136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

Question 2 Naïve Bayes (16 points)

Question 2 Naïve Bayes (16 points) About 2/3 of your email is spam so you downloaded an open source spam filter based on word occurrences that uses the Naive Bayes classifier. Assume you collected the

Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

Introduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011

Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning

An Introduction to Statistical Machine Learning - Overview -

An Introduction to Statistical Machine Learning - Overview - Samy Bengio bengio@idiap.ch Dalle Molle Institute for Perceptual Artificial Intelligence (IDIAP) CP 592, rue du Simplon 4 1920 Martigny, Switzerland

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

Support Vector Machines Explained

March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),

Classification: Basic Concepts, Decision Trees, and Model Evaluation. General Approach for Building Classification Model

10 10 Classification: Basic Concepts, Decision Trees, and Model Evaluation Dr. Hui Xiong Rutgers University Introduction to Data Mining 1//009 1 General Approach for Building Classification Model Tid Attrib1

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Government of Russian Federation Federal State Autonomous Educational Institution of High Professional Education National Research University «Higher School of Economics» Faculty of Computer Science School

Data Mining Yelp Data - Predicting rating stars from review text

Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY

QÜESTIIÓ, vol. 25, 3, p. 509-520, 2001 PRACTICAL DATA MINING IN A LARGE UTILITY COMPANY GEORGES HÉBRAIL We present in this paper the main applications of data mining techniques at Electricité de France,

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

The Fifth International Conference on e-learning (elearning-2014), 22-23 September 2014, Belgrade, Serbia BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL SNJEŽANA MILINKOVIĆ University

: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

Recognizing Informed Option Trading

Recognizing Informed Option Trading Alex Bain, Prabal Tiwaree, Kari Okamoto 1 Abstract While equity (stock) markets are generally efficient in discounting public information into stock prices, we believe

Hong Kong Stock Index Forecasting

Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei tfu1@stanford.edu cslcb@stanford.edu chuanqi@stanford.edu Abstract Prediction of the movement of stock market is a long-time attractive topic

KEITH LEHNERT AND ERIC FRIEDRICH

MACHINE LEARNING CLASSIFICATION OF MALICIOUS NETWORK TRAFFIC KEITH LEHNERT AND ERIC FRIEDRICH 1. Introduction 1.1. Intrusion Detection Systems. In our society, information systems are everywhere. They

Chapter 6. The stacking ensemble approach

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

New Ensemble Combination Scheme

New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

The Predictive Data Mining Revolution in Scorecards:

January 13, 2013 StatSoft White Paper The Predictive Data Mining Revolution in Scorecards: Accurate Risk Scoring via Ensemble Models Summary Predictive modeling methods, based on machine learning algorithms

Machine Learning in Spam Filtering

Machine Learning in Spam Filtering A Crash Course in ML Konstantin Tretyakov kt@ut.ee Institute of Computer Science, University of Tartu Overview Spam is Evil ML for Spam Filtering: General Idea, Problems.

Exploratory Data Analysis

Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

Which Is the Best Multiclass SVM Method? An Empirical Study

Which Is the Best Multiclass SVM Method? An Empirical Study Kai-Bo Duan 1 and S. Sathiya Keerthi 2 1 BioInformatics Research Centre, Nanyang Technological University, Nanyang Avenue, Singapore 639798 askbduan@ntu.edu.sg

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models. Dr. Jon Starkweather, Research and Statistical Support consultant This month

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

Data, Measurements, Features

Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

CLASSIFICATION OF TRADING STRATEGIES IN ADAPTIVE MARKETS

M A C H I N E L E A R N I N G P R O J E C T F I N A L R E P O R T F A L L 2 7 C S 6 8 9 CLASSIFICATION OF TRADING STRATEGIES IN ADAPTIVE MARKETS MARK GRUMAN MANJUNATH NARAYANA Abstract In the CAT Tournament,

The Probit Link Function in Generalized Linear Models for Data Mining Applications

Journal of Modern Applied Statistical Methods Copyright 2013 JMASM, Inc. May 2013, Vol. 12, No. 1, 164-169 1538 9472/13/\$95.00 The Probit Link Function in Generalized Linear Models for Data Mining Applications

Introduction to Logistic Regression

OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction