CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition
|
|
|
- Robert Sanders
- 10 years ago
- Views:
Transcription
1 CNFSAT: Predictive Models, Dimensional Reduction, and Phase Transition Neil P. Slagle College of Computing Georgia Institute of Technology Atlanta, GA Abstract CNFSAT embodies the P versus NP computational dilemma. Most machine learning research on CNFSAT applies randomized optimization techniques to specific problem instances in search of a satisfying assignment. A novel approach is treating CNFSAT as a supervised learning and dimension reduction problem in search of models capable of predicting satisfiability with a relatively low number of features. Herein we present some empirical results after applying decision trees to three representations of CNFSAT over four and five variables and linear regression to three representations of MAXSAT and the CNFSAT solution count (SATSOL) problem over four and five variables. The representations are bit vectors indicating the clauses included in the formulas, principal component analysis applied to the previous representation, and a simple clause count and variable hit representation. Significantly, the first principal component exhibits the number of clauses in instances, and decision trees on CNFSAT and linear regression on MAXSAT and SATSOL empirically offer a 20% improvement in error rates after preprocessing the data with PCA, baselining against the full bit vector representation. The clause count and variable hit representation gives the lowest error rates observed in the experiments. Thus, feature selection demonstrates a reduction in error using decision trees and linear regression as prediction models. Notably, PCA and the clause count and variable hit representations reduce the representation size from O(3n ) to O(n2 ) and O(n), respectively, where n is the number of variables, while reducing predictive error. Also of interest, the linear regression prediction model exhibits the phase transition of MAXSAT and SATSOL appearing in randomized optimization approaches applied to MAXSAT; that is, decision trees, linear regression, and PCA applied to these problems encounter difficulty in prediction over regions where randomized optimization algorithms encounter difficulty in solving formulas.
2 Back g r ou nd and relat ed w ork CNFSAT, or conjunctive normal form satisfiability, first shown to be NPcomplete by Cook [], is the problem of determining whether a Boolean formula in conjunctive normal form is satisfiable. Much of the research on CNFSAT applies randomized optimization techniques to specific problem instances [2]. MAXSAT is the problem of determining the maximum number of clauses satisfiable in a given instance of CNFSAT. The maximum number of satisfiable clauses in a given instance of CNFSAT is at least 50% the number of clauses in the instance; we can characterize the MAXSAT output as a fraction of clauses satisfiable. MAXSAT, like CNFSAT, is NP-hard [3], and similarly, much of the existing research applies stochastic optimization [4]. Counting the number of solutions (SATSOL) in a given instance of CNFSAT is a #P problem (defined in [5]). The phase transition of CNFSAT and MAXSAT appears in [4], [6], and [7]. Phase transitions in combinatorial problems exhibit regions containing hard and easy problem instances with respect to some key parameter [6]. In CNFSAT, MAXSAT, and, based on results herein, SATSOL, a key parameter is the number of clauses [4]. Finally, motivating the approaches discussed herein is that existing machine learning research on these problems applies randomized optimization to instances of problems rather than building supervised learning models for and applying feature selection algorithms to large collections of instances as data sets [2], [4]. That is, little or no research exists attempting to transform these problems into instances of supervised learning. 2 R ep re sen ting CNF SAT, MA X SAT, and SATSO L a s d ata To represent CNFSAT, MAXSAT, and SATSOL as data, we create bit vectors indicating the hash values of the clauses present in a formula, a dimensionally-compressed space over the bit vectors using PCA, and a restrictive clause count and variable hit representation. 2. The bit vectors The bit vector representation is isomorphic to the formula space; that is, we can recover the original formula using this representation. Though the bit vectors capture the complexity of the formulas and thus become infeasible for even low variable counts, we can leverage them to generate the second representation and as a baseline for comparing performance of the predictive models. The space of formulas containing up to n variables contains 3 n- distinct clauses, assuming no clause contains a particular variable more than once. 2 Bit vectors over four variables require 80 entries in addition to the label; bit vectors over five variables require 242 entries in addition to the label. To represent a particular formula in the bit vector, we hash the clauses of the formula so that clauses over, say the j lowest variables, ordered by indices, exhibit the same hash values irrespective of the value of n. More formally, given the hash function H:clauses Z+ and some clause C over x,, xn-, We previously applied other supervised learning techniques (SVMs, knn, boosting, ANNs) with insignificant results; decision trees and linear regression offer attractive results. 2 Each variable can appear one of three ways in a clause: absent, a positive literal, or a negative literal. Subtracting one for the empty clause, we obtain 3n- possible clauses.
3 H(C Ⅴ xn) = H(C) + 3n-, H(C Ⅴ ~xn) = H(C) + 2(3 n- ), H(xn) = 3n 2, H(~xn ) = 3n, where H(x) =, and H(~x ) = 2. Defined recursively, this hashing mechanism allows formulas over n variables to subsume formulas over smaller subsets of more lowly indexed variables in a natural way. That is, if Fn is the bit vector representing a formula F over x,, xn, then in the space of formulas over x,, xn+, F's bit vector Fn+ appears identical to Fn in the first 3 n entries. As a bit vector example, suppose a formula over simply three variables is xvx2 ^~xvx3 ^~x3. Then the bit vector representations is , (CNFSAT) , 3/3 (MAXSAT) , (SATSOL). Since a three variable formula can contain any of 26 clauses, the length of the vector sans the label is 26. The location of the ones in the bit vector indicate the hash values of the clauses in the given formula. The final entry is the label; for CNFSAT, this label is a zero or one, indicating whether the formula represented by the bit vector is satisfiable; for MAXSAT, the label is a floating precision number indicating the ratio of the maximum number of clauses satisfiable to the number of clauses in said formula; for SATSOL, the label indicates the number of solutions of said formula. 2.2 The bit vectors, reduced by PCA The PCA representations we obtain by multiplying the principal component matrix PC p 26, where p is the number of principal components and each row represents a principal component, by the bit vector. Results below suggest good performance where p is O(n2 ). 2.3 Clause count and variable hits The clause count and variable hit representation for this instance is 3 0, (CNFSAT), 3 0, 3/3 (MAXSAT) 3 0, (SATSOL). The first entry is the number of clauses, the second to fourth indicate the number of clauses containing the positive literals x, x2, and x3, respectively, and the fifth to seventh indicate the number of clauses containing the negative literals ~x, ~x2, and ~x3, respectively. If n is the number of variables, this representation contains 2n+, or O(n) entries. 3 Col lec ting t he d ata As stated earlier, the space of formulas containing up to n variables contains 3 n- possible clauses, assuming no clause can contain a variable more than once. To generate data over n variables, we randomly select a
4 number of clauses for a particular formula using an exponential distribution with some user-selected mean representing the average ratio of clauses present, say 0.3 or 0.4. The exponential distribution, defined over a positive domain, guarantees that all clause counts are possible while depressing the formula lengths more toward O(n2 ), yielding a 3035% ratio of positive examples. 3 Next, we randomly select clauses to add to the formula from the set of 3 n- clauses until the formula is of the specified length. We determine satisfiability, the maximum number of satisfiable clauses, and the number of solutions of each formula using a brute force search of the 2 n possible solutions. Finally, we collect 0,000 data points each from problem spaces in which n is four and five 4. 4 The p rincip al comp on en ts Significantly, the first principal component seems always to measure linearly to the number of clauses in the formula, a value we can calculate easily in polynomial time. Based on empirical results, we postulate that the first principle component of the data set over n variables is approximately the 3 n- dimensional vector PC = ± 3 n,...,± 3n. After subtracting the approximate average vector X avg=,..., 2 2 over the data X from Xi, a bit vector representing a formula with m clauses, the new data vector transformed by the principle component gives PC X i X avg =± 2 m 3n. 2 3n Clearly, a simple linear transformation recovers m from this form. The change in empirical variance 5 before and after subtracting the first PC is often considerably larger than the corresponding changes for subsequently calculated PCs. For example, for the data set over four variables, the empirical differences in variances are 74295, 435, 40, 430, and 43 for the first five components. The other data sets exhibit similar behavior, suggesting that PCA interprets the number of clauses to be the most significant feature of a formula. Despite the easy empirical interpretation of the first PC, similar interpretations of higher order PCs are not so easily forthcoming, despite the natural subsumption 6 of lower degree formulas into higher degree formulas. The number of PCs necessary to outperform the bit vector representation is surprisingly low. In fact, in the four variable case, PCs between sizes 3 In earlier experiments, we applied a uniform distribution; unfortunately, the spread of formula sizes significantly diminished the number of satisfiable formulas generated, frustrating supervised learning. Both distributions exhibit the significant PCA result described in the PCA section of the paper. 4 2SAT isn't NP-complete and contains only 255 formulas; all formulas in 3SAT fit on a typical hard-drive, whereas generating all of 4SAT, space notwithstanding, could require 2700 times the eight hours required to generate all of 3SAT. Thus, prediction becomes more essential once n >3. 5 The change in variance exhibits the eigenvalues for the given eigenvectors (principal components.) 6 See the hash description above.
5 eight and 20 perform similarly, exhibited in Illustration. Though this graph represents prediction error using linear regression in MAXSAT, decision trees over CNFSAT and linear regression over SATSOL exhibit similar convergence of error among various component counts. In subsequent data presented, we apply 20 PCs. Illustration : Various PC numbers on four variable MAXSAT 5 D eci sio n tre es Upon discretizing the dimensionally reduced datasets, we apply decision trees using information gain and reduced error pruning, baselining against the bit vector representation. Illustration 2 exhibits training and validation errors using the bit vector representation on four variable CNFSAT over various training sizes. Illustration 3 exhibits training and validation errors on four variable CNFSAT using the PCA representation. Illustration 4 exhibits training and validation errors on four variable CNFSAT using the clause count and variable hit representation. Interestingly, processing 0,000 records of four variable CNFSAT and five variable CNFSAT with PCA reduces the post-prune training and validation error with decision trees by 20%, even with meager training set sizes of below 000 and a double split discretization per feature. Also of note is that the post-prune node counts remain relatively low for four and five variables using all three representations, as in Illustration 5, suggesting further that decision trees can capture the problem with a relatively low complexity of the space. Finally, the validation error rates using the clause count and variable hits representation are significantly lower than either of the other representations, offering a better generalization. Illustration 2: Decision tree on four variable CNFSAT Illustration 3: Decision tree on four variable CNFSAT, 20 PCs
6 Illustration 4: Decision tree on four variable CNFSAT using clause counts and variable hits 6 Illustration 5: Decision tree node counts on five variable CNFSAT Li near reg re ssi on, p has e tran siti on We apply linear regression to MAXSAT and SATSOL in both the four and five variable cases over the three representations. As before, the PCA dimensionally-compressed representation outperforms the full bit vector representation, and the clause count and variable hit representation outperforms both the former representations on validation error. Models based on the bit vector representation achieve the lowest training error but fail to generalize as well as the remaining to representations. Errors are square roots of average sum of square errors. Illustration 6 exhibits linear regression errors on five variable MAXSAT with respect to training set size. Illustration 7 exhibits linear regression errors on five variable MAXSAT with respect to formula size. Illustration 8 exhibits linear regression errors on four variable SATSOL with respect to training set size. Illustration 9 exhibits linear regression errors on four variable SATSOL with respect to formula size. Illustration 6: Linear regression on five variable MAXSAT
7 Illustration 7: Linear regression on five variable MAXSAT by formula length Illustration 8: Linear regression on four variable SATSOL by training size Illustration 9: Linear regression on four variable SATSOL by formula size Of note in Illustration 7 and Illustration 9 is the empirical phase change discussed for MAXSAT in [4]; at a clause to variable ratio of approximately four, the error rates increase in both MAXSAT and SATSOL over four and five variables. More pronounced in SATSOL is the easy-hard-easy transition. Clearly, supervised linear regression encounters difficulty much like the optimization techniques in [4], irrespective of which of the three
8 representations we apply. This suggests a uniformity of difficulty across isolating single solutions using randomized optimization and predicting using supervised learning. 7 Co nclu sio ns and furt her w ork The results herein demonstrate that the novel approach of treating CNFSAT and constituent problems as instances of supervised learning could be significant, not just in reduction of space complexity but also that this reduction in space complexity corresponds to a reduction in prediction error. The first principal component representing essentially the number of clauses is significant, and further study might demonstrate similar interpretations of the subsequent principal components and whether this result generalizes to instances containing more than five variables. The dimensionally reduced representations outperform the full bit vector representation in generalization over decision trees and linear regression. And even if the models achieved merely comparable generalization error, the PCA dimensionally-compressed representation and the clause count and variable hit representation are still superior since their respective feature spaces are considerably smaller; that is, if the number of features is essentially polynomial in the number of variables, say O(n2), then comparable prediction error with the full, O(3n ), representation is significant. The PCA count of 20 realizes O(n2 ). The clause count and variable hits representation realizes O(n). This suggests that prediction of satisfiability, the maximum number of satisfiable clauses, and the number of solutions, might require much less complexity than that of the bit vector representation. The phase transition appearing in both randomized optimization of specific instances and the predictive models discussed herein suggests some uniformity of difficulty across approaches, entreating further study. Acknowledgments I would like to acknowledge Professor H. Venkateswaran of the Georgia Institute of Technology for his interest in these topics; he graciously participated in many discussions related to this paper. References [] Cook, S. A. (97). The complexity of theorem-proving procedures. STOC '7 Proceedings of the third annual ACM symposium on Theory of computing. ACM: New York, NY, USA. [2] Schoning, T. (999). A probabilistic algorithm for k-sat and constraint satisfaction problems. IEEE Computer Society. IEEE Computer Society: Washington, DC, USA. [3] Krentel, M. (986). The Complexity of Optimization Problems. Journal of Computer and System Sciences - Structure in Complexity Theory Conference : Orlando, FL, USA. [4] Qasem, M., Prugel-Bennett, A. (2008). Complexity of MAX-SAT Using Stochastic Algorithms. GECCO: New York, NY, USA. [5] Valiant, L. (979). The Complexity of Computing the Permanent. Theoretical Computer Science. Elsevier Science Publishers: Essex, UK. [6] Istrate, G. (999). The phase transition in random Horn satisfiability and its algorithmic implications. Random Structures & Algorithms. John Wiley & Sons, Inc.: New York, NY, USA. [7] Istrate, G. (2005). Coarse and Sharp Thresholds of Boolean Constraint Satisfaction Problems. Discrete Applied Mathematics - Special issue: Typical case complexity and phase transitions. Elsevier Science Publishers: Amsterdam, The Netherlands.
Lecture 7: NP-Complete Problems
IAS/PCMI Summer Session 2000 Clay Mathematics Undergraduate Program Basic Course on Computational Complexity Lecture 7: NP-Complete Problems David Mix Barrington and Alexis Maciel July 25, 2000 1. Circuit
Introduction to Logic in Computer Science: Autumn 2006
Introduction to Logic in Computer Science: Autumn 2006 Ulle Endriss Institute for Logic, Language and Computation University of Amsterdam Ulle Endriss 1 Plan for Today Now that we have a basic understanding
Nan Kong, Andrew J. Schaefer. Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA
A Factor 1 2 Approximation Algorithm for Two-Stage Stochastic Matching Problems Nan Kong, Andrew J. Schaefer Department of Industrial Engineering, Univeristy of Pittsburgh, PA 15261, USA Abstract We introduce
Why? A central concept in Computer Science. Algorithms are ubiquitous.
Analysis of Algorithms: A Brief Introduction Why? A central concept in Computer Science. Algorithms are ubiquitous. Using the Internet (sending email, transferring files, use of search engines, online
P versus NP, and More
1 P versus NP, and More Great Ideas in Theoretical Computer Science Saarland University, Summer 2014 If you have tried to solve a crossword puzzle, you know that it is much harder to solve it than to verify
How To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
Final Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
Dimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
Review Jeopardy. Blue vs. Orange. Review Jeopardy
Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?
Steven M. Ho!and. Department of Geology, University of Georgia, Athens, GA 30602-2501
PRINCIPAL COMPONENTS ANALYSIS (PCA) Steven M. Ho!and Department of Geology, University of Georgia, Athens, GA 30602-2501 May 2008 Introduction Suppose we had measured two variables, length and width, and
ARTICLE IN PRESS. European Journal of Operational Research xxx (2004) xxx xxx. Discrete Optimization. Nan Kong, Andrew J.
A factor 1 European Journal of Operational Research xxx (00) xxx xxx Discrete Optimization approximation algorithm for two-stage stochastic matching problems Nan Kong, Andrew J. Schaefer * Department of
NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )
Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates
Classification of Bad Accounts in Credit Card Industry
Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition
Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal
Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether
Machine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
Medical Information Management & Mining. You Chen Jan,15, 2013 [email protected]
Medical Information Management & Mining You Chen Jan,15, 2013 [email protected] 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
Data Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari : 245577
T-61.3050 : Email Classification as Spam or Ham using Naive Bayes Classifier Santosh Tirunagari : 245577 January 20, 2011 Abstract This term project gives a solution how to classify an email as spam or
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
Support Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France [email protected] Massimiliano
Page 1. CSCE 310J Data Structures & Algorithms. CSCE 310J Data Structures & Algorithms. P, NP, and NP-Complete. Polynomial-Time Algorithms
CSCE 310J Data Structures & Algorithms P, NP, and NP-Complete Dr. Steve Goddard [email protected] CSCE 310J Data Structures & Algorithms Giving credit where credit is due:» Most of the lecture notes
Data Mining Techniques for Prognosis in Pancreatic Cancer
Data Mining Techniques for Prognosis in Pancreatic Cancer by Stuart Floyd A Thesis Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUE In partial fulfillment of the requirements for the Degree
Manifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
D-optimal plans in observational studies
D-optimal plans in observational studies Constanze Pumplün Stefan Rüping Katharina Morik Claus Weihs October 11, 2005 Abstract This paper investigates the use of Design of Experiments in observational
A Performance Comparison of Five Algorithms for Graph Isomorphism
A Performance Comparison of Five Algorithms for Graph Isomorphism P. Foggia, C.Sansone, M. Vento Dipartimento di Informatica e Sistemistica Via Claudio, 21 - I 80125 - Napoli, Italy {foggiapa, carlosan,
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
A Content based Spam Filtering Using Optical Back Propagation Technique
A Content based Spam Filtering Using Optical Back Propagation Technique Sarab M. Hameed 1, Noor Alhuda J. Mohammed 2 Department of Computer Science, College of Science, University of Baghdad - Iraq ABSTRACT
Compact Representations and Approximations for Compuation in Games
Compact Representations and Approximations for Compuation in Games Kevin Swersky April 23, 2008 Abstract Compact representations have recently been developed as a way of both encoding the strategic interactions
USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS
USE OF EIGENVALUES AND EIGENVECTORS TO ANALYZE BIPARTIVITY OF NETWORK GRAPHS Natarajan Meghanathan Jackson State University, 1400 Lynch St, Jackson, MS, USA [email protected] ABSTRACT This
SIMS 255 Foundations of Software Design. Complexity and NP-completeness
SIMS 255 Foundations of Software Design Complexity and NP-completeness Matt Welsh November 29, 2001 [email protected] 1 Outline Complexity of algorithms Space and time complexity ``Big O'' notation Complexity
AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
Machine Learning in FX Carry Basket Prediction
Machine Learning in FX Carry Basket Prediction Tristan Fletcher, Fabian Redpath and Joe D Alessandro Abstract Artificial Neural Networks ANN), Support Vector Machines SVM) and Relevance Vector Machines
Lossless Grey-scale Image Compression using Source Symbols Reduction and Huffman Coding
Lossless Grey-scale Image Compression using Source Symbols Reduction and Huffman Coding C. SARAVANAN [email protected] Assistant Professor, Computer Centre, National Institute of Technology, Durgapur,WestBengal,
Making Sense of the Mayhem: Machine Learning and March Madness
Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University [email protected] [email protected] I. Introduction III. Model The goal of our research
Practical Graph Mining with R. 5. Link Analysis
Practical Graph Mining with R 5. Link Analysis Outline Link Analysis Concepts Metrics for Analyzing Networks PageRank HITS Link Prediction 2 Link Analysis Concepts Link A relationship between two entities
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Ensemble Data Mining Methods
Ensemble Data Mining Methods Nikunj C. Oza, Ph.D., NASA Ames Research Center, USA INTRODUCTION Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners, are machine learning methods
Analysis of Internet Topologies: A Historical View
Analysis of Internet Topologies: A Historical View Mohamadreza Najiminaini, Laxmi Subedi, and Ljiljana Trajković Communication Networks Laboratory http://www.ensc.sfu.ca/cnl Simon Fraser University Vancouver,
Supervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
Discuss the size of the instance for the minimum spanning tree problem.
3.1 Algorithm complexity The algorithms A, B are given. The former has complexity O(n 2 ), the latter O(2 n ), where n is the size of the instance. Let n A 0 be the size of the largest instance that can
Data Mining for Knowledge Management. Classification
1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh
MULTIPLE-OBJECTIVE DECISION MAKING TECHNIQUE Analytical Hierarchy Process
MULTIPLE-OBJECTIVE DECISION MAKING TECHNIQUE Analytical Hierarchy Process Business Intelligence and Decision Making Professor Jason Chen The analytical hierarchy process (AHP) is a systematic procedure
PHASE ESTIMATION ALGORITHM FOR FREQUENCY HOPPED BINARY PSK AND DPSK WAVEFORMS WITH SMALL NUMBER OF REFERENCE SYMBOLS
PHASE ESTIMATION ALGORITHM FOR FREQUENCY HOPPED BINARY PSK AND DPSK WAVEFORMS WITH SMALL NUM OF REFERENCE SYMBOLS Benjamin R. Wiederholt The MITRE Corporation Bedford, MA and Mario A. Blanco The MITRE
Guessing Game: NP-Complete?
Guessing Game: NP-Complete? 1. LONGEST-PATH: Given a graph G = (V, E), does there exists a simple path of length at least k edges? YES 2. SHORTEST-PATH: Given a graph G = (V, E), does there exists a simple
CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.
CI6227: Data Mining Lesson 11b: Ensemble Learning Sinno Jialin PAN Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore Acknowledgements: slides are adapted from the lecture notes
One last point: we started off this book by introducing another famously hard search problem:
S. Dasgupta, C.H. Papadimitriou, and U.V. Vazirani 261 Factoring One last point: we started off this book by introducing another famously hard search problem: FACTORING, the task of finding all prime factors
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES
CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES Claus Gwiggner, Ecole Polytechnique, LIX, Palaiseau, France Gert Lanckriet, University of Berkeley, EECS,
Algebra Unpacked Content For the new Common Core standards that will be effective in all North Carolina schools in the 2012-13 school year.
This document is designed to help North Carolina educators teach the Common Core (Standard Course of Study). NCDPI staff are continually updating and improving these tools to better serve teachers. Algebra
NP-Completeness and Cook s Theorem
NP-Completeness and Cook s Theorem Lecture notes for COM3412 Logic and Computation 15th January 2002 1 NP decision problems The decision problem D L for a formal language L Σ is the computational task:
Determining the Optimal Combination of Trial Division and Fermat s Factorization Method
Determining the Optimal Combination of Trial Division and Fermat s Factorization Method Joseph C. Woodson Home School P. O. Box 55005 Tulsa, OK 74155 Abstract The process of finding the prime factorization
CSC 373: Algorithm Design and Analysis Lecture 16
CSC 373: Algorithm Design and Analysis Lecture 16 Allan Borodin February 25, 2013 Some materials are from Stephen Cook s IIT talk and Keven Wayne s slides. 1 / 17 Announcements and Outline Announcements
Tutorial 8. NP-Complete Problems
Tutorial 8 NP-Complete Problems Decision Problem Statement of a decision problem Part 1: instance description defining the input Part 2: question stating the actual yesor-no question A decision problem
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Standardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Generating models of a matched formula with a polynomial delay
Generating models of a matched formula with a polynomial delay Petr Savicky Institute of Computer Science, Academy of Sciences of Czech Republic, Pod Vodárenskou Věží 2, 182 07 Praha 8, Czech Republic
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Using multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
Chapter 1. NP Completeness I. 1.1. Introduction. By Sariel Har-Peled, December 30, 2014 1 Version: 1.05
Chapter 1 NP Completeness I By Sariel Har-Peled, December 30, 2014 1 Version: 1.05 "Then you must begin a reading program immediately so that you man understand the crises of our age," Ignatius said solemnly.
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
npsolver A SAT Based Solver for Optimization Problems
npsolver A SAT Based Solver for Optimization Problems Norbert Manthey and Peter Steinke Knowledge Representation and Reasoning Group Technische Universität Dresden, 01062 Dresden, Germany [email protected]
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING. ----Changsheng Liu 10-30-2014
LABEL PROPAGATION ON GRAPHS. SEMI-SUPERVISED LEARNING ----Changsheng Liu 10-30-2014 Agenda Semi Supervised Learning Topics in Semi Supervised Learning Label Propagation Local and global consistency Graph
How High a Degree is High Enough for High Order Finite Elements?
This space is reserved for the Procedia header, do not use it How High a Degree is High Enough for High Order Finite Elements? William F. National Institute of Standards and Technology, Gaithersburg, Maryland,
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
A Brief Introduction to Property Testing
A Brief Introduction to Property Testing Oded Goldreich Abstract. This short article provides a brief description of the main issues that underly the study of property testing. It is meant to serve as
Volatility modeling in financial markets
Volatility modeling in financial markets Master Thesis Sergiy Ladokhin Supervisors: Dr. Sandjai Bhulai, VU University Amsterdam Brian Doelkahar, Fortis Bank Nederland VU University Amsterdam Faculty of
9 Hedging the Risk of an Energy Futures Portfolio UNCORRECTED PROOFS. Carol Alexander 9.1 MAPPING PORTFOLIOS TO CONSTANT MATURITY FUTURES 12 T 1)
Helyette Geman c0.tex V - 0//0 :00 P.M. Page Hedging the Risk of an Energy Futures Portfolio Carol Alexander This chapter considers a hedging problem for a trader in futures on crude oil, heating oil and
Complexity Classes P and NP
Complexity Classes P and NP MATH 3220 Supplemental Presentation by John Aleshunas The cure for boredom is curiosity. There is no cure for curiosity Dorothy Parker Computational Complexity Theory In computer
Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05
Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification
Lecture 10: Regression Trees
Lecture 10: Regression Trees 36-350: Data Mining October 11, 2006 Reading: Textbook, sections 5.2 and 10.5. The next three lectures are going to be about a particular kind of nonlinear predictive model,
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. Non-Structured Data Most business databases contain structured data consisting of well-defined fields with numeric or alphanumeric values.
Master of Arts in Mathematics
Master of Arts in Mathematics Administrative Unit The program is administered by the Office of Graduate Studies and Research through the Faculty of Mathematics and Mathematics Education, Department of
Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering
Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques
CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing
CS Master Level Courses and Areas The graduate courses offered may change over time, in response to new developments in computer science and the interests of faculty and students; the list of graduate
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement Toshio Sugihara Abstract In this study, an adaptive
Introduction to Machine Learning Using Python. Vikram Kamath
Introduction to Machine Learning Using Python Vikram Kamath Contents: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Introduction/Definition Where and Why ML is used Types of Learning Supervised Learning Linear Regression
Big Ideas in Mathematics
Big Ideas in Mathematics which are important to all mathematics learning. (Adapted from the NCTM Curriculum Focal Points, 2006) The Mathematics Big Ideas are organized using the PA Mathematics Standards
Part 2: Community Detection
Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -
BookTOC.txt. 1. Functions, Graphs, and Models. Algebra Toolbox. Sets. The Real Numbers. Inequalities and Intervals on the Real Number Line
College Algebra in Context with Applications for the Managerial, Life, and Social Sciences, 3rd Edition Ronald J. Harshbarger, University of South Carolina - Beaufort Lisa S. Yocco, Georgia Southern University
Less naive Bayes spam detection
Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:[email protected] also CoSiNe Connectivity Systems
Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34
Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning
Logistic Regression for Spam Filtering
Logistic Regression for Spam Filtering Nikhila Arkalgud February 14, 28 Abstract The goal of the spam filtering problem is to identify an email as a spam or not spam. One of the classic techniques used
Computer programming course in the Department of Physics, University of Calcutta
Computer programming course in the Department of Physics, University of Calcutta Parongama Sen with inputs from Prof. S. Dasgupta and Dr. J. Saha and feedback from students Computer programming course
How To Encrypt Data With A Power Of N On A K Disk
Towards High Security and Fault Tolerant Dispersed Storage System with Optimized Information Dispersal Algorithm I Hrishikesh Lahkar, II Manjunath C R I,II Jain University, School of Engineering and Technology,
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
Notes on Factoring. MA 206 Kurt Bryan
The General Approach Notes on Factoring MA 26 Kurt Bryan Suppose I hand you n, a 2 digit integer and tell you that n is composite, with smallest prime factor around 5 digits. Finding a nontrivial factor
COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction
COMP3420: Advanced Databases and Data Mining Classification and prediction: Introduction and Decision Tree Induction Lecture outline Classification versus prediction Classification A two step process Supervised
DATA ANALYSIS II. Matrix Algorithms
DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where
Supervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
The Classes P and NP
The Classes P and NP We now shift gears slightly and restrict our attention to the examination of two families of problems which are very important to computer scientists. These families constitute the
A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data
A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data Athanasius Zakhary, Neamat El Gayar Faculty of Computers and Information Cairo University, Giza, Egypt
Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University [email protected]
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University [email protected] 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES
DECISION TREE INDUCTION FOR FINANCIAL FRAUD DETECTION USING ENSEMBLE LEARNING TECHNIQUES Vijayalakshmi Mahanra Rao 1, Yashwant Prasad Singh 2 Multimedia University, Cyberjaya, MALAYSIA 1 [email protected]
OHJ-2306 Introduction to Theoretical Computer Science, Fall 2012 8.11.2012
276 The P vs. NP problem is a major unsolved problem in computer science It is one of the seven Millennium Prize Problems selected by the Clay Mathematics Institute to carry a $ 1,000,000 prize for the
