Composite Kernel Machines on Kernel Locally Consistent Concept Factorization Space for Data Mining

Size: px
Start display at page:

Download "Composite Kernel Machines on Kernel Locally Consistent Concept Factorization Space for Data Mining"

Transcription

1 International Journal of Signal Processing Systems Vol. 2, No. 1 June 2014 Composite Kernel Machines on Kernel Locally Consistent Concept Factorization Space for Data Mining Shian-Chang Huang National Changhua University of Education, Department of Business Administration shhuang@cc.ncue.edu.tw Lung-Fu Chang National Taipei College of Business, Department of Finance lfchang@webmail.ntcb.edu.tw Tung-Kuang Wu National Changhua University of Education, Department of Information Management tkwu@im.ncue.edu.tw Abstract This paper proposes a novel approach to overcome the main problems in high-dimensional data mining. We construct a composite kernel machine (CKM) on a special space (the kernel locally consistent concept factorization (KLCCF) space) to solve three problems in high-dimensional data mining: the curse of dimensionality, data complexity and nonlinearity. CKM exploits multiple data sources with strong capability to identify the relevant ones and their apposite kernel representation. KLCCF finds a compact representation of data, which uncovers the hidden information and simultaneously respects the intrinsic geometric structure of data manifold. Our new system robustly overcomes the weakness of CKM, it outperforms many traditional classification systems. financial risk. Previously, they invested heavily on establishing an automatic decision support system for evaluating the credit quality of their borrowers. The objective of this paper is to develop such a system to prevent banking institutes from investing a distressed company. Reviewing recent literature, many advanced approaches from data mining or artificial intelligence were developed to solve the problems as mentioned above. These methods (Witten and Frank [1]) include inductive learning, case-based reasoning, neural networks, rough set theory (Ahn et al. [2]), and support vector machines (SVM) (Wu et al. [3]; Hua et al. [4]). SVM, a special form of kernel classifiers, has become increasingly popular. SVM considers the structural risk in system modeling, and regularizes the model for good generalization and sparse representation. SVMs are successful in many applications. They outperform typical methods in classifications. However, the success of SVM depends on the good choice of model parameters and the kernel function, (namely, the data representation). In kernel methods, the data representation is implicitly chosen through the so-called kernel. This kernel actually plays two important roles: it defines the similarity between two examples, while defining an appropriate regularization term for the learning problem. The choice of kernel and features are typically handcrafted and fixed in advance. However, hand-tuning kernel parameters can be difficult as can selecting and combining appropriate sets of features. Recent applications have also shown that using multiple kernels instead of a single one can enhance the interpretability of the decision function and improve performances (Lanckriet et al. [5]). Multiple Kernel Learning (MKL) seeks to address this issue by learning the kernel from training data. In particular, it focuses on how the kernel can be learnt as a linear combination of given base kernels. Index Terms data mining, multiple kernel learning, kernel locally consistent concept factorization, manifold learning, support vector machine I. INTRODUCTION Data sets of high dimensionality pose great challenges on efficient processing to most existing data mining algorithms (Witten and Frank [1]). Mining highdimensional heterogeneous data is a crucial component in many information applications. Financial data mining becomes a popular topic owing to the late-2000s financial crisis. Many techniques have been developed for bankruptcy predictions. Popular methods include regression, discriminant analysis, logistic models, factor analysis, decision trees, neural networks, fuzzy logic, genetic algorithms, etc. However, their performance is usually not satisfactory. A reliable high-dimensional data mining system for financial distress predictions is urgently demanded by all banking and investment institutes to control their Manuscript received January 24, 2014; revised April 21, Engineering and Technology Publishing doi: /ijsps

2 The flat combination of kernels in MKL does not include any mechanism to cluster the kernels related to each source. In order to favor the selection/removal of kernels between or within predefined groups, one has to define a structure among kernels, which will guide the selection process. Composite kernel machines (CKM, Szafranski et al. [6]) addresses the problem by defining a structure among kernels, which is particularly well suited to the problem of learning from multiple sources. Then, each source can be represented by a group of kernels, and the algorithm aims at identifying the relevant sources and their apposite kernel representation. In financial data mining, high dimensional data from public financial statements and stock markets can be used for bankruptcy predictions. However, the high dimensional data make kernel classifiers infeasible due to the curse of dimensionality (Bellman [7]). Regarding dimensionality reduction, linear algorithms such as principal component analysis (PCA, Fukunaga [8]) and discriminant analysis (LDA, Fukunaga [8]) are the two most widely used methods due to their relative simplicity and effectiveness. However, classical techniques for manifold learning are designed to operate when the submanifold is embedded linearly, or almost linearly, in the observation space. Such algorithms often fail when nonlinear data structure cannot simply be regarded as a perturbation from a linear approximation. The task of nonlinear dimensionality reduction (NLDR) is to recover meaningful low-dimensional structures hidden in high dimensional data. Recently, matrix factorization based techniques, such as Non-negative Matrix Factorization (NMF, Lee and Seung [9]) and Concept Factorization (CF, Xu and Gong [10]), have yielded impressive results in dimensionality reduction. The non-negative constraints of NMF only allow additive combinations among different basis vectors which can learn a parts-based representation (Lee and Seung [9]). Financial data are probably sampled from a submanifold of the ambient Euclidean space. In fact, the financial data cannot fill up the high dimensional Euclidean space uniformly. Therefore, the intrinsic manifold structure needs to be considered while performing the matrix factorization. The major limitation of NMF is that it is unclear how to effectively perform NMF in the transformed data space, e.g. reproducing kernel Hilbert space (RKHS). To get rid of the limitations of NMF while inheriting all its strengths, Xu and Gong [10] proposed Concept Factorization (CF). Li and Ding [11] also proposed several interesting variations of NMF. The major advantage of CF over NMF is that it can be performed on any data representations, either in the original space or RKHS. However, NMF and CF only concern the global Euclidean geometry, whereas the local manifold geometry is not fully considered. Cai et al. [12] proposed a new version of CF called locally consistent concept factorization (LCCF) to extract the basis vectors which is consistent with the manifold geometry. Central to the approach of LCCF is a graph model which captures the local geometry of the data manifold. By using the graph Laplacian to smooth the data mapping, LCCF can extract features with respect to the intrinsic manifold structure. This study also employs a kernel version of LCCF (KLCCF) to mining underlying key features in high dimensional financial data, and constructs CMKMs on the submanifold created by KLCCF. Moreover, we incorporate the label information to the graph model used in KLCCF to improve our system performance. The remainder of this paper is organized as follows: Section 2 describes the CKM classifiers and KLCCF. Subsequently, Section 3 describes the study data and discusses the empirical findings. Conclusions are given in Section 4. II. THE PROPOSED METHODOLOGY To reduce the computational loading of kernel machines and simultaneously enhance their performance. This study constructs CKMs on a non-linear graph-based KLCCF. A. Composite Multiple Kernel Machines In multiple kernel learning (MKL), we are provided with M candidate kernels, K 1,, K M, and wish to estimate the parameters of the SVM classifier together with the weights of a convex combination of kernels K 1,, K M, that defines the effective kernel K σ. M M = { K = K, 0, =1} (1) m m m m m=1 m=1 Each kernel K m is associated to a RKHS H m whose elements will be denoted f m, and σ 1,, σ M is the weighting vector to be learned under the convex combination constraints. In order to favor the selection/removal of kernels between or within predefined groups. Szafranski et al. [6] improved traditional MKL by considering a tree structure among kernels. Szafranski et al. [6] indexes the tree depth by h, with h=0 for the root, and h=2 for the leaves. The leaf nodes represent the kernels at hand for the classification task; the nodes at depth 1 stand for the group-kernels formed by combining the kernels within each group; the root represents the global effective kernel merging the group-kernels. In the learning process, one would like to suppress the kernels and/or the groups that are irrelevant for the classification task. In the tree representation, this removal process consists in pruning the tree. When a branch is pruned at the leaf level, a single kernel is removed from the combination. When a subtree is pruned, a groupkernel is removed from the combination, and the corresponding group of kernels has no influence on the classifier. The M kernels situated at the leaves are indexed by {1,, m,, M}, and the group-kernels (at depth 1) are indexed by {1,, l,, L}. The set G l of cardinality d l indexes the leaf-kernels belonging to group-kernel l, that is, the children of node l. The groups form a partition of the leaf-kernels, that is, lg l ={1,..., m,..., M} and 2014 Engineering and Technology Publishing 65

3 International Journal of Signal Processing Systems Vol. 2, No. 1 June 2014 d l l role in developing various kinds of algorithms including dimensionality reduction algorithms (Belkin and Niyogi [14]) and semi-supervised learning algorithms (Belkin et al. [15]). Local geometric structure can be effectively modeled through a nearest neighbor graph on a scatter of data points. Consider a graph with N vertices where each vertex corresponds to a data point. Define the edge weight matrix W as follows: xit x j if xi and x j belong to the same class; Wij = xi x j otherwise, 0 Here, prior class-label information is also used to define the W. That is, the within-class geometric information is emphasized, and the similarity between two samples is set to zero if they belong to different classes. The optimal z needs to minimize the following objective: = M. The CKM of Szafranski et al. [6] is formulated as follows: 1 f m 2H C i m f1,..., f M,b,, 1, 2 2 m i min (2) s.t. yi 1,l 2,m f m ( xi ) b 1 i i = 1,..., n l m Gl i 0, i = 1,..., n d l 2/q 2,m 2/p 1,l 1, 1,l 0, l = 1,..., L 1, 2,m 0, m = 1,..., M. m where σ1=(σ1,1 σ1,l) and σ2=(σ2,1 σ2,m) are weighting vectors; hyper-parameters p and q control the sparsity within or between groups. For details of the model, we refer to Szafranski et al. [6]. ( z B. Kernel Locally Consistent Concept Factoriaztion Recently, Non-negative Matrix Factorization (NMF) has yielded impressive results in dimensionality reduction. In general given a nonnegative data matrix X, NMF tries to find reduced rank nonnegative matrices U and V so that X UVT. The column vectors of U can be thought of as basis vectors and V contains the coordinates. NMF can only be performed in the original feature space of the data points. In the case that the data are highly non-linear distributed, it is desirable that we can kernelize NMF and apply the powerful idea of the kernel method. To achieve this goal, Xu and Gong [10] proposed an extension of NMF, which is called Concept Factorization (CF). In CF, each basis uk is required to be a non-negative linear combination of the sample vectors xj. z j ) 2Wi, j. (4) This objective function incurs a heavy penalty if neighboring vertices i and j are mapped far apart. With some simple algebraic formulations, we have ( z z ) W 2 i j i, j = 2zT Lz = 2VT LV, (5) i, j where L=D W is the graph Laplacian (Chung [16]) and D is a diagonal matrix whose entries are column (or row, since W is symmetric) sums of W, Dii = W ji. Finally, j the minimization problem reduces to find the minimum of O = X - XHV T 2 Tr (V T LV ) (6) Define K= XTX. We can rewrite the objective function: O =Tr (( X XHV T )T ( X XHV T )) Tr (V T LV ) N uk = x j h jk i i, j =Tr (( I HV T )T K ( I HV T ) Tr (V T LV ) =Tr ( K ) 2Tr (VH T K ) Tr (VH T KHV T ) Tr (VT LV) (3) j =1 where h jk 0. Let H = [h jk ], CF essentially tries to find the following approximation, X XHVT, through minimization of O= X XHVT 2. CF tries to find a basis that is optimized for the linear approximation of the data. Let ztj denote the j-th row of Next, we nonlinearly extend the formulation to highdimensional RKHS: V, ztj =[vj1,, vjk], can be regarded as the new and K = φ( X )T φ( X ). Similarly, KLCCF essentially tries to find the following approximation, X ( X ) HV T, through the minimization of N u k = (x j )h jk, representation of each data point in the new basis. Cai et al. [12, 13] indicated that knowledge of the geometric structure of the data can be exploited for better discovery of this basis. A natural assumption here could be that if two data points xi, xj are close in the intrinsic geometry of the data distribution, then z i and z j, the representations O = X - φ( X ) HV T 2 Tr (V T LV ) (8) III. EXPERIMENTAL RESULTS AND ANALYSIS of this two points in the new basis, are also close to each other. Actually z i and z j are equivalent to Vi and V j in This study used bankrupt companies listed in the Taiwan Stock Exchange (TSE) for analysis. Their public financial information is used for the model input. These bankrupt companies were matched with normal our formulation. This assumption is usually referred to as local consistency assumption, which plays an essential 2014 Engineering and Technology Publishing (7) j =1 66

4 companies for comparison. The sample data covers the period from 2000 to For the balance of positive and negative samples, one company in financial crisis should be matched with one or two normal companies in the same year, in the same industry, running similar business items. Namely, they should produce the same products with the failed company and have similar scale of operation. Additionally, the normal company whose total asset or the scale of operation income should be close to the failed company. In our samples, 50 failed firms and 100 nonfailed firms were selected. The study traced the data up to 5 years, which started from the day a respective company falls into financial distress backward up to a period of 5 years. The financial reports of the non-failed companies will be matched (pooled together) with the failed company in the same year. For example, company A failed in 2005 and company B failed in We will pool them and their matched companies A, B in the same file labeled C000 representing their financial status in the year of bankruptcy. Companies A and A (or company B and B ) will be traced backward up to five years. These data were put in separate files labeled C000, C111, C222, C333, and C444 respectively for classification. The variables of this research are selected from the TEJ (Taiwan Economic Journal) financial database, which contains the following five financial indexes: profitability index, per share rates index, growth rates index, debt-paying ability index, management ability index. Altogether, there are 54 financial ratios covered by the five indexes. If some values of a ratio lost on some firms, this ratio was deleted. As a result, overall 48 financial ratios were obtained for analysis. This study tested five conventional classifiers and a kernel classifier (SVM) for bankruptcy predictions, including decision tree (J48), nearest neighbors with three neighbors (KNN), logistic regressions, Bayesian networks (BayesianNet), radial basis neural network (RBFNetwork), and SVM. For kernel classifiers, this study selected the polynomial kernel of two degrees for input owing to its good performance compared with other types of kernels. The data set was randomly divided into ten parts, and ten-folds cross validation was applied to evaluate the model performance. Table I shows that SVM outperforms other classifiers. Namely, kernel classifiers outperform traditional classifiers due to their flexibility in dealing nonlinear and high-dimensional data. Consequently, this study implemented an advanced kernel classifier, the CKM, for subsequent classifications. In high-dimensional classification problems, some input variables or features may be irrelevant. Avoiding irrelevant features is important, because they generally deteriorate the performance of a classifier. There are two approaches to address the problem: feature subset selection and dimensionality reduction. First, we try the two means of feature selection: Chi-Squared Statistics (x 2, Witten and Frank [1]) and Information Gain (IG, Witten and Frank [1]). After determination of the optimal feature subset, the selected input variables were fed into six classification algorithms (J48, KNN, BayesianNet, Logistic, RBFNetwork, SVM) for distress prediction. Table II and III show that x 2 and IG could slightly improve the performance of all classifiers. However, they deteriorate the performance of RBFNetwork significantly. RBFNetwork is a strong classifiers capable to re-scale feature weighting (importance) internally. Outside feature selection schemes using different criterions to select feature subsets do not always match its need. TABLE I. PERFORMANCE COMPARISON ON BASIC PREDICTION MODELS (ACCURACY %) J KNN BayesianNet Logistic RBFNetwork SVM TABLE II. PERFORMANCE ENHANCEMENTS BY CHI-SQUARED (X 2 ) STATISTICS x 2 +J x 2 +KNN x 2 +BayesianNet x 2 +Logistic x 2 +RBFNetwork x 2 +SVM TABLE III. PERFORMANCE ENHANCEMENTS BY INFORMATION-GAIN (IG) IG+J IG+KNN IG+BayesianNet IG+Logistic IG+RBFNetwork IG+SVM Engineering and Technology Publishing 67

5 TABLE IV. PERFORMANCE IMPROVEMENTS BY DIMENSIONALITY REDUCTIONS ICA+SVM PCA+SVM LDA+SVM KPCA+SVM Isomap+SVM KLCCF+CKM TABLE V. AVERAGE PERFORMANCE OF EACH CLASSIFIER J48 KNN BNet Log RBFNet SVM x 2 +J48 x 2 +KNN x 2 +BNet x 2 +Log x 2 +RBFNet x 2 +SVM IG +J48 IG +KNN IG +BNet IG +Log IG +RBFNet IG +SVM ICA +SVM PCA +SVM LDA +SVM Isomap +SVM KPCA +SVM KLCCF+CKM Note: Bnet is the abbreviation of Bayesian Network ; Log is the abbreviation of Logistic. Next, we compare our method (CKM on KLCCF) with other dimensionality reduction methods. We compared our system with other famous subspace or manifold learning algorithms such as the PCA, ICA (Independent Component Analysis, Hyvärinen et al. [17]), LDA, kernel PCA (KPCA), and Isomap (Tenenbaum et al. [18]). The dimension of subspace was set to five for all algorithms. Table IV shows that CKM on KLCCF significantly outperform other classifiers. It achieved the highest accuracy. This results fully demonstrate that financial data are not sampled from a linear manifold. Hence, linear algorithms such as PCA ICA, and LDA fail to extract discriminative information from data manifold. Considering graph-based nonlinear manifold learning algorithms (KLCCF) are more effective. On the other hand, our data come from diverse sources, only multiple kernel machines such as CKM are powerful enough to handle the complex structure in data. We also find in Table 4 that nonlinear dimensionality reduction methods (such as kernel PCA) is not always better than linear algorithms (PCA ICA, and LDA), since KPCA works in an unsupervised manner which lacks information to guide the mapping learning that could maintain most discriminant power. However, KLCCF is a supervised algorithm which nonlinearly forms a manifold not only preserving local geometry of the data samples, but also contains label information to discriminate the data. Table V displays average performance for each classifier. Table V clearly demonstrate the superiority of the our new classifier. The new classifier substantially outperforms other dimensionality reduction based classifiers. Moreover, it also outperforms typical SVM classifiers. IV. CONCLUSIONS From geometric perspective, data is usually sampled from a low dimensional manifold embedded in high dimensional ambient space. KLCCF finds a compact representation which uncovers the hidden information and simultaneously respects the intrinsic geometric structure. This study constructed a CKM on KLCCF to create a novel system for bankruptcy predictions. In KLCCF, an affinity graph is constructed to encode the geometrical information and KLCCF seeks a matrix factorization which respects the graph structure. CKM is an excellent framework to exploit multiple data sources with strong capability to identify the relevant sources and their apposite kernel representation. Combining the above two techniques make our hybrid classifier powerful and robust. The empirical results confirmed the superiority of the proposed system. CKM on KLCCF is a robust and reliable framework for high dimensional data mining. Future research may consider semi-supervised subspace or manifold learning algorithms to enhance system performance, or to include more variables such as non-financial and macroeconomic variables to improve accuracy. REFERENCES [1] H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques, Morgan Kaufmann, [2] B. S. Ahn, S. S. Cho, and C. Y. Kim, The integrated methodology of rough set theory and artificial neural network for business failure prediction, Expert Systems with Applications, vol. 18, no. 2, pp , [3] C. H. Wu, W. C. Fang, and Y. J. Goo, Variable selection method affects SVM-based models in bankruptcy prediction, in Proc. 9th Joint International Conference on Information Sciences, [4] Z. Hua, Y. Wang, X. Xu, B. Zhang, and L. Liang, Predicting corporate financial distress based on integration of support vector machine and logistic regression, Expert Systems with Applications, vol. 33, no. 2, pp , [5] G. R. G. Lanckriet, N. Cristianini, L. E. Ghaoui, P. Bartlett, and M. I. Jordan, Learning the kernel matrix with semidefinite programming, Journal of Machine Learning Research, vol. 5, pp , [6] M. Szafranski, Y. Grandvalet, and A. Rakotomamonjy, Composite kernel learning, Machine Learning Journal, vol. 79, pp , [7] R. Bellman, Adaptive Control Processes: A Guided Tour, Princeton University Press, [8] K. Fukunaga, Introduction to Statistical Pattern Recognition, Academic Press, [9] D. D. Lee and H. S. Seung, Learning the parts of objects by nonnegative matrix factorization, Nature, vol. 401, pp , [10] W. Xu and Y. Gong, Document clustering by concept factorization, in Proc Int. Conf. on Research and Development in Information Retrieval (SIGIR 04), Jul. 2004, pp Engineering and Technology Publishing 68

6 [11] T. Li and C. Ding, The relationships among various nonnegative matrix factorization methods for clustering, in Proc. IEEE International Conference on Data Mining, 2006, pp [12] D. Cai, X. He, and J. Han, Locally consistent concept factorization for document clustering, IEEE Transactions on Knowledge and Data Engineering, vol. 23, no. 6, pp , [13] D. Cai, X. He, K. Zhou, J. Han, and H. Bao, Locality sensitive discriminant analysis, in Proc. International Joint Conference on Artificial Intelligence (IJCAI), Jan. 2007, pp [14] M. Belkin and P. Niyogi, Laplacian eigenmaps and spectral techniques for embedding and clustering, Advances in Neural Information Processing Systems, vol. 14, pp , [15] M. Belkin, P. Niyogi, and V. Sindhwani, Manifold regularization: A geometric framework for learning from examples, Journal of Machine Learning Research, vol. 7, pp , [16] F. R. K. Chung, Spectral Graph Theory, American Mathematics Soc., [17] A. Hyvärinen, J. Karhunen, and E. Oja, Independent Component Analysis, Wiley Interscience, [18] J. Tenenbaum, V. de Silva, and J. Langford, A global geometric framework for nonlinear dimensionality reduction, Science, vol. 290, no. 5500, pp , Shian-Chang Huang received his MS degree in electric engineering from National Tsing Hwa University, and his PhD degree in financial engineering from National Taiwan University. He is currently a professor at the Department of Business Administration, National Changhua University of Education, Taiwan. His research interests include machine learning, soft computing, signal processing, data mining, computational intelligence, and financial engineering. Lung-fu Chang received his PhD degree in financial engineering from National Taiwan University. He is currently an assistant professor at the Department of Finance, National Taipei College of Business, Taiwan. His research interests include financial engineering, risk management, asset pricing. Tung-Kuang Wu received his PhD degree in computer engineering from the Department of Computer Science & Engineering at Pennsylvania State University in He is currently a professor at the Department of Information Management of National Changhua University of Education, Changhua, Taiwan. His current research interests include parallel processing, wireless networks, special education technologies, and e-learning Engineering and Technology Publishing 69

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION

HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION HYBRID PROBABILITY BASED ENSEMBLES FOR BANKRUPTCY PREDICTION Chihli Hung 1, Jing Hong Chen 2, Stefan Wermter 3, 1,2 Department of Management Information Systems, Chung Yuan Christian University, Taiwan

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

Classification algorithm in Data mining: An Overview

Classification algorithm in Data mining: An Overview Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department

More information

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II

More information

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Supervised Feature Selection & Unsupervised Dimensionality Reduction Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or

More information

Support Vector Machines with Clustering for Training with Very Large Datasets

Support Vector Machines with Clustering for Training with Very Large Datasets Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano

More information

Social Media Mining. Data Mining Essentials

Social Media Mining. Data Mining Essentials Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

More information

DATA ANALYSIS II. Matrix Algorithms

DATA ANALYSIS II. Matrix Algorithms DATA ANALYSIS II Matrix Algorithms Similarity Matrix Given a dataset D = {x i }, i=1,..,n consisting of n points in R d, let A denote the n n symmetric similarity matrix between the points, given as where

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut. Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,

More information

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network

The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network , pp.67-76 http://dx.doi.org/10.14257/ijdta.2016.9.1.06 The Combination Forecasting Model of Auto Sales Based on Seasonal Index and RBF Neural Network Lihua Yang and Baolin Li* School of Economics and

More information

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015 An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content

More information

Visualization of Large Font Databases

Visualization of Large Font Databases Visualization of Large Font Databases Martin Solli and Reiner Lenz Linköping University, Sweden ITN, Campus Norrköping, Linköping University, 60174 Norrköping, Sweden Martin.Solli@itn.liu.se, Reiner.Lenz@itn.liu.se

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Semi-Supervised Support Vector Machines and Application to Spam Filtering Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery

More information

An Overview of Knowledge Discovery Database and Data mining Techniques

An Overview of Knowledge Discovery Database and Data mining Techniques An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,

More information

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE

A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE A NEW DECISION TREE METHOD FOR DATA MINING IN MEDICINE Kasra Madadipouya 1 1 Department of Computing and Science, Asia Pacific University of Technology & Innovation ABSTRACT Today, enormous amount of data

More information

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION

ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION ISSN 9 X INFORMATION TECHNOLOGY AND CONTROL, 00, Vol., No.A ON INTEGRATING UNSUPERVISED AND SUPERVISED CLASSIFICATION FOR CREDIT RISK EVALUATION Danuta Zakrzewska Institute of Computer Science, Technical

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015 RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering

More information

Joint Feature Learning and Clustering Techniques for Clustering High Dimensional Data: A Review

Joint Feature Learning and Clustering Techniques for Clustering High Dimensional Data: A Review International Journal of Computer Sciences and Engineering Open Access Review Paper Volume-4, Issue-03 E-ISSN: 2347-2693 Joint Feature Learning and Clustering Techniques for Clustering High Dimensional

More information

A Simple Introduction to Support Vector Machines

A Simple Introduction to Support Vector Machines A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

More information

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

More information

Visualization of General Defined Space Data

Visualization of General Defined Space Data International Journal of Computer Graphics & Animation (IJCGA) Vol.3, No.4, October 013 Visualization of General Defined Space Data John R Rankin La Trobe University, Australia Abstract A new algorithm

More information

Data quality in Accounting Information Systems

Data quality in Accounting Information Systems Data quality in Accounting Information Systems Comparing Several Data Mining Techniques Erjon Zoto Department of Statistics and Applied Informatics Faculty of Economy, University of Tirana Tirana, Albania

More information

Manifold regularized kernel logistic regression for web image annotation

Manifold regularized kernel logistic regression for web image annotation Manifold regularized kernel logistic regression for web image annotation W. Liu 1, H. Liu 1, D.Tao 2*, Y. Wang 1, K. Lu 3 1 China University of Petroleum (East China) 2 *** 3 University of Chinese Academy

More information

Visualization by Linear Projections as Information Retrieval

Visualization by Linear Projections as Information Retrieval Visualization by Linear Projections as Information Retrieval Jaakko Peltonen Helsinki University of Technology, Department of Information and Computer Science, P. O. Box 5400, FI-0015 TKK, Finland jaakko.peltonen@tkk.fi

More information

DATA MINING-BASED PREDICTIVE MODEL TO DETERMINE PROJECT FINANCIAL SUCCESS USING PROJECT DEFINITION PARAMETERS

DATA MINING-BASED PREDICTIVE MODEL TO DETERMINE PROJECT FINANCIAL SUCCESS USING PROJECT DEFINITION PARAMETERS DATA MINING-BASED PREDICTIVE MODEL TO DETERMINE PROJECT FINANCIAL SUCCESS USING PROJECT DEFINITION PARAMETERS Seungtaek Lee, Changmin Kim, Yoora Park, Hyojoo Son, and Changwan Kim* Department of Architecture

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Geometric-Guided Label Propagation for Moving Object Detection

Geometric-Guided Label Propagation for Moving Object Detection MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Geometric-Guided Label Propagation for Moving Object Detection Kao, J.-Y.; Tian, D.; Mansour, H.; Ortega, A.; Vetro, A. TR2016-005 March 2016

More information

A Survey on Pre-processing and Post-processing Techniques in Data Mining

A Survey on Pre-processing and Post-processing Techniques in Data Mining , pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,

More information

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia

More information

How To Identify A Churner

How To Identify A Churner 2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management

More information

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING EFFICIENT DATA PRE-PROCESSING FOR DATA MINING USING NEURAL NETWORKS JothiKumar.R 1, Sivabalan.R.V 2 1 Research scholar, Noorul Islam University, Nagercoil, India Assistant Professor, Adhiparasakthi College

More information

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental

More information

Prediction of Stock Performance Using Analytical Techniques

Prediction of Stock Performance Using Analytical Techniques 136 JOURNAL OF EMERGING TECHNOLOGIES IN WEB INTELLIGENCE, VOL. 5, NO. 2, MAY 2013 Prediction of Stock Performance Using Analytical Techniques Carol Hargreaves Institute of Systems Science National University

More information

Machine Learning in FX Carry Basket Prediction

Machine Learning in FX Carry Basket Prediction Machine Learning in FX Carry Basket Prediction Tristan Fletcher, Fabian Redpath and Joe D Alessandro Abstract Artificial Neural Networks ANN), Support Vector Machines SVM) and Relevance Vector Machines

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

Data Mining: A Preprocessing Engine

Data Mining: A Preprocessing Engine Journal of Computer Science 2 (9): 735-739, 2006 ISSN 1549-3636 2005 Science Publications Data Mining: A Preprocessing Engine Luai Al Shalabi, Zyad Shaaban and Basel Kasasbeh Applied Science University,

More information

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal Learning Example Chapter 18: Learning from Examples 22c:145 An emergency room in a hospital measures 17 variables (e.g., blood pressure, age, etc) of newly admitted patients. A decision is needed: whether

More information

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM

TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam

More information

Multiple Kernel Learning on the Limit Order Book

Multiple Kernel Learning on the Limit Order Book JMLR: Workshop and Conference Proceedings 11 (2010) 167 174 Workshop on Applications of Pattern Analysis Multiple Kernel Learning on the Limit Order Book Tristan Fletcher Zakria Hussain John Shawe-Taylor

More information

ADVANCED MACHINE LEARNING. Introduction

ADVANCED MACHINE LEARNING. Introduction 1 1 Introduction Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures

More information

PREDICTING STOCK PRICES USING DATA MINING TECHNIQUES

PREDICTING STOCK PRICES USING DATA MINING TECHNIQUES The International Arab Conference on Information Technology (ACIT 2013) PREDICTING STOCK PRICES USING DATA MINING TECHNIQUES 1 QASEM A. AL-RADAIDEH, 2 ADEL ABU ASSAF 3 EMAN ALNAGI 1 Department of Computer

More information

Comparison of K-means and Backpropagation Data Mining Algorithms

Comparison of K-means and Backpropagation Data Mining Algorithms Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and

More information

Duality of linear conic problems

Duality of linear conic problems Duality of linear conic problems Alexander Shapiro and Arkadi Nemirovski Abstract It is well known that the optimal values of a linear programming problem and its dual are equal to each other if at least

More information

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Unsupervised Data Mining (Clustering)

Unsupervised Data Mining (Clustering) Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in

More information

1 Spectral Methods for Dimensionality

1 Spectral Methods for Dimensionality 1 Spectral Methods for Dimensionality Reduction Lawrence K. Saul Kilian Q. Weinberger Fei Sha Jihun Ham Daniel D. Lee How can we search for low dimensional structure in high dimensional data? If the data

More information

Towards better accuracy for Spam predictions

Towards better accuracy for Spam predictions Towards better accuracy for Spam predictions Chengyan Zhao Department of Computer Science University of Toronto Toronto, Ontario, Canada M5S 2E4 czhao@cs.toronto.edu Abstract Spam identification is crucial

More information

SYMMETRIC EIGENFACES MILI I. SHAH

SYMMETRIC EIGENFACES MILI I. SHAH SYMMETRIC EIGENFACES MILI I. SHAH Abstract. Over the years, mathematicians and computer scientists have produced an extensive body of work in the area of facial analysis. Several facial analysis algorithms

More information

Maximum Margin Clustering

Maximum Margin Clustering Maximum Margin Clustering Linli Xu James Neufeld Bryce Larson Dale Schuurmans University of Waterloo University of Alberta Abstract We propose a new method for clustering based on finding maximum margin

More information

Pattern Recognition Using Feature Based Die-Map Clusteringin the Semiconductor Manufacturing Process

Pattern Recognition Using Feature Based Die-Map Clusteringin the Semiconductor Manufacturing Process Pattern Recognition Using Feature Based Die-Map Clusteringin the Semiconductor Manufacturing Process Seung Hwan Park, Cheng-Sool Park, Jun Seok Kim, Youngji Yoo, Daewoong An, Jun-Geol Baek Abstract Depending

More information

Selection of the Suitable Parameter Value for ISOMAP

Selection of the Suitable Parameter Value for ISOMAP 1034 JOURNAL OF SOFTWARE, VOL. 6, NO. 6, JUNE 2011 Selection of the Suitable Parameter Value for ISOMAP Li Jing and Chao Shao School of Computer and Information Engineering, Henan University of Economics

More information

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University

More information

Data Mining for Knowledge Management. Classification

Data Mining for Knowledge Management. Classification 1 Data Mining for Knowledge Management Classification Themis Palpanas University of Trento http://disi.unitn.eu/~themis Data Mining for Knowledge Management 1 Thanks for slides to: Jiawei Han Eamonn Keogh

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Predict Influencers in the Social Network

Predict Influencers in the Social Network Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons

More information

COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS

COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS COMPARISON OF OBJECT BASED AND PIXEL BASED CLASSIFICATION OF HIGH RESOLUTION SATELLITE IMAGES USING ARTIFICIAL NEURAL NETWORKS B.K. Mohan and S. N. Ladha Centre for Studies in Resources Engineering IIT

More information

1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY 2006. Principal Components Null Space Analysis for Image and Video Classification

1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY 2006. Principal Components Null Space Analysis for Image and Video Classification 1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY 2006 Principal Components Null Space Analysis for Image and Video Classification Namrata Vaswani, Member, IEEE, and Rama Chellappa, Fellow,

More information

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S. AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree

More information

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014 RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

Combining SOM and GA-CBR for Flow Time Prediction in Semiconductor Manufacturing Factory

Combining SOM and GA-CBR for Flow Time Prediction in Semiconductor Manufacturing Factory Combining SOM and GA-CBR for Flow Time Prediction in Semiconductor Manufacturing Factory Pei-Chann Chang 12, Yen-Wen Wang 3, Chen-Hao Liu 2 1 Department of Information Management, Yuan-Ze University, 2

More information

The Scientific Data Mining Process

The Scientific Data Mining Process Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In

More information

Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing

Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing Vol.8, No.4 (214), pp.1-1 http://dx.doi.org/1.14257/ijseia.214.8.4.1 Big Text Data Clustering using Class Labels and Semantic Feature Based on Hadoop of Cloud Computing Yong-Il Kim 1, Yoo-Kang Ji 2 and

More information

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann

More information

Several Views of Support Vector Machines

Several Views of Support Vector Machines Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

More information

Soft Clustering with Projections: PCA, ICA, and Laplacian

Soft Clustering with Projections: PCA, ICA, and Laplacian 1 Soft Clustering with Projections: PCA, ICA, and Laplacian David Gleich and Leonid Zhukov Abstract In this paper we present a comparison of three projection methods that use the eigenvectors of a matrix

More information

New Ensemble Combination Scheme

New Ensemble Combination Scheme New Ensemble Combination Scheme Namhyoung Kim, Youngdoo Son, and Jaewook Lee, Member, IEEE Abstract Recently many statistical learning techniques are successfully developed and used in several areas However,

More information

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the

More information

Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu

Machine Learning CS 6830. Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Machine Learning CS 6830 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu What is Learning? Merriam-Webster: learn = to acquire knowledge, understanding, or skill

More information

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring 714 Evaluation of Feature election Methods for Predictive Modeling Using Neural Networks in Credits coring Raghavendra B. K. Dr. M.G.R. Educational and Research Institute, Chennai-95 Email: raghavendra_bk@rediffmail.com

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing E-mail Classifier International Journal of Recent Technology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-6, January 2013 Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing

More information

Class-specific Sparse Coding for Learning of Object Representations

Class-specific Sparse Coding for Learning of Object Representations Class-specific Sparse Coding for Learning of Object Representations Stephan Hasler, Heiko Wersing, and Edgar Körner Honda Research Institute Europe GmbH Carl-Legien-Str. 30, 63073 Offenbach am Main, Germany

More information

Ranking on Data Manifolds

Ranking on Data Manifolds Ranking on Data Manifolds Dengyong Zhou, Jason Weston, Arthur Gretton, Olivier Bousquet, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 72076 Tuebingen, Germany {firstname.secondname

More information

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin

More information

International Journal of Advance Research in Computer Science and Management Studies

International Journal of Advance Research in Computer Science and Management Studies Volume 2, Issue 12, December 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Tree based ensemble models regularization by convex optimization

Tree based ensemble models regularization by convex optimization Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm Martin Hlosta, Rostislav Stríž, Jan Kupčík, Jaroslav Zendulka, and Tomáš Hruška A. Imbalanced Data Classification

More information

TOWARD BIG DATA ANALYSIS WORKSHOP

TOWARD BIG DATA ANALYSIS WORKSHOP TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)

More information

A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM

A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM Journal of Computational Information Systems 10: 17 (2014) 7629 7635 Available at http://www.jofcis.com A Health Degree Evaluation Algorithm for Equipment Based on Fuzzy Sets and the Improved SVM Tian

More information

DATA PREPARATION FOR DATA MINING

DATA PREPARATION FOR DATA MINING Applied Artificial Intelligence, 17:375 381, 2003 Copyright # 2003 Taylor & Francis 0883-9514/03 $12.00 +.00 DOI: 10.1080/08839510390219264 u DATA PREPARATION FOR DATA MINING SHICHAO ZHANG and CHENGQI

More information

Meta-learning. Synonyms. Definition. Characteristics

Meta-learning. Synonyms. Definition. Characteristics Meta-learning Włodzisław Duch, Department of Informatics, Nicolaus Copernicus University, Poland, School of Computer Engineering, Nanyang Technological University, Singapore wduch@is.umk.pl (or search

More information

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

More information

Hong Kong Stock Index Forecasting

Hong Kong Stock Index Forecasting Hong Kong Stock Index Forecasting Tong Fu Shuo Chen Chuanqi Wei tfu1@stanford.edu cslcb@stanford.edu chuanqi@stanford.edu Abstract Prediction of the movement of stock market is a long-time attractive topic

More information

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH SANGITA GUPTA 1, SUMA. V. 2 1 Jain University, Bangalore 2 Dayanada Sagar Institute, Bangalore, India Abstract- One

More information

Unsupervised and supervised dimension reduction: Algorithms and connections

Unsupervised and supervised dimension reduction: Algorithms and connections Unsupervised and supervised dimension reduction: Algorithms and connections Jieping Ye Department of Computer Science and Engineering Evolutionary Functional Genomics Center The Biodesign Institute Arizona

More information

Less naive Bayes spam detection

Less naive Bayes spam detection Less naive Bayes spam detection Hongming Yang Eindhoven University of Technology Dept. EE, Rm PT 3.27, P.O.Box 53, 5600MB Eindhoven The Netherlands. E-mail:h.m.yang@tue.nl also CoSiNe Connectivity Systems

More information

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN

Machine Learning in Computer Vision A Tutorial. Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN Machine Learning in Computer Vision A Tutorial Ajay Joshi, Anoop Cherian and Ravishankar Shivalingam Dept. of Computer Science, UMN Outline Introduction Supervised Learning Unsupervised Learning Semi-Supervised

More information

Towards applying Data Mining Techniques for Talent Mangement

Towards applying Data Mining Techniques for Talent Mangement 2009 International Conference on Computer Engineering and Applications IPCSIT vol.2 (2011) (2011) IACSIT Press, Singapore Towards applying Data Mining Techniques for Talent Mangement Hamidah Jantan 1,

More information