SVM Ensemble Model for Investment Prediction

Transcription

1 19 SVM Ensemble Model for Investment Prediction Chandra J, Assistant Professor, Department of Computer Science, Christ University, Bangalore Siji T. Mathew, Research Scholar, Christ University, Dept of Computer Science, Bangalore ABSTRACT This paper analyses the usage of SVM Ensemble for Investment service. Objective of this work is to construct a model for investment prediction. Ensemble learning is a machine-learning paradigm where multiple models are trained to solve the problem. In this paper, a detailed study of SVM ensemble is done. An insurance dataset obtained from UCI knowledge discovery in Databases Archive was taken. The AdaBoost, multiclassifier SVM Ensemble was created and tested with the insurance dataset. From this work, the SVM Ensemble produces better accuracy than other ensembles. The knowledge flow of SVM ensemble model was created using Weka tool. This model helps the user to predict the best policy for investment Keywords SVM Ensemble, AdaBoost, Multiclassifier, Accuracy, ROC. 1. INTRODUCTION In this research, The SVM Ensemble model is created which is highly reliable for investment sector. We present a novel method based on SVM Ensemble classification. The Business intelligence helps the companies in their decision support system. A business intelligence technology gives both historical and current views of the businesses process and using this knowledge the decision-making system can produce good results. Different categories of Policies were selected from UCI knowledge discovery in Databases Archive. Data mining is the process of analysing data from different perspectives and summarizing it into useful information. Data are any facts, text or numbers that can be processed out. The information is useful to increase the revenue and reduce the costs of the organization. Datamining is called as data or knowledge discovery from data. Datamining is the process of finding correlations or knowledge among different fields in large relational databases where the information is stored and available for mining. Datamining tools allow the users to analyze the data from different dimensions. A data mining system contains data, information and knowledge to extract these data and information. Companies with a strong consumer focus mainly use datamining in these days. It is applied in retail field, financial sector, communication media, and in marketing organizations. Datamining facilitate these companies to determine relationships among company internal factors such as price, product positioning, or staff skills, and external factors such as competition in products, economic indicators, and customer demographics. The impacts of different factors determine the effects on sales, corporate profits, product quality and customer satisfaction. With the application of datamining, a retailer could use point-of-sale records of customer purchases to send targeted promotions based on an individual's purchase history [14]. By mining demographic data collected from various sources, the retailer could develop products and promotions to appeal to specific customer segments. 2. METHODOLOGY In this research work, the multi-classifier SVM Ensemble creation uses AdaBoost.M1 algorithm, C- SVC and Majority voting method. A. AdaBoost Algorithm AdaBoost is a Machine Learning algorithm used for improving the performance of learning algorithm [18]. AdaBoost algorithm is used to boost the accuracy of the Support Vector Machines. Let D be the given dataset with d class-labelled tuples (X 1, y1), (X 2, y2) (X d, Y d ). In the initial step, AdaBoost assign each training tuple an equal weight of 1/d. For generating the ensemble, AdaBoost algorithm require k rounds through the rest of the algorithm. In round I, the tuples from D are sampled to form a training set, D i of size d. sampling with replacement is used in AdaBoost. The selection of each sample depends on its weight. A classifier model, M i is derived from the training tuple of D i its error is calculated Di as the test set. If the tuple was incorrectly classified, its weight is increased. If a tuple is classified as correct, its weight is decreased. These weights are used to generate classifiers in the next round. B. Support Vector Machines A Support Vector Machine is a type of classifier, which can handle both linear and non-linear data. A classifier takes a feature vector and assigns a class or a label to the vector. The goal of the Support Vector Machine is to predict target value of data instances in testing set with better accuracy. The number of

2 20 elements in the feature vector corresponds to its dimensionality. Support Vector Machine consists of a set of related supervised learning methods. During the training stage, Support Vector Machine finds the maximum-margin hyperplane between different classes. The best hyperplane is selected so that the distance from it to the nearest data point on each side is maximised. Such classifiers are called as maximummargin hyperplane. This is the line in two dimensions, plane in three dimensions or hyperplane in higher dimensions that maximises the distance to the nearest data point. Cross-validation is used to reduce the chances of overfitting. The vectors or the data points that are closest to the hyperplane are called the support vectors. Let D be the dataset for linear SVM. It contains a set of points of the from D = {(Xi, yi ) Xi Rp, yi {-1,1}} Where yi is either 1 or 1 and it indicate the class to which the point Xi belongs. Equation for any hyper plane can be written as w. x b = 0, Where. Indicate the dot product, b is a constant and w is the normal vector to the hyper plane. Hyper plane with maximum distance away to separate the dataset can be described as w. x b = 1 And w. x - b= -1. The Support Vector Machines are linear functions of the form f(x) = w T x + b, where w is the weight vector and x is the input vector. Let the set of training examples be {(x 1, y 1 ), (x 2, y 2 ),, (x n, y n )}, where x i is an input vector and y i is its class label, y i {1, -1}. To find the linear function: Minimize: 1 W T W 2 Subject to the constraint: Y i (W T X i + b)>=1, i=1,2,3.n Where the index i, represents number of training cases. The C-SVC classifier with linear kernel is available in LIBSVM of WEKA. The C-SVC is used for experimental result. The Main features of LIBSVM include different SVM formulations, efficient multi-class classification, Cross validation for model selection, Probability estimates, various kernel functions and Weighted SVM for unbalanced data. C. SVM Ensemble SVM Ensemble classifier is a collection of several SVM classifiers whose individual decisions are combined to classify the test samples [4].An ensemble shows better performance than individual classifiers from which it is constructed. The SVM Ensemble classification prediction includes two levels: classifier construction and the usage of the classifier. The Model is constructed from the training set. Each sample in the training set is assumed to belong to a predefined class, as determined by the class attribute label [5]. Using WEKA knowledge flow and classifier selection, a boosted SVM Ensemble is created [10]. The created model is used for further prediction. The later involves the use of SVM Ensemble built to predict or classify the output. The processes start by training an SVM classifier with a less imbalanced subset of data, and then classify the entire training data set with the SVM to identify the incorrectly classified examples. Training another SVM classifier can reinforce the first classifier [15]. The process is repeated until we obtain an SVM ensemble in which each classifier tries to enhance the performance of its previous one. The trained models are aggregated using majority voting to obtain a collective outcome. SVM ensemble was created using C- SVC. SVM is a kernel-based algorithm. A kernel is a function that transforms the input data to a high-dimensional space where the problem is solved. Kernel functions can be linear or non-linear. SVM has built-in mechanisms that automatically choose appropriate settings based on the data. In C-SVC, SVM type available is LIBSVM, which uses One-against-one approach for multi class classification and builds k (k- 1)/2 binary classifiers for given K classifier. The classifiers are trained separately with classes against each other. The testing stage is one of the most critical phases of any classification model development process, which gives the model developer a most informative measure about the classifier performance. The classifier can helpful to justify its use, leading to possible optimisation. The final decision of ensemble is obtained after combining the individual predictions of ensemble members. Majority voting is used for the aggregation. D. Majority voting The majority voting is a method for combining several SVMs output. Majority vote counts the votes for each

3 21 class over the input classifiers and selects the majority class [15]. It is used for aggregating the results form various ensemble models. Let f k, (k=1,2,., K) be a decision function of the k th SVM in the SVM ensemble. Also let C j (j=1, 2 C) denote a label of the j th class. Then, the number of SVMs whose decisions are known to the j th class be N j =# { k f k (x) = Cj }. The final decision of the SVM Ensemble fmv(x) for a given test vector x due to the majority voting is determined by fmv(x) = arg j max Nj. In majority voting, the models are trained and allowed to vote. The one with high majority vote is selected for the individual models. By aggregating, the accuracy of the SVMs ensemble is increased. method is used. The final result is obtained and saved automatically into a file. This file is taken and used in the developed Investment prediction tool. The accuracy value of the policies varies from one policy to another. The policy, which is having maximum accuracy, is taken as the best policy among that category. 3. EXPERIMENTAL DESIGN A. Dataset Description This research makes use of dataset available from UCI Knowledge Discovery in Databases Archive [20]. Different policies are added in order to make necessary analysis. B. Performance Measures Performance metrics are used to assess how accurately the model predicts the known values. If the model performs well and meets the business requirements, it can then be applied to new data to predict the future. For evaluating the performance of a classifier confusion matrix, accuracy values etc are used. The accuracy is defined as, Accuracy = TP+TN/(TP+FP+TN+FN). Where TP, FP, TN and FN are the numbers of true positive predictions, false positive predictions, true negative predictions and false negative predictions, respectively. The Precision, Recall, F-Measures etc are also listed from the WEKA classifier performance evaluator. 4. RESULT ANALYSIS The SVM Ensemble model is created in weka knowledge flow. The first experiment was done using the Fire Policy. The knowledge flow for Fire policy is given in figure 2. There are three different fire policies. Each fire policy data is loaded using CSV loader data source. The loaded dataset is assigned into a train-test split maker using a class assigner. The class assigner assigns the class label of each policy. The loaded dataset is divided into training data and as testing data. 75% of data is taken as the training data and 25% of data was taken as testing data. The training data is used to build the SVM ensemble classifier. The test data is used to test the build classifier. A multi classified SVM ensemble is created. There are three C-SVC Support Vector Machines were selected and applied to AdaBoost. M1, to create a SVM ensemble.the kernel type is chosen as linear. This result is applied to Multiclassifier. In multiclass classifier one-against-one Figure 2.1.SVM Ensemble Model using Knowledge flow Figure 4.1 Parameter settings for SVM Ensemble Model Below table shows the detailed analyzed value, obtained for Fire policies with SVM Ensemble model and single SVM. Table 1. Single SVM vs SVM ensemble RE-1 RE-2 RE-3 F-Measure ROC-Area Single SVM SVM ensemble

4 22 The same policies were tested with single SVM. From the values obtained after the testing, it is found out that SVM ensemble gives better accuracy than single SVM Following figure shows, 38% of customers were chosen for RE-1 policy and 37% people had chosen RE-2 policy. Only 25% of the customers were taken RE-3 policy. 25% 37% fire policies 38% RE-1 RE-2 RE-3 from a given list. The detailed analysis of the Insurance dataset was done using the suggested model which was loaded in the WEKA knowledge flow. The policy, which was having maximum accuracy, has taken as the best policy among that category. From this research work, it is found that the accuracy of SVM Ensemble is better than other ensemble methods and is showing a better Investment prediction for different categories of policies. 5. CONCLUSION The experiment was carried out for different policies in the same way and found that the SVM ensemble performs better than the single classifier. Fast and accurate classifier is used for making investment prediction.svm Ensemble classifier is well-known as the state-of-art ensemble classifier for insurance field. In future, this research work can be extended to develop a Prediction Tool for investment field. REFERENCES Figure 4.2: Percentage of Policyholders The best fire policy output is shown in Investment Predictor Tool. All categories of policies were listed out in the Investment Predictor Tool. To get best policy for Fire policy, select fire policy from the list of Policy list and click on Get Result button. The analyzed results of three fire policies are stored, while running the WEKA knowledge flow model. The best policy having maximum accuracy is shown in fire policy. The RE-1 policy is selected as the best policy. This model was created and tested with other investment policies also. Figure 5.1: Best Policy prediction for FIRE In this paper, the data is collected and analyzed using multi-classifier SVM. Based on SVM ensemble classifier, the best policies were predicted for each category. The prediction helps to choose the best policy [1] Siji T.Mathew, Chandra J.Estimating the performance of SVM using Personal equity plan, in proceedings Of International Conf On Mathematics In Engineering &Business Management, pp , March 9-10,2012. [2] Siji T. Mathew, SVM Ensemble for Insurance Data Analysis, Mphil Dissertation, Christ University, Jun [3] Chawla, Nitesh V (2004). Learning Ensembles from Bites : A Scalable and Accurate Approach. Journal of Machine Learning Research, pp [4] C.Cortes,V.Vapnik,Support vector network, Machine Learning, pp ,1995. [5] Fumera, G, Roli, F, A theoretical and experimental analysis of linear combiners for multiple classifier systems Pattern Analysis and Machine Intelligence,IEEE Transactions, vol.27, no.6, pp [6] Harris Drucker et al, Support vector machines for spam categorization IEEE Transactions On Neural Networks,1999. [7] Kuncheva LI.,Combining Pattern Classifiers. Methods and Algorithms,2004. [8] Leon Bottou and Chih-Jen Lin, Support vectormachine Solvers in Large Scale Kernel Machines. Weston editors, MIT Press, Cambridge, MA, pp [9] Ma Chao, Chen Xihong, A New Algorithm of Support Vector Machine Ensemble and Its Application,Intelligent Human- Machine Systems and Cybernetics (IHMSC), 2nd International Conference, ,2010. [10] Mark Hall, et al(2009),the WEKA Data Mining

5 23 [11] Software: An Update SIGKDD Explorations, vol 11. [12] Nicolas Garcia-Pedrajas, Constructing Ensembles of Classifiers by Means of Weighted Instance Selection, IEEE Transactions On Neural Networks, vol. 20, NO. 2,2009. [13] Nikunj C. Oza and Kagan Tumer, Key Real- World Applications of Classifier Ensembles Information Fusion, Special Issue on Applications of Ensemble Methods, pp. 4 20,2008. [14] Petr Hajek, Vladimir Olej, Municipal Revenue Prediction by Support Vector Machine ensemble, ICCOMP 10:Proc. of the 14th WSEAS international Conf. On Computers: part of the 14th WSEAS CSCC multi Conf. vol.1,2010. [15] Ravi V.H et al, Soft computing system for bank performance Prediction. Elsevier Applied Soft Computing, [16] Robi Policar Ensemble based systems in decision making IEEE circuits and systems magazine, [17] Shin, Kim, An application of support vector machine in bankruptcy prediction model Expert System with applications,vol ,2005. [18] Bhardwaj M., Gupta, T., Grover T, Bhatnagar V An efficient classifier ensemble using SVM, Methods and Models in Computer Science, ICM2CS, Proceeding of International Conf, vol.2,2009. [19] Yan-Shi Dong, Ke-Song Han, Boosting SVM Classifiers By Ensemble, Special interest tracks and posters of the 14thinternational Conf on World Wide web May, pp.10-14,2005. [20] Ye Li, Yun Ze cal, et al, Fault diagnosis based on support vector machine ensemble, Machine Learning and Cybernetics. Pp ,2005. [21] Dataset available