Journal of Modern Accounting and Auditing, ISSN 1548-6583 December 2011, Vol. 7, No. 12, 1362-1367 Data Mining in Financial Application G. Cenk Akkaya Dokuz Eylül University, Turkey Ceren Uzar Mugla University, Turkey Data mining enables us to form forecasts and models regarding future by making use of past data. Any method which helps to discover data can be used as a data mining method. Enterprises gain important competitive advantage by data mining methods. Data mining is used in different fields. In finance field, it is a specially used in portfolio management, fraud detection, payment prediction, loan risk analysis, mortgage scoring, determining transaction manipulation, determining financial risk management, determining customer profile and foreign exchange market. It can be costly, risky and time consuming for enterprises to gain knowledge. Thus today enterprises use data mining as an innovative competitive mean. The aim of the study is to determine the importance of data mining in financial applications. Keywords: data mining, data mining methods, prediction Introduction Data mining (DM) is defined as the process of discovering patterns in data. The process must be automatic or (more usually) semiautomatic (Witten & Frank, 2005, p. 5). DM may also be regarded as the natural development processed of the information technologies. Large scale data may be regarded as a data mine comprising valuable data within large scale databases in different areas. Ana DM is defined as the process of producing substantive information, which used to be unknown before these data (Albayrak & Yılmaz, 2009, p. 32). The purpose of data mining is to create decision making models for estimation of the behaviors in the future based on the analysis of the past activities (Koyuncugil & Özgülbaş, 2009, p. 24). At this point, it is a means which supports the decision process to be given for reaching to the solution rather than being a solution alone and provides the information required for solving the problem. Data mining refers to rendering assistance to the analyst for finding the patterns and relations between the data created in the stage of working (Akgöbek & Çakır, 2009, p. 802). Therefore, data mining can be used for a variety of purposes in private sector. Industries such as banking, insurance, and medicine commonly use data mining to reduce costs, enhance research, and increase sales. Data analysis techniques that have been traditionally used for such tasks include regression analysis, cluster analysis, numerical taxonomy, multidimensional analysis, other multivariate statistical methods, stochastic models, time series analysis, nonlinear estimation techniques, and others (Sumathi & Sivanandam, 2006, p. 24) Building Data Models Data mining builds models from data, using tools that vary both by the type of model built and, within G. Cenk Akkaya, Ph.D., associate professor, Faculty of Economics and Administrative Sciences, Dokuz Eylül University. Ceren Uzar, lecturer, Vocational School of Higher Education, Mugla University.
DATA MINING IN FINANCIAL APPLICATION 1363 each model domain, by the type of algorithm used. As Table 1 shows, at the highest level this taxonomy of data mining separates models into two classes: predictive and descriptive (Gerritsen, 1999, p. 17): Table1 Data Mining Taxonomy Predictive models Descriptive models Classification Regression Clustering Association One of the most common learning tasks in data mining is classification. Classification is a supervised way of learning. The database contains one or more attributes that denote the class of a tuple and these are known as predicted attributes, whereas the remaining attributes are called predicting attributes (Sumathi & Sivanandam, 2006, p. 577). The popular algorithms of supervised learning techniques include decision trees, artificial neural networks, and so on (Tsai, Lu, & Hsu, 2011, p. 131). When the data are unlabelled and each instance does not have a given class label the learning task is called unsupervised. If we still want to identify which instances belong together, that is, form natural clusters of instances, a clustering algorithm can be applied (Olafsson, Li, & Wu, 2008, p. 1439). Clustering techniques can be used to identify stable dependencies for risk management and investment management (Zhang & Zhou, 2004, p. 514). Another unsupervised learning approach is association rule discovery that aims to discover interesting correlation or other relationships among the attributes. Association rule mining was originally used for market basket analysis, where items are articles in the customer s shopping cart and the supermarket manager is looking for associations among these purchases (Olafsson et al., 2008, p. 1441). Suggestions for Implementation of Data Mining in Financial Application Data mining is able to uncover hidden patterns and predict future trends and behaviors in financial markets. It creates opportunities for companies to make proactive and knowledge-driven decisions in order to gain a competitive advantage (Zhang & Zhou, 2004, p. 513). In this respect, fraud detection, predicting credit risk, foreign exchange model forecasting accuracy, portfolio management and studies for financial failure predictions can be exemplified as follows. Fraud Detection Fraud is increasing dramatically with the expansion of modern technology and the global superhighways of communication, resulting in the loss of billions of dollars worldwide each year. Although prevention technologies are the best way of reducing fraud, fraudsters are adaptive and, given time, will usually find ways to circumvent such measures (Bhargava & Manik, 2011, p. 4). Fraud detection studies have now become widespread. Fraud detection must also be dealt across all industries, especially the sectors where many transactions are made more vulnerable, such as health care, retail, credit card services, and telecommunications (Koyuncugil & Ozgulbas, 2009, p. 234). Credit card fraud detection has drawn a number of techniques, with special emphasis on data mining. In this subject, the studies conducted by Stolfo, Fan, Lee, and Prodromidis (1997) may be summarized as follows. The first step was the collection of data. The members of the Financial Services Technology Consortium provided data with real credit card transactions. 500,000 records have been obtained. Each record has 30 fields
1364 DATA MINING IN FINANCIAL APPLICATION and a total of 137 bytes. The data were already labeled by the banks as fraud or non-fraud. Among the 500,000 records, 20% are fraud transactions. The data were sampled from a 12-month period. The aim of the study is to compute a classifier that predicts whether a transaction is fraud or non-fraud. In this respect, the training data were sampled from earlier months, the validation data (for meta-learning) and the testing data were sampled from later months. Data from months 1 to 10 are used for training, data in month 11 is for validation in meta-learning and data in month 12 is used for testing. Then, ID3 (Iterative Dichotomiser 3), CART (Classification and Regression Tree), BAYES, and RIPPER (Repeated Incremental Pruning to Produce Error Reduction), each is applied to every partition above. Stolfo et al. (1997) found that 50%/50% distribution of fraud/non-fraud training data will generate classifiers with the highest true positive rate (fraud catching rate) and low false positive rate (false alarm rate) so far. Meta-learning with BAYES as a meta-learner to combine base classifiers with the highest true positive rates learned from 50%/50% fraud distribution is the best method found so far. Predicting Credit Risk With the rapid growth in the credit industry, credit scoring models have been extensively used for the credit admission evaluation. In the last two decades, several quantitative methods have been developed for the credit admission decision (Huang, Chen, & Wang, 2007, p. 847). This problem, most commonly referred to as credit scoring, is an attempt to classify applicants for credit into either good or bad risk classes. It is to be distinguished from behavioral or performance scoring, in which the repayment behavior of the customer is tracked to make predictions, provided the credit was already accepted (Vellido, Lisboa, & Vaughan, 1999, p. 55). In this subject, the studies conducted by Kotsiantis (2007) may be summarized as follows. Kotsiantis (2007) used two public credit datasets: Credit-a dataset and Credit-g dataset. (1) In Credit-a dataset, each case out of 690 represents an application for credit card facilities described by eight discrete and six continuous attributes (sex, age, mean time at addresses, home status, current occupation, current job status, mean time with employers, other investments, bank account, time with bank, liability reference, account reference, monthly housing expense, savings account balance with two decision classes (accept/reject). In the UCI repository (University of California at Irvine), the original attribute names have been changed to meaningless symbols (A1-A14) with the purpose of protecting the confidentiality of the data and also, they make a distinction between discrete (nominal) and continuously valued attributes. (2) The German Credit dataset (Credit-g) contains observations on 20 variables (checking account status, duration of credit in months, credit history, purpose of credit, credit amount, average balance in savings account, present employment, installment rate as percentage of disposable income, personal status, other parties, present resident since (years, property magnitude, age in years, other payment plans, housing, number of existing credits at this bank, nature of job, number of people for whom liable to provide maintenance, applicant has phone in his or her name, foreign worker) for 1,000 past applicants for credit. Each applicant was rated as good credit (700 cases) or bad credit (300 cases). The K2 algorithm, C4.5 algorithm, the 3NN algorithm, The BP algorithm, the Ripper algorithm, the Sequential Minimal Optimization (SMO) algorithm and the LR algorithm were applied. As a result of these methods the Credit-a dataset and the Credit-g dataset classified good credit cases, bad credit cases and evaluation samples. These algorithms achieved different rates of correctly classified individuals.
DATA MINING IN FINANCIAL APPLICATION 1365 Then Kotsiantis (2007) compared the results proposed methodology SelVot (selecting voting rule) for the Credit-a and the Credit-g databases with the voting algorithm, The Best CV algorithm (B-cross validation), the Grading algorithm, the Stacking algorithm. Kotsiantis (2007) found that the proposed selective voting methodology achieves better performance than any examined simple and ensemble method. Neural Network Foreign Exchange Model Forecasting Accuracy Based on an early model of human brain function, neural networks do classification and regression equally well. More complex than other techniques, neural networks has often been described as a black box technology (Gerritsen, 1999, p. 18). In this respect, the dominant data mining method technique used foreign exchange market prediction so far is neural network modeling. In this subject, the studies conducted by Walczak (2001) may be summarized as follows. First, neural network models were developed to forecast the spot rates of the U.S. dollar against the British pound, German mark, and Japanese yen. Historic spot rate data values for each of these currency exchange rates were collected from March 1, 1973 through June 30, 1995. This data sample represented data from the time that a floating exchange rate policy was adopted until the near present. Eleven different neural network models were applied resulting in 33 different neural network forecasting models to forecast the three foreign exchanges. Walczak (2001) found that the dollar/yen neural network models are the only ones that have a similar (identical) performance result between the full 22-year training set and either of the 1- or 2-year training set models. Modifications to the training data set alone are responsible for 11.2% change in forecasting performance for the dollar/pound neural networks (comparing the 2-year training performance against the worst-case performance of larger training sets), 6.4% change for the dollar/mark networks, and 18.4% change in the dollar/yen neural network forecasting accuracy. Portfolio Management Portfolio management is a major issue in investment. Its goal is to choose a set of risk assets to form a portfolio that can maximize to return under a given risk or minimize the risk for obtaining a given return (Hung, Liang, & Liu, 1996, p. 301). With the data mining and optimization techniques investors are able to allocate capital across trading activities to maximize profit or minimize risk. This feature supports the ability to generate trade recommendations and portfolio structuring from user supplied profit and risk requirement (Dass, 2006, p. 6). In this respect, data mining has been applied portfolio management, investment risk analysis and so on. In this subject, the studies conducted by Hung et al. (1996) may be summarized as follows. The study includes three major components: APT (arbitrage price theory), ANN (artificial neural network) and portfolio constructor. The data were collected from Taiwan Stock Exchange. Fifty-one stocks were selected among the more than 300 traded stocks. The data used in this study are from November 25 to October 23, 1993. The daily return of each stock during the period was obtained from the econometrics programming system (EPS) database maintained by Ministry of education. In the next phase, the whole time period was divided into 12 sets of training and evaluation periods.
1366 DATA MINING IN FINANCIAL APPLICATION They first applied factor analysis to identify the key risk factors. Then multiple regression analysis was used to price the assets. ANN and ARIMA (Autoregressive Integrated Moving Average) methods were used to predict the trend of each risk factor in the future. One hundred and three weekly return data were used at this step to shorten the model building time at a slight cost of precision. IANN and IARIMA models were built. Finally, candidate portfolios were built based on the risk and the predicted return of each stock. Then optimal portfolio was selected from the candidates and its performance in the following four weeks. Predicting Corporate Failure Predicting corporate failure is a very critical task for government officials, stock holders, managers, employees, investors and researchers, especially in nowadays competitive economic environment. The corporate failure prediction is basically a dichotomous decision, either being failure or not. Most statistical and AI methods estimate the probability of corporate failure. In this subject, the studies conducted by Nguyen (2005) may be summarized as follows. Data were collected from the Corporate Scorecard Group s (CSG) databases in Australia in period 1988-2002. Two sets of data collected. The first dataset consists of 32 Australian companies which is called the training or sample dataset. The second dataset consists of 200 companies that they are coded by the CSG and no extra defaulted information is revealed. There will be selected a set of financial ratios which have successfully predicted firm failure to build the models including liquidity, profitability, solvency and financial leverage. Three neutral network models were applied. These models were Probabilistic Neural Network, Multi-Layer Neural Network, and Logistic Regression. ANGOSS Knowledge Studio software used to test the predictive power of these models. As a result of these methods, Probabilistic neutral network, with the training process, the model correctly predicts 93.75%. Multi-layer neural network, with the training process, the model correctly predicts 95.45%. For Logistic regression model, with the training process, the model correctly predicts 100% both for one year and two years before failure. Nguyen (2005) found that three models are good at predicting probability of corporate failure. Moreover, probabilistic neural network model outperforms the others. Conclusions Data mining can be expressed as disclosure of hidden values and usable information among a large amount of data. The purpose of data mining is to create decision making models for future predictions based on the analysis of the past activities. Data mining can find an application usage field in almost every media in which the data is produced intensively and consequently the databases are created. It can be used in the field of finance, particularly for signature and bank note verification, mortgage scoring, country risk rating, credit risk analysis, foreign exchange rate forecasting, portfolio management, bond rating and trading early determination of financial failure, determination of financial information manipulation and internal learners, preventing law application in the initial public offerings, determining the market anomalies, analysis of investor risk perception depending on education, income level, gender and such factors, analysis of integration level in financial markets, analysis of the level of transmission of the capital provided from the initial public offerings to the reel investments,
DATA MINING IN FINANCIAL APPLICATION 1367 presupposition of bear and bull periods in the stock exchanges, creating risk (volatility) index for markets, company evaluation and equities price depth analysis of SMEs in entrepreneur capital market and forecasting of abnormal stock exchange revenues. In this study, information on data mining concept and data mining methods has been rendered. Furthermore, the suggestions for applying data mining on capital markets have been described by means of some related studies. References Akgöbek, Ö., & Çakır, F. (2009). Expert system design with data mining. Proceedings from: Akademic Information Conference (pp. 801-806). Harran University, Şanlıurfa. Albayrak, A. S., & Yılmaz, Ş. K. (2009). Data mining: Decision tree algorithms and an application on Ise data. The Journal of Faculty of Economics and Administrative Sciences, 14(1), 31-52. Bhargava, N., & Manik, G. (2011). Application of Artificial neural networks in business applications. Retrieved from www.nikhilbhargava.com/papers/lahore.pdf Dass, R. (2006). Data mining in banking and finance: A note for bankers. Retrieved from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=?doi=10.1.1.102.7502 Gerritsen, R. (1999). Assessing loan risks: A data mining case study. IT Professional, 1, 16-21. Huang, C. L., Chen, M. C., & Wang, C. J. (2007). Credit scoring with a data mining approach based on support vector machines. Expert Systems With Applications, 33, 847-856. Hung, S. Y., Liang, T. P., & Liu, C. W. (1996). Integrating arbitrage pricing theory and artificial neural networks to support portfolio management. Decision Support System, 18(3/4), 301-316. Kotsiantis, S. (2007). Credit risk analysis using a hybrid data mining model. International Journal of Intelligent Systems Technologies and Applications, 2(4), 345-356. Koyuncugil, A. S., & Özgülbaş, N. (2009). Data mining: Using and applications in medicine and healthcare. Journal of Information Technology, 2(2), 21-32. Nguyen, H. G. (2005). Using neutral network in predicting corporate failure. Journal of Social Sciences, 1(4), 199-202. Olafsson, S., Li, X. N., & Wu, S. N. (2008). Operations research and data mining. European Journal of Operational Research, 187(3), 1429-1448. Stolfo, S. J., Fan W. D., Lee, W., & Prodromidis, A. (1997). Credit card fraud detection using meta-learning issues and initial results. Proceedings from: AAAI-97 Workshop on Al Methods in Fraud and Risk Management (pp. 83-90). Menlo Park. Sumathi, S., & Sivanandam, S. N. (2006). Introduction to data mining and its applications (Vol. 29). Verlag Berlin Heidelberg: Springer. Tsai, C. F., Lu, Y. H., & Hsu, Y. F. (2011). Bankruptcy prediction by supervised machine learning techniques: A comparative study. In A. S. Koyuncugil, & N. Özgülbaş (Eds.), Surveillance technologies and early warning systems (pp. 128-143). Hershey New York: Information Science Reference. Vellido, A., Lisboa, P. J. G., & Vaughan, J. (1999). Neural networks in business: A survey of applications (1992-1998). Expert Systems With Applications, 17, 51-70. Walczak, S. (2001). An empirical analysis of data requirements for financial forecasting with neural networks. Journal of Management Information Systems, 17(4), 203-222. Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques (2nd ed.). San Francisco: Morgan Kaufmann Publishers. Zhang, D. S., & Zhou, L. N. (2004). Discovering golden nuggets: Data mining in financial application. IEEE Transactions on Systems, Man, and Cybernetics Part C: Applications and Reviews, 34(4), 513-522.