COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS



Similar documents
Predicting Student Performance by Using Data Mining Methods for Classification

Chapter 6. The stacking ensemble approach

Keywords data mining, prediction techniques, decision making.

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Social Media Mining. Data Mining Essentials

Journal of Information

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

First Semester Computer Science Students Academic Performances Analysis by Using Data Mining Classification Algorithms

How To Predict Diabetes In A Cost Bucket

Prediction of Stock Performance Using Analytical Techniques

Data Mining Analysis (breast-cancer data)

Benchmarking of different classes of models used for credit scoring

Performance Analysis of Naive Bayes and J48 Classification Algorithm for Data Classification

A Regression Approach for Forecasting Vendor Revenue in Telecommunication Industries

Data Mining. Nonlinear Classification

An Overview of Knowledge Discovery Database and Data mining Techniques

Data Mining Techniques for Prognosis in Pancreatic Cancer

Chapter 12 Discovering New Knowledge Data Mining

Towards better accuracy for Spam predictions

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Impact of Feature Selection on the Performance of Wireless Intrusion Detection Systems

DATA MINING TECHNIQUES AND APPLICATIONS

Data Mining - Evaluation of Classifiers

Identifying At-Risk Students Using Machine Learning Techniques: A Case Study with IS 100

Knowledge Discovery and Data Mining

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

Using Data Mining for Mobile Communication Clustering and Characterization

Introduction to Data Mining

A Content based Spam Filtering Using Optical Back Propagation Technique

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Comparison of Data Mining Techniques used for Financial Data Analysis

Machine Learning Algorithms and Predictive Models for Undergraduate Student Retention

Experiments in Web Page Classification for Semantic Web

An exploratory neural network model for predicting disability severity from road traffic accidents in Thailand

FRAUD DETECTION IN ELECTRIC POWER DISTRIBUTION NETWORKS USING AN ANN-BASED KNOWLEDGE-DISCOVERY PROCESS

Denial of Service Attack Detection Using Multivariate Correlation Information and Support Vector Machine Classification

A Property & Casualty Insurance Predictive Modeling Process in SAS

Data quality in Accounting Information Systems

Artificial Neural Network, Decision Tree and Statistical Techniques Applied for Designing and Developing Classifier

Application of Data Mining Techniques to Model Breast Cancer Data

Data Mining Solutions for the Business Environment

Decision Support System on Prediction of Heart Disease Using Data Mining Techniques

Predictive Data modeling for health care: Comparative performance study of different prediction models

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Football Match Winner Prediction

Impact of Boolean factorization as preprocessing methods for classification of Boolean data

An Analysis of Missing Data Treatment Methods and Their Application to Health Care Dataset

ElegantJ BI. White Paper. The Competitive Advantage of Business Intelligence (BI) Forecasting and Predictive Analysis

REVIEW OF HEART DISEASE PREDICTION SYSTEM USING DATA MINING AND HYBRID INTELLIGENT TECHNIQUES

EFFICIENCY OF DECISION TREES IN PREDICTING STUDENT S ACADEMIC PERFORMANCE

Predicting earning potential on Adult Dataset

An Introduction to Data Mining

Classification algorithm in Data mining: An Overview

A Review of Data Mining Techniques

Data Mining Framework for Direct Marketing: A Case Study of Bank Marketing

Feature Subset Selection in Spam Detection

INTERNATIONAL JOURNAL FOR ENGINEERING APPLICATIONS AND TECHNOLOGY DATA MINING IN HEALTHCARE SECTOR.

TNS EX A MINE BehaviourForecast Predictive Analytics for CRM. TNS Infratest Applied Marketing Science

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Active Learning SVM for Blogs recommendation

Manjeet Kaur Bhullar, Kiranbir Kaur Department of CSE, GNDU, Amritsar, Punjab, India

EMPIRICAL STUDY ON SELECTION OF TEAM MEMBERS FOR SOFTWARE PROJECTS DATA MINING APPROACH

Application of Predictive Model for Elementary Students with Special Needs in New Era University

ANALYSIS OF FEATURE SELECTION WITH CLASSFICATION: BREAST CANCER DATASETS

Equity forecast: Predicting long term stock price movement using machine learning

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Insurance Fraud Detection: MARS versus Neural Networks?

Healthcare Data Mining: Prediction Inpatient Length of Stay

An Efficient Way of Denial of Service Attack Detection Based on Triangle Map Generation

SURVIVABILITY ANALYSIS OF PEDIATRIC LEUKAEMIC PATIENTS USING NEURAL NETWORK APPROACH

Towards applying Data Mining Techniques for Talent Mangement

Cross-Validation. Synonyms Rotation estimation

Evaluation of Feature Selection Methods for Predictive Modeling Using Neural Networks in Credits Scoring

E-commerce Transaction Anomaly Classification

Data Mining Application in Direct Marketing: Identifying Hot Prospects for Banking Product

Knowledge Discovery and Data Mining

A Secured Approach to Credit Card Fraud Detection Using Hidden Markov Model

Easily Identify Your Best Customers

Issues in Information Systems Volume 16, Issue IV, pp , 2015

Predicting outcome of soccer matches using machine learning

Performance Evaluation of Intrusion Detection Systems using ANN

Modeling Suspicious Detection Using Enhanced Feature Selection

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Course Syllabus. Purposes of Course:

testo dello schema Secondo livello Terzo livello Quarto livello Quinto livello

APPLICATION OF DATA MINING TECHNIQUES FOR DIRECT MARKETING. Anatoli Nachev

Choosing the Best Classification Performance Metric for Wrapper-based Software Metric Selection for Defect Prediction

Scalable Machine Learning to Exploit Big Data for Knowledge Discovery

Three types of messages: A, B, C. Assume A is the oldest type, and C is the most recent type.

Genetic Neural Approach for Heart Disease Prediction

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

Pricing Crowdsourcing-based Software Development Tasks

Enhancing Quality of Data using Data Mining Method

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Data Mining: A Preprocessing Engine

Classification of Bad Accounts in Credit Card Industry

2 Decision tree + Cross-validation with R (package rpart)

Supervised Learning (Big Data Analytics)

A Comparison of Variable Selection Techniques for Credit Scoring

Transcription:

COMPARING NEURAL NETWORK ALGORITHM PERFORMANCE USING SPSS AND NEUROSOLUTIONS AMJAD HARB and RASHID JAYOUSI Faculty of Computer Science, Al-Quds University, Jerusalem, Palestine Abstract This study exploits the Neural Network data mining algorithm to predict the value of the dependent variable under certain conditions in order to investigate the effect of the dependent variable values distribution on the prediction accuracy performance. The prediction models were designed using two modelling tools, viz., SPSS and NeuroSolutions where the prediction performance of the two tools was compared. The dependent variable values distribution in the training dataset was found to have an impact on the prediction performance of the Neural Network using the two tools. The two tools were found to have a different prediction accuracy percentages under the same conditions except when the dependent variable values distribution ratio was 1:1 the two tools achieved approximately the same results. Keywords: Data Mining, Prediction Model, Neural Network, Modelling Tool. 1. INTRODUCTION An important and challenging area of research nowadays is machine learning. Historical data was analyzed using several ways for hidden knowledge extraction that can help in decision making. This is called Knowledge Discovery or Data Mining. The popular goal from data mining is prediction and the popular data mining technique used for prediction is classification. Classification can be accomplished statistically or by data mining methods [ [9]]. Prediction techniques performance comparison issues is an interesting topic for many researchers. A comparative study by Lahiri R. [ [9]] compared the performance of three statistical and data mining techniques on Motor Vehicle Traffic Crash dataset, resulted that the data information content and dependent attribute distribution is the most affecting factor in prediction performance. Delen D. et al. [ [4]] targeted data mining methods comparison as a second objective in the study, while the main objective was to build the most accurate prediction model in a critical field, breast cancer survivability. In the same area, Artificial Intelligence in Medicine, Bellaachia A. et al. [ [2]] continued the work of [ [4]] and improved the research tools especially the dataset. An important application area that exploited data mining techniques heavily was the network security. Panda M. et al. [ [10]] also performed a comparative study to identify the best data mining technique in predicting network attacks and intrusion detection. Also the data contents and characteristics revealed as an affecting factor on the data mining and prediction algorithms performance. In a previous research [ [5]] we continued on the work of Lahiri R. [ [9]] and performed a performance comparison on the same data mining techniques, viz., Logistic Regression, Neural Network and Decision Tree, also we worked on Lahiri s future work recommendation and tried to identify more precise predictors that significantly define and affect the output. The results of our research showed a small outperformance for the Neural Network over the Logistic Regression and Decision Tree especially when we used different dependent variable values distributions (distribution of 0 s and 1 s) in the dataset. In this research we intended to use the Neural Network algorithm to predict the dependent variable values over different dependent variable values distributions (distribution of 0 s and 1 s) and using two specific modeling tools, viz., SPSS and NeuroSolutions. One objective of this is again to find the effect of the dependent variable values distributions (distribution of 0 s and 1 s) in the dataset using different modeling tools on the Neural Network prediction performance. A second objective is to compare the performance of the two modeling tools in predicting the dependent variable values. The experiment will exploit a historical dataset about the Palestinian Labor Force, the same dataset used in our previous research [ [5]]. The dependent attribute will be the individual s Labor Force Status, that have values: 0 as Employed and 1 as Unemployed. The source of such data is the Palestinian Central Bureau of Statistics (PCBS). This paper is organized as follows: The literature and related work will be discussed in Section 2. The research methodology to perform the experiment will be presented in Section 3. Experimental results are presented and discussed in Section 4. Finally, Conclusion is given in the last Section 5. 2. RELATED WORK Data mining is an important field of research because it can be exploited in many applications especially in the critical applications that data mining proved a high and efficient roles that served these applications in many ways. An example of these applications that benefited of data mining techniques were business and medical purposes. As data mining is a new technology field, it is important and very helpful in predicting and detecting underlying patterns from large volumes of data, many researches were published, comparing results of data 1 1

mining algorithms in several areas. Neural Network is one of the powerful and effective data mining techniques that many researchers studied it s performance and compare it to other statistical and data mining techniques. The applications of data mining were very wide and diverse. For example data mining techniques used heavily in predicting network intrusion detection systems to protect computing resources against unauthorized access. Several studies were performed in this area and some of them addressed the prediction performance comparison of different data mining techniques like the study by Panda M. et al. (2008). A dataset of 10% KDDCup 99 intrusion detection has been generated and used in the experiment. Three popular data mining algorithms had been used in the experiment: Decision Trees ID3, J48 and Naïve Bayes. The prediction performance metrics used in the study were the time taken to build the model and the prediction error rate. For the evaluation of prediction error rate, the 10-fold cross validation test was used. As a result of the experiment, the Decision Trees had proven their efficiency in both generalization and detection of new attacks more than the Naïve Bayes. But this maybe dependence on the contents and characteristics of the data which allows single algorithm to outperform others [ [10]]. Wahbeh A. H. et al. (2011) compared the classification accuracy performance of a number of free data mining tools, viz., WEKA toolkit, Orange, KNIME, and Tanagra. A set of different characteristics and different domains datasets, nine datasets downloaded from the UCI repository [ [14]], were used to compare the classification performance over several classification algorithms, viz., Naïve Bayes (NB) algorithm, K Nearest Neighbor (KNN) algorithm, Support Vector Machine (SVM) algorithm, C4.5 algorithm, One Rule (OneR) classifier, and Zero Rule (ZeroR) classifier. The results showed that the classification performance accuracy percentage of the tools were affected by the kind of dataset used and by the way the classification algorithms were implemented within the toolkits [ [15]]. The effect of Decision Tree, Neural Network, and Logistic Regression models prediction performance comparison for different sample sizes and sampling methods on three sets of data had been investigated by Rochana Lahiri (2006). The study concluded that a very large training dataset is not required to train a Decision Tree or a Neural Network model or even for Logistic Regression models to obtain high classification accuracy and the overall performance reached a steady value at the sample size of 1000, irrespective of the total population size. Another important result was that the sampling method has not affected the classification accuracy of the models. She tried to find the effect of the 0 s and 1 s distribution of dependent variable values in the dataset but because the data was very skewed, she failed to do this. As a future work, the study recommended to apply the same study on a dataset were the relationships between the dependent variable and the independent variables are more rigid [ [9]]. The data mining methods comparison were targeted as a second objective in some studies that mainly aimed to develop a prediction model in a critical fields, like medicine, by investigating several data mining methods, intending to get the model that have the highest prediction accuracy. This type of studies has been addressed by Delen D. et al. (2005) in the context of predicting breast cancer survivability. Multiple prediction models, using Artificial Neural Networks, Decision Trees, and Logistic Regression, for breast cancer survivability using a large dataset had been developed. The comparison among the three models had been conducted depending on measuring three prediction performance metrics: classification accuracy, sensitivity and specificity. The k-fold crossvalidation test was used to minimize the bias associated with the random sampling of the training and missing data. The results of the study showed that the Decision Tree (C5) preformed the best of the three models evaluated. Sensitivity analysis, which provides information about the relative importance of the input variables in predicting the output field, was applied on Neural Network models and provided them with the prioritized importance of the prediction factors used in the study [ [4]]. Another related study in medicine by Bellaachia A. et al. (2006) also in the context of predicting breast cancer survivability. The researchers took the study of Delen D. et al. [ [4]] as the starting point with the same dataset source but with a newer version and different set of data mining techniques. For modeling and comparison, three data mining techniques had been investigated: the Naïve Bayes, the back-propagated Neural Network, and the C4.5 Decision Tree algorithms. The main goal was to have a prediction model with high prediction accuracy, besides high precision and recall metrics for patients' data retrieval. They used other performance metrics: specificity and sensitivity to compare the prediction models. The results presented that C4.5 algorithm has a much better performance than the other two techniques. The obtained results differed from the study of Delen D. et al. [ [4]] due to the facts that they used a newer database (2000 vs. 2002), a different pre-classification (109,659 and 93,273 vs. 35,148 and 116,738) and different toolkits (industrial grade tools vs. Weka) [ [2]]. Amooee G. et al. (2011) used data mining techniques to identify defective parts manufactured in an industrial factory and to maintain high quality products. A data of 1000 records was collected from the factory and 10% (100 records) of the data was about a defective parts. Prediction accuracy and processing time of the prediction techniques were the comparison performance metrics. The results showed that SVM and Logistic regression prediction algorithms has the best processing time with high overall prediction accuracy. The decision tree with its tree different branching algorithms (CRT, CHAID, and QUEST) achieved the highest prediction accuracy rates but needed more time. Neural network achieved the 2 2

least prediction accuracy rate with medium processing time [[1]]. Data mining concept was the most appropriate to the study of student retention from sophomore to junior year than the classical statistical methods. This was one main objective of the study addressed by Ho Yu C. et al. (2010) in addition to another objective that identifying the most affecting predictors in a dataset. The statistical and data mining methods used were classification tree, multivariate adaptive regression splines (MARS), and neural network. The results showed that transferred hours, residency, and ethnicity are crucial factors to retention, which differs from previous studies that found high school GPA to be the most crucial contributor to retention. In Ho Yu C. et al. research, the neural network outperformed the other two techniques [[6]]. The prediction techniques RIPPER, decision tree, neural networks and support vector machine were used to predict cardiovascular disease patients. The performance comparison metrics were the Sensitivity, Specificity, Accuracy, Error Rate, True Positive Rate and False Positive Rate. Kumari M. et al. study showed that support vector machine model outperforms the other models for predicting cardiovascular disease [[8]]. The neural network was found to achieve better performance compared to the performance rates of Naive Bayes, K-NN, and decision tree prediction techniques in a study performed by Shailesh K R et. al. (2011) to predict the inpatient hospital length of stay in a super specialty hospital [[12]]. The same result was seen that the neural network outperformed both the decision tree and linear regression models when the performance for the students academic performance in the undergraduate degree program was measured by predicting the final cumulative grade point average (CGPA) of the students upon graduation. The correlation coefficient analysis was used to identify the relationship of the independent variables with the predictors, Ibrahim Z. et al. (2007) [[7]]. Social network data, using data mining techniques and the prediction error rates were the comparison metric, was studied by Nancy P. et al. (2011). The tree based algorithms such as RndTree, ID3, C-RT, CS- CRT, C4.5, CS-MC4 and the k-nearest neighbor (k- NN) algorithms were used in the study. The RndTree algorithm achieved least error rate and outperforms the other algorithms [[11]]. C. Deepa et al. (2011) compared the prediction accuracy and error rates for the compressive strength of high performance concrete using MLP neural network, Rnd tree models and CRT regression. The results showed that neural network and Rnd tree achieved the higher prediction accuracy rates and Rnd tree outperforms neural network regarding prediction error rates [[3]]. The Rand tree algorithm also outperforms the other algorithms, C4.5, C-RT, CS-MC4, decision list, ID3 and Naïve Bayes, in a study of vehicle collision patterns in road accidents by S.Shanthi et al. (2011). Selection algorithms were used including CFS, FCBF, Feature Ranking, MIFS and MODTree, to improve the prediction accuracy. Feature Ranking algorithm was found the best in improving the prediction accuracy for all algorithms [ [13]]. In a previous research we performed a performance comparison among data mining techniques, viz., Logistic Regression, Neural Network and Decision Tree and tried to identify more precise predictors that significantly define and affect the output. The results of our research showed a small outperformance for the Neural Network over the Logistic Regression and Decision Tree when the dependent variable values distribution ratio was 1:1. The prediction performance of the three prediction methods was almost the same and too close to each other when the dependent variable values distribution ratio were 2:1 and 3:1. The results also showed that if the prediction accuracy for both dependent variable values are requested to be comparable then the training data should not be skewed and the ratio of dependent variable values distribution should be equal and not more than 1:1 [ [5]]. The big question in this situation is: did the modeling tool has an effect on the prediction accuracy?. In this research we will try to answer this question and we have chosen the Neural Network algorithm to predict the dependent variable values over different dependent variable values distributions (distribution of 0 s and 1 s) datasets and using two specific modeling tools, viz., SPSS and NeuroSolutions to hold a prediction performance comparison between the two tools and explore the effect of the dependent variable values distribution on the prediction accuracy of the Neural Network. We have used a Labor Force Data (LFS 2009) in the Palestinian s territories as a training dataset with 38,037 records. For testing, we used the same dataset of Labor Force Data (LFS 2009) and Labor Force Data (LFS 2010). The two datasets was processed and cleaned against missing values and inconsistency. 3. METHODOLOGY To achieve the objectives of this research, we have started data preparation (LFS 2009 and LFS 2010) in a suitable form for the experiment. The data contains the following variables: person s labor status (dependent variable), Sex, Age at last Birthday, Does he currently attending school?, Years of schooling, Educational Attainment (higher qualification), Refugee Status, District, Locality Type, Region, and Marital Status. All these variables are numeric data type and nominal measure, see Table 1. We should mention that for the training of the Neural Network prediction models, using SPSS and NeuroSolutions tools, we used LFS 2009 dataset and for testing we used both LFS 2009 and LFS 2010. TABLE 1: SPECIFICATION OF THE LABOR FORCE DATASET VARIABLES Variable Values Measure 3 3

Sex 1 or 2 Nominal Age at last Birthday 0-100 Nominal Does he currently attending school 1-4 Nominal Years of schooling 0-40 Nominal Educational Attainment (higher 1-10 Nominal qualification) Refugee Status 1-3 Nominal District 16 Category Nominal Locality Type 1-3 Nominal Labor Force Status 0: Employed Nominal 1: Unemployed Region 1 or 2 Nominal Marital Status 1-3 Nominal We used this datasets in another study to compare the prediction accuracy performance among three prediction techniques, viz., Logistic Regression, Neural Network and Decision Tree [ [5]]. In this research we selected the person s labor status as the dependent variable that has only two expected values, 1 as Unemployed and 0 as Employed. Because we need to find the effect of dependent variable values distributions (distribution of 0 s and 1 s) on the prediction percentage agreement, we prepared different copies of the dataset and changed the dependent variable values distributions for three values ratios, see Table 2. TABLE 2: Dependent Variable Values Distribution Ratio 0: Employed 1: Unemployed Total 1:1 9,292 9,292 1,8584 2:1 20,122 9,292 29,414 3:1 28,745 9,292 38,037 The original dataset is the set with ratio 3:1, and from this dataset we extract the other two datasets by reducing the number of records of the Employed persons in a way to have the required ratio. This was done by deriving a stratified sample for each ratio type using District as stratifying variable. In our previous research [ [5]], where we used the same dataset in this study, and because the variables were nominal and not continuous the normality test was failed to find the most related variables to the dependent variable, labor status, so we used the correlation test, Spearman s Bi-variate correlation, which is suitable for this type of data measure. The prediction models were applied on the two types, the correlated variables and the whole variables. The results showed that difference between the whole data results and the correlated data results were too small and negligible because the number of variables in the dataset was too small, so we continued the analysis depending only on the whole variables dataset. In this research we also used the same dataset structure with whole variables as independent variables. The Neural Network prediction models were trained using both tools, viz., SPSS and NeruSolutions. The trained data were the three datasets that have different dependent variable value s distribution (0 Employed : 1 Unemployed) which are: 1:1, 2:1 and 3:1 as seen in Table 2. After the training of the models, we tested them using the data of LFS 2009 and LFS 2010. The PASW Statistics (SPSS Release 18.0.0) from IBM and the NeruSolutions 6.11 from NeuroDimensions Inc. were used in training and testing the Neural Network models to predict the value of person s labor status in a given dataset. The results were recorded to compare the Neural Network prediction performance over different dependent variable value s distribution using two different modeling techniques. 4. EXPERIMENTAL RESULTS For investigating the effect of dependent variable values distribution on the prediction accuracy rate we replicated the dataset into three categories each with different 0 to 1 ratio as in Table 2 in the previous section. The results of the analyses to find the effect of the dependent variable values distribution on prediction accuracy that performed on the three different dataset s ratios exploiting the Neural Network as prediction technique using the two modeling tools, SPSS and NeuroSolutions, have been trained on the LFS 2009 dataset and tested by both the LFS 2009 and LFS 2010 datasets. The prediction accuracy percentage was our performance metric of the Neural Network prediction technique. Table 3 shows summary of the prediction accuracy percentage of the Neural Network for the testing data LFS 2009 and LFS 2010 using SPSS and NeuroSolutions over the three datasets dependent variable value s ratios. TABLE 3: Prediction Accuracy Percentage Of The Neural Network For The Testing Data 2009 And 2010 Using SPSS and NeuroSolutions 2009 2010 Tool SPSS Neuro- Solutions Ratio 0:Employed 1:Unemployed Overall 0:Employed 1:Unemployed Overall 1:1 70.8 69.6 70.5 71.2 69.6 70.4 2:1 90.1 41.7 78.3 90.5 35.5 77.6 3:1 94.9 28.9 78.8 94.4 26.2 78.4 1:1 73.7 66.3 71.9 77.7 53.8 72.1 2:1 65.7 73.9 67.7 69.3 63.9 68.1 3:1 60 75.8 63.9 48.9 79.8 56.2 It is seen that the overall prediction percentage using SPSS in ratio 1:1 were 70.5% for 2009 testing data and 70.4% for the 2010 testing data. The same behavior can be seen for the other two ratios results in SPSS part and the difference between 2009 and 2010 testing data results was less than 1.0%. On the other hand using NeuroSolutions tool on the same testing data, LFS 2009 and LFS 2010, showed approximately the same behavior and the difference between 2009 and 2010 testing data results was less than 1.0% except in one case at ratio 3:1 at year 2010 where the difference was 6.7%. This results suggests that the testing data of LFS 2009 and LFS 2010 using the two modeling tools have comparable overall prediction percentage and we can focus the discussion on any one of them. To compare the modeling tool prediction performance Fig.1 plots two graphs one for each year, 4 4

2009 and 2010 datasets. Each graph plots the overall prediction accuracy percentage over the three dataset s ratios and using the two modeling tools SPSS and NeuroSolutions. It is seen that, in the two graphs, the prediction performance value at ratio 1:1 for the two modeling tools were approximately the same. Then at ratios 2:1 and 3:1 the prediction performance values for SPSS increased significantly forming increasing curve while it is on the contrary for the NeuroSolutions that the prediction performance values decreased significantly forming a decreasing curve. In other words by increasing the dependent variable values distribution ratios the overall prediction accuracy percentage value increases significantly using SPSS and decreasing significantly using NeuroSolutions. Fig. 1: The overall prediction accuracy percentage of the three ratio datasets exploiting the Neural Network prediction techniques using SPSS and NeuroSolutions The previous discussion explored the performance of the two modeling tools in producing overall prediction accuracy percentage, but what about the prediction accuracy percentages for the individual values of the dependent variable?. In other words, What about the effect of the dependent variable values distribution in the training dataset on the prediction performance?. To answer these questions Fig.2 plots three graphs one for each dependent variable values distribution ratios over the testing 2009 dataset. Fig. 2: The prediction accuracy percentages of the modeling tools over the three ratio datasets of 2009 testing data We explored the results of 2009 only because the results of 2010 has approximately the same behavior. It is seen that when the ratio was 1:1 the prediction accuracy percentage for both SPSS and NeuroSolutions were very comparable and ranging between 66% and 72% for predicting the two values of the dependent variable. When the ratio were 2:1 and 3:1 the situation was changed dramatically. Graph 2:1 showed that the SPSS succeeded to predict the dependent variable value 0 =Employed, that has the highest frequency in the dataset, with a high prediction percentage value 5 5

reached more than 90% but at the same time it failed to predict the other value of the dependent variable, that has the lowest frequency in the data, successfully with a high percentage and the percentage value settle around 40%. On the other hand the NeuroSolutions succeeded to predict the two dependent variable values successfully close to each other where the it predicted the dependent variable value 0 =Employed with an accepted prediction percentage value reached more than 65% and at the same time it succeeded to predict the other value of the dependent variable, 1 =Unemployed, successfully with a percentage value around 73% which is very comparable to the result of the first value. This results of SPSS and NeuroSolutions holds also when the ratio was 3:1. Thus in spite of that the SPSS outperformed the NeuroSolutions in predicting dependent variable values with a high overall prediction percentage, but the NeuroSolutions outperformed the SPSS in predicting both the individual values of the dependent variable with an accepted prediction accuracy percentage and a small difference between the two prediction accuracy percentages. 5. CONCLUSION In this paper we exploited the Neural Network as prediction algorithm and worked to achieve the objectives of this research. It is seen in Fig.1 that the SPSS modeling tool outperforms the NeuroSolutions modeling tool producing a higher overall prediction accuracy percentage in predicting the values of the dependent variable when it s distribution ratio in the training data were more than 1:1. In this situation and by increasing the dependent variable values distribution in the training data SPSS achieved an increasing overall prediction accuracy percentage and NeuroSolutions achieved a decreasing overall prediction accuracy percentage. Thus SPSS is more reliable than NeuroSolutions in achieving a higher overall prediction accuracy percentage. As an application for this is the breast cancer diagnosing in women that a high overall prediction accuracy is needed to check if the patient is infected like the studies of Delen D. [ [4]] and Bellaachia A. [ [2]]. On the other hand, as seen in Fig. 2, the NeuroSolutions outperformed the SPSS in predicting both the individual values of the dependent variable with an accepted prediction accuracy percentage for the two values when the dependent variable values distribution was more than 1:1. The difference between the two prediction accuracy percentages of the two dependent variable values was too small using NeuroSolutions while it is too large using SPSS. Consequently NeuroSolutions is more effective than SPSS in predicting the individual dependent variable values. possible application of this result is the study of the effects on the student s academic performance and retention as addressed by Ho Yu C. et al. [ [6]]. The effect of the dependent variable values distributions (distribution of 0 s and 1 s) in the training dataset on the prediction performance was negligible when the ratio is 1:1 for the two modeling tools, SPSS and NeuroSolutions. When the ratio is more than 1:1 NeuroSolutions approximately kept a steady prediction performance state as in 1:1 ratio while SPSS succeeded to predict one value with high prediction accuracy percentage and failed to predict other value successfully. We have two limitations in this study. One of them is the population size, that we investigate a dataset of the Palestinian Labor Force around 40,000 records each year and we haven t any idea about the results and consequences if the records count became hundreds of thousands or even millions. The other limitation was the small number of predictors, inputs or independent variables, that allow us successfully to identify the most related independent variables to the dependent variable and to check whether the results will be different as concluded here or not. Finally, as a future work we recommend to repeat this study on a larger population size of hundreds of thousands or millions records and larger number of predictors. References [1] Amooee G., Minaei-Bidgoli B. and Bagheri- Dehnavi M., A Comparison Between Data Mining Prediction Algorithms for Fault Detection (Case study: Ahanpishegan co.), IJCSI International Journal of Computer Science Issues, Vol. 8, Issue 6, No 3, November 2011. [2] Bellaachia A. and Guven E., Predicting Breast Cancer Survivability Using Data Mining Techniques, Ninth Workshop on Mining Scientific and Engineering Datasets in conjunction with the Sixth SIAM International Conference on Data Mining (SDM 2006). [3] Deepa C., Sathiya Kumari K., and Pream Sudha V., A Tree Based Model for High Performance Concrete Mix Design, International Journal of Engineering Science and Technology, Vol. 2(9), 4640-4646, 2010. [4] Delen D., Walker G., and Kadam A., Predicting breast cancer survivability: a comparison of three data mining methods, Artificial Intelligence in Medicine, 34(2):113-27, 2005. [5] Harb A. and Jayousi R., A Comparative Study of Statistical and Data Mining Algorithms for Prediction Performance, proceedings of the International Conference 6 6

on Information & Communications Technology (ICICT'2012). [6] Ho Yu C., DiGangi S., Jannasch-Pennell A., and Kaprolet C., A Data Mining Approach for Identifying Predictors of Student Retention from Sophomore to Junior Year, Journal of Data Science 307-325, 8(2010). [7] Ibrahim Z. and Rusli D., Predicting Students Academic Performance: Comparing Artificial Neural Network, Decision Tree and Linear Regression, 21 st Annual SAS Malaysia Forum, 5 th September 2007. [8] Kumari M. and Godara S., Comparative Study of Data Mining Classification Methods in Cardiovascular Disease Prediction, International Journal of Computer Science and Technology (IJCST) Vol. 2, Issue 2, June 2011. [9] Lahiri R., Comparison of Data Mining and Statistical Techniques for Classification Model, A Thesis submitted to the graduate faculty of the Louisiana State University in partial fulfilment of the requirements for the degree of Master of Science in The Department of Information Systems & Decision Sciences. (December 2006). [10] M. Panda and M. R. Patra, A comparative study of data mining algorithms for network intrusion detection, proc. of ICETET, India, 2008.pp.504-507. IEEE Xplore. [11] Nancy P. and Geetha Ramani R., A Comparison on Performance of Data Mining Algorithms in Classification of Social Network Data, International Journal of Computer Applications (0975 8887) Volume 32 No.8, October 2011. [12] Shailesh K R et. al., Comparison of Different Data Mining Techniques to Predict Hospital Length of Stay, JPBMS, 7 (15), 2011. [13] Shanthi S. and Geetha Ramani R., Classification of Vehicle Collision Patterns in Road Accidents using Data Mining Algorithms, International Journal of Computer Applications (0975 8887) Volume 35 No.12, December 2011. [14] UCI Machine Learning Repository: http://archive.ics.uci.edu/ml/. [15] Wahbeh A. H, Al-Radaideh Q. A., Al-Kabi M. N., and Al-Shawakfa E. M., A Comparison Study between Data Mining Tools over some Classification Methods, International Journal of Advanced Computer Science and Applications (IJACSA), Special Issue on Artificial Intelligence, September, 2011. 7 7