Using Incidence and Mortality Cancer Statistics Dataset for Cancer Control Organizations Shivam Sidhu 1,, Upendra Kumar Meena 2, Narina Thakur 3 1,2 Department of CSE, Student, Bharati Vidyapeeth s College of Engineering, New Delhi 110 063, India. 3 Department of CSE, Faculty of Technical Education, Bharati Vidyapeeth s College of Engineering, New Delhi 110 063, India. e-mail: 1 shivam2040sidhu@gmail.com Abstract. The ultimate goal of data mining is prediction, and predictive data mining is the most common type of data mining and one that has most direct business applications. This paper discusses how data mining will help in predicting cancer count for cancer statistics datasets. This paper discusses neural network and accurate prediction methods. Neural network is an adaptive system that changes its structure during learning phase, and continuously refines its predictive behavior. Data set is taken from NPCR and NVSS government bodies. The precision of the tool shows significant promise to be used as a benchmark by cancer societies. Also various cancer control organizations can utilize this tool for taking vital decisions in investment and designing new strategies and policies for reducing cancer incidence and mortality. Keywords: Data mining, NPCR, NVSS, Clustering, Neural network. 1. Introduction In this section we aim to bequeath an overview of the paper and the dataset used. Followed by this is section 2 in which we discuss various data mining techniques-association, clustering, classification and prediction. Next in section 3 Artificial Neural Networks is discussed including the concept of hidden layer of neurons. Later Backpropagation mechanism is explicated in section 4. Then the Implementation methodology is elucidated in section 5 which includes dataset explanation, data preprocessing, training in MATLAB and how the prediction was done. Sections 6 and section 7 discusses Analysis of Result and Conclusion respectively. At the end section is for references. The three most important techniques that this paper specifically discusses are the Processing of dataset followed by Training and finally the prediction of future values. In the Processing of the dataset, we first assigned numeric values to each of the entries followed by normalizing one set of values and converting the excel file into.csv. After the processing part, Training and Prediction was done on the dataset where we used Neural Network Tool (nn-tool) with Feed Forward Backprop network type. This was followed by simulation of the network. Then we finally created a GUI using GUIDE tool of MATLAB. 1.1 United states cancer statistics (USCS) The dataset has in total 23 attributes and 671641 records. The dataset has been collected from several sources according to 3 different parameters which are described below: Incidence Data: In cancer incidence the primary source of data is medical records. Staff present at the health care facilities abstract data from medical records of all the patients, enter it into the facility s own cancer registry if it has one, and then transmit the data to regional or state registry. Corresponding author Elsevier Publications 2013.
Mortality Data: Cancer Mortality data is based on information from all death certificates filed in the 50 states and the District of Columbia and processed by the National Vital Statistics System (NVSS). Population Denominator Data: Estimates of population in case of denominators of incidence and death rates are race-specific, ethnicity-specific, and sex-specific county population estimates aggregated to the state or metropolitanarea level. 2. Data Mining Data Mining, refers to the non trivial extraction of implicit, previously unknown and potentially useful information from data in databases. 2.1 Association In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Based on the concept of strong rules, association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets were introduced. For example, the rule {milk, bread} => {butter} found in the sales data of a supermarket would indicate that if a customer buys milk and bread together, he is likely to buy butter. Today, association rules are employed in many application areas including web usage mining and intrusion detection. 2.2 Clustering The notion of a cluster varies between algorithms and is one of the many decisions to take when choosing the appropriate algorithm for a specific problem. First the terminology of a cluster seems obvious: a group of data objects. The clusters determined by different algorithms vary significantly in their properties. A clustering is essentially a set of clusters, containing all objects in the data set. 2.3 Classification Classification is a data mining function that assigns items in a collection to target categories or classes. The aim of classification is to accurately predict the target class for each case in the data. A classification [1] task begins with a data set in which the class assignments are known. Example: a classification model which can predict credit risk could be developed based on observed data for many loan applicants over a significant period of time. 2.4 Predictive data mining The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit [3] card company may want to engage in predictive data mining, to derive a (trained) model or set of models that can quickly identify transactions which have a high probability of being deceitful. Other types of data mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers). Data reduction is another possible objective for data mining (e.g., to aggregate the information in very large data sets into useful and manageable chunks). 3. Neural Networks Artificial Neural Networks, are computational tools which are modeled on the interconnection of the neuron in the nervous systems of the human brain and that of other organisms [4]. ANN employs some basic atomic units known as neurons. Artificial neural nets, abbreviated as ANN are a type of non-linear processing system that is ideally suited for a wide range of tasks, especially those ones where there is no existing algorithm for completion of the task. There are various applications to ANN. They can be trained to solve certain problems using a teaching method and sample data. Therefore, identically constructed ANN can be used to perform different tasks depending on the training received. If proper training is done, ANN are capable of generalization, with the ability to recognize similarities among different input patterns and patterns that have been corrupted by noise. Elsevier Publications 2013. 525
Shivam Sidhu, Upendra Kumar Meena and Narina Thakur Input Hidden Output Figure 1. A layer of neurons. 4. Back Propagation Mechanism 4.1 Backpropagation algorithm Backpropagation algorithm is based on the generalized delta rule. In the employment of the backpropagation algorithm, every iteration of training involves the following steps: 1) A particular case of data (for training) is fed through the network in a forward direction, producing results at the output layer. 2) Based on the target information which is known, the error is determined at the output nodes, and the required changes to the weights that lead into the output layer are determined by this error calculation. 3) The changes to the weights that lead to the preceding network layers are determined as a function of the properties of the neurons to which they directly connect (weight changes are calculated as a function of the errors determined for all following layers, working backward toward the input layer) until all necessary weight changes are calculated for the entire network. The calculated weight changes then are implemented throughout the network, the subsequent iteration begins, and the entire procedure is again repeated using the next training pattern. 5. Implementation 5.1 Data set The dataset was obtained online comprises of a large collection of attributes- area, event type, site, sex etc. It contains cancer cases from regions all around the United States ranging from Alabama to Wyoming. Let s discuss the main parameters. EVENT TYPE categorizes data into incidence or mortality. SITE indicates which part of human body contains the cancer cells, ranging from Brain to Urinary Bladder. SEX which can be male or female. Figure 2. Cancer ranking by state, including all cancer sites male and female 1999 2009. Rates are per 100,000 persons and are age-adjusted to the 2000 U.S. standard population (19 age groups Census P25 1130). 526 Elsevier Publications 2013.
Table 1. Compressed and analyzed input dataset. All Categories Native American Migrated Asian Islander American Indian Native Hispanic 1 Prostate Prostate Prostate Female Breast Prostate Prostate 151.4 140.8 228.6 85.3 76.8 124.4 2 Female Breast Female Breast Female Breast Prostate Female Breast Female Breast 122.0 123.0 118.0 76.9 68.3 93.1 3 Lung and Lung and Lung and Lung and Lung and Colon and Bronchus Bronchus Bronchus Bronchus Bronchus Rectum 67.2 67.9 70.6 37.0 45.1 39.2 4 Colon and Colon and Colon and Colon and Colon and Lung and Rectum Rectum Rectum Rectum Rectum Bronchus 46.2 45.1 54.8 36.0 32.5 34.7 Cancer Incidence Rates which are adjusted by age for the Primary Sites with the Highest Rates within Race- and Ethnic- Specific Categories 5.2 Data preprocessing A numeric value was assigned to each of the input entries. This way, all cancer cases belonging to each and every event type, cancer site, sex and area had a unique numeric identity. The excel file was converted to.csv (comma separated value) file and fed as an input to the system. Since non-normalized values of the target file would always yield erroneous results, the values had to be normalized first. This is accomplished using Microsoft Excel. Excel has some very useful tools and functions. A few of them are the STDDEV, STANDARDIZE and AVERAGE functions. The values get normalized, that is, they are uniformly distributed around a common middle point. Half of the values lie to the left of the middle point, and are negative. The other half lies to the other side, and consists of positive numbers. This data was again converted to a.csv file. Now, we possess an input file and a target file that can be used for the purpose of making accurate predictions. 5.3 Training in MATLAB and prediction Henceforth, we can proceed to training the data, and creating a GUI for the system. The neural network tool is a very helpful tool that permits us to train a dataset, so that the network can intelligently predict future values. Hence, we go about using the nntool in order to realize our goal. We selected four neurons for the first layer and 1 neuron for the second layer, and the network type as Feed Forward Backprop. Then we go ahead and simulate the network. To facilitate easy implementation, we loaded the network file into a function, and then called this function within the GUI. The GUI itself was created in MATLAB, using GUIDE (GUI development environment). Hence, the result of this whole endeavor was a system that could be used effectively to make precise and intelligent predictions about the prospects of formulating policies and investing in them. 6. Results and Analysis At first glance, the results obtained were found to be comparable to the output expected. Further investigation shows that the outcome is indeed almost similar to that anticipated. More comparison can be done with statistical techniques like moving average, regression and by simple analysis cancer cases and incidence and mortality trends. Figure 3. A view of the network created. Elsevier Publications 2013. 527
Shivam Sidhu, Upendra Kumar Meena and Narina Thakur Figure 4. GUI of the cancer count prediction system. Figure 5. Plot of the predicted cancer count. 7. Conclusion and Future Scope In the world of finance and global commerce, prediction of the returns of investing in a particular firm is a matter of the utmost importance [2]. For long, artificial neural networks have been used in the field of prediction. Sometimes, it has been found that artificial neural networks possess drawbacks when learning data patterns. They have also been known to demonstrate inconsistent and unpredictable behavior if the data used is too massive or complex. However, the overall percentage of errors or deviations from the result expected being low, it can be safely concluded that artificial neural networks have a vast future scope in the domain of economics and prediction. National Program of Cancer Registries (NPCR) and National Vital Statistics System (NVSS) could refer to this system allowing them to predict future values. This would consecutively help government in formulating polices and programmes intended to lower cancer cases. Hence, we could be seeing a greater involvement of artificial neural networks in foretelling cancer incidence and mortality trends in the future. References [1] Jharna Chopra and Sampada Satav, Privacy Preservation Techniques in Data Mining, 9th April, 2013. [2] Fu K. S., Syntactic Pattern Recognition and Applications, Prentice-Hall, 1982. [3] Chandrika Satyavolu and T. Y. Lin, [171] Attribute (Feature) Completion The Theory of Attributes from Data Mining Prospect, 18th December, 2007. [4] Kartalopoulos S. V., Understanding Neural Networks and Fuzzy Logic, Prentice-Hall, 2000. [5] Inmon W. H. and Osterfelt S., Understanding Data Pattern Processing, QED Technical Publishing Group, 1991. [6] Janmenjoy Nayak, Asanta Ranjan Routray & Hadibandhu Pattnayak, Integration of Soft Computing Tools in Data Mining: A Unified Approach, 20th October, 2012. 528 Elsevier Publications 2013.