Prediction of Cancer Count through Artificial Neural Networks Using Incidence and Mortality Cancer Statistics Dataset for Cancer Control Organizations



Similar documents
2. IMPLEMENTATION. International Journal of Computer Applications ( ) Volume 70 No.18, May 2013

Comparison of K-means and Backpropagation Data Mining Algorithms

How To Use Neural Networks In Data Mining

NEURAL NETWORKS IN DATA MINING

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Neural Networks in Data Mining

EFFICIENT DATA PRE-PROCESSING FOR DATA MINING

Bank Customers (Credit) Rating System Based On Expert System and ANN

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

Impelling Heart Attack Prediction System using Data Mining and Artificial Neural Network

SPATIAL DATA CLASSIFICATION AND DATA MINING

Artificial Neural Network and Non-Linear Regression: A Comparative Study

Data Mining Algorithms Part 1. Dejan Sarka

Data Warehousing and Data Mining in Business Applications

Utilization of Neural Network for Disease Forecasting

Nine Common Types of Data Mining Techniques Used in Predictive Analytics

Credit Card Fraud Detection Using Self Organised Map

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNOLOGY. Keywords: data mining, data warehouse, knowledge discovery, OLAP, OLAM.

An Overview of Knowledge Discovery Database and Data mining Techniques

THE APPLICATION OF DATA MINING TECHNOLOGY IN REAL ESTATE MARKET PREDICTION

SURVIVABILITY ANALYSIS OF PEDIATRIC LEUKAEMIC PATIENTS USING NEURAL NETWORK APPROACH

Predictive time series analysis of stock prices using neural network classifier

Analecta Vol. 8, No. 2 ISSN

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

SUCCESSFUL PREDICTION OF HORSE RACING RESULTS USING A NEURAL NETWORK

A New Approach For Estimating Software Effort Using RBFN Network

American International Journal of Research in Science, Technology, Engineering & Mathematics

Populations of Color in Minnesota

Neural Networks and Back Propagation Algorithm

8. Machine Learning Applied Artificial Intelligence

What is Data Mining? Data Mining (Knowledge discovery in database) Data mining: Basic steps. Mining tasks. Classification: YES, NO

STATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and

Power Prediction Analysis using Artificial Neural Network in MS Excel

BOOSTING - A METHOD FOR IMPROVING THE ACCURACY OF PREDICTIVE MODEL

Database Marketing, Business Intelligence and Knowledge Discovery

Chapter 12 Discovering New Knowledge Data Mining

Prediction Model for Crude Oil Price Using Artificial Neural Networks

Index Contents Page No. Introduction . Data Mining & Knowledge Discovery

Mobile Phone APP Software Browsing Behavior using Clustering Analysis

Use of Artificial Neural Network in Data Mining For Weather Forecasting

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling

Knowledge Based Descriptive Neural Networks

A STUDY ON DATA MINING INVESTIGATING ITS METHODS, APPROACHES AND APPLICATIONS

Back Propagation Neural Network for Wireless Networking

Social Media Mining. Data Mining Essentials

Data quality in Accounting Information Systems

A new approach to revenue estimation in Telecommunication Industry using Linear Model

Neural network software tool development: exploring programming language options

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Application of Data Mining Techniques in Intrusion Detection

Healthcare Measurement Analysis Using Data mining Techniques

Performance Based Evaluation of New Software Testing Using Artificial Neural Network

Time Series Data Mining in Rainfall Forecasting Using Artificial Neural Network

Data Mining Techniques Chapter 7: Artificial Neural Networks

Hexaware E-book on Predictive Analytics

International Journal of Computer Trends and Technology (IJCTT) volume 4 Issue 8 August 2013

Dynamic Data in terms of Data Mining Streams

The Data Mining Process

Quality Control of National Genetic Evaluation Results Using Data-Mining Techniques; A Progress Report

Effective Analysis and Predictive Model of Stroke Disease using Classification Methods

Data Mining Techniques

Community Information Book Update October Social and Demographic Characteristics

A Neural Network based Approach for Predicting Customer Churn in Cellular Network Services

Introduction. A. Bellaachia Page: 1

A Content based Spam Filtering Using Optical Back Propagation Technique

MANAGING QUEUE STABILITY USING ART2 IN ACTIVE QUEUE MANAGEMENT FOR CONGESTION CONTROL

A New Approach for Evaluation of Data Mining Techniques

APPLICATION OF INTELLIGENT METHODS IN COMMERCIAL WEBSITE MARKETING STRATEGIES DEVELOPMENT

not possible or was possible at a high cost for collecting the data.

Big Data with Rough Set Using Map- Reduce

Chapter I Overview Chapter Contents

Sanjeev Kumar. contribute

Efficient Artificial Neural Network based Practical Approach of Stock Market Forecasting

Data Mining for Fun and Profit

Intrusion Detection via Machine Learning for SCADA System Protection

NTC Project: S01-PH10 (formerly I01-P10) 1 Forecasting Women s Apparel Sales Using Mathematical Modeling

Data Mining System, Functionalities and Applications: A Radical Review

REVIEW OF HEART DISEASE PREDICTION SYSTEM USING DATA MINING AND HYBRID INTELLIGENT TECHNIQUES

Data Mining Applications in Fund Raising

Total Males Females (0.4) (1.6) Didn't believe entitled or eligible 13.0 (0.3) Did not know how to apply for benefits 3.4 (0.

Soft-Computing Models for Building Applications - A Feasibility Study (EPSRC Ref: GR/L84513)

Keywords: Data Mining, Neural Networks, Data Mining Process, Knowledge Discovery, Implementation. I. INTRODUCTION

Artificial Neural Network and Location Coordinates based Security in Credit Cards

Price Prediction of Share Market using Artificial Neural Network (ANN)

Neural Networks and Support Vector Machines

Data are everywhere. IBM projects that every day we generate 2.5 quintillion bytes of data. In relative terms, this means 90

ISSN: (Online) Volume 3, Issue 7, July 2015 International Journal of Advance Research in Computer Science and Management Studies

Intrusion Detection System using Log Files and Reinforcement Learning

Keywords data mining, prediction techniques, decision making.

Customer Relationship Management using Adaptive Resonance Theory

Performance Evaluation of Online Image Compression Tools

Business Intelligence and Decision Support Systems

Novel Mining of Cancer via Mutation in Tumor Protein P53 using Quick Propagation Network

EASI Reseller Opportunities: Demographic Estimates and Forecasts; Life Stage Clusters; Major Merchandise Lines and Minor Store Groups

Web Mining using Artificial Ant Colonies : A Survey

The Scientific Data Mining Process

Neural Network Design in Cloud Computing

A Review of Anomaly Detection Techniques in Network Intrusion Detection System

Lecture 6. Artificial Neural Networks

Transcription:

Using Incidence and Mortality Cancer Statistics Dataset for Cancer Control Organizations Shivam Sidhu 1,, Upendra Kumar Meena 2, Narina Thakur 3 1,2 Department of CSE, Student, Bharati Vidyapeeth s College of Engineering, New Delhi 110 063, India. 3 Department of CSE, Faculty of Technical Education, Bharati Vidyapeeth s College of Engineering, New Delhi 110 063, India. e-mail: 1 shivam2040sidhu@gmail.com Abstract. The ultimate goal of data mining is prediction, and predictive data mining is the most common type of data mining and one that has most direct business applications. This paper discusses how data mining will help in predicting cancer count for cancer statistics datasets. This paper discusses neural network and accurate prediction methods. Neural network is an adaptive system that changes its structure during learning phase, and continuously refines its predictive behavior. Data set is taken from NPCR and NVSS government bodies. The precision of the tool shows significant promise to be used as a benchmark by cancer societies. Also various cancer control organizations can utilize this tool for taking vital decisions in investment and designing new strategies and policies for reducing cancer incidence and mortality. Keywords: Data mining, NPCR, NVSS, Clustering, Neural network. 1. Introduction In this section we aim to bequeath an overview of the paper and the dataset used. Followed by this is section 2 in which we discuss various data mining techniques-association, clustering, classification and prediction. Next in section 3 Artificial Neural Networks is discussed including the concept of hidden layer of neurons. Later Backpropagation mechanism is explicated in section 4. Then the Implementation methodology is elucidated in section 5 which includes dataset explanation, data preprocessing, training in MATLAB and how the prediction was done. Sections 6 and section 7 discusses Analysis of Result and Conclusion respectively. At the end section is for references. The three most important techniques that this paper specifically discusses are the Processing of dataset followed by Training and finally the prediction of future values. In the Processing of the dataset, we first assigned numeric values to each of the entries followed by normalizing one set of values and converting the excel file into.csv. After the processing part, Training and Prediction was done on the dataset where we used Neural Network Tool (nn-tool) with Feed Forward Backprop network type. This was followed by simulation of the network. Then we finally created a GUI using GUIDE tool of MATLAB. 1.1 United states cancer statistics (USCS) The dataset has in total 23 attributes and 671641 records. The dataset has been collected from several sources according to 3 different parameters which are described below: Incidence Data: In cancer incidence the primary source of data is medical records. Staff present at the health care facilities abstract data from medical records of all the patients, enter it into the facility s own cancer registry if it has one, and then transmit the data to regional or state registry. Corresponding author Elsevier Publications 2013.

Mortality Data: Cancer Mortality data is based on information from all death certificates filed in the 50 states and the District of Columbia and processed by the National Vital Statistics System (NVSS). Population Denominator Data: Estimates of population in case of denominators of incidence and death rates are race-specific, ethnicity-specific, and sex-specific county population estimates aggregated to the state or metropolitanarea level. 2. Data Mining Data Mining, refers to the non trivial extraction of implicit, previously unknown and potentially useful information from data in databases. 2.1 Association In data mining, association rule learning is a popular and well researched method for discovering interesting relations between variables in large databases. Based on the concept of strong rules, association rules for discovering regularities between products in large-scale transaction data recorded by point-of-sale (POS) systems in supermarkets were introduced. For example, the rule {milk, bread} => {butter} found in the sales data of a supermarket would indicate that if a customer buys milk and bread together, he is likely to buy butter. Today, association rules are employed in many application areas including web usage mining and intrusion detection. 2.2 Clustering The notion of a cluster varies between algorithms and is one of the many decisions to take when choosing the appropriate algorithm for a specific problem. First the terminology of a cluster seems obvious: a group of data objects. The clusters determined by different algorithms vary significantly in their properties. A clustering is essentially a set of clusters, containing all objects in the data set. 2.3 Classification Classification is a data mining function that assigns items in a collection to target categories or classes. The aim of classification is to accurately predict the target class for each case in the data. A classification [1] task begins with a data set in which the class assignments are known. Example: a classification model which can predict credit risk could be developed based on observed data for many loan applicants over a significant period of time. 2.4 Predictive data mining The term Predictive Data Mining is usually applied to identify data mining projects with the goal to identify a statistical or neural network model or set of models that can be used to predict some response of interest. For example, a credit [3] card company may want to engage in predictive data mining, to derive a (trained) model or set of models that can quickly identify transactions which have a high probability of being deceitful. Other types of data mining projects may be more exploratory in nature (e.g., to identify cluster or segments of customers). Data reduction is another possible objective for data mining (e.g., to aggregate the information in very large data sets into useful and manageable chunks). 3. Neural Networks Artificial Neural Networks, are computational tools which are modeled on the interconnection of the neuron in the nervous systems of the human brain and that of other organisms [4]. ANN employs some basic atomic units known as neurons. Artificial neural nets, abbreviated as ANN are a type of non-linear processing system that is ideally suited for a wide range of tasks, especially those ones where there is no existing algorithm for completion of the task. There are various applications to ANN. They can be trained to solve certain problems using a teaching method and sample data. Therefore, identically constructed ANN can be used to perform different tasks depending on the training received. If proper training is done, ANN are capable of generalization, with the ability to recognize similarities among different input patterns and patterns that have been corrupted by noise. Elsevier Publications 2013. 525

Shivam Sidhu, Upendra Kumar Meena and Narina Thakur Input Hidden Output Figure 1. A layer of neurons. 4. Back Propagation Mechanism 4.1 Backpropagation algorithm Backpropagation algorithm is based on the generalized delta rule. In the employment of the backpropagation algorithm, every iteration of training involves the following steps: 1) A particular case of data (for training) is fed through the network in a forward direction, producing results at the output layer. 2) Based on the target information which is known, the error is determined at the output nodes, and the required changes to the weights that lead into the output layer are determined by this error calculation. 3) The changes to the weights that lead to the preceding network layers are determined as a function of the properties of the neurons to which they directly connect (weight changes are calculated as a function of the errors determined for all following layers, working backward toward the input layer) until all necessary weight changes are calculated for the entire network. The calculated weight changes then are implemented throughout the network, the subsequent iteration begins, and the entire procedure is again repeated using the next training pattern. 5. Implementation 5.1 Data set The dataset was obtained online comprises of a large collection of attributes- area, event type, site, sex etc. It contains cancer cases from regions all around the United States ranging from Alabama to Wyoming. Let s discuss the main parameters. EVENT TYPE categorizes data into incidence or mortality. SITE indicates which part of human body contains the cancer cells, ranging from Brain to Urinary Bladder. SEX which can be male or female. Figure 2. Cancer ranking by state, including all cancer sites male and female 1999 2009. Rates are per 100,000 persons and are age-adjusted to the 2000 U.S. standard population (19 age groups Census P25 1130). 526 Elsevier Publications 2013.

Table 1. Compressed and analyzed input dataset. All Categories Native American Migrated Asian Islander American Indian Native Hispanic 1 Prostate Prostate Prostate Female Breast Prostate Prostate 151.4 140.8 228.6 85.3 76.8 124.4 2 Female Breast Female Breast Female Breast Prostate Female Breast Female Breast 122.0 123.0 118.0 76.9 68.3 93.1 3 Lung and Lung and Lung and Lung and Lung and Colon and Bronchus Bronchus Bronchus Bronchus Bronchus Rectum 67.2 67.9 70.6 37.0 45.1 39.2 4 Colon and Colon and Colon and Colon and Colon and Lung and Rectum Rectum Rectum Rectum Rectum Bronchus 46.2 45.1 54.8 36.0 32.5 34.7 Cancer Incidence Rates which are adjusted by age for the Primary Sites with the Highest Rates within Race- and Ethnic- Specific Categories 5.2 Data preprocessing A numeric value was assigned to each of the input entries. This way, all cancer cases belonging to each and every event type, cancer site, sex and area had a unique numeric identity. The excel file was converted to.csv (comma separated value) file and fed as an input to the system. Since non-normalized values of the target file would always yield erroneous results, the values had to be normalized first. This is accomplished using Microsoft Excel. Excel has some very useful tools and functions. A few of them are the STDDEV, STANDARDIZE and AVERAGE functions. The values get normalized, that is, they are uniformly distributed around a common middle point. Half of the values lie to the left of the middle point, and are negative. The other half lies to the other side, and consists of positive numbers. This data was again converted to a.csv file. Now, we possess an input file and a target file that can be used for the purpose of making accurate predictions. 5.3 Training in MATLAB and prediction Henceforth, we can proceed to training the data, and creating a GUI for the system. The neural network tool is a very helpful tool that permits us to train a dataset, so that the network can intelligently predict future values. Hence, we go about using the nntool in order to realize our goal. We selected four neurons for the first layer and 1 neuron for the second layer, and the network type as Feed Forward Backprop. Then we go ahead and simulate the network. To facilitate easy implementation, we loaded the network file into a function, and then called this function within the GUI. The GUI itself was created in MATLAB, using GUIDE (GUI development environment). Hence, the result of this whole endeavor was a system that could be used effectively to make precise and intelligent predictions about the prospects of formulating policies and investing in them. 6. Results and Analysis At first glance, the results obtained were found to be comparable to the output expected. Further investigation shows that the outcome is indeed almost similar to that anticipated. More comparison can be done with statistical techniques like moving average, regression and by simple analysis cancer cases and incidence and mortality trends. Figure 3. A view of the network created. Elsevier Publications 2013. 527

Shivam Sidhu, Upendra Kumar Meena and Narina Thakur Figure 4. GUI of the cancer count prediction system. Figure 5. Plot of the predicted cancer count. 7. Conclusion and Future Scope In the world of finance and global commerce, prediction of the returns of investing in a particular firm is a matter of the utmost importance [2]. For long, artificial neural networks have been used in the field of prediction. Sometimes, it has been found that artificial neural networks possess drawbacks when learning data patterns. They have also been known to demonstrate inconsistent and unpredictable behavior if the data used is too massive or complex. However, the overall percentage of errors or deviations from the result expected being low, it can be safely concluded that artificial neural networks have a vast future scope in the domain of economics and prediction. National Program of Cancer Registries (NPCR) and National Vital Statistics System (NVSS) could refer to this system allowing them to predict future values. This would consecutively help government in formulating polices and programmes intended to lower cancer cases. Hence, we could be seeing a greater involvement of artificial neural networks in foretelling cancer incidence and mortality trends in the future. References [1] Jharna Chopra and Sampada Satav, Privacy Preservation Techniques in Data Mining, 9th April, 2013. [2] Fu K. S., Syntactic Pattern Recognition and Applications, Prentice-Hall, 1982. [3] Chandrika Satyavolu and T. Y. Lin, [171] Attribute (Feature) Completion The Theory of Attributes from Data Mining Prospect, 18th December, 2007. [4] Kartalopoulos S. V., Understanding Neural Networks and Fuzzy Logic, Prentice-Hall, 2000. [5] Inmon W. H. and Osterfelt S., Understanding Data Pattern Processing, QED Technical Publishing Group, 1991. [6] Janmenjoy Nayak, Asanta Ranjan Routray & Hadibandhu Pattnayak, Integration of Soft Computing Tools in Data Mining: A Unified Approach, 20th October, 2012. 528 Elsevier Publications 2013.