Discovering Trends in Large Datasets Using Neural Networks

Similar documents

A Holistic Method for Selecting Web Services in Design of Composite Applications

Hierarchical Clustering and Sampling Techniques for Network Monitoring

Improved SOM-Based High-Dimensional Data Visualization Algorithm

Granular Problem Solving and Software Engineering

Big Data Analysis and Reporting with Decision Tree Induction

Behavior Analysis-Based Learning Framework for Host Level Intrusion Detection

A Keyword Filters Method for Spam via Maximum Independent Sets

Sebastián Bravo López

Improved Vehicle Classification in Long Traffic Video by Cooperating Tracker and Classifier Modules

Pattern Recognition Techniques in Microarray Data Analysis

Neural network-based Load Balancing and Reactive Power Control by Static VAR Compensator

An Enhanced Critical Path Method for Multiple Resource Constraints

Intelligent Measurement Processes in 3D Optical Metrology: Producing More Accurate Point Clouds

' R ATIONAL. :::~i:. :'.:::::: RETENTION ':: Compliance with the way you work PRODUCT BRIEF

Interpretable Fuzzy Modeling using Multi-Objective Immune- Inspired Optimization Algorithms

Recovering Articulated Motion with a Hierarchical Factorization Method

FIRE DETECTION USING AUTONOMOUS AERIAL VEHICLES WITH INFRARED AND VISUAL CAMERAS. J. Ramiro Martínez-de Dios, Luis Merino and Aníbal Ollero

An Efficient Network Traffic Classification Based on Unknown and Anomaly Flow Detection Mechanism

Open and Extensible Business Process Simulator

Classical Electromagnetic Doppler Effect Redefined. Copyright 2014 Joseph A. Rybczyk

Robust Classification and Tracking of Vehicles in Traffic Video Streams

State of Maryland Participation Agreement for Pre-Tax and Roth Retirement Savings Accounts

Impedance Method for Leak Detection in Zigzag Pipelines

Deadline-based Escalation in Process-Aware Information Systems

A Context-Aware Preference Database System

Channel Assignment Strategies for Cellular Phone Systems

Henley Business School at Univ of Reading. Pre-Experience Postgraduate Programmes Chartered Institute of Personnel and Development (CIPD)

Weighting Methods in Survey Sampling

A Three-Hybrid Treatment Method of the Compressor's Characteristic Line in Performance Prediction of Power Systems

Capacity at Unsignalized Two-Stage Priority Intersections

Static Fairness Criteria in Telecommunications

Supply chain coordination; A Game Theory approach

REDUCTION FACTOR OF FEEDING LINES THAT HAVE A CABLE AND AN OVERHEAD SECTION

GABOR AND WEBER LOCAL DESCRIPTORS PERFORMANCE IN MULTISPECTRAL EARTH OBSERVATION IMAGE DATA ANALYSIS

A novel active mass damper for vibration control of bridges

A Comparison of Service Quality between Private and Public Hospitals in Thailand

TECHNOLOGY-ENHANCED LEARNING FOR MUSIC WITH I-MAESTRO FRAMEWORK AND TOOLS

Customer Efficiency, Channel Usage and Firm Performance in Retail Banking

Impact Simulation of Extreme Wind Generated Missiles on Radioactive Waste Storage Facilities

i_~f e 1 then e 2 else e 3

Trade Information, Not Spectrum: A Novel TV White Space Information Market Model

Criminal Geographical Profiling: Using FCA for Visualization and Analysis of Crime Data

WORKFLOW CONTROL-FLOW PATTERNS A Revised View

THE PERFORMANCE OF TRANSIT TIME FLOWMETERS IN HEATED GAS MIXTURES

An integrated optimization model of a Closed- Loop Supply Chain under uncertainty

) ( )( ) ( ) ( )( ) ( ) ( ) (1)

Modeling and analyzing interference signal in a complex electromagnetic environment

INCOME TAX WITHHOLDING GUIDE FOR EMPLOYERS

Chapter 1 Microeconomics of Consumer Theory

INCOME TAX WITHHOLDING GUIDE FOR EMPLOYERS

MATE: MPLS Adaptive Traffic Engineering

A Survey of Usability Evaluation in Virtual Environments: Classi cation and Comparison of Methods

RATING SCALES FOR NEUROLOGISTS

Deduplication with Block-Level Content-Aware Chunking for Solid State Drives (SSDs)

Effects of Inter-Coaching Spacing on Aerodynamic Noise Generation Inside High-speed Trains

Scalable Hierarchical Multitask Learning Algorithms for Conversion Optimization in Display Advertising

Agent-Based Grid Load Balancing Using Performance-Driven Task Scheduling

Table of Contents. Appendix II Application Checklist. Export Finance Program Working Capital Financing...7

Hierarchical Beta Processes and the Indian Buffet Process

Computational Analysis of Two Arrangements of a Central Ground-Source Heat Pump System for Residential Buildings

RISK-BASED IN SITU BIOREMEDIATION DESIGN JENNINGS BRYAN SMALLEY. A.B., Washington University, 1992 THESIS. Urbana, Illinois

Performance Analysis of IEEE in Multi-hop Wireless Networks

protection p1ann1ng report

arxiv:astro-ph/ v2 10 Jun 2003 Theory Group, MS 50A-5101 Lawrence Berkeley National Laboratory One Cyclotron Road Berkeley, CA USA

SLA-based Resource Allocation for Software as a Service Provider (SaaS) in Cloud Computing Environments

Marker Tracking and HMD Calibration for a Video-based Augmented Reality Conferencing System

Computer Networks Framing

Henley Business School at Univ of Reading. Chartered Institute of Personnel and Development (CIPD)

Context-Sensitive Adjustments of Cognitive Control: Conflict-Adaptation Effects Are Modulated by Processing Demands of the Ongoing Task

Agile ALM White Paper: Redefining ALM with Five Key Practices

Learning Curves and Stochastic Models for Pricing and Provisioning Cloud Computing Services

Programming Basics - FORTRAN 77

VOLUME 13, ARTICLE 5, PAGES PUBLISHED 05 OCTOBER DOI: /DemRes

NOMCLUST: AN R PACKAGE FOR HIERARCHICAL CLUSTERING OF OBJECTS CHARACTERIZED BY NOMINAL VARIABLES

RELEASING MICRODATA: DISCLOSURE RISK ESTIMATION, DATA MASKING AND ASSESSING UTILITY

Design Implications for Enterprise Storage Systems via Multi-Dimensional Trace Analysis

Recommending Questions Using the MDL-based Tree Cut Model

Account Contract for Card Acceptance

Findings and Recommendations

Electrician'sMathand BasicElectricalFormulas

The Application of Mamdani Fuzzy Model for Auto Zoom Function of a Digital Camera

Ranking Community Answers by Modeling Question-Answer Relationships via Analogical Reasoning

university of illinois library AT URBANA-CHAMPAIGN BOOKSTACKS

TRENDS IN EXECUTIVE EDUCATION: TOWARDS A SYSTEMS APPROACH TO EXECUTIVE DEVELOPMENT PLANNING

Software Ecosystems: From Software Product Management to Software Platform Management

Soft-Edge Flip-flops for Improved Timing Yield: Design and Optimization

S&P/Case-Shiller Home Price Indices 24%

Chapter 5 Single Phase Systems

Social Network Analysis Based on BSP Clustering Algorithm

Bayes Bluff: Opponent Modelling in Poker

protection p1ann1ng report

Parametric model of IP-networks in the form of colored Petri net

A Reputation Management Approach for Resource Constrained Trustee Agents

A Comparison of Default and Reduced Bandwidth MR Imaging of the Spine at 1.5 T

A Design Environment for Migrating Relational to Object Oriented Database Systems

Dynamic and Competitive Effects of Direct Mailings

BENEFICIARY CHANGE REQUEST

Asymmetric Error Correction and Flash-Memory Rewriting using Polar Codes

The Optimal Deterrence of Tax Evasion: The Trade-off Between Information Reporting and Audits

Transcription:

Disovering Trends in Large Datasets Using Neural Networks Khosrow Kaikhah, Ph.D. and Sandesh Doddameti Department of Computer Siene Texas State University San Maros, Texas 78666 Abstrat. A novel knowledge disovery tehnique using neural networks is presented. A neural network is trained to learn the orrelations and relationships that exist in a dataset. The neural network is then pruned and modified to generalize the orrelations and relationships. Finally, the neural network is used as a tool to disover all existing hidden trends in four different types of rimes (murder, rape, robbery, and auto theft) in US ities as well as to predit trends based on existing knowledge inherent in the network. 1 Introdution Large datasets enompass hidden trends, whih onvey valuable knowledge about the dataset. The aquired knowledge is helpful in understanding the domain, whih the data desribe. The hidden trends, whih an be expressed as rules or orrelations, highlight the assoiations that exist in the data. Therefore, disovering these hidden trends, whih are speifi to the appliation, is extremely helpful and vital for analyzing the data.[1], [2] We define a mahine learning proess that uses artifiial neural networks to disover trends in large datasets. A neural network is trained to learn the inherent relationships among the data. The neural network is then modified via pruning and hidden layer ativation lustering. The modified neural network is then used as a tool to extrat ommon trends that exist in the dataset as well as to predit trends. The extration phase an be regulated through several ontrol parameters. The extration proess defines a method for analyzing the knowledge aquired by the neural network that is enoded into its arhiteture, onnetions and weights. Our goal is to define an analytial approah using neural networks based on hidden unit ativations and neural network training. The novelty of our proess lies in relating the strength/preditability of ourrene trends to the frequeny of neural ativations. The knowledge stored in the neural network is extrated only when the rigorous frequeny and onsisteny requirements are satisfied thus providing a sound method for seleting the knowledge/rules from the vast possible ombinations of data. 1

2 Rule Extration Tehniques Andrews et al. in [3] disuss the diffiulty in omprehending the internal proess of how a neural network learns a hypothesis. Aording to their survey, rule extration methods have been ategorized into deompositional, pedagogial, and eleti tehniques. The distinguishing harateristi of the deompositional approah is that the fous is on extrating rules at the level of individual (hidden and output) units within the trained neural network. In the pedagogial approah to rule extration, the trained neural network is treated as a blak-box, in other words, the view of the underlying trained artifiial neural network is opaque. The eleti approah ombines deompositional and pedagogial tehniques. They desribe several rule extration methods and onlude that no single rule extration/rule refinement tehnique is urrently in a dominant position to the exlusion of all others. The Subset method uses the deompositional tehnique to extrat rules at eah individual hidden and output unit.[4] These rules are then aggregated to form the omposite rule base for the neural network as a whole. The basi idea is to searh for subsets of inoming weights to eah hidden and output unit whih exeed the bias on the unit. The key underlying assumption is that the orresponding signal strength assoiated with eah onnetion to the unit is zero or one, in other words, maximally ative or minimally ative. This assumption is ahieved by judiious seletion of the unit s ativation funtion. The Subset method is a general purpose proedure that does not appear to be speifi to a partiular problem domain. However, sine the solution time of the algorithm inreases exponentially with the number of input units, it is suitable only for simple networks or so-alled small problem domains. The M-of-N method uses the deompositional tehnique.[5] The M-of-N method is desribed as follows. For eah hidden and output unit, identify groups of similarlyweighted links. Set link weights of all group members to the average of the group. Eliminate any groups that do not signifiantly affet whether the unit will be ative or inative. Holding all link weights onstant, optimize biases of all hidden and output units using the bak-propagation algorithm. Form a single rule for eah hidden and output unit, the rule onsists of a threshold given by the bias and weighted anteedents speified by the remaining links. Where possible, simplify rules to eliminate superfluous weights and 2

thresholds. The M-of-N method is designed as general purpose rule extration method and not limited to any partiular lass of problem domain. The RULEX method uses the deompositional tehnique.[6] It is designed to exploit the manner of onstrution and onsequent behavior of a partiular type of multilayer pereptron, the Constrained Error Bak-Propagation (CEBP) MLP whih is a type of loal response neural network similar in performane to a Radial Basis Funtion (RBF) network. The hidden units of the CEBP network are sigmoid-based loally responsive units (LRUs) that have the effet of partitioning the training data into a set of disjoint regions, eah region represented by a single hidden layer unit. Eah LRU is omposed of a set of ridges, one ridge for eah dimension of the input. A ridge will produe an appreiable output only if the value presented as input lies within the ative range of the ridge. The LRU output is the thresholded sum of the ativations of the ridges. In order for a vetor to be lassified by an LRU, eah omponent of the input vetor must lie within the ative region of its orresponding ridge. RULEX performs rule extration by diret interpretation of the weight parameters as rules. The VIA (Validity Interval Analysis) method uses the pedagogial tehnique to extrat rules that map inputs diretly to outputs.[7] VIA uses a generate-and-test proedure to extrat symboli rules from standard bak-propagation trained neural network whih have not been speifially onstruted to failitate rule extration. The approah is similar to sensitivity analysis in that it haraterizes the output of the trained ANN by systemati variations in the input patterns and examining the hanges in the network lassifiation. The validity interval of a unit speifies a maximum range for its ativation. VIA is designed as a general purpose rule extration proedure and does not seem to be limited to any partiular problem domain. The BRAINNE system uses the pedagogial tehnique.[8] It extrats rules from an ANN trained using bak-propagation. It requires a speialized training regime that traks an initialized trained network with m inputs and n outputs and transforms it into a network with m+n inputs and n outputs. This transformed network is then retrained. The next phase in the proess is to perform a pair-wise omparison of the weights for the links between eah of the original m input units and the set of hidden units with the weights from eah of the n additional input units and the orresponding hidden units. The smaller 3

the differene between the two values, the greater the ontribution of the original input unit to the output. A major innovation of the BRAINNE system is the apability to deal with ontinuous data as input without first having to employ a disretising phase. The Rule-Extration-as-Learning method uses the eleti tehnique.[9] The ore idea is to view rule extration as a learning task where the target onept is the funtion omputed by the network and the input features are simply the network s input features. The key is the proedure used to determine if a given rule agrees with the network. This proedure aepts a lass label and a rule r, and returns true if all instanes overed by r are lassified as members of lass by the network. The method is designed as a general purpose rule extration proedure and its appliability is not limited to any speifi lass of problem domains. The DEDEC method uses the eleti tehnique.[10] It provides a general method for disgorging the information ontained in existing trained neural network solutions already implemented in various problem domains. It extrats symboli rules effiiently from a set of individual ases. The task of rule extration is treated as a proess similar to that of identifying the minimal set of information required to distinguish a partiular objet from other objets. In order to searh the solution spae in as optimum a fashion as possible, the DEDEC method ranks the ases to be examined in order of importane. This is ahieved by using the magnitude of the weight vetors in the trained ANN to rank the input units aording to the relative share of their ontribution to the output units. The fous is on extrating rules from those ases that involve what are deemed to be the most important input units. This method also employs heuristis to terminate the proess either as soon as a measure of stability appears in the extrated rule set or the relative signifiane of an input unit, seleted for generating ases to be examined, falls below some threshold value. 3 Related Researh Setiono in [11] uses the deompositional tehnique and applies feedforward multilayer neural networks to the data mining lassifiation problem. The overall proess is as follows: a) Train a neural network for a lassifiation problem using the dataset. The network will be trained to the desired auray. b) The network is then pruned to obtain a 4

minimal arhiteture to improve generalization and to derease the omplexity. ) The aquired knowledge is then extrated in the form of rules. An example whih lassifies persons based on their age and inome is desribed in [11]. In addition, Setiono demonstrates the appliability and auray of the proess on the Wisonsin breast aner dataset in [12]. The rules extrated are in the if <onditions> then <result> form. The overall proess is defined for lassifiation problems with three layer networks (1 input, 1 hidden and 1 output layer). There are no signifiant ontrol parameters for data analysis. Gupta et al. in [13] propose an algorithm (GLARE) to extrat lassifiation rules from feedforward and fully onneted neural networks trained by bak-propagation. The major harateristis of the GLARE algorithm are: (a) its analyti approah for rule extration, (b) its appliability to standard network struture and training method, and () its rule extration mehanism as a diret mapping between input and output neurons. This method is designed for a neural network with only one hidden layer. This approah uses the signifiane of onnetion strengths based on their absolute magnitude and uses only a few important onnetions (highest absolute values) to analyze the rules. Wang et al. in [14] present an example of biologial data mining using neural networks. They train a Bayesian neural network with protein sequenes to lassify them as belonging to a partiular super family. The input to the network is the protein sequene enoded in a speial form and the output is a single neuron, whih is ativated if the protein sequene is in a partiular lass. Bayesian neural networks use probability to represent biases, sampling distributions, noise, and other unertainties. The Bayesian approah inorporates external knowledge (biases) about the target funtion in the form of prior probabilities of different hypothesis funtions. In this work, no rule extration is performed and the proess is a simple lassifiation on the test data. The relevane of this artile to our work lies in appliability of neural networks to learn the hypothesis enompassed in the dataset. Our knowledge disovery proess is both deompositional and pedagogial. It is deompositional in nature, sine we examine the weights for pruning and lustering the hidden unit ativation values. It is pedagogial, sine we use the neural network as a blak-box for knowledge disovery.[15] Our approah is neither limited by the omplexity of the hidden layer, nor by the number of hidden layers. Therefore, our 5

approah an be extended to networks with several hidden layers. Our proess also provides ontrol parameters for data analysis. These parameters provide a mehanism to ontrol the probability of ourrenes and auray of rules whih are similar to Support and Confidene framework of the assoiative data mining. However, there are no data mining or statistial tehniques that are omparable to our knowledge disovery proess. 4 Disovering Trends We have developed a novel proess for disovering trends in datasets, with m dimensional input spae and n dimensional output spae, utilizing neural networks. Our proess is independent of the appliation. The signifiane of our approah lies in using neural networks for disovering knowledge, with ontrol parameters. The ontrol parameters influene the disovery proess in terms of importane and signifiane of the aquired knowledge. There are four phases in our approah: 1) neural network training and filtering, 2) pruning and re-training, 3) lustering the hidden neuron ativation values, and 4) rule disovery and extration. In phase one, the neural network is trained using a supervised learning method. The neural network learns the assoiations inherent in the dataset. In phase two, the neural network is pruned by removing all unneessary onnetions and neurons. In phase three, the ativation values of the hidden layer neurons are lustered using an adaptable lustering tehnique. In phase four, the modified neural network is used as a tool to extrat and disover hidden trends. These four phases are desribed in more detail in the next four setions. 4.1 Neural Network Training and Filtering Neural networks are able to solve highly omplex problems due to the non-linear proessing apabilities of their neurons. In addition, the inherent modularity of the neural network s struture makes them adaptable to a wide range of appliations.[16] The neural network adjusts its parameters to aurately model the distribution of the provided dataset. Therefore, exploring the use of neural networks for disovering orrelations and trends in data is prudent. 6

Neural networks learn to approximate omplex funtional relationships between input and output patterns. An example of a mapping problem spae is depited in Figure 1. n pairs p pairs A m pairs B o pairs Input q pairs C Output Figure 1: Types of mapping that exist in a given dataset In any dataset, there are three types of mappings. Type A mapping is from similar input patterns to similar output patterns. They are said to have onsistent mapping, i.e. lose input pairs (losely grouped in input spae) map to lose output pairs (losely grouped in output spae). Type B mapping is from different regions in the input spae to similar regions in the output spae. Type C mapping is from one region in the input spae to various regions in the output spae. Type C is inonsistent mapping, sine similar inputs are mapped to different outputs. In the example above, there are n pairs of type A mapping, p + m pairs of type B mapping, and o + q pairs of type C mappings. Consistent and frequent patterns of type A and B strengthen the learning. The frequenies of these patterns largely affet the learning and generalization apability of the neural network. However, inonsistent patterns of type C weaken the learning. In these ases, the inonsistent pairs with the lowest frequenies should be filtered out in favor of the higher frequeny pairs. The inonsisteny of a partiular pattern an be measured in terms of its error with respet to the mean squared error. For patterns in Figure 1, the filtering proess 7

removes either set o or set q patterns, whihever that has a lower frequeny, or both sets, if both sets have relatively low frequenies with respet to the whole dataset. By filtering the sattered and inonsistent data during the training phase, the neural network an ahieve a higher performane rate. Figure 2 depits the error rates obtained for a sample dataset with and without filtering. MSE MSE Training Epoh Training Epoh Mean Squared Error for a Sample Set s Mean Squared Error for a Sample Set s Without Filtering With Filtering Figure 2: Effets of Filtering The input and output patterns may be real-valued or binary-valued. If the patterns are real-valued, eah value is disretized and represented as a sequene of binary values, where eah binary value represents a range of real values. For example, in a redit ard transation appliation, an attribute may represent the person s age (a value greater than 21). This value an be disretized into 4 different intervals: (21-30],(30-45],(45-65], and (65+]. Therefore [0 1 0 0] would represent a ustomer between the ages of 31 and 45. The number of neurons in the input and output layers are determined by the appliation, while the number of neurons in the hidden layer are dependent on the number of neurons in the input and output layers. We use an augmented gradient desent approah to train and update the onnetion strengths of the neural network. The gradient desent approah is an intelligent searh for the global minima of the energy funtion. We use an energy funtion, whih is a ombination of an error funtion and a penalty funtion [5]. The total energy funtion to be minimized during the training proess is: θ ( w, v) = E( w, v) + P( w, v) (1) 8

The error funtion to be minimized is the mean squared error. L n 1 2 E( w, v) = ( o lk d lk ) (2) L l= i k = 1 where L is the number of patterns n is the number of output neurons th o is the output of the output neuron for pattern lk th d lk is the expeted output of the k k output neuron for pattern l l The addition of the penalty funtion drives the assoiated weights of unneessary onnetions to very small values while strengthening the rest of the onnetions. Therefore, the unneessary onnetions and neurons an be pruned without affeting the performane of the network. The penalty funtion is defined as: P w, v ) = ρ deay ( P ( w, v ) + P ( w, )) (3) ( 1 2 v h m 2 h n 2 βwij βv jk P = 1( w, v) ε 1 + 2 j= 1 i= 11+ βwij j= 1 k = 11+ βv 2 jk P 2 ( w, v) ε 2 h m h n 2 2 w ij + v jk j k = j= 1 i= 1 = 1 = 1 m is the number of input neurons h is the number of hidden neurons n is the number of output neurons th th w ij is the i input to j hidden layer onnetion strength j k th th v jk is the hidden to output layer onnetion strength ε 1 is saling fator typially set at 0.1 ε 2 is the saling fator typially set at 0.00001 β is the saling fator typially set at 10 ρdeay is the saling fator typially set to a value between 0.03 and 0.05 The network is trained till it reahes a reall auray of 99% or higher. 9

4.2 Pruning and Re-Training The neural network is trained with an energy funtion, whih inludes a penalty funtion. The penalty funtion drives the strengths of unneessary onnetions to approah zero very quikly. The insignifiant onnetions having very small values an safely be removed without onsiderable impat on the performane of the network. For eah input to hidden layer onnetion (w ij ), if max v w < 0. 1 k jk ij remove w ij, and for eah hidden to output layer onnetion (v jk ), if v 0. 1 remove v jk. After removing all weak onnetions, any input layer neuron having no outgoing onnetions an be removed. In addition, any hidden layer neuron having no inoming or outgoing onnetions an safely be removed. Finally, any output layer neuron having no inoming onnetions an be removed. Removal of input layer neurons orresponds to having irrelevant inputs in the data model; removal of hidden layer neurons redues the omplexity of the network and the lustering phase; and removal of the output layer neurons orresponds to having irrelevant outputs in the data model. Pruning the neural network results in a less omplex network while improving its generalization. One the pruning step is omplete, the network is trained with the same dataset in phase one to ensure that the reall auray of the network has not diminished signifiantly. If the reall auray of the network drops by more than 2%, the pruned onnetions and neurons are restored and a stepwise approah is pursued. In the stepwise pruning approah, the insignifiant inoming and outgoing onnetions of the hidden layer neurons are pruned, one neuron at a time, and the network is re-trained and tested for reall auray. jk 4.3 Clustering the Hidden Layer Neuron Ativation Values The ativation values of eah hidden layer neuron are dynamially lustered and relustered with a luster radius and onfidene radius, respetively. The lustering algorithm is adaptable, that is, the lusters are reated dynamially as ativation values are added into the luster-spae. Therefore, the number of lusters and the number of ativation values in eah luster are not known a priori. The entroid of eah luster represents the mean of the ativation values in the luster and an be used as the 10

representative value of the luster, while the frequeny of eah luster represents the number of ativation values in that luster. The entroid and frequeny of a luster are denoted by G and freq respetively. By using the entroids of the lusters, eah hidden layer neuron has a minimal set of ativations. This helps with getting generalized outputs at the output layer. The entroid is adjusted dynamially as new elements added to the luster. i e are old i new ( G freq ) + e G = (4) freq + 1 Dist ( G, e)is the numerial distane of an element e from the entroid G. Dist ( G e) = G e, (5) The radius of a luster defines the distane of the farthest element to the entroid. G e r for any luster (6) i i Cluster is a region having a radius of whih inludes elements, where 1 i n and n is the number of elements in the luster. Clusters may be overlapping or disjoint. The luster radius must be less than a predetermined upper bound in order to maintain the desired auray of the network. The upper bound for the luster radius defines a range for whih the hidden layer neuron ativation values an flutuate without ompromising the network performane. The atual ativation value of the j th hidden layer neuron is defined as: r e S j n = Sig xl w l= 1 lj (7) where 1 Sig( x) = 1 + e a x Eah group of ativation values are represented by a entroid value whih in the worse ase is defined as: G = S ± r j j j (8) 11

The output of the k th output layer neuron is defined as: Z k m = Sig G j= 1 j v jk (9) For maintaining the auray of the network, the following must hold. * k Z k Z ρ (10) * Z k is the desired value, and ρ (the tolerane fator) is typially set to a small value suh as 0.01. The upper bound for the luster radius is derived from (8), (9), and (10) as: r 1 ln 1 ρ (11) max k m j= 1 v jk Sine dynami lustering is order sensitive, one the lusters are dynamially reated with a luster radius, all elements will be re-lustered with a onfidene radius of onehalf the luster radius. Figure 3: Effets of Clustering The benefits of re-lustering are twofold: 1) Due to order sensitivity of dynami lustering, some of the ativation values may be mislassified. Re-lustering alleviates this defiieny by lassifying the ativation values in appropriate lusters. 2) Relustering with onfidene radius (one-half the luster radius) eliminates any possible overlaps among lusters. For example, an element whih was determined to belong to a luster in Figure 3, after re-lustering, is determined to belong to a different luster in Figure 4. In addition during re-lustering, the frequeny of eah onfidene luster is alulated, whih will be utilized in the extration phase. 12

Figure 4: Effets of Re-Clustering Clustering of hidden layer neuron ativation values helps to identify regions of ativations along with the frequeny of suh ativities. In addition, lustering generates representative values for suh regions by whih we an retrieve generalized outputs. Sine we utilize a desired onfidene frequeny, we an examine the level of ativities in all regions aross the entire hidden layer. Only patterns whih satisfy the desired onfidene frequeny aross the entire hidden layer are onsidered. This ensures that inonsistent patterns and those whih fall within the regions with low level of ativity are not onsidered. 4.4 Rule Disovery and Extration In the final phase of the proess, the knowledge aquired by the trained and modified neural network is extrated in the form of rules. This is done by utilizing the generalization of the hidden layer neuron ativation values as well as ontrol parameters. The novelty of the extration proess is the use of the hidden layer as a filter by performing vigilant tests on the lusters. Clusters identify ommon regions of ativations along with the frequeny of suh ativities. In addition, lusters provide representative values (the mean of the lusters) that an be used to retrieve generalized outputs. The ontrol parameters for the extration proess inlude: a) luster radius, b) onfidene frequeny, and ) hidden layer ativation level. The luster radius determines the oarseness of the lusters. The onfidene radius,, is usually set to one-half of the luster radius to remove any possible overlaps or mislassifiations among lusters. The onfidene frequeny, Freq Conf, defines the minimum required support, whih reflets the required ommonality among patterns. The hidden layer ativation level defines the maximum level of tolerane for inative hidden layer neurons. r onf 13

Knowledge extration is performed in two steps. First, the existing trends are disovered by presenting the input patterns in the dataset to the trained and modified neural network and by providing the desired ontrol parameters. For a given hidden layer neuron ativation value e j : If min Dist( G, e ) r n j onf and G freq Freq Conf the hidden layer ativation value would be Otherwise the hidden layer neuron would be inative If the total number of inative hidden layer neurons does not exeed the maximum allowable tolerane for inative hidden layer neurons, the network produes an output. The input patterns that satisfy the rigorous extration phase requirements and produe an output pattern represent generalization and orrelations that exist in the dataset. The level of generalization and orrelation aeptane is regulated by the ontrol parameters. This ensures that inonsistent patterns, whih fall outside onfidene regions of hidden layer ativations, or fall within regions with low levels of ativity, are not onsidered. There may be many dupliates in these aepted input-output pairs. In addition, several input-output pairs may have the same input pattern or the same output pattern. Those pairs having the same input patterns will be ombined, and, those pairs having the same output patterns will be ombined. This post-proessing is neessary to determine the minimal set of trends. Any input or output attribute not inluded in the disovered trend orresponds to irrelevant attributes in the dataset. Seond, the prediated trends are extrated by providing all possible permutations of input patterns, as well as the desired ontrol parameters. Any additional trends disovered in this step onstitute the prediated knowledge based on existing knowledge. This step is a diret byprodut of the generalizability of neural networks. G 5 Disovering Trends in Crimes in US Cities We ompiled two datasets onsisting of the latest two annual demographi and rime statistis for 6100 US ities. The data are derived from three different soures: 1) US Census; 2) Uniform Crime Reports (UCR) published annually by the Federal Bureau of 14

Investigation (FBI); and 3) Unemployment Information from the Bureau of Labor Statistis. We used the first dataset to disover existing and predited trends in rimes with respet to the demographi harateristis of the ities and used the seond dataset to verify the auray of the predited trends. We divided the datasets into three groups in terms of the population of ities: a) ities with populations of less than 20k (4706 ities), b) ities with populations of greater than 20k and less than 100k (1193 ities), and ) ities with populations of greater than 100k (201 ities). We divided the datasets into three groups in terms of ity population, sine otherwise, small ities (ities less that 20k) would dominate the proess due to their high overall perentage. We then trained a neural network for eah group of ities and eah of four types of rimes (murder, rape, robbery, and auto theft) using the first dataset. Table 1 inludes the demographi harateristis and rime types we used for the knowledge disovery proess. Table 1: The Categories for the Proess POP : City Population SINGP% : Perentage of Single-Parent Households MINOR% : Perentage of Minority YOUNG% : Perentage of Young People(between the ages of 15 and 24) HOMEOW% : Perentage of Home Owners SAMEHO% : Perentage of People living in the Same House for Over 5 years UNEMPL% : Perentage of Unemployment Murder: Rape: Robbery: Auto-Theft: Number of Murders Number of Rapes Number of Robberies Number of Auto Thefts Eah ategory is disretized into several intervals to define the binary input/output patterns. The size of eah interval is data dependent. The data should be disretized as to preserve the integrity of the data. Therefore, the regions with high data density should be disretized into smaller intervals to haraterize the data more aurately. In general, the granularity of data analysis inreases as the size of intervals dereases. For eah ategory, we disretized the regions having high data density into smaller intervals to preserve data 15

integrity. Table 2 represents the disrete intervals for eah ategory. Eah interval of eah ategory is represented by a single neuron. Table 2: Disrete Intervals Categories Neurons Intervals POP(small) 5 [0-4k],(4k-8k],(8k,12k],(12k-16k],(16k-20] POP(medium) 5 (20k-40k],(40k-60k],(60k-80k],(80k-90k],(90k-100k] POP (large) 5 (100k-130k],(130k-160k],(160k-200k],(200k-500k],500k+ SINGP% 7 [0-5],(5-7],(7-9],(9-11],(11-14],(14-20],(20-100] MINOR% 6 [0-5],(5-10],(10-20],(20-40],(40-70],(70-100] YOUNG% 7 [0-12],(12-13],(13-14],(14-15],(15-17],(17-25],(25-100] HOMEOW% 7 [0-40],(40-50],(50-60],(60-70],(70-80],(80-90],(90-100] SAMEHO% 6 [0-45],(45-50],(50-55],(55-60],(60-65],(65-100] UNEMPL% 6 [0-4],(4-6],(6-8],(8-12],(12-20],(20-100] Murder 4 0, (1-5],(5-10],10+ Rape 5 0, (1-5],(5-10],(10-70],70+ Robbery 5 0, (1-5],(5-10],(10-100],100+ Auto-Theft 5 [0-10],(10-100],(100-500],(500-1000],1000+ 5.1 Neural Network Arhiteture For eah rime type, three different feedforward neural networks with a single hidden layer are trained, using the modified bak-propagation algorithm, for the three groups of ities (small, medium, and large) to an auray of 99% or higher. There are twelve neural networks in total. Eah network onsists of 44 input layer neurons, 60 hidden layer neurons, and 4 or 5 output layer neurons. The number of hidden layer neurons is arbitrary. We experimented with a wide range of values and hose the value that resulted in best network reall performane. After the training phase, the networks are pruned and lustered. Although, for eah network, about 30% of the onnetions as well as about 5% of the hidden layer neurons were pruned, none of the input neurons were pruned. This reflets the importane of all demographi ategories we used for disovering trends in rimes. After phase two and three, all twelve networks maintain an auray rate of 99% or higher. The networks were then used as tools to disover the existing, as well as predited trends. 16

5.2 Disovering Existing Trends The existing trends are those that are present in the existing dataset. Therefore, all input patterns in the first dataset are used as input to the trained and modified networks to extrat existing trends. For eah type of rime, several trends were disovered. Some of the disovered trends are represented in the following three tables. For eah type of rime, either a lower-bound or an upper-bound perentage of population is alulated. The support perentage for eah disovered trend represents the perentage of patterns in the first dataset that are similar to the trend. This is used for verifiation of the disovered trends and reflets the strength of eah trend. The dataset for small ities onsists of 4706 reords. Table 3 inludes some of the existing trends disovered for small ities. For example, ities having population between 4000 and 8000, minority between 0% and 10%, unemployment less than 6%, single-parent households less than 9%, people living in the same house for more than five years less than 55%, young people between 12% and 13%, and home owners between 60% and 70%, have an annual robbery rate of 1 to 5. This trend personifies 30% of the patterns in the first dataset. Table 3: The Existing Trends for Small Cities Support% 30% 30% 30% 30% POP 0-4k 0-4k 4k-8k 4k-12k MINOR % 0-5 20-40 0-10 0-5 UNEMPL % 0-4 8-12 0-6 0-4 SINGP % 7-9 7-9 0-9 7-11 SAMEHO % 50-55 60-65 0-55 55-60 YOUNG % 14-15 0-12 12-13 13-14 HOMEOW % 70-80 70-80 60-70 60-70 Murder Rape Robbery Auto-Theft 0 0 1-5 1-5 Population% 0% 0% 0.00125% 0.00125% The dataset for medium ities onsists of 1193 reords. Table 4 inludes some of the existing trends disovered for medium ities. For example, ities having population between 20000 and 40000, minority less than 5%, unemployment of 4% to 6%, singleparent households less than 5%, people living in the same house for more than five years 17

between 65% and 100%, young people between 12% and 13%, and home owners between 80% and 90%, have an annual auto theft rate of 10 to 100. This trend personifies 30% of the patterns in the first dataset. Table 4: The Existing Trends for Medium Cities Support% 30% 25% 25% 30% POP 20k-40k 20k-40k 20k-40k 20k-40k MINOR % 0-5 5-10 5-10 0-5 UNEMPL % 0-6 4-6 0-4 4-6 SINGP % 9-11 11-14 7-9 0-5 SAMEHO % 45-50 45-50 0-45 65-100 YOUNG % 12-13 14-15 17-25 12-13 HOMEOW % 60-90 40-50 60-70 80-90 Murder Rape Robbery Auto-Theft 0-5 1-10 5-10 10-100 Population% 0.00025% 0.0005% 0.0005% 0.005% The dataset for large ities onsists of 201 reords. Table 5 inludes some of the existing trends disovered for large ities. For example, ities having population between 200000 and 500000, minority between 20% and 40%, unemployment of 8% to 12%, singleparent households between 11% and 14%, people living in the same house for more than five years between 50% and 55%, young people between 15% to 17%, and home owners between 50% and 60%, have an annual murder rate of 5 to 10. This trend personifies 25% of the patterns in the first dataset. Table 5: The Existing Trends for Large Cities Support% 25% 25% 30% 20% POP 200k-500k 200k-500k 130k-200k 100k-130k MINOR % 20-40 40-70 20-70 10-20 UNEMPL % 8-12 8-12 6-8 0-4 SINGP % 11-14 14-20 11-14 5-7 SAMEHO % 50-55 50-55 45-50 0-45 YOUNG % 15-17 15-17 15-25 0-12 HOMEOW % 50-60 50-60 50-60 70-80 Murder Rape Robbery Auto-Theft 5-10 10-70 100+ 500-1000 Population% 0.00005% 0.00035% 0.000769% 0.01% 18

5.3 Disovering Predited Trends The predited trends are those that reflet the generalization property of the trained neural networks. Therefore, all possible permutations of input ategories are used as input to the trained and modified networks to extrat predited trends based on existing orrelations. Sine all possible permutations of input ategories inlude the existing dataset, naturally the disovered trends in this step inlude the existing trends that were disovered in the previous step. After removing the already disovered trends, the remaining trends onstitute the predited trends. The following three tables inlude predited trends for three types of ities (small, medium, and large). The predited trends desribe ities having ertain demographi harateristis and their rime rates. These ities did not exist in the first dataset, however if suh ities were to exist, they an expet to have suh rime rates. We used the seond dataset to verify the validity of the predited trends. The predited trends are reasonable expetations that result from the generalizability property of trained neural networks. The onsisteny perentage indiates the perentage of patterns in the seond dataset (future data) that are similar to the predited trend. A predited trend will be onsistent if all neessary onditions of the trend beome available. On the other hand, if the onsisteny of a predited trend is 0%, this indiates that the neessary onditions required for this trend are not urrently present. However, it is still possible for this predited trend to be onsistent if required onditions beome available. Therefore, the predited trends an be used for monitoring the environment. Table 6 inludes some of the predited trends disovered for small ities. For example, if ities having population less than 4000, minority between 10% and 20%, unemployment of 4% to 6%, single-parent households of 9% to 11%, people living in the same house for more than five years between 65% to 100%, young people between 17% and 25%, and home owners between 40% and 50% were to exist, they would have an annual robbery rate of 1 to 5. This predition is onsistent with 5% of the patterns in the seond dataset. 19

Table 6: The Predited Trends for Small Cities Consisteny% 30% 10% 5% 0% POP 0-4k 4k-8k 0-4k 4k-8k MINOR % 0-5 5-10 10-20 5-10 UNEMPL % 4-6 4-6 4-6 20-40 SINGP % 5-7 5-7 9-11 7-9 SAMEHO % 45-50 60-65 65-100 60-65 YOUNG % 25-40 15-17 17-25 17-25 HOMEOW % 60-70 40-50 40-50 60-70 Murder Rape Robbery Auto-Theft 0 1-5 1-5 1-5 Population% 0% 0.00125% 0.005% 0.00125% Table 7 inludes some of the predited trends disovered for medium ities. For example, if ities having population between 20000 and 40000, minority between 70% and 100%, unemployment of 6% to 8%, single-parent households between 5% and 7%, people living in the same house for more than five years between 50% and 55%, young people between 13% and 14%, and home owners between 40% and 50% were to exist, they would have an annual rape rate of 1 to 10. This predition is onsistent with 12% of the patterns in the seond dataset. It is worth noting that, for medium ities, no trend was predited for auto theft. Table 7: The Predited Trends for Medium Cities Consisteny% 25% 12% 0% POP 20k-40k 20k-40k 20k-40k MINOR % 0-5 70-100 5-10 UNEMPL % 0-6 6-8 8-12 SINGP % 5-7 5-7 11-14 SAMEHO % 50-60 50-55 50-55 YOUNG % 12-13 13-14 14-15 HOMEOW % 80-90 40-50 40-50 Murder Rape Robbery Auto-Theft 0 1-10 10-100 Population% 0% 0.0005% 0.005% No strong preditor 20

Table 8 inludes some of the predited trends disovered for large ities. For example, if ities having population between 200000 and 500000, minority between 40% and 100%, unemployment of 8% to 12%, single-parent households between 11% and 14%, people living in the same house for more than five years between 50% and 55%, young people less than 12%, and home owners between 60% and 70% were to exist, they would have an annual auto theft rate of more than 1000. This predition is onsistent with 20% of the patterns in the seond dataset. Table 8: The Predited Trends for Large Cities Consisteny% 11% 8% 10% 20% POP 200k-500k 200k-500k 200k-500k 200k-500k MINOR % 20-40 70-100 40-70 40-100 UNEMPL % 8-12 20-40 6-8 8-12 SINGP % 11-14 11-14 11-14 11-14 SAMEHO % 50-55 50-55 45-50 50-55 YOUNG % 15-17 13-14 17-25 0-12 HOMEOW % 50-60 60-70 50-60 60-70 Murder Rape Robbery Auto-Theft 5-10 70+ 100+ 1000+ Population% 0.00005% 0.00035% 0.0005% 0.005% 6 Conlusions For eah group of ities (small, medium, large), we were able to disover the existing trends for eah type of rime (murder, rape, robbery, auto theft). These trends represent the hidden knowledge and are based on the high level of ommonalty inherent in the dataset. The desired level of ommonalty an be regulated through the ontrol parameters. We were able to demonstrate the validity of existing trends by determining the perentage of patterns in the first dataset that support eah trend. In addition, by using the generalizability feature of neural networks, we were able to disover predited trends. These predited trends desribe the demographi harateristis and rime rates of ities that do not exist in the dataset but are possible to exist. One again, we were able to verify the validity of the predited trends by determining the perentage of patterns in the seond dataset (future data) that is onsistent with eah predited trend. 21

The knowledge disovery tehnique offers two unique features that are not available in other knowledge disovery tehniques. First, the ontrol parameters provide a means to set the desired level of onfidene for extrating existing and predited trends. Seond, the predited trends provide reasonable expetations that an be used for monitoring the environment. The knowledge disovery tehnique an be applied to any appliation domain that deals with vast amounts of data suh as medial, military, business, and seurity. In medial fields, the data gathered from aner patients an be used to disover the dominating fators and trends for the development of aner. In military fields, the data gathered from the enemy an be used to prediate their future movements. In business environments, the data gathered from ustomers an be used to model the transation ativities of the ustomers. In seurity appliations, the data gathered an be used to prediate and prevent potential intrusions. 7 Referenes [1] Joseph P. Bigus, Data Mining With Neural Networks: Solving Business Problems from Appliation Development to Deision Support, MGraw-Hill, NY, 1996 [2] Jiawei Han and Miheline Kamber, Data Mining: Conepts and Tehniques, Morgan Kaufmann, San Franiso, California, 2001 [3] Robert Andrews, Joahim Diederih, and Alan Tikle, A Survey and Critique of Tehniques for Extrating Rules from Trained Artifiial Neural Networks, Knowledge-Based Systems, Vol.8, No.6, pp. 373-389, 1995 [4] L. Bohereau and P. Bourgine, Extration of semanti features and logial rules from a multilayer neural network International Joint Conferene on Neural Networks Washington DC Vol. 2, pp. 579-582, 1990 [5] G. Towell and J. Shavlik, The Extration of Refined Rules From Knowledge Based Neural Networks Mahine Learning Vol. 131, pp. 71-101, 1993 [6] R. Andrews and S. Geva, Rule extration from a onstrained error bak propagation MLP Pro. 5 th Australian Conferene on Neural Networks Brisbane Queensland, pp. 9-12, 1994 [7] S.B. Thrun, Extrating Provably Corret Rules From Artifiial Neural Networks Tehnial Report IAI-TR-93-5 Institut fur Informatik III Universitat Bonn, 1994 22

[8] S. Sestito and T. Dillon, Automated knowledge aquisition of rules with ontinuously valued attributes Pro. 12 th International Conferene on Expert Systems and their Appliations (AVIGNON 92), pp 645-656, May 1992 [9] M. W. Carven and J.W. Shavlik, Using sampling and queries to extrat rules from trained neural networks Mahine Learning: Proeedings of the Eleventh International Conferene, pp. 176-183, 1994 [10] A.B. Tikle, M. Orlowski, and J. Diederih, DEDEC: deision detetion by rule extration from neural networks QUT NRC, September 1994 [11] Rudy Setiono, H. Lu, H. Liu, Effetive data mining using neural networks, IEEE Transations on Knowledge and Data Engineering, Vol. 8, No. 6, pp. 957-961, Deember 1996 [12] Rudy Setiono, Extrating Rules from Pruned Neural Networks for Breast Caner Diagnosis, Artifiial Intelligene in Mediine, Vol. 8, No. 1, pp. 37-51, February 1996 [13] Amit Gupta, Sang Park, and Siuva M. Lam, Generalized Analyti Rule Extration for Feedforward Neural Networks, IEEE transations on knowledge and data engineering, vol. 11, pp. 985 991, 1998 [14] Jason T. L. Wang, Qiheng Ma, Dennis Shasha, and Cathy H.Wu, Appliation of Neural Networks to Biologial Data Mining: A Case Study in Protein Sequene Classifiation, The Sixth ACM SIGKDD International Conferene on Knowledge Disovery & Data Mining, pp. 305-309, August 20-23, 2000 Boston, MA, USA [15] Khosrow Kaikhah and Sandesh Doddameti, Knowledge Disovery Using Neural Networks Seventeenth International Conferene on Industrial & Engineering Appliations of Artifiial Intelligene & Expert Systems, (IEA/AIE), pp. 20-28, May 2004 [16] Kishan Mehrotra, Chilukuri K. Mohan, and Sanjay Ranka, Elements of Artifiial Neural Networks (Complex Adaptive Systems), Cambridge, MA: MIT Press, 1997 23