Czech Technical University in Prague Faculty of Electrical Engineering. using. Pavel Kordík

Transcription

1 Czech Technical University in Prague Faculty of Electrical Engineering Fully Automated Knowledge Extraction using Group of Adaptive Models Evolution by Pavel Kordík A thesis submitted to the Faculty of Electrical Engineering, Czech Technical University in Prague, in partial fulfilment of the requirements for the degree of Doctor. PhD program: Electrical Engineering and Information Technology September 26

2 Thesis Supervisor: Miroslav Šnorek Department of Computer Science and Engineering Faculty of Electrical Engineering Czech Technical University in Prague Copyright c 26 by Pavel Kordík ii

3 Abstract and contributions Keywords like data mining (DM) and knowledge discovery (KD) appear in several thousands of articles in recent time. Such popularity is driven mainly by demand of private companies. They need to analyze their data effectively to get some new useful knowledge that can be capitalized. This process is called knowledge discovery and data mining is a crucial part of it. Although several methods and algorithms for data mining has been developed, there is still a lot of gaps to fill. The problem is that real world data are so diverse that no universal algorithm has been developed to mine all data effectively. Also stages of the knowledge discovery process need the full time assistance of an expert on data preprocessing, data mining and the knowledge extraction. These problems can be solved by a KD environment capable of automatical data preprocessing, generating regressive, predictive models and classifiers, automatical identification of interesting relationships in data (even in complex and high-dimensional ones) and presenting discovered knowledge in a comprehensible form. In order to develop such environment, this thesis focuses on the research of methods in the areas of data preprocessing, data mining and information visualization. The Group of Adaptive Models Evolution (GAME) is data mining engine able to adapt itself and perform optimally on big (but still limited) group of realworld data sets. The Fully Automated Knowledge Extraction using GAME (FAKE GAME) framework is proposed to automate the KD process and to eliminate the need for the assistance of data mining expert. The GAME engine is the only GMDH type algorithm capable of solving very complex problems (as demonstrated on the Spiral data benchmarking problem). It can handle irrelevant inputs, short and noisy data samples. It uses an evolutionary algorithm to find optimal topology of models. Ensemble techniques are employed to estimate quality and credibility of GAME models. Within the FAKE framework we designed and implemented several modules for data preprocessing, knowledge extraction and for visual knowledge discovery. Keywords: Data Mining, Knowledge Discovery, GMDH, Continuous Optimization, Niching Genetic Algorithm, Ensemble Models, Data Preprocessing, Visualization, Feature Ranking iii

4 Acknowledgements First of all, I would like to express my gratitude to my thesis supervisor, Dr. Miroslav Šnorek. He managed to create great environment, where many ideas arise, evolve and are shared by people, who are interested in soft computing. Thank you Mirek also for your personality. I thank to prof. Jiřina who read the early version of this thesis and his comments fundamentally influenced it. Thanks also to prof. Tvrdík for his constructive comments and for pushing me to finish the thesis. Thanks to the GMDH community (prof. Ivakhnenko, his son Gregorij, prof. Stepasko, many other GMDH people from Kiev and Dr. Frank Lemke from Berlin) for always being friendly and willing to discuss new ideas. Thanks to Phil Prendergast from Hort Research New Zealand for giving me chance to model mandarin tree water consumption instead of picking mandarins. This initiated my interest in real world applications of GMDH theory. Thanks to my friends from our department for their will to stay, collaborate on the research and not complain about bad weather during canoeing trips. I would like to thank following people, who collaborate(d) on the FAKE GAME project: Jan Saidl - classification plots, scatterplot matrix, GA search for interesting projections Miroslav Čepek - application of the GAME engine to Sleep stages classification data Jiří Novák - application of the GAME engine to signal filtering Jiří Kopsa - 3D visualization of classification boundaries Jiří Nožka - 3D visualization of regression manifolds Jan Šimáček - module for distribution transformation to uniform one and back Tomáš Černý - module for missing values imputation Lukaš Trlida - parser for PMML standard of GMDH models Oleg Kovařík - ACO* and CACO optimization methods Samuel Ferenčík - HGAPSO optimization method Jan Drchal - java implementation of the SADE algorithm Miroslav Jánošík - DE (version 1) optimization method Ondřej Filípek - SCG, OS, SOS, palde, training modules Aleš Pilný - SinusNeuron unit and experiments with optimal Sin transfer function Ondřej Zicha - PolyFractNeuron unit with rational transfer function Michal Semler - ExpNeuron unit with exponential transfer function David Sedláček - visualization of GAME models topology (connections) Pavel Staněk - experiments with GAME engine settings (niching enabled/disabled) Finally, my greatest thanks to my father, sister and our big family whose support was of crucial importance. iv

5 Dedication To my wife Jana and our daughter Anička. v

6 Contents 1 Introduction Problem statement Goals of the thesis Contributions of the thesis Organization of the thesis Background and survey of the state-of-the-art The theory related to FAKE Automated data preprocessing Dealing with alpha variables Imputing missing values Data normalization Distribution transformation Data reduction Visual data mining Credibility estimation Feature ranking The theory related to GAME Inductive modeling and the GMDH The philosophy behind inductive modeling Group method of data handling (GMDH) The state of the art in the GMDH related research Neural networks Optimization methods Quasi-Newton method (QN) Conjugate gradient method (CG) Orthogonal Search (OS) Genetic algorithms (GA) Niching methods in the evolutionary computation Differential Evolution (DE) Simplified Atavistic Differential Evolution (SADE) Particle swarm optimization (PSO) Ant colony optimization (ACO) vi

7 Hybrid of the GA and the particle swarm optimization (HGAPSO) Ensemble methods Bagging Bias-variance decomposition Simple ensemble and weighted ensemble Ensembles - state of the art Previous results and related work Overview of our approach - the FAKE GAME framework The goal of the FAKE GAME environment Research of methods in the area of data preprocessing Automated data mining Knowledge extraction and information visualization The design of the GAME engine and related results Heterogeneous units Experiments with heterogeneous units Optimization of GAME units The analytic gradient of the Gaussian unit The analytic gradient of the Sine unit The experiment: analytic gradient saves error function evaluations Heterogeneous learning methods Experiments with heterogeneous learning methods Structural innovations Growth from a minimal form Interlayer connections Regularization in GAME Regularization of Combi units on real world data Evaluation of regularization criteria Genetic algorithm Niching methods Evaluation of the distance computation The performance tests of the Niching GA versus the Regular GA The inheritance of unit properties - experimental results Evolving units (active neurons) vii

8 4.7.1 CombiNeuron - evolving polynomial unit Ensemble techniques in GAME Benchmarking the GAME engine Internet advertisements Pima Indians data set Spiral data benchmark Summary The FAKE interface and related results Automated data preprocessing Imputing missing values Distribution transformation The design of the transformation function Experiments with artificial data sets Mandarin data set distribution transformation Data reduction Knowledge extraction and information visualization Math formula extraction Feature ranking Extracting significance of features from niching GA used in GAME Relationship of variables Relationships in the Dyslexia data set Relationships in the Building data set Boundaries of classes Classification boundaries and regression plots in 3D GAME classifiers in the scatterplot matrix Credibility estimation of GAME models Credibility estimation - artificial data Credibility estimation - real world data Uncertainty signaling for visual knowledge mining Credibility of GAME classifiers The search for interesting behavior Ensembling: what do we mean by interesting behavior? Evolutionary search on simple synthetic data viii

9 Experiments with diversity Study with more complex synthetic data Experiments with real world data Applications of the FAKE GAME framework Noise cancelation by means of GAME Finite impulse response filter (FIR) Replacing FIR by the GAME network Experiment with synthetic data Sleep stages classification using the GAME engine GMDH for classification purposes Data acquisition and preprocessing Classification of sleep stages The configuration of GAME engine The configuration of WEKA methods Comparison of different methods Experiments with GAME configurations Summary and conclusions 14 8 Suggestions for the future work 16 9 Bibliography 17 1 Publications of the author 115 A Standardization of GMDH clones 117 A.1 PMML description of GMDH type polynomial networks B Data sets used in this thesis 12 B.1 Building data set B.2 Boston data set B.3 Mandarin data set B.4 Dyslexia data set B.5 Antro data set B.6 UCI data sets C The FAKE GAME environment 123 ix

10 C.1 The GAME engine - running the application C.1.1 Loading and saving data, models C.1.2 How to build models C.2 Units so far implemented in the GAME engine C.3 Optimization methods in the GAME engine C.4 Configuration options of the GAME engine C.5 Visual knowledge extraction support in GAME D Results of additional experiments 129 x

11 List of Figures 1.1 Fully Automated Knowledge Extraction (FAKE) using Group of Adaptive Models Evolution (GAME) Real world data set with missing values and alpha variables Pixel oriented scatterplot for variables relationship analysis Some climatic models also uses ensembling to estimate the uncertainty of the prediction The original MIA GMDH network Models are trained for the same task and then combined The Bagging scheme: Models are constructed on bootstrapped samples The Bias-Variance Decomposition Reducing variance and bias part of the error by ensembling The comparison of GMDH methods with various levels of modification FAKE GAME environment for the automated knowledge extraction Group of Adaptive Models Evolution (GAME) The comparison: original MIA GMDH network and the GAME network List of units implemented in the FAKE GAME environment Units competition on the Building data set Units competition on the Spiral data set The process of GAME units optimization Exponentially growing computational complexity eliminated by the analytic gradient Learning methods on the Ecoli data set Learning methods on the Boston data set Learning methods on the Building data set Models complexity, a minimum of the regularization criterion and the noise The expected and the measured error for different regularization and noise Regularization of the Combi units on the Antro data set Regularization of the Combi units on the Building data set The validation on both the training and the validation set is better for complex data with low noise The relative performance of regularization methods on various levels of noise Encoding of units when optimized in GAME layers xi

12 4.18 Regular GA versus Niching GA: non-correlated units can be preserved The fitness can be higher, when non-correlated inputs are used The distance of two units in the GAME network The visualization of units correlation and distances during the evolution Results of experiment with different distance computation between GAME units The GA versus the Niching GA with DC: the experiment proved our assumptions For the WBHW and WBE variables, the GA with DC is significantly better than the regular GA For simple data set, GA with DC attained a superior performance in all output variables The fifty percent inheritance level is a reasonable choice for all three data sets Encoding of the transfer function for the CombiNeuron unit Bagging the GAME models The ensemble of suboptimal models tops their accuracy For optimally trained models the ensemble has not superior performance Ensemble of two models exhibiting diverse errors can provide significantly better result Two GAME networks solving the intertwined spirals problem The comparison of imputing methods on the Stock market prediction data set The principle of the distribution transformation How to create an artificial distribution function For data that are closer to uniform distribution, results are not significant Histograms of the original data set and the data set transformed by the artificial distribution function For highly non-uniform distribution, models build on the transformed data are significantly better The artificial distribution functions for input features of the Mandarin data set The Scatterplot matrix of the Mandarin data before and after the transformation How to extract the math equation from the GAME model The extraction of a math formula from the GAME model on the Anthrop data The number of units connected to a particular signify its importance The feature ranking derived during the construction of the GAME inductive model Visualizing relationship of variables derived by a model Data projection into a regression plot With data vectors displayed, quality of models can be evaluated xii

13 5.16 An overfitted nonlinear inductive model on the Dyslexia data set The group of linear inductive models shows the relation of a reading speed to dyslexia Relationship plots on the Building data set The classification plot of the Pima Indians Diabetes data set Visualization of 3D manifolds representing decision boundaries of a GAME model on the Iris data set D visualization of GAME models regression manifolds together with data vectors The visualization of a class membership into a scatterplot matrix for the Ecoli data set Responses of GAME models for a testing vector lying in the area insufficiently described by the training data set The dependence of ρ on dy i is linear for an artificial data without noise The dependence of ρ on dy i is quadratic for real world data GAME models on the Building data set with the uncertainty signified by the envelope in background The explanation how to combine ensemble models to get better defined class memberships When models are multiplied, the classification into classes is better constrained Multiplication is sensitive to anomalies in model behavior Interesting behavior of models Synthetic training data and ensemble models approximating it The plot of fitness function for all possible individuals The individual with the highest fitness dominated the population of genetic algorithm Diversity in population for the standard genetic algorithm Diversity in population for the niching genetic algorithm (Deterministic Crowding employed) Three solutions found by the niching genetic algorithm Input vectors used for generating training data are concentrated in clusters The best individuals after ten and fifty generations for features x 1,x 2,x Plots showing the relationship of the feature temp and outputs wbc, wbhw and wbe The architecture of the FIR filter The GAME network functioning as a filter Frequency response of reference signal(left) and input signal(right) xiii

14 6.4 The signal filtered by GAME network (left) corresponds better to the reference signal GAME and adaptive FIR doing a feature ranking An example of the GAME classifier of the REM Sleep Stage A.1 An example of simple GMDH type model that is described in PMML bellow. 117 C.1 The configuration of units in the GAME engine D.1 The behavior of a GAME model consisting from Gaussian units almost resembles fractals D.2 Characteristic response of the FIR filter D.3 Characteristic Response of the GAME network filter - noise is better inhibited in regions D.4 The percentage of surviving units in GAME network according to their type (Motol, Ecoli data) D.5 The type of surviving units - Mandarin and Iris data sets D.6 Relationship plots on the Antro data set D.7 The classification of the Spiral data by the GAME model evolved with all units enabled D.8 Units competition on the Boston data set D.9 Units competition on the Ecoli data set D.1 Units competition on the Mandarin data set D.11 Units competition on the Iris data set D.12 The performance comparison of learning methods on the Mandarin data set. 134 D.13 Regularization of the Combi units on the Antro data set D.14 Performance of GAME ensembles on Advertising data set depending on number of member models D.15 The configuration window of the CombiNeuron unit xiv

15 SECTION 1. INTRODUCTION 1 1 Introduction The Group Method of Data Handling (GMDH) was invented by A.G. Ivakhnenko in late sixties [47]. He was looking for a computational instruments allowing him to model real world systems characterized by data with many inputs (dimensions) and few records. Such ill-posed problems could not be solved traditionally (ill-conditioned matrixes) and therefore different approach was needed. Prof. Ivakhnenko proposed the GMDH method which avoided solution of ill-conditioned matrixes by decomposing them into submatrices of lower dimensionality that could be solved easily. The more important idea behind the GMDH is the adaptive process of combination of these submatrices back to the final solution. The original GMDH method is called Multilayered Iterative Algorithm (MIA GMDH). Many similar GMDH methods based on the principle of induction (problem decomposition and combination of partial results) have been developed since then. The only possibility how to model real world systems before the GMDH, was to manually create a set of math equations mimicking the behavior of a real world system. This involved a lot of time, domain expert knowledge and also experience with the synthesis of math equations. The GMDH allowed automatic generation of a set of these equations. A model of the real world system can be also created by Data Mining (DM) algorithms, particulary by artificial Neural Networks (NNs). Some DM algorithms such as decision trees are simple to understand, whereas NNs have often so complex structure that they are necessarily treated as a black-box model. The MIA GMDH is something in between - it generates polynomial equations which are less comprehensible than a decision tree, but better interpretable than a NN model. In this thesis, we propose the Group of Adaptive Models Evolution DATA WAREHOUSING DATA INTEGRATION Classes boundaries, relationship of variables Classification, Prediction, Identification and Regression DATA CLEANING INPUT DATA DATA INSPECTION AUTOMATED DATA PREPROCESSING MO DEL MO DEL MO DEL GAME ENGINE MO DEL MO DEL MO DEL Credibility estimation Math equations DATA COLLECTION Interesting behaviour PROBLEM IDENTIFICATION FAKE INTERFACE Feature ranking Figure 1.1: Fully Automated Knowledge Extraction (FAKE) using Group of Adaptive Models Evolution (GAME)

16 2 SECTION 1. INTRODUCTION (GAME) engine which evolved from MIA GMDH. The ensemble of GAME models is more accurate than models generated by the GMDH and it also outperforms NN models on data sets, we have been experimenting with. The consequence of improved accuracy is greater complexity and reduced interpretability of models. To deal with this problem, we propose several techniques for the knowledge extraction from GAME models. The goal of this thesis is to propose a framework for Fully Automated Knowledge Extraction (FAKE) using the GAME engine 1.1. It should help domain experts to extract useful knowledge from a real world data without assistance of data miners or statisticians. The Figure 1.1 shows that data collection, warehousing, integration, etc. are not in the scope of this thesis. These steps cannot be automated in general, they are highly dependent on specific conditions of the data provider. Within the FAKE interface, we explore methods for automated data preprocessing for the GAME engine. Preprocessing methods are necessary to increase the number of data sets that can be processed by means of the GAME engine. The GAME engine itself is designed to be able to mine data automatically, without experimenting with proper topology and optimal parameter settings. The knowledge extraction is supported by several methods (information visualization, feature ranking, formula extraction, etc.) belonging to the FAKE interface. The motivation why we propose the FAKE GAME framework can be found in the next Section. 1.1 Problem statement With the continual development of sensors and computers, the number of collected data dramatically increases. Data sets obtained from various domains of human activity are growing in size and diversity. Global data repositories include tiny data sets as well as extremely large multivariate ones having several billions of instances. The collected data sets can also differ in complexity of the problem they describe. Several data sets include missing values, outlayers and are affected by noise. In order to extract some useful knowledge from these data, they need to be manually processed first. The traditional techniques for knowledge extraction like Exploratory Data Analysis followed by statistical techniques are very demanding for both statistical skills and time. Therefore most of data collected by companies and institutions remain unexplored and the valuable knowledge hidden inside is lost. During the last decade, the Data Mining and the Knowledge Discovery became hot topics for private companies. They know the price of knowledge in their data and they do not wish to lose it. The Data Mining [32] employs machine learning algorithms to analyze a real world data. Probably the most popular DM methods are artificial neural networks (MLP, etc.). They are popular mainly thanks to their usability as a black-box, without knowing how it works inside. They can also often quickly deliver superb results. However, there are also problems with DM methods and particulary with NNs. Some of the biggest drawbacks are listed bellow. The first drawback is that one has to be experienced in NN theory to be able to get really

17 SECTION 1. INTRODUCTION 3 reliable results. One has to choose a proper neural network architecture, suitable learning paradigm, preprocess data for neural network and interpret the results. Using NNs as a black-box is often source of serious mistakes (data overfitting, recalling patterns far from a training data, etc.) The second drawback is that the knowledge of the system is hidden in network weights (blackbox) as opposed to DM methods that generate formulas interpretable by man. Neural network can be of course written also in the form of math equation, but even for simple networks, the equation is several pages long and contains nested nonlinear functions. The third major drawback is that users do not know when they can trust a neural network. Whereas for some input patterns the response is perfect, for other patterns the output is far from the target value. Especially for a real world data, it is very hard to distinguish regions, where the NN is trained well, from regions where the output is almost random. This thesis targets all these drawbacks. We build the FAKE GAME environment to automate the process of knowledge extraction from data. Firstly, a user of the FAKE GAME environment does not need to experienced in the theory of neural networks or the GMDH. The GAME engine incorporates the genetic algorithm to evolve optimal models with proper type of units (hybrid model) and optimal learning methods. GAME models grow during the learning process (their topology is not given in advance) so the size of models proportional to the complexity of a problem is guaranteed. Secondly, the models evolved by the GAME can be written in the form of math equations. To overcome the black-box disadvantage of more complex models, we use the visualization of models behavior. Within the FAKE interface, we developed several visualization techniques that can be directly used for the visual knowledge mining. Plots showing models behavior are very useful especially for complex systems, where math equations are not interpretable any more. We also automated the search for interesting plots of variables relationship in the multidimensional space. The third above mentioned drawback is the questionable plausibility of neural network. The problem is that NN can give plausible output just for cases it has been successfully trained for. Good output of the NN model cannot be ensured by constraining its input features or by computing the distance from training data. The GAME engine solves this problem by evolving ensemble of diverse models. The more they differ in responses to the same input, the less plausible their output is. The difference in responses of ensemble models is also used in definition of the interesting behavior of models (see Section 5.2.8). 1.2 Goals of the thesis The goals of this thesis are the following: Propose the FAKE GAME framework. Describe the GAME engine. Evaluate functionality of improvements proposed. Benchmark the GAME engine with other data mining tools.

18 4 SECTION 1. INTRODUCTION Describe the FAKE interface. Present some applications of described techniques. 1.3 Contributions of the thesis The original results presented in this thesis are: The FAKE GAME framework for automated knowledge extraction from data. The GAME engine for automated data mining (evolves of ensemble models for the purpose of regression, prediction and classification). Heterogeneous GAME units (hybrid models perform better than uniform ones) - Section 4.1. Optimization of GAME units (analytic gradient significantly reduces number of error evaluation calls needed to reach the optimum) - Section 4.2. Heterogeneous learning methods (several optimization methods compared on diverse real world problems) - Section 4.3. Regularization of GAME units (regularization prevented CombiNeuron unit from overfitting noisy data) - Section 4.5. Genetic algorithm evolves GAME units layer by layer (evolves input connections as well as transfer functions, properties and type of learning method used) - Sections 4.6, 4.7. Niching scheme employed (maintaining diversity increased accuracy of models) - Section Ensemble of GAME models generated (ensemble response is more reliable, often more accurate than single models) - Section 4.8. GAME benchmarks: GAME achieved superior results for all benchmarks we have performed, it solved intertwined spirals problem as the only GMDH-type method we know - Section 4.9. The FAKE interface consisting from modules for automated data preprocessing and knowledge extraction support. Missing values imputing (promising performance of the Euclid distance neighbor replacement method) - Section Transformation to uniform distribution (transformation of data using artificial distribution function significantly improved accuracy of GAME models on simple synthetic data set) - Section Math formula extraction (regularized CombiNeuron GAME models can be serialized into simple polynomial equations suitable for knowledge extraction) - Section Feature ranking (three novel algorithms for the feature ranking) - Sections 5.2.2, ,

19 SECTION 1. INTRODUCTION 5 Regression plots allow to study relationship of variables in particular conditions - Section Classification plots enable to study decision boundaries of classes estimated by models - Section Interactive 3D regression (classification) plots are helpful, when there exist more than one (two) important features in a system modeled - Section For multivariate problems we proposed the scatterplot matrix enriched by information on models classification boundaries - Section The credibility of models was empirically set to be inversely proportional to dispersion of ensemble models responses - Section We use ensemble of classifiers, where member models are multiplied, to visualize just credible areas of class membership - Section To locate interesting regression plots in multidimensional space automatically, we use a genetic search with specific fitness function - Section The GAME engine outperformed the FIR filter and also classifiers from WEKA environment on the Sleep stages classification problem - Sections 6.1, Organization of the thesis This thesis is organized as follows. After the introduction, the second chapter summarizes the state of the art in domains connected to this thesis. The chapter is subdivided into two sections. First section deals with the theory related to Fully Automated Knowledge Extraction concept. Namely the advances in Data Preprocessing, Visual Data Mining, Feature Ranking and other methods related to Knowledge Discovery process are briefly mentioned. The second section focuses on the theory related to the core of the FAKE GAME framework - Group of Adaptive Models Evolution. The state of the art in areas of Inductive Modeling, Continuous Optimization Techniques, Neural Networks, and Ensemble Methods are described in this section. In the third chapter we propose the FAKE GAME framework for automated knowledge extraction. The fourth chapter describes the design of the GAME engine, the core of the framework. Several improvements of the state of the art methods and their empirical evaluation can be found in separate sections of this chapter. Benchmarks of the GAME engine are concluding this chapter. The FAKE interface is described in the chapter five. Several methods are described and their application on real world problems is demonstrated. The chapter six presents two case studies. The GAME engine and feature ranking methods from the FAKE interface were applied to real world problems. After the conclusion, future work and bibliography chapters, there are three chapters in the appendix of this thesis. The first proposes the PMML standard for exchanging GMDH models. In the second chapter of the appendix we documented the FAKE GAME environment. The last chapter contains additional figures and graphs, extending experiments made within the thesis.

20 6 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART?,C,A,8,,?,S,?,,?,?,G,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,.7,61.,,?,,?,3?,C,R,,,?,S,2,,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,Y,?,?,?,COIL,3.2,61.,,?,,?,3?,C,R,,,?,S,2,,?,?,E,?,?,Y,?,B,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,.7,13.,762,?,,?,3?,C,A,,6,T,?,?,,?,?,G,?,?,?,?,M,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,2.81,385.1,,?,,?,3?,C,A,,6,T,?,?,,?,?,G,?,?,?,?,B,Y,?,?,?,Y,?,?,?,?,?,?,?,?,?,SHEET,.81,255.,269,?,,?,3?,C,A,,45,?,S,?,,?,?,D,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,1.6,61.,,?,,?,3?,C,R,,,?,S,2,,?,?,E,?,?,?,?,?,Y,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,.699,61.,488,Y,,?,3?,C,A,,,?,S,2,,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,Y,?,?,?,COIL,3.3,152.,,?,,?,3?,C,R,,,?,S,2,,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,Y,?,?,?,COIL,.699,132.,,?,,?,3?,C,A,,,?,S,3,,N,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,1.,132.,762,?,,?,3?,C,R,,,?,S,2,,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,1.2,61.,,?,,?,3?,C,R,,,?,S,2,,?,?,E,?,?,?,?,?,Y,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,.3,132.,488,Y,,?,3?,C,R,,,?,S,2,,?,?,E,?,?,?,?,B,Y,?,?,?,Y,?,?,?,?,?,?,?,?,?,SHEET,1.2,61.,15,?,,?,3?,C,A,,45,?,S,?,,?,?,D,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,1.2,69.9,,?,,?,3?,C,A,,,?,S,2,,?,?,F,?,?,Y,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,.6,122.,761,?,,?,3?,C,A,,,?,S,2,,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,4.,132.,762,?,,?,3?,C,A,1,,?,?,?,,?,?,E,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,COIL,3.21,6.,,?,,?,U?,C,A,,8,T,?,?,,?,?,G,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,?,SHEET,.8,61.,417,Y,,?,U Figure 2.1: An example of the real world data set with missing values and alpha variables. 2 Background and survey of the state-of-the-art We divided this chapter into two sections. The first section discusses the background and state of the art related to the FAKE interface. The second section deals with the knowledge needed to understand the process of design and evolution of the GAME engine. 2.1 The theory related to FAKE Fully Automated Knowledge Extraction is an interface to the GAME engine. It combines methods from several scientific domains. We designed this interface because the extraction of knowledge is time consuming and expert skills demanding task. For the knowledge extraction, we need a data set describing behavior of system under investigation. This raw data set needs to be preprocessed first. The task of data preprocessing is of crucial importance and it often takes more time to preprocess data than to mine it [8] Automated data preprocessing The data preprocessing phase can be divided into several steps [38]. Not all steps need to be necessarily performed. It depends on the quality of data and on requirements of a data mining method used, which steps are demanded. To preprocess data for the GAME engine, following steps can be performed Dealing with alpha variables The FAKE GAME environment works with numerical data. All characters, strings or symbols (alpha variables) have to be encoded into numerical variables. The encoding can be performed automatically (adding new binary variable for each discovered string in a data set). However the dimensionality of resulting data set might become huge with many unique strings in the

21 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART 7 data set. Then it is wise to let the domain expert to reduce the dimensionality by utilizing his expert knowledge of strings (several strings can be encoded in one new variable - e.g. variable Pressure low =.1, medium =.4, etc.) Imputing missing values We assume that missing values have been identified by the domain expert and replaced e.g. by the? symbol as shown in the Figure 2.1. Some neural networks such as Self Organizing Map (SOM) [55] can work with data containing missing values [23]. Neither the GMDH nor the GAME engine can deal with an incomplete data set. The problem of missing data can be overcome by following techniques. The easiest way how to deal with missing data is to delete records containing missing values. However this will not work for a data with higher percentage of missing values. Also, some potentially useful information can be lost by leaving out all records with missing values. Better approach is to fill in missing values (missing values imputing). We can replace all missing values by zero or if we assume that the distribution of missing values is the same as the non-missing values, we can replace them by a mean value. These techniques are fast and simple to implement, but they introduce bias and they do not take into account interrelationships in data [25]. Another technique is based on similarity measures. It assumes that if two records match in non-missing values, they probably have all values identical. For ordinal variables, the similarity can be computed e.g. by the Euclid Distance measurement. Missing values are the imputed by the values taken from the most similar records. Even more sophisticated techniques model the relationship between attributes (features) and use these models (Decision Tree, Linear Regression, Neural Networks or GMDH) to impute the missing values. Another methods we can utilize for imputing are Markov Chain Monte Carlo (MCMC) or Propensity Scores that use approximate Bayesian bootstrap to estimate missing values from non-missing ones [25] Data normalization For neural networks with a nonlinear sigmoid transfer function in neurons (Multi Layered Perceptron, Cascade Correlation Neural Network, etc.), at least the output variable has to be normalized into the, 1 range. The GMDH does not require the normalization because it uses polynomial transfer functions with an unlimited range. The GAME engine uses normalization to be able to utilize units with limited range transfer functions, but it can also work without it. v v min A variable can be normalized by the Min-max normalization v norm = v max v min, by the Z- score normalization v norm = v vmean v std dev, by the Decimal scaling v norm = v, where j is the 1 j smallest integer such as max ( v norm < 1) [38]. The Softmax Scaling, that uses a logistic function to project variables into the, 1 interval, is also frequently used. According to the article [61], a logarithmic normalization improved clustering results of gene

22 8 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART expression data (it was found superior to other normalization techniques such as Min/Max scaling). However on a different data, results are likely to be different. When normalizing variables, it is important to check if a data set does not contain outliers. An outlier is a record that is extremely abnormal and for certain types of normalization (e.g. Min-max) can project almost all records to a tiny interval by zero. Outliers can be detected for example by means of the Semi-Discrete Decomposition (SDD) as described in [68]. The Softmax Scaling when properly configured can also deal with outliers Distribution transformation According to [8], the best distribution for the data mining is the uniform one. The more a distribution of a data set differ from the uniform distribution, the worse results we are likely to encounter. In this thesis we propose a technique that can transform data from whatever distribution to the distribution close to the uniform one (see Section 5.1.2) Data reduction When a data set contains several thousands of records, it is likely, that some of the records are redundant and some are almost identical. The more records we use for a data mining method, the more time it consumes. The computational time of certain DM methods grows almost exponentially with increasing number of records. Therefore some data reduction mechanism has to be applied to large data sets in the preprocessing stage. We need to reduce the data set in volume, but at the same time preserve the knowledge hidden inside. We have implemented an application that reduces the data set in order to get the uniform distribution in the output variable (see Section 5.1.3) Visual data mining Real world data sets are usually of medium size (5+ attributes, 1+ observations). When viewing such data set as columns of numbers, the knowledge can hardly be extracted. The most straightforward technique making the knowledge accessible is to visualize the data vectors as points in a plot. The problem is with the dimensionality of the data set. The scientific discipline dealing with multivariate data is called Information Visualization and it becomes very popular. One area of Information Visualization is the Visual Data Mining (VDM). The task of VDM [18, 12] is to let the user explore data in the multidimensional system space to find relationships among the system attributes. The VDM utilizes several methods to display the data in a form that is easier comprehensible for man. Here, the evolutionary computation can be also employed to display data [24]. Plots and scenes are the most effective views for the exploration and the knowledge extraction process. There are several methodologies how these plots or scenes can be rendered. The common visualization techniques for VDM are:

23 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART 9 Figure 2.2: Pixel oriented scatterplot for variables relationship analysis Scatterplots Parallel coordinates Grand tour techniques Pixel mapping Relationship-based visualization ([99]) Real data sets describing complex systems have many more than three features. Each feature adds one degree of freedom (one dimension) to a data space. It is hard for man to cope with more than three dimensional space. Therefore a data set with more than three attributes has to be somehow mapped or projected into two or three dimensions. Techniques that can be used for multidimensional data visualization are listed bellow: Scatterplot matrix Parallel coordinate plots (see [44]) 3-D stereoscopic scatterplots Grand tour on all plot devices Circle Segments (see [14]) Density plots Linked views Saturation brushing Pruning and cropping

24 1 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART All these techniques allow to study complex systems utilizing the information carried by data vectors location. The most convenient area of application for these methods is the cluster analysis [13]. However when one would like to study relationship among input and output variables of the system, the yield of these methods is not very high. Figure 2.2 shows pixel oriented scatterplot that can be used for variables relationship analysis. The relation between price and promotion variable is visible. But we can only guess the how the exact relationship in particular conditions looks like. In this thesis we describe an approach to streamline the VDM process. Together with data vectors projection we employ the GAME engine to construct ensemble of inductive models from these data vectors. The behavior of these models is expressed by the background color in existing scatterplots (see Figure 5.8 and 5.19). The behavior of models enriches pixel oriented scatterplots so the knowledge extraction is faster and more efficient. In our approach advantages of inductive models (generalization, dimensionality reduction, error resistance, complex relationship expression, etc.) can be exploited. By employing the ensemble of inductive models we also are able to automatically locate interesting areas of the multidimensional space of system variables (see Section 5.2.8). This can be used as the projection pursuit for the Data Driven Guided Tours [15], too. Very popular method of multidimensional data projection is the Self Organizing Map (SOM) [54] Credibility estimation In the environmental science, there are attempts to estimate the credibility of their models. They combine several models of different kind and plot the empirical probability of their Figure 2.3: Some climatic models also uses ensembling to estimate the uncertainty of the prediction. predictions (Figure 2.3). Another possibility is to test models on sensitivity of its initial conditions and parameters. Several models with slightly different values can be produced and the credibility estimated (for further information see [97]). These techniques have tight connection to the theory of ensembles discussed bellow. The credibility estimation for GAME models and classifiers is also based on similar ideas.

25 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART Feature ranking When modeling a real-world system, it is necessary to preselect a set of features from the available information that may have impact on the behavior of the system. By recording these features in particular cases, a data set suitable for modelling can be produced [53]. The goal of feature selection is to avoid selecting too many or too few variables than necessary. In practical applications, it is impossible to obtain complete set of relevant variables. Siedlecki and Sklansky [91] used genetic algorithms for feature selection by encoding the initial set of n variables to a chromosome, where 1 and represents presence and absence respectively of variables in the final subset. They used classification accuracy, as the fitness function and obtained good neural network results. These methods are usually used as a data preprocessing tool for further system modeling and classification by means of neural networks. Some neural networks are very sensitive to the presence of irrelevant features in a data set. The GAME engine is on the other hand designed to deal with irrelevant features (by ignoring them). Some features can be relevant just in a small subspace of the state space of all input features. Later in this thesis, we present three algorithms ranking features according their significance (see Sections 5.2.2, and ). 2.2 The theory related to GAME This section describes the core algorithm for building the ensemble of inductive models - the GAME engine. Models are subsequently used in the FAKE GAME knowledge extraction process. Firstly the theory of inductive modeling has to be explained. It is necessary for the reader of this thesis to understand where the GAME method is coming from, therefore related inductive algorithms will be described in a broader scope Inductive modeling and the GMDH Inductive modeling uses machine learning techniques to derive models from data. Deductive modeling on the other hand uses domain expertise to conclude a math models of a system. There have been lots of discussions between supporters of both approaches whose approach is better. The answer is: There is a enough data for both of them, but only the inductive modeling can be applied massively with limited human resources The philosophy behind inductive modeling The capability of induction is fundamental for human thinking. It is the next human ability that can be utilized in soft-computing, besides that of learning and generalization. The induction means gathering small pieces of information, combining it, using already collected information in the higher abstraction level to get complex overview of the studied object or process. Inductive modeling methods utilize the process of induction to construct models of studied systems. The construction process is highly efficient, it starts from the minimal form and the model grows according to system complexity. It also works well for systems with many

26 12 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART Figure 2.4: The original MIA GMDH network inputs. Where the traditional modeling methods fail, due to the curse of dimensionality phenomenon, the inductive methods are capable to build reliable models. The problem is decomposed into small subtasks. At first, the information from most important inputs is analyzed in the subspace of low dimensionality, later the abstracted information is combined to get a global knowledge of the system variables relationship Group method of data handling (GMDH) There are several methods for inductive models construction commonly known as Group Method of Data Handling (GMDH) introduced by Ukrainian scientist Ivachknenko in 1966 [47, 31, 63]. The GMDH theory or polynomial networks are called Statistical Learning Networks [17] in the United States of America. They were developed more or less independently. Their disadvantage is that they do not use the external regularization criterion and therefore can overfit noisy data. The GAME engine presented in this thesis was inspired by the Multilayered Iterative Algorithm (MIA GMDH). It uses a data set to construct a model of a complex system. The model is represented by a network (see Figure 2.4). Layers of units transfer input signals to the output of the network. The coefficients of units transfer functions are estimated using the data set describing the modeled system. Networks are constructed layer by layer during the learning stage. The original MIA algorithm works as follows. First initial population of units with given polynomial transfer function is generated. Units have two inputs and therefore all pair-wise combinations of input variables are employed. Then coefficients of unit s transfer functions are estimated using stepwise regression or any other optimization method. Units are sorted by their error of output variable modeling. Few of the best performing units are selected and function as inputs for next layer. Next layers are generated identically until the error of modeling decreases. Which units are performing best and therefore should survive in a layer is decided using an external criterion [46] of regularity (CR). There are several possible criteria applicable. The most popular is the criterion of regularity

27 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART 13 based on the validation using an external data set: AB = 1 N B N B (y i (A) d i ) 2 min (2.1) i=1 where the y i (A) is the output of a GMDH model trained on the A data set. Additional criterion to discriminate units that will be deleted is the variation accuracy criterion (VAC) [19, 48] Ni=1 δ 2 (y i d i ) 2 = Ni=1 min (2.2) (d i d) 2 where the y i is the output of a GMDH model, d i is the target variable and d is the mean of the target variable. With δ 2 <.5 the model is good and when δ 2 > 1, the modeling failed (the output unit should be deleted). The proper regularization [18] is of crucial importance in the GMDH theory. More information on the GMDH topic can be found in the book of A. Muller and F. Lemke [69] or on the GMDH website [2] The state of the art in the GMDH related research Polynomial Neural Networks (PNN) [77] are also GMDH type networks. The units called partial descriptions having different transfer function of polynomial type [75] are evolved by a genetic algorithm (GA). In [76] the structural optimization of the fuzzy polynomial neural network (FPNN) is realized via standard GA whereas in the case of the parametric optimization a standard least square method based learning is used. The article [7] uses GA to optimize the structure of original MIA GMDH neural network whereas the coefficients are solved by the Singular Value Decomposition (SVD) method. The hybrid architecture of the network is employed in the polynomial harmonic GMDH (phg- MDH) [72], where the harmonic inputs are passed to a polynomial network whose architecture is built using the MIA GMDH algorithm. In this thesis, we do not limit the use of harmonic functions to the input layer of the GMDH network. Any neuron in the GAME network can have harmonic transfer function. A novel algorithm based on GMDH for designing MLP neural networks can be found in [34]. This idea is similar to the one presented in [85], where Cascade Correlation Networks are enhanced by the GMDH. In [72], the iterative gradient descent training algorithm is offered for improving the performance of polynomial neural networks. The Back-Propagation algorithm is derived for multilayered networks with polynomial activation functions. We believe that if powerful enough optimization techniques are used during the construction stage, it is not necessary to readjust parameters after the polynomial network is built. This readjustment of parameters might suggest they were not set optimally. The AIC and PSS criterion are used in revised GMDH-type algorithm [56] to find the optimal number of neurons and layers of the GMDH networks. Such regularization takes into account just the complexity of GMDH network. Outputs from neurons in a layer can be highly

28 14 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART correlated resulting to a redundant GMDH network. We propose a different approach in the Section of this thesis. The recent article [9] introduces a GMDH-based feature ranking and selection algorithm. This algorithm builds GMDH networks of gradual complexity, rewarding features selected by smaller networks. In this thesis we propose three different algorithms of feature ranking that can also supply the proportional significance of features Neural networks Neural networks are closely connected to the GMDH theory, although they have different background. The GMDH evolved from the mathematical description of a system by means of Kolmogorov-Gabor polynomial [63]. Neural networks were at the beginning biologically oriented. Later, powerful optimization methods for neural networks (Back-Propagation of error) were invented, allowing to build multilayered networks of neurons (MLP) capable of solving nonlinear problems. It has been shown [59] that MLPs are equivalent to mathematical description of a system by means of the Koglomorov theorem 1. In recent time both neural networks and GMDH algorithms are optimized by genetic algorithms and it is even harder to distinguish the boundary between both theories. Of course, some neural networks are very different from GMDH (recurrent, modular, spiking neural networks, etc.) [57]. The article [66] shows that the problem of two intertwined spirals can be successfully solved by the MultiLayered Perceptron (MLP), where weights are evolved by the Genetic Algorithm. Number of function evaluation is in this case much higher than when using standard Back- Propagation, because the information about gradient of error is not utilized in the GA. On the other hand, in some applications, Genetic approach gives better results than BackPropagation [89]. The Cascade Correlation algorithm [3] is capable of solving extremely difficult problems. It performs optimally on spiral benchmarking problem (a network consisting of less than 2 neurons is generated). According to experiments on real-world data performed in [19], the algorithm has difficulties with avoiding premature convergence to complex topological structures. The main advantage of the Cascade Correlation algorithm is also its main disadvantage. It easily solves extremely difficult problems therefore it is likely to overfit. Also the GAME engine is able to generate models solving spiral problem only when the build-in validation and regularization mechanisms are disabled. The recent article [85] proposes an improvement of the Cascade Correlation Algorithm [3]. The original algorithm assumes fully connected network. Each neuron is connected to all features and all previously built neurons. The improvement called Evolving Cascade Correlation Networks (ECCN) [85] uses techniques from GMDH theory [47] to choose just relevant inputs for each neuron. Cascade networks evolved by ECCN overfit data less than fully connected cascade networks. Recently, very interesting algorithm for designing recurrent neural networks was proposed in [94, 95]. The NeuroEvolution Through Augmenting Topologies (NEAT) is designed for solving reinforcement learning tasks [93], but can be applied also to supervised learning prob- 1 although inner functions are very complex and they have almost fractal character

29 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART 15 lems. Similarly to the GMDH, NEAT networks grow from a minimal structure up to optimal complexity. The topology and also weights of the NEAT networks are evolved using niching genetic algorithm [65]. We have applied the NEAT to Two intertwined spiral problem [51], but it failed. The reason why NEAT is unable to evolve successful networks solving the spiral problem is probably that a) the chromosome is too big when evolving architecture and weights of network simultaneously, b) niching alone is unable to protect complex structures Optimization methods The understanding of optimization methods is crucial in both mathematical and machine learning modeling. To build successful models, it is necessary to adjust their parameters. The GAME engine uses optimization methods to adjust weights and coefficients of units. Optimization of weights and coefficients leads to the following problem of nonlinear programming: min f( x), x R n, (2.3) where f( x) is differentiable function of vector (weights or coefficients) defined in R n. From the initial value of x the sequence of elements x k+1 = x k + α k dk (2.4) computed when iterating k 1, 2, 3,...,, where α k is the step length. To be able to achieve the convergence, we need to find the proper direction d k of the search in a state space of all possible coefficients values. The easiest way how to find this direction is to use some gradient method of steepest descent d k = f( x k ), f( x k ) = ( ) T f( x 1 ),..., f( x n ). (2.5) x 1 x n Theoretical analysis proved that the first order gradient methods are not very effective. Especially when the function f( x) is simple, many iterations are needed to find the optimal solution. The more effective second order methods compute (or estimate) second derivations of f( x) revealing more information about extremes of the error function and they often provide faster convergence to the optimal solution Quasi-Newton method (QN) The most popular second order optimization method of nonlinear programming is Quasi- Newton method [88, 86]. It computes the search directions d k from the equation 2.4 as d k = 2 f( x k ) 1 f( x k ) (2.6) so we can expect quadratic convergence of the method and that is perfect. On the other hand to compute second derivations of the function f is both computationally expensive and inaccurate. Better approach is to make a compromise and use first derivations of f to compute

30 16 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART the search direction d more precisely. This can be realized e.g. by the following formula d k = H k f( x k ), H k = ( 2 x i x j f( x) ) i,j=1,...,n, (2.7) where H k R n n is so called Hessian matrix. This matrix can be computed from the first derivations in case that all the following assumptions are fulfilled: 1. H k must be positively definite given that we start H which is positively definite. 2. H k fulfill so called quasi-newton condition: H k+1 ( f( x k+1 ) f( x k )) = x k+1 x k. 3. H k+1 can be computed from H k, x k+1, x k, f( x k+1 ), f( x k ) in the following way: H k+1 = H k + β k u k u T k + γ k v k v T k, where β k, γ k R, u k, v k R n. There exists several formulas fulfilling these conditions. Very popular is Davidon-Fletcher- Powell (DFP) formulae [83]: H k+1 = H k + p k p T k p T k q k + H k q k q T k H k q T k H k q k, (2.8) where q k = f( x k+1 ) f( x k ), p k = x k+1 x k. Another popular approach is to construct the approximate Hessian matrix by the Broyden-Fletcher-Goldfarb-Shanno (BFGS) method [16]. In our application, x is the vector of weights or coefficients 2 of a unit in the GAME network we are optimizing. The function f( x) is the error of the GAME unit on the training data and therefore has to be minimized. By computing gradients and Hessian matrix of the function f for each learning iteration, we get the optimal direction d in the state space of coefficients x. From the initial setting of coefficients x with probably large error on training data f( x ), we change the coefficients x k of the GAME unit in the direction d k that is computed from the equation 2.7. After several steps (learning iterations), the error of the GAME unit would be much lower than the initial error (f( x n ) f( x )). Number of learning iterations needed to find an optimal solution x n depends on the complexity of data set and on the transfer function of the GAME unit Conjugate gradient method (CG) The Conjugate gradient method [17] is a non-linear iterative method for solving Ax = b, where A is an n n matrix, A T = A >, and b R n is given. The pseudocode of the CG algorithm is given bellow. 2 Note that we use different notation. Coefficients of units are labeled as a. The vector of inputs is labeled x.

31 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART 17 Given x, generate x k, k = 1, 2,... by: r = b Ax, p = r. For k =, 1, 2,... : If p, then: a k = (r T k r k)/(p T k Ap k) x k+1 = x k + a k p k r k+1 = r k a k Ap k b k = (r T k+1 r k+1)/(r T k r k) p k+1 = r k+1 + b k p k. End If. End For. End. For detailed explanation of CG algorithm principles see [9] Orthogonal Search (OS) The Orthogonal Search (OS) optimizes multivariate problem by selecting one dimension at a time, minimizing the error at each step. The OS can be used [11] to train single layered neural networks. We use minimization of a real-valued function of several variables without using gradient, optimizing variables one by one. The Stochastic Orthogonal Search differs from OS just by random selection of variables Genetic algorithms (GA) The Genetic Algorithms (GA) [36, 41] are inspired by Darwin s theory of evolution. Population of individuals are evolved according simple rules of evolution. Each individual has a fitness that is computed from its genetic information. Individuals are crossed and mutated by genetic operators and the most fit individuals are selected to survive. After several generations the mean fitness of individuals is maximized. The GA can be used as optimization method e.g. for learning of neural structures or to setup weights and architecture of ANN. The Inductive Genetic Programming (igp) applied to construct Multivariate Trees as described in [73] has a tight connection to the topic of this thesis (although it uses a different terminology). Multivariate Trees are in fact inductive polynomial models similar to those generated by the Multilayered Iterative Algorithm GMDH. They use second order polynomials (programs) to construct the final model. The topology of models (trees, with polynomials in the internal nodes, features x i in the leaves and dependent variable y as the root) is evolved by Genetic Algorithm. Later, the niching version of the genetic algorithm [71] proved to evolve better preforming models. This findings fully correspond to results obtained by [95] and also to the ones presented in this thesis (see Section 4.6.1). Unfortunately, just maintaining diversity by niching is not enough to preserve more complex topologies that are able to solve very hard problems such as two intertwined spirals [51]. Our experiments showed [26] that NEAT is unable to evolve models capable of solving spiral problem. In this case the evolution layer by layer as implemented in GAME (see Section

32 18 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART 4.6) enables to evolve topologies complex enough. The problem with NEAT is that it evolves topology and weights simultaneously. The dimensionality of chromosomes in the Topology and Weight Evolving Artificial Neural Networks(TWEANN) [94] is growing fast with increasing complexity of networks. Solving the spiral problem involves very complex network with extremely long chromosome. The curse of dimensionality phenomenon prevents TWEANN networks from solving the spiral problem Niching methods in the evolutionary computation Niching methods [65] extend genetic algorithms to domains that require the location of multiple solutions. They promote the formation and maintenance of stable subpopulations in genetic algorithms (GAs). Niching method Fitness Sharing was recently used to maintain diverse individuals when evolving neural networks [95]. Similar niching method is the Deterministic Crowding (DC) [64]. The big advantage of this method is that it does not need any extra parameters as many of others do. The basic idea of deterministic crowding is that offspring is often most similar to parents. We replace the parent who is most similar to the offspring with higher fitness. DC works as follows. First it groups all population elements into n=2 pairs. Then it crosses all pairs and mutates the offsprings. Each offspring competes against one of the parents that produced it. For each pair of offspring, two sets of parent-child tournaments are possible. DC holds the set of tournaments that forces the most similar elements to compete. Similarity can be measured using either genotypic or phenotypic distances. The pseudocode of simple genetic algorithm and for the niching GA with deterministic crowding can be compared bellow. Genetic algorithm (no niching) Generate initial population of n individuals repeat for m generations repeat n/2 times Select two fit individuals p 1,p 2 Cross them, yielding c 1,c 2 Apply mutation, yielding c 1,c 2 if f(c 1 ) > f(p 1) then replace p 1 with c 1 if f(c 2 ) > f(p 2) then replace p 2 with c 2 end end Niching GA (Deterministic Crowding) Generate initial population of n individuals repeat for m generations repeat n/2 times Select two individuals p 1,p 2 randomly Cross them, yielding c 1,c 2 Apply mutation, yielding c 1,c 2 if [d(p 1, c 1 )+d(p 2, c 2 )] [d(p 1, c 2 )+d(p 2, c 1 )] if f(c 1 ) > f(p 1) then replace p 1 with c 1 if f(c 2 ) > f(p 2) then replace p 2 with c 2 else if f(c 2 ) > f(p 1) then replace p 1 with c 2 if f(c 1 ) > f(p 2) then replace p 2 with c 1 end end The distance of individuals e.g. d(p 1, c 1 ) can be based on their phenotypic or genotypic difference. In case of neurons the difference can be computed from connection of their inputs or from difference of their weights.

33 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART 19 There exist several other niching strategies such as islands, restrictive competition, semantic niching [5], etc Differential Evolution (DE) The Differential Evolution (DE) [98] is a genetic algorithm with special crossover scheme. It adds the weighted difference between two individuals to a third individual. For each individual in the population, an offspring is created using the weighted difference of parent solutions. The offspring replaces the parent in case it is fitter. Otherwise, the parent survives and is copied to the next generation. The pseudocode how offsprings are create can be found e.g. in [16] Simplified Atavistic Differential Evolution (SADE) The Simplified Atavistic Differential Evolution (SADE) algorithm [42] is a genetic algorithm improved by one crossover operator taken from differential evolution. It also prevents premature convergence by using so called radiation fields. These fields have increased probability of mutation and they are placed to local minima of the energy function. When individuals reach a radiation field, they are very likely to be strongly mutated. At the same time, the diameter of the radiation field is decreased. The global minimum of the energy is found when the diameter of some radiation field descend to zero. This algorithm was also applied to optimize weights of neural network [27] Particle swarm optimization (PSO) The entry in Wikipedia, free encyclopedia defines basic, canonical PSO algorithm as follows. Let f : R m R be the objective function. Let there be n particles, each with associated positions x i R m and velocities v i R m, i = 1,..., n. Let ˆx i be the current best position of each particle and let ĝ be the global best. Initialize x i and v i for all i. One common choice is to take x ij U[a j, b j ] and v i = For all i and j = 1,..., m, where a j, b j are limits of the search domain in each dimension. ˆx i x i, i = 1,..., n. Set ĝ to the position with the smallest objective value. While not converged: For 1 i n: x i x i + v i. v i ωv i + c 1 r 1 (ˆx i x i ) + c 2 r 2 (ĝ x i ). If f(x i ) < f(ˆx i ), ˆx i x i. If f(x i ) < f(ĝ), ĝ x i. End For. End While. End For.

34 2 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART Note the following about the above algorithm: ω is an inertial constant. Good values are usually slightly less than 1, c 1 and c 2 are constants that say how much the particle is directed towards good positions. They represent a cognitive and a social component, respectively. Usually, we take c 1, c 2 1, usually, r 1, r 2 U[, 1]. By studying this algorithm, we see that we are essentially carrying out something like a discrete-time simulation where each iteration of it represents a tic of time. The particles communicate information they find about each other by updating their velocities in terms of local and global bests; when a new best is found, the particles will change their positions accordingly so that the new information is broadcast to the swarm. The particles are always drawn back both to their own personal best positions and also to the best position of the entire swarm. They also have stochastic exploration capability via the use of the random constants r 1, r Ant colony optimization (ACO) The Ant colony optimization (ACO) algorithm is primary used for discrete problems (e.g. Traveling Salesman Problem, packet routing). Recently, several approaches have been proposed to extend the application of this algorithm for continuous space problems [13, 14]. We have so far implemented two of them. The first algorithm Continuous Ant colony optimization (CACO) was proposed in [21]. It works as follows. There is an ant nest in a center of a search space. Ants exits the nest in a direction given by quantity of pheromone. When an ant reaches the position of the best ant in the direction, it moves randomly (the step is limited by decreasing diameter of search. If the ant find better solution, it increases the quantity of pheromone in the direction of search [58]. The second algorithm is similarly called Ant Colony Optimization for Continuous Spaces (ACO*) [2]. It was designed for the training of feed forward neural networks. Each ant represents a point in the search space. The position of new ants is computed from the distribution of existing ants in the state space [21, 92] Hybrid of the GA and the particle swarm optimization (HGAPSO) The HGAPSO algorithm was proposed in [49] and it is based on the following ideas. Since PSO and GA both work with a population of solutions, combining the searching abilities of both methods seems to be a good approach. Originally, PSO works based on social adaptation of knowledge, and all individuals are considered to be of the same generation. On the contrary, GA works based on evolution from generation to generation, so the changes of individuals in a single generation are not considered.

35 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART 21 Input variables Model 1 Model 2 Model 3 Model N Output variable Output variable combination Ensemble output Figure 2.5: Models are trained for the same task and then combined In the reproduction and crossover operation of GAs, individuals are reproduced or selected as parents directly to the next generation without any enhancement. However, in nature, individuals will grow up and become more suitable to the environment before producing offspring. To incorporate this phenomenon into GA, PSO is adopted to enhance the topranking individuals on each generation. Then, these enhanced individuals are reproduced and selected as parents for crossover operation. Offsprings produced by the enhanced individuals are expected to perform better than some of those in original population, and the poorperformed individuals will be weeded out from generation to generation. For detailed description of the HGAPSO algorithm see [49] Ensemble methods Ensemble techniques [22] are based on the idea that a collection of a finite number of models (eg. neural networks) is trained for the same task. Neural network ensemble [111] is a learning paradigm where a collection of a finite number of neural networks is trained for the same task. It originates from Hansen and Salamons work [39], which shows that the generalization ability of a neural network system can be significantly improved through ensembling a number of neural networks, i.e., training many neural networks and then combining their predictions. Since this technology behaves remarkably well, recently it has become a very hot topic in both neural networks and machine learning communities, and has already been successfully applied to diversified areas such as face recognition, optical character recognition, scientific image analysis, medical diagnosis, seismic signals classification, etc. In general, a neural network ensemble is constructed in two steps, i.e., training a number of component neural networks and then combining the component predictions (see Figure 2.5). As for training component neural networks, the most prevailing approaches are Bagging and Boosting. Boosting generates a series of component neural networks whose training sets are determined by the performance of former ones. Training instances that are wrongly predicted by former networks will play more important roles in the training of later networks. For detailed description of the boosting approach see [33].

36 22 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART Sampling with replacement Sample 1 Learning algorithm Model 1 (classifier) Averaging or voting Training data Sample 2 Learning algorithm Model 2 (classifier) Ensemble model (classifier) output... Sample M Learning algorithm Model M (classifier) Figure 2.6: The Bagging scheme: Models are constructed on bootstrapped samples Bagging Bagging is based on bootstrap sampling [45]. It generates several training sets from the original training set and then trains a component neural network from each of those training sets. Not every ensemble of models gives more accurate prediction. When e.g. identical models are combined, we cannot achieve any improvement. Models in the ensemble must be diverse - they must exhibit diverse errors [22]. Bagging introduces diversity by varying the training data Bias-variance decomposition The theoretical tool to study how a training data affects the performance of models is Biasvariance decomposition. It decomposes the error of the ensemble to the bias and variance part [22]. Bias is the part of the expected error caused by the fact the model is not perfect, where variance is the part of the expected error due to the nature of the training set. The proof of the following formal definition can be found in [22]. E T {(f d) 2 } = E T {(f E T {f}) 2 } + (E T {f} d) 2 (2.9) where E T is the Expectation operator in the decomposition, with respect to all possible training sets of fixed size N, and all possible parameter initializations d is the target value of a testing data point 3. During the training, the part of the error caused by model bias decreases, as the model approaches the training data. At the same time the part of error caused by variance increased due to the overfitting phenomena. Bagging reduces mainly variance (see Figure 2.8), where Boosting reduces mainly the bias part (training vectors that are wrongly modeled by former models play more important roles in the training of later models). 3 The noise level of zero assumed, in the case of a non-zero noise component, d in the decomposition would be replaced by its expected value d, and a constant (irreducible) term σ 2 e would be added, representing the variance of the noise.

37 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART 23 Error of the model Optimum Bias Variance Training time Figure 2.7: The training should be stopped in the minimum of generalization error. Training time can be replaced by model complexity as well. a) training data model 1 b) model 2 model 1 ensemble model training data model 2 ensemble model Figure 2.8: Ensemble methods reduce variance a) and bias b) giving the resulting model better generalization abilities, thus improving accuracy on testing data Simple ensemble and weighted ensemble A single output can be created from a set of model outputs via simple averaging, or by means of a weighted average that takes account of the relative accuracy of the models to be combined [37]. The output Φ m (x) of the weighted ensemble is given according to the equation bellow. Φ m (x) = m i=1 w i f i (x), w i = e αe i j e αe j (2.1) Where f n (x) are outputs of member models given that the vector x is provided to their inputs. Weights w i are determined according to the performance e i of individual models on the training and validation set.

38 24 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART Ensembles - state of the art There are concerns about complexity of the final solution. Building ensemble of models is in the contradiction to the Occam s razor [12], because the model should be kept as simple as possible, when the accuracy is comparable. The study [29] shows that in some cases, the output function of the ensemble is in fact simpler than the output function of individual models (although it has much more parameters). Therefore from this point of view the ensemble of models is in the accordance with Occam s razor. Error-Correcting Output Codes (ECOC) according to [67] improves the accuracy of ensemble classifiers. This method uses a different output coding for each base classifier. Instead of one classifier with k outputs in a k-class problem (one-per-class coding), b classifiers with only one output are used, each classifier only deciding between two super-classes that partition the k classes. The output of the single base classifiers can then be interpreted as bits in a codeword of length b, transmitting the class of the classified pattern. In [4] we can find similar result to the one obtained in this section. If the base classifiers are very accurate or the number of input features is too large, Adaboost.ECC cannot improve the classification performance on the test patterns compared to ECOC classifiers of comparable size and complexity. Only if the number of input features is small and few hidden neurons are used some improvements can be achieved. In [1] MLP neural network committees proved to be superior to individual MLP networks. The committee of GMDH networks generated by the Abductory Inductive Mechanism (AIM) gave better performance just when individual networks varied in complexity - a result similar to the conclusion in this thesis (see Section 4.8). 2.3 Previous results and related work Our previous method ModGMDH incorporated some improvements (heterogeneous units, interlayer connections, growing complexity, etc.) of the MIA GMDH. It proved to generate more accurate models on various data sets. We published these results in the paper [A.15] (conference on Inductive Modeling ICIM 22 in Lviv). Bellow we present the most important results from the paper. We designed an experiment and evaluated results of the original and the Modified GMDH method showing the advantages of the ModGMDH topology. For the purpose of the experiment we introduced more methods with partial modifications of the structure to explore the effects of particular modifications of the original method. We run the experiment with the following methods: Original version of GMDH method (OGMDH) Extended Original GMDH method (EOGMDH) - homogenous network with polynomial transfer function and extra inputs in units Perceptron GMDH method (HPGMDH) - homogenous network with units of perceptron type and extra inputs

39 SECTION 2. BACKGROUND AND SURVEY OF THE STATE-OF-THE-ART 25 Figure 2.9: The comparison of GMDH methods with various levels of modification. Modified GMDH method (MGMDH) - heterogeneous units of growing complexity (all modifications) Figure 2.9 compares the root mean square error (RMS) of models for full data set. The error was averaged for three runs. We can see that the response of models constructed by the original GMDH method is the worst for all data sets, whereas the modified GMDH method gives models with the best results. The EOGMDH method differs from OGMDH just in using extra inputs. This modification generally improves approximation attributes of the original GMDH method, according to results showed in Figure 2.9. Methods HPGMDH and EOGMDH differ in the type of unit, both are homogenous and using extra inputs. The HPGMDH gives better results than EOGMDH on the Mandarin data set but for the other data sets it is exactly the opposite. It seems that perceptron units are suitable for higher density of data in the input space (mandarin data set), whereas units with polynomial transfer function give better result on thin data sets. The idea of data-dependent approximation ability is supported by the fact that surviving units in the network constructed by MGMDH (initial population of a layer contains units of all types) on the Mandarin data set are almost just of the perceptron type. On the Artificial data set there is a majority of units with polynomial transfer function. When developing networks by MGMDH on the thinnest Nuclear data set, the winning units are mostly that with polynomial or linear transfer function. The presence of heterogeneous units in MGMDH networks has positive influence on the stability of the learning process and the function approximation. This initial experiment opened the door. We entered and followed the way for three years. In this thesis we present what we have found on the way.

40 26SECTION 3. OVERVIEW OF OUR APPROACH - THE FAKE GAME FRAMEWORK 3 Overview of our approach - the FAKE GAME framework Knowledge discovery and data mining are popular research topics in recent times. It is mainly due to the fact that the amount of collected data significantly increases. Manual analysis of all data is no longer possible. This is where the data mining and the knowledge discovery (or extraction) can help. The process of knowledge discovery [32] is defined as the non-trivial process of finding valid, potentially useful, and ultimately understandable patterns. The problem is that this process still needs a lot of human involvement in all its phases in order to extract some useful knowledge. This thesis focuses on methods aimed at significant reduction of expert decisions needed during the process of knowledge extraction. Within the FAKE GAME environment we develop methods for automatic data preprocessing, adaptive data mining and for the knowledge extraction (see Figure 3.1). The data preprocessing is very important and time consuming KNOWLEDGE EXTRACTION and INFORMATION VISUALIZATION KNOWLEDGE INPUT DATA AUTOMATED DATA PREPROCESSING AUTOMATED DATA MINING FAKE INTERFACE GAME ENGINE Figure 3.1: FAKE GAME environment for the automated knowledge extraction. phase of the knowledge extraction process. According to [8] it accounts for almost 6% of total time of the process. The data preprocessing involves dealing with non-numeric variables (alpha values coding), missing values replacement (imputing), outlier detection, noise reduction, variables redistribution, etc. The data preprocessing phase cannot be fully automated for every possible data set. Each data have unique character and each data mining method requires different preprocessing. Existing data mining software packages support just very simple methods of data preprocessing [7]. There are new data mining environments [8, 5] trying to focus more on data preprocessing, but their methods are still very limited and give no hint which preprocessing would be the best for your data. It is mainly due to the fact that the theory of data preprocessing is not very developed. Although some preprocessing methods seem to be simple, to decide which method would be the most appropriate for some data might be very complicated. Within the

41 SECTION 3. OVERVIEW OF OUR APPROACH - THE FAKE GAME FRAMEWORK27 FAKE interface we develop more sophisticated methods for data preprocessing and we study which methods are most appropriate for particular data. The final goal is to automate the data preprocessing phase as much as possible. In the knowledge extraction process, the data preprocessing phase is followed by the phase of data mining. In the data mining phase, it is necessary to choose appropriate data mining method for your data and problem. The data mining method usually generates a predictive, regressive model or a classifier on your data. Each method is suitable for different task and different data. To select the best method for the task and the data, the user has to experiment with several methods, adjust parameters of these methods and often also estimate suitable topology (e.g. number of neurons in a neural network). This process is very time consuming and presumes strong expert knowledge of data mining methods by the user. In the new version of one commercial data mining software [96], an evolutionary algorithm is used to select the best data mining method with optimal parameters for actual data set and a problem specified. This is really significant step towards the automation of the data mining phase. We propose a different approach. The ensemble of predictive, regressive models or classifiers is generated automatically using the GAME engine. Models adapt to the character of a data set so that they have an optimal topology. We develop methods eliminating the need of parameters adjustment so that the GAME engine performs independently and optimally on bigger range of different data. The results of data mining methods can be more or less easily transformed into the knowledge, finalizing the knowledge extraction process. Results of methods such as simple decision tree are easy to interpret. Unfortunately majority of data mining methods (neural networks, etc.) are almost black boxes - the knowledge is hidden inside the model and it is difficult to extract it. Almost all data mining tools bound the knowledge extraction from complex data mining methods to statistical analysis of their performance. More knowledge can be extracted using the techniques of information visualization. Recently, some papers [15] on this topic had been published. In this thesis, we propose techniques based on methods such as scatterplot matrix, regression plots, multivariate data projection, etc. to extract additional useful knowledge from the ensemble of GAME models. We also develop evolutionary search methods to deal with the state space dimensionality and to find interesting projections automatically. 3.1 The goal of the FAKE GAME environment The ultimate goal of our research is to automate the process of knowledge extraction from data. It is clear that some parts of the process still need the involvement of expert user. We build the FAKE GAME environment to limit the user involvement during the process of knowledge extraction. To automate the knowledge extraction process, we research in the following areas: data preprocessing, data mining, knowledge extraction and information visualization (see Figure 3.1).

42 28SECTION 3. OVERVIEW OF OUR APPROACH - THE FAKE GAME FRAMEWORK Research of methods in the area of data preprocessing In order to automate the data preprocessing phase, we develop more sophisticated methods for data preprocessing.we focus on data imputing (missing values replacement), that is in existing data mining environments [8, 5] realized by zero or mean value replacement although more sophisticated methods already exist [8]. We also developed a method for automate nonlinear redistribution of variables (see Section 5.1.2). Dealing with non-numeric variables (alpha values coding), outlier detection, noise reduction and other techniques of data preprocessing will be in the scope of our further research. This thesis is not focusing on data warehousing, because this process is very difficult to automate in general. It is very dependent on particular conditions (structure of databases, information system, etc.) We assume that source data are already collected cleansed and integrated (Figure 3.1) Automated data mining To automate the data mining phase, we develop an engine that is able to adapt itself to the character of data. This is necessary to eliminate the need of parameter tuning. The GAME engine autonomously generates the ensemble of predictive, regressive models or classifiers. Models adapt to the character of data set so that they have optimal topology. Unfortunately, the class of problems where the GAME engine performs optimally is still limited. To make the engine more versatile, we need to add more types of building blocks, more learning algorithms, improve the regularization criteria, etc Knowledge extraction and information visualization To extract the knowledge from complex data mining models is very difficult task. Visualization techniques are promising way how to achieve it. Recently, some papers [15] on this topic had been published. In our case, we need to extract information from an ensemble of GAME inductive models. To do that we enriched methods such as scatterplot matrix, regression plots by the information about the behavior of models (see Section 5.2.3). For data with many features (input variables) we have to deal with curse of dimensionality. The state space is so big, that it is very difficult to find some interesting behavior (relationship of system variables) manually. For this purpose, we developed evolutionary search methods to find interesting projections automatically (see Section 5.2.8). Along with the basic research, we implement proposed methods in Java programming language and integrate it into the FAKE GAME environment [1] so we can directly test the performance of proposed methods, adjust their parameters, etc. Based on the research and experiments performed within this dissertation, we would like to develop an open source software FAKE GAME. This software should be able to automatically preprocess various data, to generate regressive, predictive models and classifiers (by means of GAME engine), to automatically identify interesting relationships in data (even in highdimensional ones) and to present discovered knowledge in a comprehensible form. The software should fill gaps which are not covered by existing open source data mining environments [7, 8].

43 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 29 4 The design of the GAME engine and related results The Group of Adaptive Models Evolution (GAME) is a data mining method. It can generate models for classification, prediction, identification or regression purposes (see Figure 4.1). It works with both continuous and discrete variables. The topology of GAME models adapts to the nature of a data set supplied. The GAME is highly resistant to irrelevant and redundant features, suitable for short and noisy data samples. The GAME engine further develops REGRESSION PREPROCESSED DATA MO DEL MO DEL MO DEL GAME MO DEL MO DEL MO DEL OR CLASSIFICATION OR IDENTIFICATION OR PREDICTION Figure 4.1: Group of Adaptive Models Evolution (GAME) the MIA GMDH algorithm [47] and is even more sophisticated than it s predecessor the ModGMDH [A.9]. The GAME generates group of inductive models adapting themselves on data set character and on it s complexity. An inductive model (network) grows as big as needed to solve a problem with sufficient accuracy. It consists of units (neurons) that have been most successful in modeling interrelationships within the data set. A GAME model (see Figure 4.2) has more degrees of freedom (units with more inputs, interlayer connections, transfer functions etc.) than MIA GMDH models. When a data set is multidimensional, it is impossible to search the huge state space of model s possible topologies without any heuristic, as implemented in the ModGMDH. Therefore the GAME method incorporates niching genetic algorithm to evolve models of optimal structure. Improvements of the MIA GMDH are discussed bellow in more detailed form. Improvements of the GAME method over the original MIA GMDH:

44 3 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Figure 4.2: The comparison: original MIA GMDH network and the GAME network Heterogeneous units Optimization of units Heterogeneous learning methods Structural innovations Regularization Genetic algorithm Niching methods Evolving units (active neurons) Ensemble of models generated Several types of units compete to survive in GAME models. Efficient gradient based training algorithm developed for hybrid networks. Several optimization methods compete to build most successful units. Growth from a minimal form, interlayer connections, etc. Regularization criteria are employed to reduce a complexity of transfer functions. A heuristic construction of GAME models. Inputs of units are evolved. Diversity promoted to maintain less fit but more useful units. Units such as the CombiNeuron evolve their transfer functions. Ensemble improves accuracy, the credibility of models can be estimated. 4.1 Heterogeneous units In MIA GMDH models, all units have the same polynomial transfer function. In the PNN [75] models, there are more types of polynomials used within single model. Our previous research showed that employing heterogeneous units within a model gives better results than when using units just of single type [A.15]. Hybrid models are often more accurate than homogeneous ones, even if the homogeneous model has an optimal transfer function

45 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 31 x 1 Linear (LinearNeuron) x 1 Sin (SinusNeuron) x 2... x n n y = aixi + an i= x 2... x n y n = an + 1 sin an+ 2 aixi + an+ 3 + a i= 1 x 1 x 2... x n Gaussian (GaussianNeuron) y ( + a ) 2 ( xi ai ) i= 1 2 ( 1+ an 2 ) n + 1 * e + = + 1 a n x 1 Polynomial (CombiNeuron) x 2 n m r... y = a x i j + a i= 1 j= 1 x n x 1 Logistic (SigmNeuron) x y = n + a i i 1 1 e a x i = + x n x 1 x 2... x n Rational (PolyFractNeuron) a y = n n a x + a x x + a i= 1 i i i= 1 j= 1 2 n n n* i+ j i j 2 n + 1 a x 1 x 2... x n Exponential (ExpNeuron) y n an+ 1* ai xi i= 1 = an + 2 e + * a x 1 x 2... x n Universal (BPNetwork) n ( pq( xp) ) 2n+ 1 ψ q p= 1 q= 1 y = φ Figure 4.3: Units are building blocks of GAME models. Transfer functions of units can be combined in a single model (then we call it a hybrid model with heterogeneous units). The list of units includes some of units implemented in the FAKE GAME environment. appropriate for modeled system (see results in the Section 2.3). In GAME models, units within single model can have several types of transfer functions (Hybrid Inductive Model). Transfer functions can be linear, polynomial, logistic, exponential, gaussian, rational, perceptron network, etc. (see Table 4.3 and Figure 4.3). The motivation, why we implemented so many different units was following. Each problem or data set is unique. Our previous experiments showed (see Section 2.3) that for simple problems, models with simple units were superior. Whereas for complex problems winning models were these with units having complex transfer function. The best performance on all tested problems was achieved by models, where units were mixed Experiments with heterogeneous units To prove our assumptions and to support our preliminary results [A.15], we designed and performed following experiments. We used several real world data sets of various complexity and noise levels. For each data set, we built ensembles of 1 models. Each ensemble had a different configuration. In homogeneous ensembles, there was just single type of unit allowed (eg. Exp stands for an ensemble of 1 models consisting of ExpNeuron units only). The ensemble of 1 heterogeneous inductive models, where all units are allowed to participate in the evolution stage of models is labeled all (all-simple and all-fast are configurations, where

46 32 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Table 4.1: The summary of unit types appearing in GAME models Class of the unit Abbrevation Transfer function Learning method LinearNeuron Linear Linear - any method - CombiNeuron Combi Polynomial - any method - CombiNeuron CombiR3 Polynomial - any method + R3 - PolySimpleNeuron Polynomial Polynomial - any method - PolySimpleNRNeuron PolyNR Polynomial - any method + GL5 - SigmNeuron Sigm Sigmoid - any method - ExpNeuron Exp Exponential - any method - PolyFractNeuron Fract Rational - any method - SinusNeuron Sin Sinus - any method - GaussNeuron Gauss Gaussian1 - any method - GaussianNeuron Gaussian Gaussian2 - any method - MultiGauss MultiGauss Gaussian3 - any method - BPNetwork Perceptron Universal BackPropagation algorithm NRBPNetwork PerceptNR Universal BP alg. + GL5 stop.crit. all units with simple transfer function respectively all units with short learning time, were allowed). The first experiment was performed on the Building data set. This data set has three output variables. One of these variables is considerably noisy (Energy consumption) and the other two output variables have a low noise level. The results are consistent with this observation. The Combi and the Polynomial ensembles perform very well on variables with low noise levels, but for the third noisy variable they overfitted the training data having huge error on the testing data set. Notice, that the configuration all has an excellent performance for all three variables, no matter the level of noise (Figure 4.4). In the Figure 4.5, we present results of the experiment on the Spiral data set [51]. As you can see, Perceptron ensemble learned to tell two spirals apart without any mistake. The second best performing configuration was all with almost hundred percent accuracy 1. The worst performing ensembles were Linear and Sigm (units with linear and logistic transfer functions). Their 5% classification accuracy signifies that these units absolutely failed to learn the Spiral data set. This is not surprising for linear units, but the failure of units with logistic transfer function is under our investigation (there might be a bug in the implementation of the analytic gradient for the SigmNeuron unit). The behavior of the all ensemble on the Spiral data set is demonstrated in the appendix (see Figure D.7). We performed number of similar experiments with other real world data sets. You can find results of these experiments in the appendix. We can conclude the all ensemble performed extremely well for almost all data sets under the investigation. It showed the best performance for the Mandarin data set (Figure D.1), Boston data set (Figure D.8) and the well known Iris data set (Figure D.11). Only for the Ecoli data set, the error of the all ensemble was 1 We have to mention that building the ensemble of all models took just a fraction of time needed to build the ensemble of Perceptron models (consisting of BPnetwork units)

47 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 33 Hot water consumption Fract Cold water consumption Combi Energy consumption all all-p all all-pf all all-p Fract Polynomial Polynomial Exp Combi all-pf Sin all-pf Fract Perceptron Sin Perceptron CombiR3 Perceptron CombiR3 Sigm Exp Sigm Linear CombiR3 Exp all-p 4.48 Sigm Sin Polynomial Linear Linear Combi Figure 4.4: The performance comparison of GAME units on the Building data set. In the all-pf configuration all units except the Perceptron and Fract unit were enabled, similarly in all-p just the Perceptron unit was excluded. Classification accuracy on the Spiral data set 1% 8% 6% 4% Perceptron all Fract all-simple Polynomial MultiGauss Sin Gauss all-fast Combi Gaussian CombiR3 Exp Sigm Linear Figure 4.5: The performance comparison of GAME units on the Spiral data set.

48 34 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS relatively high (see Figure D.9). It is not clear, why it apparently overfitted the data, when other ensembles which are more likely to overfitt (Combi, Polynomial) did not. We will focus on this data set in our further research. The conclusion of our experiments is that for the best results we recommend to enable all units which have be so far implemented in the GAME engine. The more types of transfer function we have, the more different relationships we can model. Which types of units are selected to make up a model depends just on nature of data modeled. The main advantage of using units of various types in single model is that models automatically adapt themselves to the character of modelled system. Only units with appropriate transfer function survive. Hybrid models also better approximate relationships that can be expressed by a superposition of different functions (e.g. polynomial * sigmoid * linear). In the same sense, we also use several types of learning algorithms. Many authors agree that gradient methods are convenient for simpler data, for more difficult data (e.g. multidimensional with many local optima) it is better to estimate weights or coefficients of unit s transfer function by a genetic algorithm. We study this problem in details in the Section Optimization of GAME units The GMDH methods can be divided into two classes. The parametric and non-parametric methods. The MIA GMDH and also the GAME belong to the parametric class. Parametric models contain parameters that are optimized during the learning (training) stage. The optimal values of parameters are values minimizing the difference between a behavior of a real system and its model. This difference is typically measured by a root mean squared error. The error of a model 2 on a training data is the sum of errors on the particular training vectors m E = (y j d j ) 2, (4.1) j= where y j is the output of the model for the j th training vector and d j is the corresponding target output value. The coefficients of a GAME unit a 1, a 2,, a n can be estimated during the training stage by any optimization method implemented (see Table 4.3). In each iteration, the optimization method proposes values of coefficients and the GAME unit returns its error on training data with these proposed coefficients (see Figure 4.6a). If the analytical gradient of the error can be computed, the number of iterations would be significantly reduced, because we know in which direction coefficients should be adjusted (see Figure 4.6b). The gradient of the error E in the error surface of GAME unit can be written as ( E E =, E,, ) E, (4.2) a 1 a 2 a n where E a i is a partial derivative of the error in the direction of the coefficient a i. It tell us how to adjust the coefficient to get smaller error E on the training data. This partial derivative 2 Output of an optimized unit can be treated as an output of the model so the terms error of the model and error of the unit can be confused.

49 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 35 a) b) coefficients a 1, a 2,..., a n compute error on training data Unit optimize coefficients given inintial values repeat new values error final values Optimization method estimate gradient coefficients a 1, a 2,..., a n compute error on training data Unit compute gradient of the error optimize coefficients given inintial values repeat new values error gradient final values Optimization method Figure 4.6: Optimization of the coefficients can be performed without analytic gradient a) or with the gradient supplied b). Utilization of analytic gradient significantly reduces the number of iteration needed for the optimization of coefficients. can be computed as E a i = m j= E y j y j a i, (4.3) where m is the number of training vectors. The first part of the summand can be is easily derived from the Equation 4.1 as E m = 2 (y j d j ). (4.4) y j j= The second part of the summand from the Equation 4.3 is unique for each unit, because it depends on its transfer function. We demonstrate the computation of the analytic gradient for Gaussian unit and for Sin unit. For other units the gradient is computed in a similar way The analytic gradient of the Gaussian unit Gaussian functions are very important and they can be found almost everywhere. The most common distribution in nature follows the gaussian probability density function f(x) = 1 2πσ e (x µ)2 2σ 2. Neurons with gaussian transfer function are typically used in Radial Basis Function Networks. We have modified the function for our purposes. We added coefficients to be able to scale and shift the function. The first version of the transfer function as implemented in GaussianN euron is the following: y j = (1 + a n+1 ) e n i=1 ( x ij a i) 2 (1+a n+2 ) 2 + a (4.5) The second version (GaussN euron) proved to perform better on several low dimensional real world data sets: n ( a i=1 i x ij a n+3) 2 y j = (1 + a n+1 ) e (1+a n+2 ) 2 + a (4.6) Finally, the third version (M ultigaussn euron), as the combination of the transfer functions above showed the best performance, but sometimes exhibited almost fractal behavior (see

50 36 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Figure D.1). ni=1 (a i x ij a n+i ) 2 (1 + a 2n+2 ) 2 } {{ } ρ y j = (1 + a 2n+1 ) e j + a (4.7) We computed gradients for all these transfer functions. In this thesis, we derive the gradient of the error (see Equation 4.2) for the third version of the gaussian transfer function (Equation 4.7). We need to derive partial derivatives of the error function according to Equation 4.3. The easiest partial derivative to compute is the one in the direction of the a coefficient. The second term y j ρ j is equal to 1. Therefore we can write E a = 2 m j= (y j d j ). In the case of the coefficient a 2n+1, the equation becomes more complicated E n m ( = 2 (y j d j ) e a i=1 i x ij a n+i) 2 (1+a 2n+2 ) 2. (4.8) a 2n+1 j= Remaining coefficients are in the exponential part of the transfer function. Therefore the second summand in the Equation 4.3 cannot be formulated directly. We have to rewrite the Equation 4.3 as E [ m E = y j ρ ] j, (4.9) a i y j ρ j a i j= where ρ j is the exponent of the transfer function 4.7. Now we can formulate partial derivatives of remaining coefficients as E [ m ni=1 = 2 (y j d j ) (1 + a 2n+1 )e ρ j (a i x ij a n+i ) 2 ] 2 a 2n+2 j= (1 + a 2n+2 ) 3 (4.1) E [ m = 2 (y j d j ) (1 + a 2n+1 )e ρ j 2 a2 i x2 ij a ] n+i x ij a i j= (1 + a 2n+2 ) 2 (4.11) E [ m = 2 (y j d j ) (1 + a 2n+1 )e ρ j 2 a ] n+i a i x ij a n+i j= (1 + a 2n+2 ) 2. (4.12) We derived the gradient of error on training data for the gaussian transfer function unit. An optimization method often requests these partial derivatives every iteration to be able to adjust parameters in proper direction. This mechanism (as described on the Figure 4.6b) can significantly save the number of error evaluations needed (see Figure 4.7) The analytic gradient of the Sine unit We also list the gradient for the Sine unit with the following transfer function [ ( n )] y j = a n+1 sin a n+2 a i x ij + a n+3 + a (4.13) } i=1 {{ } ρ j

51 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 37 Table 4.2: Number of evaluations saved by supplying gradient depending on the complexity of the energy function. Complexity Avg. evaluations Avg. evals. Avg. gradient Evaluations Computation energy fnc. without grad. with grad. calls saved time saved % 13.15% % 33.12% % 37.41% % 48.75% % 62.87% Partial derivatives of the error function can be derived similarly as for the gaussian unit above. E m = 2 (y j d j ), (4.14) a j= E m [ ] = 2 (y j d j ) sin (ρ j ), (4.15) a n+1 j= E [ m ( n ) ] = 2 (y j d j ) a n+1 cos (ρ j ) sin a i x ij, (4.16) a n+2 j= i=1 E m [ ] = 2 (y j d j ) a n+1 cos (ρ j ) sin (a n+2 ), (4.17) a n+3 j= E m [ ] = 2 (y j d j ) a n+1 cos (ρ j ) sin (a n+2 x ij ). (4.18) a i j= The experiment: analytic gradient saves error function evaluations We performed an experiment to evaluate the effect of analytic gradient computation. The Quasi-Newton optimization method was used to optimize the SigmNeuron unit (a logistic transfer function). In the first run the analytic gradient was provided and in the second run, the gradient was not provided so the QN method was forced estimate the gradient itself. We measured the number of function evaluation calls and for the first run, we recorded also the number of gradient computation requests. The results are displayed in the Graph 4.7 and in the Table B.1. In the second run, without the analytic gradient provided, the number of error function evaluation calls increased exponentially with rising complexity of the error function. For the first run, when the analytic gradient is provided, number of error function evaluation calls increases just linearly and the number of gradient computations grows also linearly. The computation of gradient is almost equally time-consuming as the error function evaluation. When we sum up these two numbers for the first run, we still get growth increasing linearly with the number of layer (increasing complexity of the error surface). This is superb result, because some models of

52 38 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Evaluation calls f_eval (no grad) f_eval (grad) g_eval (grad) No. of GAME layer (increas. complexity) Figure 4.7: When the gradient have to be estimated by the optimization method, number of function evaluation calls grows exponentially with an increasing complexity of the problem. When the analytic gradient is computed, the growth is almost linear. complex problems can have 2 layers, the computational time saved by providing the analytic gradient is huge. Unfortunately some optimization methods such as genetic algorithms and swarm methods are not designed to use the analytic gradient of the error surface. On the other hand, for some data sets, the usage of analytic gradient can worsen a convergence characteristic of optimization methods (getting stuck in local minima). The training algorithm described in this Section enables efficient training of hybrid neural networks. 4.3 Heterogeneous learning methods The question Which optimization method is the best for our problem? has not a simple answer. There is no method superior to others for all possible optimization problems. However there are popular methods performing well on the whole range of problems. Among these popular methods, we can include so called gradient methods - the Quasi Newton method, the Conjugate Gradient method and the Levenberg-Marquardt method. They use the analytical gradient (or its estimation) of the problem error surface. The gradient brings them faster convergence, but in cases when the error surface is jaggy, they are likely to get stuck in a local optima. Other popular optimization methods are genetic algorithms. They search the error surface by jumping on it with several individuals. Such search is usually slower, but more prone to get stuck in a local minima. The search performed by swarm methods can be imagined as a swarm of birds flying over the error surface, looking for food in deep valleys. You can also imagine that for certain types of terrain, they might miss the deepest valley. Each data set have different complexity. The surface of a model s RMS error depends on the data set, transfer functions of optimized unit and also on preceding units in the network.

53 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 39 Table 4.3: Learning methods summary Name of the class Abbrv. Search Learning method UncminTrainer QN Gradient Quasi-Newton method SADETrainer SADE Genetic SADE genetic method PSOTrainer PSO Swarm Particle Swarm Optimization HGAPSOTrainer HGAPSO Hybrid Hybrid of GA and PSO PALDifferentialEvolutionTr. PalDE Genetic Differtial Evolution version 1 DifferentialEvolutionTrainer DE Genetic Differtial Evolution version 2 StochasticOSearchTrainer SOS Random Stochastic Orthogonal Search OrthogonalSearchTrainer OS Empirical Orthogonal Search ConjugateGradientTrainer CG Gradient Conjugate Gradient method ACOTrainer ACO Swarm Ant Colony Optimization CACOTrainer CACO Swarm Cont. Ant Colony Optimization Therefore we might expect, there is no universal optimization method performing optimally on every data set. Each unit has different error surface even within a single network. In GAME, each unit can use arbitrary learning algorithm to estimate it s coefficients (Quasi-Newton, Conjugate Gradient, Differential Evolution, SADE genetic algorithm [27], Particle Swarm Optimization, Back-Propagation algorithm, etc.). Learning methods are also evolved by the niching genetic algorithm (see Section 4.6.1), therefore methods training successful units ale selected more often than methods training poor performers on a particular data set. Learning methods we have so far implemented to the GAME engine are summarised in the Table Experiments with heterogeneous learning methods For validation of the assumption that no universal optimization method exists performing optimally for any data set, we designed following experiments. Again several different real world data set were involved. For each data set, we generated models where just units with simple transfer functions (Linear, Sigm, Combi, Polynomial, Exp) were enabled. Coefficients of these units were optimized by a single method from the Table 4.3. In the configuration all, all methods were enabled. Because these experiments were computationally expensive (optimization methods not utilizing the analytic gradient need many more iterations to converge), we built the ensemble of 5 models for each configuration and data set. For the Ecoli data set, ensembles are formed just by three member models 3. Therefore results are not as significant as in the previous section where we experimented with GAME units. Compare results of two trials on the Ecoli data set in the Figure 4.8. If you focus on the Quasi Newton method (QN) in the First trial (Figure 4.8 left) ensemble of models optimized 3 The Ecoli data set has six output variables so six times more time was needed to produce models. Therefore we reduced the number of models in the ensemble to three.

54 4 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Ecoli data set classification accuracy single models, two trials 1% 1% Training set Testing set 9% 9% 8% 8% 7% 7% 6% 6% HGAPSO OS all PSO SADE SOS ACO palde CG CACO QN DE HGAPSO SOS QN palde all ACO CG DE OS SADE CACO PSO Figure 4.8: The performance comparison of learning methods on the Ecoli data set. Boston dataset RMS error CG PSO all CACO ACO HGAPSO DE SADE QN SOS palde OS Training set Testing set Figure 4.9: The performance comparison of learning methods on the Boston data set. by the QN method overfitted the data, but in the second trial (with the same configuration) it s result on the testing data set was much better. The hybrid of the Genetic Algorithm and the Particle Swarm Optimization (HGAPSO) performed best in both trials. Ensembles optimized by the HGAPSO did not overfitted the Ecoli data at all. Very different results we obtained from experiments with the Boston data set (see Figure 4.9). For all optimization methods the difference between their error on training and testing data set was almost the same. It signifies that this data set is not very noisy so the overfitting did not occurred. The best performance showed the Conjugate Gradient method, but all methods except the worst performing one (Orthogonal Search) achieved similar results. The results on the Building data set for it s three output variables are shown in the Figure 4.1. There is no significant difference between results for the noisy variable (Energy consumption) and the other two. We can divide optimization methods into the good and bad performing classes. Good performers are Conjugate Gradient, Quasi Newton, SADE genetic algorithm, Differential Evolution, and the all configuration standing for all methods participation in models evolution. On the other hand badly performing optimization methods for

55 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 41 Energy consumption CG DE QN SADE all SOS CACO PSO HGAPSO ACO OS palde Cold water consumption QN all DE SADE CG OS HGAPSO CACO SOS palde ACO PSO Hot water consumption QN CG SADE DE all HGAPSO CACO SOS palde ACO PSO OS Figure 4.1: The performance comparison of learning methods on the Building data set. the Building data set are Particle Swarm Optimization, PAL- Differential Evolution 4 and the Ant Colony Optimization. In accordance with results published in [16], our version of differential evolution outperformed swarm optimization methods. The conclusion from our experiments with optimization methods is not fully in accordance with our expectations. We predicted that the all configuration (all methods employed) should be the best performing for all data sets. The experiments showed that gradient methods like Quasi Newton and Conjugate Gradients performed very well for all data sets we have been experimenting with (including the Mandarin data set - Figure D.12 in the appendix). When all methods are used, good performance is guarantied, but the computation is significantly slower (some methods need many iterations to converge). At this stage of the research and implementation, we recommend using the Quasi Newton (QN) optimization method only, because it is the fastest and very reliable. In our further research we plan to use the analytic gradient to improve the performance of gradient and swarm methods. We also plan to experiment with switching of optimization methods (switch to a different method when a convergence is slow). 4.4 Structural innovations The structure of the original MIA GMDH was adapted to computational capabilities of early seventies. Thirty years later, the computing is a big step forward. Experiments that can be nowadays run on every personal computer were intractable even on the most advanced supercomputer. To make the computation of an inductive model possible, several restrictions on the structure of the model had to be imposed. Because of growing computational power and the development of heuristic methods capable of solving np-hard problems, we can leave out some of these restrictions. The first restriction of the original MIA GMDH is the fixed number of units inputs (two) and 4 palde is the second version of the Differential Evolution algorithm implemented in the GAME engine. The result when the first version of DE performed well and the second version badly is pellicular. It signifies that the implementation and the proper configuration of a method is of crucial importance.

56 42 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS a polynomial transfer function that is constant (except coefficients) for all units in a model. The second restriction is the absence of layer breakthrough connections. In the original version inputs to a unit can be from a previous layer only Growth from a minimal form The GAME models grow from a minimal form. There is a strong parallel with state of the art neural networks as the NEAT [93]. In the default configuration of the GAME engine, units are restricted to have at least one input and the maximum number of inputs must not exceed the number of the hidden layer, the unit belongs to. The number of inputs to the unit increases together with the depth of the unit in the model. Transfer functions of units reflect growing number of inputs. We showed [A.15] that increasing number of unit s inputs and allowing interlayer connections plays significant role in improving accuracy of inductive models. The growing limit of inputs a unit is allowed to have is crucial for inductive networks. It helps to overcome the curse of dimensionality. According to the induction principle it is easier to decompose problem to one-dimensional interpolation problems and then combine solutions in two and more dimensions, than to start with multi-dimensional problems (for full connected networks - dimensionality is proportional to the number of input features). To improve the modeling accuracy of neural networks, artificial input features are often added to the training data set. These features are synthesized from original features by math operations and can possibly reveal more about the modeled output. This is exactly what is GAME doing automatically (units of the first hidden layer serve as additional synthesized features for the units deeper in the network). Our additional experiments showed, that the restriction on the maximum number of inputs to a units has a negative effect on the accuracy of models. However when the restriction is enabled 5, the process of model generation is much faster and the accuracy of produced models is more stable than without the restriction. Without the restriction we would need many more epochs of the genetic algorithm evolving units in a layer (models accuracy would be stable and the feature ranking algorithm deriving significance from proportional numbers of units connected to a particular feature would work properly - see Section ) Interlayer connections Units have no longer inputs just from previous layer. Inputs can be connected to the output of any unit from previous layers as well as to any input feature. This modification greatly increases the state space of possible model topologies, but the improvement in accuracy of models is rather big [A.15]. The GMDH algorithm implemented in the KnowledgeMiner software [69] can also generate models with layer breakthrough connections. 5 No restriction on the maximal number of inputs does not mean a fully connected network!

57 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Regularization in GAME The regularization is a methodology allowing to create models that are not too simple, not too complex for an appropriate task. Without any form of regularization, models often overfit a training data loosing generalization abilities and their performance on a new unseen data becomes extremely bad. The GMDH methods usually regularize models using an external data set called a validation set 6. The criterion of regularity (in this case the error on the validation set) should be minimized: CR RMS val = 1 N B N B (y i (A) d i ) 2 (4.19) i=1 In some cases (e.g. few data records), it is possible to validate also on the training set: CR RMS tr&val = 1 N N (y i (A) d i ) 2 (4.2) i=1 For the experiments we have performed in previous sections, the criterion CR RMS tr&val was used. If you look at the results in Figure 4.8, you can see that for some models, the accuracy on the training data is far better than that on the testing data. Also the error of Polynomial and Combi units when modeling the Energy consumption (Figure 4.4) is huge. These are the indicators of overtraining. The CR RMS tr&val regularization was unable to prevent overtraining for noisy data. Other and very straightforward form of regularization is the penalization for complexity. There exist several criteria (AIC, BIC, etc.) developed in the information theory that can be applied to our problem. We experimented with a regularization that can be written as: CR RMS R val = 1 N B (y i (A) d i ) 2 (1 + 1 ) N B R unit.p enaltyf orcomplexity(), i=1 (4.21) and with the version validating also on the training data CR RMS R tr&val. The value of the R coefficient is very important. When you look on the Figure 4.11, you can see the minimum of the criterion is changing its position with the value of the R parameter. For noisy data, it is better to have a stronger regularization (R = 12) and for no noise in data, no regularization is needed R. To validate our assumptions, we have designed an experiment with an artificial data set, where we adjusted the level of noise from % up to 2%. We also generated 3 models for each noise level and different strength of the regularization (from R=12 to 3). Theoretically (Figure 4.12 left), the lowest error on the testing data should be achieved when the strength of the regularization match the level of noise in the data. The experimental result (Figure 4.12 right) matched our expectations except that for a low regularization (R = 3) the testing error was low also for data with the medium level of noise. This deviation from the theoretical expectations can be caused by the fact, we used just the error of the simple 6 The validation set can be created by splitting the training data set into a part used for optimization of model and a part used for the validation.

58 44 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS CR.9 High noise.6 R12 R5 Stop in the minimum of CR Medium noise.3 R3 R75 R3 Low noise y = a 1 x 1 +a 2 Model complexity y = a 1 x 3 1 x a 6 x 2 +a 7 Figure 4.11: As described in [48], the regularization criterion (CR) should be sensitive to the presence of noise in data. The complexity of models is increasing during training as new layers are added. Training should be stopped in the minimum of the CR. The penalization for complexity (R???) can be part of the CR, but the value should be adjusted according to the noise level in data RMS-R2 RMS-R5 RMS-R12 RMS-R5 RMS-R3 RMS-R725 RMS-R16 RMS-R3 2% noise 1% noise 5% noise 2% noise 1% noise 5% noise % noise.2.1 RMS-R2 RMS-R5 RMS-R12 RMS-R5 RMS-R3 RMS-R725 RMS-R16 RMS-R3 2% noise 1% noise 5% noise 2% noise 1% noise 5% noise % noise Figure 4.12: The stronger penalization for complexity we use, the worse performance we can expect on complex problems with low noise. On the other hand, stronger penalization should perform better for a noisy data (left graph). Experimental measurements (right graph) are in accordance with the theoretical assumptions, except for low regularization and medium noise.

59 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 45 RMS error on the Antro training data set & the Antro testing data set E+9 2E R12 train1 R12 train2 R5 train1 R5 train2 R3 train1 R3 train2 R725 train1 R725 train2 R16 train1 R16 train2 R3 train1 R3 train R12 test1 R12 test2 R5 test1 R5 test2 R3 test1 R3 test2 R725 test1 R725 test2 R16 test1 R16 test2 R3 test1 R3 test2 Figure 4.13: Regularization of the Combi units by various strengths of a penalization for complexity. The error on Antro training data decreases and best result is achieved for almost no penalization (R3). The optimal regularization on the testing data set is R3, where errors of individual models are the lowest and their variance is still reasonable. ensemble of 3 models for all configurations (instead of the mean and standard deviation of their error). The ensemble techniques reduced overfitting of models for medium noise, but were unable to correct extremely overfitted models trained on highly noisy data Regularization of Combi units on real world data We we also interested how the regularization affects the performance of the GAME models on real world data sets. We chose the Antro data set because our previous experiments showed that this data set is considerably noisy. We used two different splits into the training and testing sets (training1, testing1, training2, testing2) to reduce the bias of selecting unrepresentative data. For our experiments we enabled just the CombiNeuron units (see Section 4.7 for detailed description of this unit). We generated 3 models for each strength of penalization on both training sets. In the Figure 4.14 you can see the minimum, maximum and the mean error of these models. Results signify that on the Antro data set the optimal value of the R parameter in the Equation 4.21 is around 3. When you look at the Figure D.13 in the Appendix, you can see results when an ensemble, instead of 3 individual models is used. The overfitting was reduced according to expectations. The same experiment with the Building data set turned out absolutely differently for two variables with low level of noise (Cold and Hot water consumption). The best accuracy was achieved for the lowest regularization (R3). For the third output variable (Energy consumption) that is considerably noisy, the error also decreased with lower penalty for complexity. In the configuration R16 two from 3 models significantly overfitted the data and also the output of the simple ensemble demonstrated large error on the testing data. Therefore stronger penalization (R725) is optimal for the Energy consumption variable. Both experiments with real world data sets showed that each output variable require different

60 46 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS RMS error on the Building training data set & Building testing data set R12 train1 R5 train1 R3 train1 R725 train1 R16 train1 R3 train R12 test1 R5 test1 R3 test1 R725 test1 R16 test1 R3 test1 WBE WBCW WBHW Figure 4.14: The error of the GAME ensemble on Building training data decreases with the declining penalization. The results on the testing data set show that no regularization is needed for the WBHW and WBCW variables. For the WBE variable that is much noisier than the other two and for low penalization levels models are overfitted. degree of regularization depending on amount of noise present in data vectors Evaluation of regularization criteria In [48] it was proposed that the criterion of regularity should be changing its minima with a changing level of noise in a data set. Such clever criterion involves taking into account a noise level of an output variable. However the level of noise can be hardly estimated. The problem is that we cannot say if the variance is caused by noise or by a complex relationship. We can assume that a noise level is correlated with the variance of the output variable in a data set 7 σ 2 = N ( di d ) 2, (4.22) i=1 where d i are target values of the output variable and d is the mean of these values. Then the regularity criterion can be written as CR RMS p n val = 1 N B ( ) (y i (A) d i ) σ 2 unit.p enaltyf orcomplexity(). N B i=1 (4.23) The penalty for complexity is then stronger for higher variance in a data set. We have been experimenting with above proposed criteria on the artificial data set with various levels of noise. The results in the Figure 4.15 left indicate that for low levels of noise 7 This assumption is not true for several regression data sets and for almost all classification problems.

61 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 47 RMS on the testing data Training & Validation set Validation set 1.7E E E-2 4.1E E E-2 % noise 5% noise 1% noise 4.86E E-2 8.7E E-2 2% noise 5% noise 6.9E E-1 3.2E E+8 1% noise 2% noise Regularization on testing data E-3 2.1E-3 RMS-tr&val R3-tr&val RMS-p-n-tr&val 1.95E-3 6.9E+7 % noise 5% noise 1% noise 2% noise 5% noise 1% noise 2% noise 3.86E+8 Figure 4.15: For low noise levels in data, it is better to validate on both the training and the validation set. For noisy data, just the validation set should be used to prevent the overtraining. in data, it is better to validate on both the training and the validation set. For highly noisy data, this regularization (Equation 4.2) fails and it is better to validate on the validation set only (Equation 4.19), or use other criteria with the penalization for complexity (e.g. CR RMS p n tr&val ). The difference between the criterion with and without the variance considered is not significant (Figure 4.15 right). The overview of the criteria performance relative to the best results found during the experiment in Figure 4.12 can be found in the Figure Results showed that the regularization taking into account the variance of the output variable (Equation 4.23) is not better than the medium penalization for complexity (Equation 4.21 with R = 3). The problem is that we cannot say if the variance of the output variable is caused by noise or by a complex relationship. Additional research is needed to improve the results of the adaptive regularization. In this state of research, we recommend to use the medium penalization for complexity. Even better option would be when a noise level in the data set is supplied by a domain expert as an external information. Then we can adjust the coefficient R from the Equation 4.21 to the appropriate value. 4.6 Genetic algorithm The genetic algorithm is frequently used to optimize a topology of neural networks [66, 89, 95]. Also in GMDH related research, recent papers [77, 7] report improving the accuracy of models by employing genetic search to search for their optimal structure. In the GAME engine, we also use genetic search to optimize the topology of models and also the configuration and shapes of transfer functions within their units. The individual in the genetic algorithm represents one particular unit of the GAME network. Inputs of a unit are encoded into a binary string chromosome. The transfer function can be also encoded into the chromosome (see Figure 4.17 and the next section Evolving units ). The chromosome can also include configuration options such as strategies utilized during optimization

62 48 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS % noise.3 2% noise % noise RMS-valid R3-val RMS-p-n-valid 1% noise 1% noise 5% noise 2% noise R3-tr&val RMS-tr&val RMS-p-n-tr&valid Figure 4.16: We can observe similar results as in the previous figure. Regularization methods using both training and validation sets are better with no noise but fail with high noise levels in the data. The R3 regularization criterion proved to perform surprisingly well for all levels of noise. The RMS-p-n criterion should be further adjusted to penalize less on a low noise and more on a high noise levels Niching GA Linear transfer unit (LinearNeuron) Settings Inputs Transfer function = a1x1 + a2x2 a y Polynomial trasfer unit (CombiNeuron) Transfer function Inputs 3 2 a1x1 x2 + a2x1 x2 a y = + Settings Figure 4.17: If two units of different types are crossed, just the Inputs part of the chromosome come into play. If two CombiNeuron units cross over, also the second part of the chromosome is involved.

63 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 49 Inputs Type Other Inputs Type Other P 1 P trans.fn. P 1 P trans.fn. Regular Genetic Algorithm (GA) (no niching) Niching GA with Deterministic Crowding Select N best individuals P P P Select the best individual from each niche P P S Figure 4.18: GAME units in the first layer are encoded into chromosomes, then GA is applied to evolve the best performing units. After few epochs all units will be connected to the most significant input and therefore correlated. When the Niching GA is used instead of the basic variant of GA, units connected to different inputs survive. of parameters. The length of the inputs part of the chromosome equals to the number of input variables plus number of units from previous layers, the unit can be connected to. The existing connection is represented by 1 in the corresponding gene. The number of ones is restricted to maximal number of unit s inputs. The example how the transfer function can be encoded is in the Figure Note that coefficients of the transfer functions (a, a 1,, a n ) are not encoded in the chromosome (Figure These coefficients are set separately by optimization methods (Section 4.3. This is crucial difference from the Topology and Weight Evolving Artificial Neural Network (TWEANN) approach [95]. The fitness of the individual (e.g. f(p1)) is inversely proportional to the criterion of regularity described in the previous section. The application of the genetic algorithm in the GAME engine is depicted in the Figure The left schema describes the process of single GAME layer evolution when the standard genetic algorithm is applied. Units randomly initialized and encoded into chromosomes. Then the genetic algorithm is run. After several epochs of the evolution, individuals with the highest fitness (units connected to the most significant input) dominate the population. The best solution represented by the best individual is found. The other individuals (units) have very similar or the same chromosomes as the winning individual. This is also the reason why all units surviving in the population (after several epochs of evolution by the regular genetic algorithm) are highly correlated. The regular genetic algorithm found one best solution. We want to find also multiple suboptimal solutions (e.g. units connected to the second and the third most important input). By using less significant features we get more additional information than by using several best individuals connected to the most significant feature, which are in fact highly correlated (as shown on Figure 4.19.). Therefore we employ a niching method described bellow. It maintains diversity in the population and therefore units connected to less significant inputs are allowed to survive, too (see Figure 4.18 right).

64 5 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Figure 4.19: Fitness of unit Z is higher than that of unit C, although Z has less fit inputs Niching methods The major difference between the regular genetic algorithm and a niching genetic algorithm is that in the niching GA the distance among individuals is defined. The distance of two Niching GA P 1 P Distance(P 1,P 2 ) = genotyphic distance + correlation of errors Computed from units deviations on training & validation set Normalized distance of Inputs Hamming(11,111) + features used + Normalized distance of Transfer functions Euclid distance of coefficients + Normalized distance of Other attributes Distance of configuration variables Encoding units to chromosomes: P 1 Transfer function P Other Inputs Transfer function Other Inputs Figure 4.2: The distance of two units in the GAME network. individuals (e.g. d(p1; c1) from the pseudocode of Deterministic Crowding ) can be based on the phenotypic or genotypic difference of units. In the GAME engine, the distance of units is computed from both differences. Figure 4.2 shows that the distance of units is partly computed from the correlation of their errors and partly from their genotypic difference. The genotypic difference consists the obligatory part difference in inputs, then some units add difference in transfer functions and also difference in configurations can be defined. Units that survive in layers of GAME networks are chosen according to the following algorithm. After the niching genetic algorithm finished the evolution of units, a multi-objective algorithm sorts units according to their RMS error, genotypic distance and the correlation of errors. Surviving units have low RMS errors, high mutual distances and low correlations of errors. Niches in GAME are formed by units with similar inputs, similar transfer functions, similar configurations and high correlation of errors.

65 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 51 The next idea is that units should inherit their type and the optimization method used to estimate their coefficients. This improvement allows to reduce time wasted with optimizing units with an improper transfer function by optimization methods not suitable to processed data Evaluation of the distance computation The GAME engine enables the visual inspection of complex processes that are normally impossible to control. One of these processes is displayed in the Figure From left we can see the matrix of genotypic distances computed from chromosomes of individual units during the evolution of the GAME layer. Note that this distance is computed as a sum of three components: distance of inputs, transfer functions and configuration variables, where last two components are optional. The darker color of background signifies the higher distance Chromos. dist. Correlation Error on training vectors RMSE Epoch 1 Start of the niching Genetic Algorithm, units are randomly initialized, trained and their error is computed, Epoch 3 after 3 epochs the niching Genetic Algorithm terminates, Sorted finally units are sorted according to their RMSE, chromosome difference and the correlation. Figure 4.21: During the GAME layer evolution, distances of units can be visually inspected. The first graph shows their distance based on the genotypic difference. The second graph derives distance from their correlation. Third graph shows deviations of units on individual training vectors and the most right graph displays their RMS error on the training data. of corresponding individuals and vice versa. The next matrix visualize distances of units based on the correlation of their errors. Darker background signifies less correlated errors. The next graph shows deviations of units output from the target value of individual training vectors. From these distances the correlation is computed. The most right graph of the Figure 4.21 shows a normalized RMS error of units on the training data. All these graphs are updated as the evolution proceeds from epoch to epoch. When the niching genetic algorithm finishes, you can observe how units are sorted (multi objective sorting algorithm based on the Bubble sort) and which units are finally selected to survive in the layer. Using this visual inspection tool, we have evaluated and tuned the distance computation in the niching genetic algorithm. The next goal was to evaluate if the distance computation is well defined. The results in the Figure show that the best performing models can be evolved with the proposed combination of genotypic difference and correlation as the distance measure. The worst results are achieved when the distance is set to zero for all units. Medium accuracy models are generated by either the genotypic difference based distance or the correlation of errors

66 52 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS RMS Error on Boston testing data set.292 Simple Ensemble.29 Weighted Ensemble Average, Minimum and Maximum RMS Error of 1 Ensemble Models on Boston data set None Genome Correlation Gen&Corr..276 None Genome Correlation Gen&Corr. Figure 4.22: The best results gives the setting when the distance of units is computed as a combination of their genotypic distance and the correlation of their errors on training vectors. based distance The performance tests of the Niching GA versus the Regular GA In the Figure 4.23, there is a comparison of the regular genetic algorithm and the niching GA with the Deterministic Crowding scheme. The data set used to model the output variable (Mandarin tree water consumption) has eleven input features. Units in the first hidden layer of the GAME network have a single input, so they are connected to a single feature. The population of 2 units in the first layer was initialized randomly (genes are uniformly distributed - approx. the same number of units connected to each feature). After 25 epochs of the regular genetic algorithm the fittest individuals (units connected to the most significant feature) dominated the population. On the other hand the niching GA with DC maintained diversity in the population. Individuals of three niches survived. As Figure 4.23 shows, the functionality of niching genetic algorithm in the GAME engine has been proved. When you look at the Figure 4.23 you can also observe that the number of individuals (units) in each niche is proportional to the significance of the feature, units are connected to. From each niche the fittest individual is selected and the construction goes on with the next layer. The fittest individuals in next layers of the GAME network are these connected to features which brings the maximum of additional information. Individuals connected to features that are significant, but highly correlated with features already used, will not survive. By monitoring which individuals endured in the population we can estimate the significance of each feature for the output variable modelling. This can be subsequently used for the feature ranking (see section 5.2.2). We also compared the performance (the inverse of RMS error on a testing data) of GAME models evolved by means of the regular GA and the niching GA with Deterministic Crowding respectively. Extensive experiments were executed on the complex data (Building dataset) and on the small simple data (On-ground nuclear tests dataset). The statistical test proved that on the level of significance 95%, the GA with DC performs

67 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 53 #! #! " # #"!!" Figure 4.23: The experiment proved that the regular Genetic Algorithm approaches an optimum relatively quickly. Niching preserves different units for many more iterations so we can chose the best unit from each niche at the end. Niching also increases a probability of the global minimum not being missed. RMS energy consumption 5,3E-2 RMS cold water consumption 8,8E-3 RMS hot water consumption 4,1E-2 5,25E-2 5,2E-2 5,15E-2 5,1E-2 5,5E-2 5,E-2 4,95E-2 4,9E-2 4,85E-2 8,7E-3 8,6E-3 8,5E-3 8,4E-3 8,3E-3 8,2E-3 4,5E-2 4,E-2 3,95E-2 3,9E-2 3,85E-2 3,8E-2 3,75E-2 4,8E-2 GA GA+DC 8,1E-3 GA GA+DC 3,7E-2 GA GA+DC Figure 4.24: RMS error of GAME models evolved by means of the regular GA and the GA with the Deterministic Crowding respectively (on the complex data). For the the hot water and the energy consumption, the GA with DC is significantly better than the regular GA

68 54 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS RMS on-ground nuclear tests 4,5E-5 4,E-5 3,5E-5 3,E-5 2,5E-5 2,E-5 CRATER DEPTH CRATER DIAMETER FIRE RADIUS INSTATNT RADIATION SUMA RADIATION WAWE PRESSURE DC off DC on 1,5E-5 1,E-5 5,E-6,E+ GA GA+DC Figure 4.25: Average RMS error of GAME models evolved by means of simple GA (DC off) and GA with Deterministic Crowding (DC on) respectively (on the simple data). Here for all variables, the Deterministic Crowding attained the superior performance. better than simple GA for the energy and hot water consumption. The Figure 4.24 shows RMS errors of several models evolved by means of the regular GA and the GA with Deterministic Crowding respectively. The results are more significant for the On-ground nuclear dataset. The Figure 4.25 shows the average RMS error of 2 models evolved for each output attribute. Leaving out models of the fire radius attribute, the performance of all other models is significantly better with Deterministic Crowding enabled. We can conclude, than niching strategies significantly improved the evolution of GAME models. Generated models are more accurate than models evolved by the regular GA as showed our experiments with real world data The inheritance of unit properties - experimental results We designed experiments to test our assumption that the configuration of the GAME engine, where offsprings of units inherit the type and the optimization method used by more successful parent, would perform better than the configuration, where the type and the optimization method is assigned randomly. We prepared configurations of the GAME engine with several different inheritance settings. In the configuration p% none offsprings got their type or optimization method assigned randomly. In the configuration p5% offsprings have 5% chance to get random type or method assigned. In the configuration p1% nothing is inherited, all types and optimization methods are set randomly. We have been experimenting with the Mandarin, Antro and Boston data sets. For each configuration 3 models were evolved. The maximum, minimum and mean of their RMS errors for each configuration are displayed in the Figure Results are very similar for all configurations and data sets. There is no configuration significantly better than others. For all data sets we can observe that the p5% and the p1% configuration have slightly better mean error values and lower dispersion of errors. We chose the p5% configuration to be default in the GAME engine. It

69 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Mandarin inhertance test p% p1% p5% p7% p1% Antro inhertance test p% p2% p5% p8% p1% Boston inhertance test p% p2% p5% p8% p1% Figure 4.26: The experiments with the inheritance of transfer function and learning method. For all three data sets, the fifty percent inheritance level is a reasonable choice. means offspring units have 5% chance to get random type and optimization method assigned otherwise their type and methods are inherited from parent units. 4.7 Evolving units (active neurons) Input connections of units are evolved by means of the niching genetic algorithm described above. It is possible to evolve at the same time also transfer functions of units and their configuration. In the actual version of the GAME engine, only the CombiNeuron unit supports the evolution of its transfer function. We are working on extending the support of transfer function evolution also to other GAME units CombiNeuron - evolving polynomial unit The GAME engine in the configuration generating homogeneous models with PolySimpleNeuron or CombiNeuron units only, can be classified as Multiplicative-additive (generalized) GMDH algorithm [63]. The transfer function of a polynomial unit can be either pseudorandomly generated (as implemented in the PolySimpleNeuron unit 8 ) or evolved as implemented in the CombiNeuron unit. To be able to evolve the transfer function we have to encode it into chromosome first. The encoding designed for the CombiNeuron unit is displayed in the Figure The advantage of the encoding is that it keeps track of degrees of input features (degree field) although for some units particular features are disabled. This encoding is also used in the proposal of the PMML GMDH standard (see appendix A). When the transfer function was added into the chromosome 9, it was necessary to define evolutionary operators. The mutation operator can add/delete one element in the transfer function. It can also mutate the degree of arbitrary input feature in arbitrary element of the transfer function. The crossover operator simply combines elements of transfer functions of both parents. We plan to experiment with more sophisticated crossover techniques utilizing e.g. historical marking [95]. In case of noisy data, the CombiNeuron unit should be penalized for complexity to avoid the 8 In PolySimpleNeuron units, multiplicative-additive polynomials of complexity increasing with the number of layer in the network are generated pseudo-randomly. 9 The Genome class was overridden by the CombiGenome class encoding both the inputs and the transfer function.

70 56 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Elements y = 8.94 * x 2 3 * x * x 1 * x * x Encoding Coeff. x 1 x 2 x 3 x 4 x 5 Coeff. x 1 x 2 x 3 x 4 x 5 Coeff. x 1 x 2 x 3 x 4 x used_field degree_field Figure 4.27: Encoding of the transfer function for the CombiNeuron unit. Sampling with replacement Sample 1 GAME GAME model 1 Averaging or voting Training data Sample 2 GAME GAME model 2 GAME ensemble output... Sample M GAME GAME model M Figure 4.28: The Bagging approach is used to build an ensemble of GAME models, models are then combined by the Simple or Weighted averaging. training data overfitting phenomenon (see Section 4.5). When more types of units are enabled and the number of epochs of the niching GA together with the size of population is small, CombiNeuron units have not opportunity to evolve their transfer functions resulting in their poor performance and absence in final models. In this case we advise to reduce the number of unit types and change the inheritance configuration to p% (all offsprings inherit their type from parents). 4.8 Ensemble techniques in GAME The GAME method generates on the training data set models of similar accuracy. They are built and validated on random subsets of the training set (this technique is known as bagging [45]). Models have also similar types of units and similar complexity. It is difficult to choose the best model - several models have the same (or very similar) performance on the testing data set. We do not choose one best model, but several optimal models - ensemble models [22]. The Figure 4.28 illustrates the principle how GAME ensemble models are generated using bootstrap samples of training data and later combined into a simple ensemble or a weighted ensemble. This technique is called Bagging and it helps that member models demonstrate

71 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 57 diverse errors on a testing data. Other techniques that promote diversity in the ensemble of models play significant role in increasing the accuracy of the ensemble output. The diversity in the ensemble of GAME models is supported by following techniques: Input data varies (Bagging) Input features vary (using subset of features) Initial parameters vary (random initialization of weights) Model architecture varies (heterogeneous units used) Training algorithm varies (several training methods used) Stochastic method used (niching GA used to evolve models) We assumed that the ensemble of GAME models will be more accurate than any of individual models. This assumption appeared to be true just for GAME models whose construction was stopped before they reached the optimal complexity (Figure 4.8 left). RMS cold water consumtion,274,273,272,271,27,269,268,267,266, ensemble 6 RMS age estimation ensemble 12 Figure 4.29: The Root Mean Square error of the simple ensemble is significantly lower than RMS of individual suboptimal models on testing data (left graph). For optimal GAME models it is not the case (right). We performed several experiments on both synthesized and real world data sets. These experiments demonstrated that ensemble of optimal GAME models is seldom significantly better then single the best performing model from the ensemble (Figure 4.8 right). The problem is, we cannot say in advance which single model will perform the best on testing data. The best performing model on training data can be the worst performing one on testing data and vice versa. Usually, models badly performing on training data perform badly also on testing data. Such models can impair the accuracy of ensemble model. To limit the influence of bad models on the output of ensemble, models can be weighted according to their performance on training data set. Such ensemble is called the weighted ensemble and we discuss its performance

72 58 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS bellow. Contrary to the approach introduced in [37], we do not use the whole data set to determine performances (Root Mean Square Errors) of individual models in the weighted ensemble. Also the optimal value of the coefficient α was experimentally determined to be 2*1 6, instead of 1 used in [37]. RMS skeleton age estimation training&validation data set Simple ensemble 124 1Individual 3 5 GAME 7 9models Weighted ensemble RMS skeleton age estimation testing data set Simple ensemble Individual 3 5 GAME 7 9models Weighted ensemble Figure 4.3: Performance of the simple ensemble and weighted ensemble on very noisy data set (Skeleton age estimation based on senescence indicators). On the Figure 4.3 you can see that weighted ensemble has tendency to overfitt the data - stronger than simple ensemble. While its performance is superior on the training and validation data, on the testing data there are several individual models performing better. The theoretical explanation for such behavior might be the following. Figure 4.8 a shows ensemble of two models that are not complex enough to reflect the variance of data (weak learner). The error of the ensemble is lower than that of individual models, similarly like in the first experiment mentioned above. a) b) model 2 model 1 model 2 model 3 model 1 ensemble model ensemble model Figure 4.31: Ensemble of two models exhibiting diverse errors can provide significantly better result. On the Figure 4.8 b, there is an ensemble of three models having the optimal complexity. It is apparent that the accuracy of the ensemble cannot be significantly better, than those of individual models. The negative result of second experiment is therefore caused by the fact, that the bias of optimal models cannot be further reduced. We can conclude that by using the simple ensemble, instead of single GAME model, we can in some cases improve the accuracy of modeling. The accuracy improvement is not only advantage of using ensembles. There is a highly interesting information encoded in the

73 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 59 Table 4.4: Experiments with the Internet advertisements data set in the JavaNNS software. JavaNNS RMSE Yes[%] No[%] Correct[%] Winning MLP topology ica2.dat ica5.dat ica5.dat ica7.dat ica7.dat ensemble behavior. It is the information about the credibility of member models. The section describes, how this information can be extracted. These models approximate data similarly and their behavior differ out of areas where system can be successfully modeled (insufficient data vectors present, etc.). In well defined areas all models have compromise response. We use this fact for models quality evaluation purposes and for estimation of modeling plausibility in particular areas of the input space. The applications of these techniques to real-world problems are discussed in the Chapter?? of this report. 4.9 Benchmarking the GAME engine In this section we compare the performance of the GAME engine to the most popular NN techniques and state of the art DM methods Internet advertisements The Internet advertisements data set is available in the UCI archive. It has many records and many input features. To be able to experiment with this data set in the JavaNNS software [3], we used the Independent Component Analysis (ICA) [43] to reduce the number of input features. Table 4.4 shows the accuracy of the classification (MLP in JavaNNS Software) on data, where 2,5 and 7 input features were extracted from several hundreds original binary features by means of the Independent Component Analysis (ICA) [43]. For several topologies of the MLP network tested, the best results we obtained with the topology (numbers of neurons in layers). The best advertising classification (87.9% accuracy) was achieved for the data set with five input features. The same data set (ica5.dat) was used in our experiments with the GAME engine. The best classification accuracy (93.13%) was achieved when the All 2 1 and All fast 5 11 configurations were used. 1 All 2: All units enabled, ensemble of two models for both Yes and No classes. 11 All fast 5: All units with analytic gradient implemented were enabled, ensemble of five models for both Yes and No classes.

74 6 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS Table 4.5: Experiments with the Internet advertisements data set (ica5.dat) in the GAME software. GAME Yes model No model config. RMSE Accuracy RMSE Accuracy Correct All % % 93.13% All-fast % % 93.13% Standard % % 92.44% Combi % % 91.41% Table 4.6: The error of the GAME engine on each fold of the Pima data set (for various configurations). GAME cfg. fold1 fold2 fold3 fold4 fold5 fold6 fold7 fold8 fold9 fold1 Avg Std-1-tv All-1-tv Std-5-tv Std-5-v CombiR We also experimented with the optimal number of models in the GAME ensemble (see Figure D.14 in the appendix). Results indicate that 5 models in the ensemble is the optimum, but the experiment should be repeated several times to get more reliable results. On the Internet advertisements data set the GAME engine generates 5% more accurate classifiers than the best classifiers from the JavaNNS software Pima Indians data set The Pima Indians data set is well known data set used in many machine learning benchmarks. It can be obtained from the UCI repository. We used the 1 fold cross validation [52] to find out which configuration of the GAME engine generates more accurate classifiers. The accuracy of GAME classifiers (the percentage of correctly classified examples) for each fold can be found in the Table 4.6. The winning configuration is default in the GAME engine 12. Followed by the All 1 tv configuration 13. Three top configurations validated models both on the training and the testing set. Winning configurations do not use ensemble of classifiers, just individual models. We compared our results to the results recently published in [82]. Table 4.7 shows that the GAME engine can classify the Pima Indians data set better than other state of the art soft 12 Units:LinearNeuron, PolySimpleNeuron, ExpNeuron, SigmNeuron, SinusNeuron, GaussianNeuron, Multi- Gaussian and GaussNeuron enabled 13 All units enabled

75 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS 61 Table 4.7: The 1fold cross validation classification error and the standard deviation of errors for the Pima data set. Pima data set Bayes MLP-BP Lazy training Soft propagation GAME Accuracy[%] Std.dev.[%] computing methods Spiral data benchmark The Spiral data set [6] is frequently used for benchmarking machine learning methods. It describes very difficult classification problem. Two intertwined spirals are to be told apart. In [3] shows, that simple backpropagation MLP networks are not capable to solve this problem. The more advanced neural networks such as the Cascade Correlation Network (CCN) [3] are able to solve it. We have found that original MIA GMDH is not capable to solve this complex problem at all. The motivation of this section is to prove that GAME models solve the Spiral data set problem successfully. Inductive methods traditionally use part of the training data set as the validation set (to decide which units have the best generalization ability). For this experiment, all training data were used to modify weights and coefficients of GAME units and fitness of these units was computed from their performance on the training data. This decreases the generalization ability of the network, but in this case this is the only option how to get hundred percent classification accuracy (few training data for complex problem). Figure 4.32: Two GAME networks solving the intertwined spirals problem. The Figure 4.32 depicts two GAME networks solving the intertwined spirals problem. The

76 62 SECTION 4. THE DESIGN OF THE GAME ENGINE AND RELATED RESULTS dark background signifies 1 on the output of the network. All points making up one spiral are covered by dark background; whereas all points on the second spiral are classified as (white background). The majority of units in GAME networks are small perceptron networks optimised by the BP algorithm. The average network has 12 layers with approximately four units in each layer. It signifies that solving the intertwined spirals problem is very difficult. Small perceptron network optimized by the BP algorithm is not able to solve this problem individually. The GAME demonstrates the power of induction - cooperating networks interconnected in one GAME network are able to solve the Spiral problem successfully. When the complexity of GAME network on this benchmarking problem is compared to the complexity of the Cascade Correlation Network, we must state that GAME networks are redundant. The CCN network can solve this problem with approx. 16 hidden sigmoid neurons. The solution is to implement the CCN network as the unit of the GAME network. This would reduce the complexity of resulting GAME network when solving difficult problems. Most important is that GAME allows solving complex multivariate problems where CCN fails because of the curse of dimensionality. 4.1 Summary In this chapter we listed results related to the core of our research - the GAME engine. We proposed and evaluated usefulness of using heterogeneous units in GAME models. The big range of various units is evolved and only units adapted to nature of a data set survive. We derived and implemented an analytic gradient for several types of units. We showed that it dramatically reduces the number of error function evaluation calls and saves the computational time. We focused also on various optimization methods and performed experiments showing there is no superior method universally applicable, but the Quasi Newton and Conjugate Gradient methods perform sufficiently on a big range of different problems. Experiments with regularization of GAME units showed the modest penalization for complexity can prevent overfitting even of highly noisy data. Penalized models are simpler and can be interpreted by man. We showed that the niching genetic algorithm is superior to regular genetic algorithm when generating the topology of GAME models. The polynomial transfer functions of units are evolved, too. When combined with proper regularization, we can get really simple and accurate polynomial models. We showed that the simple ensemble of GAME models is seldom better than the best of individual models. However is has very stable behavior and its error is one of the lowest. Therefore is often worth to use ensemble instead of individual model with the best performance on the training and validation data.

77 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 63 5 The FAKE interface and related results In this chapter we present results related to the Fully Automated Knowledge Extraction (FAKE) interface. It is a long way from a raw data set to the knowledge that is encoded inside. To extract the knowledge from data, several techniques are proposed and discussed in this chapter. Some limited knowledge can be extracted directly from raw data. More valuable knowledge can be accessed by means of sophisticated data mining techniques such as the GAME engine. Without data preprocessing, data mining is often impossible. If we aim at automating the knowledge extraction process, we need to automate data preprocessing stage, as well. 5.1 Automated data preprocessing The results in this section are related to very important part of the FAKE interface. It is the interface between raw data and the GAME engine (see Figure 3.1). We still have not addressed many problems from this area. The research of automated data preprocessing techniques is in the early stage Imputing missing values Majority of real world data sets contains some missing values. To be able to process these data by the FAKE GAME environment, we need to deal with the missing values. We have implemented an application allowing us to impute missing values in data by different techniques summarized bellow. Leave out - Records containing missing values are deleted. Replace by zero - All missing values are replaced by zeros. Replace by mean - Missing values are replaced by the mean value of the corresponding variable (their column). Text match similarity - If the text of non-missing values in the record match with other similar record, missing values are replaced by values from the similar record. Euclid distance similarity - Missing values are replaced by values from the most similar record, based on the Euclid distance. Dot Product similarity - The same as previous item, just the distance is computed as a dot product of n-dimensional vectors (where n is the number of variables or columns). Techniques implemented can solve the missing values problem automatically without user involvement. To examine the properties of these techniques, we designed the following experiment. The Stock market prediction data set does not contain any missing values. We artificially introduced missing values into this data set. Various mechanisms that produce missing values can be distinguished [62].

78 64 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS RMS error on a complete data set (Stock prediction) Infinity no data available leave out replace by zero replace by mean text match similarity euclid distance neighbour dot product neighbour % missing 2% missing 5% missing 8% missing RMS error if trained on the complete data set Figure 5.1: The performance of imputing methods on the Stock market prediction data set with different volume of missing values. We used the MCAR (Missing Completely At Random) mechanism to produce data sets with 5%, 1%, 5% and 8% of missing values in the Stock market prediction data. The MCAR [62] means that the probabilities that the values of the inputs are missing attributes are independent of the values of any of the variables. After that, we used above listed imputing techniques to correct data sets with various degree of missing values. Then we trained GAME ensembles on these corrected data sets. Finally, the error of ensembles on the original Stock market prediction data is shown in the Figure 5.1. The results signify that replacement is not suitable for the Stock market prediction data. Much better is to replace by the mean value. The leave out strategy is superior to other methods up to 2% of missing values. The imputing based on the Euclid distance has very promising results, specifically for high percentages of missing values in the data set. It is hard to do some general conclusions from results of this small experiment. With a data set of different properties it is likely we get different results. Our further work is to explore the relationship between properties of a data set and a proper imputing method used. We also plan to develop more sophisticated algorithms of the missing values replacement Distribution transformation The best distribution of data for the data mining techniques should be the uniform one (as suggested in [8]). We proposed and implemented an algorithm that transforms data from whatever distribution to the distribution close to the uniform one. The inverse transformation is also possible. The transformation of distribution should be transparent to an expert user, because it changes physical meaning of features. Therefore we use the transformation as depicted in the Figure 5.2. The transformation algorithm is based on an artificial distribution function (see Figure 5.3). The design of the distribution function is inspired by [84], where receptive fields are used

79 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 65 Black box Inputs (both training and recall phases) T T Neural Network (e.g. GAME) Outputs (training phase) T IT Outputs (recall phase) T IT Data Transformation Module Inverse Transformation Module Figure 5.2: The principle of the distribution transformation. 1.9 slope of lines offset of lines transformation function Figure 5.3: The artificial distribution function is computed from a histogram of values in a data set. for local approximations of the function. produces the final approximation. Summing them together using gaussian weights The design of the transformation function For each feature, we compute its own transformation function. The slope of the transformation function should be in any point equal to the probability density function. This function is unknown, but can be approximated by the histogram of a data set processed. We divide the, 1 range to M bins. The empirical probability in each bin r i = N i N is computed from number of values N i of the feature hitting the i th bin and total number of data vectors N. Then we have to map the t i range of definition, 1 into the, ) interval. This can be realized e.g. by the tangent function: ( k i = tan r i π ). 2 The k i is the slope of the transformation function in the center of the i th bin. We can construct whole transformation function by weighted sum of lines representing the probability density of each bin. Only thing missing is the offset o i of these lines. We can compute it step by step

80 66 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Figure 5.4: The distribution of the data set plotted in the left graph is displayed in the Figure 5.3. The results of the experiment with this data set were not statistically significant - in another words - the transformation has no impact to the quality of models. The reason might be that this distribution is closer to an uniform one and the difference of data sets is not important for the GAME engine. assuming that o = and i= i o i = o j + k j 1 M. j= Having offsets o i and slopes k i of lines, we can finally sum them up to create the transformation function: M 1 f { ( (x) = (k i x + o i ) H x i ) ( H x + i + 1 )}, M M where H is the Heaviside function. The final transformation function must be normalized to the, 1 range. Therefore we have to divide it by its maximal value f(x) = 1 f (1) f (x) Experiments with artificial data sets From an artificial data set in the Figure 5.4 left, we have synthesized the transformation function which is shown in the Figure 5.3. You can see slopes and offsets of lines computed from the empirical probability density of the data set. The right plot in the Figure 5.4 shows the original data set transformed using our function (Figure 5.3). Histograms of both the original and transformed data sets are displayed in the Figure 5.5. The distribution of the transformed data set is almost uniform. You can also observe minor disturbances of spiral shapes in the Figure 5.4 caused by the linear segmentation of the transformation function. We performed experiments to check if the transformed data set with uniform distribution improves the quality (classification accuracy) of models. Several GAME ensembles were produced for the original data set and for the transformed data set and their classification accuracy was compared. Surprisingly, the difference was not significant. We repeated experiments with a data set, with highly non-uniform distribution (see Figure 5.6). Here the GAME engine generated significantly more accurate models for transformed

81 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS number of data vectors in the interval 7 number of data vectors in the interval Figure 5.5: The histogram of the original data set (left) and the histogram of the data set transformed from the original using the artificial distribution function. You can observe the distribution of transformed data set is almost uniform..85 Example of models behaviour on Original and Transformed data Classification accuracy Original Transformed X 2 Transformed X 2 X 1 X 1 Transformed Figure 5.6: It is clear that the transformation has positive influence to results of GAME models trained on the original and the transformed data. The result is significant on a 98% confidence level.

82 68 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Day Time Rs Rn PAR Tair RH u SatVap VapPress Battery Figure 5.7: The artificial distribution functions for input features of the Mandarin data set data than for the original highly non-uniform data 1. The Figure 5.6 shows that the mean classification accuracy for 3 GAME ensembles for transformed data was about 4% higher than for ensembles generated on the original data Mandarin data set distribution transformation The idea to apply the distribution transformation to the Mandarin data set was motivated by highly nonlinear character of some of its input features. Particulary the distribution of features Rs, PAR and Rn (see Figure 5.7) is highly nonlinear. On the other hand, some input features have uniform distribution (Day, Time) and their transformation functions in the Figure 5.7 are identity. Again, we generated GAME ensembles for the original and the transformed Mandarin data set. We compared their RMS errors on a testing data and the difference was not significant. To explore properties of the transformation, we used the scatterplot matrix to project original and transformed data (see Figure 5.8). The reason why the transformation does not improve the performance of models on Mandarin data might be the following. The data set contains measurements of Mandarin tree water consumption. During the night, tree is not consuming any water and also input features related to the solar radiation are close to zero (PAR, Rs). By transforming these features into the uniform distribution, we only amplify noise around zero (Figure 5.8). Our future work is to experiment with real world data sets containing clusters. Such data sets seems to be more suitable for the transformation of distribution using our approach. We also work on smoothing the transformation function by using Gaussian kernels instead of Heaviside functions. Weak learners should benefit more from the distribution transformation, therefore we scheduled some experiments in this direction. The improved performance of GAME ensemble models on the Mandarin data set can be achieved by better selection of training data set by means of an intelligent data reduction. 1 Note that this sort of data transformation can hardly be achieved by standard transformation techniques such as Softmax scaling (there are more than one dense clusters in the data set).

83 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 69 Time Rs Rn PAR Tair RH u SVpPr VapPr Battery Water Time Rs Rn PAR Tair RH u SVpPr VapPr Battery Water Water Batery VapPr SVpPr u RH Tair PAR Rn Rs Time Water Batery VapPr SVpPr u RH Tair PAR Rn Rs Time Figure 5.8: The Scatterplot matrix of the Mandarin data generated by the Sumatra TT software [5] before and after the transformation. Some disturbances caused by the linearization of the artificial distribution function can be observed Data reduction A data set can be reduced in both dimensionality and size. The dimensionality reduction is important for several data mining methods. For example a fully connected MLP network suffers from curse of dimensionality when too many input features and few data vectors are provided. To solve this problem, input features can be reduced by a feature selection or a feature extraction algorithm. Feature selection algorithms look for the best subset of input features in order to preserve maximum information. Other, less interesting features are removed from a data set. Feature extraction algorithms such as Principal Component Analysis (PCA) or Independent Component Analysis (ICA) project data to a low dimensional space while preserving maximum information possible. The GAME engine selects the most interesting input features automatically, while ignoring irrelevant ones. The dimensionality reduction is therefore necessary for data set with more than hundreds input features 2. The different problem is the reduction of size in case of large data sets. The volume of vectors can be reduced, when randomly selecting a subset of data vectors. This is however not the best approach. The data subset selected has to be representative. It means it should contain data vectors representing all possible system states. We have implemented a data filter in the FAKE GAME environment allowing us to choose a representative data subset. The main criterion is the value of the output variable in the record. A record is added to the subset with the probability inverse of the empiric probability 2 We are working on the version of the GAME engine allowing us to process data sets with more than thousand of input features. For such large data set it is necessary to increase the population size and number of epochs of genetic algorithm evolving GAME units. Some changes in memory structures of the application have to be also performed.

84 7 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS of the output variable. features among records. Later, the probability should respect also the difference in input Unfortunately, results of experiments with the data filter were not available at the time of publication of this thesis. 5.2 Knowledge extraction and information visualization This section summarizes results related to the most important part of the FAKE interface - the interface between the GAME engine and an expert user. We aim at extracting as much knowledge as possible in comprehensible form Math formula extraction Math formulas are often used to describe a behavior of a system. The knowledge can be decoded from math formulas by users with some mathematical background. The extraction x 1 P i - units with linear transfer function (e.g. LinearNeuron) x 2 x 3 P 1 P 2 y = Ρ ( ) 3 Ρ2( x1, x3), Ρ1( x1, x2) 3 P a ( a x = ( a a a + a x + a 12 3 a ) x 13 1 ) + a + ( a 32 a ( a = 21 1 ) x x + a ( a a x + a ) x 3 23 ) + a 33 + a a = + a a a 33 Figure 5.9: How to extract the math equation from the GAME model. of math formulas from inductive models is depicted in the Figure 5.9. For model with linear units, simple linear equation can be extracted. Some formulas are too complicated to be useful. The example bellow is a math formula extracted from the GAME classifier build on the Pima Indians data set. diabetes= *sin(-.91*-.395*.311*(.353*e^(-(-.381*.85* e^(-(.866* /( *age-4.236*age*age) *plasma_conc-1.778)^2/(.64)^2) *plasma_conc-1.159*diab_pedigree )^2/(.866)^2)+-.86)+.76*( *sin(-1.542*-.31*.85* e^(-(.866* /( *age-4.236*age*age) *plasma_conc-1.778)^2/(.64)^2) *serum_insulin-.258*diab_pedigree+.2))-.29*(triceps_thickness)-.171*(serum_insulin) * BPNetwork(.85*e^(-(.866* /( *age-4.236*age*age) *plasma_conc-1.778)^2/(.64)^2)+-.44,diab_pedigree)-1.26) The only information we can gain from this equation is that diabetes is dependent on features age, plasma conc, diab pedigree and serum insulin, other features were not selected and therefore are less significant or irrelevant. We can also see that the classifier contains linear, polynomial, rational, sin, gaussian and BPNetwork units. The short equation with a few complex units signifies the decision boundary is not very complex. For the Spiral data set, the equation would be several pages long containing many complex units (e.g. BPNetwork).

85 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 71 Europe Asia Africa NorthAmerica SouthAmerica Australia Antarctica Africaner Portugal SOTO Spain Suisse Thailand USAB USAW ZULU male female PUSA PUSB PUSC SSPIA SSPIB SSPIC SSPID P P P P P P GAME age Age =,782*,676*12,622*PUSB+ 18,618+ 7,862*SSPIB- 2,991+,68*14,256*SSPIC+ 3,76+ 1,975*Spain-23,85 Figure 5.1: The formula extracted from a GAME model on the Anthrop data set is not in optimal form (it can be simplified). When you copy/paste it into a cell of some Excel-like application, you can integrate it into your calculations. The second equation was generated from the GAME model of the Age variable in the Anthrop data set. This data set is considerably noisy so linear or simple polynomial models are of the optimal complexity. The Figure 5.1 shows simple polynomial equation extracted from the GAME network depicted. This formula is quite easy to interpret. When a simple equation is needed, it is possible to use strong penalization for complexity (e.g. CombiR12 configuration) to get simple model even for complex problem. Of course the accuracy of such model will be proportional to its complexity. For complex problems it is better to extract knowledge by means of visualization techniques introduced bellow Feature ranking The knowledge which input features play the most significant role in influencing the output variable is very valuable. In this thesis we propose three different feature ranking algorithms. The first technique uses the information extracted from the inductive model. It counts how many units are connected to particular feature. The ranking of this feature is subsequently derived taking into account the attributes of units connected (e.g. error of units on validation set). The algorithm is shown bellow in form of a pseudocode.

86 72 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Feature ranking from inductive models function compute significance of features (Inductive model model) { for (j) all features do feature significance[j] = ; layer index = model.get last layer index (); while (layer index >= ) { layer = model.layer[layer index]; for (i) all units in layer do { unit = layer.unit[i]; for (j) all features do if (feature[j].is used by (unit)) feature significance[j] += unit.get unit score(); } layer index = layer index -1; } } The most important function get unit score() can be implemented in various ways. It should take into account the importance of the unit in the model. The score can be for example computed as (best error in the layer) / (unit s error on the validation set) + (1 / (result of the statistical correlation or mutual information test with other units in the layer) +...). We have implemented this algorithm in the FAKE GAME environment [1] and it was also already implemented in the well know software for inductive modeling KnowledgeMiner [4] Extracting significance of features from niching GA used in GAME The second approach utilizes the information gained during the run of the niching genetic algorithm in the GAME engine. A significance of features can be extracted as follows. The number of individuals (units) in each niche is proportional to the significance of the feature, units are connected to. From each niche the fittest individual is selected and the construction goes on with the next layer. The fittest individuals in next layers of the GAME network are these connected to features which brings the maximum of additional information. Individuals connected to features that are significant, but highly correlated with features already used, will not survive. By monitoring which individuals endured in the population we can estimate the significance of each feature for the output variable modeling. This can be subsequently used for the feature extraction. In each layer of the network, after the last epoch of GA with DC, before the best individual from each niche is selected, we count how many individuals are connected to each feature. This number is accumulated for each feature and when divided by sum of accumulated numbers for all features, we get the proportional significance of each feature. The ranking of features is extracted from their proportional significance. More detailed description of this method can be found in [A.12]. As an example we apply the above described method to the selection of features significant for dyslectic children classification. The Figure 5.11 shows number of units connected to particular features while evolving the first layer of the GAME network. At the beginning,

87 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 73 Figure 5.11: The significance of each feature is correlated by the number of units, connected to this feature during the evolution process. all features had in average the same number of units connected to them. After several epochs, irrelevant features were not used any more. Number of units connected to a feature was proportional to its significance. In next layers of the GAME network, the proportional significance of features is updated as less significant features appear in the population of the niching genetic algorithm (see Figure 5.12). In this case the most significant six features (EE1, WE3, WW3, RS3, FV2, and FV3) for dyslectic children classification were selected. Using these features, the SOM clustering of patients was performed and the separation of dyslectic patients from other subject improved. For detailed description of the problem see [A.7]. The proposed method of feature extraction takes into account three factors. First is the significance of a feature for modeling the output variable. The second is the correlation of features (the distance of units in the niching GA is computed from their correlation). The third is the amount of additional information to the information carried by already extracted features. This resembles state of the art methods based on mutual information analysis. These methods extract set of features of the highest mutual information with the output variable while minimizing mutual information among the selected variables [78]. Third feature ranking algorithm is described in the section We are comparing the performance of our methods with the state of the art feature selection algorithm [1] Relationship of variables Instead of trying to extract math formula to get some useful knowledge from complex models, visualization techniques can be successfully utilized.

88 74 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Figure 5.12: The significance of each feature developed during the construction of the GAME inductive model for dyslectic patients. In total 9 layers were used (axis x-features, axis y- percent of overall connections for each feature, axis z-inductive model layers 1 to 9). The GAME, GMDH and neural network models are often complex and multidimensional - this significantly complicates understanding the influence of input features to the output variable. Any knowledge concerning relationship of variables cannot be extracted from such models nor their math description. To overcome this problem, we visualize the behavior of models. By visualization of model responses we can access the information abstracted by the model from a data set. The easiest way to visualize how the model approximates the data constant x 1 y k x 1 = x 3 = const. moving x 2 y k constant x 3 GAME min x 2 max moving moving constant x 1 x 2 x 3 GAME y m y m x 3 = const. max x 1 min x 2 max Figure 5.13: The IO relationship scatterplot produced by GAME model sensitivity analysis. set is to change values of input variables and record the output of the network. When we vary just one input variable whereas others stay constant, we can plot a curve (see Figure 5.13). The curve shows us the influence of selected input variable to the output variable in the configuration specified by the others input variables. If we change the input configuration,

89 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 75 max V([A1,A2,A3];Y) i y x 3 X3 x 2 max Y V([A1,A2,A3];Y) min X1 x 1 max min i = x 2 A2 max Figure 5.14: The projection of data vectors into the IO relationship plot. the shape of the curve changes often too. If we vary two of the input variables whereas others stay constant, we can plot a surface (see Figure 5.13). The surface represents the relationship between two input variables and the response of the model in the configuration defined by the constant inputs. More precisely, the curve in the Figure 5.13 expresses the relationship between input variable x 2 and the dependent output variable y for the configuration x 1 =X1, x 3 =X3. To be able to see, how models approximate the training data, we need to find a projection of training vectors to the input-output relationship plot. Each data vector V ([A1,A2,A3];Y) consists of two parts. The first is the input vector and the second is the target output for this input vector. We plot the crosses representing data vectors to the graph The position of the cross [A2,Y] is given by the value of the vector for the dimension of intersection (x 2 ) and by the target value Y. We compute the Euclid distance of input part of each data vector V from the axis of the input space intersection. The size of the cross that represents the data vector is inversely related to its distance from the axis of intersection. 1 Size = max (H, Dist), H >, Dist = (A1-X1) 2 + (A3 X3) 2, where H Size Dist A1,..,A3 X1,..,X3 is small number to limit the cross size, is the size of the cross in the graph, is the Euclid distance of the vector from axis of intersection, are values of the data vector in input dimensions, are the constant values of model inputs (input configuration). The upper edge of the curve in the right graph of the Figure 5.14 shows responses of the model for input vectors located on the axis of intersection. The thickness of the curve represents the density of data vectors in the input space. The more vectors are present and in defined neighborhood, the higher the density is.

90 76 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS y y 2 y 4 x 2 =.1 y x 1 =.25 y 5 y 9 y 3 y 2 y 1 y 1 y y 7 6 y 9 y 5 y 3 y 8 data defined -1.. j = x data defined y 4 y 8 y 7 y i = x Figure 5.15: With data vectors displayed, quality of models can be evaluated. You can consider the quality of the model simply by looking at the IO relationship graph and checking how it approximates the training or testing vectors. In the Figure 5.15, you can study relationship variables of an artificial problem defined as y = 1 ( 2 sinh(x1 x 2 ) + x 2 1 (x 2.5) 2), where training data vectors were distributed uniformly in the area x 1, x 2,1. Ensemble of 9 models was generated. Notice that their responses differ in areas where training data are absent. This property will be later in this thesis utilized to credibility estimation and also as the criterion in the search for interesting behavior of models Relationships in the Dyslexia data set As an example we demonstrate relationships plots for models trained on the Dyslexia data set. The full data set (26 input variables, 3 output states corresponding to 3 classes, 49 vectors corresponding to 49 patients) was used to build 3 ensembles of linear inductive models. First ensemble classified healthy patients, second patients suffering from reading dysfunction and the third patients with dyslexia. In the Figure 5.16 there is a relationship plot for the dyslexia output variable and the most significant input feature - reading speed. This inductive model apparently overfitted data, there are some dyslectic patients reading faster than patients with a reading dysfunction. Therefore we decided to use just linear units in inductive models. The classification accuracy dropped significantly, but we were able to study the influence of input variables to the output. Figure 5.17 shows that according to all linear inductive models, the growing reading speed increases the probability of the patient being healthy. We showed that using relationship plots we can extract the knowledge from GAME classifiers. However these plots are more suitable for regression data sets.

91 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 77 Figure 5.16: Nonlinear inductive model (majority of sigmoidal units) - some healthy patients have lower reading speed than dyslectic patients. Figure 5.17: The group of linear inductive models. Lines are the responses of models, crosses data vectors mapped into the scatterplot.

92 78 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Relationships in the Building data set On the Building data set we evolved ensemble of GAME models for each output (wbcw, wbhw, wbe). For the wbcw output variable (cold water consumption), the left plot in the Figure 5.18 Figure 5.18: Relationship of temperature and solar radiation with cold water consumption (left) and solar radiation with the energy consumption variable (right). shows relationship of the wbcw with TEMP (temperature of the air) and SOLAR (solar radiation) input features for low humidity and medium wind strength. In these conditions increasing temperature clearly leads to growing cold water consumption. The change in solar radiation does not affect the output variable at all. This input feature can be therefore considered irrelevant (for low humidity and medium wind strength). For models of wbe the noise level is extremely high (see Figure 5.18). It is partly caused by the fact that information about time of measurement was excluded from the data set. As you can see, models properly generalized the problem and avoided the overfitting. In the appendix D.6, you can see that this technique is also applicable to data sets with discrete features, although best performance can be expected for continuous features Boundaries of classes The GAME is very versatile, because generated models adapt to the character of the system modeled. In previous section we gave short description of GAME models used as regression models. The water consumption variable that is modeled is continuous and its relationships with input variables can be easily expressed in polynomial form. That is the reason why many of surviving units in GAME models have polynomial transfer function. On the other hand modeling the membership to a class involves a binary encoding of the output variable (e.g. 1 = member, = non-members). A sigmoid function is more usable for classification tasks, because it can simply divide two groups with a decision boundary. This is the reason why majority of successful units in GAME membership models (classifiers) are of sigmoid type. It further complicate the extraction of math formulas from a GAME model (a formula with nested sigmoid functions is not comprehensible and cannot be simplified). We developed visualization technique for GAME membership modeling. It is just slightly different as the one used for regression models in the previous section. Data are projected

93 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 79 Figure 5.19: The Pima Indians Diabetes data set [6] - crosses represent healthy/patient subjects, dark background signifies membership to the class diabetics modeled by GAME network. into two-dimensional plane as crosses or rectangles of color indicating the membership to a particular class. The size of a rectangle is inversely proportional to the distance of the corresponding vector from the plane of intersection in the input space. To indicate responses of inductive models, we added the background color to scatterplot planes. If the model works well, then big rectangles of a particular color should have the same color in their background. The background of different color under a big rectangle signifies a misclassified vector. Figure 5.19 shows how GAME model separates one class (diabetics) from another (healthy). We can clearly see the decision boundary of the model. Visual knowledge mining as well as quality evaluation of the model is straightforward Classification boundaries and regression plots in 3D We have extended the visualization techniques described above into third dimension. One extra degree of freedom can be utilized so we can study relationship of the output variable with two (regression) and three (classification) input features. We implemented all 3D visualization modules in 3D Java, so user can turn, zoom and shift objects in the 3D space. This allow better investigation of decision boundaries and regression manifolds. Figure 5.2 shows decision boundaries of two models of classes Versicolor and Virginica of the Iris data set. You can study the behavior of these models in 3 dimensions (petal length, petal width and sepal width). Regression manifolds of models on the Iris data set and Boston data set are shown in the Figure Class Setoza is linearly separable from other two as visible in the projection. With increasing crime and distance from big employment centers, the value of housing in Boston increases (Figure 5.21 right).

94 8 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Figure 5.2: Visualization of 3D manifolds representing decision boundaries of a GAME model on the Iris data set. Left plot shows the decision boundary of the Virginica class. Right plot shows the decision boundary of the Versicolor class. The tube character of the boundary indicates that the sepal width feature is not very significant. Figure 5.21: Left plot shows behavior of GAME model of the Setoza class from the Iris data set. You can see the decision boundary dividing member records of the Setoza class from records of Virginica and Versicolor classes. Right plot displays the relationship of housing value, crime and distance to employment centers as modeled by the GAME network trained on the Boston data set.

95 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS GAME classifiers in the scatterplot matrix The scatterplot matrix [12] is a popular technique for multivariate data visualisation. Data are projected into several two-dimensional graphs (axes are all pair-wise combinations of input features). Again, crosses represent data vectors, their color signifies the membership to a particular class. The GAME model is trained to separate member vectors of his class from other vectors. The dark background signifies areas, where the output of the GAME model is close to 1 - all vectors in these areas are classified as members of the class modeled. Figure 5.22: The scatterplot matrix showing the GAME network modeling the membership to the class im (the Ecoli data set). The Figure 5.22 shows one GAME model of the im class in the Ecoli data set [6]. By looking at the scatterplot matrix graph, we can decide which scatterplot best separates classes (axes alm2, aac) and choose the more detailed graph Credibility estimation of GAME models The traditional techniques for models credibility estimation (eg. testing set, cross validation) are not sufficient to evaluate the quality of inductive models. Inductive models often deal with irrelevant features and short noisy data sets and it is hard to estimate their credibility. The main disadvantage of a black-box model is that it generates a random output for patterns it has not been taught on. It is hard to determine whether the configuration of inputs is out of the region, where the model was taught properly, or out of this area. Especially for black-box models with irrelevant or redundant inputs it is unacceptable to state that model is invalid, when values of its inputs are far from training data. Therefore we developed technique, allowing us to estimate the credibility of GAME models

96 82 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS for any configuration of inputs. Lets have an ensemble of GAME models for a single output variable. These models disagrees where not taught properly, giving the compromise response in areas well defined by training data. Random behavior of models can be expected also far from the areas of data presence. The dispersion of responses of GAME models can give us an estimate of models credibility for any configuration of input features. Following experiments helped us to explore the relationship between the dispersion of models responses in the GAME ensemble and the credibility of models. Given a training data set L and testing data set T, suppose that (x 1, x 2,..., x m, y) is a single testing vector from T, where x 1...x m are input values and y is the corresponding output value. Let G is an ensemble of n GAME models evolved on L using the Bagging technique [111]. When we apply values x 1, x 2,..., x m to the input of each model, we receive models outputs y 1, y 2,...y n. Ideally, all responses would match the required output (y 1 = y 2 =... = y n = y). This can be valid just for certain areas of the input space that are well described by L and just for data without noise. In most cases models outputs will differ from the ideal value y. Figure 5.23: Responses of GAME models for a testing vector lying in the area insufficiently described by the training data set. Figure 5.23 illustrates a case when a testing vector is lying in the area insufficiently defined by L. Responses of GAME models y 1, y 2,..., y n significantly differ. The mean of responses is defined as µ = 1 n ni=1 y i, the distance of ith model from the mean is dy i = µ y i and the distance of the mean response from required output ρ = µ y (see Figure 5.23). We observed that there may be a relationship between dy 1, dy 2,..., dy n and ρ. If we could express this relationship, we would be able to compute for any input vector not just the estimate of the output value (µ), but also the interval µ + ρ, µ ρ where the real output should lie with a certain significant probability Credibility estimation - artificial data We designed an artificial data set to explore this relationship by means of inductive modeling. We generated 14 random training vectors (x 1, x 2, y) in the range x 1, x 2.1,.9 and 2 testing vectors in the range x 1, x 2, 1, y = 1 2 sinh(x 1 x 2 ) + x 2 1 (x 2.5). Then using the training data and the Bagging scheme we evolved n inductive models G by the GAME method. The density of the training data in the input space is low, therefore there are several testing vectors far from training vectors. For these vectors responses of GAME models considerably differ (similarly to the situation depicted on the Figure 5.23). For each testing vector, we

97 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 83 dy i ' dy i ' dy Figure 5.24: The dependence of ρ on dy i < a max is linear for an artificial data without noise. dy i ' dy i ' dy i ' 2 <( dy i ') Figure 5.25: The dependence of ρ on dy i is quadratic for real world data. computed dy 1, dy 2,..., dy n and ρ. This data (x 1 = dy 1,..., x n = dy n, y = ρ) we used to train a GAME model D to explore the relationship between dy i and ρ. In the Figure 5.24 there are responses of the model D for input vectors lying on dimension axes of the input space. Each curve express the sensitivity of ρ to the change of one particular input whereas other inputs are zero. We can see that with growing deviation of model G i from the required value y, the estimate of ρ given by the model D increases with a linear trend. There exist a coefficient a max that limits the maximal slope of linear dependence of ρ on dy i. If we present an input vector 3 to models from G, we can approximately limit (upper bound) the maximal deviation of their mean response from the real output as ρ a max ni=1 n dy i Credibility estimation - real world data We repeated the same experiment with a real world data set (mandarin tree water consumption data - see Appendix B.3 for the description). Again, GAME model D was trained on distances of GAME models from µ on the testing data (25 vectors). Figure 5.25 shows the sensitivity of D to the deviation of each GAME model from the mean response. Contrary to the artificial data set, the dependence of ρ on dy i is quadratic. We can approximately limit the maximal error of models mean response for this real world data set as ρ a max ni=1 n (dy i )2, so the credibility models is inversely proportional to the size of this interval. 3 We show relationship between dy i and ρ just on dimension axes of the input space(fig.5.24), but the linear relationship was observed for whole input space (inputs are independent), therefore the derived equation can be considered valid for any input vector.

98 84 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Figure 5.26: GAME models of the cold water (left) and hot water (right) consumption variable. Lef t: Models are not credible for low temperature of the air. With increasing temperature, the consumption of cold water grows. Right: When its too hot outside, the consumption of hot water drops down. Nothing else is clear - we need more data or include more relevant inputs Uncertainty signaling for visual knowledge mining To demonstrate the credibility estimation on real world problem, we have chosen the Building data set again. On this data set we evolved ensemble of GAME models for each output (wbcw, wbhw, wbe). Figure 5.26 lef t shows the relationship of the cold water consumption variable on the temperature outside the building under conditions given by the other weather variables. The relationship of the variables can be clearly seen within the area of models compromise response. We have insufficient information for modeling in the areas where models responses differ. By the y thickness of dark background we signal the uncertainty of the models response for particular values of inputs. It is computed according to the equation derived above for real world data y wbc 1 ni=1 n (dy i )2, y wbc + 1 ni=1 n (dy i )2. The real value of the output variable should be in this interval with a significant degree of probability. In Figure 5.26 right we show that the level of uncertainty for wbhw is significantly higher than for wbc. For this specific conditions (values of humid, solar, wind) our GAME models are credible just in thin area where the consumption of hot water drops down. We presented the method for credibility estimation of GAME models. Proposed approach can be also used for many similar algorithms generating black-box models. The main problem of black-box models is that user does not know when one can trust the model. Proposed techniques allow estimating the credibility of the model s response, reducing the risk of the misinterpretation of model outputs Credibility of GAME classifiers Similar techniques as for the regression models, can be applied to GAME classifiers. Again a group of adaptive models is evolved for single output variable (membership to a particular class). Each model should have 1 on the output, when classifies patterns of his member

99 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 85 Figure 5.27: When we multiply responses of several GAME networks modeling the same class, we get the membership area just for those configurations of inputs, where the output of all models is 1. class, and when the input vector belongs to another class. For regions far from training vectors, the output of the model is random. But the random output is usually close to 1 or. This behavior has the following reason. When GAME networks are used to classify data into membership classes (output 1 or ), units with sigmoid transfer function outweigh other types of units (e.g. polynomial) so GAME networks are mainly formed by sigmoidal units. It affects their behavior in regions far from training vectors. Far from the decision boundary the sigmoid function is either close to 1 or and so are outputs of networks. Consider the data about apples and pears. If we evolve the ensemble of GAME classifiers for the apple class, their outputs are 1 for objects similar to apples, for these similar to pears. For object that is different from both apple and pear, each model from the group can give different output. Some can classify it as an apple ( 1 ); some can respond it is not apple ( ). When outputs of all models are multiplied, the result is 1 just for objects classified as an apple by the whole group. This simple idea is extended bellow to filter out artefacts and unimportant information from the classification by GAME models. The Iris data set [6] is often used to test the performance of classifiers. Iris plants are to be classified into three classes (Setoza, Virginica and Versicolor) given measurements of their sepal width, length and petal width, length. We evolved three ensembles of GAME models - one ensemble for each class. Figure 5.27 shows three models from each ensemble. Dark background signifies 1 on the output of the model, light background the output. When these three models are multiplied for each class, the results can be observed in scatterplots

100 86 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Figure 5.28: Three groups of twelve GAME models for classes Iris Setoza, Virginica and Versicolor. When all models are displayed in one scatterplot (left), regions of class membership are overlapping. The right scatterplot shows proposed improvement when outputs of models within one group are first multiplied and then result for each class is displayed. Figure 5.29: The same plots as in the case of the Iris data set in the previous picture. Ten GAME models classifying the Advertising data set are multiplied. The behavior of the resulting classifier is sensitive to anomalies in the behavior of individual models. The majority instead of multiplication might be better approach to combine individual models. of the fourth column. Each resulting scatterplot classifies as members of the class just plants similar to those present in the training set. On the Figure 5.28, you can see how the proposed method improved the classification. Outputs of twelve GAME models for each class are displayed in one scatterplot (left). Especially plants more distant from these present in training data are classified as members of several classes. When twelve models for each class were multiplied first and then results for three classes were displayed into one scatterplot (right), the boundaries of membership areas are clearly visible. The second example shows how this technique works on the Advertising data set (Figure 5.29). The multiplication of models should be probably changed to the majority operator. It would reduce the disturbances caused by misbehavior of few individual models. We are going to experiment with properties of the multiplication and the majority operators on several real world data sets.

101 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS The search for interesting behavior Very useful outcome of a neural network model is that relationship of input and output variables can be plotted revealing some potentially interesting information about a modeled system. However this approach is not often used because there are several problems appearing from a closer look. At first there is a problem with the curse of dimensionality, secondly the problem of model credibility arises when system state space is not fully covered by training data. There are also problems with irrelevant input variables, with the time needed to find some useful plot in multidimensional state space, etc. In this section we show that all these problems can be successfully overcome using modern techniques of evolutionary computation and ensemble modeling. The result of our research is an application that is able to automatically locate interesting plots of system behavior Ensembling: what do we mean by interesting behavior? Real-world data usually does not cover whole state space of features (we do not have measurements for all possible combinations of input variables). Features are seldom independent and the correlated data are normally distributed around some cluster centers. The rest of the input state space is empty. When we use such training data to build a model, responses of model would be random in areas, where training data are absent. The problem is that we cannot simply find the border of model credibility. If we limit the credibility just to areas of training data presence, we cannot deal with irrelevant features. Therefore we use ensemble techniques to find out areas where models are credible and their behavior is interesting. Figure 5.3: Interesting behavior of models When you look at the Figure 5.3, you can see, how we defined the term interesting. Each curve represents the output of one ensemble model y i when a feature x i is changed. The more rapid change we observe (y size max), the more interesting behavior is displayed. The second criterion for the importance of projection is the credibility of models. We need to look for rapid changes of output variable just within the areas, where models are credible. This can be achieved by simple assumption. The random (different) output of ensemble

102 88 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS models signify that we are outside the area of credible behavior. The equal output means that all models converged to single (credible) value that is based on training data. The second criterion can be therefore computed from dispersion of ensemble model outputs - the envelope p should be minimized: p = x start +x size j=x start ( ) arg max (y i(j)) arg min (y i(j)) <i m <i m (5.1) The third criterion x size max helps us to privilege bigger areas of interesting behavior Evolutionary search on simple synthetic data The optimization problem how to find optimal values of constant features (x j where j (1, n), j i and n is number of features), the start (x i start ) and the size (x i size ) of the area can be solved by evolutionary computation. We performed an experiment with a synthetic data to find the best properties of the search method: genetic algorithm [36]. The synthetic data we generated are depicted in the Figure In the left part of the Figure 5.31: Synthetic training data and ensemble models approximating it. Figure 5.31 you can see the training data used to generate the ensemble of GAME models. The outputs of ensemble models (curves) are in the right part of the figure, together with the training data. To be able to use genetic algorithm to find areas of interesting behavior, we have to construct the genotype of individuals, genetic operators and the fitness function. In this simple problem, models have just one input x. The genotype will therefore contain just two genes (x start,x size ). As genetic operators we employed the commonly used mutation and crossover for real valued genes - with the constrain, that x start +x size < 1 for normalized features. The fitness function is derived from the three criteria of interesting projection discussed above. To find suitable fitness function, we had been experimenting with several equations. The best properties, as you can see on the Figure 5.32, has the fitness function of the following

103 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 89 Figure 5.32: The plot of fitness function for all possible individuals. signifies higher value of the fitness function. Darker background type: where fitness = y size 1 p x size, y size = arg max (ȳ(t)) arg min (ȳ(t)), t (x start, x size ). The term ȳ(t) is in fact Simple Ensemble of models [111] defined as ȳ(t) = 1 m mi=1 y i (t) for t, 1. This term represents the first criterion in the fitness function. The second criterion (term p) can be computed according to Equation 5.1 and the third criterion (x size ) is directly encoded in the genotype. We started genetic algorithm and after 5 generations all individuals had the same genes (x start =.4, x size =.5). As you can check on the Figure 5.32, the global maximum of the fitness function was found. The individual with such properties is also shown on the Figure Experiments with diversity We demonstrated that using genetic algorithm and well chosen fitness function, we are able to locate the area of the most interesting behavior according to definition in subsection of this paper. We also want to locate other interesting areas, that are not the best, but carry also very important information about input-output relationship. For this reason we needed to promote diversity in the population of genetic algorithm. Once the population preserves also suboptimal solutions, other interesting areas can be found. The technique allowing us to preserve diversity in genetic algorithm is called niching [64]. To be able to say if to individuals are diverse, the distance operator has to be introduced. We used

104 9 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Figure 5.33: The individual with the highest fitness dominated the population of genetic algorithm. so called genotypic distance of individuals that can be computed as d ( x, z) = n i= (x i z i ) 2, where x and z are genes of two individuals. Figure 5.34 shows this distance (diversity) in the matrix for all individuals in the population of standard genetic algorithm. The darker is a point in the matrix, the more diverse are individuals with indexes corresponding to particular row and column of the matrix. You can observe that after 5 generations, the there is no diversity among individuals - all are equal to the one depicted on the Figure Then we used niching technique in the genetic algorithm. We employed Deterministic Crowding that has many advantages over other niching methods [64]. The results can be observed on the Figure After 5 generations there are still three different individuals in the population. One represents the optimal solution, two other represents suboptimal solutions that are apparent also in the Figure The properties of all three individuals are shown on the Figure Its necessary to say, that niching does not preserve suboptimal solution in the population forever. Its important to stop the algorithm before diversity of individuals drops to zero. The niching is particulary important for more complex problems, were it prevents the algorithm from early convergence and also maintains stable subpopulations Study with more complex synthetic data To prove that our algorithm works also for more dimensional data, we prepared synthetic data set with three input variables. We generated vectors just in four clusters (see Figure 5.37) to emulate real world data with partial definition of the input state space. Total 16 training vectors t = (x 1, x 2, x 3, f(x 1, x 2, x 3 )) were generated, where x 1, x 2, x 3 (, 1) and f(x 1, x 2, x 3 ) = sin(2πx 1) + (2x 2 1) 2 + x Training data were used again to generate ensemble of GAME models. After that we executed the genetic search for interesting behavior again. For this experiment, we had to run the algorithm for three times - for each feature separately. For the first feature (x 1 ) the genotype

105 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 91 Figure 5.34: Diversity in population for the standard genetic algorithm. Figure 5.35: Diversity in population for the niching genetic algorithm (Deterministic Crowding employed).

106 92 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Figure 5.36: Three solutions found by the niching genetic algorithm. Figure 5.37: Input vectors used for generating training data are concentrated in clusters.

107 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 93 of individuals was x 2,x 3,x 1 start,x 1 size, for other two features the genotype was similar. The Figure 5.38: The best individuals after ten and fifty generations for features x 1,x 2,x 3. Figure 5.38 shows the best individuals from the population after ten and fifty generations for each feature. We can observe dramatic improvement in the quality of projection. The resulting plots are much more interesting than the best plots in the generation Experiments with real world data One more time the Building data set was used as an example. On this data set we generated ensembles of GAME models for each output (wbcw, wbhw, wbe). Then we executed the genetic algorithm four times (4 features) for each output. The most important genes of best individuals found in 12 runs are summarised in the Table 5.1. In the Figure 5.39 there are projections with higher fitness (the most interesting) showing the relationship of the feature temp and output variables wbc, wbhw and wbe. We found out,

108 94 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS Table 5.1: Chromosomes of the best individuals and their fitness. Output temp humid solar wind Fitness wbcw wbe wbhw wbcw wbe wbhw wbcw wbe wbhw wbcw wbe wbhw Figure 5.39: Three most interesting plots showing the relationship of the feature temp and outputs wbc (fitness 17.1), wbhw (fitness 38.29) and wbe (fitness 3.96).

109 SECTION 5. THE FAKE INTERFACE AND RELATED RESULTS 95 Table 5.2: Comparison of results to the second feature ranking method in GAME WBCW WBE WBHW Ranking: GAME-3 GAME-2 GAME-3 GAME-2 GAME-3 GAME-2 Temp 55% 41% 25% 26% 86% 51% Humid 29% 31% 11% 9% 6% 3% Solar 8% 11% 31% 41% 3% 13% Wind 8% 15% 33% 22% 4% 4% that the values of fitness found for each feature significantly correlate with the significance of features, we derived by means of two other techniques (Sections 5.2.2, ) and can be therefore used also for the feature ranking [78]. In this section we presented the technique that can be used for automatic extraction of interesting plots showing input-output relationship of a system modeled by the GAME engine or ensemble neural networks. This technique can make data analysis much more effective. It also minimizes the need to interact with neural network and reveals some useful knowledge hidden inside. All experiments were performed in the FAKE GAME environment.

110 96 SECTION 6. APPLICATIONS OF THE FAKE GAME FRAMEWORK 6 Applications of the FAKE GAME framework In this chapter we would like to demonstrate successful applications of the GAME method. We show that GAME outperforms commonly used techniques as FIR filters, MLP networks, Logistic Regression classifiers, etc. in several real world applications. This fact is caused mainly by the ability of GAME networks to adapt itself to the problems of various complexity. The utilization of FAKE methods brings additional knowledge about realworld systems, reduces time needed for data preprocessing, the configuration of the data mining tools and the analysis of their results. 6.1 Noise cancelation by means of GAME In this section, we demonstrate a successful application of GAME in the domain of noise cancelation [11]. Noise cancelation methods are needed when dealing with noise signals. These methods are able - more or less successfully - filter out the noise from the useful signal. Digital filters may be categorized as Infinite (IIR) or Finite (FIR) Impulse Response Filters. The most commonly used method is FIR and its adaptive version Adaptive FIR. This noise cancelation methods are briefly described bellow Finite impulse response filter (FIR) FIR filters may be regarded as a more stable design comparing to IIR filters due to the lack of feedback in the design. The topology of a FIR filter consists of a delay line followed by an x(n) x(n-1) x(n-m+1) x(n-m) Z -1 Z -1 x b x b x x 1 b M-1 b M y(n) x( n) W y(n) Figure 6.1: The architecture of the FIR filter. accumulator, as shown in Figure 6.1. The incoming signal to the filter is applied to the delay line resulting in vector of samples delayed by increasing time. In our experiments we used 5 delay blocks resulting in the same number of samples (history of incoming signal). Each tap consists of an incoming sample and a coefficient b i. The coefficient is applied to the incoming sample to shape the output sample from the tap. These output samples are accumulated to produce the filter output y(n).

111 SECTION 6. APPLICATIONS OF THE FAKE GAME FRAMEWORK 97 In non-adaptive filters, coefficients are precalculated and are built into the filter design. In adaptive filters these coefficients need to be calculated at run time Replacing FIR by the GAME network When you look at the design of the FIR filter, it strongly resembles simple inductive model (single layer, linear transfer units with one input). Our question was if the more complex architecture of the filter would be more effective or not. Therefore we performed the following experiment. We replaced FIR filter multiplication and additive elements by the GAME network. The inputs of the network are samples of original signal delayed by increasing time (see Figure 6.2). The output of the network gives the filtered signal. The network is trained the same way, as we precalculate coefficients for the FIR filter. We x(n) x(n-1) x(n-m+1) x(n-m) Z -1 Z -1 L P L L P L y(n) Figure 6.2: The GAME network functioning as a filter. L are units with linear transfer function, P are units with simple polynomials (usually with degree not higher than 2) use so called reference signal - this is the signal without noise. To get input signal we can add some artificial noise - with characteristics similar to the noise we want to filter out - this signal will be provided to the input of the filter. The proper values of FIR coefficients can be computed by means of some simple optimization method (usually Least Squares Method) when input signal and reference signal are provided to the input and to the output of the FIR filter respectively. The goal of the optimization method is to minimize the energy given as E = 1 N e 2 (n) = 1 N [δ(n) y(n)] 2 (6.1) N N n= n= where δ(n) is the sample of the reference signal, y(n) is the sample of the output signal which is given as y(n) = b x(n) + b 1 x(n 1) + + b M x(n M) for the FIR filter depicted in the Figure 6.1. The coefficients b i can be iteratively estimated using LMS method: b i (n) = b i (n 1) + αe(n 1)x i (n 1). (6.2)

112 98 SECTION 6. APPLICATIONS OF THE FAKE GAME FRAMEWORK This equation is derived and described in more detailed form in [74] Experiment with synthetic data To test the ability of GAME network to work as a filter, we synthesized artificial signals. It was sampled with frequency 11KHz. The frequency response of the first signal (reference signal) are depicted in the Figure 6.3 left. This signal was mixed with an artificial noise, producing input signal with frequency response depicted in the Figure6.3 right. Its apparent Amplitude [ ] Amplitude [ ] Frequency [Hz] Frequency [Hz] Figure 6.3: Frequency response of reference signal(left) and input signal(right) which was generated from reference signal by adding noises of frequencies 5Hz, 1.7KHz and 3.3KHz that the frequency characteristics of noise and that of reference signal are not overlapping. In this case filtering is simpler than that for problems where noise frequency overlaps with the reference signal. We used the GAME simulator to build the model (filter) where inputs were 5 samples of delayed input signal and the output was the actual value of reference signal. The training set consisted of several thousands vectors produced by the sliding window method applied to the input and reference signals. The results of filtering are again best visible in the frequency domain. When you compare graphs in the Figure 6.4, it is apparent that the GAME network preserved reference signal much better, than the FIR filter (compare with the reference signal - left graph in the Figure 6.3). This result is also confirmed by graphs showing transfer and phase characteristic of GAME and FIR filters (see Figures D.2, D.3 in the appendix). GAME networks which were generated had just tree layers in average, approximately 1 units with linear and low order polynomial transfer functions. We were surprised that the more complex units did not assert itself at all. Moreover just few inputs were used (6 in average) compared to all 5 in case of FIR filter. Very interesting information is how important are individual inputs (delayed samples of signal) for filtration. Figure 6.5 left shows the importance of inputs derived by the GAME feature ranking algorithm no.2 (see Section ). The most important input is the one with delay 1, then with delay 12, 23, etc. The period of important inputs is 11 delay elements

113 SECTION 6. APPLICATIONS OF THE FAKE GAME FRAMEWORK Amplitude [ ] Amplitude [ ] Frequency [Hz] Frequency [Hz] Figure 6.4: Results of filtering shows that the signal filtered by GAME network (left) corresponds better to the reference signal (Fig.6.3left). The signal filtered by the FIR filter successfully canceled noise, but also distorted the useful signal. - this corresponds with the frequency of useful signal 1KHz (signal was sampled with 11KHz). The same period can be observed in the right graph of the Figure 6.5. The values of FIR filter coefficients plotted in the graph also reflect the importance of inputs. When coefficient is zero, corresponding input (tap) has no effect in the filtration process. The bigger coefficient, the more impact it has to the output of the filter so the input can be considered important. When we compare results of the GAME network and FIR filter in this experiment, we can conclude that GAME networks used as filter: use far less inputs (taps) than the FIR filter (6 in comparison with 5), provide better filtration results (the noise is canceled while useful signal is better preserved then in the case of FIR filter), need more time to train its parameters (LMS used by FIR filter is several times faster comparing to Genetic Algorithm and sophisticated optimization techniques such as Quasi-Newton Method) In general we can say that GAME network proved to be significantly better than FIR filter in this experiment. It is mainly due to the fact that optimization techniques used by GAME are much more effective than LMS method. If you need a non-adaptive filter, where coefficients are trained in the beginning and later stay constant, then the GAME network would be good choice. It will generate a structure which is much simpler and far more effective than FIR and can be embedded to the hardware. In case you need an adaptive filter, which is able to adjust coefficients during filtration, the GAME network is not the right choice (the GAME would have to be modified to allow incremental training so the coefficients can be realtime updated - see [74])

114 1 SECTION 6. APPLICATIONS OF THE FAKE GAME FRAMEWORK % 15 b(t),1 12,8 9,6 6,4 3,2 t-5 t-25 t, t-5 t-25 t Figure 6.5: The significance of inputs can be obtained by GAME feature ranking algorithm (left). The right graph shows absolute values of coefficients b i. We can se both graphs are similar and the importance of features depends on the period of the useful signal. 6.2 Sleep stages classification using the GAME engine We used the GAME engine for the sleep stages classification. We compare its results to well known classification methods implemented in data mining tool Weka [11] GMDH for classification purposes The drawback of the GMDH is that it restricts the transfer function of units (neurons) to be of polynomial type. Even recent improvements of the theory such as Polynomial Neural Networks (PNN) [76, 7] and the GMDH with active neurons implemented in KnowledgeMiner [69, 6] use polynomial transfer functions (although of different types) of for all units in the model. This limitation leads to poor classification ability of GMDH models. Polynomials are perfect for the purpose of regression, when the mutual relationship of system variables has smooth character. To represent the decision boundaries as needed for classification problems, polynomial functions are not a good choice. Even higher order polynomials oscillate, when constant ( or 1) output is needed to express the class membership. The advantage of the GAME engine is that it can build the model (see Figure 6.6) from units being more suitable to express the classification boundary (logistic transfer function, etc.) Data acquisition and preprocessing The data we process in our work come from MIT PhysioBank database [35]. They offer all night sleep recording of several patients. The advantage of these data is that they have expert assigned sleep stages. Sleep stages were assigned to the signal according to Sleep scoring manual [81]. This manual is widely used for sleep stages classification around the world.

115 SECTION 6. APPLICATIONS OF THE FAKE GAME FRAMEWORK 11 EEG_Fpz-Cz_cwt_MEAN EOG_horizontal-windowed_FFT_band_alpha EOG_horizontal_hjorth_complexity EMG_submental_fft_VAR EMG_submental_SPECTRAL_CENTROID S 1 S 2 S 3 S 4 L 5 S 6 Niching Genetic Algorithm evolves units in layers S S 1 : y=1/(1-e.63* (EMG_submental_fft_VAR) ) 7 L 5 : y=.949* S *EOG_horizontal_hjorth_complexity +.2 REM Sleep Stage... Figure 6.6: An example of the GAME classifier of the REM Sleep Stage. In one model there are two types of units used: S x Sigmoidal (logistic transfer function) units and L x linear transfer unit. Units with other transfer functions did not survive during the evolution. Surviving units were optimized by the Quasi-Newton method and the SADE Genetic Algorithm. Sleep scoring manual [81] defines 6 actual sleep stages Wake, Stage 1, Stage 2, Stage 3, Stage 4 and Stage REM. All stages are present in signal classification. In addition it contains 2 artifact stages Patient motion and Unclassified intervals. The first step in preprocessing was to remove above mentioned artifact stages Classification of sleep stages For the classification of the Sleep stages data we used several different classification methods. Methods used from Weka software are described in details in [11] or in [28] The configuration of GAME engine We use three different configurations. All configurations validate on both training set and the validation set so that the overfitting is apparent for the Strong configuration. The first we call Strong and in this configuration all units are enabled units with linear, polynomial, exponential and sigmoid transfer function and neurons with build-in perceptron neural network 1. Their parameters were left on their default values. 1 Strong config. units enabled: LinearNeuron, CombiNeuron, PolyHornerNeuron, PolySimpleNeuron,

116 12 SECTION 6. APPLICATIONS OF THE FAKE GAME FRAMEWORK Data / Classifier Bayes Net Decision tab. J48 Tree Naïve Bayes S. Logistic GAME All features 31,86% 28,86% 33,7% 39,68% 26,45% 19% CfsSubset BestFirst 26,45% 3,86% 28,26% 31,66% 26,65% 24% CfsSubset Genetic 26,21% 24,74% 25,79% 24,74% 27,88% 2% χ 2 Ranker 28,42% 26,95% 22,74% 36,84% 25,89% 22% GainRatio Ranker 29,65% 26,99% 26,58% 37,22% 23,72% 18% InfoGain Ranker 27,84% 28,95% 29,18% 34,52% 26,73% 2% Game ranking 24,53% 25,79% 26,21% 31,24% 22,1% 16% Table 6.1: Error rate achieved by classifiers for the Sleep stage 4 The second configuration we call Medium. Enabled units are units with linear, polynomial, exponential and sigmoid transfer function 2. We also adjusted the number of epochs for each layer. The third configuration we call Weak. In this configuration we enable only the most simple units with sigmoid, linear and low order polynomial transfer function 3. All other parameters we left on default values The configuration of WEKA methods The settings of classification methods used in our experiments: Decision Table J48 Decision Tree Simple Logistic Support vector machine (SMO in Weka) Naïve Bayes Classifier Bayes Net were left on their default values. To prevent overfitting of classifiers implemented in Weka we used the cross validation with 1 folds Comparison of different methods We experimented with several different input data sets. Each data set contains a subset of input features selected by different feature ranking method. Our results show that the best classifier is the GAME engine. It achieves the best accuracy for all sleep stages. The second in order is the Simple logistic function from the Weka software (see Table ). PolySimpleNRNeuron, ExpNeuron, SigmNeuron, PolyFracNeuron, BPNetwork, NRBPNetwork. 2 Medium config. units enabled: LinearNeuron, CombiNeuron, PolyHornerNeuron, PolySimpleNeuron, PolySimpleNRNeuron, ExpNeuron, SigmNeuron, PolyFracNeuron. 3 Weak config. units enabled: LinearNeuron, PolySimpleNeuron, SigmNeuron.

117 SECTION 6. APPLICATIONS OF THE FAKE GAME FRAMEWORK 13 GAME classifier Strong configuration Medium configuration Weak configuration Data set Training Testing Training Testing Training Testing All features 6.9% 23.7% 13.1% 21.4% 16.1% 23.7% BestFirst+Cfs 4.7% 25,2% 2.5% 21.9% 2.5% 23.5% InfoGain+Ranker 2.8% 27.5% 14.4% 22.5% 17.7% 25.9% GeneticSearch+Cfs 3.3% 2.8% 18.5% 18.5% 2.5% 2.8% GAME 4.9% 2.2% 13.5% 17.8% 14.2% 17.1% GainRatio+Ranker 3.3% 23.5% 16.4% 2.4% 16.6% 2.3% Chi2+Ranker 2% 3% 16.2% 25.9% 17.3% 29.2% Table 6.2: Percentage of misclassified instances for three GAME configurations. Although the Strong configuration can reduce the error on the training set greatly, its generalization abilities are slightly worse than that of the Medium configuration Experiments with GAME configurations We divided data sets for the Sleep stage 4 into the training set and the testing set. The number of instances in the training set was 8% of all instances and number of instances in the testing set was 2%. Three configurations of the GAME neural network were used. We will refer to them as Strong, Medium and Weak. Their exact settings can be found above (Section 6.2.3). The results are shown in the Table 6.2. The experiment showed that the Strong configuration overfits the data. Classification errors on training sets are around 4% and the errors on the testing set range from 2% to 3%. This difference indicates that all classifiers trained with this configuration are overfitted (partly due to the validation on the both the training and testing set). Medium configuration gives much better results. The classification error on the training set almost matches the classification error on the testing set. This means that classifiers are trained correctly and do not overfit. The Medium configuration was used in the comparison of different classifiers (Table ). The last configuration, Weak, does not overfit either, but its disadvantage is achieving much worse classification accuracy than the Medium configuration. The GAME engine outperformed all other classifiers (Decision Table, J48 Decision Tree, Simple Logistic, Support vector machine (SMO in Weka), Naïve Bayes Classifier and Bayes Net) on the Sleep stages classification data.

118 14 SECTION 7. SUMMARY AND CONCLUSIONS 7 Summary and conclusions The FAKE GAME framework, proposed in this thesis, aims at the simplification of the knowledge discovery process. The core of the framework is the GAME engine. This engine enables automated data mining from diverse data sets. Within the FAKE interface we proposed several methods focused on automated data preprocessing and knowledge extraction. Specifically, we showed that hybrid GAME models perform better than uniform ones. Moreover hybrid models adapt to problem solved, without the necessity of parameters adjustment. Analytic gradients we derived for GAME units significantly reduce number of error evaluation calls needed to reach the optimum. The performance of optimization methods was improved, they are able to locate proper parameters setting of GAME units in less steps and more accurately. We also compared the performance of several optimization methods compared on diverse real world problems. Experiments performed indicate the Quasi Newton is superior (both faster and more accurate) to all other methods for several real word data sets. From nature inspired methods Differential Evolution showed very good properties. Big disadvantage of nature inspired methods was slower convergence caused by ignoring the gradient of the error surface. Some GAME units evolve their transfer functions and regularization (external criterion) prevents them from overfitting noisy data. This arrangement further contributes to automated data mining. Niching scheme was employed in genetic algorithm evolving units. It maintained diversity among units and the accuracy of resulting models increased. Several GAME models are combined in the ensemble. Ensemble response is more reliable, often more accurate than the best from single models. Compromise response of member models can indicate well trained regions. Models generated by the GAME engine were benchmarked on several data sets and outperformed all other methods. The GAME engine outperformed the FIR filter and also classifiers from WEKA environment on the Sleep stages classification problem. With regularization disabled, GAME also solved very hard problem two intertwined spirals. The FAKE interface consists from modules for automated data preprocessing and knowledge extraction support. We compared performance of several imputing methods and missing values replacement by Euclid distance neighbors achieved very promising results. Transformation of data using artificial distribution function significantly improved accuracy of GAME models on simple synthetic data set. For a real world data improvement was not significant so and this problem remains challenge for our future research. Math formula extracted from regularized CombiNeuron GAME models can be serialized into simple polynomial equations suitable for knowledge extraction. Three novel algorithms for the feature ranking were proposed, consistently estimating the significance of input features. Regression plots allow to study relationship of variables in particular conditions. They incor-

119 SECTION 7. SUMMARY AND CONCLUSIONS 15 porate information about behaviour of GAME models in the neighborhood of data vectors. Classification plots enable to study decision boundaries of classes estimated by models. Interactive 3D regression (classification) plots are helpful, when there exist more than one (two) important features in a system modeled. For multivariate data sets with many significant features, we proposed the scatterplot matrix enriched by information on classification boundaries of models. The credibility of models can be estimated from behavior of GAME ensemble. It was experimentally derived to be inversely proportional to the dispersion of ensemble member models responses. We used ensemble of classifiers, where member models are multiplied, to visualize just credible areas of class membership and to filter out unimportant information (random behavior of models outside credible areas). To locate interesting regression plots in multidimensional space automatically, we designed and implemented the genetic search with specific fitness function. This technique proved to perform well when applied to the real world problem. All these results allow building the first version of the FAKE GAME environment within the framework proposed in this thesis. We are currently working on improving the user friendliness of the environment, to be able to release it as an open source software.

120 16 SECTION 8. SUGGESTIONS FOR THE FUTURE WORK 8 Suggestions for the future work We are going to improve the GAME engine to be more versatile (to be able to model data of very different attributes) and to be easy to use. Configuration options in the FAKE GAME environment will be eliminated in the standard mode. Pressing a single green button will run automated preprocessing, then ensemble of GAME models will be generated on preprocessed data. FAKE methods will prepare information from GAME models for knowledge extraction - results of modeling will be summarized in a HTML file - pictures of the most interesting behavior (found by GA), feature ranking, error on the testing set, derived math formula, estimates of models plausibility, etc.). To fulfill these goals we need to: Implement more types of units (evolving perceptron, cascade correlation unit, etc.) and improve optimization methods to make the GAME engine more versatile. Evolve parameters of optimization methods. Experiment with regularization criteria with respect to level of noise in data. Implement the GAME booster and Negative Correlation method to improve accuracy of the ensemble. Compare GAME on more benchmarking problems with state of the art modeling methods. Improve the visualization in three dimensions (Java 3D) which appears to be very attractive. Automate the data preprocessing step - preprocessing cannot be ever fully independent from user assistance, but many of preprocessing steps already run independently in the GAME software. Design and implement genetic search of interesting classification plots, 3D regression manifolds, etc.

121 SECTION 9. BIBLIOGRAPHY 17 9 Bibliography [1] The fake game environment for the automatic knowledge extraction. available online at: September 26. [2] The gmdh website. September 26. [3] The javanns simulator software. available at [4] The knowledgeminer software. Available at September 26. [5] The sumatra tt data preprocessing tool. available online at September 26. [6] Uci machine learning repository. available at mlearn/mlsummary.html, September 26. [7] Weka open source data mining software. available online at September 26. [8] The yale open source learning environment. available online at September 26. [9] R. Abdel-Aal. Gmdh-based feature ranking and selection for improved classification of medical data. Journal of Biomedical Informatics, 38: , April 25. [1] R. Abdel-Aal. Improving electric load forecasts using network committees. Electric Power Systems Research, (74):83 94, 25. [11] K. Adeney and M. Korenberg. An easily calculated bound on condition for orthogonal algorithms. In IEEE-INNS-ENNS International Joint Conference on Neural Networks (IJCNN ), volume 3, page 362, 2. [12] M. Ankerst. Visual Data Mining. PhD thesis, University of Munich, 2. [13] M. Ankerst. Visual Data Mining with Pixel-oriented Visualization Techniques. The Boeing Company, Seatle, USA, 21. [14] M. Ankerst, D. Keim, and H. Kriegel. Circle segments: A technique for visually exploring large multidimensional data sets. In Visualization 96, Hot Topic Session, San Francisco, CA, [15] D. Asimov. The grand tour: a tool for viewing multidimensional data. SIAM Journal on Scientific and Statistical Computing, 6: , [16] Avriel and Mordecai. Nonlinear Programming: Analysis and Methods. Dover Publishing, 23.

122 18 SECTION 9. BIBLIOGRAPHY [17] R. L. Barron, A. N. Mucciardi, F. J. Cook, J. N. Craig, and A. R. Barron. Adaptive Learning Networks: Development and Application in the United States of Algorithms Related to GMDH, volume Self-Organizing Methods in Modeling: GMDH Type Algorithms, pages Marcel Dekker, New York, s.j. farlow edition, [18] P. Bearse and H. Bozdogan. Subset selection in vector autoregressive models using the genetic algorithm with informational complexity as the fitness function. Systems Analysis, Modelling, and Simulation (SAMS), 31:61 91, [19] V. P. Belogurov. A criterion of model suitabilityfor forcasting quantitative processes. Soviet Journal of Automation and Information Sciences, 23(3):21 25, 199. [2] G. Bilchev and I. C. Parmee. The ant colony metaphor for searching continuous design spaces. In Selected Papers from AISB Workshop on Evolutionary Computing, pages 25 39, London, UK, Springer-Verlag. [21] C. Blum and K. Socha. Training feed-forward neural networks with ant colony optimization: An application to pattern classification. In Proceedings of Hybrid Intelligent Systems Conference, HIS-25, pages , Los Alamitos, CA, USA, 25. IEEE Computer Society. [22] G. Brown. Diversity in Neural Network Ensembles. PhD thesis, The University of Birmingham, School of Computer Science, Birmingham B15 2TT, United Kingdom, January 24. [23] R. Browse, D. Skillicorn, and S. McConnell. Using competitive learning to handle missing values in astrophysical datasets. Technical Report 22458, Department of Computing and Information Science Queen s University, Kingston, [email protected], 22. [24] T. Chemaly and C. Aldrich. Visualization of process data by use of evolutionary computation. Computers and Chemical Engineering, (25): , 21. [25] Dasu and Johnson. Exploratory Data Mining and Data Quality. John Wiley and Sons, 24. [26] J. Drchal. Evolution of recurrent neural networks. Master s thesis, Czech Technical University in Prague, 26. [27] J. Drchal, A. Kučerová, and J. Němeček. Using a genetic algorithm for optimizing synaptic weights of neural networks. Technical Report 7(1): , Czech Technical University in Prague, FEE, CTU Prague, Czech Republic, 23. [28] R. Duda, P. Hart, and D. Stork. Pattern Classification. Wiley-Interscience, 21. [29] J. F. Elder. The generalization paradox of ensembles. Journal of Computational and Graphical Statistics, 12(4): , 23. [3] S. E. Fahlman and C. Lebiere. The cascade-correlation learning architecture. Technical Report CMU-CS-9-1, Carnegie Mellon University Pittsburgh, USA, 1991.

123 SECTION 9. BIBLIOGRAPHY 19 [31] S. J. Farlow. Self-Organizing Methods in Modeling: GMDH Type Algorithms. Marcel Dekker, New York, USA, [32] U. Fayyad, G. Shapiro, and P. Smyth. From data mining to knowledge discovery in databases. AI Magazine, 17(3):37 54, [33] Y. Freund and R. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In Proceedings of the Second European Conference on Computational Learning Theory, pages Springer Verlag, [34] Gilbar and C. Thomas. A new GMDH type algorithm for the development of neural networks for pattern recognition. Isbn: , FLORIDA ATLANTIC UNI- VERSITY, 22. [35] A. Goldberg, L. Amaral, L. Glass, J. Hausdorff, P. Ivanov, R. Mark, J. Miteus, C. Peng, and S. H. Physiobank, physiontoolkit and physionet: Components of a new research resource for complex physiologic signals. Circulation, 11(23):e215 e22, 2. [36] D. Goldberg. Genetic Algorithms in Search. In [36], [37] P. Granitto, P. Verdes, and H. Ceccatto. Neural network ensembles: evaluation of aggregation algorithms. Artificial Intelligence, 163: , 25. [38] J. Han and M. Kamber. Data Mining: Concepts and Techniques. Intelligent Database Systems Research Lab, School of Computing Science,Simon Fraser University, Canada, [39] L. Hansen and P. Salamon. Neural network ensembles. IEEE Trans. Pattern Anal. Machine Intelligence, 12(1):993 11, 199. [4] S. Hauger. Ensemble learned neural networks using error-correcting output codes and boosting. Master s thesis, University of Surrey, Guildford, Surrey, GU2 7XH, UK, 23. [41] J. Holland. Adaptation in Neural and Artificial Systems. University of Michigan Press, [42] O. Hrstka and A. Kučerová. Improvements of real coded genetic algorithms based on differential operators preventing premature convergence. Advances in Engineering Software, 35(3-4): , March-April 24. [43] A. Hyvärinen, J. Karhunen, and E. Oja. Independent Component Analysis. John Wiley & Sons, 21. [44] A. Inselberg. The plane with parallel coordinates. The Visual Computer, 1:69 91, [45] M. Islam, X. Yao, and K. Murase. A constructive algorithm for training cooperative neural network ensembles. IEEE Transitions on Neural Networks, 14(4), July 23. [46] A. Ivakhnenko, E. Savchenko, and G. Ivakhnenko. Gmdh algorithm for optimal model choice by the external error criterion with the extension of definition by model bias and its applications to the committees and neural networks. Pattern Recognition and Image Analysis, 12(4):347353, 22.

124 11 SECTION 9. BIBLIOGRAPHY [47] A. G. Ivakhnenko. Polynomial theory of complex systems. IEEE Transactions on Systems, Man, and Cybernetics, SMC-1(1): , [48] A. G. Ivakhnenko, G. Ivakhnenko, and J. Muller. Self-organization of neural networks with active neurons. Pattern Recognition and Image Analysis, 4(2): , [49] C.-F. Juang and Y.-C. Liou. On the hybrid of genetic algorithm and particle swarm optimization for evolving recurrent neural network. In Proceedings of the IEEE International Joint Conference on Neural Networks, volume 3, pages , Dept. of Electr. Eng., Nat. Chung-Hsing Univ., Taichung, Taiwan, July 24. [5] H. Juille and J. B. Pollack. Semantic niching and coevolution in optimization problems. [51] H. Juille and J. B. Pollack. Co-evolving intertwined spirals. In P. J. A. Lawrence J. Fogel and T. Baeck, editors, Proceedings of the Fifth Annual Conference on Evolutionary Programming, Evolutionary Programming V, pages MIT Press, [52] R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Proceedings of International Joint Conference on Artificial Intelligence, [53] R. Kohavi. Wrappers for Performance Enhancement and Oblivious Decision Graphs. PhD thesis, Stanford University, [54] T. Kohonen. Self-organized formation of topologically correct feature maps. In J. W. Shavlik and T. G. Dietterich, editors, Readings in Machine Learning. Morgan Kaufmann Publishers, Inc., 199. [55] T. Kohonen. Self-Organizing Maps. Springer, [56] T. Kondo, J. Ueno, and K. Kondo. Revised gmdh-type neural networks using aic or pss criterion and their application to medical image recognition. Journal of Advanced Computational Intelligence and Intelligent Informatics, 9(3): , March 25. [57] J. Koutník. Modular neural networks for analysis and recognition of real data. Technical Report DC-PSR-24-8, Czech Technical University in Prague, FEE, CTU Prague, Czech Republic, 24. [58] L. Kuhn. Ant Colony Optimization for Continuous Spaces. PhD thesis, The Department of Information Technology and Electrical Engineering The University of Queensland, October 22. [59] V. Kurkova. Kolmogorov s theorem is relevant. Neural Computation, 3: , [6] F. Lemke, E. Benfenati, and J.-A. Mueller. Data-driven modeling and prediction of acute toxicity of pesticide residues. SIGKDD Explorations Newsletter, 8(1):71 79, June 26. Special Issue: Successful Real-World Data Mining Applications. [61] A. Lindlöf and B. Olsson. Genetic network inference: the effects of preprocessing. BioSystems, 72: , 23. [62] R. Little and D. Rubin. Statistical analysis with missing data. John Wiley and Sons, New York, 1987.

125 SECTION 9. BIBLIOGRAPHY 111 [63] H. Madala and A. Ivakhnenko. Inductive Learning Algorithm for Complex System Modelling. CRC Press, Boca Raton. [64] S. W. Mahfoud. A comparison of parallel and sequential niching methods. In Sixth International Conference on Genetic Algorithms, pages , [65] S. W. Mahfoud. Niching methods for genetic algorithms. Technical Report 951, Illinois Genetic Algorithms Laboratory (IlliGaL), University of Ilinios at Urbana- Champaign, May [66] M. Mandischer. A comparison of evolution strategies and backpropagation for neural network training. Neurocomputing, (42):87 117, 22. [67] F. Masulli and G. Valentini. Effectiveness of error correcting output codes in multiclass learning problems. In In Lecture Notes in Computer Science, volume 1857, pages , Berlin, Heidelberg, 2. Springer-Verlag. [68] S. McConnell and D. B. Skillicorn. Outlier detection using semidiscrete decomposition. Technical Report Tech. Report 21452, Queen s University, Department of Computing and Information Science, November 21. [69] J. A. Muller and F. Lemke. Self-Organising Data Mining. Berlin, 2. ISBN [7] N. Nariman-Zadeh, A. Darvizeh, A. Jamali, and A. Moeini. Evolutionary design of generalized polynomial neural networks for modelling and prediction of explosive forming process. Journal of Materials Processing Technology, (165): , 25. [71] N. I. Nikolaev, H. Iba, and V. Slavov. Inductive gentetic programming with immune network dynamics. In Advances in Gentic Programming 3, chapter 15. The MIT Press, [72] N. Y. Nikolaev and H. Iba. Polynomial harmonic gmdh learning networks for time series modeling. Neural Networks, (16): , May 23. [73] N. Nikolayev and V. Slavov. Concepts of inductive genetic programming. In P. R. Banzhaf W., editor, EuroGP 98: First European Workshop on Genetic Programming, LNCS-1391, pages 49 59, Berlin, Springer. [74] J. Novák. Srovnání metod pro potlačení šumu. Master s thesis, Czech Technical University in Prague, 25. [75] S. K. Oh and W. Pedrycz. The design of self-organizing polynomial neural networks. Inf. Sci., 141: , 22. [76] S.-K. Oh and W. Pedrycz. A new approach to self-organizing fuzzy polynomial neural networks guided by genetic optimization. Physics Letters, A(345):88 1, July 25. [77] S.-K. Oh, W. Pedrycz, and B.-J. Park. Polynomial neural networks architecture: analysis and design. Computers and Electrical Engineering, 29(29):73 725, 23. [78] S. Piramuthu. Evaluating feature selection methods for learning in data mining applications. European Journal of Operational Research, 156: , 24.

126 112 SECTION 9. BIBLIOGRAPHY [79] L. Prechelt. A set of neural network benchmark problems and rules. Technical Report 21/94, Karlsruhe, Germany, [8] D. Pyle. Data Preparation for Data Mining. Morgan Kaufman, Fondi di Ricerca Salvatore Ruggieri - Numero 421 d inventario. [81] A. Rechtschaffen and A. Kales. A Manual of Standardized Terminology, Technique and Scoring System for Sleep Stages of Human Subjects. U.S. Government Printing Office, Washington DC, public health service edition, [82] M. Rimer and T. Martinez. Softprop: Softmax neural network backpropagation learning. In Proceedings of the 24 IEEE International Joint Conference on Neural Networks, volume 2, pages , Dept. of Comput. Sci., Brigham Young Univ., Provo, UT, USA, 24. [83] Salane and Tewarson. A unified derivation of symmetric quasi-newton update formulas. Applied Math, 25:29 36, 198. [84] S. Schaal and C. G. Atkeson. Receptive field weighted regression. Technical Report TR-H-29, ATR Human Information Processing Laboratories, 2-2 Hikaridai, Seika-cho, Soraku-gu, Kyoto 619-2, Japan, [85] V. Schetinin. A learning algorithm for evolving cascade neural networks. Neural Processing Letters, (17):21 31, 23. [86] K. Schittkowsk and C. Zillober. Nonlinear programming. Technical report, D-9544 Bayreuth, Germany. [87] A. Schmitt, P. Murail, E. Cunha, and D. Rougé. Variability of the pattern of aging on the human skeleton : evidence from bone indicators and implications on age at death estimation. Journal of Forensic Sciences, 47: , 22. [88] R. Schnabel, J. Koontz, and B. Weiss. A modular system of algorithms for unconstrained minimization. Technical Report CU-CS-24-82, Comp. Sci. Dept., University of Colorado at Boulder, [89] R. S. Sexton and J. Gupta. Comparative evaluation of genetic algorithm and backpropagation for training neural networks. Information Sciences, (129):45 59, 2. [9] J. R. Shewchuk. An introduction to the conjugate gradient method without the agonizing pain. Technical report, School of Computer Science Carnegie Mellon University, Pittsburgh, PA 15213, August [91] W. Siedlecki and J. Sklansky. On automatic feature selection. International Journal of Pattern Recognition, 2:197 22, [92] K. Socha. Aco for continuous and mixed-variable optimization. In Proceedings of 4th International Workshop on Ant Colony Optimization and Swarm Intelligence (ANTS 24), Brussels, Belgium, September 24. [93] K. Stanley, B. Bryant, and R. Miikkulainen. Real-time neuroevolution in the nero video game. Evolutionary Computation, IEEE Transactions on, 9(6): , Dec. 25.

127 SECTION 9. BIBLIOGRAPHY 113 [94] K. O. Stanley and R. Miikkulainen. Continual coevolution through complexification. In W. B. L. et al., editor, GECCO 22: Proceedings of the Genetic and Evolutionary Computation Conference, New York, USA, 9-13 July 22, pages Morgan Kaufmann, 22. [95] K. O. Stanley and R. Miikkulainen. Evolving neural networks through augmenting topologies. Evolutionary Computation, 1(2):99 127, 22. Massachusetts Institute of Technology. [96] Statsoft. Statistica neural networks software. More information at nn.html, September 26. [97] T. Stocker. Long-term perspectives on the earth system looking from the past into the future. Pleanry Talk IGBP Open Science Conference, Banff 23, 23. [98] R. Storn and K. Price. Differential evolution - a simple and efficient heuristic for global optimization over continuous spaces. Journal of Global Optimization, 11: , [99] A. Strehl and J. Ghosh. Relationship-based clustering and visualization for highdimensional data mining. INFORMS Journal on Computing, pages 1 23, 22. [1] M. Tesmer and P. Estevez. Amifs: adaptive feature selection by using mutual information. In Proceedings of the 24 IEEE International Joint Conference on Neural Networks, volume 1, page 38, Dept. of Electr. Eng., Chile Univ., Santiago, Chile, July 24. [11] R. Thomson and T. Arslan. Evolvable hardware for the generation of sequential filter circuits. In Proceedings of NASA/DoD Conference on Evolvable Hardware (EH 2), pages 17 23, 22. [12] W. M. Thorburn. The myth of occam s razor. Mind, 17(27): , [13] S. Tsutsui. Ant colony optimisation for continuous domains with aggregation pheromones metaphor. In Proceedings of the 5th International Conference on Recent Advances in Soft Computing (RASC-4), pages , 24. [14] S. Tsutsui, M. Pelikan, and A. Ghosh. Performance of aggregation pheromone system on unimodal and multimodal problems. In The IEEE Congress on Evolutionary Computation, 25 (CEC25), volume 1, pages IEEE, 2-5 September 25. [15] F.-Y. Tzeng and K.-L. Ma. Opening the black box - data driven visualization of neural networks. In Proceedings of IEEE Visualization 5 Conference, pages 23 28, Minneapolis, USA, October 25. [16] J. Vesterstrom and R. Thomsen. A comparative study of differential evolution, particle swarm optimization, and evolutionary algorithms on numerical benchmark problems. In Proceedings of the 24 Congress on Evolutionary Computation, volume 2, pages , 24. [17] J. G. Wade. Convergence properties of the conjugate gradient method. available at www-math.bgsu.edu/ gwade/tex examples/example2.txt, September 26.

128 114 SECTION 9. BIBLIOGRAPHY [18] E. J. Wegman. Visual data mining. Statistics in Medicine, 22: , 23. [19] D. Wickera, M. M. Rizkib, and L. A. Tamburinoa. E-net:evolutionary neural network synthesis. Neurocomputing, 42: , 22. [11] I. Witten and E. Frank. Data Mining Practical Machine Learning Tools and Techniques, Second Edition. Elsevier, 25. [111] Z.-H. Zhou, J. Wu, and W. Tang. Ensembling neural networks: Many could be better than all. Artificial Intelligence, 137: , 22.

129 SECTION 1. PUBLICATIONS OF THE AUTHOR Publications of the author [A.1] Kordík P., Saidl J., Šnorek, M.: Evolutionary Search for Interesting Behavior of Neural Network Ensembles In: Proceeding of 26 IEEE World Congress on Computational Intelligence, July 26, Vancouver, Canada. [A.2] Kordík P.: GAME - Group of Adaptive Models Evolution Technical report DCSE- DTP-25-7, Czech Technical University in Prague, 25; [A.3] Kordík P., Šnorek, M.: Deterministic Crowding Helps to Evolve Non-correlated Active Neurons In: Proceedings of the International Workshop on Inductive Modeling IWIM-25, Academy of Sciences, Glushkov Institute, p Kiev, Ukraine 25; ISBN [A.4] Kordík P., Šnorek, M.: Ensemble Techniques for Credibility Estimation of GAME Model In: Artificial Neural Networks: Formal Models and Their Applications - ICANN 25. Berlin: Springer, 25, vol. 2, s ISBN [A.5] [A.6] Kordík P.: Why Bagging of GAME Inductive Models Does Not Futher Improve their Accuracy? In: Proceedings of the International Workshop on Inductive Modeling IWIM-25, Academy of Sciences, Glushkov Institute, p Kiev, Ukraine 25; ISBN Kordík P., Šnorek M., Genyk-Berezovskyj M.: Hybrid Inductive Models: Deterministic Crowding Employed In: Proceedings of the International Joint Conference on Neural Networks, Piscataway: IEEE, p , 24. ISBN [A.7] Novák D., Kordík P., Macaš M., Vyhnálek M., Brzezný R., Lhotská L.: School Children Dyslexia Analysis using Self Organizing Maps In: 26th Annual International Conference of the IEEE Engineering in Medicine and Biology Society, Los Alamitos: IEEE Computer Society Press, p. 1-4., 24. ISBN [A.8] Kordík P., Šnorek M.: Progress in inductive modelling In: Proceedings of the 5th EUROSIM Congres Modelling and Simulation, Vienna: EUROSIM-FRANCOSIM- ARGESIM, vol. II, 24. ISBN [A.9] Kordík, P. - Náplava, P. - Šnorek, M. - Genyk-Berezovskyj, M.: A Modified GMDH Method and Model Quality Evaluation by Visualization, In: Control Systems and Computers, no. 2, p , 23. ISSN [A.1] Kordík P.: Inductive Modelling: Detection of System States Validity, In: Proceedings of the International Conference and Competition STUDENT EEICT 23, prize awarded, Brno: Vysoke uceni technicke v Brne, Fakulta elektrotechniky a komunikacnich technologii, p , 23. ISBN X [A.11] Kordík P.: Inductive Modelling: Detection of System States Validity, In: Poster 23, the winning poster, Prague: CTU, Faculty of Electrical Engineering, p. IC2, 23. [A.12] Kordík P.: Selecting Subset of Relevant Variables by Means of Niching Genetic Algorithm, In: Poster 24, Prague: CTU, Faculty of Electrical Engineering, p. IC21, 24. [A.13] Kordík P., Šnorek M.: Multivariate Models Evaluation by way of Visualization, In: ASIS 23. Ostrava: MARQ, p , 23. ISBN [A.14] Šnorek M., Kordík P.: Automatic Model Generation Based on Arificial Intelligence Methods, In: ASIS 23. Ostrava: MARQ, p , 23. ISBN

130 116 SECTION 1. PUBLICATIONS OF THE AUTHOR [A.15] Kordík, P. - Náplava, P. - Šnorek, M. - Genyk-Berezovskyj, M.: The Modified GMDH Method Applied to Model Complex Systems, In: International Conference on Inductive Modeling, ICIM 22, Lviv: State Scientific and Research Institute of Information Infrastructure, p ISSN [A.16] Kordík P.: Visualising Models of Complex Systems, In: POSTER 22 - Book of Extended Abstracts, Prague: CTU, Faculty of Electrical Engineering, p. IC28, 22.

131 APPENDIX A. STANDARDIZATION OF GMDH CLONES 117 A Standardization of GMDH clones To be able to share inductive models among different DM tools, a standard description of the GMDH like networks has to be developed. Such description allows not only sharing models in different applications, but it also positively influences the compatibility issues within one application. For example GAME models are stored into files using Java serialization routines. Every time there is a change in a memory structure (e.g. adding new parameter or method), previous models cannot be loaded any more. This absence of backward compatibility is very unpleasant issue. The solution could be to store GAME model using the XML 1 serialization. Dr. Frank Lemke, the author of the Knowledgeminer software, proposed to make common standard in the PMML language. Predictive Model Markup Language (PMML) is an XML mark up language to describe statistical and data mining models. PMML describes the inputs to data mining models, the transformations used prior to prepare data for data mining, and the parameters which define the models themselves. PMML is complementary to many other data mining standards. It s XML interchange formats is supported by several other standards, such as XML for Analysis, JSR 73, and SQL/MM. We designed the first version of the PMML GMDH standard and implemented a parser capable of loading and displaying a GMDH network from it s PMML description. A simple Age of car Gender Number of claims= Number of claims=1 Number of claims=2,3 Number of claims> P 1 P P y =.73 * 1 3 * * 5 P 1 8 P 2 y = 2.62 * 1 * 3 4 * Amount of claims P 3 y =.91 * 7 2 * 8 3 * Figure A.1: An example of simple GMDH type model. This model can be generated by the GAME engine, when only units with a polynomial transfer function are enabled. example of a GMDH network is in the Figure A.1. Its PMML description can be found in the next section. 1 Extensible Markup Language (XML) is a simple, very flexible text format derived from SGML (ISO 8879). XML plays an increasingly important role in the exchange of a wide variety of data on the Web and elsewhere.

132 118 APPENDIX A. STANDARDIZATION OF GMDH CLONES A.1 PMML description of GMDH type polynomial networks <?xml version="1."?> <PMML version="3.1" xmlns=" xmlns:xsi=" <NeuralNetwork modelname="neural Insurance" functionname="regression" activationfunction="polynomial" numberoflayers="2"> <NeuralInputs numberofinputs="6"> <NeuralInput id="1"> <DerivedField optype="continuous" datatype="double"> <NormContinuous field="age of car"> <LinearNorm orig=".1" norm=""/> <LinearNorm orig="3.7897" norm=".5"/> <LinearNorm orig="11.44" norm="1"/> </NormContinuous> </DerivedField> </NeuralInput> <NeuralInput id="2"> <DerivedField optype="continuous" datatype="double"> <NormDiscrete field="gender" value=" male"/> </DerivedField> </NeuralInput> <NeuralInput id="3"> <DerivedField optype="continuous" datatype="double"> <NormDiscrete field="no of claims" value=" "/> </DerivedField> </NeuralInput> <NeuralInput id="4"> <DerivedField optype="continuous" datatype="double"> <NormDiscrete field="no of claims" value=" 1"/> </DerivedField> </NeuralInput> <NeuralInput id="5"> <DerivedField optype="continuous" datatype="double"> <NormDiscrete field="no of claims" value=" 3"/> </DerivedField> </NeuralInput> <NeuralInput id="6"> <DerivedField optype="continuous" datatype="double"> <NormDiscrete field="no of claims" value=" > 3"/> </DerivedField> </NeuralInput> </NeuralInputs> <NeuralLayer numberofneurons="2"> <Neuron id="7"> <Element> <Coefficient value=".73"/> <Array_Item> <Item_Bin value="true"/> <Item_Grade value="3"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="2"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value=""/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="5"/> </Array_Item> <Array_Item> <Item_Bin value="true"/> <Item_Grade value="6"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="2"/> </Array_Item> </Element> <Element> <Coefficient value="3.87"/> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="1"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value=""/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="2"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="4"/> </Array_Item> <Array_Item> <Item_Bin value="true"/> <Item_Grade value="1"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="1"/> </Array_Item> </Element> <Bias value=""/> </Neuron> <Neuron id="8"> <Element> <Coefficient value="2.62"/> <Array_Item> <Item_Bin value="true"/> <Item_Grade value="1"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="8"/> </Array_Item> <Array_Item> <Item_Bin value="true"/> <Item_Grade value="4"/> </Array_Item> <Array_Item> <Item_Bin value="true"/> <Item_Grade value="2"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="8"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="3"/> </Array_Item> </Element> <Bias value="-.16"/> </Neuron> </NeuralLayer> <NeuralLayer numberofneurons="1"> <Neuron id="9"> <Element> <Coefficient value=".91"/> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="2"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="3"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="4"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="6"/> </Array_Item> <Array_Item> <Item_Bin value="false"/> <Item_Grade value="2"/> </Array_Item> <Array_Item> <Item_Bin value="true"/> <Item_Grade value="1"/> </Array_Item> <Array_Item> <Item_Bin value="true"/> <Item_Grade value="2"/> </Array_Item> <Array_Item> <Item_Bin value="true"/> <Item_Grade value="3"/> </Array_Item>

133 APPENDIX A. STANDARDIZATION OF GMDH CLONES 119 \begin{onecolumn} </Element> <Bias value="1.73"/> </Neuron> </NeuralLayer> <NeuralOutputs numberofoutputs="1"> <NeuralOutput outputneuron="9"> <DerivedField optype="continuous" datatype="double"> <NormContinuous field="amount of claims"> <LinearNorm orig="" norm=".1"/> </NormContinuous> </DerivedField> </NeuralOutput> </NeuralOutputs> </NeuralNetwork> </PMML>

134 12 APPENDIX B. DATA SETS USED IN THIS THESIS Table B.1: Number of evaluations saved by supplying gradient depending on the complexity of the energy function. Feature Description CRIM per capita crime rate by town ZN proportion of residential land zoned for lots over 25, sq.ft. INDUS proportion of non-retail business acres per town CHAS Charles River dummy variable (= 1 if tract bounds river; otherwise) NOX nitric oxides concentration (parts per 1 million) RM average number of rooms per dwelling AGE proportion of owner-occupied units built prior to 194 DIS weighted distances to five Boston employment centers RAD index of accessibility to radial highways TAX full-value property-tax rate per $1, PTRATIO pupil-teacher ratio by town B 1(Bk.63) 2 where Bk is the proportion of blacks by town LSTAT % lower status of the population MEDV Median value of owner-occupied homes in $1 s B Data sets used in this thesis B.1 Building data set The Building data set is frequently used for benchmarking modeling methods [79]. It consists of 3+ measurement inside the building (hot water (wbhw), cold water (wbc), energy consumption (wbe)) for the specific weather conditions outside (temperature (temp), humidity of the air (humid), solar radiation (solar), wind strength (wind)). We excluded the information about time of measurement and transformed this data set from time series domain to regression domain. B.2 Boston data set The Boston housing data set was taken from the StatLib library which is maintained at Carnegie Mellon University. It concerns housing values in suburbs of Boston. Number of Instances in the data set is 56. There are 13 continuous input features and one discrete output variable MEDV. B.3 Mandarin data set The Mandarin tree data set (provided by the Hort Research, New Zealand) describes water consumption of a mandarin tree.

135 APPENDIX B. DATA SETS USED IN THIS THESIS 121 The mandarin tree is the complex system influenced by many input variables (water, temperature, sunshine, humidity of the air, etc.). Our data set consists of measurements of these input variables and one output variable the water consumption of the tree. It describes how much water the tree needs in specific conditions. We used 25 training vectors (11 input variables, 1 output variable). B.4 Dyslexia data set The eye movements of 49 female subjects were recorded using iview 3. videooculography system at Department of Neurology, 2nd Medical Faculty, Charles University, Czech Republic. The frequency rate of the camera used was 1Hz. 22 subjects were healthy, 18 subjects suffered from reading dysfunction and 9 subjects were dyslectic children. The average group age was 11 years, the age variance was.5 years. The subjects were stimulated by two non-verbal and one verbal stimuli. The non-verbal stimuli consist of the two images with different graphical patterns. The first image was the grid of blue dots (stimulus number 1 ), the subject was asked to inspect the image dot by dot. The second image was composed of the grid of faces, the subject was asked to count all smiling faces(stimulus number 2 ). The next verbal stimulus was Czech text to read (stimulus number 3 ). B.5 Antro data set Data 1 represent a set of observations the skeletal indicators studied for the proposal of the methods of age at death assessment from the human skeleton (see [87]). It is a results of the visual scoring of the morphological changes of the features in two pelvic joint surfaces defined and described by a text accompanied with photos. The material consists of 955 subjects from the 9 human skeletal series of subjects known age and sex. This collections (populations) are dispersed on 4 continents (Europe, North America, Africa, Asia). The age in the death of the individuals varies between 19 and 1 years. Three features are scored on the pubic symphysis in the pelvis: (A) Posterior plate (PUSA) scored in three phases (1-2-3); (B) Anterior plate (PUSB) observed in three phases (1-2-3); (C) Posterior lip (PUSC) scored in two phases (1-2). Four features on the sacro-pelvic surface of the ilium ware observed: (A) Transverse organisation (SSPIA) evaluated in two phases (1-2); (B) Modification of the articular surface (SSPIB) scored in four phases ( ); (C) Modification of the apex (SSPIC) observed in two phases (1-2); (D) Modification of the iliac tuberosity (SSPID) estimated in two phases (1-2). 1 Thanks to Dr. Jaroslav Bruzek who provided us with this data set and with many valuable comments of our results.

136 122 APPENDIX B. DATA SETS USED IN THIS THESIS B.6 UCI data sets Other data sets used in this thesis come from UCI machine learning repository [6]. The detailed description of the Pima Indian Diabetes data set, Iris data set, Ecoli data set, etc. can be found at [6].

137 APPENDIX C. THE FAKE GAME ENVIRONMENT 123 C The FAKE GAME environment The FAKE GAME environment is still under development. Do not expect a single green button application (this is our vision). The core of the environment is the GAME engine offering many configuration options. These options were necessary during the development of the engine. In future versions these options will be available in expert mode only. Many components of the system are developed and tested independently (3D visualization engine, search for interesting plots, data preprocessing modules, etc.). They will be released in later versions of the FAKE GAME environment. Bellow we shortly describe how to work with the core of the environment - the GAME engine. We also mention the visual knowledge extraction support in the environment. C.1 The GAME engine - running the application The GAME engine is implemented in Java programming language. To run the application, java virtual machine must be installed on your computer. You can download java runtime environment (JRE) from the Sun website (free). To run the application type java -Xmx512M -Xms128M -jar autogen.jar in the directory, you have unpacked the archive to. C.1.1 Loading and saving data, models Select File Load raw data from the menu of the application window to import a data set in following format (name of output starts with!): sepal_l sepal_w petal_l petal_w!setosa!versicolor!virginica Select File Load to import a data set (data with models) in following format: Input_factor_name type max min med sepal_length continuous sepal_width continuous petal_length continuous petal_width continuous Attribute_name polarity significance max min Iris-setosa positive 1.. Iris-versicolor positive 1.. Iris-virginica positive 1..

138 124 APPENDIX C. THE FAKE GAME ENVIRONMENT Group_name input_factors_values output_attributes_values The advantage of this format is that you can specify the minimum and maximum value of particular input or output. For the previous format, both values are computed from the data. It is possible that some vectors in the data you plan to use can have values out of this range. For such case, you need to use this data format. Select File Save to save a data set. When File Save responses is enabled, responses of models are appended to data vectors. Models are serialized to the independent file with extension.net. Select File Save training and testing set to split your data set randomly into training part (file training ) and the testing part (file testing ). Then you can load the training file, build models, load the testing file and save responses of models on the testing set. C.1.2 How to build models Select File Create Single GAME model to build one GAME model for each output variable selected in the right panel of the application. The item File Create Repeat generating GMDH allows you to build ensemble of GAME models for each selected output. The Repeat generating option is much faster than single model generation (graphics is reduced). Use preferably this option unless you need to examine the process of GAME model evolution. C.2 Units so far implemented in the GAME engine LinearNeuron unit with simple linear transfer function, coefficients can be estimated by any training method ( very fast, accurate) LinearGJNeuron unit with simple linear transfer function, are estimated by Gauss-Jordan method ( extremely fast,inaccurate) CombiNeuron unit with polynomial transfer function that is evolved during the run of GA (for full functionality it needs all other units to be disabled, not to disturb evolution of chromosomes encoding its transfer function), coefficients can be estimated by any training method ( fast in the beginning slowing with increasing complexity of the transfer function, accurate only if all other units disabled, otherwise won t survive) PolyHornerNeuron unit with simple polynomial transfer function computed by Horner Scheme, coefficients can be estimated by any training method ( fast, accuracy limited by simplicity of the transfer function) PolySimpleNeuron unit with randomly generated polynomial transfer function complexity depends on the number of the layer; coefficients can be estimated by any training method ( fast, accurate) PolySimpleGJNeuron unit with randomly generated polynomial transfer function complexity depends on the number of the layer; coefficients are estimated by simplified Gauss-

139 APPENDIX C. THE FAKE GAME ENVIRONMENT 125 Jordan method ( extremely fast,inaccurate survives just by accident :-)) PolySimpleNRNeuron unit with randomly generated polynomial transfer function complexity depends on the number of the layer; coefficients can be estimated by any training method; training can be stopped early to prevent overtraining GL 5 criterion ( fast, accurate) ExpNeuron unit with exponential transfer function, coefficients can be estimated by any training method ( fast, accurate) SigmNeuron unit with logistic transfer function (sigmoid), coefficients can be estimated by any training method ( fast, accurate especially for classification problems) SinusNeuron unit with sin transfer function, coefficients can be estimated by any training method ( fast, accurate) GaussianNeuron unit with gaussian transfer function (see Equation 4.5), coefficients can be estimated by any training method ( fast, accurate) GaussNeuron unit with gaussian transfer function (see Equation 4.6), coefficients can be estimated by any training method ( fast, accurate) MultiGaussianNeuron unit with gaussian transfer function (see Equation 4.7), coefficients can be estimated by any training method ( fast, accurate) GaussPDFNeuron unit with gaussian transfer function (Gaussian conditional probability), coefficients can be estimated by any training method ( slow, accurate) PolyFractNeuron unit with randomly generated rational transfer function complexity depends on the number of the layer; coefficients can be estimated by any training method ( slow speed of learning, very accurate) BPNetwork unit with small MLP network trained by the Backpropagation algorithm the topology of networks is generated randomly and can be configured by regulating the number of hidden neurons or the number of training epochs, the compromise between slowness and accuracy can be found ( extremely slow,very accurate especially for complex data can solve the two intertwined spirals problem) BPNRNetwork the same as the BPNetwork unit; training can be stopped early to prevent overtraining GL 5 criterion ( extremely slow,very accurate) C.3 Optimization methods in the GAME engine QuasiNewtonTrainer (Quasi Newton method) a gradient method can be even faster if explicitly supplied by the gradient and Hessian matrix of the error surface ( very fast, accurate) translated to Java from Fortran by Steve Verrill SADETrainer (SADE genetic method) a genetic method mixed with the Tabu search (the concept of radiation fields) to increase the search radius, preventing stacking in local optima ( moderate speed, accurate) translated to Java by Jan Drchal PSOTrainer (Particle Swarm Optimization) a swarm optimization method inspired by birds; need some parameters tuning ( slow,moderate accuracy) HGAPSOTrainer (Hybrid of GA and PSO) a genetic method mixed with swarm optimization for few generations chromosome s fly as birds, then are mutated and crossed

140 126 APPENDIX C. THE FAKE GAME ENVIRONMENT and again ( moderate speed, accurate) PALDifferentialEvolutionTrainer (Differential Evolution version 1) a genetic method using special crossover and mutation. Each offspring has four parents. This version was implemented by PAL Development Core Team and is distributed under the GPL ( moderate speed, moderate accuracy) DifferentialEvolutionTrainer (Differential Evolution version 2) a genetic method using special crossover and mutation. Each offspring has four parents ( moderate speed, accurate) StochasticOSearchTrainer (Stochastic Orthogonal Search) an orthogonal search, dimensions selected randomly. This version was implemented by PAL Development Core Team and is distributed under the GPL ( moderate speed, accurate) OrthogonalSearchTrainer (Orthogonal Search) a simple iterative search method, does not use gradient, performs search in directions of dimensions (one by one). This version was implemented by PAL Development Core Team and is distributed under the GNU GPL ( moderate speed, accurate) ConjugateGradientTrainer (Conjugate Gradient method) an iterative method with good convergence properties. This version was implemented by WEKA development team and is distributed under the GNU GPL ( moderate speed, accurate) ACOTrainer (Ant Colony Optimization) this nature inspired method was extended for continuous problems. ( moderate speed, moderate accuracy) CACOTrainer (Continuous Ant Colony Optimization) this nature inspired method was extended for continuous problems. ( moderate speed, moderate accuracy) C.4 Configuration options of the GAME engine The GAME engine is still under development and it offers several options that can be configured. In this section we describe some options that can be accessed by selecting Options Configure GAME, from the main menu of the GAME application. The configuration window will appear. There are several tabs in the configuration window. The first tab is called Complexity. The meaning of abbreviations is the following: Populat. size - the size of the population in each layer (how many chromosomes (units) are evolved within one mating pool by genetic algorithm Max.surv.units - maximal number of units that are selected to survive in the layer, after the GA finishes Epochs - number of epochs of the genetic algorithm in each layer of the GAME network The next tab in the configuration window that can be selected is Unit types tab (see Figure C.1). Its purpose is to allow selecting units that will be used during the construction of the GAME network. You can enable/disable units with the certain type of transfer function (linear, polynomial, perceptron networks and others). Disabled units won t be generated to the population of GA during the construction of GAME network. To select an individual type of unit, choose proper transfer function in the tree of configuration options. To configure a unit, you have to

141 APPENDIX C. THE FAKE GAME ENVIRONMENT 127 Figure C.1: The configuration of units in the GAME engine. expand the tree with configuration option and click on the name of unit. Types of units that have been implemented so far are listed in the section C.2. In the next tab, you can enable or disable training methods listed in the section C.3. Enabled methods are used to estimate coefficients of units supporting the training by any method. In the Evolution tab it is possible to configure parameters of genetic algorithm that optimizes units in layers of the GAME network. If the Deterministic Crowding is enabled, the distance of two units needs to be computed - either by difference of inputs (first checkbox) or by the correlation of units responses on the training data (second checkbox) or by the combination of both. Diversity is derived from the distance of units - the threshold of minimal diversity can be set to prevent unifying population. When crossing two units, their offsprings can be of the same type as parents, or purely random. This allows evolving just successful types of units, but some amount of randomness is recommended. If the importance of individuals distance is set to zero, the algorithm works faster - especially the computation of units correlation is time demanding. The more important is the distance, the more preferred are diverse units having lower accuracy. It is possible to set the size of the training and the validation set in the Data configuration tab. The training set will be used to estimate coefficients/weights of units. The validation set will be used to compute the fitness function of units when evolving them. The implicit size of the validation set is one third of training data. The more noisy data you process, the bigger proportion of data you should use as validation set. Percentage of data set used for learning can be decreased to make the computation faster - but remember the accuracy of resulting model can dramatically fall down, with just the fraction of data used for training.

142 128 APPENDIX C. THE FAKE GAME ENVIRONMENT If validate on training set, too checkbox is enabled, the fitness of units is computed on both training and validation set (can sometimes lead to overtraining, but useful if just small sample of data available). Bootstrap sampling is designed to introduce diversity into ensemble of models - sometimes giving more accurate ensemble output. In the Connection tab, you can set, whether the number of inputs to a unit should be limited by the number of layer. The GAME networks consisting of units with unlimited number of inputs can me more accurate than when the number of inputs is growing with the number of layer. On the other hand in this case the genetic algorithm needs far more epochs to evolve proper topology (far bigger search space). The Others tab allows to configure whether layers of GAME network will be added no matter how big is the increase in accuracy. Sometimes you can disable the normalization of variables - units with linear and polynomial transfer function are able to handle data that are not normalized. You can then use the Model equation button to serialize the model into the string and use it directly in table processor (Excel). C.5 Visual knowledge extraction support in GAME There are several visualization modules accessible via the Graph menu item. For optimal graph properties, tune their parameters in Options Graph Properties. Visualization modules can be simply added to the FAKE GAME environment. Modules that have been so far implemented are summarized bellow. Module Functionality Dim. Models G2D IO relationship plots 2 single G2Dmulti IO relationship plots 2 multiple Graph3D Simple fast projection of model s output 3 single Cut3D IO relationship regression plots 3 multiple Clasification3d Decision manifold visualization 3 single Starplot Behavior of model in the neighborhood N single Clasification2D Decision boundaries visualization 2 single ClasificationMulti2D Decision boundaries visualization 2 multiple N ScatterplotMatrix Matrix of decision boundaries 2 2 x2 single GA Visualization of the search for interesting plots - multiple The additional configuration options are: With Graph Combine models for clasif. enabled, responses of models of the same class are multiplied. The responses in uncertain areas gets zero. Similar result as when using neural networks with local units (RBF), but immunity to curse of dimensionality and to irrelevant inputs. Select Graph Visible response areas Models responses accuracy to show the estimate of uncertainty signified by dark background. You can change value of inputs and study input-output relationship in conditions given by values of input features except the one studied.

143 APPENDIX D. RESULTS OF ADDITIONAL EXPERIMENTS 129 D Results of additional experiments Figure D.1: The behavior of a GAME model consisting from Gaussian units almost resembles fractals.

144 13 APPENDIX D. RESULTS OF ADDITIONAL EXPERIMENTS 2 1 Magnitude (db) Normalized Angular Frequency ( π rads/sample) 15 Phase (degrees) Normalized Angular Frequency ( π rads/sample) Figure D.2: Characteristic response of the FIR filter. 2 1 Magnitude (db) Normalized Angular Frequency ( π rads/sample) 15 Phase (degrees) Normalized Angular Frequency ( π rads/sample) Figure D.3: Characteristic Response of the GAME network filter - noise is better inhibited in regions.

145 APPENDIX D. RESULTS OF ADDITIONAL EXPERIMENTS 131 Type of the unit Motol-questionary data Cp Ecoli data Im ims Un Uns Un Uns Un Uns Un Uns Unc Uncv sinus sinus gauss sinus sinus sinus sinus sinus sinus sinus sinus linear polynomial perceptron rational exponencial sigmoid SUMA Figure D.4: The percentage of surviving units in GAME network according to their type. This experiment was performed to find the best form of the transfer function in the SinNeuron unit. Type of the unit Mandarin data Setosa Iris data Versicolor Virginica Un Uns Un Uns Un Uns Un Uns Unc Uncv sinus sinus gauss sinus sinus sinus sinus sinus sinus sinus sinus linear polynomial perceptron rational exponencial sigmoid SUMA Figure D.5: The percentage of surviving units in GAME network according to their type. This experiment was performed to find the best form of the transfer function in the SinNeuron unit.

146 132 APPENDIX D. RESULTS OF ADDITIONAL EXPERIMENTS Figure D.6: Relationship of age at the death and features obtained from the skeleton. Higher values of SSPIB feature signify older people. The SSPIA feature is considered irrelevant. Values of the PUSC feature increase with growing age, but for older people the relationship is not clearly defined. Figure D.7: The classification of the Spiral data by the GAME model evolved with all units enabled.

147 APPENDIX D. RESULTS OF ADDITIONAL EXPERIMENTS 133 RMS on the Boston data set E+5 Training set Testing set.15.1 all Fract Gaussian MultiGauss all-fast Perceptron all-simple Gauss Sigm Polynomial Sin Exp CombiR3 Linear Combi Figure D.8: The performance comparison of GAME units on the Boston data set. Classification accuracy on the Ecoli data set 1% 9% 8% Training set Testing set 7% 6% 5% Combi Polynomial CombiR3 Linear Exp all-p all-pf Sin all Sigm Figure D.9: The performance comparison of GAME units on the Ecoli data set. RMS error on the Mandarin data set.4 Training set Testing set all Fract all-p Polynomial all-pf Combi Perceptron Exp CombiR3 Sin Linear Sigm Figure D.1: The performance comparison of GAME units on the Mandarin data set.

148 134 APPENDIX D. RESULTS OF ADDITIONAL EXPERIMENTS RMS error on the Iris data set 1% 9% 8% 7% Training set Testing set 6% all Combi CombiR3 Sigm Exp all-pf all-p Polynomial Sin Fract Perceptron Linear Figure D.11: The performance comparison of GAME units on the Iris data set. Mandarin dataset RMS error Training set Testing set.1.5 CG DE QN SADE all HGAPSO SOS palde CACO PSO ACO OS Figure D.12: The performance comparison of learning methods on the Mandarin data set.

149 APPENDIX D. RESULTS OF ADDITIONAL EXPERIMENTS 135 RMS error on the Antro training data set & Simple Ensemble Weighted Ensemble R12 train1 R12 train2 R5 train1 R5 train2 R3 train1 R3 train2 R725 train1 R725 train2 R16 train1 R16 train2 R3 train1 R3 train2 the Antro testing data set E+8 6.8E+7 6.8E+7 2E+25 R12 test1 R12 test2 R5 test1 R5 test2 R3 test1 R3 test2 R725 test1 R725 test2 R16 test1 R16 test2 R3 test1 R3 test2 4.8E+2 4.8E+2 Figure D.13: Regularization of the Combi units by various strengths of a penalization for complexity. Compare this Figure to the Figure 4.14, which displays the maximum, mean and the minimum errors of the individual GAME models. When these models are combined to the ensemble, the overfitting is reduced (more for Simple ensemble, than for Weighted ensemble). Performance of GAME ensembles on the Advertising data set 95% Training 94% Testing 93% 92% 91% 9% e5 e8 e3 e15 e7 e2 e25 e1 e6 e9 e4 e1 e2 Figure D.14: The optimal number of models in the ensemble is 5, according to results on the Advertising data set. Surprisingly the ensemble with 1 member models overfitted the data. This experiment should be repeated several times, to get reliable results, but we currently lack computational resources.

150 136 APPENDIX D. RESULTS OF ADDITIONAL EXPERIMENTS Figure D.15: The configuration window of the CombiNeuron unit.