AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES

Transcription

1 Int. J. Appl. Math. Coput. Sci., 2014, Vol. 24, No. 1, DOI: /acs AN ALGORITHM FOR REDUCING THE DIMENSION AND SIZE OF A SAMPLE FOR DATA EXPLORATION PROCEDURES PIOTR KULCZYCKI,, SZYMON ŁUKASIK, Systes Research Institute Polish Acadey of Sciences, ul. Newelska 6, Warsaw, Poland e-ail: {kulczycki,slukasik}@ibspan.waw.pl Departent of Autoatic Control and Inforation Technology Cracow University of Technology, ul. Warszawska 24, Cracow, Poland The paper deals with the issue of reducing the diension and size of a data set (rando saple) for exploratory data analysis procedures. The concept of the algorith investigated here is based on linear transforation to a space of a saller diension, while retaining as uch as possible the sae distances between particular eleents. Eleents of the transforation atrix are coputed using the etaheuristics of parallel fast siulated annealing. Moreover, eliination of or a decrease in iportance is perfored on those data set eleents which have undergone a significant change in location in relation to the others. The presented ethod can have universal application in a wide range of data exploration probles, offering flexible custoization, possibility of use in a dynaic data environent, and coparable or better perforance with regards to the principal coponent analysis. Its positive features were verified in detail for the doain s fundaental tasks of clustering, classification and detection of atypical eleents (outliers). Keywords: diension reduction, saple size reduction, linear transforation, siulated annealing, data ining. 1. Introduction Conteporary data analysis avails of a broad and varied ethodology, based on both traditional and odern, often specialized statistical procedures, currently ever ore supported by the significant possibilities of coputational intelligence. Apart fro the classical ethods fuzzy logic and neural networks etaheuristics such as genetic algoriths, siulated annealing, particle swar optiization, and ants algoriths (Gendreau and Potvin, 2010) are also applied here ore widely. Proper cobination and exploitation of the advantages of these techniques allows in practice an effective solution of fundaental probles in knowledge engineering, particularly those connected with exploratory data analysis. More and ore frequently the process of knowledge acquisition is realized using ultidiensional data sets of a large size. This stes fro the dynaic growth in the aount of inforation collected in database systes requiring peranent processing. The extraction of knowledge fro extensive data sets is a highly coplex task. Here difficulties are ainly related to liits in the efficiency of coputer systes for large size saples and probles exclusively connected with the analysis of ultidiensional data. The latter arise ostly fro a nuber of phenoena occurring in data sets of this type, known in the literature as the curse of ultidiensionality. Above all, this includes the exponential growth in the saple size necessary to achieve appropriate effectiveness of data analysis ethods with an increasing diension (the epty space phenoenon), as well as the vanishing difference between near and far points (nor concentration) using standard Minkowski distances (François et al., 2007). As previously entioned, the data set size can be reduced ainly to speed up calculations or ake the possible at all (Czarnowski and Jędrzejowicz, 2011). In the classical approach, this is realized ostly with sapling ethods or advanced data condensation techniques. Useful algoriths have also been worked out allowing the proble to be siplified by decreasing its diensionality. Therefore, let X denote a data atrix of diension n: X =[x 1 x 2... x ] T, (1)

2 134 P. Kulczycki and S. Łukasik with particular rows representing realizations of an n-diensional rando variable 1. The ai of reducing a diension is to transfor the data atrix in order to obtain its new representation of the diension N, where N is considerably (fro the point of view of the conditioning of a proble in question) saller than n. This reduction can be achieved in two ways, either by choosing N ost significant coordinates/features (feature selection) or through the construction of a reduced data set, based on initial features (feature extraction) (Inza et al., 2000; Xu and Wunsch, 2009). The latter can be treated as ore general the selection is a particularly siple case of extraction. Noteworthy aong extraction procedures are linear ethods, where the resulting data set Y is obtained through the linear transforation of the initial set (1), therefore using the forula Y = XA, (2) where A is a atrix of diension n N, aswellas nonlinear ethods for which the transforation can be described by a nonlinear function g : R n R N. This group also contains the ethods for which such a functional dependence, expressed explicitly, does not exist. Studies on the efficiency of extraction procedures carried out in the literature on the subject show that nonlinear ethods, despite having a ore general atheatical apparatus and higher efficiency in the case of artificially generated specific sets of data, frequently achieve significantly worse results for real saples (Maaten, 2009). The goal of this paper is to develop a universal ethod of reducing the diension and size of a saple designed for use in data exploration procedures. The reduction of the diension will be ipleented using a linear transforation on the condition that it affects as little as possible the utual positions of the original and resulting saples eleents. The focus of this contribution is to offer an alternative for the state-of-the-art linear ethod of principal coponent analysis (Sui et al., 2012), with unsupervised character and the possibility to custoize the algorith s features. To this ai a novel version of the heuristic ethod of parallel fast siulated annealing will be researched. Moreover, those eleents of a rando saple which significantly change their position following transforation will be eliinated or assigned less weight for the purposes of further analysis. This concept achieves an iproveent in the quality of knowledge discovery and, possibly, a reduction in the saple size. The effectiveness of the presented ethod will be verified for fundaental procedures in exploratory 1 Particular coordinates of a rando variable clearly constitute onediensional rando variables, and if the probabilistic aspects are not the subject of research then in data analysis these variables are given the ters feature or attribute. data analysis: clustering, classification and detection of atypical eleents (outliers). Many aspects considered in this paper were initially proposed by Łukasik and Kulczycki (2011) as well as Kulczycki and Łukasik (2014) in their basic for. 2. Preliinaries 2.1. Reduction in the diension and saple size. The diension can be reduced in any ways. Correctly sorting the procedures applied here requires, therefore, a wide range of criteria to be taken into account. Firstly, the aforeentioned systeatic for linear and nonlinear ethods is associated with the character of the dependence between the initial and reduced data sets. Most iportant of these, being even a reference linear procedure for diension reduction, is Principal Coponent Analysis (PCA). Aong linear ethods ost often entioned is MultiDiensional Scaling (MDS). Reduction procedures are often considered with respect to the ease of description of the apping between the initial and reduced data sets. This can be defined explicitly (which allows generalizing the reduction procedure on points not belonging to the initial data set), as well as given only iplicitly, i.e., through a reduced representation of eleents of the initial data set. The type of ethod chosen has particular significance in the cases of data analysis tasks, where a continuous influx of new inforation is present. In this for of the proble, reduction ethods belonging to the first of the above groups are preferred. The third division of transforation procedures is related to their level of relationship with the data analysis algoriths used in the next step. It is worth noting here universal techniques which, through an analogy to achine learning ethods, can be tered as unsupervised. These work autonoously, without using results of exploration procedures (Bartenhagen et al., 2010; Aswani Kuar and Srinivas, 2006). The second category concerns algoriths dedicated to particular techniques in data analysis, in particular considering class labels. Here statistical ethods as well as heuristic procedures of optiization, e.g., evolutionary algoriths or siulated annealing, are often used (Tian et al., 2010). For an overview of the existing diension reduction techniques one could refer to Cunningha (2007) or Aswani Kuar (2009). A reduction in the data set size can be realized with a wide range of sapling or grouping ethods. The forer ost often use rando procedures or stratified sapling (Han and Kaber, 2006). The latter apply either classical clustering techniques or special procedures for data condensation probles. There also exists a significant nuber of ethods for size reduction which take into account additional knowledge, for exaple, concerning

3 An algorith for reducing the diension and size of a saple for data exploration procedures 135 whether eleents belong to particular classes (Kulczycki and Kowalski, 2011; Wilson and Martinez, 2000). Moreover, ethods dedicated to particular analytical techniques, for exaple, kernel estiators (Kulczycki, 2005; Wand and Jones, 1995), have been developed (see, e.g., Deng et al., 2008). The ethod presented in this paper is based on a concept of diension reduction which is linear explicitly defined and of universal purpose. Its closest equivalents can be the principal coponent analysis ethod (due to its linear and unsupervised nature), feature selection using evolutionary algoriths (Saxena et al., 2010) and the projection ethod with preserved distances (Saon, 1969; Strickert et al., 2005; Vanstru and Starks, 1981), with respect to the siilar quality perforance criterion. A natural priority for the diension reduction procedure is aintaining distances between particular data set eleents a wide range of ethods treat this as a quality index. Typical for this group of algoriths is the classic ultidiensional scaling, also known as principal coordinates analysis. It is a linear ethod which creates the analytical for of the transforation atrix A, iniizing the perforance index S(A) = 1 i=1 j=i+1 ( d 2 ij δ ij (A) 2), (3) where d ij denotes the distance between the eleents x i and x j of the initial data set, while δ ij (A) are respective distances in the reduced data set. A different strategy is required when searching for a solution with different structural characteristics or the perforance index, or a nonlinear relation between the initial and reduced data sets. This type of procedure is tered ultidiensional scaling, as entioned before. A odel exaple of this is nonlinear Saon apping, which (due to the application of a siple gradient algorith) allows us to find a reduced representation of the investigated data set, ensuring the iniization of the so-called Saon stress: S S (A) = 1 i=1 1 1 j=i+1 d ij i=1 j=i+1 (d ij δ ij (A)) 2 d ij. (4) Such a criterion enables a ore hoogeneous treatent of sall and large distances (Cox and Cox, 2000), while the value S S (A) is further noralized to the interval [0, 1]. An alternative index, also considered in the context of MDS, is the so-called raw stress, defined by S R (A) = 1 i=1 j=i+1 (d ij δ ij (A)) 2. (5) Multidiensional scaling ethods are ostly nonlinear procedures. However, the task was undertaken to forulate the proble of iniization of the indexes (4) and (5) with the assued linear for of transforation. The first exaple of this technique is the algorith for finding a linear projection described by (Vanstru and Starks, 1981). They apply an iterative ethod of steepest descent, which gives in consequence better results than PCA in the sense of the index (4). A siilar procedure was investigated for the function (5), with the additional possibility to successively suppleent a data set (Strickert et al., 2005). In both the cases the approach applied did not account for the ultiodality of the stress function. To avoid becoing trapped in a local iniu, one can use the appropriate heuristic optiization strategy. In particular, for iniization of the index (4), Saxena et al. (2010) use the evolutionary algorith. The solution for this investigation is, however, only to choose the reduced features set. A ore effective approach sees to be the concept of their extraction being ore general, it will be the subject of investigation in this paper. In the construction of the algorith presented here, an auxiliary role is played by an unsupervised technique of feature selection using to this ai an appropriate easure of the siilarity index of axial copression of inforation (Pal and Mitra, 2004). It is based on the concept of dividing features into clusters, with the siilarity criterion of features defined by the aforeentioned index. This division is based on the algorith of k-nearest neighbors, where it is recoended that k = n N. The nuber of clusters achieved then approaches N, although it is not strictly fixed, but in a ore natural anner adapted to a real data structure. Another aspect of the procedure presented here is a reduction in the size of the saple (1). Conceptually, the closest technique is the condensation ethod (Mitra et al., 2002). It is unsupervised, and to establish the iportance of eleents it takes into account their respective distances. In this case the algorith of k-nearest neighbors is also applied, where the siilarity easure between saple eleents is the Euclidean distance. Within this algorith, in the data set there are iteratively found prototype points, or points for which the distance r to the k-th nearest neighbor is the sallest. With every iteration, eleents closer than 2r fro the nearest prototype point are eliinated Siulated annealing algorith. Siulated Annealing (SA) is a heuristic optiization algorith, based on the iterative technique of local search with an appropriate criterion for accepting solutions. This allows establishing a valid solution for every iteration, ostly using the perforance index value for the previous and current iteration, and a variable paraeter called the

4 136 P. Kulczycki and S. Łukasik annealing teperature, which decreases in tie. Within the schee of the algorith it is possible to accept a valid solution a worse than the previous one, which reduces the danger of the algorith getting stuck at local inius. It is assued, however, that the probability of accepting such worse solution should decrease over tie. All of the above traits contain the so-called Metropolis rule, which is ost often applied as an acceptance criterion in siulated annealing algoriths. Let therefore Z R t denote the set of adissible solutions to a certain optiization proble, while the function h : Z R is its perforance index, hereinafter referred to as cost. Furtherore, let k =0, 1,... stand for the nuber of iteration, whereas T (k) R, z(k) Z, c(k) =h(z(k)), z 0 (k) Z, c 0 (k) =h(z 0 (k)) denote respectively the teperature and the solution valid for the iteration k and its cost, and also the best solution found to date and its cost. Under the above assuptions the basic variant of the SA algorith can be described as Algorith 1. The procedure for the Metropolis rule is realized by Algorith 2. Algorith 1. Siulated annealing. 1: Generate(T(1),z(0)) 2: c(0) Evaluate_quality(z(0)) 3: z 0 (0) z(0) 4: c 0 (0) c(0) 5: k 1 6: repeat 7: z(k) Generate_neighbor(z(k 1)) 8: c(k) Evaluate_quality(z(k)) 9: Δc c(k) c(k 1) 10: z(k) =Metropolis_rule(Δc, z(k),z(k 1),T(k)) 11: if c(k) <c 0 (k 1) then 12: z 0 (k) z(k) 13: c 0 (k) c(k) 14: else 15: z 0 (k) z 0 (k 1) 16: c 0 (k) c 0 (k 1) 17: end if 18: Calculate(T (k +1)) 19: stop_condition = Check_stop_condition() 20: k k +1 21: until stop_condition = false 22: return k stop = k 1, z 0 (k stop ), c 0 (k stop ) The SA algorith requires in the general case the assuption of the appropriate initial teperature value, a forula of its changes associated with an accepted ethod of generating a neighboring solution, and a condition for ending the procedure. However, in particular applications one should also define other functional eleents, such as the ethod of generating the initial solution and the for of the quality index. The first group of tasks will Algorith 2. Metropolis rule. 1: if Δc <0then 2: return z(k) 3: else 4: if Rando_nuber_in_(0,1) < exp( Δc/T (k)) then 5: return z(k) 6: else 7: return z(k 1) 8: end if 9: end if now be discussed, while the second as specific for the application of the SA algorith investigated here will be the subject of a detailed analysis in Section 3. Nuerous fundaental and applicational works have resulted in the creation of any variants of the algorith described here. Their ain difference is the schee for teperature changes and the ethod for obtaining a neighboring solution. A standard approach is the classical siulated annealing algorith, also known as the Boltzann Annealing (BA) algorith. This assues an iterative change in teperature according to a logarithic schedule and generation of a subsequent solution by adding to the current one the value of step Δz R t, which is a realization of a t-diensional pseudorando vector with the noral distribution. The BA algorith, although effective in the general case, has a large probability of acceptance of worse solutions, even in the final phase of the search process. This allows effective escape fro local inia of a cost function and guarantees asyptotic convergence to a global one (Gean and Gean, 1984), while also causing that the procedure represents, in soe sense, rando search of the space of adissible solutions. For the SA algorith to be ore deterinistic in character, and at the sae tie to keep convergence to the optial solution, the following schee for the teperature change can be applied: T (k +1)= T (1) k +1, (6) together with the generation of neighboring solution using the Cauchy distribution, g(δz) = T (k). (7) (Δz 2 + T (k) 2 )(t+1)/2 The procedure defined by the above eleents is called Fast Siulated Annealing (FSA) (Szu and Hartley, 1987). It will be a base in the fraework of this paper for the diension reduction algorith. The proble of practical ipleentation of FSA is effective generation of rando nubers with a

5 An algorith for reducing the diension and size of a saple for data exploration procedures 137 ultidiensional Cauchy distribution. The siplest solution is the application, for each diension of the vector, of a one-diensional nuber generator with the sae distribution. This strategy was used in the Very Fast Siulated Annealing (VFSA) algorith, expanded later within the fraework of a coplex procedure of adaptive siulated annealing (Ingber, 1996). Such a concept has, however, a fundaental flaw: the step vectors generated here concentrate near the axes of the coordinate syste. An alternative could be to use a ultidiensional generator based on the transforation of the Cartesian coordinate syste to a spherical one. It is suggested here that the step vector Δz = [Δz 1, Δz 2,...,Δz t ] should be obtained by generating first the radius r of the hypersphere, using the ethod of inverting the Cauchy distribution function described with the spherical coordinates, and then selecting the appropriate point on the t-diensional hypersphere. The second phase is realized by randoly generating the vector u =[u 1,u 2,u t ] T with coordinates originating fro the one-diensional noral distribution u i N(0, 1), and then the step vector Δz is obtained by Δz i = r u i, i =1, 2,...,t. (8) u The presented procedure ensures a syetric and ultidirectional generation schee, with heavy tails of distribution, which in consequence causes effective exploration of a solution space (Na et al., 2004). Taking the above into account, it has been applied in the algorith investigated in this paper. Establishing an initial teperature is vital for correct functioning of the siulated annealing algorith. It iplies the probability of acceptance of a worse solution at subsequent stages of the search in the solution space. The subject literature tends to suggest choosing the initial teperature so that the probability of acceptance of a worse solution at the first iteration, denoted hereinafter as P (1), is relatively large. These recoendations are not absolute, however, and different proposals can be found in the literature, for exaple, close to 1.0 (Aarts et al., 1997), around 0.8 (Ben-Aeur, 2004) or even only 0.5 (Kuo, 2010). Often in practical applications of SA algoriths the teperature value is fixed during nuerical experients. An alternative is to choose a teperature according to a calculational criterion which has the goal of obtaining the T (1) value on the basis of a set of pilot iterations, consisting of generating the neighbor solution z(1) so that the assued P (1) value is ensured. For this purpose one can, analyzing the ean difference in cost between the solutions z(1) and z(0), denoted by Δc, calculate the teperature T (1) value by substituting Δc to the right-side of the inequality in the Metropolis rule defining the probability of the worse solutions acceptance: ( P (1) = exp Δc ), (9) T (1) thus in consequence T (1) = Δc ln P (1). (10) The ean difference in cost can be replaced with, e.g., the standard deviation of the cost function value, arked as σ c, also estiated on the basis of the set of pilot iterations (Sait and Youssef, 2000). A proble which appears in the case of SA algoriths dedicated to iniizing functions with real arguents (including the aforeentioned BA, FSA, VFSA and ASA) is the dependence of the strategy for generating a neighbor solution on teperature. Therefore, both the standard deviation σ c and the ean Δc are directly dependent on it. The application of the forula (10) is not possible here, and in the case of these algoriths the initial teperature value is usually arbitrary. This paper proposes a different strategy based on the generation of a set of pilot iterations. It allows the value T (1) to be obtained with the assuption of any initial probability of worse solution acceptance. As iportant as the choice of the initial teperature is the deterination of the iteration at which the algorith should be terinated. The siplest (although not flexible and often requiring too detailed knowledge of the investigated task) stop criterion is reaching a previously assued nuber of iterations or a satisfactory cost function value. An alternative could be to finish the algorith when following a certain nuber of iterations the best obtained solution is not iproved, or the use of an appropriate statistical ethod based on the analysis of cost function values as they are obtained. The last concept is universal and (desirable aong heuristic algoriths stop criteria) related to the expected result of their works. This usually consists of calculating the estiator of the expected value of the global iniu ĉ in and finishing the algorith in the iteration k, when the difference between it and the discovered sallest value c 0 (k) is not greater than the assued positive ε,soif c 0 (k) ĉ in ε. (11) One recent technique using this type of strategy is the algorith proposed by Bartkuté and Sakalauskas (2009). In order to calculate the value ĉ in an estiator is applied here based on order statistics (David and Nagaraja, 2003). This algorith constitutes a universal and effective tool for a wide range of stochastic optiization techniques. Such a ethod, used as part of the FSA procedure, will now be described.

6 138 P. Kulczycki and S. Łukasik Let therefore {c 0 (k),c 1 (k),c 2 (k),...,c r (k)} denote the ordered non-decreasing set of r lowest cost function values, obtained during k iterations of the algorith. In the case of an algorith convergent to a global iniu, the condition li k c j (k) =c in is fulfilled for every j N, while the sequences c j (k) can be applied to construct the aforeentioned estiator value ĉ in.this estiator akes use of the assuption of asyptotic convergence of the order statistic distribution to the Weibull distribution, and in the iteration k takes the general for: 2t β 1 ĉ in (k) =c 0 (k) r (c r(k) c 0 (k)). (12) The paraeter β, occurring above, is tered a hoogenous coefficient of the cost function h around its iniu. On additional assuptions, in calculational practice one can take β =2(Zhigljavsky and Žilinskas, 2008). The confidence interval for the cost function iniu, at the assued significance level δ (0, 1), is of the for [c 0 (k) (1 (1 δ)1/r ) β/t 1 (1 (1 δ) 1/r ) (c r(k) β/t ] c 0 (k)),c 0 (k). (13) Bartkuté and Sakalauskas (2009) suggest that the point estiator (12) can be replaced by the confidence interval (13) with the algorith being stopped when the confidence interval width is less than the aforeentioned, assued value ε. Such an idea, odified for the specific proble under investigation, will be applied here. The siulated annealing procedure can be easily parallelized, whether for required calculations or in the schee of establishing subsequent solutions. While parallelizing the SA algorith is not a new idea and was already investigated a few years after its creation (Azencot, 1992), it needs to be adapted for particular applicational tasks (Alba, 2005). At present the suitability of the Parallel Siulated Annealing (PSA) algorith continuously increases with the coon availability of ulticore systes. In the algorith worked out in this paper, a variant will be taken with parallel generation of neighbor solutions, assuing that the nuber of SA threads equals that of available processing units. 3. Procedure for reducing the diension and saple size The algorith investigated in this paper consists of two functional coponents: a procedure for reducing the diension and a way of allowing the saple size to be decreased. They are ipleented sequentially, with the latter dependent on the results of the forer. The reduction of the saple size is optional here Procedure for diension reduction. The ai of the algorith under investigation is a decrease in the diensionality of the data set eleents, represented by the atrix X with the for specified by the forula (1), so of the diension n, where eans the data set size and n the diension of its eleents. In consequence, the reduced for of this set is represented by the data atrix Y of the diension N, while N denotes the assued reduced diension of eleents, appropriately less than n. The procedure for reducing the diension is based on the linear transforation (2). For the purposes of notation used in the siulated annealing algorith, atrix A is denoted as the row vector [ a 11,a 12,...,a 1N,a 21,a 22,...,a 2N,...,a n1, a n2,...,a nn ], (14) which represents the current solution z(k) R (nn) in any iteration k. In order to generate neighbor solutions, a strategy was used based on the ultidiensional generator of the Cauchy distribution (forulas (7) and (8)). The quality of the obtained solution can be evaluated with the application of the cost function h, which is the function of the raw stress S R given by the dependence (5), where the eleents of the atrix Y are defined on the basis of Eqn. (2). The alternative possibility of using the Saon stress (4) for this purpose was also exained. The developed procedure requires firstly that the basic paraeters should be specified: the diension of the reduced space N, a coefficient defining directly the axiu allowed width of the confidence interval ε w for the stop criterion based on the order statistics, the nuber of processing threads of the FSA procedure p thread, the initial scaling coefficient (step length) for the ultidiensional Cauchy generator T scale, as well as the probability of acceptance of a worse solution P (1) in the first iteration of the FSA algorith. Starting the algorith requires oreover the generation of the initial solution z(0). Tothisai,the feature selection procedure of Pal and Mitra (2004), described in the previous section, was realized. Here k = n N should be assued, which in consequence usually results in obtaining approxiately N clusters. The aforeentioned procedure leads to obtaining the auxiliary vector b R n, whose particular coordinates characterize the nuber of the cluster to which the coordinate fro the original space was apped, as well as the vector b r R n of binary values b r (i) {0, 1} for i = 1, 2,...,n, defining whether a given feature was chosen as a representative of the cluster to which it belongs, in which case b r (i) =1, or not, then b r (i) =0. The auxiliary vectors b and b r can be used in the algorith considered for generating the initial solution in two ways: 1. Each of N features of the initial solution is a

7 An algorith for reducing the diension and size of a saple for data exploration procedures 139 linear cobination of features apped to one of N clusters to define the for of the atrix A one can use { 1 if b(i) =j, a ij = (15) 0 if b(j) j for i =1, 2,...,nand j =1, 2,...,N. 2. Each of N features of the initial solution is given as a representative for one of N clusters the for of the atrix A is then defined as { 1 if b r (i) =1 and b(i) =j, a ij = (16) 0 if b r (i) =0 for i =1, 2,...,nand j =1, 2,...,N. The possibility of applying both of the above ways of generating an initial solution the first called a linear cobination of features and the second referred to as feature selection is a subject of detailed experiental analysis concerning diensional reduction, described in Section 4. After generating the initial solution, in order to carry out the siulated annealing algorith, the teperature T (1) should be established for the first iteration. To this ai, the technique presented in the previous section can be followed, allowing us at the start to obtain the assued initial value of the probability of worse solution acceptance P (1). In the case of the algorith for generating neighbor solutions, it is not recoended to use the relation resulting fro the equality (9). As entioned in the previous section, this is iplied by the dependence of a forula for generating a neighbor solution on the annealing teperature. In order to avoid this inconvenience, an additionalcoefficient T scale was introduced, being the paraeter of the Cauchy distribution in the first iteration of the FSA algorith (also known as an initial step length), and furtherore the teperature occurring in the generating distribution was scaled. The coefficient T scale is thus used as a paraeter of the rando nubers generator, with the ai of calculating a set of pilot iterations (the size of this set is assued to be 100). These iterations consist of the generation of an appropriate nuber of transitions fro z(0) worse in the sense of the cost function used to the neighbor solution z(1), and the establishent of the ean value of the cost difference δc between z(1) and z(0). This value is inserted to the forula (10), through which the initial teperature can be obtained. Moreover, in order to find the assued shape of the generated distribution, in the first iteration of the FSA algorith an additional scaling coefficient is calculated: c tep = Δc ln P (1)T scale. (17) In consequence, in the first iteration of the actual algorith, in order to generate a neighbor solution, the scaled teperature T (1)/c tep (therefore T scale )is used, and for the Metropolis rule it is just the value T (1). Siilar scaling takes place during the generation of neighbor solutions in each consecutive iteration of the FSA algorith. Thanks to this kind of operation it becoes possible to fix the initial probability of acceptance of a worse solution, deterined by the coefficient P (1), while retaining the additional possibility of establishing by assuing the value T scale the paraeter of the initial spread of values obtained fro a pseudorando nubers generator. All iterations of the FSA algorith have been parallelized using a strategy with parallel generation of neighbor solutions. So each of p thread threads creates a solution neighboring the one established in the previous iteration z(k 1). This occurs with the application of a rando generator with a ultidiensional Cauchy distribution. For all threads, the annealing teperature is identical and equals T (k)/c tep. Furtherore, every thread realizes the procedure for the Metropolis rule, accepting or rejecting its own obtained neighbor solutions. The next two steps of the algorith are perfored in sequence. So, first the current solution is established for the SA algorith. The procedure for this is to choose as a current solution either the best fro those better than that the ones found in the previous iteration obtained by different threads, or if such a solution does not exist a rando selection of one of the worse solutions. Thus, the current solution, together with the teperature updated according to the forula (6), is used in the next iteration of the FSA algorith as the current solution. This kind of strategy can be classified as a ethod of parallel processing based on speculative decoposition. The last step perfored as part of a single iteration is verifying the criterion for stopping the SA procedure. To this ai, the confidence interval for the iniu value of the cost function, given by the forula (13), is calculated. The order statistics used for interval estiation have the order r assued to be 20, in accordance with the proposals of Bartkuté and Sakalauskas (2009). As a significance level δ for the confidence interval defined by the forula (13), a typical value 0.99 is assued. The width of the confidence interval is copared with the threshold value ε calculated at every iteration with ε =10 εw c 0 (k). (18) Finally, the siulated annealing procedure is terinated if (1 (1 δ) 1/r ) β/t 1 (1 (1 δ) 1/r ) (c r(k) c β/t 0 (k)) >ε, (19) with the notation introduced at the end of Section 2. Finding the threshold value ε based on the forula (19)

8 140 P. Kulczycki and S. Łukasik allows the adaptation of such a criterion to the structure of a specific data set under investigation. The sensitivity of the above procedure can be set by assuing the value of the exponent ε w N, one of the arbitrarily fixed paraeters of the procedure worked out in this paper. It is worth noting that the nature of the ethod presented here for diension reduction enables the establishent of the contribution which particular eleents of the data set Y ake to the final value c 0 (k stop ). This fact will be used in the procedure for reducing the saple size, which will be presented in the next section. A graphical description of the algorith is suarized in Fig. 1. Fig. 1. Diension reduction algorith Procedure for saple size reduction. In the case of the diension reduction ethod presented above, soe saple eleents ay be subject to an undesired shift with respect to the others and, as a result, ay noticeably worsen the results of an exploratory data analysis procedure in the reduced space R N. A easure of the location deforation of the single saple eleent x i copared to the others, resulting fro the transforation (2), is the corresponding stress value c 0 (k stop ) calculated for this point (called the stress per point) (Borg and Groenen, 2005). In the case of the raw stress it is given by c 0 (k stop ) i = S R (A) i = (d ij δ ij (A)) 2, (20) j=1 j i whereas for the Saon stress it takes the for c 0 (k stop ) i = S S (A) i = 1 i=1 1 j=i+1 d ij j=1 j i (d ij δ ij (A)) 2 d ij. (21) It is worth noting that in both cases these values are nonzero, except for the case (unattainable in practice) of perfect atching of respective distances of eleents in the original and reduced spaces. The values c 0 (k stop ) i for particular eleents can be used to construct a set of weights, defining the adequacy of their location in a reduced space. Let therefore w i represent nonnegative weight apped with the eleent x i. Taking the above into account, it is calculated according to the following forula: w i = 1 c 0(k stop) i 1. (22) i=1 c 0(k stop) i The noralization which occurs in the above dependence guarantees the condition = w i. (23) i=1 The weights in this for contain inforation as to the degree to which a given saple eleent changed its relative location copared with the rest, where the larger the weight, the ore relatively adequate its location, and its significance should be greater for procedures of exploratory data analysis carried out in a space of a reduced diension. The weights values which are calculated on the basis of the above forulas can be used for further procedures of data analysis. They also allow the following ethod of reducing the saple size. Fro the reduced data set Y one can reove those el eleents for which their respective weights fulfill the condition w i < W with assued W > 0. Intuitively, W = 1 is justified taking into account the forula (23), and this results in the eliination of eleents corresponding to the values w i less than the ean. In conclusion, conjoining the ethods fro Sections 3.1 and 3.2 enables a data set with a reduced diension as well as size to be obtained, with the degree of copression iplied by the values of paraeters N and W.

9 An algorith for reducing the diension and size of a saple for data exploration procedures Coents and suggestions. In the case of the procedure for reducing the diension and saple size presented here, efforts were ade to liit the nuber of paraeters, whose arbitrary selection is always a significant practical proble for heuristic algoriths. At the sae tie, conditioning data analysis tasks, in which the procedure investigated here will be applied, eans that, fro a practical point of view, it is useful to propose specific values of those paraeters with an analysis of the influence of their potential changes. One of the ost iportant arbitrarily assued paraeters is the reduced space diension N. It can be fixed initially using one of the ethods for estiating a hidden diension (Caastra, 2003), or by taking a value resulting fro other requireents, for exaple, N =2 or N = 3, to enable a suitable visualization of the investigated data set. It is worth reebering that the procedure applied earlier for generating an initial solution with the fixed paraeter k = n N creates a solution which does not always have a diensionality identical to the assued (as entioned in the previous section). If a strictly defined diension of the reduced data set is required, one should adjust the paraeter k by repeating the feature selection algorith with its correctly odified value, or use the initial solution, generated randoly, of the assued diension of reduced space. It is also worth entioning the proble of coputational coplexity of the procedure worked out here, in particular regarding calculation of the cost function value. In practice, the coputational tie for the PSA algorith increases exponentially with an increase in the saple size. So, despite the heuristic algorith being the only ethod available in practice to iniize the stress function S S or S R for data sets of large diensionality and size, its application ust, nonetheless, be liited to those cases which are in fact feasible. Therefore, although the nuber of siulated annealing treads p thread can be fixed at will, it should take the available nuber of processing units into account. This allows efficient parallel evaluation of the cost function by particular threads. It should also be noted that the subject algorith, due to its universal character, can be applied to a broad range of probles in statistics and data analysis. An exaple, fro the case of statistical kernel estiators (Kulczycki, 2005; Wand and Jones, 1995), is the introduction of generalization of the basic definition of the estiator of probability distribution density to the following for: f(ŷ) = 1 h n i=1 w i ( ) y yi w i K. (24) h i=1 Such a concept allows not only a reduction in the saple size (for reoved eleents, w i = 0 is assued), but also alternatively an iproveent in the quality of estiation in the reduced space without eliinating any eleents fro the initial data set. In the forer case care should also be taken to noralize the weights after eliinating parts of eleents, to fulfill the condition (23). This approach ay be used, e.g., in the coplete gradient clustering technique (Kulczycki and Charytanowicz, 2010). Weights w i calculated in the above anner can also be introduced to odified classical ethods of data analysis, such as a weighted k-eans algorith (Kerdprasop et al., 2005), or a weighted technique of k-nearest neighbors (Parvin et al., 2010). In the first case, the weights are activated in the procedure for deterining centers of clusters. The location of the center of the cluster C i, denoted by s i =[s i1,s i2,s in ] T, is updated in every iteration if y l C i w l 0, according to the forula 1 s ij = w l y lj (25) y l C i w l y l C i for j =1, 2,...,N.Inthek-nearest neighbors procedure, however, each distance fro neighbors of any eleent fro the learning data set is scaled using the appropriate weight. 4. Nuerical verification The presented ethodology underwent detailed nuerical testing. It consisted of an analysis of the ain functional aspects of the investigated algorith, in particular the dependence of its efficiency on the values of arbitrarily assued paraeters. Another subject of the nuerical analysis was the quality of reduction of the diension and saple size, also in coparison with alternative solutions available in the literature. Exaination was ade with both data sets obtained fro repositories and the literature, as well as those generated pseudorandoly with various diensionalities and configurations. The forer consist of the data sets, denoted hereinafter as W1, W2, W3, W4 and W5, with the first four taken fro the Machine Learning Repository aintained by the Center for Machine Learning and Intelligent Systes, University of California Irvine, on the website (UC Irvine Machine Learning Repository, 2013), while the fifth one coes fro the authors own research (Charytanowicz et al., 2010). Thus W1: glass, contains the results of cheical and visual analysis of glass saples, fro 6 weakly separable (Ishibuchi et al., 2001) classes, characterized by 9 attributes; W2: wine, represents the results of analysis of saples fro 3 different producers, aking up strongly separable (Cortez et al., 2009) classes, with 13 attributes;

10 142 P. Kulczycki and S. Łukasik W3: Wisconsin Breast Cancer (WBC), obtained during oncological exainations (Mangasarian and Wolberg, 1990), has 2 classes and 9 attributes; W4: vehicle, presents easureents of vehicle shapes taken with a caera (Oliveira and Pedrycz, 2007), grouped into 4 classes, described by 18 attributes; W5: seeds, lists the diensions of 7 geoetrical properties of 3 species of wheat seeds, obtained using the X-ray technology (Charytanowicz et al., 2010). For classification purposes the above data sets are divided into learning and testing saples in the proportion 4:1. The latter, data sets R1, R2, R3, R4 and R5, were obtained using a pseudorando nuber generator with the noral distribution, with the assued expected value and the covariance atrix Σ. The tested sets were fored as linear cobinations of such coponents with coefficients p k (0, 1). Their list is presented in Table 1. For classification purposes, every coponent of a given distribution is treated as an individual class, while in the case of clustering it is assued that subsequent clusters contain eleents of particular coponents of linear cobinations. Siilarly for the needs of identification of outliers, atypical eleents are considered to be coponents with arginal values of linear cobination coefficients. The case of clustering uses the classic procedure of k-eans with the assued nuber of clusters equal the nuber of classes appearing in the tested set. Its quality is assessed with the Rand index (Parvin et al., 1971): I c = a ( + b ) 100%, (26) 2 where a and b denote the nuber of pairs of eleents properly assigned to the sae and different clusters. If the results concern the initial space, then the above index specifies itself to I cinit, whereas I cred corresponds to the reduced space. Next, the classification procedure will be realized applying the standard nearest neighbor algorith. Its results are judged with a natural indicator which is given as a percentage of correctly qualified eleents, denoted in the initial space as I kinit and in the reduced as I kred. Finally, an identification of outliers is ade by a testing procedure based on statistical kernel estiators (Kulczycki, 2008). The quality of the effects can be appraised in the sae way as for classification, where the indicators are tered I oinit and I ored, respectively. All perforance indices used in the paper are outlined in Table 2. The diension of the reduced space N is arbitrarily assued following suggestions fro the appropriate subject literature (Charytanowicz et al., 2010; Saxena et al., 2010). For pseudorando data sets its value was selected to ensure a convenient visualization of results. Table 3 presents a suary of these. In every case, Table 3. Diensions of reduced spaces. Data set N Data set N W1 4 R1 1 W2 5 R2 1 W3 4 R3 2 W4 10 R4 2 W5 2 R5 2 the tested procedure was repeated 30 ties, following which, for further inference, the ean values and standard deviations of the introduced indexes were used. Thus, preliinary research concerned the functionality of the investigated procedure, in particular its sensitivity to versions and paraeters with arbitrarily assued values. During the conducted research, the options entioned below were taken into account: (i) two versions of generating the initial solution: feature selection (version A) and the linear cobination of features (version B); in addition, a randoly obtained initial for of the transforation atrix was also considered, which was, however, disissed at the preliinary stage, due to its ineffectiveness; (ii) two possible fors of the cost function, described by the forulas (5) and (4); (iii) four levels of sensitivity for the stop criterion ε w = {1, 2, 3, 4}, whose range resulted fro practical conditioning of the test carried out; (iv) three possible values for the paraeter T scale = { }. Furtherore, in this phase it was also assued that P (1) = 0.7 and p thread =4. In general, it can be ascertained that the results obtained for every option did not differ significantly, therefore (worth stressing) the algorith worked out is not very sensitive to the selected paraeters. In practice, this is highly valuable and considerably increases its applicational potential. The preferred strategy for generating an initial solution sees to be the version based on feature selection (version A). The for of the cost function, however, does not appear to have a great influence on the quality of the transforation. The difference, while negligible in the sense of the exained indexes, can only be seen following the analysis for ε w = 4. In probles of identifying outliers and clustering, using the function S R is recoended, and for classification the function S S. However, it is difficult to definitely suggest a value for

11 An algorith for reducing the diension and size of a saple for data exploration procedures 143 Table 1. Characteristics of data sets R1 R5 of pseudorando nature. Data set n Paraeters of distribution coponents k p k μ k Σ k R1(,a,o) 2 1 o/2 [0 a] [ ] o/2 [aa] o [a 0] R2(,a,o) o [000] 2 o/4 [a 00] 3 o/4 [a 0-a] 4 o/4 [aa0] 5 o/4 [0 0 a] R3(,a,o) 5 1 o/5 [a 0000] 2 o/5 [0 a 000] 3 o/5 [0 0 a 00] 4 o/5 [000a 0] 5 o/5 [0000a] 6 1-o [aaaaa] [σ ij] i, j =1, 2,...,n i = j, σ ij =1 i j, σ ij =0 R4(,a,o) 10 1 o/2 [a ] [σ ij] 2 o/2 [ a] i, j =1, 2,...,n 3 (1-o)/2 [ ] i = j, σ ij =1 4 (1-o)/2 [ a/2] i j, σ ij =0 R5(,a,o) o [5a -aa-aa-aa-aa-aa-aa-a 5a] 2 o/2 [a -aa-aa-aa-aa-aa-aa-aa] 3 o/2 [-aaa-aa-aa-aa-aa-aa-aa-a] [σ ij] i, j =1, 2,...,n i = j, σ ij =1 i j, σ ij =0 the coefficient T scale. The ost stable in the sense of presented efficiency sees to be the version with T scale = 0.1. Lastly, the ost advantageous results were obtained for two versions of the FSA algorith based on generating an initial solution with selection of adequate features, the sensitivity of the stop criterion ε w = 4, the coefficient T scale =0.1, as well as the cost function S S or S R.These two versions, called standard in the following part of this paper, were used in later research involving nuerical evaluation of the concept presented here. In the next stage of the experiental research, the dependence of the algorith s efficiency on the value P (1) was exained. Appropriate tests were carried out considering the cases P (1) 0, P (1) = 0.1, 0.2,...,0.9 and also P (1) 1. Above all it can be inferred that the FSA algorith applied is highly efficient with respect to parallel local search (which corresponds to P (1) 0), and also priarily a rando parallel exploration of the solution space (when P (1) 1). In the case of the algorith used, low probabilities in the interval [0.1, 0.4] are preferred. This results fro the eployed schee of generating neighbor solutions, which, together with the procedure of teperature change according to the forula (6), allows distant jups in the solutions space under investigation. It copensates for the influence of the low probability of the acceptance of worse solutions. The next subject of the research was the presented algorith for reducing the saple size. Verification of the obtained results consisted in defining the percentage reduction of the saple for particular values of W = 0, 0.1,...,2.0. The dependence of the nuber of reoved eleents on the paraeter W is not easy to express with an analytical forula. It is, however, possible to continuously adjust the degree of reduction achieved. The value W close to zero iplies the reoval of a sall nuber of data set eleents, while even W =2results in a reduction greater than 80%. Fixing W =1usually causes a saple size reduced by about half, because the eleents with weights less than average are eliinated. An increase in the intensity of saple reduction in the case of soe data analysis procedures ay have a positive influence on the quality of the results in the reduced space. This was ainly observed for the identification of outliers, where W = 0.9 produced the best results. In this event, reoval fro the data set of a significant aount of its eleents did not usually lower the difference between typical eleents and outliers, but often actually exaggerated it. In clustering and classification, one can use the research to argue that such a high degree of reduction is advisable, as it leads to the eliination of saple eleents which are fundaental to the creation of a grouping structure.

12 144 P. Kulczycki and S. Łukasik Table 2. Perforance indices used for evaluating results of the data analysis tasks considered. Task Cluster analysis Classification Outlier detection Solution quality indicator and its I c = a+b ( 2 ) 100% I k = a 100% Io = a 100% description :saplesize :saplesize :saplesize a: nuber of saple eleents a: nuber of saple eleents a: nuber of saple eleents properly assigned assigned to the right class assigned to the right class to the sae cluster (outlier/typical) b: nuber of saple eleents properly assigned to different clusters In conclusion it can be suggested that where other conditions are not present for these probles, the copression coefficient should be assued fro the range between 0.1 and 0.2. This often enables the eliination of such eleents which, as a result of transforation, undergo significant displaceent with respect to the rest, causing an iproveent in clustering and classification quality in the reduced space. All suggested values of the algorith s paraeters and recoended options are ultiately suarized in Table 4. Table 4. Suggested values and options of the algorith. Paraeter/option Suggested value Initial solution generation Version A Cost function S S for classification S R otherwise Sensitivity of stop criterion ɛ w =4 Teperature scaling T scale =0.1 Initial probability of P (1) [0.1, 0.4] worse solution acceptance Copression coefficient W = 1 for outlier detection W [0.1, 0, 2] otherwise Finally, the effectiveness of the algorith under investigation was copared with other reference ethods. The first verification was ade only for the diension reduction procedure, and then together with the saple size reduction algorith. Research was carried out using the options given above, whereas for the identification of outliers and clustering the cost function S R was used, and for classification the function S S. In the case of diension reduction, the classic PCA linear algorith as well as the aforeentioned features selection procedure using evolutionary algoriths and the Saon stress function were used as reference techniques. Research of the latter focused only on the efficiency of classification, for which this strategy was confired earlier in the work (Saxena et al., 2010). The results of experients are shown in Tables 5 7. The reported execution ties include both diensionality reduction and the given data ining task. In interpreting the results of identifying outliers shown in Table 5, it ust be underlined that the proposed ethod for diension reduction ensures, in ost cases, a greater efficiency of the data analysis procedure used. This is also true when coparing results of this procedure in an unreduced space, as well as when juxtaposing the result with that obtained using principal coponent algorith. Reducing the diensionality of a data set with the subject algorith, also in the case of classification (as shown in Table 6), ensures greater average efficiency, both when the reference constitutes the result obtained on the basis of a reduced data set using the principal coponent algorith, as well as in the consequent analysis of a features set, chosen with an evolutionary strategy. For clustering (Table 7), a reduced for of a data set is usually achieved, for which the k-eans algorith produces a result which is ore copatible (in the sense of the I cred index) with the data structure than for the PCA ethod. Table 7. Coparison of results of diension reduction for clustering. W1 W2 W3 W4 W5 glass wine WBC vehicle seeds I cinit ±σ(i cred ) ±1.39 ±0.68 ±0.38 ±1.08 ±0.53 Reduction using FSA I cred ±σ(i cred ) ±0.61 ±0.76 ±0.31 ±0.28 ±0.78 ax(i cred ) tie[s] Reduction using PCA I cred ±σ(i cred ) ±0.21 ±0.88 ±0.21 ±0.32 ±0.55 In the next stage, a coplete coparison, siilar to the one presented above, was conducted, additionally taking into account the procedure for saple size reduction. According to previous inferences, for the identification of outliers, the copression coefficient W

13 An algorith for reducing the diension and size of a saple for data exploration procedures 145 Table 5. Coparison of results of diension reduction for identification of outliers. R1 R2 R3 R4 R5 (300, 4, 0.1) (300, 4, 0.1) (500, 3, 0.1) (500, 4, 0.1) (500, 2, 0.1) I oinit Reduction using FSA I ored ±σ(i ored) ±0.00 ±1.91 ±0.87 ±1.20 ±0.73 ax(i ored) tie[s] Reduction using PCA I ored tie[s] Table 6. Coparison of results of diension reduction for classification. W1 W2 W3 W4 W5 glass wine WBC vehicle seeds I kinit ±σ(i kinit ) ±8.10 ±5.29 ±1.35 ±3.34 ±2.85 Reduction using FSA I kred ±σ(i kred ) ±3.79 ±6.46 ±1.12 ±2.98 ±2.31 tie[s] Reduction using PCA I kred ±σ(i kred ) ±6.37 ±7.22 ±2.06 ±3.84 ±7.31 Reduction using evolutionary algoriths I kred ±σ(i kred ) ±4.43 ±1.02 ±0.80 ±1.51 N/A value was fixed at the level 0.9, for classification W = 0.1, and for clustering W = 0.2. To illustrate, in the latter two, a selected version with ore intensive reduction was additionally considered. In order to verify the effectiveness of cobining the investigated procedures of diension and saple size reduction, the possibility was also exained of using clustering together with the principal coponent algorith, and the proposed saple size reduction with a siilar reduction intensity. Moreover, in the case of identifying outliers, the prospect was also considered of applying the PCA ethod with the algorith for condensing data presented by Mitra et al. (2002). The results obtained are presented sequentially in Tables The results displayed in Tables 8 10 lead us to conclude that the strategy of saple size reduction is a worthy suppleent to the ethod for diension reduction based on the FSA algorith. For identifying outliers, treated data sets saw 30 50% of saple eleents reoved and, as expected, the quality of identification of outliers iproved greatly. In the cases of clustering and classification, the first series of experients did not apply intensive reduction of the saple size. However, eliination of even a sall part of stray eleents transfored during diension reduction leads to an increase in the average efficiency of the procedures applied in the reduced space. This is especially obvious in classification with the nearest neighbor algorith, which is very sensitive to disturbances in the data set structure introduced by the transforation. An increase in the copression coefficient W value, realized in the second stage of the nuerical experients, usually sees the worsening of stability and/or average efficiency of classification and clustering. This ainly concerns the first procedure, where the growth of the value W above 0.5 has decidedly negative iplications. In the case of clustering, one can achieve high efficiency of conducted analysis ost often in single experients only, with an additional 20 40% reduction in the saple size, although the iproveent was peranent for the data set seeds. Despite the results obtained for sets reduced by the PCA algorith, and the procedure for saple size reduction proposed here usually being significantly worse than the two-stage ethod investigated in this paper, a cobination of these two algoriths also sees to be a valuable concept in practical applications and is

14 146 P. Kulczycki and S. Łukasik broadly studied within a separate contribution (Łukasik and Kulczycki, 2013). In conclusion of this experiental study, let us illustrate the advantages of using the proposed algorith for additional ultidiensional dataset fro the UCI Machine Learning Repository. It concerns an iage segentation proble and consists of 2310 instances with 19 attributes each, representing 7 classes of textures. The execution of cluster analysis for PCA-reduced two feature set results in the average accuracy of 79.9%. For the sae feature set size, PFSA-based algorith reaches on average 82.3% of the Rand index value. The obtained cluster structure is illustrated on Fig Suary and final rearks The subject of this paper was a coplete algorith for reducing the diension and saple size, ready to be used in a wide range of exploratory data analysis probles. It constitutes a universal, unsupervised linear transforation of a features space, with the ai of best aintaining distances between saple eleents, additionally suppleented by a reduction in significance of those eleents whose locations in relation to the rest changed considerably. The foundation for this algorith is an innovative version of the parallel fast siulated annealing procedure, with the stop criterion based on order statistics, autoatic generation of initial teperature and a ultidiensional generator of pseudorando nubers with the Cauchy distribution. The saple size was reduced as a result of calculating weights for particular eleents, with the possibility of continuous adjustent of this procedure s intensity, through establishing an appropriate for an investigated proble value for the copression coefficient. Besides presenting the concept of the algorith and selected applicational properties, diverse nuerical experients were also included. They led to the conclusions regarding the recoended values of paraeters which can be fixed within this algorith, and also allowed its properties to be exained. In addition, during the research the investigated procedure was copared with the principle coponent ethod and a technique using evolutionary algoriths. Generally the obtained results of nuerical verification, conducted for both pseudorando and real data sets, confired the validity of the proposed concept as well as its any positive properties. Furtherore, the particular functional coponents of the procedure presented here can be applied in other tasks of inforation processing. Thus, the parallel fast siulated annealing algorith ay be used successfully in a wide range of optiization probles, due to its universal structure and relatively intuitive selection of arbitrarily assued paraeters. What is ore, the proposed procedure for reducing the saple size can be applied together with other, also nonlinear, strategies for diension reduction. Finally, it is worth stressing that, regardless of the calculational coplexity of the proposed algorith, as its execution tie is guided by the adaptive stop criterion it corresponds to the difficulty of the proble at hand. The use of linear transforation, which is easy to generalize, and the siple idea of a set of weights, akes it possible to use the investigated ethod effectively in a wide range of conteporary data analysis probles, also those involving data sets characterized by the influx of new saple eleents. Fig. 2. Result of cluster analysis for the two-diensional representation of the segentation dataset. Acknowledgent The study is co-funded by the European Union fro resources of the European Social Fund (project PO KL Inforation technologies: Research and their interdisciplinary applications, agreeent UDA-POKL /10-00). References Aarts, E., Korst, J. and van Laarhoven, P. (1997). Siulated annealing, in E. Aarts and J. Lenstra (Eds.), Local Search in Cobinatorial Optiization, Wiley, Chichester, pp Alba, E. (2005). Parallel Metaheuristics: A New Class of Algoriths, Wiley, New York, NY. Aswani Kuar, C. and Srinivas, S. (2006). Latent seantic indexing using eigenvalue analysis for efficient inforation retrieval, International Journal of Applied Matheatics and Coputer Science 16(4): Aswani Kuar, C. (2009). Analysis of unsupervised diensionality techniques, Coputer Science and Inforation Systes 6(2):

15 An algorith for reducing the diension and size of a saple for data exploration procedures 147 Table 8. Coparison of results of diension and saple size reduction for identification of outliers. R1 R2 R3 R4 R5 (300, 4, 0.1) (300, 4, 0.1) (500, 3, 0.1) (500, 4, 0.1) (500, 2, 0.1) Reduction of diension using FSA and saple size (W =0.9) I ored ±σ(i ored) ±0.00 ±0.49 ±1.91 ±1.35 ±1.59 el el el 100% Reduction of diension using PCA and saple size (approxiate size of reduced saple) I ored % Reduction of diension using PCA and saple condensation (approxiate size of reduced saple) I ored % Table 9. Coparison of results of diension and saple size reduction for clustering. W1 W2 W3 W4 W5 glass wine WBC vehicle seeds Reduction of diension using FSA and saple size (W =0.2) I cred ±σ(i cred) ±0.62 ±0.76 ±0.62 ±0.24 ±1.57 el 100% Reduction of diension using PCA and saple size (approxiate size of reduced saple) I cred ±σ(i cred) ±0.62 ±0.76 ±0.62 ±0.24 ±1.57 Reduction of diension using FSA and saple size (selected W ) I cred ±σ(i cred) ±1.69 ±2.08 ±2.03 ±0.40 ±0.62 W el 100% Azencot, R. (1992). Siulated Annealing: Parallelization Techniques, Wiley, New York, NY. Bartenhagen, C., Klein, H.-U., Ruckert, C., Jiang, X. and Dugas, M. (2010). Coparative study of unsupervised diension reduction techniques for the visualization of icroarray gene expression data, BMC Bioinforatics 11, paper no Bartkuté, V. and Sakalauskas, L. (2009). Statistical inferences for terination of Markov type rando search algoriths, Journal of Optiization Theory and Applications 141(3): Ben-Aeur, W. (2004). Coputing the initial teperature of siulated annealing, Coputational Optiization and Applications 29(3): Borg, I. and Groenen, P. (2005). Modern Multidiensional Scaling. Theory and Applications, Springer-Verlag, Berlin. Caastra, F. (2003). Data diensionality estiation ethods: Asurvey,Pattern Recognition 36(12): Charytanowicz, M., Niewczas, J., Kulczycki, P., Kowalski, P., Łukasik, S. and Żak, S. (2010). Coplete gradient clustering algorith for features analysis of x-ray iages, in E. Piątka and J. Kawa (Eds.), Inforation Technologies in Bioedicine, Vol. 2, Springer-Verlag, Berlin, pp Cortez, P., Cerdeira, A., Aleida, F., Matos, T. and Reis, J. (2009). Modeling wine preferences by data ining fro physicocheical properties, Decision Support Systes 47(4): Cox, T. and Cox, M. (2000). Multidiensional Scaling, Chapan and Hall, Boca Raton, FL. Cunningha, P. (2007). Diension reduction, Technical report, UCD School of Coputer Science and Inforatics, Dublin. Czarnowski, I. and Jędrzejowicz, P. (2011). Application of agent-based siulated annealing and tabu search procedures to solving the data reduction proble, Inter-

16 148 P. Kulczycki and S. Łukasik Table 10. Coparison of results of diension and saple size reduction for classification. W1 W2 W3 W4 W5 glass wine WBC vehicle seeds Reduction of diension using FSA and saple size (W =0.1) I kred ±σ(i kred ) ±7.02 ±4.86 ±1.43 ±2.66 ±3.18 el 100% Reduction of diension using PCA and saple size (approxiate size of reduced saple) I kred ±σ(i kred ) ±8.90 ±8.08 ±1.46 ±3.79 ±5.91 el 100% Reduction of diension using FSA and saple size (selected W ) I kred ±σ(i kred ) ±8.46 ±4.75 ±1.48 ±2.63 ±3.95 el W % national Journal of Applied Matheatics and Coputer Science 21(1): 57 68, DOI: /v David, H. and Nagaraja, H. (2003). Order Statistics, Wiley, New York, NY. Deng, Z., Chung, F.-L. and Wang, S. (2008). FRSDE: Fast reduced set density estiator using inial enclosing ball approxiation, Pattern Recognition 41(4): François, D., Wertz, V. and Verleysen, M. (2007). The concentration of fractional distances, IEEE Transactions on Knowledge and Data Engineering 19(7): Gean, S. and Gean, D. (1984). Stochastic relaxation, Gibbs distribution and the Bayesian restoration in iages, IEEE Transactions on Pattern Analysis and Machine Intelligence 6: Gendreau, M. and Potvin, J.-Y. (2010). Handbook of Metaheuristics, Springer, New York, NY. Han, J. and Kaber, M. (2006). Data Mining: Concepts and Techniques, Morgan Kaufann, San Francisco, CA. Ingber, L. (1996). Adaptive siulated annealing (ASA): Lessons learned, Control and Cybernetics 25(1): Inza, I., Larranaga, P., Etxeberria, R. and Sierra, B. (2000). Feature subset selection by Bayesian network-based optiization, Artificial Intelligence 123(1 2): Ishibuchi, H., Nakashia, T. and Murata, T. (2001). Three-objective genetics-based achine learning for linguistic rule extraction, Inforation Sciences 136(1 4): Kerdprasop, K., Kerdprasop, N. and Sattayatha, P. (2005). Weighted k-eans for density-biased clustering, in A.Tjoa and J. Trujillo (Eds.), Data Warehousing and Knowledge Discovery, Lecture Notes in Coputer Science, Vol. 3589, Springer-Verlag, Berlin pp Kulczycki, P. (2005). Kernel Estiators in Syste Analysis, WNT, Warsaw, (in Polish). Kulczycki, P. (2008). Kernel estiators in industrial applications, in B. Prasad (Ed.), Soft Coputing Applications in Industry, Springer-Verlag, Berlin, pp Kulczycki, P. and Charytanowicz, M. (2010). A coplete gradient clustering algorith fored with kernel estiators, International Journal of Applied Matheatics and Coputer Science 20(1): , DOI: /v Kulczycki, P. and Kowalski, P. (2011). Bayes classification of iprecise inforation of interval type, Control and Cybernetics 40(1): Kulczycki, P. and Łukasik, S. (2014). Reduction of diension and size of data set by parallel fast siulated annealing, in L.T. Koczy, C.R. Pozna, R. Claudiu and J. Kacprzyk (Eds.), Issues and Challenges of Intelligent Systes and Coputational Intelligence, Springer-Verlag, Berlin, pp Kuo, Y. (2010). Using siulated annealing to iniize fuel consuption for the tie-dependent vehicle routing proble, Coputers & Industrial Engineering 59(1): Łukasik, S. and Kulczycki, P. (2011). An algorith for saple and data diensionality reduction using fast siulated annealing, in J. Tang, I. King, L. Chen and J. Wang (Eds.), Advanced Data Mining and Applications, Lecture Notes in Coputer Science, Vol. 7120, Springer-Verlag, Berlin, pp Łukasik, S. and Kulczycki, P. (2013). Using topology preservation easures for ultidiensional intelligent data analysis in the reduced feature space, in L.Rutkowski, M. Korytkowski, R. Scherer, R. Tadeusiewicz, L. Zadeh and J. Zurada (Eds.), Artificial Intelligence and Soft Coputing, Lecture Notes in Coputer Science, Vol. 7895, Springer-Verlag, Berlin, pp Maaten, van der, L. (2009). Feature Extraction fro Visual Data, Ph.D. thesis, Tilburg University, Tilburg.

17 An algorith for reducing the diension and size of a saple for data exploration procedures 149 Mangasarian, O. and Wolberg, W. (1990). Cancer diagnosis via linear prograing, SIAM News 23(5): Mitra, P., Murthy, C. and Pal, S. (2002). Density-based ultiscale data condensation, IEEE Transactions on Pattern Analysis and Machine Intelligence 24(6): Na, D., Lee, J.-S. and Park, C. (2004). n-diensional Cauchy neighbor generation for the fast siulated annealing, IEICE Transactions on Inforation and Systes E87- D(11): Oliveira, J. and Pedrycz, W. (Eds.) (2007). Advances in Fuzzy Clustering and Its Applications, Wiley, Chichester. Pal, S. and Mitra, P. (2004). Pattern Recognition Algoriths for Data Mining, Chapan and Hall, Boca Raton, FL. Parvin, H., Alizadeh, H. and Minati, B. (1971). Objective criteria for the evaluation of clustering ethods, Journal of the Aerican Statistical Association 66(336): Parvin, H., Alizadeh, H. and Minati, B. (2010). A odification on k-nearest neighbor classifier, Global Journal of Coputer Science and Technology 10(14): Sait, S. and Youssef, H. (2000). Iterative Coputer Algoriths with Applications in Engineering: Solving Cobinatorial Optiization Probles, IEEE Coputer Society Press, Los Alaitos, CA. Saon, J. (1969). A nonlinear apping for data structure analysis, IEEE Transactions on Coputers 18(5): Saxena, A., Pal, N. and Vora, M. (2010). Evolutionary ethods for unsupervised feature selection using Saon s stress function, Fuzzy Inforation and Engineering 2(3): Strickert, M., Teichann, S., Sreenivasulu, N. and Seiffert, U. (2005). DIPPP online self-iproving linear ap for distance-preserving data analysis, 5th Workshop on Self- Organizing Maps, WSOM 05, Paris, France, pp Sui, S.M., Zaan, M.F. and Hirose, H. (2012). A rainfall forecasting ethod using achine learning odels and its application to the Fukuoka city case, International Journal of Applied Matheatics and Coputer Science 22(4): , DOI: /v Szu, H. and Hartley, R. (1987). Fast siulated annealing, Physics Letters A 122(3 4): Tian, T., Wilcox, R. and Jaes, G. (2010). Data reduction in classification: A siulated annealing based projection ethod, Statistical Analysis and Data Mining 3(5): UC Irvine Machine Learning Repository (2013). Vanstru, M. and Starks, S. (1981). An algorith for optial linear aps, Southeastcon Conference, Huntsville, AL, USA, pp Wand, M. and Jones, M. (1995). Kernel Soothing, Chapan and Hall, London. Wilson, D. and Martinez, T. (2000). Reduction techniques for instance-based learning algoriths, Machine Learning 38(3): Xu, R. and Wunsch, D. (2009). Clustering, Wiley, Hoboken, NJ. Zhigljavsky, A. and Žilinskas, A. (2008). Stochastic Global Optiization, Springer-Verlag, Berlin. Piotr Kulczycki is a professor at the Systes Research Institute of the Polish Acadey of Sciences and the head of its Center for Inforation Technology for Data Analysis Methods, as well as at the Cracow University of Technology, where he is the head of the Departent of Autoatic Control and Inforation Technology. The field of his scientific activity to date covers applicational aspects of inforation technology as well as data ining and analysis, ostly connected with the use of odern statistical ethods and fuzzy logic in diverse issues of conteporary systes research and control engineering. Szyon Łukasik is an assistant professor both at the Systes Research Institute of the Polish Acadey of Sciences and at the Departent of Autoatic Control and Inforation Technology of the Cracow University of Technology. His scientific interests focus on various aspects of data analysis, nature-inspired algoriths, social networks and innovative applications of collective intelligence in distributed knowledge discovery. Received: 6 April 2013 Revised: 10 Noveber 2013