Generative Probability Density Model in the Self-Organizing Map
|
|
|
- Alexander Martin
- 10 years ago
- Views:
Transcription
1 Generative Probability Density Model in the Self-Organizing Map Jouko Lampinen and Timo Kostiainen Laboratory of Computational Engineering, Helsinki University of Technology, P.O.Box94, FIN-5 ESPOO, FINLAND Abstract. The Self-Organizing Map, SOM, is a widely used tool in exploratory data analysis. A theoretical and practical challenge in the SOM has been the difficulty to treat the method as a statistical model fitting procedure. In this chapter we give a short review of statistical approaches for the SOM. Then we present the probability density model for which the SOM training gives the maximum likelihood estimate. The density model can be used to choose the neighborhood width of the SOM so as to avoid overfitting and to improve the reliability of the results. The density model also gives tools for systematic analysis of the SOM. A major application of the SOM is the analysis of dependencies between variables. We discuss some difficulties in the visual analysis of the SOM and demonstrate how quantitative analysis of the dependencies can be carried out by calculating conditional distributions from the density model. Introduction The self-organizing map, SOM, is a widely used tool in data mining, visualization of high-dimensional data, and analysis of relations between variables. For a review of SOM applications see other chapters in this volume, and []. The most characteristic property of the SOM algorithm [8] is the preservation of topology, or the fact that the neighborhood relationships of the input data are maintained in the mapping. A large part of theoretical work on SOM has been focused on the definition and quantification of topology preservation, but mathematically rigorous treatment is not yet complete. See [6] for up-to-date discussion of the topology preservation in the SOM. The roots of the SOM are in simplified models for the self-organization process in biological neural networks [7]. In related engineering problems the SOM offers considerable potential, such as in automatic formation of categories in larger artificial neural systems. Currently, a very active application domain of the SOM is exploratory data analysis, where a database is searched for any phenomena that are important in the studied application. In normal statistical data analysis there are usually a set of hypotheses that are validated in the analysis, while in exploratory data analysis the hypotheses are generated from the data in a data-driven exploratory phase and validated in a confirmatory phase.
2 Jouko Lampinen and Timo Kostiainen The SOM is mainly used in the exploratory phase, by visually searching for potentially dependent variables. There may be some problems where the exploratory phase alone might be sufficient, such as visualization of data without more quantitative statistical inference upon it. However, in practical data analysis problems the found hypotheses need to be validated with well understood methods, in order to assess the confidence of the conclusions and to reject those that are not statistically significant. When using the SOM in data analysis, an obvious criterion for model selection should be generalization of the conclusions to new data, just as it is in the case of any other statistical method. The preservation of topology is also important, to facilitate the visual analysis by grouping the similar states to neighboring map units, but if the positions of the map units are not statistically reliable the map is useless for any generalizing inference. In this chapter we present the SOM as a probability density estimation method, in contrast to the standard view of the SOM as a method for mapping high dimensional data vectors to a lower dimensional space. There are several benefits in associating a probability density, or a generative model, with a mapping method (see [4] for discussion of a generative model for the PCA mapping): The density model enables computation of the likelihood of any data sample (training data or test data), facilitating statistical testing and comparison with other density estimation techniques. The selection of hyperparameters of the model (eg., the width of the neighborhood) can be chosen with standard methods, such as cross-validation, to avoid overfitting, in the same way as with other statistical methods. The density model facilitates quantitative analysis of the model, for example, by computing conditional densities to test the visually found hypotheses. In principle, Bayesian methods could be used for model complexity control and model comparison (see [] for a review of Bayesian approach for neural networks). However, as shown later, the normalization of the probability density in the original SOM requires a numerical procedure that seems to render the Bayesian approach impractical. The organization of this chapter is the following: In section we discuss the problem of finding dependencies between variables using visual inspection of the SOM, to demonstrate the need for quantitative analysis tools with the SOM. In section 3 we shortly review some results related to the existence of the error function in the SOM. The SOM algorithm is not defined in terms of an error function, but directly via the training rule, and unfortunately the training rule is not a gradient of any global error function [5]. This makes the exact mathematical analysis of the SOM algorithm fairly difficult. For a discrete data sample the algorithm may converge to a local minimum of an error function [3], which may exist only in a small volume in the parameter
3 Generative Probability Density Model in the SOM 3 space (the error function changes if the best-matching unit of any data sample changes). In section 3 we shortly review the results about the existence of the error functions in the SOM and some modifications that make the error function to exist more generally. The probability density model in the SOM, derived in this chapter, consists of kernels of non-regular shape, whose positions are weighted averages over the neighboring units receptive fields, and thus the model is close to many mixture models where the kernels are confined to a low dimensional latent space. In section 4 we review some constraint mixture models that are similar to the SOM. In section 5 we derive the exact probability density model, for which the converged state of the SOM training gives the Maximum Likelihood estimate. In section 6 we discuss the selection of the SOM hyperparameters to avoid overfitting, and demonstrate how quantitative analysis can be carried out with the aid of the probability density model. In section 7 we present conclusions and point some directions for further study. SOM and Dependence Between Variables In practical data analysis problems a common task is to search for dependencies between variables. Statistical dependence means that the conditional distribution of a variable is dependent on the values of other (explanatory) variables, and thus the analysis of dependencies is closely related to the estimation of probability density or conditional probability densities. In regression analysis the goal is to estimate the dependence of the conditional mean of the target variable on the explanatory variables, using, for example, the standard least squares fitting of neural network outputs to the targets. In real data analysis problems, the shape of the conditional distribution needs to be considered also in the regression models, by means of, e.g., error bars, or confidence intervals, in order to assess the statistical significance of the dependence of conditional mean on the explanatory variables. The most simple goal is to look for pairwise dependencies, where a variable is assumed to depend only on one other variable. For such a problem the advantage of the SOM is rather marginal, as simple correlation analysis is sufficient for the linear case, and in the non-linear case there exist plenty of methods for directly estimating the conditional density and thus the dependencies in such a low dimensional case (see e.g. [] for review). The tough problem in exploratory data analysis is to search for non-linear dependencies between multiple variables. With the SOM, the analysis of dependencies is based on visual inspection of the SOM structure. Several visualization methods have been developed for interpreting the SOM, see, e.g., [4]. The basic procedure is to visually search for regions on the map where the values of two or more variables coincide, e.g., have large or small values
4 4 Jouko Lampinen and Timo Kostiainen in the same units. Such a region is interpreted as a hypothesis that the variables are dependent in a way that, for example, low value for one variable is indication of low value for the other variable, given that rest of the variables are close to the corresponding values in the reference vectors. Clearly, efficient visualization methods are necessary, as the number of variable pairs is proportional to the square of the number of variables, and there may be dozens of distinguishable regions in the SOM. It is very important to notice that any conclusions drawn from models overfitted to the data sample are not guaranteed to generalize to any other situation. In the case of the SOM, overfitting means that reference vectors of some units have been determined by too few data points, so that the reference vectors are not representative of the underlying probability density. Any conclusions based on such units are prone to fail for new data, and thus analysis of statistical dependencies requires some way, heuristic or more disciplined, to avoid overfitting, as in all statistical modeling. On the other hand, it should be noted that when the SOM is used for analyzing the whole population, and measurement errors are considered negligible, there is no need to generalize the conclusions to other data, and overfitting is not an issue. This important distinction between analyzing the population, and analyzing a sample from the population and generalizing the conclusions to the population, often seem to be ignored in the SOM framework. The main problem in visual inspection of the SOM is that in general the lack of dependence between variables is difficult to observe visually from the SOM. That is, even if variables, say, x and x both have high values at map unit M ij, that alone does not show that the variables have any mutual dependence. As a simple example, consider two-dimensional uniform distribution x,x U(, ). A SOM with zero neighborhood would have component planes (in any order of the columns and rows) M = [ ] M = [ ] The coincidence of high values in unit M and low values in M are only a result of the vector quantization. To see that high values in M do not indicate dependence between x and x, one must observe that high value for x occurs also in M with low value for x (i.e., tallied over the map, high value for x occurs with both high and low value for x ). In high dimensional space the visual inspection of the dependencies becomes more difficult, as the map folds into the data space, and the range of values for each variable is distributed around the map. Fig. illustrates this for random data with no dependencies between the variables, and Fig. shows an example from a real data analysis project, where all hypotheses were later rejected in careful analysis.
5 Generative Probability Density Model in the SOM 5 Fig.. Example of a SOM trained on purely random data. The independence of the variables in the component level display is not trivial to observe. One might, for example, erroneously conclude that high values of x 3 would indicate low values of x. Here the neighborhood is trained down to zero. Allergic history Relative humidity Enthalpy Air freshness Ergonomics Carbon dioxide Fig.. Example of real data analysis. In the case study, the dependence of Air freshness on the other variables was investigated. In the final analysis all hypotheses were rejected using methods like RBF models, Bayesian neural networks, and hierarchical generalized linear models using Bayesian inference, etc. [7]. One evident conclusion, from the lower right corner of the map, is that high value for variable Ergonomics appears only with low value for Air freshness, but careful analysis showed that this was just an effect of vector quantization. 3 Error functions in the SOM The converged state of the SOM is a local minimum of the error function which is given by [3] E(X) = N n= j= M H bj x n m j, () where X = {x n },n =,...,N is the discrete data sample, j is the index (or position) of a unit in the SOM, with reference vector m j,andh bj = H(b(x) j) is the neighborhood function, b(x) being the index of the best matching unit for x. The error function is not defined at the boundaries of the receptive fields, or Voronoi cells, so the function does not exist in the continuous case. In the case of a discrete data set, the probability of any sample lying at any boundary is zero, so in practice the error function can always be computed for any data set. The error function changes if any sample changes its best matching unit. That is why the error function is only consistent with the SOM training rule when the algorithm has converged. In practical data analysis
6 6 Jouko Lampinen and Timo Kostiainen the data set is always discrete and the algorithm is allowed to converge, so analysis of the error function is thereby justified. The SOM training rule involves assigning to each data sample one reference vector, the best matching unit b(x). The matching criterion is Euclidian distance, as follows: b(x) = argmin x m i. () i When a sample is near a boundary of two or more receptive fields, a small change in the position of one reference vector can change the best matching unit of that sample. Hence the gradient of the error function with respect to the reference vectors is infinite. The error function () can be thought of as the sum of the distortion function D(x) = H bj x m j. (3) j over the data set. The distortion function is also discontinuous at the Voronoi cell boundaries. The discontinuity is a result of the winner selection rule of the training algorithm. Luttrell [] has shown that exact minimization of Eq. leads to an approximation to the original training rule, where, instead of the nearest neighbor winner rule, the best matching unit is taken to be the one that minimizes the value of the distortion function (3), as follows: b(x) = argmin i H ij x m j. (4) j The minimum distortion rule avoids many theoretical problems associated with the original rule, without compromising any desirable properties of the SOM except for an increase in the computational burden [6]. The gradient of the error function becomes continuous at the boundaries of receptive fields (which are no longer the same as the Voronoi tessellation). The distortion function with the modified winner selection rule is also continuous across the unit boundaries. 4 Constrained Mixture Models The main aspects of the SOM algorithm, that make the analysis difficult are ) the hard assignment of the input samples to the nearest units, which makes the receptive fields irregularly shaped according to the Voronoi tessellation, and ) the regularizing effect defined as updating the parameters of the neighboring units towards each input, instead of regularization applied directly to the positions of the reference vectors. It is worth noticing that the first issue is due to the shortcut algorithm [7] devised to speed up the computation and to enchance the organization of
7 Generative Probability Density Model in the SOM 7 the map from a completely random initial state, and is not a characteristic of the assumed model for the self-organization in biological networks. The original on-center off-surround lateral feedback mechanism in [7] produces a possibly multimodal pattern of activity on the map, that was approximated by a single activity bubble around the best-matching unit, BMU, with the shape of the neighborhood function. Then equal Hebbian learning rule in each unit produces the concerted updating towards the input in the neighborhood of the BMU. An interpretation of the mapping of input data point on the SOM lattice, that would be consistent with issues above, and the minimum distortion rule in Eq. 4 would thus be, that a point in the input space is mapped to a activity bubble H ib around the BMU b, rather than to a single unit. By ignoring the winner-take-all mechanism the SOM can be approximated by a kernel density estimator, where the activity of a unit is dependent only on the match of the data point and the unit reference vector. This is often called soft assignment of data samples to the map units, in contrast to the hard assignment of a data point to only the best-matching unit. The second characteristic of the SOM training, the way of constraining the unit positions, is dictated by the biological origin of the method. A regularizer directly on the reference vector positions would require a means for the neurons to update their weights towards the weights of the neighboring neurons, while in the SOM rule all learning is towards the input data. The biological plausibility is obviously non-relevant in data analysis applications, even though it may have a role in building larger neural system with the SOM as a building block. In the approach taken by Utsugi [5], small approximations are made to render the model more easily analyzable: the winner-take-all rule is replaced by soft assignment, and the neighborhood effect is approximated by a smoothing prior directly on the reference vector positions. The model is then a Gaussian mixture model with kernels constrained by the smoothing prior. This approach yields a very efficient way to set the hyperparameters of the model, that is, the widths of the kernels and the weighting coefficient of the smoothing prior, by empirical Bayesian approach. For any values of the hyperparameters, the evidence, or conditional marginal probability of the values given the data and the priors, can be computed by integrating over the posterior probability of the model parameters (kernel positions). The values with the maximum evidence are then chosen as the most likely values. Actually, a proper Bayesian approach would be to integrate over the posterior distribution of the hyperparameters (see [] for a discussion), but clearly the empirical Bayes approach is a notable advance in the SOM theory. Another model close to the SOM is the Generative Topographic Mapping []. In that approach, the Gaussian mixture density model is constrained by a nonlinear mapping from a regularly organized distribution in a latent space to the component centroids in data space. Hyperparameters of the model, which
8 8 Jouko Lampinen and Timo Kostiainen control noise variance, stiffness of the nonlinear mapping and the prior distribution of mapping parameters, can be optimized using Bayesian evidence approximation, similar to the one used by Utsugi [3]. 5 Probability density model in the self-organizing map In this section we derive the probability density model for the original SOM, with no approximations in the effect of the neighborhood or the posterior probability of units given an input sample (the activity of the units). The density model is based on the mean square type error function (), discussed in section 3. The error function is specific to the given neighborhood parameters, so that it cannot be directly used to compare maps which have different neighborhoods. The maximum likelihood (ML) estimate is based on maximizing the likelihood of data given the model. We wish to find a likelihood function which is consistent with the error function. This can be achieved by making the error function proportional to the negative logarithm of the likelihood of data. Assuming the training samples x n independent, the likelihood of the training set X = {x n },n =,...,N is the product of probabilities of each sample, p(x m, H) = n p(x n m, H), (5) where m denotes the codebook (set of reference vectors) and H is the neighborhood. The negative log-likelihood is L = log p(x m, H) and setting it proportional to Eq. yields p(x m, H) =Z exp( βe) =Z exp( β H bj x n m j ). (6) n Here we have introduced two constants, Z and β, which are not needed in the ML estimate of the codebook m but which are necessary for the complete density model. The probability density function in Eq. 6 is given by j p(x m, H) =Z exp( β j H bj x m j ), (7) which is a product of Gaussian densities centered at m j, whose variances are inversely proportional to the neighborhood function values H bj. Note that the discontinuity of the density is due to the discontinuity of the best-matching unit index b for the input x. Inside a Voronoi cell, or the receptive field of unit m b, the density function has Gaussian form: ( p(x x V b )=Ze βw b exp ) s x µ b, (8) b
9 Generative Probability Density Model in the SOM 9 where V b denotes the Voronoi cell around the unit m b. The position and the variance of the kernel are denoted by µ b and s b, respectively, and W b is a weighting coefficient. The values of the parameters are j µ b = H bjm j j H (9) bj s b =/(β H bj ) () j W b = j H bj m j µ b. () The density model consists of cut Gaussian kernels, which are centered at the neighborhood-weighted means of the reference vectors and clipped by the Voronoi cell boundaries. The parameter W b controls the height of the kernel; it depends on the density of the neighboring reference vectors near the centroid µ b. The density function is not continuous at the boundaries of the Voronoi cells. See Figs. 3, 4 and 5 for examples of the density models. The variances of the kernels depend on the parameter β and they are equal if the neighborhood is normalized (see section 6. for further discussion). In the standard SOM formulation, the border units with incomplete neighborhood have larger variances, as can be seen in Figs. 5 and 8, allowing the map to shrink into the middle of the training data distribution. The normalizing constant Z and the noise variance parameter β are bound together by the constraint that the integral of the density over the data space must equal one. That integral can be written as p(x)dx = Z e βw b exp( r x V r s x µ r )dx, () r where the integration over the data space is decomposed to the sum of integrals over each Voronoi cell. The integrals cannot be computed in closed form but they can be approximated numerically using Monte Carlo sampling. A simple way to do this is the following algorithm:. For each cell r,drawl samples from the normal distribution N(µ r,s r ). Compute q r = L r /L, the fraction of samples that are inside the cell r. 3. The integral over V r in Eq. equals q r (πs r) d/, where d is the dimension of the data space. For a map that contains M units, this algorithm requires the computation of distances between M L samples and the M reference vectors. Thus if M is large the computational cost of the normalization procedure exceeds that of the training algorithm itself. In an efficient implementation the number of samples L should be chosen according to the desired accuracy. The acceptance ratio q r varies in large
10 Jouko Lampinen and Timo Kostiainen Fig. 3. Example of the density model in a 3 8 SOM. Top: training data and the resulting SOM lattice using Gaussian neighborhood with σ =. Middle: the density model of the SOM. Bottom: zoomed part of the density model above, with contours and Voronoi cell boundaries added. From the figure it is clear how the Gaussian kernels are located at the neighborhood weighted averages of the reference vectors.
11 Generative Probability Density Model in the SOM L = 4 L = 388 L = 44 Fig. 4. Training data and density estimates due to different SOM topologies. L denotes the negative log-likelihood of test data. The optimal value for the parameter β is so small that the model is close to normal Gaussian mixture. Among these alternatives, the 3 6 topology (middle) produces the best model judging by the likelihood criterion. range according to the neighborhood size. When the neighborhood is small, the neighborhood-weighted center µ r is close to the reference vector m r and s r is likely to be small, so q r is high. When the neighborhood is large the situation is opposite, and to achieve an equivalent accuracy L will have to be much greater. Detailed analysis of the dependence of the accuracy on L is presented in appendix A. The maximum likelihood estimate for β can only be found by numerical optimization, by maximizing the likelihood of validation data. It is worth noting that when a numerical method such as bisection search is applied, savings can be made by allowing the accuracy to vary. Initial estimates can be very coarse, corresponding to small L, if the accuracy gradually increases towards the convergence of the search. The final accuracy should reflect the size of the validation data sample. By equating the partial derivative of the likelihood function p(x)/ β with zero, an interpretation for the maximum likelihood solution β ml can be found in terms of the neighborhood-weighted distortion function D(x) (3) as follows: N n= D(xn ) D(x)exp( β ml D(x))dx =. (3) N exp( β ml D(x))dx Observe that the estimated input distribution is ˆp(x β,h) exp( βd(x)). Heuristically, Eq. 3 says that, at the ML-estimate with β equal to β ml, the mean value of D(x) over the estimated input distribution equals the sample average of D(x n ) over the input data x n,n=,...,n. 6 Model selection The SOM algorithm produces a model of the input data. The complexity of this model is determined by the number of units and the width of the
12 Jouko Lampinen and Timo Kostiainen neighborhood, which has a regularizing effect on the model. When the input data is a sample from a larger population, the objective is to choose the complexity such that the model generalizes as well as possible to new samples from that population. See [9] for discussion and examples of overfitting of the SOM model. The likelihood function provides a consistent way to compare the goodness of different models. In this section we discuss how this can be used for model selection in the self-organizing map. Let us first regard the number of units as given, so neighborhood width σ is the sole control parameter. The density model allows us to select the neighborhood width σ by maximizing the likelihood of data p(x m, H). In the course of SOM training, σ is gradually decreased in some pre-specified manner, i.e. σ = σ(t),t=,...,k; σ(t+) <σ(t). We trust that the training algorithm will find an ML estimate for the map codebook at each value of the neighborhood width σ(t), if it is allowed to converge every time. To construct the density model for each of these K candidate maps, we numerically optimize β(t) as described in the previous section. This yields K different density models to compare. To choose between these we compute the likelihood values p(x V m(t),σ(t),β ml (t)) for validation data X V (which should ideally be different from that used to select β ml (t)). Cross-validation can also be applied. An example of model selection is shown in Fig. 5. The map with σ =. maximizes the likelihood of validation data. This approach extends directly to the comparison of different size maps as well as different topologies (see Fig. 4). If one wishes to have a large map, it may be advisable to ease the computational requirement by finding the correct σ for a smaller map first and then simply scaling it up in proportion to the dimensions of the maps. (For example, if σ KL is the optimal neighborhood width for a K L map, then 5σ KL is probably a reasonable value for a 5K 5L map.) Because the exact value of the density function cannot be computed in closed form, it is difficult to apply methods such as Bayesian evidence to parameter selection. If the values of the function itself are approximations, then the derivatives will be even more inaccurate, and due to the numerical normalization procedure the approach would be computationally too expensive in practice. A common application of the SOM is to look for dependencies between variables by visual inspection. In that context, the density model can be used to select the complexity of the model, but it also enables quantitative analysis. Regression or conditional expectations can be computed directly from the joint density in Eq. 7 by numerical integration. For example, the conditional distribution for variable x j equals p(x j x \j,m,h)= p(x m, H) p(x m, H)dxj, (4) where x \j denotes the vector x with element j excluded. Likewise, the regression of x j on other variables can be computed as the conditional mean
13 Generative Probability Density Model in the SOM 3 σ = 8.3, β =., L=.33 σ = 4.9, β =.5, L=. σ =., β =.53, L=.86 σ =., β = 9.34, L=.9 Fig. 5. SOM density models for different widths σ of the Gaussian neighborhood. From the total likelihood of validation data the optimal neighborhood can be chosen to avoid overfitting. L denotes the negative log-likelihood of validation data (per sample). E[x j x \j,m,h]. It should be noted that the SOM density model may not give the best possible description of the input distribution. We have included this discussion here so as to illustrate the value of model selection. Reducing the variance parameter to zero, β, gives an important special case. The conditional density is then sharply peaked at the value of the outputs x j in the best matching unit for the inputs x \j. The conditional mean E[x j x \j ] then gives the same value as nearest neighbor (NN) regression with the SOM reference vectors, with the neighborhood-weighted reference vectors (Eq. 9) as output values, producing a piecewise constant estimate. Comparison with the NN rule is interesting, because it is a close quantitative counterpart of the popular visual analysis of the SOM. Fig. 6 illustrates the difference between computing the conditional mean from the density model and using the nearest neighbor rule. A random 3D
14 4 Jouko Lampinen and Timo Kostiainen data set ( N(, )) is analyzed by a 6 6 SOM. We attempt to infer E(x x,x 3 = ), the expected value of the variable x given x, with x 3 zero. As the variables are truly independent, the answer should be E(x x,x 3 = ) = E(x ) =. The optimal width of the Gaussian neighborhood function is σ =4., which is a relatively large value, suggesting independent variables (a simple distribution). At zero neighborhood, the model is badly overfitted. Clearly, neglecting to select the correct model complexity would give unreliable results. When the complexity is right, the nearest neighbor rule can give a good approximation to the mean, though the lack of confidence intervals limits the reliability of analysis. p(x x,x 3 =), σ = 4. E(x x,x 3 =), NN rule, σ = 4. x x 4 4 x x p(x x,x 3 =), σ =. E(x x,x 3 =), NN rule, σ = x x x x Fig. 6. Conditional densities from a SOM trained on random independent data. Upper row: the conditional density and the nearest neighbor prediction for optimal neighborhood σ = 4.. Lower row: conditional density and the nearest neighbor prediction for small neighborhood σ =.. The black lines show the means and standard deviations computed from the densities. An example of using the conditional distributions is shown in fig. 7. The neighborhood width was chosen based on the maximum likelihood of test data. The data is three dimensional; there is a dependence between two
15 Generative Probability Density Model in the SOM x.5 x x3 3 x x3 x 3 x x x Fig. 7. Example of the use of SOM for data analysis. Top: all three component levels of a SOM trained down to optimal neighborhood width.. Mid row: Training data and the means and standard deviations of the conditional densities p(x x ) and p(x 3 x ), integrated over x 3 and x, respectively. Bottom: nearest neighbor estimates based on the best matching units using five different values for x 3 and x, respectively. x3 3 3 x of the variables, and one is independent, as follows: x = sin(ωx )+ɛ, x 3 N(, ). This kind of a distribution can easily be modeled by means of a two dimensional SOM; that is why no severe overfitting is observed and a small neighborhood width gives the best fit to test data. Yet it is not easy
16 6 Jouko Lampinen and Timo Kostiainen to observe the dependence from the component level display. The conditional densities, on the other hand, are easy to interpret. Nearest neighbor regression also works relatively well, since the model complexity is correct. By visual inspection of the map it is difficult to perceive the mean or shape of the conditional distributions and thus the reliability of the conclusions is practically impossible to assess. Choosing parameter values to optimize the density estimate may not result in a mapping that is also optimal for visual display. However, the examples shown in Figs. 5 and 6 indicate that this method will outperform any prefixed heuristic rule. In any case, the results of visual inspection should be validated by other, more reliable techniques. 6. Border effects In typical implementations of the SOM, the neighborhood function is the same for each map unit. This causes problems near the borders of the map, where the neighborhood function gets clipped and thus becomes asymmetric. The effect is that, for no obvious reason, data samples which are outside the map are given less significance than those within the map. As a result, units close to the border have larger kernels and allow data points to reside farther away from the map units. Consequently, the border units are pulled towards the center of the map, and the map does not extend close to the edges of the input distribution until the neighborhood is relatively small and the regularization is loose. This leads to decrease of the likelihood for maps with large neighborhood (or increase of the quantization error), biasing the optimal width of the neighborhood towards smaller values. This effect can be alleviated by normalizing the neighborhood function at the edges of the map. In the case of the sequential algorithm, it suffices to normalize the neighborhood function such that its sum is the same in each part of the map. When using the batch algorithm, the portion of the neighborhood function that gets clipped off due to the finite size of the map lattice can be transferred to nearest edge units. Normalization of the neighborhood function is of particular importance, if the minimum distortion rule (4) is applied to winner selection. We see from Eq. that when the sum of the neighborhood function is constant throughout the map, all cells have equal noise variance. In practice we find that the minimal distortion rule will produce very similar results as the original rule in terms of model selection. The equalization of the neighborhood volume near map borders smoothes out the density model somewhat, as it allows better fit for the data sample with larger neighborhood, as can be seen from Fig. 8. Obviously, the continuous density model, with minimum distortion rule, is favorable in many respects.
17 Generative Probability Density Model in the SOM 7 Training data Original rule Normalized neighborhood Minimum distortion rule Fig. 8. Effect of the minimum distortion training rule on the density model. The neighborhood width is the same σ =.8 for all the maps. (This value is larger than would be optimal, as the purpose of the figure is to highlight the differences between these cases.) Note also, how the density kernel positions do not coincide with the reference vector positions. Especially in the middle of the maps the kernels are shifted upwards from the unit positions, according to Eq. 9. The jaggedness of the Voronoi cell boudaries is due to the discrete grid where the density is evaluated. 7 Conclusions We have presented the probability density model associated with the SOM algorithm. Also we have discussed some difficulties that arise in the application of the SOM to statistical data analysis. We have shown how the probability density model can be used to find the maximum likelihood estimate for the SOM parameters, in order to optimize the generalization of the model to new data. The parameter search involves a considerable increase in the computational cost, but in serious data analysis, the major concern is the reliability of the conclusions. It should be stressed that although the density model is based on the error function which is not defined in all cases, in practice the only restriction in the application of the density function to data analysis is that the algorithm should be allowed to converge. Unfortunately, maximizing the likelihood of
18 8 Jouko Lampinen and Timo Kostiainen data is not directly related with ensuring the reliability of the visual representation of the SOM. Especially when the data dimension is high, the units that code the co-occurrences of all correlated variables cannot be grouped together on the map, and thus the conditional densities become distributed around the map. Such effects cannot be observed by visual inspection of the component levels, no matter how the model hyperparameters are set. Those effects can be revealed, to some extent, by calculating the conditional densities, or the conditional means and the confidence intervals. The association of a generative probability density model with the SOM enables comparison of the SOM and other similar methods, like the Generative Topographic Mapping. If the theoretical difficulties of the SOM are avoided by adopting the minimum distortion winner selection rule, the main difference that remains is the hard vs. soft assignment of data to the units. The hard assignments of the SOM are perhaps easier to interpret and visualize. In the SOM the activation of the units (ie., the posterior probability of the kernels given one data point) is always one for the winning unit and zero for the others, or a unimodal activation bubble of the shape of the neighborhood around the winning unit, depending on the interpretation. With soft assignments the posterior probability may be multimodal (when two distant regions in the latent space are folded close to each other in the input space), and thus the activation is more difficult to visualize. Note, however, that this multimodal response gives visual indication of the folding, which may also be valuable. Apparently, the choice of methods depends on the application goals, and in real data analysis it is reasonable to apply different methods to decrease the effect of artefacts of the methods. Matlab r routines for evaluating the SOM probability density are available at
19 Bibliography [] C. M. Bishop, M. Svensen, and C. K. I. Williams. GTM: a principled alternative to the self-organizing map. In C. von der Malsburg, W. von Seelen, J. C. Vorbruggen, and B. Sendhoff, editors, Artificial Neural Networks ICANN International Conference Proceedings, pages Springer-Verlag, Berlin, Germany, 996. [] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, 995. [3] Christopher M Bishop, Markus Svensen, and Christopher KI Williams. Developments of the generative topographic mapping. Neurocomputing, ():3 4, 998. [4] M. Cottrell, P. Gaubert, P. Letremy, and P. Rousset. Analyzing and representing multidimensional quantitative and qualitative data : Demographic study of the Rhone valley. the domestic consumption of the canadian families. In E. Oja and S. Kaski, editors, Kohonen Maps, pages 4. Elsevier, Amsterdam, 999. [5] Ed Erwin, Klaus Obermayer, and Klaus Schulten. Self-organizing maps: Ordering, convergence properties and energy functions. Biol. Cyb., 67():47 55, 99. [6] Tom Heskes. Energy functions for self-organizing maps. In Erkki Oja and Samuel Kaski, editors, Kohonen maps, pages Elsevier, 999. [7] Teuvo Kohonen. Self-organizing formation of topologically correct feature maps. Biol. Cyb., 43():59 69, 98. [8] Teuvo Kohonen. Self-Organizing Maps, volume 3 of Springer Series in Information Sciences. Springer, Berlin, Heidelberg, 995. (Second Extended Edition 997). [9] Jouko Lampinen and Timo Kostiainen. Overtraining and model selection with the self-organizing map. In Proc. IJCNN 99, Washington, DC, USA, July 999. [] Jouko Lampinen and Aki Vehtari. Bayesian approach for neural networks review and case studies. Neural Networks,. (Invited article). To appear. [] Stephen P. Luttrell. Code vector density in topographic mappings: scalar case. IEEE Trans. on Neural Networks, (4):47 436, July 99. [] E. Oja and S. Kaski. Kohonen Maps. Elsevier, Amsterdam, 999. [3] H. Ritter and K. Schulten. Kohonen self-organizing maps: exploring their computational capabilities. In Proc. ICNN 88 International Conference on Neural Networks, volume I, pages 9 6, Piscataway, NJ, 988. IEEE Service Center.
20 Jouko Lampinen and Timo Kostiainen [4] Michael E. Tipping and Christopher M. Bishop. Mixtures of principal component analysers. Technical report, Aston University, Birmingham B4 7ET, U.K., 997. [5] Akio Utsugi. Hyperparameter selection for self-organizing maps. Neural Computation, 9(3):63 635, 997. [6] Thomas Villmann, Ralf Der, Michael Herrmann, and Thomas M Martinetz. Topology preservation in self-organizing feature maps: exact definition and measurement. IEEE Transactions on Neural Networks, 8():56 66, 997. [7] Irma Welling, Erkki Kähkönen, Marjaana Lahtinen, Kari Salmi, Jouko Lampinen, and Timo Kostiainen. Modelling of occupants subjective responses and indoor air quality in office buildings. In Proceedings of the Ventilation, 6th International Symposium on Ventilation for Contaminant Control, volume, pages 45 49, Helsinki, Finland, June. A APPENDIX: Accuracy of the MCMC method In the algorithm described in Section 5 one picks L samples and computes S, the number of samples that satisfy a certain condition. The task is to estimate q, the probability of a single sample satisfying the condition, based on S and L. S follows the binomial distribution p(s q, L) =q S ( q) L S (5) In Bayesian terms, the posterior distribution of q for given S and L is p(q S, L) = p(s q, L)p(q) = qs ( q) L S p(s q, L)p(q)dq qs ( q) L S dq, (6) where the prior distribution p(q) is uniform. The integral in the denominator yields q S ( q) L S dq = L S k= ( ) k S + k + ( L S k ). (7) Moments of the posterior can be written as serial expressions, for example L S ( ) ( k L S ) k= S+k+ k E(q) = qp(q S, L)dq = ), (8) L S k= ( ) k S+k+ ( L S k and from the first and second moments we can derive an expression for the variance of the estimate ˆq = E(q). Unfortunately, the computation easily runs into numerical difficulties due to the alternating sign ( ) k. Exact values of
21 Generative Probability Density Model in the SOM the binomial coefficients are required, and these can be difficult to obtain if, say, L>. To consider an approximation to the variance, observe that when S = L the formulas simplify considerably and the variance can be written as ν = Var(ˆq S = {,L}) = L + (L +3)(L +). (9) This is the minimum value of the variance for given L. The variance of the binomial distribution at the maximum likelihood estimate q ml = S/L equals q ml ( q ml )/L. We can combine these results to get a fairly good approximation Var(ˆq) ν (ˆq, L) =ν +ˆq( ˆq)/L, () which slightly over-estimates the variance. A more precise approximation can be obtained directly from Eq. 6 by numerical integration. Picking Monte Carlo samples from the distribution N[ˆq, ν (ˆq)] truncated to the range [, ] produces good results, except maybe at the very edges of the range, but there the exact value ν can be applied. Let us write the sum in Eq. as F = w i q i, where w i = e ( βwb) (πs i )(d/). The relative standard error of an estimate of F equals ɛ = ˆF ˆF = wi ˆq i wi Var(ˆqi ) = () wi ˆq i wi ˆq i Hence, to achieve a given accuracy ɛ, L should be increased until wi Var(ˆqi ) wi ˆq i <ɛ. ()
Visualization of Breast Cancer Data by SOM Component Planes
International Journal of Science and Technology Volume 3 No. 2, February, 2014 Visualization of Breast Cancer Data by SOM Component Planes P.Venkatesan. 1, M.Mullai 2 1 Department of Statistics,NIRT(Indian
Visualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk
A Computational Framework for Exploratory Data Analysis
A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.
Self-Organizing g Maps (SOM) COMP61021 Modelling and Visualization of High Dimensional Data
Self-Organizing g Maps (SOM) Ke Chen Outline Introduction ti Biological Motivation Kohonen SOM Learning Algorithm Visualization Method Examples Relevant Issues Conclusions 2 Introduction Self-organizing
Self Organizing Maps: Fundamentals
Self Organizing Maps: Fundamentals Introduction to Neural Networks : Lecture 16 John A. Bullinaria, 2004 1. What is a Self Organizing Map? 2. Topographic Maps 3. Setting up a Self Organizing Map 4. Kohonen
ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization
IEEE TRANSACTIONS ON NEURAL NETWORKS, VOL. 13, NO. 1, JANUARY 2002 237 ViSOM A Novel Method for Multivariate Data Projection and Structure Visualization Hujun Yin Abstract When used for visualization of
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus
Auxiliary Variables in Mixture Modeling: 3-Step Approaches Using Mplus Tihomir Asparouhov and Bengt Muthén Mplus Web Notes: No. 15 Version 8, August 5, 2014 1 Abstract This paper discusses alternatives
Component Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
Data topology visualization for the Self-Organizing Map
Data topology visualization for the Self-Organizing Map Kadim Taşdemir and Erzsébet Merényi Rice University - Electrical & Computer Engineering 6100 Main Street, Houston, TX, 77005 - USA Abstract. The
Visualization of Topology Representing Networks
Visualization of Topology Representing Networks Agnes Vathy-Fogarassy 1, Agnes Werner-Stark 1, Balazs Gal 1 and Janos Abonyi 2 1 University of Pannonia, Department of Mathematics and Computing, P.O.Box
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
Models of Cortical Maps II
CN510: Principles and Methods of Cognitive and Neural Modeling Models of Cortical Maps II Lecture 19 Instructor: Anatoli Gorchetchnikov dy dt The Network of Grossberg (1976) Ay B y f (
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Machine Learning and Pattern Recognition Logistic Regression
Machine Learning and Pattern Recognition Logistic Regression Course Lecturer:Amos J Storkey Institute for Adaptive and Neural Computation School of Informatics University of Edinburgh Crichton Street,
INTERACTIVE DATA EXPLORATION USING MDS MAPPING
INTERACTIVE DATA EXPLORATION USING MDS MAPPING Antoine Naud and Włodzisław Duch 1 Department of Computer Methods Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract: Interactive
Penalized regression: Introduction
Penalized regression: Introduction Patrick Breheny August 30 Patrick Breheny BST 764: Applied Statistical Modeling 1/19 Maximum likelihood Much of 20th-century statistics dealt with maximum likelihood
Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps
Technical Report OeFAI-TR-2002-29, extended version published in Proceedings of the International Conference on Artificial Neural Networks, Springer Lecture Notes in Computer Science, Madrid, Spain, 2002.
6.2.8 Neural networks for data mining
6.2.8 Neural networks for data mining Walter Kosters 1 In many application areas neural networks are known to be valuable tools. This also holds for data mining. In this chapter we discuss the use of neural
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Online data visualization using the neural gas network
Online data visualization using the neural gas network Pablo A. Estévez, Cristián J. Figueroa Department of Electrical Engineering, University of Chile, Casilla 412-3, Santiago, Chile Abstract A high-quality
Visualization of textual data: unfolding the Kohonen maps.
Visualization of textual data: unfolding the Kohonen maps. CNRS - GET - ENST 46 rue Barrault, 75013, Paris, France (e-mail: [email protected]) Ludovic Lebart Abstract. The Kohonen self organizing
Linear Threshold Units
Linear Threshold Units w x hx (... w n x n w We assume that each feature x j and each weight w j is a real number (we will relax this later) We will study three different algorithms for learning linear
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
HT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data
A Comparative Study of the Pickup Method and its Variations Using a Simulated Hotel Reservation Data Athanasius Zakhary, Neamat El Gayar Faculty of Computers and Information Cairo University, Giza, Egypt
Maximum likelihood estimation of mean reverting processes
Maximum likelihood estimation of mean reverting processes José Carlos García Franco Onward, Inc. [email protected] Abstract Mean reverting processes are frequently used models in real options. For
Decompose Error Rate into components, some of which can be measured on unlabeled data
Bias-Variance Theory Decompose Error Rate into components, some of which can be measured on unlabeled data Bias-Variance Decomposition for Regression Bias-Variance Decomposition for Classification Bias-Variance
Visualization by Linear Projections as Information Retrieval
Visualization by Linear Projections as Information Retrieval Jaakko Peltonen Helsinki University of Technology, Department of Information and Computer Science, P. O. Box 5400, FI-0015 TKK, Finland [email protected]
OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS
OBJECTIVE ASSESSMENT OF FORECASTING ASSIGNMENTS USING SOME FUNCTION OF PREDICTION ERRORS CLARKE, Stephen R. Swinburne University of Technology Australia One way of examining forecasting methods via assignments
Probabilistic Latent Semantic Analysis (plsa)
Probabilistic Latent Semantic Analysis (plsa) SS 2008 Bayesian Networks Multimedia Computing, Universität Augsburg [email protected] www.multimedia-computing.{de,org} References
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Christfried Webers. Canberra February June 2015
c Statistical Group and College of Engineering and Computer Science Canberra February June (Many figures from C. M. Bishop, "Pattern Recognition and ") 1of 829 c Part VIII Linear Classification 2 Logistic
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay
Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay Lecture - 17 Shannon-Fano-Elias Coding and Introduction to Arithmetic Coding
Experiment #1, Analyze Data using Excel, Calculator and Graphs.
Physics 182 - Fall 2014 - Experiment #1 1 Experiment #1, Analyze Data using Excel, Calculator and Graphs. 1 Purpose (5 Points, Including Title. Points apply to your lab report.) Before we start measuring
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Multiple Optimization Using the JMP Statistical Software Kodak Research Conference May 9, 2005
Multiple Optimization Using the JMP Statistical Software Kodak Research Conference May 9, 2005 Philip J. Ramsey, Ph.D., Mia L. Stephens, MS, Marie Gaudard, Ph.D. North Haven Group, http://www.northhavengroup.com/
Learning Vector Quantization: generalization ability and dynamics of competing prototypes
Learning Vector Quantization: generalization ability and dynamics of competing prototypes Aree Witoelar 1, Michael Biehl 1, and Barbara Hammer 2 1 University of Groningen, Mathematics and Computing Science
Convolution. 1D Formula: 2D Formula: Example on the web: http://www.jhu.edu/~signals/convolve/
Basic Filters (7) Convolution/correlation/Linear filtering Gaussian filters Smoothing and noise reduction First derivatives of Gaussian Second derivative of Gaussian: Laplacian Oriented Gaussian filters
2. Simple Linear Regression
Research methods - II 3 2. Simple Linear Regression Simple linear regression is a technique in parametric statistics that is commonly used for analyzing mean response of a variable Y which changes according
Social Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
Gaussian Processes in Machine Learning
Gaussian Processes in Machine Learning Carl Edward Rasmussen Max Planck Institute for Biological Cybernetics, 72076 Tübingen, Germany [email protected] WWW home page: http://www.tuebingen.mpg.de/ carl
Bayesian Image Super-Resolution
Bayesian Image Super-Resolution Michael E. Tipping and Christopher M. Bishop Microsoft Research, Cambridge, U.K..................................................................... Published as: Bayesian
Simple Random Sampling
Source: Frerichs, R.R. Rapid Surveys (unpublished), 2008. NOT FOR COMMERCIAL DISTRIBUTION 3 Simple Random Sampling 3.1 INTRODUCTION Everyone mentions simple random sampling, but few use this method for
CHAPTER 2 Estimating Probabilities
CHAPTER 2 Estimating Probabilities Machine Learning Copyright c 2016. Tom M. Mitchell. All rights reserved. *DRAFT OF January 24, 2016* *PLEASE DO NOT DISTRIBUTE WITHOUT AUTHOR S PERMISSION* This is a
The primary goal of this thesis was to understand how the spatial dependence of
5 General discussion 5.1 Introduction The primary goal of this thesis was to understand how the spatial dependence of consumer attitudes can be modeled, what additional benefits the recovering of spatial
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
How To Understand The Theory Of Probability
Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL
How To Check For Differences In The One Way Anova
MINITAB ASSISTANT WHITE PAPER This paper explains the research conducted by Minitab statisticians to develop the methods and data checks used in the Assistant in Minitab 17 Statistical Software. One-Way
Bayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
Visualizing class probability estimators
Visualizing class probability estimators Eibe Frank and Mark Hall Department of Computer Science University of Waikato Hamilton, New Zealand {eibe, mhall}@cs.waikato.ac.nz Abstract. Inducing classifiers
Measurement in ediscovery
Measurement in ediscovery A Technical White Paper Herbert Roitblat, Ph.D. CTO, Chief Scientist Measurement in ediscovery From an information-science perspective, ediscovery is about separating the responsive
Methods of Data Analysis Working with probability distributions
Methods of Data Analysis Working with probability distributions Week 4 1 Motivation One of the key problems in non-parametric data analysis is to create a good model of a generating probability distribution,
Statistical Models in Data Mining
Statistical Models in Data Mining Sargur N. Srihari University at Buffalo The State University of New York Department of Computer Science and Engineering Department of Biostatistics 1 Srihari Flood of
CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES. From Exploratory Factor Analysis Ledyard R Tucker and Robert C.
CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES From Exploratory Factor Analysis Ledyard R Tucker and Robert C MacCallum 1997 180 CHAPTER 8 FACTOR EXTRACTION BY MATRIX FACTORING TECHNIQUES In
LOGISTIC REGRESSION. Nitin R Patel. where the dependent variable, y, is binary (for convenience we often code these values as
LOGISTIC REGRESSION Nitin R Patel Logistic regression extends the ideas of multiple linear regression to the situation where the dependent variable, y, is binary (for convenience we often code these values
Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations
Monitoring of Complex Industrial Processes based on Self-Organizing Maps and Watershed Transformations Christian W. Frey 2012 Monitoring of Complex Industrial Processes based on Self-Organizing Maps and
On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data
On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data Jorge M. L. Gorricha and Victor J. A. S. Lobo CINAV-Naval Research Center, Portuguese Naval Academy,
Load balancing in a heterogeneous computer system by self-organizing Kohonen network
Bull. Nov. Comp. Center, Comp. Science, 25 (2006), 69 74 c 2006 NCC Publisher Load balancing in a heterogeneous computer system by self-organizing Kohonen network Mikhail S. Tarkov, Yakov S. Bezrukov Abstract.
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: [email protected] Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
NEUROEVOLUTION OF AUTO-TEACHING ARCHITECTURES
NEUROEVOLUTION OF AUTO-TEACHING ARCHITECTURES EDWARD ROBINSON & JOHN A. BULLINARIA School of Computer Science, University of Birmingham Edgbaston, Birmingham, B15 2TT, UK [email protected] This
The Variability of P-Values. Summary
The Variability of P-Values Dennis D. Boos Department of Statistics North Carolina State University Raleigh, NC 27695-8203 [email protected] August 15, 2009 NC State Statistics Departement Tech Report
Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data Neil D. Lawrence Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield,
Comparing large datasets structures through unsupervised learning
Comparing large datasets structures through unsupervised learning Guénaël Cabanes and Younès Bennani LIPN-CNRS, UMR 7030, Université de Paris 13 99, Avenue J-B. Clément, 93430 Villetaneuse, France [email protected]
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios
Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios By: Michael Banasiak & By: Daniel Tantum, Ph.D. What Are Statistical Based Behavior Scoring Models And How Are
Exploratory Data Analysis Using Radial Basis Function Latent Variable Models
Exploratory Data Analysis Using Radial Basis Function Latent Variable Models Alan D. Marrs and Andrew R. Webb DERA St Andrews Road, Malvern Worcestershire U.K. WR14 3PS {marrs,webb}@signal.dera.gov.uk
BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I
BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential
Neural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm [email protected] Rome, 29
Introduction to Machine Learning and Data Mining. Prof. Dr. Igor Trajkovski [email protected]
Introduction to Machine Learning and Data Mining Prof. Dr. Igor Trakovski [email protected] Neural Networks 2 Neural Networks Analogy to biological neural systems, the most robust learning systems
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement
Adaptive Demand-Forecasting Approach based on Principal Components Time-series an application of data-mining technique to detection of market movement Toshio Sugihara Abstract In this study, an adaptive
Towards running complex models on big data
Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling
Spatial Statistics Chapter 3 Basics of areal data and areal data modeling Recall areal data also known as lattice data are data Y (s), s D where D is a discrete index set. This usually corresponds to data
In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.
MATHEMATICS: THE LEVEL DESCRIPTIONS In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data. Attainment target
Linear Classification. Volker Tresp Summer 2015
Linear Classification Volker Tresp Summer 2015 1 Classification Classification is the central task of pattern recognition Sensors supply information about an object: to which class do the object belong
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution
A Primer on Mathematical Statistics and Univariate Distributions; The Normal Distribution; The GLM with the Normal Distribution PSYC 943 (930): Fundamentals of Multivariate Modeling Lecture 4: September
Classification by Pairwise Coupling
Classification by Pairwise Coupling TREVOR HASTIE * Stanford University and ROBERT TIBSHIRANI t University of Toronto Abstract We discuss a strategy for polychotomous classification that involves estimating
CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理
CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 Submitted to Department of Electronic Engineering 電 子 工 程 學 系 in Partial Fulfillment
Gerry Hobbs, Department of Statistics, West Virginia University
Decision Trees as a Predictive Modeling Method Gerry Hobbs, Department of Statistics, West Virginia University Abstract Predictive modeling has become an important area of interest in tasks such as credit
From the help desk: Bootstrapped standard errors
The Stata Journal (2003) 3, Number 1, pp. 71 80 From the help desk: Bootstrapped standard errors Weihua Guan Stata Corporation Abstract. Bootstrapping is a nonparametric approach for evaluating the distribution
Regression III: Advanced Methods
Lecture 4: Transformations Regression III: Advanced Methods William G. Jacoby Michigan State University Goals of the lecture The Ladder of Roots and Powers Changing the shape of distributions Transforming
How to Win the Stock Market Game
How to Win the Stock Market Game 1 Developing Short-Term Stock Trading Strategies by Vladimir Daragan PART 1 Table of Contents 1. Introduction 2. Comparison of trading strategies 3. Return per trade 4.
http://www.jstor.org This content downloaded on Tue, 19 Feb 2013 17:28:43 PM All use subject to JSTOR Terms and Conditions
A Significance Test for Time Series Analysis Author(s): W. Allen Wallis and Geoffrey H. Moore Reviewed work(s): Source: Journal of the American Statistical Association, Vol. 36, No. 215 (Sep., 1941), pp.
2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
Advanced Computer Graphics. Rendering Equation. Matthias Teschner. Computer Science Department University of Freiburg
Advanced Computer Graphics Rendering Equation Matthias Teschner Computer Science Department University of Freiburg Outline rendering equation Monte Carlo integration sampling of random variables University
Applications to Data Smoothing and Image Processing I
Applications to Data Smoothing and Image Processing I MA 348 Kurt Bryan Signals and Images Let t denote time and consider a signal a(t) on some time interval, say t. We ll assume that the signal a(t) is
DYNAMIC RANGE IMPROVEMENT THROUGH MULTIPLE EXPOSURES. Mark A. Robertson, Sean Borman, and Robert L. Stevenson
c 1999 IEEE. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or
Overview of Factor Analysis
Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. Philip Kostov and Seamus McErlean
Using Mixtures-of-Distributions models to inform farm size selection decisions in representative farm modelling. by Philip Kostov and Seamus McErlean Working Paper, Agricultural and Food Economics, Queen
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
BINOMIAL OPTIONS PRICING MODEL. Mark Ioffe. Abstract
BINOMIAL OPTIONS PRICING MODEL Mark Ioffe Abstract Binomial option pricing model is a widespread numerical method of calculating price of American options. In terms of applied mathematics this is simple
Lecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
