Visual Data Mining: Integrating Machine Learning with Information Visualization
|
|
|
- Evan Rice
- 9 years ago
- Views:
Transcription
1 Visual Data Mining: Integrating Machine Learning with Information Visualization Dharmesh M. Maniyar and Ian T. Nabney NCRG, Aston University Birmingham B4 7ET, United Kingdom ABSTRACT Today, the data available to tackle many scientific challenges is vast in quantity and diverse in nature. The exploration of heterogeneous information spaces requires suitable mining algorithms as well as effective visual interfaces. Most existing systems concentrate either on mining algorithms or on visualization techniques. Though visual methods developed in information visualization have been helpful, for improved understanding of a complex large high-dimensional dataset, there is a need for an effective projection of such a dataset onto a lower-dimension (2D or 3D) manifold. This paper introduces a flexible visual data mining framework which combines advanced projection algorithms developed in the machine learning domain and visual techniques developed in the information visualization domain. The framework follows Shneiderman s mantra to provide an effective user interface. The advantage of such an interface is that the user is directly involved in the data mining process. We integrate principled projection methods, such as Generative Topographic Mapping (GTM) and Hierarchical GTM (HGTM), with powerful visual techniques, such as magnification factors, directional curvatures, parallel coordinates, billboarding, and user interaction facilities, to provide an integrated visual data mining framework. Results on a real life high-dimensional dataset from the chemoinformatics domain are also reported and discussed. Projection results of GTM are analytically compared with the projection results from other traditional projection methods, and it is also shown that the HGTM algorithm provides additional value for large datasets. The computational complexity of these algorithms is discussed to demonstrate their suitability for the visual data mining framework. Keywords Visual Data Mining, Data Visualization, Principled Projection Algorithms, Information Visualization Techniques. 1. INTRODUCTION Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MDM 2006 Philadelphia, USA Copyright 200X ACM X-XXXXX-XX-X/XX/XX...$5.00. The wide availability of ever-growing data sets from different domains has created a need for effective knowledge discovery and data mining. For data mining to be effective, it is important to include the domain expert in the data exploration process and combine the flexibility, creativity, and general knowledge of the domain expert with automated machine learning algorithms to obtain useful results [17]. The principal purpose of visual data exploration is to present the data in a visual form provided with interactive exploration facilities, allowing the domain expert to get insight into the data, draw conclusions, and understand the structure of the data. Exploration of complex information spaces is an important research topic in many fields, including computer graphics, data mining, pattern recognition, and other areas of statistics, as well as database management and data warehousing. Visual techniques on their own cannot entirely replace analytic nonvisual mining algorithms to represent a large high-dimensional dataset in a meaningful way. Rather, it is useful to combine multiple methods from different domains for effective data exploration [20] [41] [14]. Recently, the core research in visual data mining has focused on combination of visual and nonvisual techniques as well as on integrating the user in the exploration process. Integrating visual and nonvisual methods in order to support a variety of exploration tasks, such as identifying patterns in large unstructured heterogeneous information or identifying clusters or studying data in different clusters in detail etc., requires sophisticated machine learning algorithms, visual methods, and interaction techniques. Ankerst [1] classified visual data mining approaches into three categories. Methods of the first group apply visual methods independently of data mining algorithms. The second group uses visual methods in order to represent patterns and results from mining algorithms graphically. The third category tightly integrates both mining algorithms and visual methods in such a way that intermediate steps of the mining algorithms can be visualized and further guided by the domain expert. This tight integration allows users to control and steer the mining process directly based on the given visual feedback. The approach we present here, belongs to the third category where we introduce tight integration between principled projection algorithms and powerful information visualization techniques. Shneiderman s mantra of Overview first, zoom and filter, details-on-demand [30] nicely summarizes the design philosophy of modern information visualization systems. First,
2 the user needs to get an overview of the data. In the second stage, the user identifies interesting patterns and focuses on one or more of them. Finally, to analyze patterns in detail, the user needs to drill down and access details of the data. Information visualization technology may be used for all three steps of the data exploration process [20]. For a complex large high-dimensional dataset, where clear clustering is difficult or inappropriate, grouping of data points in soft clusters and then using visual aids to explore it further can reveal insight that may prove useful in data mining [36]. Principled projection of high-dimensional data on a lower-dimension space is an important step to obtain an effective grouping and clustering of a complex highdimensional dataset [10]. Here, we use the term projection to mean any method of mapping data into a lower-dimensional space in such a way that the projected data keeps most of the topographic properties (i.e. structure ) and makes it easier for the users to interpret the data to gain useful information from it. As mentioned by Ferreira et. al. [10], traditional projection methods such as Principle Component Analysis (PCA) [4], Factor Analysis [13], Multidimensional Scaling [42], Sammon s mapping [29], Self-Organizing Map (SOM) [18], and FastMap [12] are all used in the knowledge discovery and data mining domain [15] [37] [38] [19]. For many real life large high-dimensional datasets, the Generative Topographic Mapping (GTM) [8], a principled projection algorithm, provides better projections than those obtained from traditional methods, such as PCA, Sammon s mapping, and SOM [5] [26]. Moreover, since the GTM provides a probabilistic representation of the projection manifold, it is possible to analytically describe (local) geometric properties anywhere on the manifold. For example, we can calculate the local magnification factors [6], which describe how small regions in the visualization space are stretched or compressed when mapped to the data space. Note that it is not possible to obtain magnification factors for PCA and Sammon s mapping. For the SOM, the magnification factors can only be approximated [7]. It is also possible in the GTM to calculate analytically the local directional curvatures of the projection manifold to provide the user with a facility for monitoring the amount of folding and neighborhood preservation in the projection manifold [34]. The details of how these geometric properties of manifold can be used during visual data mining are presented in Section 3.1. Moreover, it has been argued that a single two-dimensional projection, even if it is non-linear, is not usually sufficient to capture all of the interesting aspects of a large highdimensional datasets. Hierarchical extensions of visualization methods allow the user to drill down into the data; each plot covers a smaller region and it is therefore easier to discern the structure of the data. Hierarchical GTM (HGTM) is a hierarchical visualization system which allows the user to explore interesting regions in more detail [35]. The work presented here describes a flexible framework for visual data mining which combines principled projection algorithms developed in the machine learning domain and visual techniques developed in the information visualization domain to achieve a better understanding of the information (data) space. The framework follows Shneiderman s mantra to provide an effective user interface (a software tool). The advantage of such an interface is that the user is directly involved in the data mining process taking advantage of powerful and principled machine learning algorithms. The results presented, on a real life dataset from the chemoinformatics domain, clearly show that this interface provides a useful platform for visual data mining of large high-dimensional datasets. Projection results of GTM are analytically compared with projection results from other methods traditionally used in the visual data mining domain. Using the hierarchical data visualization output, the tool also supports the development of new mixture of local experts models [25]. The remainder of this paper is organized as follows. In Section 2, we provide an overview of GTM and HGTM. The main information visualization and interaction techniques applied are described in Section 3. The integrated visual data mining framework we propose is discussed in Section 4. The experiments are presented in Section 5. In Section 6, we discuss computational costs for the projection algorithms. Finally, we draw the main conclusions in Section PROJECTION ALGORITHMS This section provides a short overview of the projection algorithms we use. 2.1 Generative Topographic Mapping (GTM) The GTM models a probability distribution in the (observable) high-dimensional data space, D = R D, by means of low-dimensional latent, or hidden, variables [8]. The data is visualized in the latent space, H R L. Figure 1: model. Schematic representation of the GTM As demonstrated in Figure 1 (adapted from [8]), we cover the latent space, H, with an array of K latent space centers, x i H, i = 1, 2,..., K. The non-linear GTM transformation, f : H D, from the latent space to the data space is defined using an RBF network. To this end, we cover the latent space with a set of M 1 fixed non-linear basis functions (here we use Gaussian functions of the same width σ), φ : H R, j = 1, 2,..., M 1, centred on a regular grid in the latent space. Given a point x H in the latent space, its image under the map f is f(x) = Wφ(x), (1) where W is a D M matrix of weight parameters and φ(x) = (φ 1(x),..., φ M (x)) T. The GTM creates a generative probabilistic model in the data space by placing a radially symmetric Gaussian with zero mean and inverse variance β around images, under f, of the latent space centres x i H, i = 1, 2,..., K. We refer to the Gaussian density associated with the center x i
3 by P (t x i, W, β). Defining a uniform prior over x i, the density model in P the data space provided by the GTM is P (t W, β) = 1 K K i=1 P (t xi, W, β). For the purpose of data visualization, we use Bayes theorem to invert the transformation f from the latent space H to the data space D. The posterior distribution on H, given a data point t n D, is a sum of delta functions centred at centres x i, with coefficients equal to the posterior probability R i,n that the i-th Gaussian (corresponding to the latent space center x i) generated t n [8], P (t n x i, W, β) R i,n = P K. (2) j=1 P (tn xj, W, β) The latent space representation of the point t n, i.e. the projection of t n, is taken to be the mean, P K i=1 Rinxi of the posterior distribution on H. The parameters of the GTM (weights W and inverse variance β) are learned from data using an Expectation Maximization (EM) algorithm [4]. The f image of the latent space H, Ω = f(h) = {f(x) R D x H}, forms a smooth L-dimensional manifold in the data space. We refer to the manifold Ω as the projection manifold of the GTM. 2.2 Hierarchical GTM (HGTM) The hierarchical GTM (HGTM) arranges a set of GTMs and their corresponding plots in a tree structure T [33]. An example HGTM structure is shown in Figure 2. Figure 2: An example structure for HGTM. The Root of the hierarchy is at level 1, i.e. Level(Root) = 1. Children of a model N with Level(N ) = l are at level l + 1, i.e. Level(M) = l + 1, for all M Children(N ). Each model M in the hierarchy, except for Root, has an associated parent-conditional mixture coefficient, or prior π(m Parent(M)). The priors are non-negative and satisfy P M Children(N ) the consistency condition: π(m N ) = 1. Unconditional priors for the models are recursively calculated as follows: π(root) = 1, and for all other models π(m) = Level(M) Y i=2 π(path(m) i Path(M) i 1), (3) where Path(M) = (Root,..., M) is the N-tuple (N = Level(M)) of nodes defining the path in T from Root to M. The distribution given by the hierarchical model is a mixture of leaf models of T, P (t T ) = X M Leaves(T ) π(m)p (t M). (4) We obtain a soft segmentation of the input space from the HGTM model. Non-leaf models not only play a role in the process of creating the hierarchical model, but in the context of data visualization can be useful for determining the relationship between related subplots in the hierarchy. The HGTM is trained using the EM algorithm to maximize its likelihood with respect to the data sample ς = {t 1, t 2,..., t N }. Training of a hierarchy of GTMs proceeds in a recursive fashion. Visualization and interaction is provided to the domain experts during the intermediate steps of training an HGTM model. A detailed description of user interaction and construction of HGTM models is provided in section VISUAL TECHNIQUES There is a large number of visual techniques developed in the information visualization domain which can be used for visualizing data or results. In addition to standard 2D/3D graphs, there is a number of more sophisticated visualization techniques. Keim [17] provided an informative overview of different classes of information visualization techniques. In addition to the usual facilities such as zoom, rotate, etc., the following specialized information visualization aids support exploration of the projection obtained from the principled machine learning algorithms to create a powerful visual data mining framework. 3.1 Magnification Factors and Directional Curvatures One of the main advantages of using GTM based models is that it is possible to analytically calculate the Magnification Factors (MF) [6] and the Directional Curvature (DC) [34] of the GTM projection manifold. MFs of a GTM projection manifold, Ω, are calculated as the determinant of the Jacobian of the GTM map f [6]. Magnification factor plots are used to observe the amount of stretching in a GTM manifold at different parts of the latent space, which helps in understanding the data space, outlier detection, and cluster separation. Tiňo et. al. [34] derived a closed-form formula for directional curvatures of the GTM projection manifold, Ω, for a latent space point x H and a directional vector h H. Directional curvature plots allow the user to observe the direction and amount of folding in the GTM manifold. This can help the user detect regions where the GTM manifold does not fit the data well. It is possible that groups of data points far apart when projected onto the projection manifold are close together in the data space due to high folding in the manifold. This neighborhood preservation in the data space can be spotted with a strong curvature band on the corresponding directional curvature plot. The magnification factor is represented by color shading in the projection manifold (e.g., see Figure 4(c)). The lighter the color, the more stretch in the projection manifold. The direction of folding in the projection manifold plot is presented using a small line for each part of the projection manifold in the directional curvature plots (e.g., see Figure 4(d)). The length and the shade of the background color represents the magnitude of folding. The longer the
4 line and the lighter the background color, higher the folding (curvature). 3.2 Local Parallel Coordinates The parallel coordinates technique [16] maps the D dimensional data space onto two display dimensions by using D equidistant axes which are parallel to one of the display axes. It displays each multi-dimensional data point as a polygonal line which intersects the horizontal dimension axes at the position corresponding to the data value for the corresponding dimension. Instead of displaying parallel coordinates for all the data points together, which is impractical for a large dataset, we provide an interactive facility to let the user select a point on the projection manifold and display parallel coordinates for the n nearest neighbors of that selected point. Figure 3(a) displays an example of parallel coordinates used on a GTM projection: when the user clicks on a point in the projection (upper plot), the data space visualization graph shows a color coded plot of normalized property values for a group of points close in the projection space. We call this a local parallel coordinates technique. This facility has proved very useful for the domain experts at Pfizer 1 to understand large high-dimensional datasets. Using this facility, the user can study properties of a high-dimensional data point in the data space while working with the lower-dimensional latent (projection) space. A detailed example discussing how local parallel coordinates are used to explore a projection manifold is presented in Section Billboarding For many real-life datasets which have a natural representation, e.g. chemical compound structure, handwritten digit recognition, face recognition, galaxy classification, etc., using this natural representation of data points in the projection is more useful to understand the data compared with data represented by labelled and/or colored dots. Here the term billboarding means visualizing a natural representation of a data point in the form of an image, in such a way that the image always faces the viewer (even in 3D). A chemical compound structure or a hand written digit image is certainly more user-friendly than a dot. Partiview [22] is an interactive 3D visualization tool supporting a billboarding facility, primarily created for astronomyrelated applications. But recently it has been successfully used for visualizing the output of some machine learning algorithms [31]. The number of pictures that can be displayed at a time depends on how much graphics memory is present. Figure 3(b) presents a close up of the points visualized for the MNIST database [21] using Laplacian eigenmaps [3]. Billboarding presentation of images of the handwritten digits provides us an intuitive visualization and can help us to identify why certain data points are misclassified (e.g., notice that in Figure 3(b), images of 7s and 9s on the top left corner of the plot are quite similar). Partiview also provides many useful interaction facilities, such as 3D zooming and traversal, selective plotting of classes, properties displace, etc. [22]. 4. THE INTEGRATED VISUAL DATA MIN- ING FRAMEWORK 1 Pfizer Global Research and Development, Kent, UK. The integrated visual data mining framework combines principled projection algorithms, discussed in Section 2, and visual techniques, discussed in Section 3, to achieve a better understanding of the data space. It follows Shneiderman s mantra [30], Overview first, zoom and filter, details on demand, to develop an effective interface. To support the overview first stage of Shneiderman s mantra, output of the projection algorithms and basic visualization aids such as highlight, rotate, etc., are provided for exploring a large high-dimensional dataset. For the second stage, zoom and filter, visualization aids such as zooming, filtering interesting regions on the projection manifold with the use of magnification factor and directional curvatures plots, etc., are provided. This allows the user to identify and concentrate on interesting subsets of the projection we obtained in the first stage. The third stage, details-ondemand, is supported using local parallel coordinates and billboarding. Integration with other visualization tool is also possible at various stages. Moreover, a single two-dimensional projection, even if it is non-linear, is not usually sufficient to capture all of the interesting aspects of a large high-dimensional data sets. A hierarchical system which allows user to interactively drill down in the projection can be useful. Interactive visual methods support the construction of HGTM models and allow the user to explore interactively interesting regions in more detail. Visual aids described in Section 3 are provided at each stage of the HGTM model development. First, a base (Root) GTM is trained and used to visualize the data. Then the user identifies interesting regions on the visualization plot that they would like to explore in greater detail. In particular, the user chooses a collection of points, c i H, by clicking on the projection plot. The regions of interest given by these points (centres) are then transformed into the data space as Voronoi compartments [2] defined by the mapped points f Root(c i) D, where f Root is the map of the Root GTM. The child GTMs are initiated by local PCA in the corresponding Voronoi compartments. After training the child GTMs and seeing the lower level visualization plots, the user may decide to proceed further and model in greater detail some portions of the lower level plots, etc. Thus, HGTM allows domain experts to segment the input space interactively using data visualization. When the dataset is very large, the higher-level projection plots may be cluttered and confused (with densely clustered and overlapping projections). This makes it difficult for the user to select locations for submodels at the next level. In such cases, an alternative semi-automatic submodel initialization algorithm [28], based on minimum message length (MML) criteria, which decides both the number of submodels and their location can be used for higher-level projections of the visualization hierarchy and then the domain expert can take control to guide the lower-level projections. We have developed an interactive software tool that supports this framework [24]. The interface is developed in Matlab 2 using the Netlab [27] toolbox. The tool supports other projection methods, such as PCA, Sammon s mapping, and SOM. The interface has proved useful for domain experts to understand and mine large high-dimensional 2 c The MathWorks Inc.,
5 (a) The projection interface and the local parallel coordinates. (b) Billboarding example. Figure 3: Information visualization facilities. datasets. Visualization is a valuable tool for exploring and understanding data, but in many applications the fundamental task is one of prediction. It has been argued that a single global classification/regression model can rarely capture the full variability of a huge multi-dimensional dataset. Instead, local models, each focused on a separate area of input space (a cluster), often work better since the mapping in different areas may vary. The tool also supports the guided mixture of local experts model, which uses the soft segmentation obtained using probabilistic hierarchical visualization algorithms, such as HGTM, to formulate the guided local mixture of experts model [25]. Thus the visual mining framework is not just a visual exploration tool but also supports guided modelling where the domain expert is closely involved. 5. EXPERIMENTS Experiments were carried out on a real-life dataset from the pharmaceutical domain. A typical challenge in the early stages of the drug discovery process is to understand and explore large datasets containing High Throughput Screening (HTS) results (biological activity) alongside some wholemolecular properties [11]. A chemist/biologist is interested in studying and exploring clusters of active compounds to understand the data and make informed decisions for future screens. The chemoinformatics dataset provided by the chemist at Pfizer is described in Section 5.1. The results are discussed in Section The Dataset The chemoinformatics dataset, we used, is composed of 23,600 compounds having biological activity data for five different biological targets and 11 whole-molecular physicochemical properties. Thus, the dataset has, in total, 16 variables (dimensions) in the data space and we want to effectively visualize it on a 2-dimensional manifold. Out of these five biological targets, two are peptidergic G- Protein coupled receptor (GPCR) targets, two are aminergic GPCR targets, and one is a kinase target. The four GPCR targets are of related receptor types whilst the kinase is a completely unrelated enzyme target class. Table 1 lists the label information and distribution of compounds in different labels. In addition to the biological activity values, 11 whole
6 Table 1: Label information and compound distribution across labels. Label Description Marker Compounds Not active in any screen Active for peptidergic type Active for peptidergic type2 362 Active for aminergic type1 100 Active for aminergic type2 818 Active for kinase 412 Active for more than 1 screen 132 Table 2: Molecular physicochemical properties. AlogP Molecular solubility Number of atoms Number of bonds Number of Hydrogens Number of ring bonds Number of rotatable ring bonds Number of Hydrogen acceptors Number of Hydrogen donors Molecular polar surface area Molecular weight molecule physiochemical properties were included for each compound in the dataset. Table 2 lists the physicochemical properties used. Since different input variables in the dataset have different ranges, before the development of visualization models we apply a linear transformation (Z-score transformation) to have similar ranges for all variables. Each variable is treated independently and is rescaled as follows: µ i = 1 N σ i 2 = 1 N 1 NX n=1 x i n (5) NX (x n i µ i) 2, (6) n=1 where n = 1,..., N indexes the patterns, and µ i and σ i 2 represent mean and variance of variable i respectively. Then the values are scaled by x n i = xin µ i σ i. (7) where x n i is the scaled value of variable i for the pattern n. Histograms of variables after the scaling show approximately normal distributions (the plot is not included in the paper due to the space constraint). 50% of the dataset was used as the training set and the remaining 50% was used as the test set. 5.2 Results Figure 4(a) shows the projection using the Neuroscale [23], a novel neural network implementation of Sammon s mapping, algorithm. The projection in Figure 4(a) is like a blob and does not help us to understand the structure of data in data space. GTM visualization results are shown in Figure 4(b). The GTM plot shows clear clusters for the compounds active for different targets. The GTM visualization is certainly more informative than the visualization results obtained from PCA, Sammon s mapping, and SOM (PCA and SOM projection results are similar to those for the Sammon s mapping. They are not shown here because of the space constraint). Though visually we can easily observe the effectiveness of GTM projection on this dataset, it is useful to get an analytical measurement of the separation between different data classes in the projections. To obtain such a measurement, first we fit a Gaussian mixture model (GMM) [4] on each class in the projection space and then we calculate the Kullback-Leibler (KL) [9] divergence between the fitted GMMs as below: KL(p a, p b ) = X x p a(x) log pa(x) p b (x), (8) where p a and p b are the GMMs for classes a and b respectively. The higher the value of KL divergence, the greater the separation between classes. Table 3 presents the sum of the entries in the KL divergence matrix for different visualization model outputs. GTM projection has a much higher KL divergence matrix sum, which is evidence for better separation using GTM. Table 3: KL divergence matrix sum for different projection models. Visualization model KL divergence matrix sum PCA Neuroscale SOM GTM Once the GTM projection is obtained, the user can study the dataset in more detail using different interactive visual aids. The corresponding magnification factor (MF) and directional curvature (DC) plots of the GTM projection manifold (Figure 4(b)) are presented in Figure 4(c) and Figure 4(d) respectively. Using the MF plot displayed in Figure 4(c), we observe the stretching in the data space. For example, higher value of MF in the bottom-right corner of the MF plot shows that compounds active for peptidergic type 2 target (marked as ) are clustered far away from the bulk of the data. Visual inspection of the direction lines and background color in the DC plot presented in 4(d) can help the user to understand the curvature directions and magnitude of the folding in the projection manifold. Magnification factor and directional curvature plots are also useful for making decisions about number and the positions of the centers for GTM subplots during the training of an HGTM model. For example, the lighter bands at the bottom right corner in the directional curvature plot (see 4(d)) reveals a huge folding in the projection manifold to cover the data space. This helps us to understand that there could be a cluster there even though the data points are not marked (labeled) differently. Magnification factors and directional curvatures plots are mainly used to understand projection space and data space in detail. But if the data are not colored (labeled) (for example if we do the analysis on a virtual compound library), magnification factor and directional cur-
7 (a) Neuroscale projection. (b) GTM projection. (c) Magnification Factors for the GTM projection. (d) Directional Curvatures for the GTM projection. Figure 4: Neuroscale and GTM projections for the chemoinformatics dataset with MF and DC plots for the GTM projection. Legend information for (a) and (b) is given in Table 1. vature plots can be used to observe clusters in the projection and data space. Local parallel coordinate plots help us to observe variations in the patterns in different regions of a projection plot. Figure 5 shows how patterns of physicochemical properties vary in different regions of the GTM projection. A careful study with the parallel coordinate technique reveals interesting structures in the projection space. It can be observed that the active compounds for different targets are nicely clustered. Close study with the help of local parallel coordinate technique reveals that the compounds marked as, present in the clusters for peptidergic type 1 and peptidergic type 2, are the active compounds for both of the peptidergic targets. That is in line with the fact that some compounds are active for both of the peptidergic targets. Billboarding is an effective way to display chemical compound structures in the projection manifold. Because of the confidentiality agreement with Pfizer, we are not able produce billboarding results on the chemoinformatics dataset. The HGTM visualization results are presented in Figure 6. Active compounds can be seen in different clusters in the root GTM. The deeper level plots clearly separate interesting local regions. At each level, the magnification factor plots, directional curvature plots, parallel coordinates, and the basic visual exploration facilities are used to make decisions about where to place the center of a subplot for the next level. 6. COMPUTATIONAL COST Although the rapid development of high-performance computing has to some extent altered our perception of computational complexity, this issue cannot be ignored in a visual data mining framework where user interaction is important. The computational cost for PCA scales as linear, O(N), in
8 Figure 5: Local parallel coordinates demonstrating variations in the patterns in different regions of the GTM projection (plot 4(d)). Legend information is given in Table 1. Figure 6: HGTM projection of the chemoinformatics dataset. Legend information is given in Table 1.
9 the number of data points (N). Neuroscale suffers from the fact that the computational demands grow with the square of the number of data points, O(N 2 ). This is because each evaluation of the Stress error requires the computation of N(N 1)/2 inter-point distances. In practice, for large data sets, it is common to apply an initial clustering phase to the data set (using for example the K-means algorithm), to generate a set of K prototype vectors (where K N). Neuroscale can then be applied to these prototype vectors at a much reduced computational cost. Here we used Neuroscale with the fast shadow targets training algorithm [32]. The distance calculation between data points and mixture components of reference vectors, respectively, is identical in SOM and GTM training algorithms. Updating the parameters in SOM training depends on the neighborhood function. In the experiments presented here it was continuous on the latent space so the parameter updating scales as O(L 2 ND + L 2 ), where L is the number of grid points in the SOM map and D is the dimension of the data space. When updating parameters, the GTM requires a matrix inversion of an M M matrix, where M is the number of basis functions, followed by a set of matrix multiplications. The matrix inversion scales as O(M 3 ), while the matrix multiplications scales as O(LND) 3, where L is the number of grid points in the GTM latent space. Table 4 shows the time taken to train different projection models on the training set using an Intel Pentium 4-2.4GHz machine with 2GB of RAM. The implementation of the algorithms in C/C++ instead of Matlab can further improve the speed. Table 4: Training time for different projection models (N = 11800, 20 iterations). The model Time (seconds) Architecture PCA 1 - Neuroscale SOM 36 L = 256 GTM 42 L = 256, M = 64 Once the models are trained, the computational cost to project data for the subsequent test set scales as linear, O(N), in the number of data points (N) in the test set. 7. CONCLUSIONS To understand a large high-dimensional dataset, close integration of principled projection methods and information visualization techniques are useful to develop an effective visual data mining framework. The interface developed using this framework and following Shneiderman s mantra for design provided us with a useful tool for visual data mining of large high-dimensional datasets. Traditional projection algorithms used in data mining domain, such as PCA, Neuroscale, and SOM, are not powerful enough for a good projection for many real life scientific problem where one has to deal with a complex highdimensional dataset. The GTM algorithm, a principled projection algorithm, has provided a better projection (in terms 3 To be exact, the matrix multiplications scales as O(LMD+ LND), but normally the number of data points, N, exceeds the number of basis functions, M. of cluster separation) than the projection obtained from the traditional methods, such as PCA, Neuroscale, and SOM. For the chemoinformatics dataset analysed here, the KL divergence matrix sum results presented in Table 3 clearly show the effectiveness of the strong class grouping we obtain. With the use of interactive visual aids provided during the projection output, we not only can characterize hit populations from different target classes (i.e. peptidergic GPCRs vs. aminergic GPCRs vs. kinases) but can also understand areas of overlap. Magnification factor and directional curvature plots of GTM helped to provide a better understanding of the projection manifold and its fitting on data in the data space. The local parallel coordinates technique proved to be a useful tool to understand data points in interesting regions of the projection manifold more in detail. Since the structure of compounds is very important in drug discovery, billboarding could be a useful feature for the domain experts (chemists and screening scientists) at Pfizer to visualize chemical structures in the projection manifold. HGTM models are useful to explore clusters and interesting local regions in details in a large dataset. Note that this results are only for 11,800 compounds. The number of compounds one has to consider during the drug discovery process is enormous; in such situations, a single GTM projection can look cluttered but a well trained HGTM model could be very useful to provide a better grouping. The computational cost to train a GTM algorithm is acceptable for inclusion in the visual data mining framework. The GTM and HGTM algorithms are scalable so having a large number of data points during training, however, causes no difficulty beyond increased computational cost. The interface we developed supports powerful machine learning algorithms and information visualization techniques. The interface is also flexible enough to let a user export data points from selected region on a projection manifold and analyze them further using industry standard data mining tool such as PipeLine Pilot [39] and SpotFire [40]. In the future we plan to develop data visualization using GTM with simultaneous feature selection and model selection for unsupervised learning problems. 8. ACKNOWLEDGEMENT D. M. Maniyar is grateful to Pfizer Central Research and the Overseas Research Scholarship award for funding his PhD. We thank Bruce Williams and Andreas Sewing of Pfizer for useful discussions and feedback on the results. 9. REFERENCES [1] M. Ankerst. Visual data mining with pixel-oriented visualization techniques. Proc. Workshop on Visual Data Mining, [2] F. Aurenhammer. Vornoi diagrams - survey of a fundamental geometric data structure. ACM Computing Surveys, 3: , [3] M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data representation. Neural Computation, 15(6): , [4] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, New York, 1st edition, 1995.
10 [5] C. M. Bishop, M. Svensen, and C. K. I. Williams. GTM: a principled alternative to the self-organizing map. In Proc. of ICANN 96, pages Springer-Verlag, Berlin, Germany, [6] C. M. Bishop, M. Svensén, and C. K. I. Williams. Magnification factors for the GTM algorithm. Proc. IEE Fifth International Conference on Artificial Neural Networks, pages 64 69, [7] C. M. Bishop, M. Svensén, and C. K. I. Williams. Magnification factors for the som and gtm algorithms. Work. proc. on Self-Organizing Maps, pages , [8] C. M. Bishop, M. Svensén, and C. K. I. Williams. GTM: The generative topographic mapping. Neural Computation, 10: , [9] T. Cover and J. Thomas. Elements of Information Theory. Wiley, New York, 1st edition, [10] M. C. F. de Oliveira and H. Levkowitz. From visual data exploration to visual data mining: A survey. IEEE Trans. on Vis. and Comp. Graphics, 9(3): , [11] P. Englebienne. High throughput screening: Will the past meet the future? Frontiers in Drug Design & Discovery, 1:69 86, [12] C. Faloutsos and K. Lin. Fastmap: a fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pages , [13] H. H. Harman. Modern Factor Analysis. Univ. of Chicago Press, [14] A. Hinneburg, D. A. Keim, and M. Wawryniuk. HD-Eye: Visual mining of high-dimensional data. IEEE Comp. Graph. and App., 19(5):22 31, [15] P. Hoffman, G. Grinstein, K. Marx, I. Grosse, and E. Stanley. DNA visual and analytic data mining. Proc. of the 8th IEEE Visualization Conf., [16] A. Inselberg and B. Dimsdale. Parallel coordinates : A tool for visualizing multi-dimensional geometry. Proc. IEEE VISUALIZATION 90, pages , [17] D. A. Keim. Information visualization and visual data mining. IEEE Transactions on Visualization and Computer Graphics, 8(3): , [18] T. Kohonen. Self-Organizing Maps. Springer-Verlag, Berlin, [19] E. L. Koua and M.-J. Kraak. Geovisualization to support the exploration of large health and demographic survey data. International Journal of Health Geographics, 3(12):1 13, [20] M. Kreuseler and H. Schumann. A flexible approach for visual data mining. IEEE Transactions on Visualization and Comp. Graphics, 8(1):39 51, [21] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition. Proc. of the IEEE, 86(11): , [22] S. Levy. Interactive 3-d visualization of particle systems with Partiview. Proc of the Int Astronomical Union Symposium, 208:85 91, [23] D. Lowe and M. E. Tipping. Neuroscale: Novel topographic feature extraction with radial basis function networks. Advances in Neural Information Processing Systems, 9: , [24] D. M. Maniyar and I. T. Nabney. DVMS 1.5 : A user manual. Technical Report NCRG/2005/003, Neural Computing Research Group, UK, [25] D. M. Maniyar and I. T. Nabney. Guiding local regression using visualisation. In Deterministic and Statistical Methods in Machine Learning, LNAI, Springer-Verlag, 3635:98 109, [26] D. M. Maniyar, I. T. Nabney, B. S. Williams, and A. Sewing. Data visualization during the early stages of drug discovery. Journal of Chemical Information and Modelling, to appear in vol. 46, [27] I. T. Nabney. Netlab: Algorithms for Pattern Recognition. Springer, London, 1st edition, [28] I. T. Nabney, Y. Sun, P. Tiňo, and A. Kabán. Semisupervised learning of hierarchical latent trait models for data visualization. IEEE Trans. on Know. and Data Eng., 17(3): , [29] J. W. Sammon. A nonlinear mapping for data structure analysis. IEEE Tran. on Comp., C-18: , [30] B. Shneiderman. The eyes have it: A task by data type taxonomy for information. visualizations. Proc. IEEE Symp. on Visual Lang., 3(6): , Sep [31] D. Surendran and S. Levy. Visualizing high dimensional datasets using partiview. IEEE Symp. on Info. Visu., INFOVIS 04, [32] M. E. Tipping and D. Lowe. Shadow targets: A novel algorithm for topographic projections by radial basis functions. Neurocomputing, 19: , [33] P. Tiňo and I. T. Nabney. Constructing localized non-linear projection manifolds in a principled way: hierarchical generative topographic mapping. IEEE Trans. on Patt. Ana. and Mach. Inte., 24: , [34] P. Tiňo, I. T. Nabney, and Y. Sun. Using directional curvatures to visualize folding patterns of the GTM projection manifolds. Int. Conf. on Artificial Neural Networks, pages , [35] P. Tiňo, I. T. Nabney, Y. Sun, and B. S. Williams. A principled approach to interactive hierarchical non-linear visualization of high-dimensional data. Computing Science and Statistics, 33, [36] E. R. Tufte. The Visual Display of Quantitative Information. Graphics Press, Cheshire, second ed. edition, [37] J. Vesanto. SOM-based data visualization method. Intelligent Data Analysis, 3(2): , [38] S. Wang and H. Wang. Knowledge discovery through self-organizing maps: data visualization and query processing. Know. and Info. Sys., 4(1):31 45, [39] Website. c SciTegic. [40] Website. c Spotfire. [41] P. C. Won. Visual data mining. IEEE Computer Graphics and Applications, 19(5):20 21, [42] F. Young. Multidimensional Scaling: History, Theory, And Applications. Lawrence Erlbaum Assoc., Hillsdale, N.J., 1987.
Data Visualization with Simultaneous Feature Selection
1 Data Visualization with Simultaneous Feature Selection Dharmesh M. Maniyar and Ian T. Nabney Neural Computing Research Group Aston University, Birmingham. B4 7ET, United Kingdom Email: {maniyard,nabneyit}@aston.ac.uk
Visualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87-100 Toruń, Poland. www.phys.uni.torun.pl/kmk
Clustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
Using Smoothed Data Histograms for Cluster Visualization in Self-Organizing Maps
Technical Report OeFAI-TR-2002-29, extended version published in Proceedings of the International Conference on Artificial Neural Networks, Springer Lecture Notes in Computer Science, Madrid, Spain, 2002.
Visualization of Breast Cancer Data by SOM Component Planes
International Journal of Science and Technology Volume 3 No. 2, February, 2014 Visualization of Breast Cancer Data by SOM Component Planes P.Venkatesan. 1, M.Mullai 2 1 Department of Statistics,NIRT(Indian
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
INTERACTIVE DATA EXPLORATION USING MDS MAPPING
INTERACTIVE DATA EXPLORATION USING MDS MAPPING Antoine Naud and Włodzisław Duch 1 Department of Computer Methods Nicolaus Copernicus University ul. Grudziadzka 5, 87-100 Toruń, Poland Abstract: Interactive
Visualization Techniques in Data Mining
Tecniche di Apprendimento Automatico per Applicazioni di Data Mining Visualization Techniques in Data Mining Prof. Pier Luca Lanzi Laurea in Ingegneria Informatica Politecnico di Milano Polo di Milano
Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data
Gaussian Process Latent Variable Models for Visualisation of High Dimensional Data Neil D. Lawrence Department of Computer Science, University of Sheffield, Regent Court, 211 Portobello Street, Sheffield,
Topic Maps Visualization
Topic Maps Visualization Bénédicte Le Grand, Laboratoire d'informatique de Paris 6 Introduction Topic maps provide a bridge between the domains of knowledge representation and information management. Topics
CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.
Lecture Machine Learning Milos Hauskrecht [email protected] 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht [email protected] 539 Sennott
Exploratory Data Analysis Using Radial Basis Function Latent Variable Models
Exploratory Data Analysis Using Radial Basis Function Latent Variable Models Alan D. Marrs and Andrew R. Webb DERA St Andrews Road, Malvern Worcestershire U.K. WR14 3PS {marrs,webb}@signal.dera.gov.uk
A Computational Framework for Exploratory Data Analysis
A Computational Framework for Exploratory Data Analysis Axel Wismüller Depts. of Radiology and Biomedical Engineering, University of Rochester, New York 601 Elmwood Avenue, Rochester, NY 14642-8648, U.S.A.
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets
Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: [email protected]
Visualization methods for patent data
Visualization methods for patent data Treparel 2013 Dr. Anton Heijs (CTO & Founder) Delft, The Netherlands Introduction Treparel can provide advanced visualizations for patent data. This document describes
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM
TOWARDS SIMPLE, EASY TO UNDERSTAND, AN INTERACTIVE DECISION TREE ALGORITHM Thanh-Nghi Do College of Information Technology, Cantho University 1 Ly Tu Trong Street, Ninh Kieu District Cantho City, Vietnam
Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz
Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz Luigi Di Caro 1, Vanessa Frias-Martinez 2, and Enrique Frias-Martinez 2 1 Department of Computer Science, Universita di Torino,
An example. Visualization? An example. Scientific Visualization. This talk. Information Visualization & Visual Analytics. 30 items, 30 x 3 values
Information Visualization & Visual Analytics Jack van Wijk Technische Universiteit Eindhoven An example y 30 items, 30 x 3 values I-science for Astronomy, October 13-17, 2008 Lorentz center, Leiden x An
Utilizing spatial information systems for non-spatial-data analysis
Jointly published by Akadémiai Kiadó, Budapest Scientometrics, and Kluwer Academic Publishers, Dordrecht Vol. 51, No. 3 (2001) 563 571 Utilizing spatial information systems for non-spatial-data analysis
USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS
USING SELF-ORGANIZING MAPS FOR INFORMATION VISUALIZATION AND KNOWLEDGE DISCOVERY IN COMPLEX GEOSPATIAL DATASETS Koua, E.L. International Institute for Geo-Information Science and Earth Observation (ITC).
Visualization by Linear Projections as Information Retrieval
Visualization by Linear Projections as Information Retrieval Jaakko Peltonen Helsinki University of Technology, Department of Information and Computer Science, P. O. Box 5400, FI-0015 TKK, Finland [email protected]
What is Visualization? Information Visualization An Overview. Information Visualization. Definitions
What is Visualization? Information Visualization An Overview Jonathan I. Maletic, Ph.D. Computer Science Kent State University Visualize/Visualization: To form a mental image or vision of [some
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining
Extend Table Lens for High-Dimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du [email protected] University of British Columbia
Visualization of Large Font Databases
Visualization of Large Font Databases Martin Solli and Reiner Lenz Linköping University, Sweden ITN, Campus Norrköping, Linköping University, 60174 Norrköping, Sweden [email protected], [email protected]
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
Integration of Cluster Analysis and Visualization Techniques for Visual Data Analysis
Integration of Cluster Analysis and Visualization Techniques for Visual Data Analysis M. Kreuseler, T. Nocke, H. Schumann, Institute of Computer Graphics University of Rostock, D-18059 Rostock, Germany
Big Data: Rethinking Text Visualization
Big Data: Rethinking Text Visualization Dr. Anton Heijs [email protected] Treparel April 8, 2013 Abstract In this white paper we discuss text visualization approaches and how these are important
How To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
Data Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototype-based clustering Density-based clustering Graph-based
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
Classification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics
Complex Network Visualization based on Voronoi Diagram and Smoothed-particle Hydrodynamics Zhao Wenbin 1, Zhao Zhengxu 2 1 School of Instrument Science and Engineering, Southeast University, Nanjing, Jiangsu
Sanjeev Kumar. contribute
RESEARCH ISSUES IN DATAA MINING Sanjeev Kumar I.A.S.R.I., Library Avenue, Pusa, New Delhi-110012 [email protected] 1. Introduction The field of data mining and knowledgee discovery is emerging as a
Introduction to Data Mining
Introduction to Data Mining Jay Urbain Credits: Nazli Goharian & David Grossman @ IIT Outline Introduction Data Pre-processing Data Mining Algorithms Naïve Bayes Decision Tree Neural Network Association
Visual Data Mining with Pixel-oriented Visualization Techniques
Visual Data Mining with Pixel-oriented Visualization Techniques Mihael Ankerst The Boeing Company P.O. Box 3707 MC 7L-70, Seattle, WA 98124 [email protected] Abstract Pixel-oriented visualization
A Discussion on Visual Interactive Data Exploration using Self-Organizing Maps
A Discussion on Visual Interactive Data Exploration using Self-Organizing Maps Julia Moehrmann 1, Andre Burkovski 1, Evgeny Baranovskiy 2, Geoffrey-Alexeij Heinze 2, Andrej Rapoport 2, and Gunther Heidemann
An Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
Machine Learning with MATLAB David Willingham Application Engineer
Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the
International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
Review of Modern Techniques of Qualitative Data Clustering
Review of Modern Techniques of Qualitative Data Clustering Sergey Cherevko and Andrey Malikov The North Caucasus Federal University, Institute of Information Technology and Telecommunications [email protected],
Predict the Popularity of YouTube Videos Using Early View Data
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals
The Role of Size Normalization on the Recognition Rate of Handwritten Numerals Chun Lei He, Ping Zhang, Jianxiong Dong, Ching Y. Suen, Tien D. Bui Centre for Pattern Recognition and Machine Intelligence,
Exploratory Data Analysis with MATLAB
Computer Science and Data Analysis Series Exploratory Data Analysis with MATLAB Second Edition Wendy L Martinez Angel R. Martinez Jeffrey L. Solka ( r ec) CRC Press VV J Taylor & Francis Group Boca Raton
Self Organizing Maps for Visualization of Categories
Self Organizing Maps for Visualization of Categories Julian Szymański 1 and Włodzisław Duch 2,3 1 Department of Computer Systems Architecture, Gdańsk University of Technology, Poland, [email protected]
Interactive Information Visualization of Trend Information
Interactive Information Visualization of Trend Information Yasufumi Takama Takashi Yamada Tokyo Metropolitan University 6-6 Asahigaoka, Hino, Tokyo 191-0065, Japan [email protected] Abstract This paper
Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical micro-clustering algorithm Clustering-Based SVM (CB-SVM) Experimental
203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
The Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
A Study of Web Log Analysis Using Clustering Techniques
A Study of Web Log Analysis Using Clustering Techniques Hemanshu Rana 1, Mayank Patel 2 Assistant Professor, Dept of CSE, M.G Institute of Technical Education, Gujarat India 1 Assistant Professor, Dept
Supervised and unsupervised learning - 1
Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in
Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
High-dimensional labeled data analysis with Gabriel graphs
High-dimensional labeled data analysis with Gabriel graphs Michaël Aupetit CEA - DAM Département Analyse Surveillance Environnement BP 12-91680 - Bruyères-Le-Châtel, France Abstract. We propose the use
Comparative Analysis of EM Clustering Algorithm and Density Based Clustering Algorithm Using WEKA tool.
International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 9, Issue 8 (January 2014), PP. 19-24 Comparative Analysis of EM Clustering Algorithm
Information Management course
Università degli Studi di Milano Master Degree in Computer Science Information Management course Teacher: Alberto Ceselli Lecture 01 : 06/10/2015 Practical informations: Teacher: Alberto Ceselli ([email protected])
MS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
Data Visualization Techniques
Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The
Data Mining and Neural Networks in Stata
Data Mining and Neural Networks in Stata 2 nd Italian Stata Users Group Meeting Milano, 10 October 2005 Mario Lucchini e Maurizo Pisati Università di Milano-Bicocca [email protected] [email protected]
Semi-Supervised Support Vector Machines and Application to Spam Filtering
Semi-Supervised Support Vector Machines and Application to Spam Filtering Alexander Zien Empirical Inference Department, Bernhard Schölkopf Max Planck Institute for Biological Cybernetics ECML 2006 Discovery
HDDVis: An Interactive Tool for High Dimensional Data Visualization
HDDVis: An Interactive Tool for High Dimensional Data Visualization Mingyue Tan Department of Computer Science University of British Columbia [email protected] ABSTRACT Current high dimensional data visualization
Interactive Exploration of Decision Tree Results
Interactive Exploration of Decision Tree Results 1 IRISA Campus de Beaulieu F35042 Rennes Cedex, France (email: pnguyenk,[email protected]) 2 INRIA Futurs L.R.I., University Paris-Sud F91405 ORSAY Cedex,
An Introduction to Data Mining
An Introduction to Intel Beijing [email protected] January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
Cluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototype-based Fuzzy c-means
Data topology visualization for the Self-Organizing Map
Data topology visualization for the Self-Organizing Map Kadim Taşdemir and Erzsébet Merényi Rice University - Electrical & Computer Engineering 6100 Main Street, Houston, TX, 77005 - USA Abstract. The
Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
Using Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION
BIG DATA PROBLEMS AND LARGE-SCALE OPTIMIZATION: A DISTRIBUTED ALGORITHM FOR MATRIX FACTORIZATION Ş. İlker Birbil Sabancı University Ali Taylan Cemgil 1, Hazal Koptagel 1, Figen Öztoprak 2, Umut Şimşekli
Predict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, [email protected] Department of Electrical Engineering, Stanford University Abstract Given two persons
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data
An Analysis on Density Based Clustering of Multi Dimensional Spatial Data K. Mumtaz 1 Assistant Professor, Department of MCA Vivekanandha Institute of Information and Management Studies, Tiruchengode,
Specific Usage of Visual Data Analysis Techniques
Specific Usage of Visual Data Analysis Techniques Snezana Savoska 1 and Suzana Loskovska 2 1 Faculty of Administration and Management of Information systems, Partizanska bb, 7000, Bitola, Republic of Macedonia
Introduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
Probabilistic Latent Semantic Analysis (plsa)
Probabilistic Latent Semantic Analysis (plsa) SS 2008 Bayesian Networks Multimedia Computing, Universität Augsburg [email protected] www.multimedia-computing.{de,org} References
Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks
Chin. J. Astron. Astrophys. Vol. 5 (2005), No. 2, 203 210 (http:/www.chjaa.org) Chinese Journal of Astronomy and Astrophysics Automated Stellar Classification for Large Surveys with EKF and RBF Neural
Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
Data Visualization Techniques
Data Visualization Techniques From Basics to Big Data with SAS Visual Analytics WHITE PAPER SAS White Paper Table of Contents Introduction.... 1 Generating the Best Visualizations for Your Data... 2 The
A Simple Introduction to Support Vector Machines
A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear
On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data
On the use of Three-dimensional Self-Organizing Maps for Visualizing Clusters in Geo-referenced Data Jorge M. L. Gorricha and Victor J. A. S. Lobo CINAV-Naval Research Center, Portuguese Naval Academy,
Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.
Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,
Manifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
Mobile Phone APP Software Browsing Behavior using Clustering Analysis
Proceedings of the 2014 International Conference on Industrial Engineering and Operations Management Bali, Indonesia, January 7 9, 2014 Mobile Phone APP Software Browsing Behavior using Clustering Analysis
3D Interactive Information Visualization: Guidelines from experience and analysis of applications
3D Interactive Information Visualization: Guidelines from experience and analysis of applications Richard Brath Visible Decisions Inc., 200 Front St. W. #2203, Toronto, Canada, [email protected] 1. EXPERT
Character Image Patterns as Big Data
22 International Conference on Frontiers in Handwriting Recognition Character Image Patterns as Big Data Seiichi Uchida, Ryosuke Ishida, Akira Yoshida, Wenjie Cai, Yaokai Feng Kyushu University, Fukuoka,
CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學. Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理
CITY UNIVERSITY OF HONG KONG 香 港 城 市 大 學 Self-Organizing Map: Visualization and Data Handling 自 組 織 神 經 網 絡 : 可 視 化 和 數 據 處 理 Submitted to Department of Electronic Engineering 電 子 工 程 學 系 in Partial Fulfillment
Principles of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
Bayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
Support Vector Machines Explained
March 1, 2009 Support Vector Machines Explained Tristan Fletcher www.cs.ucl.ac.uk/staff/t.fletcher/ Introduction This document has been written in an attempt to make the Support Vector Machines (SVM),
STATISTICA. Financial Institutions. Case Study: Credit Scoring. and
Financial Institutions and STATISTICA Case Study: Credit Scoring STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table of Contents INTRODUCTION: WHAT
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
