Implementation and Use of The K-Means Algorithm

Size: px
Start display at page:

Download "Implementation and Use of The K-Means Algorithm"

Transcription

1 Implementation and Use of The K-Means Algorithm Thilina Hewaragamunige September 11, 2014 Contents 1 K-Means Algorithm Introduction Performance measure of K-Means Python Implementation of k-means Implementation Demonstrations Impact of k on J value Experiments Data Set Objective Results Discussion 9 Appendices 12 A Complete Python Code for K-Means algorithm 12 Abstract A brief introduction to K-Means algorithm is provided which is followed by an implementation in Python. The implementation is demonstrated using a synthetic data set. Finally it is applied to Wisconsin breast cancer data set to see if the K-Means algorithm can cluster benign and malignant entries separately. The results of this experiment is discussed along with the performance measures of K-Means algorithm. 1 K-Means Algorithm This section briefly introduces the K-Means algorithm and its underpinning principle. 1.1 Introduction K-Means algorithm is used to find a given number of clusters (i.e. K number of clusters) in a particular multidimensional data set. A cluster is a group of data points which are located in a close proximity to each other in terms of Euclidean distance. Each cluster has a central point of the cluster which is called the centroid. K-Means algorithm uses an iterative approach to find a locally optimum solution. It starts by choosing an initial set of centroids. It s possible that these initial set of centroids are chosen randomly from the data points. Then each data point is assigned to the closest centroid, which in turn forms the initial set of clusters. A new centroid is calculated for each cluster by taking the mean value along each dimension of data points 1

2 belong to the cluster. This will results in a new set of centroids. Then the data points are reassigned to new centroids following the same criteria as earlier and the same procedure continues until the centroids do not change. Also in certain cases, a fixed number of iterations may be used irrespective of whether centroids are stabilized or not. 1.2 Performance measure of K-Means The underlying principle of K-Means is to find a set of clusteres such that the total distance from data points to the centroids of their corresponding clusters is minimum. The sum of square distances from data points to the centroids of their cluster is called the performance measure of K-Means of the J value[1, p. 424]. The formal definition of J-Value is as follows. J = N n=1 k=1 K r nk x n µ k 2 The number of data points is N and the number of clusters is K. x n is the n th data point whereas µ k is the centroid of the k th cluster. r nk is known as a binary indicator variable which is defined as follows. { 1 if x n belongs to cluster k with the centroid µ k r nk = 0 Otherwise Usually K-Means algorithm is executed until the J value will reach a minimum and remain unchanged. This happens when the centroids are stabilized after a number of iterations. 2 Python Implementation of k-means This section is dedicated to discuss the Python implementation of K-Means in detail. In the latter half of the section, it demonstrates the implementation by applying it to a synthetic data set. 2.1 Implementation The K-Means algorithm was implemented in a generic way to ensure that it can be used with a data set of any dimension and also with any number of clusters. The K-Means function in this implementation accepts two parameters, the data object which is a Numpy array and the K value. In this implementation, the K-Means algorithm will execute until the centroids are stabilized. In this discussion, the number of data points will be referred to as n and the number of clusters is referred to as k. Number of dimensions in the data set is represented by m. 1. The first step is to select the initial set of centroids. As per the terminology outlined before, the input data array is of size n m. Numpy random function is used without replacement to find k random numbers between 0 and n. These random numbers are used as indexes to the data array in order to retrieve the initial k centroids. Following code snippet implements the above mentioned functionality. # get k random numbers between 0 and the number o f rows i n the data s e t c e n t r o i d i n d e x e s = np. random. c h o i c e ( range ( data. shape [ 0 ] ), k, r e p l a c e = F a l s e ) # get the c o r r e s p o n d i n g data p o i n t s c e n t r o i d s = data [ c e n t r o i d i n d e x e s, : ] 2. Now the code enters a while loop that ll run until the centroids are stabilized. The next step is to calculate the Euclidean distance between each data point and each centroid. Centroid array is of shape k m where as data array is of shape n m. Since Numpy is used to directly calculate the Euclidean distance, it should be possible to broadcast these multi-dimensional arrays to a common shape. In order to do that, a new axis is introduced into data array to change its shape to n 1 m. data ex = data [ :, np. newaxis, : ] 2

3 Multi-dimensional arrays data ex and centroids are now compatible for broadcasting. Now the Euclidean distance can be calculated using per element operations in Numpy. The resulting array will be of size n k m. e u q l i d e a n d i s t = ( data ex c e n t r o i d s ) 2 As the final step of this step, the sum of individual distances corresponding to each dimension is added together to get the distance between a data point and a centroid. The summation has to be done along the 3rd axis which represents the distances for each dimension for a given data point and a given centroid. d i s t a n c e a r r = np. sum( e u q l i d e a n d i s t, a x i s =2) The shape of the distance arr will be n k. 3. The next step is to identify the closest centroid for each data point. This is achieved by taking the minimum distance between the data point and a centroid out of m centroids. So the minimum distance is calculated along the 2nd axis of distance arr for each data point. np. argmin ( d i s t a n c e a r r, a x i s =1) To represent the cluster assignments, a n k binary array is used. If the i th data point belongs to the cluster corresponding to j th column, then point {i, j} is set to 1 in this array. Otherwise, it is set to 0. After combining this minimum location array with the previous code snippet, the following code segment is resulted. m i n l o c a t i o n = np. z e r o s ( d i s t a n c e a r r. shape ) m i n l o c a t i o n [ range ( d i s t a n c e a r r. shape [ 0 ] ), np. argmin ( d i s t a n c e a r r, a x i s =1) ] = 1 4. Now it s possible to calculate the J value for this iteration. It is stored in a list for later use. j v a l = np. sum( d i s t a n c e a r r [ m i n l o c a t i o n == True ] ) 5. Next, the new centroids should be calculated by taking the mean of the data points that are clustered together. The indexes of the data points belong to a particular cluster can be identified by using the corresponding column corresponding to min location matrix. Then the mean values are calculated along each dimension of the corresponding data points of the cluster to identify the new centroid for that cluster. In the Python implementation, a for loop is used to iterate through each cluster. n e w c e n t r o i d s = np. empty ( c e n t r o i d s. shape ) ; f o r c o l i n range ( 0, k ) : n e w c e n t r o i d s [ c o l ] = np. mean ( data [ m i n l o c a t i o n [ :, c o l ] == True, : ], a x i s = 0) 6. Then the terminating condition should be tested. If the centroids haven t been changed, it s safe to assume that the J value is stabilized. To check whether the new centroids and old centroids are equal, they are sorted and array equal method of Numpy is invoked. If the centroids are different, the process will be repeated from step 2. np. a r r a y e q u a l ( np. s o r t ( n e w c e n t r o i d s, a x i s =0), np. s o r t ( c e n t r o i d s, a x i s =0) ) 7. In this implementation, even after the terminating condition is met three more iterations will be run just to collect the J values for plotting. Finally clusters are plotted as scatter plots. Each cluster is plotted in a different color using a random color palette. It ll have k+1 colors for k clusters and centroids. c o l o r s = i t e r (cm. rainbow ( np. l i n s p a c e ( 0, 1, k + 1) ) ) Cluster plotting code snippet is shown below. 3

4 Figure 1: Demonstration: Scatter plot of the data points (Before clusters are identified) f o r c o l i n range ( 0, k ) : p l t. s c a t t e r ( data [ m i n l o c a t i o n [ :, c o l ] == True, : ] [ :, 0 ], data [ m i n l o c a t i o n [ :, c o l ] == True, : ] [ :, 1 ], c o l o r=next ( c o l o r s ) ) c e n t r o i d l e g = p l t. s c a t t e r ( n e w c e n t r o i d s [ :, 0 ], n e w c e n t r o i d s [ :, 1 ], c o l o r=next ( c o l o r s ), marker= x ) p l t. l egend ( [ c e n t r o i d l e g ], [ C entroids ], s c a t t e r p o i n t s =1, l o c= b e s t ) p l t. s a v e f i g ( c l u s t e r. png ) Finally the J value is plotted against the iterations. The complete code of K-Means algorithm is available in Appendix A. 2.2 Demonstrations For demonstrating the correct functionality of the implementation, following synthetic data set is used. [[1.1,2],[1,2],[0.9,1.9],[1,2.1],[9,9],[8.9,9],[8.7,9.2],[9.1,9]] As it can be noticed, there are two clearly visible clusters in this data set. First four data points can be clustered together and the remaining four data points can be considered as the second cluster. A scatter plot of these data points is depicted in the figure 1. K-Means algorithm is executed over this data set with k=2 to identify the two clusters. Figure 2 depicts the clustered data set. Markers of data points of each cluster is appearing in different colors and centroids are marked with a different marker. Please refer to Figure 3 for J value plot. J value was stabilized at in this particular instance. 4

5 Figure 2: Demonstration: Scatter plot of the clustered data points Figure 3: Demonstration: J value plot 5

6 Figure 4: Demonstration: J Vs. K Plot Impact of k on J value To observe the effect of the value of k on the J value, a set of random data points were used. The motivation behind using random data points is to generalize the input to the K-Means algorithm as much as possible. The stabilized J value was recorded and plotted against different k values incremented from 1 to the number of data points in the random sample. Figure 4 depicts the results of this test with 100 random data points. When k is set to 1, every data point belongs to a single cluster and final centroid will be the mean of every data point. This may result in a higher J value, as majority of the data points may be located in a significant distance from the centroid. As k increases, data points get a chance to cluster with a centroid in a closer proximity. Hence the J value is expected to decrease. When k is set to 100, i.e. the number of data points, every data point becomes a centroid and will stabilize which will result in a 0 J value. The shape of this plot can be different for real data sets which do contain some clusters. In such cases, it s possible to have an optimum J value in the middle before it goes to zero when k is close to the sample size. So a plot like this is useful to identify a good k value. But it s not possible to have a high confidence on this plot, because it s dependent on the randomness of selecting the initial set of centroids. 3 Experiments This section discusses about using the K-Means implementation outlined in previous section on a real data set. The Wisconsin breast cancer data set[2] was used here. 3.1 Data Set Wisconsin breast cancer data set[2] contains a set of features computed from a digitized image of a fine needle aspirate (FNA) of a breast mass. It contains 699 entries each with 11 attributes. One of the attributes is the sample code number which identifies a sample uniquely. Class attribute identifies whether it s a benign or malignant entry. Table given below provides a brief description of each attribute. 6

7 Figure 5: Experiment: Plot of all dimensions Attribute No. Attribute Domain 1 Sample code number id number 2 Clump Thickness Uniformity of Cell Size Uniformity of Cell Shape Marginal Adhesion Single Epithelial Cell Size Bare Nuclei Bland Chromatin Normal Nucleoli Mitoses Class (2 for benign, 4 for malignant) There were missing attributes in 16 records which were denoted by? signs. Those records were removed from the processing set, hence it had 683 records at the end. Data was available from the first row itself, hence the file was manually modified to have the abbreviations of attribute names as the first row. First and Last columns were skipped before running the K-Means algorithm. Following code listing shows how Pandas was used to import these data into a Numpy array. d = pandas. r e a d c s v ( open ( data / breast cancer w i s c o n s i n. data ), n a v a l u e s=? ) d c l e a n = d [ d. i s n u l l ( ). any ( a x i s =1)==F a l s e ] data = d c l e a n. i l o c [ :, 1 : 1 0 ]. v a l u e s This would result in a 683 data points, each with 9 dimensions. If this data was plotted to see how they are varying on each dimension, Figure 5 is resulted. Since this plot was too clear for the naked eye, it was attempted to plot some of the attribute separately for the first 100 data entries. Figure 6 depicts the resulting set of plots. As it can be seen, the data has a oscillating behavior along each axis. 7

8 Figure 6: Experiment: Plot of 4 dimensions for first 100 data points 3.2 Objective Since the original data set was labeled whether a record is benign or malignant, it was attempted to test if k is set to 2, if the resulting two clusters would resemble those two categories. 3.3 Results In the data set of 683 entries, the composition of benign and malignant entries based on the Class attribute (label) is shown below. Class No. Of Entries Percentage Benign % Malignant % The number of benign and malignant entries in the clusters resulted by the K-Means algorithm were counted. If the majority of the cluster contains benign entries, then it was considered as the benign cluster. Similarly, if the malignant entries occupy the majority of the cluster, then it was labeled as the malignant cluster. Then it was attempted to calculate the percentage of true positives in each cluster, i.e. how many genuine benign records are ended up in the benign cluster and how many malignant records are ended up in the malignant cluster. Tabulated results of multiple iterations of K-Means algorithm are given below. It contains the number of total entries, the number of benign entries and the number of malignant entries in each cluster. And the positive predictive value[3] for overall result is also listed. Positive predictive value (PPV) is the proportion of true positive results. P P V = number of true positives number of true positives + number of false positives 8

9 Figure 7: Experiment: J value plot over multiple iterations in a single execution of K-Means Iteration Cluster 1 Cluster 2 Size Benign Malignant Size Benign Malignant PPV For instance in iteration 1, PPV is calculated as follows. number of true positives in cluster 1 + number of true positives in cluster 2 P P V = (number of true positives + number of false positives) in both clusters = ( ) + ( ) = = 0.96 Based on these observations, sufficient evidence exists to conclude that the clusters generated by the K-Means algorithm significantly resembles with the two categories of entries. 4 Discussion A manual scan of the data set suggests that the values of majority of features of a given data point are either low (around 1-2) or relatively higher in most cases. If the feature values of a record are low, then the records is categorized as benign whereas relatively higher values makes it a malignant entry. This particular characteristic of this data set should be the main reason behind the observed results. Because data points with low values for each feature are clustered together by K-Means algorithm which will create the cluster of benign entries. Same applies for entries with relatively high feature values. The J value plot for an iteration is depicted in Figure 7. J value is stabilized on in this iteration. 9

10 Figure 8: Experiment: Stabilized J Values in 5 executions of K-Means Figure 9: Experiment: Final Centroids resulted in 5 executions of K-Means 10

11 Figure 10: Experiment: Scatter plot of data points before and after being clustered. (With Clump Thickness and Uniformity of Cell Size) Figure 11: Experiment: Scatter plot of data points before and after being clustered. (With Clump Thickness and Uniformity of Cell Shape) 11

12 Figure 8 shows the stabilized J value of multiple executions of the K-Means algorithm. These J values have a very less variance over multiple executions. Figure 9 is a plot of the final centroids of 5 executions of K-Means algorithm. The adjacent centroid pairs belong to a single execution, for instance the first two centroids belong to the first execution and the last two centroids belong to the fifth execution. Each feature is plotted with a different color. If the distribution of the values of a particular feature of these centroids is observed, it can be seen that it is varying between two distinct narrow range of values. For instance, the feature plotted in green color varies between 1.2 and 2.6. Also it can be seen that the centroids corresponding to a single execution lie in either of these two ranges and not both of them belong to the same range. Continuing the previous example, first centroid is approximately equal to 1.2 and the second centroid corresponding to the same execution is approximately equal to 2.6. Another observation is that if the entire set of features is considered, then at each centroid all of them show either a relatively high value or a relatively low value. As mentioned previously, if the features values are lower then there is a high chance that the record is a benign entry and vice versa. So this graphical evidence again justify the reason behind the high accuracy of the results. The observations made by Figure 8 can be supported by the observations made by Figure 9. The final set of centroids resulted by each execution are very close to each other in terms of the euclidean distance. Hence the stabilized J values are approximately the equal to each other. Figure 10 and Figure 11 shows the results of the K-Means algorithm if only two features are considered. As we discussed before, the two clusters are formed where one is containing smaller values in each dimension and the other one containing larger values. One of the biggest challenge was to get familiarize with the Numpy library. It requires some practice to be comfortable with handling multidimensional arrays and some of Numpy s advanced features like broadcasting. The ipython notebooks provided in the class and the book Python for Data Analysis [4] was very useful in this assignment. K-Means algorithm proves to be effective against certain data sets. Wisconsin breast cancer data set is one such example. If the values of features of the data ares inherently clustered, K-Means algorithm can generate very accurate results. Also if it s K value is known like in this particular example, the challenge of identifying the correct k value will not arise. Appendices A Complete Python Code for K-Means algorithm The complete code for the k-mean algorithm is given below. This does not contain the code used to prepare the data. d e f kmeans ( data, k = 2) : # get the i n i t i a l s e t o f c e n t r o i d s # get k random numbers between 0 and the number o f rows i n the data s e t c e n t r o i d i n d e x e s = np. random. c h o i c e ( range ( data. shape [ 0 ] ), k, r e p l a c e = F a l s e ) # get the c o r r e s p o n d i n g data p o i n t s c e n t r o i d s = data [ c e n t r o i d i n d e x e s, : ] p r i n t ( I n i t i a l Centroids :, c e n t r o i d s ) # c o l o r map f o r c o l o r i n g c l u s t e r s. One e x t r a c o l o r f o r the c e n t r o i d s c o l o r s = i t e r (cm. rainbow ( np. l i n s p a c e ( 0, 1, k + 1) ) ) j v a l u e s = [ ] e x t r a i t e r a t i o n s = 3 w h i l e e x t r a i t e r a t i o n s > 0 : # f i n d the e u c l e d e a n d i s t a n c e between a c e n t e r and a data p o i n t # c e n t r o i d s array shape = k x m # data array shape = n x m # In o r d e r to broadcast i t, we have to i n t r o d u c e a t h i r d dimension i n t o data # data array becomes n x 1 x m # now as a r e s u l t o f broadcasting, both array s i z e s w i l l be n x k x m data ex = data [ :, np. newaxis, : ] 12

13 e u q l i d e a n d i s t = ( data ex c e n t r o i d s ) 2 # now take the summation o f a l l d i s t a n c e s along the 3 rd a x i s ( l e n g t h o f the dimension i s m). #This w i l l the t o t a l d i s t a n c e from each c e n t r o i d f o r each data p o i n t. # r e s u l t i n g v e c t o r w i l l be o f s i z e n x k d i s t a n c e a r r = np. sum( e u q l i d e a n d i s t, a x i s =2) # now we need to f i n d out to which c l u s t e r each data p o i n t b e l o n g s. #Use a matrix o f n x k where [ i, j ] = 1 i f the i t h data p o i n t b e l o n g s # to c l u s t e r j. m i n l o c a t i o n = np. z e r o s ( d i s t a n c e a r r. shape ) m i n l o c a t i o n [ range ( d i s t a n c e a r r. shape [ 0 ] ), np. argmin ( d i s t a n c e a r r, a x i s =1) ] = 1 # c a l c u l a t e J j v a l = np. sum( d i s t a n c e a r r [ m i n l o c a t i o n == True ] ) p r i n t ( J Value :, j v a l ) j v a l u e s. append ( j v a l ) # c a l c u l a t e s the new c e n t r o i d s n e w c e n t r o i d s = np. empty ( c e n t r o i d s. shape ) ; f o r c o l i n range ( 0, k ) : n e w c e n t r o i d s [ c o l ] = np. mean ( data [ m i n l o c a t i o n [ :, c o l ] == True, : ], a x i s = 0) p r i n t ( n e w c e n t r o i d s ) # compare c e n t r o i d s to s e e i f they are equal or not i f ( np. a r r a y e q u a l ( np. s o r t ( n e w centroids, a x i s =0), np. s o r t ( c e n t r o i d s, a x i s =0) ) ) : # i t has r e s u l t e d i n the same c e n t r o i d s. #Run i t f o r e x t r a i t e r a t i o n s j u s t to p l o t the J value. p r i n t ( Centroids are s t a b i l i z e d. Going f o r an e x t r a i t e r a t i o n s. ) e x t r a i t e r a t i o n s = e x t r a i t e r a t i o n s 1 i f ( e x t r a i t e r a t i o n s == 0) : # p l o t the c e n t r o i d s and the a s s i g n e d data p o i n t s u s i n g a s c a t t e r p l o t f o r c o l i n range ( 0, k ) : p l t. s c a t t e r ( data [ m i n l o c a t i o n [ :, c o l ] == True, : ] [ :, 0 ], data [ m i n l o c a t i o n [ :, c o l ] == True, : ] [ :, 1 ], c o l o r=next ( c o l o r s ) ) c e n t r o i d l e g = p l t. s c a t t e r ( n e w c e n t r o i d s [ :, 0 ], n e w c e n t r o i d s [ :, 1 ], c o l o r=next ( c o l o r s ), marker= x ) p l t. legend ( [ c e n t r o i d l e g ], [ C entroids ], s c a t t e r p o i n t s =1, l o c= b e s t ) p l t. s a v e f i g ( c l u s t e r. png ) # p l o t J v a l u e s f i g = p l t. f i g u r e ( ) j p l o t = f i g. add subplot ( 1, 1, 1 ) j p l o t. p l o t ( range ( l e n ( j v a l u e s ) ), np. array ( j v a l u e s ) ) j p l o t. s e t t i t l e ( J Value V a r i a t i o n With No. o f I t e r a t i o n s ) j p l o t. s e t x l a b e l ( I t e r a t i o n s ) j p l o t. s e t y l a b e l ( J Value ) p l t. s a v e f i g ( j v a l s. png ) r e t u r n m i n l o c a t i o n, j v a l break ; c e n t r o i d s = n e w c e n t r o i d s References [1] Christopher Bishop. Pattern Recognition and Machine Learning. Springer, [2] Center for Machine Learning and Irvine Intelligent Systems UC. Breast cancer wisconsin (diagnostic) data set. 29,

14 [3] Wikipedia. Positive and negative predictive values. negative_predictive_values. [4] Wes McKinney. Python for Data Analysis. O Reilly,

Data Mining Analysis (breast-cancer data)

Data Mining Analysis (breast-cancer data) Data Mining Analysis (breast-cancer data) Jung-Ying Wang Register number: D9115007, May, 2003 Abstract In this AI term project, we compare some world renowned machine learning tools. Including WEKA data

More information

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,

More information

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows:

There are a number of different methods that can be used to carry out a cluster analysis; these methods can be classified as follows: Statistics: Rosie Cornish. 2007. 3.1 Cluster Analysis 1 Introduction This handout is designed to provide only a brief introduction to cluster analysis and how it is done. Books giving further details are

More information

K-Means Clustering Tutorial

K-Means Clustering Tutorial K-Means Clustering Tutorial By Kardi Teknomo,PhD Preferable reference for this tutorial is Teknomo, Kardi. K-Means Clustering Tutorials. http:\\people.revoledu.com\kardi\ tutorial\kmean\ Last Update: July

More information

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets Macario O. Cordel II and Arnulfo P. Azcarraga College of Computer Studies *Corresponding Author: macario.cordel@dlsu.edu.ph

More information

4. Continuous Random Variables, the Pareto and Normal Distributions

4. Continuous Random Variables, the Pareto and Normal Distributions 4. Continuous Random Variables, the Pareto and Normal Distributions A continuous random variable X can take any value in a given range (e.g. height, weight, age). The distribution of a continuous random

More information

Performance Metrics for Graph Mining Tasks

Performance Metrics for Graph Mining Tasks Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical

More information

CLUSTER ANALYSIS FOR SEGMENTATION

CLUSTER ANALYSIS FOR SEGMENTATION CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every

More information

Categorical Data Visualization and Clustering Using Subjective Factors

Categorical Data Visualization and Clustering Using Subjective Factors Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,

More information

CSCI-B 565 DATA MINING Project Report for K-means Clustering algorithm Computer Science Core Fall 2012 Indiana University

CSCI-B 565 DATA MINING Project Report for K-means Clustering algorithm Computer Science Core Fall 2012 Indiana University CSCI-B 565 DATA MINING Project Report for K-means Clustering algorithm Computer Science Core Fall 2012 Indiana University Jayesh Kawli jkawli@indiana.edu 09/17/2012 1. Examining Wolberg s breast cancer

More information

Simple Regression Theory II 2010 Samuel L. Baker

Simple Regression Theory II 2010 Samuel L. Baker SIMPLE REGRESSION THEORY II 1 Simple Regression Theory II 2010 Samuel L. Baker Assessing how good the regression equation is likely to be Assignment 1A gets into drawing inferences about how close the

More information

BOOLEAN ALGEBRA & LOGIC GATES

BOOLEAN ALGEBRA & LOGIC GATES BOOLEAN ALGEBRA & LOGIC GATES Logic gates are electronic circuits that can be used to implement the most elementary logic expressions, also known as Boolean expressions. The logic gate is the most basic

More information

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications

More information

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu

Medical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Regression Clustering

Regression Clustering Chapter 449 Introduction This algorithm provides for clustering in the multiple regression setting in which you have a dependent variable Y and one or more independent variables, the X s. The algorithm

More information

Statistical tests for SPSS

Statistical tests for SPSS Statistical tests for SPSS Paolo Coletti A.Y. 2010/11 Free University of Bolzano Bozen Premise This book is a very quick, rough and fast description of statistical tests and their usage. It is explicitly

More information

5. Binary objects labeling

5. Binary objects labeling Image Processing - Laboratory 5: Binary objects labeling 1 5. Binary objects labeling 5.1. Introduction In this laboratory an object labeling algorithm which allows you to label distinct objects from a

More information

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

More information

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

More information

Introduction to Clustering

Introduction to Clustering 1/57 Introduction to Clustering Mark Johnson Department of Computing Macquarie University 2/57 Outline Supervised versus unsupervised learning Applications of clustering in text processing Evaluating clustering

More information

Component Ordering in Independent Component Analysis Based on Data Power

Component Ordering in Independent Component Analysis Based on Data Power Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals

More information

15.062 Data Mining: Algorithms and Applications Matrix Math Review

15.062 Data Mining: Algorithms and Applications Matrix Math Review .6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop

More information

Least-Squares Intersection of Lines

Least-Squares Intersection of Lines Least-Squares Intersection of Lines Johannes Traa - UIUC 2013 This write-up derives the least-squares solution for the intersection of lines. In the general case, a set of lines will not intersect at a

More information

IBM SPSS Direct Marketing 23

IBM SPSS Direct Marketing 23 IBM SPSS Direct Marketing 23 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 23, release

More information

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

More information

3 An Illustrative Example

3 An Illustrative Example Objectives An Illustrative Example Objectives - Theory and Examples -2 Problem Statement -2 Perceptron - Two-Input Case -4 Pattern Recognition Example -5 Hamming Network -8 Feedforward Layer -8 Recurrent

More information

Session 7 Bivariate Data and Analysis

Session 7 Bivariate Data and Analysis Session 7 Bivariate Data and Analysis Key Terms for This Session Previously Introduced mean standard deviation New in This Session association bivariate analysis contingency table co-variation least squares

More information

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

More information

Chapter ML:XI (continued)

Chapter ML:XI (continued) Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

More information

IBM SPSS Direct Marketing 22

IBM SPSS Direct Marketing 22 IBM SPSS Direct Marketing 22 Note Before using this information and the product it supports, read the information in Notices on page 25. Product Information This edition applies to version 22, release

More information

Statistical Databases and Registers with some datamining

Statistical Databases and Registers with some datamining Unsupervised learning - Statistical Databases and Registers with some datamining a course in Survey Methodology and O cial Statistics Pages in the book: 501-528 Department of Statistics Stockholm University

More information

Supervised and unsupervised learning - 1

Supervised and unsupervised learning - 1 Chapter 3 Supervised and unsupervised learning - 1 3.1 Introduction The science of learning plays a key role in the field of statistics, data mining, artificial intelligence, intersecting with areas in

More information

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin

More information

Week 4: Standard Error and Confidence Intervals

Week 4: Standard Error and Confidence Intervals Health Sciences M.Sc. Programme Applied Biostatistics Week 4: Standard Error and Confidence Intervals Sampling Most research data come from subjects we think of as samples drawn from a larger population.

More information

Analytics on Big Data

Analytics on Big Data Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis

More information

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

More information

Chapter 11. Correspondence Analysis

Chapter 11. Correspondence Analysis Chapter 11 Correspondence Analysis Software and Documentation by: Bee-Leng Lee This chapter describes ViSta-Corresp, the ViSta procedure for performing simple correspondence analysis, a way of analyzing

More information

Formulas, Functions and Charts

Formulas, Functions and Charts Formulas, Functions and Charts :: 167 8 Formulas, Functions and Charts 8.1 INTRODUCTION In this leson you can enter formula and functions and perform mathematical calcualtions. You will also be able to

More information

Efficient Data Structures for Decision Diagrams

Efficient Data Structures for Decision Diagrams Artificial Intelligence Laboratory Efficient Data Structures for Decision Diagrams Master Thesis Nacereddine Ouaret Professor: Supervisors: Boi Faltings Thomas Léauté Radoslaw Szymanek Contents Introduction...

More information

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

6.4 Normal Distribution

6.4 Normal Distribution Contents 6.4 Normal Distribution....................... 381 6.4.1 Characteristics of the Normal Distribution....... 381 6.4.2 The Standardized Normal Distribution......... 385 6.4.3 Meaning of Areas under

More information

203.4770: Introduction to Machine Learning Dr. Rita Osadchy

203.4770: Introduction to Machine Learning Dr. Rita Osadchy 203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:

More information

CALCULATIONS & STATISTICS

CALCULATIONS & STATISTICS CALCULATIONS & STATISTICS CALCULATION OF SCORES Conversion of 1-5 scale to 0-100 scores When you look at your report, you will notice that the scores are reported on a 0-100 scale, even though respondents

More information

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

More information

A Review of Sudoku Solving using Patterns

A Review of Sudoku Solving using Patterns International Journal of Scientific and Research Publications, Volume 3, Issue 5, May 2013 1 A Review of Sudoku Solving using Patterns Rohit Iyer*, Amrish Jhaveri*, Krutika Parab* *B.E (Computers), Vidyalankar

More information

Probability Distributions

Probability Distributions CHAPTER 5 Probability Distributions CHAPTER OUTLINE 5.1 Probability Distribution of a Discrete Random Variable 5.2 Mean and Standard Deviation of a Probability Distribution 5.3 The Binomial Distribution

More information

Microsoft Excel 2010 Charts and Graphs

Microsoft Excel 2010 Charts and Graphs Microsoft Excel 2010 Charts and Graphs Email: training@health.ufl.edu Web Page: http://training.health.ufl.edu Microsoft Excel 2010: Charts and Graphs 2.0 hours Topics include data groupings; creating

More information

Summary of important mathematical operations and formulas (from first tutorial):

Summary of important mathematical operations and formulas (from first tutorial): EXCEL Intermediate Tutorial Summary of important mathematical operations and formulas (from first tutorial): Operation Key Addition + Subtraction - Multiplication * Division / Exponential ^ To enter a

More information

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

More information

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 2013 Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

More information

ABSORBENCY OF PAPER TOWELS

ABSORBENCY OF PAPER TOWELS ABSORBENCY OF PAPER TOWELS 15. Brief Version of the Case Study 15.1 Problem Formulation 15.2 Selection of Factors 15.3 Obtaining Random Samples of Paper Towels 15.4 How will the Absorbency be measured?

More information

Using Excel for inferential statistics

Using Excel for inferential statistics FACT SHEET Using Excel for inferential statistics Introduction When you collect data, you expect a certain amount of variation, just caused by chance. A wide variety of statistical tests can be applied

More information

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics

Descriptive statistics Statistical inference statistical inference, statistical induction and inferential statistics Descriptive statistics is the discipline of quantitatively describing the main features of a collection of data. Descriptive statistics are distinguished from inferential statistics (or inductive statistics),

More information

Representing Vector Fields Using Field Line Diagrams

Representing Vector Fields Using Field Line Diagrams Minds On Physics Activity FFá2 5 Representing Vector Fields Using Field Line Diagrams Purpose and Expected Outcome One way of representing vector fields is using arrows to indicate the strength and direction

More information

Solving Mass Balances using Matrix Algebra

Solving Mass Balances using Matrix Algebra Page: 1 Alex Doll, P.Eng, Alex G Doll Consulting Ltd. http://www.agdconsulting.ca Abstract Matrix Algebra, also known as linear algebra, is well suited to solving material balance problems encountered

More information

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009

Cluster Analysis. Alison Merikangas Data Analysis Seminar 18 November 2009 Cluster Analysis Alison Merikangas Data Analysis Seminar 18 November 2009 Overview What is cluster analysis? Types of cluster Distance functions Clustering methods Agglomerative K-means Density-based Interpretation

More information

Scatter Plots with Error Bars

Scatter Plots with Error Bars Chapter 165 Scatter Plots with Error Bars Introduction The procedure extends the capability of the basic scatter plot by allowing you to plot the variability in Y and X corresponding to each point. Each

More information

Portal Connector Fields and Widgets Technical Documentation

Portal Connector Fields and Widgets Technical Documentation Portal Connector Fields and Widgets Technical Documentation 1 Form Fields 1.1 Content 1.1.1 CRM Form Configuration The CRM Form Configuration manages all the fields on the form and defines how the fields

More information

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study The data for this study is taken from experiment GSE848 from the Gene Expression

More information

STATISTICA Formula Guide: Logistic Regression. Table of Contents

STATISTICA Formula Guide: Logistic Regression. Table of Contents : Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

More information

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate

One-Way ANOVA using SPSS 11.0. SPSS ANOVA procedures found in the Compare Means analyses. Specifically, we demonstrate 1 One-Way ANOVA using SPSS 11.0 This section covers steps for testing the difference between three or more group means using the SPSS ANOVA procedures found in the Compare Means analyses. Specifically,

More information

Projects Involving Statistics (& SPSS)

Projects Involving Statistics (& SPSS) Projects Involving Statistics (& SPSS) Academic Skills Advice Starting a project which involves using statistics can feel confusing as there seems to be many different things you can do (charts, graphs,

More information

Exercise 1: How to Record and Present Your Data Graphically Using Excel Dr. Chris Paradise, edited by Steven J. Price

Exercise 1: How to Record and Present Your Data Graphically Using Excel Dr. Chris Paradise, edited by Steven J. Price Biology 1 Exercise 1: How to Record and Present Your Data Graphically Using Excel Dr. Chris Paradise, edited by Steven J. Price Introduction In this world of high technology and information overload scientists

More information

Neural Networks Lesson 5 - Cluster Analysis

Neural Networks Lesson 5 - Cluster Analysis Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29

More information

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp. 238-239

STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. Clarificationof zonationprocedure described onpp. 238-239 STATISTICS AND DATA ANALYSIS IN GEOLOGY, 3rd ed. by John C. Davis Clarificationof zonationprocedure described onpp. 38-39 Because the notation used in this section (Eqs. 4.8 through 4.84) is inconsistent

More information

The correlation coefficient

The correlation coefficient The correlation coefficient Clinical Biostatistics The correlation coefficient Martin Bland Correlation coefficients are used to measure the of the relationship or association between two quantitative

More information

m i: is the mass of each particle

m i: is the mass of each particle Center of Mass (CM): The center of mass is a point which locates the resultant mass of a system of particles or body. It can be within the object (like a human standing straight) or outside the object

More information

Machine Learning using MapReduce

Machine Learning using MapReduce Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

More information

Data Mining: A Hybrid Approach on the Clinical Diagnosis of Breast Tumor Patients

Data Mining: A Hybrid Approach on the Clinical Diagnosis of Breast Tumor Patients Data Mining: A Hybrid Approach on the Clinical Diagnosis of Breast Tumor Patients Onuodu F. E. 1, Eke B. O. 2 2 bathoyol@gmail.com, University of Port Harcourt, Port Harcourt, Nigeria 1 University of Port

More information

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck!

EXAM #1 (Example) Instructor: Ela Jackiewicz. Relax and good luck! STP 231 EXAM #1 (Example) Instructor: Ela Jackiewicz Honor Statement: I have neither given nor received information regarding this exam, and I will not do so until all exams have been graded and returned.

More information

The Bending Strength of Pasta

The Bending Strength of Pasta The Bending Strength of Pasta 1.105 Lab #1 Louis L. Bucciarelli 9 September, 2003 Lab Partners: [Name1] [Name2] Data File: Tgroup3.txt On the cover page, include your name, the names of your lab partners,

More information

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical

More information

Data representation and analysis in Excel

Data representation and analysis in Excel Page 1 Data representation and analysis in Excel Let s Get Started! This course will teach you how to analyze data and make charts in Excel so that the data may be represented in a visual way that reflects

More information

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts BIDM Project Predicting the contract type for IT/ITES outsourcing contracts N a n d i n i G o v i n d a r a j a n ( 6 1 2 1 0 5 5 6 ) The authors believe that data modelling can be used to predict if an

More information

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING

LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING LAB 4 INSTRUCTIONS CONFIDENCE INTERVALS AND HYPOTHESIS TESTING In this lab you will explore the concept of a confidence interval and hypothesis testing through a simulation problem in engineering setting.

More information

TIBCO Spotfire Network Analytics 1.1. User s Manual

TIBCO Spotfire Network Analytics 1.1. User s Manual TIBCO Spotfire Network Analytics 1.1 User s Manual Revision date: 26 January 2009 Important Information SOME TIBCO SOFTWARE EMBEDS OR BUNDLES OTHER TIBCO SOFTWARE. USE OF SUCH EMBEDDED OR BUNDLED TIBCO

More information

Exploratory Data Analysis

Exploratory Data Analysis Exploratory Data Analysis Johannes Schauer johannes.schauer@tugraz.at Institute of Statistics Graz University of Technology Steyrergasse 17/IV, 8010 Graz www.statistics.tugraz.at February 12, 2008 Introduction

More information

Advanced Microsoft Excel 2010

Advanced Microsoft Excel 2010 Advanced Microsoft Excel 2010 Table of Contents THE PASTE SPECIAL FUNCTION... 2 Paste Special Options... 2 Using the Paste Special Function... 3 ORGANIZING DATA... 4 Multiple-Level Sorting... 4 Subtotaling

More information

Tutorial Customer Lifetime Value

Tutorial Customer Lifetime Value MARKETING ENGINEERING FOR EXCEL TUTORIAL VERSION 150211 Tutorial Customer Lifetime Value Marketing Engineering for Excel is a Microsoft Excel add-in. The software runs from within Microsoft Excel and only

More information

Chapter 5. Random variables

Chapter 5. Random variables Random variables random variable numerical variable whose value is the outcome of some probabilistic experiment; we use uppercase letters, like X, to denote such a variable and lowercase letters, like

More information

A Non-Linear Schema Theorem for Genetic Algorithms

A Non-Linear Schema Theorem for Genetic Algorithms A Non-Linear Schema Theorem for Genetic Algorithms William A Greene Computer Science Department University of New Orleans New Orleans, LA 70148 bill@csunoedu 504-280-6755 Abstract We generalize Holland

More information

Creating Contrast Variables

Creating Contrast Variables Chapter 112 Creating Contrast Variables Introduction The Contrast Variable tool in NCSS can be used to create contrasts and/or binary variables for use in various analyses. This chapter will provide information

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Section IV.1: Recursive Algorithms and Recursion Trees

Section IV.1: Recursive Algorithms and Recursion Trees Section IV.1: Recursive Algorithms and Recursion Trees Definition IV.1.1: A recursive algorithm is an algorithm that solves a problem by (1) reducing it to an instance of the same problem with smaller

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering

Engineering Problem Solving and Excel. EGN 1006 Introduction to Engineering Engineering Problem Solving and Excel EGN 1006 Introduction to Engineering Mathematical Solution Procedures Commonly Used in Engineering Analysis Data Analysis Techniques (Statistics) Curve Fitting techniques

More information

Creating Charts in Microsoft Excel A supplement to Chapter 5 of Quantitative Approaches in Business Studies

Creating Charts in Microsoft Excel A supplement to Chapter 5 of Quantitative Approaches in Business Studies Creating Charts in Microsoft Excel A supplement to Chapter 5 of Quantitative Approaches in Business Studies Components of a Chart 1 Chart types 2 Data tables 4 The Chart Wizard 5 Column Charts 7 Line charts

More information

RADIOTHERAPY COST ESTIMATOR USER GUIDE

RADIOTHERAPY COST ESTIMATOR USER GUIDE RADIOTHERAPY COST ESTIMATOR USER GUIDE The originating section of this publication in the IAEA was: Applied Radiation Biology and Radiotherapy Section International Atomic Energy Agency Wagramerstrasse

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

Cluster Analysis using R

Cluster Analysis using R Cluster analysis or clustering is the task of assigning a set of objects into groups (called clusters) so that the objects in the same cluster are more similar (in some sense or another) to each other

More information

MACHINE LEARNING IN HIGH ENERGY PHYSICS

MACHINE LEARNING IN HIGH ENERGY PHYSICS MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!

More information

Evaluation & Validation: Credibility: Evaluating what has been learned

Evaluation & Validation: Credibility: Evaluating what has been learned Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

More information

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining Data Mining: Exploring Data Lecture Notes for Chapter 3 Introduction to Data Mining by Tan, Steinbach, Kumar What is data exploration? A preliminary exploration of the data to better understand its characteristics.

More information