Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification
|
|
- Geraldine Jordan
- 8 years ago
- Views:
Transcription
1 29 Ninth IEEE International Conference on Data Mining Two Heads Better Than One: Metric+Active Learning and Its Applications for IT Service Classification Fei Wang 1,JimengSun 2,TaoLi 1, Nikos Anerousis 2 1 School of Computing and Information Sciences, Florida International University, Miami, FL IBM T. J. Watson Research Center, Service Department, Yorktown Heights, NY Abstract Large IT service providers track service requests and their execution through problem/change tickets. It is important to classify the tickets based on the problem/change description in order to understand service quality and to optimize service processes. However, two challenges exist in solving this classification problem: 1) ticket descriptions from different classes are of highly diverse characteristics, which invalidates most standard distance metrics; 2) it is very expensive to obtain high-quality labeled data. To address these challenges, we develop two seemingly independent methods 1) Discriminative Neighborhood Metric Learning (DNML) and 2) Active Learning with Median Selection (ALMS), both of which are, however, based on the same core technique: iterated representative selection. A case study on real IT service classification application is presented to demonstrate the effectiveness and efficiency of our proposed methods. I. INTRODUCTION In IT outsourcing environments, every day thousands of problem and change requests are generated on a diverse set of issues related to all kinds of software and hardware. Those requests need to be resolved correctly and quickly in order to meet service level agreements (SLAs). Under this environment, when devices fail or applications need to be upgraded, the person recovering such failures or applying the patch is likely to be sitting thousand miles from the affected component. The reliance on outsourcing of technology support has fundamentally shift the dependencies between participating organizations. In such a complex environment, all service requests are handled and tracked through tickets by various problem & change management systems. A ticket is opened with certain symptom description, and routed to appropriate Service Matter Experts (SMEs) for resolution. The solution is documented when the ticket is closed. The job of SMEs is to resolve tickets quickly and correctly in order to meet SLAs. Another important job role is Quality Analysts (QAs), whose responsibility is to analyze recent tickets in order to identify the opportunity for the service-quality improvement. For example, a frequent password resets on one system may be due to the inconsistent password period in that system. Instead of resetting the password all the times, the password period should be adjusted properly. Or sometimes one patch or fix for a particular server should also be applied to all servers with the same configuration, instead of creating multiple tickets with exactly the same work order on each of those systems. It requires great understanding of the current ticket distribution for QAs to identify any opportunity for optimization. In another word, it is important to classify those tickets based on their description and resolution accurately in a timely manner. In this paper, we address the ticket classification problem that lies in the core of IT service quality improvement with significant practical impact and great challenges: Standard distance metrics do not apply due to diverse characteristics of raw features. More specifically, the ticket description and solution are highly diverse and noisy. Since different SMEs can describe the same problem quite differently and the descriptions are typically short and noisy. Also depending on the types of problems, the description can vary significantly. There are almost no high-quality labeled data. Tickets are handled by SMEs, who often do not have the incentive or ability to classify the tickets accurately, due to heavy workload and incomplete information. On the other hand, QAs, who have the right ability and incentive, do not have cycle to manually label all the tickets. To address these two challenges, we propose a novel hybrid approach that leverage both active learning and metric learning. The contributions of this paper are the following: We propose Discriminative Neighborhood Metric Learning (DNML) that learns a domain-specific distance metric using the overall data distribution and a small set of labeled data. We propose Active Learning with Median Selection (ALMS) that progressively selects the representative data points that need to be labeled, which is naturally a multi-class algorithm. We combine metric and active learning steps into a unified process over the data. Moreover, our algorithm can automatically detect the number of classes contained in the data set. We demonstrate the effectiveness and efficiency of DNML and ALMS over several datasets compared to several existing methods /9 $ IEEE DOI 1.119/ICDM
2 because of its homogeneous assumption of all the feature dimensions. Therefore many researchers propose to learn a Mahalanobis distance which measures the distance between data x i and x j by d m (x i, x) = (x i x j ) C(x i x j ) (2) Figure 1. The basic algorithm flowchart, where the red blobs represent the labeled points, and the gray blobs correspond to unlabeled points. The rest of the paper is organized as follows: Section II presents the methods for metric and active learning. Section III demonstrates the practical case-study of the proposed methods on IT ticket classification application. Finally, Section V concludes. II. METRIC+ACTIVE LEARNING In this section we will introduce our algorithm in detail. First we give an overview of our algorithm. A. Algorithm Overview The basic procedure of our algorithm is to iterate the following procedure: Learn a distance metric from the labeled data set, and then classify the unlabeled data points using the nearest neighbor classifier with the learned metric. We call the data points in the same class a cluster. Select the median from each cluster. For each cluster X i, we partition it into a labeled set Xi L (whose labels are given initially) and an unlabeled set Xi U (whose labels are predicted by the nearest neighbor classifier). Then the median for X i is defined as m i = arg min x c i 2 (1) x Xi U where c i is the mean of X i. Add the selected points into the labeled set. Fig.1 shows the graphical view of the basic algorithm flowchart. B. Discriminative Neighborhood Metric Learning As we know that a good distance metric plays a central role in many data mining and machine learning algorithms (e.g., Nearest Neighbor classifier, k-means algorithm). Usually the Euclidean distance cannot satisfy our requirements where C R d d is a positive semi-definite covariance matrix used to incorporate the correlations of different feature dimensions. In this paper, we consider to learn a low-rank covariance matrix C, such that with the learned distance metric, the within-class compactness and betweenclass scatterness are maximized. Different from linear discriminant analysis [1], which seeks for a discriminant subspace in a global sense, our algorithm aims to learn a distance metric with enhanced local discriminability. To define such local discriminability, we first introduce the definition of two types of neighborhoods [8]: Definition 1: Homogeneous Neighborhood. The homogeneous neighborhood of x i, denoted as Ni o,isthe N i o - nearest data points of x i with the same label, Ni o is the size of Ni o. Definition 2: Heterogeneous Neighborhood. The heterogeneous neighborhood of x i, denoted as Ni e,isthe N i e - nearest data points of x i with different labels, Ni e is the size of Ni e. Basedontheabovetwodefinitions, we can define the local compactness of point x i as C i = d 2 j:x j Ni o m (x i, x j ) (3) and the local scatter ness of point x i as S i = d 2 k:x k Ni e m (x i, x k ) (4) Then the local discriminability of the data set X with respect to the distance metric d m can be defined as J = i C i j:x j N (x i = o i x j ) C(x i x j ) (x i x k ) () C(x i x k ) i S i k:x k N e i The goal of our algorithm is to minimize J, which is equivalent to minimize the local compactness and maximize the local scatterness simultaneously. Fig.2 provides an intuitive graphical illustration of the theme behind our algorithm. However, the minimization of J in Eq.() to get an optimal C is not an easy task as there are d(d +1)/2 variables to solve provided that C is symmetric. Recall that we require C to be a low-rank matrix and C is positive semi-definite, then by incomplete Choleskey factorization, we can decompose C to C = WW (6) where W R d r and r is the rank of C. Inthisway, we only need to solve W instead of solving the entire C. 123
3 (a) The neighborhoods of x i with Euclidean distance (b) The neighborhoods of x i with the learned distance Figure 2. The homogeneous and heterogeneous neighborhoods of x i with the regular Euclidean and learned Mahalanobis distance metrics. The blob with dark green in the middle of the circle is x i, and the blobs with green and blue correspond to the points in Ni o and N i e. The goal of DNML is to learn a distance metric that pulls the points in Ni o towards x i while push the points in Ni e away from x i. Table I DECOMPOSED NEWTOWN S PROCEDURE FOR SOLVING THE TRACE RATIO PROBLEM Input: Matrices M C and M S, precision ε, dimensiond Output: Trace Ratio value λ and matrix W Procedure: 1. Initialize λ =, t = 2. Do eigenvalue decomposition to M C λ tm S 3. Let (β k (λ), w k (λ)) be the k-th eigenvalue, eigenvector pair obtained from step 2, define the first order Taylor expansion ˆβ k (λ) =β k (λ)+β k (λt)(λ λt), where β k (λt) =w k(λ t) M S w k (λ t) 4. Define ˆf d (λ) to be the sum of the smallest d ˆβ k (λ), solve ˆd d (λ) =and set the root to be λ t+1. If λ t+1 λ t <ɛ,gotostep6;otherwiset = t +1, go to step 2 6. Output λ = λ, andw to be the eigenvectors w.r.t. the smallest d eigenvalues of M C M S Combining Eq.(6) and Eq.(), we can derive the following optimization problem tr(w M C W) min W tr(w (7) M S W) where M C = (x i x j )(x i x j ) (8) i j:x j Ni o M S = (x i x k )(x i x k ) (9) i k:x k Ni e are the compactness and scatterness matrices respectively. Therefore, our problem (7) becomes a trace quotient minimization problem, and we can make use of the decomposed Newtown s method [3]. C. Active Learning with Median Selection Recently a novel active learning method called Transductive Experimental Design (TED) [11] is proposed, which aims to select the k most representative points contained in the data set. Despite the theoretical soundness and empirical success of TED, there are still some limitations: Although the name suggests TED is transductive, it does make use of any label information contained in the data set. In fact, TED just uses the whole data set (include labeled and unlabeled) to select the k most representative data, such that the linear reconstruction loss of the whole data set using the selected points are minimized. In this sense, TED is an unsupervised method. As the authors analyzed in [11], TED tends to select the data points with large norms (where the authors analyzed that these points are hard to predict). However, these selected points lie on the border of the data distribution area, and those data points could be outliers that would mislead the classification process. Based on the above analysis, we propose to (1) make use of the label information; (2) select the representative points locally. Specifically, we first learn a distance metric using our DNML method introduced in last section and then apply the nearest neighbor classifier to classify the unlabeled points. In this way, the whole data set is partitioned into several classes, and for each class, we just select the median point as defined in Eq.(1) class1 class2 class3 class4 selected (a) TED class1 class2 class3 class4 selected (b) Local Median Figure 3. Active learning results. (a) shows the results of transductive experimental design; (b) shows the results of our local median selection method. Fig.3 illustrates a toy example on the difference between TED and our local median selection method. The data set here is generated from 4 Gaussians, and each Gaussian contains 1 data points. We treat each Gaussian as a class. Initially we randomly label 1% of the data points and use TED to select 4 most representative data points shown as black triangles in Fig.3(a), from which we can see that these points all lie on the border of these Gaussians. Fig.3(b) shows the results of our local median selection method, where we first apply DNML to learn a proper distance metric from the labeled points and then use such metric to classify the whole data set, and finally we select one median from each class. From the figure we observe that the selected points are representative for each Gaussian. An issue that is worth mentioning here is that our algorithm in fact can be viewed as an approximated version of 124
4 Table II THE METRIC+ACTIVE LEARNING ALGORITHM 6 6 Inputs: Training data, Ni o, N i e, precision ɛ, dimensiond, number of iteration steps T Outputs: The selected points and learned W Procedure: for t=1:t 1. Construct M S, M C from the training data 2. Learn a proper distance metric 3. Count the number of classes k in the training data, apply the learned metric to classify the unlabeled data using the nearest neighbor classifier 4. Select the median in each class and add them into the training data pool end local TED, wherewefirst partition the data set into several local regions using the learned distance metric, and then select exactly ONE representative point in each region. As data mean is the most representative point for a set of data in the sense of Euclidean loss, we select the median which is closest to the data mean from the candidate set. The whole algorithm procedure is summarized in Table II. III. TICKET CLASSIFICATION: ACASE STUDY In this section we present the detailed experimental results on applying our proposed active learning scheme for ticket classification. First we will describe the basic characteristics of the data set. A. The Data Set There are totally 4182 tickets from 27 classes. We use the bag-of-words features, which results in a 3882 dimensional space. After eliminating the duplicate and null tickets, there are 2222 tickets remained. The class distribution is shown in Fig.4(a), from which we can observe that the classes are highly imbalanced and there are many rare classes with only a few data points. We identify a class as rare class if and only if the number of data points contained in it is less than 2. In our experiments, we eliminate those rare classes, which results in a data set of size 2161 from classes, and the class distribution is shown in Fig.4(b). Besides rare classes, we also observe that the data set is highly sparse and there are also a set of rare features. The original feature distribution is shown in Fig.(a), where we accumulated the times that each feature appears in each class. We identify a feature as a rare feature if and only if its total appearance times in the data set is less than 1. After eliminating those rare features, we obtain a data set with 669 features and its distribution is shown in Fig.(b). Finally, we also eliminated the data with only these rare features, which makes the final data set containing 213 tickets with 669 features. number of data points Figure 4. classes 1 1 (a) Original data number of data points (b) Data distribution after eliminating the rare classes Class distribution of original data and the data with no rare feature index (a) Original data feature distribution feature index (b) Data feature distribution with no rare features Figure. Feature distribution of the original data and the data with rare feature eliminated. B. Distance Metric Learning In this part of experiments, we first test the effectiveness of our DNML algorithm on the ticket data set, where we use our algorithm to obtain a distance metric, and then use such distance metric to perform nearest neighbor classification and get the final classification accuracy. Such procedure is repeated times and we report the average classification accuracies and standard deviations as in Fig.6. The size of homogeneous and heterogeneous neighborhoods are set to 3 manually, and the rank of the covariance matrix C is set to 4. From the figure we observe the superiority of our metric learning method. Specifically, with the learned metric, our DNML method clearly outperforms the original NN method, which validates that DNML can learn a better distance metric. As there are two sets of parameters in our DNML method, one is the rank r of the covariance matrix C, the other is the sizes of the homogeneous neighborhood N o and heterogeneous neighborhood N e (denoted as n o and n e ). Therefore we also conducted a set of experiments to test the sensitivity of DNML with respect to those parameters. Fig.7 shows how the algorithm performance varies with respect to the rank of the covariance matrix C, wherewe randomly label half of the tickets as the training set, and the remaining tickets are used for testing. We set the sizes of N o and N e to be 3. The results in Fig.7 are summarized over independent runs. From the figure we can see that
5 classification accuracy DNML NN NB RLS SVM not that sensitive with respect to the variation of Ni o and Ni e. From Fig.8 we can see that when the neighborhood sizes are small, the algorithm performance is better than that when the neighborhood sizes are large. This is possibly because the distribution of the data set is complicated and data in different classes are highly overlapped, then when we enlarge the neighborhoods to include more data points, the learned distance metric might be corrupted by some noisy points, which will make the final classification results inaccurate percentage of labeled tickets (%) Figure 6. Classification accuracy comparison with different supervised learning methods. The x-axis represents the percentage of randomly labeled tickets, and the y-axis denotes the averaged classification accuracy. the final ticket classification results are stable with respect to the choice of the rank of the covariance matrix, except the cases when the rank is too small (i.e., 1 in our case), since in those cases too much information will be lost. When the rank becomes too large, some noise contained in the data set could be retained, therefore the performance of our algorithm would go down a little, and the choices of r [2, 4] are all reasonable. classification accuracy rank of the covariance matrix Figure 7. The sensitivity of the performance of our algorithm with respect to the rank r of the covariance matrix C. Weset Ni o = N i e =3,and half of the data set are labeled as the training data. We also test the sensitivity of our algorithm with respect to the choices of the sizes of Ni o and Ni e, the results of which are shown in Fig.8, where the x-axis and y-axis correspond to the sizes of Ni o and N i e, and the z-axis denotes the classification accuracy averaged over independent runs. Here we assume that the sizes homogeneous and heterogeneous neighborhoods are the same for all the data points. For each run, we randomly label % of the tickets as training data, and the rest as testing data. From Fig.8 we can clearly see that the whole surface of z = f(x, y) is flat, which means that the performance of our algorithm is Accuracy N i e 1 Figure 8. The sensitivity of the performance of our algorithm with respect to the choices of Ni o and N i e, and half of the data set are labeled as the training data. C. Integrated Active Learning and Distance Metric Learning In our implementation, we initially label 2% of the data set, and then apply the various active learning methods. Since there are totally classes in the ticket data set, for each method, we select points from the unlabeled set in each round. For all the approaches that use DNML, we set N o = N e =3, and the rank of the covariance matrix is set to 4. Fig.9 illustrates the results of these algorithms summarized over independent runs, where the x-axis represents the percentage of selected points, and the y-axis denotes the averaged classification accuracy as well as the standard deviation. From the figure we can clearly see that with our DNML+LMED method, the classification accuracy will ascend faster compared to other methods. IV. RELATED WORKS In this section we will briefly review some previous works that are closely related to our metric+active learning method. A. Distance Metric Learning Distance metric learning plays a central role in real world applications. According to [1], these approaches can mainly be categorized into two classes: unsupervised and supervised. Here we mainly review the supervised methods, which learn distance metric from the data set with some supervised information. Usually the information takes the 1 N i o 126
6 classification accuracy DNML+LMED. LMED DNML+Rand DNML+TED percentage of actively labeled tickets (%) Figure 9. The classification accuracy vs. number of selected tickets plot. The x-axis represents the percentage of labeled tickets, and the y-axis represents the classification accuracy averaged over independent runs. form of pairwise constraints, which indicating whether a pair of data points belong to the same class (usually referred to as must-link constraints) or different classes (cannotlink) constraints. Then these algorithms aim to learn a proper distance metric under which the data with must-link constraints are as compact as possible, while the data with cannot-link constraints are far apart from each other. Some typical approaches include the side-information method [9], Relevant Component Analysis (RCA) [],anddiscriminant Component Analysis (DCA)[2].OurDiscriminant Neighborhood Metric Learning (DNML) method can also be viewed as a supervised method, however, we make use of the labeled data together with their labels, which is different from those pairwise constraints. B. Active Learning In many real-world problems, we face the problems when unlabeled data are abundant but labeling data is expensive to obtain (e.g., in text classification, it is expensive and time consuming to ask the users to label the documents manually, however, it is quite easy to obtain a large amount of unlabeled documents by crawling the web). In such a scenario the learning algorithm can actively query the user/teacher for labels. This type of iterative supervised learning is called active learning. Since the learner chooses the examples, the number of examples to learn a concept can often be much lower than the number required in normal supervised learning. Two classical active learning algorithms are Tong and Koller s Simple SVM algorithm [6] and Seung et al. s Query By Committee (QBC) algorithm [4]. However, the simple SVM algorithm is coupled with the Support Vector Machine (SVM) classifier [7] and is only applicable to two-class problems. For the QBC algorithm, we need to construct a committee of models that represent different regions of the version space and have some measure of disagreement among committee members, which is usually difficult for real world applications. Recently, Yu et al. [11] proposed another active learning algorithm called Transductive Experimental Design (TED), which aims to find some most representative points that can optimally reconstruct the whole data set in the sense of Euclidean sense. Our median selection strategy introduced in this paper is similar to TED, and we have analyzed in section II-C the superiority of our algorithm. V. CONCLUSIONS We present a novel metric+active learning method for IT service ticket classification in this paper. Our method combines both the strengths of metric learning and active learning methods. Finally the experimental results on both benchmark and real ticket data sets are presented to demonstrate the effectiveness of the proposed method. ACKNOWLEDGEMENT The work is partially supported by NSF CAREER Award IIS-4628 and a 28 IBM Faculty Award. REFERENCES [1] K. Fukunaga. Introduction to Statistical Pattern Recognition. Academic Press, San Diego, California, 199. [2] S. Hoi, W. Liu, M. Lyu, and W. Ma. Learning distance metrics with contextual constraints for image retrieval. In Proceedings of CVPR26, 26. [3] Y. Jia, F. Nie, and C. Zhang. Trace ratio problem revisited. IEEE Transactions on Neural Networks, 29. [4] H. S. Seung, M. Opper, and H. Sompolinsky. Query by committee. In Proceedings of COLT, pages , [] N. Shental, T. Hertz, D. Weinshall, and M. Pavel. Adjustment learning and relevant component analysis. In Proceedings of ECCV, pages , 22. [6] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. Journal of Machine Learning Research, 2:4 66, 21. [7] V. N. Vapnik. The nature of statistical learning theory. Springer-Verlag New York, Inc., New York, NY, USA, 199. [8] F. Wang and C. Zhang. Feature extraction by maximizing the neighborhood margin. In Proceedings of CVPR, 27. [9] E. P. Xing, A. Y. Ng, M. I. Jordan, and S. Russell. Distance metric learning, with application to clustering with sideinformation. In Advances in Neural Information Processing Systems, volume, pages 12, 23. [1] L. Yang. Distance metric learning: A comprehensive survey. Technical report, Michgan State University, 26. [11] K. Yu, J. Bi, and V. Tresp. Active learning via transductive experimental design. In Proceedings of ICML, pages ,
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center
Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center 1 Outline Part I - Applications Motivation and Introduction Patient similarity application Part II
More informationAzure Machine Learning, SQL Data Mining and R
Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:
More informationFacebook Friend Suggestion Eytan Daniyalzade and Tim Lipus
Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus 1. Introduction Facebook is a social networking website with an open platform that enables developers to extract and utilize user information
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationComparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationPractical Data Science with Azure Machine Learning, SQL Data Mining, and R
Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationBayesian Active Distance Metric Learning
44 YANG ET AL. Bayesian Active Distance Metric Learning Liu Yang and Rong Jin Dept. of Computer Science and Engineering Michigan State University East Lansing, MI 4884 Rahul Sukthankar Robotics Institute
More information15.062 Data Mining: Algorithms and Applications Matrix Math Review
.6 Data Mining: Algorithms and Applications Matrix Math Review The purpose of this document is to give a brief review of selected linear algebra concepts that will be useful for the course and to develop
More informationPersonalized Hierarchical Clustering
Personalized Hierarchical Clustering Korinna Bade, Andreas Nürnberger Faculty of Computer Science, Otto-von-Guericke-University Magdeburg, D-39106 Magdeburg, Germany {kbade,nuernb}@iws.cs.uni-magdeburg.de
More informationNeural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationVirtual Landmarks for the Internet
Virtual Landmarks for the Internet Liying Tang Mark Crovella Boston University Computer Science Internet Distance Matters! Useful for configuring Content delivery networks Peer to peer applications Multiuser
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationHow To Cluster
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationClustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationA Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images
A Complete Gradient Clustering Algorithm for Features Analysis of X-ray Images Małgorzata Charytanowicz, Jerzy Niewczas, Piotr A. Kowalski, Piotr Kulczycki, Szymon Łukasik, and Sławomir Żak Abstract Methods
More informationMulti-class Classification: A Coding Based Space Partitioning
Multi-class Classification: A Coding Based Space Partitioning Sohrab Ferdowsi, Svyatoslav Voloshynovskiy, Marcin Gabryel, and Marcin Korytkowski University of Geneva, Centre Universitaire d Informatique,
More informationSupervised Feature Selection & Unsupervised Dimensionality Reduction
Supervised Feature Selection & Unsupervised Dimensionality Reduction Feature Subset Selection Supervised: class labels are given Select a subset of the problem features Why? Redundant features much or
More informationPATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
More informationEM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationRobust Outlier Detection Technique in Data Mining: A Univariate Approach
Robust Outlier Detection Technique in Data Mining: A Univariate Approach Singh Vijendra and Pathak Shivani Faculty of Engineering and Technology Mody Institute of Technology and Science Lakshmangarh, Sikar,
More informationData Mining and Knowledge Discovery in Databases (KDD) State of the Art. Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland
Data Mining and Knowledge Discovery in Databases (KDD) State of the Art Prof. Dr. T. Nouri Computer Science Department FHNW Switzerland 1 Conference overview 1. Overview of KDD and data mining 2. Data
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More information1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY 2006. Principal Components Null Space Analysis for Image and Video Classification
1816 IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 15, NO. 7, JULY 2006 Principal Components Null Space Analysis for Image and Video Classification Namrata Vaswani, Member, IEEE, and Rama Chellappa, Fellow,
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
More informationEnvironmental Remote Sensing GEOG 2021
Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationA Survey on Outlier Detection Techniques for Credit Card Fraud Detection
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661, p- ISSN: 2278-8727Volume 16, Issue 2, Ver. VI (Mar-Apr. 2014), PP 44-48 A Survey on Outlier Detection Techniques for Credit Card Fraud
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationUnsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
More informationTensor Methods for Machine Learning, Computer Vision, and Computer Graphics
Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics Part I: Factorizations and Statistical Modeling/Inference Amnon Shashua School of Computer Science & Eng. The Hebrew University
More informationRecognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang
Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform
More informationAPPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
More informationAssessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall
Automatic Photo Quality Assessment Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall Estimating i the photorealism of images: Distinguishing i i paintings from photographs h Florin
More informationA Survey on Pre-processing and Post-processing Techniques in Data Mining
, pp. 99-128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Pre-processing and Post-processing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More information203.4770: Introduction to Machine Learning Dr. Rita Osadchy
203.4770: Introduction to Machine Learning Dr. Rita Osadchy 1 Outline 1. About the Course 2. What is Machine Learning? 3. Types of problems and Situations 4. ML Example 2 About the course Course Homepage:
More informationFinal Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
More informationADVANCED MACHINE LEARNING. Introduction
1 1 Introduction Lecturer: Prof. Aude Billard (aude.billard@epfl.ch) Teaching Assistants: Guillaume de Chambrier, Nadia Figueroa, Denys Lamotte, Nicola Sommer 2 2 Course Format Alternate between: Lectures
More informationAdvanced Ensemble Strategies for Polynomial Models
Advanced Ensemble Strategies for Polynomial Models Pavel Kordík 1, Jan Černý 2 1 Dept. of Computer Science, Faculty of Information Technology, Czech Technical University in Prague, 2 Dept. of Computer
More informationA fast multi-class SVM learning method for huge databases
www.ijcsi.org 544 A fast multi-class SVM learning method for huge databases Djeffal Abdelhamid 1, Babahenini Mohamed Chaouki 2 and Taleb-Ahmed Abdelmalik 3 1,2 Computer science department, LESIA Laboratory,
More informationChapter 6. The stacking ensemble approach
82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationSubspace Analysis and Optimization for AAM Based Face Alignment
Subspace Analysis and Optimization for AAM Based Face Alignment Ming Zhao Chun Chen College of Computer Science Zhejiang University Hangzhou, 310027, P.R.China zhaoming1999@zju.edu.cn Stan Z. Li Microsoft
More informationE-commerce Transaction Anomaly Classification
E-commerce Transaction Anomaly Classification Minyong Lee minyong@stanford.edu Seunghee Ham sham12@stanford.edu Qiyi Jiang qjiang@stanford.edu I. INTRODUCTION Due to the increasing popularity of e-commerce
More informationCalculation of Minimum Distances. Minimum Distance to Means. Σi i = 1
Minimum Distance to Means Similar to Parallelepiped classifier, but instead of bounding areas, the user supplies spectral class means in n-dimensional space and the algorithm calculates the distance between
More informationDistance Metric Learning for Large Margin Nearest Neighbor Classification
Journal of Machine Learning Research 10 (2009) 207-244 Submitted 12/07; Revised 9/08; Published 2/09 Distance Metric Learning for Large Margin Nearest Neighbor Classification Kilian Q. Weinberger Yahoo!
More informationSTATISTICA. Clustering Techniques. Case Study: Defining Clusters of Shopping Center Patrons. and
Clustering Techniques and STATISTICA Case Study: Defining Clusters of Shopping Center Patrons STATISTICA Solutions for Business Intelligence, Data Mining, Quality Control, and Web-based Analytics Table
More informationLearning a Metric during Hierarchical Clustering based on Constraints
Learning a Metric during Hierarchical Clustering based on Constraints Korinna Bade and Andreas Nürnberger Otto-von-Guericke-University Magdeburg, Faculty of Computer Science, D-39106, Magdeburg, Germany
More informationCLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA
CLASSIFYING NETWORK TRAFFIC IN THE BIG DATA ERA Professor Yang Xiang Network Security and Computing Laboratory (NSCLab) School of Information Technology Deakin University, Melbourne, Australia http://anss.org.au/nsclab
More informationComponent Ordering in Independent Component Analysis Based on Data Power
Component Ordering in Independent Component Analysis Based on Data Power Anne Hendrikse Raymond Veldhuis University of Twente University of Twente Fac. EEMCS, Signals and Systems Group Fac. EEMCS, Signals
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014
RESEARCH ARTICLE OPEN ACCESS A Survey of Data Mining: Concepts with Applications and its Future Scope Dr. Zubair Khan 1, Ashish Kumar 2, Sunny Kumar 3 M.Tech Research Scholar 2. Department of Computer
More informationFeature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract - This paper presents,
More informationHow To Solve The Kd Cup 2010 Challenge
A Lightweight Solution to the Educational Data Mining Challenge Kun Liu Yan Xing Faculty of Automation Guangdong University of Technology Guangzhou, 510090, China catch0327@yahoo.com yanxing@gdut.edu.cn
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationGoing Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
More informationHow To Identify A Churner
2012 45th Hawaii International Conference on System Sciences A New Ensemble Model for Efficient Churn Prediction in Mobile Telecommunication Namhyoung Kim, Jaewook Lee Department of Industrial and Management
More informationPerformance Metrics for Graph Mining Tasks
Performance Metrics for Graph Mining Tasks 1 Outline Introduction to Performance Metrics Supervised Learning Performance Metrics Unsupervised Learning Performance Metrics Optimizing Metrics Statistical
More informationPredict Influencers in the Social Network
Predict Influencers in the Social Network Ruishan Liu, Yang Zhao and Liuyu Zhou Email: rliu2, yzhao2, lyzhou@stanford.edu Department of Electrical Engineering, Stanford University Abstract Given two persons
More informationMaximum Margin Clustering
Maximum Margin Clustering Linli Xu James Neufeld Bryce Larson Dale Schuurmans University of Waterloo University of Alberta Abstract We propose a new method for clustering based on finding maximum margin
More informationStandardization and Its Effects on K-Means Clustering Algorithm
Research Journal of Applied Sciences, Engineering and Technology 6(7): 399-3303, 03 ISSN: 040-7459; e-issn: 040-7467 Maxwell Scientific Organization, 03 Submitted: January 3, 03 Accepted: February 5, 03
More informationDistances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
More informationDomain Classification of Technical Terms Using the Web
Systems and Computers in Japan, Vol. 38, No. 14, 2007 Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J89-D, No. 11, November 2006, pp. 2470 2482 Domain Classification of Technical Terms Using
More informationHow To Perform An Ensemble Analysis
Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY 10598 Outlier Ensembles Keynote, Outlier Detection and Description Workshop, 2013 Based on the ACM SIGKDD Explorations Position Paper: Outlier
More informationCluster Analysis. Isabel M. Rodrigues. Lisboa, 2014. Instituto Superior Técnico
Instituto Superior Técnico Lisboa, 2014 Introduction: Cluster analysis What is? Finding groups of objects such that the objects in a group will be similar (or related) to one another and different from
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationLearning with Local and Global Consistency
Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany
More informationLearning with Local and Global Consistency
Learning with Local and Global Consistency Dengyong Zhou, Olivier Bousquet, Thomas Navin Lal, Jason Weston, and Bernhard Schölkopf Max Planck Institute for Biological Cybernetics, 7276 Tuebingen, Germany
More informationMachine Learning for Data Science (CS4786) Lecture 1
Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:
More informationMulticlass Classification. 9.520 Class 06, 25 Feb 2008 Ryan Rifkin
Multiclass Classification 9.520 Class 06, 25 Feb 2008 Ryan Rifkin It is a tale Told by an idiot, full of sound and fury, Signifying nothing. Macbeth, Act V, Scene V What Is Multiclass Classification? Each
More informationJava Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
More informationInternational Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015
RESEARCH ARTICLE OPEN ACCESS Data Mining Technology for Efficient Network Security Management Ankit Naik [1], S.W. Ahmad [2] Student [1], Assistant Professor [2] Department of Computer Science and Engineering
More informationA General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions
A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions Jing Gao Wei Fan Jiawei Han Philip S. Yu University of Illinois at Urbana-Champaign IBM T. J. Watson Research Center
More informationData Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
More informationThe Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationInteractive Machine Learning. Maria-Florina Balcan
Interactive Machine Learning Maria-Florina Balcan Machine Learning Image Classification Document Categorization Speech Recognition Protein Classification Branch Prediction Fraud Detection Spam Detection
More informationIN this paper we focus on the problem of large-scale, multiclass
IEEE TRANSACTIONS ON PATTERN RECOGNITION AND MACHINE INTELLIGENCE 1 Distance-Based Image Classification: Generalizing to new classes at near-zero cost Thomas Mensink, Member IEEE, Jakob Verbeek, Member,
More informationIMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH
IMPROVING DATA INTEGRATION FOR DATA WAREHOUSE: A DATA MINING APPROACH Kalinka Mihaylova Kaloyanova St. Kliment Ohridski University of Sofia, Faculty of Mathematics and Informatics Sofia 1164, Bulgaria
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationData Exploration and Preprocessing. Data Mining and Text Mining (UIC 583 @ Politecnico di Milano)
Data Exploration and Preprocessing Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationCross-Validation. Synonyms Rotation estimation
Comp. by: BVijayalakshmiGalleys0000875816 Date:6/11/08 Time:19:52:53 Stage:First Proof C PAYAM REFAEILZADEH, LEI TANG, HUAN LIU Arizona State University Synonyms Rotation estimation Definition is a statistical
More informationManifold Learning Examples PCA, LLE and ISOMAP
Manifold Learning Examples PCA, LLE and ISOMAP Dan Ventura October 14, 28 Abstract We try to give a helpful concrete example that demonstrates how to use PCA, LLE and Isomap, attempts to provide some intuition
More informationReference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors
Classification k-nearest neighbors Data Mining Dr. Engin YILDIZTEPE Reference Books Han, J., Kamber, M., Pei, J., (2011). Data Mining: Concepts and Techniques. Third edition. San Francisco: Morgan Kaufmann
More information. Learn the number of classes and the structure of each class using similarity between unlabeled training patterns
Outline Part 1: of data clustering Non-Supervised Learning and Clustering : Problem formulation cluster analysis : Taxonomies of Clustering Techniques : Data types and Proximity Measures : Difficulties
More informationAUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.
AUTOMATION OF ENERGY DEMAND FORECASTING by Sanzad Siddique, B.S. A Thesis submitted to the Faculty of the Graduate School, Marquette University, in Partial Fulfillment of the Requirements for the Degree
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationNetwork Intrusion Detection using Semi Supervised Support Vector Machine
Network Intrusion Detection using Semi Supervised Support Vector Machine Jyoti Haweliya Department of Computer Engineering Institute of Engineering & Technology, Devi Ahilya University Indore, India ABSTRACT
More informationClustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
More informationSupport Vector Machine (SVM)
Support Vector Machine (SVM) CE-725: Statistical Pattern Recognition Sharif University of Technology Spring 2013 Soleymani Outline Margin concept Hard-Margin SVM Soft-Margin SVM Dual Problems of Hard-Margin
More informationDimensionality Reduction: Principal Components Analysis
Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely
More informationIntroducing diversity among the models of multi-label classification ensemble
Introducing diversity among the models of multi-label classification ensemble Lena Chekina, Lior Rokach and Bracha Shapira Ben-Gurion University of the Negev Dept. of Information Systems Engineering and
More informationCategorical Data Visualization and Clustering Using Subjective Factors
Categorical Data Visualization and Clustering Using Subjective Factors Chia-Hui Chang and Zhi-Kai Ding Department of Computer Science and Information Engineering, National Central University, Chung-Li,
More informationIJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS
IJCSES Vol.7 No.4 October 2013 pp.165-168 Serials Publications BEHAVIOR PERDITION VIA MINING SOCIAL DIMENSIONS V.Sudhakar 1 and G. Draksha 2 Abstract:- Collective behavior refers to the behaviors of individuals
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationLearning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu
Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of
More information