Question 1. Question 2. Question 3. Question 4. Mert Emin Kalender CS 533 Homework 3


 Jeffery Singleton
 2 years ago
 Views:
Transcription
1 Question 1 Cluster hypothesis states the idea of closely associating documents that tend to be relevant to the same requests. This hypothesis does make sense. The size of documents for information retrieval is huge. At the same time, there are certain amounts similarity in between these documents, but this does not apply to all rather relates to varioussized groups of documents. These similar groups can be gathered together for efficiency (e.g. performance) and effectiveness (e.g. similar results for similar queries). Thus, the idea of grouping similar documents together proposed by cluster hypothesis have initiated a significant research in information retrieval area. Question 2 Clustering methods are automatic classification methods to associate relevant documents independent of the data. However, clustering algorithm is the implementation of a clustering method referring to low level implementation of software. Thus, while clustering method is more abstract and conceptual aiming to increase efficiency and effectiveness in retrieval independent of data and system characteristics, clustering algorithm is the application of the method for certain the data and system. Question 3 cosine, Dice and Euclidian distances are used as similarity measures. All there measures consider two documents, which are measured in terms of similarity, presented in a matrix form. So here expression of documents in the matrix form is an issue. This issue is solved by parsing documents word by word and using word occurrences for forming matrices. Regarding the idea of similarity, these metrics are used to measure the similarity between two documents independent of the order that words occur. So using these metrics we cannot get a similarity value based on word occurrences and their order. Another issue is that these metrics have different characteristics for various patterns of word occurrences that might affect the similarity value computed. To illustrate, Dice coefficient is more sensitive to more heterogeneous and outliers by giving less weight to them. Question 4 The similarity matrix, S, is computed using Dice coefficient and provided S = Table 1 presents the document pairs in decreasing order corresponding to the similarity between them. The document pairs with 0 similarity are presented. Page 1 of 7
2 Table 1: Document pairs with similarity values in decreasing order Document Pair Similarity Value D 3, D D 3, D D 2, D D 4, D D 1, D D 1, D Figure 1: Singlelink clustering structure for given S Figure 2: Completelink clustering structure for given S (c) S = To compute product moment correlation coefficient the matrices S and S are first flattened and applied to the following formula: cov(s, S ) r = = (var(s)var(s )) 1/ = Page 2 of 7
3 (d) Monte Carlo simulation provides us a range of possible outcomes over various runs. This simulation is carried out using interactive tool for creating confidence intervals for correlation coefficients 1. The simulation is run with r value 0.161, which is found in (c), sample size 100 and the desired level of confidence 95% over repetitions. The result is shown in Figure 3. Figure 3: Monte Carlo distribution of r Question 5 Completelink s order dependence can be proved by exchanging the first two rows in 1. Since the pairs in the first two rows have the same similarity value, the exchange operation is safe to apply. The following dendrogram will be obtained in that situation: Figure 4: Completelink clustering structure for altered S Unlike completelink, singlelink is not orderdependent. In singlelink clustering, the similarity of two clusters is the similarity of their most similar members. This makes singlelink clustering depend on the members where these two come closest in clusters. The order in which documents arrive do not change these two most similar members. However, in completelink clustering, the similarity of two clusters is the 1 Page 3 of 7
4 similarity of their most dissimilar members. This makes the current clustering structure to affect merge decision of a document, which is completed after all the similarity values between the documents in the cluster and the documenttobemerged are known. Question C = i=1 c ii = = (c) P 1 = (0.77)(0.23)3 = P 2 = (0.66)(0.34)2 = P 3 = (0.33)(0.67)1 = P 4 = (0.33)(0.67)2 = P 5 = (0.66)(0.34)2 = (d) There will be 3 clusters as computed in, and the power seeds sorted in decreasing orders is as follows: P 1 > P 2, P 5 > P 4 > P 3. Thus, P 1, P 2 and P 5 are the cluster seeds. (e) The inverted index for cluster seed documents is as follows: t 1 <d 1, 1> t 2 <d 1, 1>, <d 2, 1> t 3 <d 2, 1> t 4 <d 1, 1> t 5 <d 5, 1> t 6 <d 5, 1> (f) Consider d 3 with only nonzero frequency term t 5. If we check IISD created in (e), then it can be seen that t 5 appears only in d 5. Thus, d 3 is clustered with d 5 without any computation. In a similar manner, when we check d 4 with terms t 2 and t 5, this document can be clustered with any of seeds. Further, C matrix computed in presents that c 41 = c 42 = c 45 = 0.16 supports the previous assertion. Page 4 of 7
5 (g) The cluster seeds are determined in (d) as d 1, d 2 and d 5. d 3 is clustered with d 5 in (f). d 4 can be clustered with any of seeds as stated in (f). Thus, one example for clusters is {d 1 }, {d 2, d 4 }, {d 3, d 5 }. (h) (1): The number of C entries we have to calculate is m + (m n c )n c. m refers to the diagonal entries to decide cluster seeds. (m n c )n c is the number of entries to be calculated for each nonseed document (m n c ) according to cluster seeds (n c ). (2): For D matrix, m = 5, n c = 3. Then, the number of C entries to be computed is 5 + (5 3)3 = 11. Question 7 n : number of terms m : number of documents t : nonzero entries in D matrix t g : average number of terms to describe a document ( t n ) x d : average number of documents described by a term ( t m ) n c = m x d = m n t Question 8 n c = mn t n c = mn t = = = n t m = n x d = n c = 3, d c = m n c = 5 3 = 1.67 = 3, d c = m n = 5 c 3 = 1.67 The clusteringindexing relationships implied by the cover coefficient concept provide the number of clusters and average cluster sizes beforehand. This information can be used to allocate and handle memory better before documents are arrived and clusters are created by putting the documents in the same cluster together or maybe placing similar cluster together as well. Dynamic memory allocation is a hard problem and consumes a lot of time, and that can be handled better with the clusteringindexing relationships. Another use case would be easy and better implementation of clustering methods into clustering algorithms with the precomputed number of clusters and clustersize information. Question 9 In terms of C 3 M cluster maintenance is efficient with respect to time and space and the complexity of maintaining clusters for a series of updates is low, because the definition of clusters are concrete thanks to cover coefficient concept with the help of the clusteringindexing relationships. The clustering is orderindependent that makes the order in which documents arrive ineffective for clustering structure. This independence improves the cluster maintenance, because updates in current structure do not directly change clusters. Additions and removals would be handled better with (re)computations of cluster seeds and sizes without a degradation in performance. Page 5 of 7
6 Question 10 Parallel implementations of clustering algorithms provide ability to enhance clustering performance significantly depending on the clustering algorithm and distribution of computations over processing elements. Faster and more efficient grouping in between documents for series of updates or analysis of documents for a query or data mining operations over a large data set are some the examples that can benefit from parallel implementation of clustering. Data mining is searching and extracting useful information among huge amounts of data. Similar to clustering, data mining approaches are implemented via different ways of exploring data. In that sense, clustering is used for data mining approaches, and these approaches can benefit from efficient and effective clustering techniques. (c) This paper summarises, evaluates, compares and applies the proposed clustering approaches. Although this may seem trivial, it is not because it requires a lot of work to find, learn, understand and apply various studies in the literature. The work presented in this paper is useful for a scientist to check and observe the recent approaches critically, and choose the one that fits her problem better compared to others probably utilising the reasoning given in this paper, and that increases the number of citations. (d) Rijsbergen s book, Information Retrieval, is another most cited paper in this area. This publication provides the essentials for IR using a similar approach of a survey paper. These are the publications that a researcher always need and refer to and use for any study in the same area, that is why the citations count is high. Question 11 kmeans is a clustering method to partition n documents in k clusters. It starts with randomly chosen clusters, then (re)assigns documents to clusters according to the similarity between cluster and document until a converge criterion is met (e.g. no reassignment from one cluster to another, or significant decrease in the squared error computed for similarity). Apache Hadoop enables us to create dataintensive distributed applications among different number of processing elements. Regarding a possible implementation of kmeans algorithm, a Hadoop application can be created to exploit parallelism as mentioned before. Due to limited time a pseudo algorithm is given in the following section. This algorithm distributes the assignments of nonseed documents to clusters among different processing elements. This assignment seems to be the only part that can be parallelised or distributed. After reassignments are finished in different processing elements, a synchronisation point is used before checking the converge criterion. Here is the details: Page 6 of 7
7 while converge criterion is not met do if there are no partitions yet then choose random k cluster seeds; else (re)decide the clusters seeds; end partition nonseed documents among available processing elements; re(assign) documents to clusters (parallel section); synchronise; end Algorithm 1: Pseudoparallel implementation for kmeans Question 12 m(1 k i=1 n(1 1/m) i + 1 ) = n i + 1 The following reference points to a paper having similar purpose in terms of block accesses: Brad T. Vander Zanden, Howard M. Taylor, Dina Bitton, Estimating Block Accessses when Attributes are Correlated, Proceedings of the 12th International Conference on Very Large Data Bases, p , August 2528, 1986 References If not explicitly stated otherwise, please refer to the references in the homework description. Page 7 of 7
Clustering & Association
Clustering  Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects
More informationInformation Retrieval and Web Search Engines
Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 WolfTilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More information10810 /02710 Computational Genomics. Clustering expression data
10810 /02710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intracluster similarity low intercluster similarity Informally,
More informationLargeScale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 59565963 Available at http://www.jofcis.com LargeScale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationCLASSIFICATION AND CLUSTERING. Anveshi Charuvaka
CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training
More informationText Clustering. Clustering
Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover
More informationAppendix A: Sampling Methods
Appendix A: Sampling Methods What is Sampling? Sampling is used in an @RISK simulation to generate possible values from probability distribution functions. These sets of possible values are then used to
More informationA Kmeanslike Algorithm for Kmedoids Clustering and Its Performance
A Kmeanslike Algorithm for Kmedoids Clustering and Its Performance HaeSang Park*, JongSeok Lee and ChiHyuck Jun Department of Industrial and Management Engineering, POSTECH San 31 Hyojadong, Pohang
More informationData Mining Project Report. Document Clustering. Meryem UzunPer
Data Mining Project Report Document Clustering Meryem UzunPer 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. Kmeans algorithm...
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical
More informationClustering. Adrian Groza. Department of Computer Science Technical University of ClujNapoca
Clustering Adrian Groza Department of Computer Science Technical University of ClujNapoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 Kmeans 3 Hierarchical Clustering What is Datamining?
More informationCluster Analysis: Basic Concepts and Algorithms
Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering Kmeans Intuition Algorithm Choosing initial centroids Bisecting Kmeans Postprocessing Strengths
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationFUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM
International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 3448 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT
More informationData Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining
Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distancebased Kmeans, Kmedoids,
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationChapter 7. Hierarchical cluster analysis. Contents 71
71 Chapter 7 Hierarchical cluster analysis In Part 2 (Chapters 4 to 6) we defined several different ways of measuring distance (or dissimilarity as the case may be) between the rows or between the columns
More informationData Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining
Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototypebased Fuzzy cmeans Mixture Model Clustering Densitybased
More informationText Analytics. Text Clustering. Ulf Leser
Text Analytics Text Clustering Ulf Leser Content of this Lecture (Text) clustering Cluster quality Clustering algorithms Application Ulf Leser: Text Analytics, Winter Semester 2010/2011 2 Clustering Clustering
More informationClustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
More informationLecture 20: Clustering
Lecture 20: Clustering Wrapup of neural nets (from last lecture Introduction to unsupervised learning Kmeans clustering COMP424, Lecture 20  April 3, 2013 1 Unsupervised learning In supervised learning,
More informationAnalysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 4089241000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model
More informationComputational Complexity between KMeans and KMedoids Clustering Algorithms for Normal and Uniform Distributions of Data Points
Journal of Computer Science 6 (3): 363368, 2010 ISSN 15493636 2010 Science Publications Computational Complexity between KMeans and KMedoids Clustering Algorithms for Normal and Uniform Distributions
More informationUnsupervised learning: Clustering
Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What
More informationA Relevant Document Information Clustering Algorithm for Web Search Engine
A Relevant Document Information Clustering Algorithm for Web Search Engine Y.SureshBabu, K.Venkat Mutyalu, Y.A.Siva Prasad Abstract Search engines are the Hub of Information, The advances in computing
More informationCLUSTER ANALYSIS FOR SEGMENTATION
CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every
More informationSPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING
AAS 07228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations
More informationBig Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
More informationMachine Learning for NLP
Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability
More informationData Mining for Model Creation. Presentation by Paul Below, EDS 2500 NE Plunkett Lane Poulsbo, WA USA 98370 paul.below@eds.
Sept 032305 22 2005 Data Mining for Model Creation Presentation by Paul Below, EDS 2500 NE Plunkett Lane Poulsbo, WA USA 98370 paul.below@eds.com page 1 Agenda Data Mining and Estimating Model Creation
More informationUNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS
UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationChapter ML:XI (continued)
Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis DensityBased Cluster Analysis Cluster Evaluation Constrained
More informationClustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.unisb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
More informationAn introduction to ValueatRisk Learning Curve September 2003
An introduction to ValueatRisk Learning Curve September 2003 ValueatRisk The introduction of ValueatRisk (VaR) as an accepted methodology for quantifying market risk is part of the evolution of risk
More informationFig. 1 A typical Knowledge Discovery process [2]
Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering
More informationMovie Classification Using kmeans and Hierarchical Clustering
Movie Classification Using kmeans and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DAIICT, Gandhinagar Gujarat, India dharak_shah@daiict.ac.in Saheb Motiani
More information! Two Fundamental Methods in Machine Learning! Supervised Learning ( learn from my example )
Supervised vs. Unsupervised Learning Basic Machine Learning: Clustering CS 315 Web Search and Data Mining! Two Fundamental Methods in Machine Learning! Supervised Learning ( learn from my example ) n Goal:
More informationJava Modules for Time Series Analysis
Java Modules for Time Series Analysis Agenda Clustering Nonnormal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series
More informationHadoop Design and kmeans Clustering
Hadoop Design and kmeans Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
More informationClustering Hierarchical clustering and kmean clustering
Clustering Hierarchical clustering and kmean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity
More informationChapter 7. Cluster Analysis
Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. DensityBased Methods 6. GridBased Methods 7. ModelBased
More informationOverview. Clustering. Clustering vs. Classification. Supervised vs. Unsupervised Learning. Connectionist and Statistical Language Processing
Overview Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.unisb.de Computerlinguistik Universität des Saarlandes clustering vs. classification supervised vs. unsupervised
More informationData Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)
Data Mining 資 料 探 勘 Tamkang University 分 群 分 析 (Cluster Analysis) DM MI Wed,, (: :) (B) MinYuh Day 戴 敏 育 Assistant Professor 專 任 助 理 教 授 Dept. of Information Management, Tamkang University 淡 江 大 學 資
More informationAn Enhanced Clustering Algorithm to Analyze Spatial Data
International Journal of Engineering and Technical Research (IJETR) ISSN: 23210869, Volume2, Issue7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav
More informationScheduling Algorithms in MapReduce Distributed Mind
Scheduling Algorithms in MapReduce Distributed Mind Karthik Kotian, Jason A Smith, Ye Zhang Schedule Overview of topic (review) Hypothesis Research paper 1 Research paper 2 Research paper 3 Project software
More informationRobotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard
Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,
More informationROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015
ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti
More informationCONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19
PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and PreRequisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The prerequisites are significant
More informationDynamical Clustering of Personalized Web Search Results
Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationA Statistical Text Mining Method for Patent Analysis
A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical
More informationDATA CLUSTERING USING MAPREDUCE
DATA CLUSTERING USING MAPREDUCE by Makho Ngazimbi A project submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Boise State University March 2009
More informationSEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL KMEANS FOR FAST SEARCH AND RETRIEVAL
SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL KMEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India
More informationLargeScale Data Cleaning Using Hadoop. UC Irvine
Chen Li UC Irvine Joint work with Michael Carey, Alexander Behm, Shengyue Ji, Rares Vernica 1 Overview Importance of information Importance of information quality Data cleaning Large scale Hadoop 2 Data
More informationMapReduce for Machine Learning on Multicore
MapReduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers  dual core to 12+core Shift to more concurrent programming paradigms and languages Erlang,
More informationResearch on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2
Advanced Engineering Forum Vols. 67 (2012) pp 8287 Online: 20120926 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.67.82 Research on Clustering Analysis of Big Data
More informationClustering and Data Mining in R
Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches
More informationMODULE 15 Clustering Large Datasets LESSON 34
MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1 Clustering Large Datasets Pattern matrix It is convenient to view the input data
More informationTerritorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs
Territorial Analysis for Ratemaking by Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs Department of Statistics and Applied Probability University
More informationCluster analysis Cosmin Lazar. COMO Lab VUB
Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,
More information2. Norm, distance, angle
L. Vandenberghe EE133A (Spring 2016) 2. Norm, distance, angle norm distance angle hyperplanes complex vectors 21 Euclidean norm (Euclidean) norm of vector a R n : a = a 2 1 + a2 2 + + a2 n = a T a if
More informationBig Data from a Database Theory Perspective
Big Data from a Database Theory Perspective Martin Grohe Lehrstuhl Informatik 7  Logic and the Theory of Discrete Systems A CS View on Data Science Applications Data System Users 2 Us Data HUGE heterogeneous
More informationGeneral Framework for an Iterative Solution of Ax b. Jacobi s Method
2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE  Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationExample: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering
Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? KMeans Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar
More informationEFFICIENT KMEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING
EFFICIENT KMEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING Navjot Kaur, Jaspreet Kaur Sahiwal, Navneet Kaur Lovely Professional University Phagwara Punjab Abstract Clustering is an essential
More informationMonte Carlo analysis used for Contingency estimating.
Monte Carlo analysis used for Contingency estimating. Author s identification number: Date of authorship: July 24, 2007 Page: 1 of 15 TABLE OF CONTENTS: LIST OF TABLES:...3 LIST OF FIGURES:...3 ABSTRACT:...4
More informationKMeans Clustering. Clustering and Classification Lecture 8
KMeans Clustering Clustering and Lecture 8 Today s Class Kmeans clustering: What it is How it works What it assumes Pitfalls of the method (locally optimal results) 2 From Last Time If you recall the
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance Knearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationDistance based clustering
// Distance based clustering Chapter ² ² Clustering Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 99). What is a cluster? Group of objects separated from other clusters Means
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster  latency Do more computations in given time  throughput
More informationOLTP Compared With OLAP
OLTP Compared With OLAP On Line Transaction Processing OLTP Maintains a database that is an accurate model of some realworld enterprise. Supports daytoday operations. Characteristics: Short simple transactions
More informationOperation Count; Numerical Linear Algebra
10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floatingpoint
More informationEntropy and Information Gain
Entropy and Information Gain The entropy (very common in Information Theory) characterizes the (im)purity of an arbitrary collection of examples Information Gain is the expected reduction in entropy caused
More informationMATH 304 Linear Algebra Lecture 8: Inverse matrix (continued). Elementary matrices. Transpose of a matrix.
MATH 304 Linear Algebra Lecture 8: Inverse matrix (continued). Elementary matrices. Transpose of a matrix. Inverse matrix Definition. Let A be an n n matrix. The inverse of A is an n n matrix, denoted
More informationSoSe 2014: MTANI: Big Data Analytics
SoSe 2014: MTANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering
More informationThe PageRank Citation Ranking: Bring Order to the Web
The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized
More informationDescriptive Data Summarization
Descriptive Data Summarization (Understanding Data) First: Some data preprocessing problems... 1 Missing Values The approach of the problem of missing values adopted in SQL is based on nulls and threevalued
More informationClustering. 15381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is
Clustering 15381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv BarJoseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is
More informationCredibility and Pooling Applications to Group Life and Group Disability Insurance
Credibility and Pooling Applications to Group Life and Group Disability Insurance Presented by Paul L. Correia Consulting Actuary paul.correia@milliman.com (207) 7711204 May 20, 2014 What I plan to cover
More informationMathQuest: Linear Algebra. 1. Which of the following matrices does not have an inverse?
MathQuest: Linear Algebra Matrix Inverses 1. Which of the following matrices does not have an inverse? 1 2 (a) 3 4 2 2 (b) 4 4 1 (c) 3 4 (d) 2 (e) More than one of the above do not have inverses. (f) All
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationRecognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bagofwords Spatial pyramids Neural Networks Object
More informationCHAPTER 3 DATA MINING AND CLUSTERING
CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge
More informationE3: PROBABILITY AND STATISTICS lecture notes
E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................
More informationGLM, insurance pricing & big data: paying attention to convergence issues.
GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK  michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.
More informationCSCI 5417 Information Retrieval Systems Jim Martin!
CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms Kmeans and its variants Hierarchical clustering
More informationLinear Dependence Tests
Linear Dependence Tests The book omits a few key tests for checking the linear dependence of vectors. These short notes discuss these tests, as well as the reasoning behind them. Our first test checks
More informationTopics in basic DBMS course
Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch
More informationData Mining and Clustering Techniques
DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: K Data Mining and Clustering Techniques I. K. Ravichandra Rao Professor and Head Documentation Research and Training Center
More information2. DATA AND EXERCISES (Geos2911 students please read page 8)
2. DATA AND EXERCISES (Geos2911 students please read page 8) 2.1 Data set The data set available to you is an Excel spreadsheet file called cyclones.xls. The file consists of 3 sheets. Only the third is
More informationNVIVO 10 WORKSHOP II. Hui Bian Office for Faculty Excellence
NVIVO 10 WORKSHOP II Hui Bian Office for Faculty Excellence Memo Memos are a type of document that enable you to record the ideas, insights, interpretations or growing understanding of the material in
More informationBuilding Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu
Building Data Cubes and Mining Them Jelena Jovanovic Email: jeljov@fon.bg.ac.yu KDD Process KDD is an overall process of discovering useful knowledge from data. Data mining is a particular step in the
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationDistances, Clustering, and Classification. Heatmaps
Distances, Clustering, and Classification Heatmaps 1 Distance Clustering organizes things that are close into groups What does it mean for two genes to be close? What does it mean for two samples to be
More informationClustering Data Streams
Clustering Data Streams Mohamed Elasmar Prashant Thiruvengadachari Javier Salinas Martin gtg091e@mail.gatech.edu tprashant@gmail.com javisal1@gatech.edu Introduction: Data mining is the science of extracting
More information