Question 1. Question 2. Question 3. Question 4. Mert Emin Kalender CS 533 Homework 3

Save this PDF as:

Size: px
Start display at page:

Download "Question 1. Question 2. Question 3. Question 4. Mert Emin Kalender CS 533 Homework 3"

Transcription

1 Question 1 Cluster hypothesis states the idea of closely associating documents that tend to be relevant to the same requests. This hypothesis does make sense. The size of documents for information retrieval is huge. At the same time, there are certain amounts similarity in between these documents, but this does not apply to all rather relates to various-sized groups of documents. These similar groups can be gathered together for efficiency (e.g. performance) and effectiveness (e.g. similar results for similar queries). Thus, the idea of grouping similar documents together proposed by cluster hypothesis have initiated a significant research in information retrieval area. Question 2 Clustering methods are automatic classification methods to associate relevant documents independent of the data. However, clustering algorithm is the implementation of a clustering method referring to low level implementation of software. Thus, while clustering method is more abstract and conceptual aiming to increase efficiency and effectiveness in retrieval independent of data and system characteristics, clustering algorithm is the application of the method for certain the data and system. Question 3 cosine, Dice and Euclidian distances are used as similarity measures. All there measures consider two documents, which are measured in terms of similarity, presented in a matrix form. So here expression of documents in the matrix form is an issue. This issue is solved by parsing documents word by word and using word occurrences for forming matrices. Regarding the idea of similarity, these metrics are used to measure the similarity between two documents independent of the order that words occur. So using these metrics we cannot get a similarity value based on word occurrences and their order. Another issue is that these metrics have different characteristics for various patterns of word occurrences that might affect the similarity value computed. To illustrate, Dice coefficient is more sensitive to more heterogeneous and outliers by giving less weight to them. Question 4 The similarity matrix, S, is computed using Dice coefficient and provided S = Table 1 presents the document pairs in decreasing order corresponding to the similarity between them. The document pairs with 0 similarity are presented. Page 1 of 7

2 Table 1: Document pairs with similarity values in decreasing order Document Pair Similarity Value D 3, D D 3, D D 2, D D 4, D D 1, D D 1, D Figure 1: Single-link clustering structure for given S Figure 2: Complete-link clustering structure for given S (c) S = To compute product moment correlation coefficient the matrices S and S are first flattened and applied to the following formula: cov(s, S ) r = = (var(s)var(s )) 1/ = Page 2 of 7

3 (d) Monte Carlo simulation provides us a range of possible outcomes over various runs. This simulation is carried out using interactive tool for creating confidence intervals for correlation coefficients 1. The simulation is run with r value 0.161, which is found in (c), sample size 100 and the desired level of confidence 95% over repetitions. The result is shown in Figure 3. Figure 3: Monte Carlo distribution of r Question 5 Complete-link s order dependence can be proved by exchanging the first two rows in 1. Since the pairs in the first two rows have the same similarity value, the exchange operation is safe to apply. The following dendrogram will be obtained in that situation: Figure 4: Complete-link clustering structure for altered S Unlike complete-link, single-link is not order-dependent. In single-link clustering, the similarity of two clusters is the similarity of their most similar members. This makes single-link clustering depend on the members where these two come closest in clusters. The order in which documents arrive do not change these two most similar members. However, in complete-link clustering, the similarity of two clusters is the 1 Page 3 of 7

4 similarity of their most dissimilar members. This makes the current clustering structure to affect merge decision of a document, which is completed after all the similarity values between the documents in the cluster and the document-to-be-merged are known. Question C = i=1 c ii = = (c) P 1 = (0.77)(0.23)3 = P 2 = (0.66)(0.34)2 = P 3 = (0.33)(0.67)1 = P 4 = (0.33)(0.67)2 = P 5 = (0.66)(0.34)2 = (d) There will be 3 clusters as computed in, and the power seeds sorted in decreasing orders is as follows: P 1 > P 2, P 5 > P 4 > P 3. Thus, P 1, P 2 and P 5 are the cluster seeds. (e) The inverted index for cluster seed documents is as follows: t 1 <d 1, 1> t 2 <d 1, 1>, <d 2, 1> t 3 <d 2, 1> t 4 <d 1, 1> t 5 <d 5, 1> t 6 <d 5, 1> (f) Consider d 3 with only nonzero frequency term t 5. If we check IISD created in (e), then it can be seen that t 5 appears only in d 5. Thus, d 3 is clustered with d 5 without any computation. In a similar manner, when we check d 4 with terms t 2 and t 5, this document can be clustered with any of seeds. Further, C matrix computed in presents that c 41 = c 42 = c 45 = 0.16 supports the previous assertion. Page 4 of 7

5 (g) The cluster seeds are determined in (d) as d 1, d 2 and d 5. d 3 is clustered with d 5 in (f). d 4 can be clustered with any of seeds as stated in (f). Thus, one example for clusters is {d 1 }, {d 2, d 4 }, {d 3, d 5 }. (h) (1): The number of C entries we have to calculate is m + (m n c )n c. m refers to the diagonal entries to decide cluster seeds. (m n c )n c is the number of entries to be calculated for each non-seed document (m n c ) according to cluster seeds (n c ). (2): For D matrix, m = 5, n c = 3. Then, the number of C entries to be computed is 5 + (5 3)3 = 11. Question 7 n : number of terms m : number of documents t : nonzero entries in D matrix t g : average number of terms to describe a document ( t n ) x d : average number of documents described by a term ( t m ) n c = m x d = m n t Question 8 n c = mn t n c = mn t = = = n t m = n x d = n c = 3, d c = m n c = 5 3 = 1.67 = 3, d c = m n = 5 c 3 = 1.67 The clustering-indexing relationships implied by the cover coefficient concept provide the number of clusters and average cluster sizes beforehand. This information can be used to allocate and handle memory better before documents are arrived and clusters are created by putting the documents in the same cluster together or maybe placing similar cluster together as well. Dynamic memory allocation is a hard problem and consumes a lot of time, and that can be handled better with the clustering-indexing relationships. Another use case would be easy and better implementation of clustering methods into clustering algorithms with the precomputed number of clusters and cluster-size information. Question 9 In terms of C 3 M cluster maintenance is efficient with respect to time and space and the complexity of maintaining clusters for a series of updates is low, because the definition of clusters are concrete thanks to cover coefficient concept with the help of the clustering-indexing relationships. The clustering is orderindependent that makes the order in which documents arrive ineffective for clustering structure. This independence improves the cluster maintenance, because updates in current structure do not directly change clusters. Additions and removals would be handled better with (re)computations of cluster seeds and sizes without a degradation in performance. Page 5 of 7

6 Question 10 Parallel implementations of clustering algorithms provide ability to enhance clustering performance significantly depending on the clustering algorithm and distribution of computations over processing elements. Faster and more efficient grouping in between documents for series of updates or analysis of documents for a query or data mining operations over a large data set are some the examples that can benefit from parallel implementation of clustering. Data mining is searching and extracting useful information among huge amounts of data. Similar to clustering, data mining approaches are implemented via different ways of exploring data. In that sense, clustering is used for data mining approaches, and these approaches can benefit from efficient and effective clustering techniques. (c) This paper summarises, evaluates, compares and applies the proposed clustering approaches. Although this may seem trivial, it is not because it requires a lot of work to find, learn, understand and apply various studies in the literature. The work presented in this paper is useful for a scientist to check and observe the recent approaches critically, and choose the one that fits her problem better compared to others probably utilising the reasoning given in this paper, and that increases the number of citations. (d) Rijsbergen s book, Information Retrieval, is another most cited paper in this area. This publication provides the essentials for IR using a similar approach of a survey paper. These are the publications that a researcher always need and refer to and use for any study in the same area, that is why the citations count is high. Question 11 k-means is a clustering method to partition n documents in k clusters. It starts with randomly chosen clusters, then (re)assigns documents to clusters according to the similarity between cluster and document until a converge criterion is met (e.g. no reassignment from one cluster to another, or significant decrease in the squared error computed for similarity). Apache Hadoop enables us to create data-intensive distributed applications among different number of processing elements. Regarding a possible implementation of k-means algorithm, a Hadoop application can be created to exploit parallelism as mentioned before. Due to limited time a pseudo algorithm is given in the following section. This algorithm distributes the assignments of non-seed documents to clusters among different processing elements. This assignment seems to be the only part that can be parallelised or distributed. After reassignments are finished in different processing elements, a synchronisation point is used before checking the converge criterion. Here is the details: Page 6 of 7

7 while converge criterion is not met do if there are no partitions yet then choose random k cluster seeds; else (re)decide the clusters seeds; end partition non-seed documents among available processing elements; re(assign) documents to clusters (parallel section); synchronise; end Algorithm 1: Pseudo-parallel implementation for k-means Question 12 m(1 k i=1 n(1 1/m) i + 1 ) = n i + 1 The following reference points to a paper having similar purpose in terms of block accesses: Brad T. Vander Zanden, Howard M. Taylor, Dina Bitton, Estimating Block Accessses when Attributes are Correlated, Proceedings of the 12th International Conference on Very Large Data Bases, p , August 25-28, 1986 References If not explicitly stated otherwise, please refer to the references in the homework description. Page 7 of 7

Clustering & Association

Clustering - Overview What is cluster analysis? Grouping data objects based only on information found in the data describing these objects and their relationships Maximize the similarity within objects

Information Retrieval and Web Search Engines

Information Retrieval and Web Search Engines Lecture 7: Document Clustering December 10 th, 2013 Wolf-Tilo Balke and Kinda El Maarry Institut für Informationssysteme Technische Universität Braunschweig

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with

10-810 /02-710 Computational Genomics. Clustering expression data

10-810 /02-710 Computational Genomics Clustering expression data What is Clustering? Organizing data into clusters such that there is high intra-cluster similarity low inter-cluster similarity Informally,

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

CLASSIFICATION AND CLUSTERING. Anveshi Charuvaka

CLASSIFICATION AND CLUSTERING Anveshi Charuvaka Learning from Data Classification Regression Clustering Anomaly Detection Contrast Set Mining Classification: Definition Given a collection of records (training

Text Clustering. Clustering

Text Clustering 1 Clustering Partition unlabeled examples into disoint subsets of clusters, such that: Examples within a cluster are very similar Examples in different clusters are very different Discover

Appendix A: Sampling Methods

Appendix A: Sampling Methods What is Sampling? Sampling is used in an @RISK simulation to generate possible values from probability distribution functions. These sets of possible values are then used to

A K-means-like Algorithm for K-medoids Clustering and Its Performance

A K-means-like Algorithm for K-medoids Clustering and Its Performance Hae-Sang Park*, Jong-Seok Lee and Chi-Hyuck Jun Department of Industrial and Management Engineering, POSTECH San 31 Hyoja-dong, Pohang

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining by Tan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/8/2004 Hierarchical

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?

Cluster Analysis: Basic Concepts and Algorithms

Cluster Analsis: Basic Concepts and Algorithms What does it mean clustering? Applications Tpes of clustering K-means Intuition Algorithm Choosing initial centroids Bisecting K-means Post-processing Strengths

Machine Learning using MapReduce

Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous

FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT MINING SYSTEM

International Journal of Innovative Computing, Information and Control ICIC International c 0 ISSN 34-48 Volume 8, Number 8, August 0 pp. 4 FUZZY CLUSTERING ANALYSIS OF DATA MINING: APPLICATION TO AN ACCIDENT

Data Mining Clustering (2) Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining

Data Mining Clustering (2) Toon Calders Sheets are based on the those provided by Tan, Steinbach, and Kumar. Introduction to Data Mining Outline Partitional Clustering Distance-based K-means, K-medoids,

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar

Chapter 7. Hierarchical cluster analysis. Contents 7-1

7-1 Chapter 7 Hierarchical cluster analysis In Part 2 (Chapters 4 to 6) we defined several different ways of measuring distance (or dissimilarity as the case may be) between the rows or between the columns

Data Mining Cluster Analysis: Advanced Concepts and Algorithms. ref. Chapter 9. Introduction to Data Mining

Data Mining Cluster Analysis: Advanced Concepts and Algorithms ref. Chapter 9 Introduction to Data Mining by Tan, Steinbach, Kumar 1 Outline Prototype-based Fuzzy c-means Mixture Model Clustering Density-based

Text Analytics. Text Clustering. Ulf Leser

Text Analytics Text Clustering Ulf Leser Content of this Lecture (Text) clustering Cluster quality Clustering algorithms Application Ulf Leser: Text Analytics, Winter Semester 2010/2011 2 Clustering Clustering

Clustering UE 141 Spring 2013

Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or

Lecture 20: Clustering

Lecture 20: Clustering Wrap-up of neural nets (from last lecture Introduction to unsupervised learning K-means clustering COMP-424, Lecture 20 - April 3, 2013 1 Unsupervised learning In supervised learning,

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions of Data Points

Journal of Computer Science 6 (3): 363-368, 2010 ISSN 1549-3636 2010 Science Publications Computational Complexity between K-Means and K-Medoids Clustering Algorithms for Normal and Uniform Distributions

Unsupervised learning: Clustering

Unsupervised learning: Clustering Salissou Moutari Centre for Statistical Science and Operational Research CenSSOR 17 th September 2013 Unsupervised learning: Clustering 1/52 Outline 1 Introduction What

A Relevant Document Information Clustering Algorithm for Web Search Engine

A Relevant Document Information Clustering Algorithm for Web Search Engine Y.SureshBabu, K.Venkat Mutyalu, Y.A.Siva Prasad Abstract Search engines are the Hub of Information, The advances in computing

CLUSTER ANALYSIS FOR SEGMENTATION

CLUSTER ANALYSIS FOR SEGMENTATION Introduction We all understand that consumers are not all alike. This provides a challenge for the development and marketing of profitable products and services. Not every

SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING

AAS 07-228 SPECIAL PERTURBATIONS UNCORRELATED TRACK PROCESSING INTRODUCTION James G. Miller * Two historical uncorrelated track (UCT) processing approaches have been employed using general perturbations

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

Machine Learning for NLP

Natural Language Processing SoSe 2015 Machine Learning for NLP Dr. Mariana Neves May 4th, 2015 (based on the slides of Dr. Saeedeh Momtazi) Introduction Field of study that gives computers the ability

Data Mining for Model Creation. Presentation by Paul Below, EDS 2500 NE Plunkett Lane Poulsbo, WA USA 98370 paul.below@eds.

Sept 03-23-05 22 2005 Data Mining for Model Creation Presentation by Paul Below, EDS 2500 NE Plunkett Lane Poulsbo, WA USA 98370 paul.below@eds.com page 1 Agenda Data Mining and Estimating Model Creation

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS Dwijesh C. Mishra I.A.S.R.I., Library Avenue, New Delhi-110 012 dcmishra@iasri.res.in What is Learning? "Learning denotes changes in a system that enable

Developing MapReduce Programs

Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

Chapter ML:XI (continued)

Chapter ML:XI (continued) XI. Cluster Analysis Data Mining Overview Cluster Analysis Basics Hierarchical Cluster Analysis Iterative Cluster Analysis Density-Based Cluster Analysis Cluster Evaluation Constrained

Clustering Connectionist and Statistical Language Processing

Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised

An introduction to Value-at-Risk Learning Curve September 2003

An introduction to Value-at-Risk Learning Curve September 2003 Value-at-Risk The introduction of Value-at-Risk (VaR) as an accepted methodology for quantifying market risk is part of the evolution of risk

Fig. 1 A typical Knowledge Discovery process [2]

Volume 4, Issue 7, July 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com A Review on Clustering

Movie Classification Using k-means and Hierarchical Clustering

Movie Classification Using k-means and Hierarchical Clustering An analysis of clustering algorithms on movie scripts Dharak Shah DA-IICT, Gandhinagar Gujarat, India dharak_shah@daiict.ac.in Saheb Motiani

! Two Fundamental Methods in Machine Learning! Supervised Learning ( learn from my example )

Supervised vs. Unsupervised Learning Basic Machine Learning: Clustering CS 315 Web Search and Data Mining! Two Fundamental Methods in Machine Learning! Supervised Learning ( learn from my example ) n Goal:

Java Modules for Time Series Analysis

Java Modules for Time Series Analysis Agenda Clustering Non-normal distributions Multifactor modeling Implied ratings Time series prediction 1. Clustering + Cluster 1 Synthetic Clustering + Time series

Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

Clustering Hierarchical clustering and k-mean clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics Elhanan Borenstein The clustering problem: A quick review partition genes into distinct sets with high homogeneity

Chapter 7. Cluster Analysis

Chapter 7. Cluster Analysis. What is Cluster Analysis?. A Categorization of Major Clustering Methods. Partitioning Methods. Hierarchical Methods 5. Density-Based Methods 6. Grid-Based Methods 7. Model-Based

Overview. Clustering. Clustering vs. Classification. Supervised vs. Unsupervised Learning. Connectionist and Statistical Language Processing

Overview Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes clustering vs. classification supervised vs. unsupervised

Data Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)

Data Mining 資 料 探 勘 Tamkang University 分 群 分 析 (Cluster Analysis) DM MI Wed,, (:- :) (B) Min-Yuh Day 戴 敏 育 Assistant Professor 專 任 助 理 教 授 Dept. of Information Management, Tamkang University 淡 江 大 學 資

An Enhanced Clustering Algorithm to Analyze Spatial Data

International Journal of Engineering and Technical Research (IJETR) ISSN: 2321-0869, Volume-2, Issue-7, July 2014 An Enhanced Clustering Algorithm to Analyze Spatial Data Dr. Mahesh Kumar, Mr. Sachin Yadav

Scheduling Algorithms in MapReduce Distributed Mind

Scheduling Algorithms in MapReduce Distributed Mind Karthik Kotian, Jason A Smith, Ye Zhang Schedule Overview of topic (review) Hypothesis Research paper 1 Research paper 2 Research paper 3 Project software

Robotics 2 Clustering & EM. Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard

Robotics 2 Clustering & EM Giorgio Grisetti, Cyrill Stachniss, Kai Arras, Maren Bennewitz, Wolfram Burgard 1 Clustering (1) Common technique for statistical data analysis to detect structure (machine learning,

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015

ROBERTO BATTITI, MAURO BRUNATO. The LION Way: Machine Learning plus Intelligent Optimization. LIONlab, University of Trento, Italy, Apr 2015 http://intelligentoptimization.org/lionbook Roberto Battiti

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

PREFACE xi 1 INTRODUCTION 1 1.1 Overview 1 1.2 Definition 1 1.3 Preparation 2 1.3.1 Overview 2 1.3.2 Accessing Tabular Data 3 1.3.3 Accessing Unstructured Data 3 1.3.4 Understanding the Variables and Observations

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

Dynamical Clustering of Personalized Web Search Results

Dynamical Clustering of Personalized Web Search Results Xuehua Shen CS Dept, UIUC xshen@cs.uiuc.edu Hong Cheng CS Dept, UIUC hcheng3@uiuc.edu Abstract Most current search engines present the user a ranked

Social Media Mining. Data Mining Essentials

Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers

A Statistical Text Mining Method for Patent Analysis

A Statistical Text Mining Method for Patent Analysis Department of Statistics Cheongju University, shjun@cju.ac.kr Abstract Most text data from diverse document databases are unsuitable for analytical

DATA CLUSTERING USING MAPREDUCE

DATA CLUSTERING USING MAPREDUCE by Makho Ngazimbi A project submitted in partial fulfillment of the requirements for the degree of Master of Science in Computer Science Boise State University March 2009

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL Krishna Kiran Kattamuri 1 and Rupa Chiramdasu 2 Department of Computer Science Engineering, VVIT, Guntur, India

Large-Scale Data Cleaning Using Hadoop. UC Irvine

Chen Li UC Irvine Joint work with Michael Carey, Alexander Behm, Shengyue Ji, Rares Vernica 1 Overview Importance of information Importance of information quality Data cleaning Large scale Hadoop 2 Data

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

Clustering and Data Mining in R

Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches

MODULE 15 Clustering Large Datasets LESSON 34

MODULE 15 Clustering Large Datasets LESSON 34 Incremental Clustering Keywords: Single Database Scan, Leader, BIRCH, Tree 1 Clustering Large Datasets Pattern matrix It is convenient to view the input data

Territorial Analysis for Ratemaking. Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs

Territorial Analysis for Ratemaking by Philip Begher, Dario Biasini, Filip Branitchev, David Graham, Erik McCracken, Rachel Rogers and Alex Takacs Department of Statistics and Applied Probability University

Cluster analysis Cosmin Lazar. COMO Lab VUB

Cluster analysis Cosmin Lazar COMO Lab VUB Introduction Cluster analysis foundations rely on one of the most fundamental, simple and very often unnoticed ways (or methods) of understanding and learning,

2. Norm, distance, angle

L. Vandenberghe EE133A (Spring 2016) 2. Norm, distance, angle norm distance angle hyperplanes complex vectors 2-1 Euclidean norm (Euclidean) norm of vector a R n : a = a 2 1 + a2 2 + + a2 n = a T a if

Big Data from a Database Theory Perspective

Big Data from a Database Theory Perspective Martin Grohe Lehrstuhl Informatik 7 - Logic and the Theory of Discrete Systems A CS View on Data Science Applications Data System Users 2 Us Data HUGE heterogeneous

General Framework for an Iterative Solution of Ax b. Jacobi s Method

2.6 Iterative Solutions of Linear Systems 143 2.6 Iterative Solutions of Linear Systems Consistent linear systems in real life are solved in one of two ways: by direct calculation (using a matrix factorization,

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

Example: Document Clustering. Clustering: Definition. Notion of a Cluster can be Ambiguous. Types of Clusterings. Hierarchical Clustering

Overview Prognostic Models and Data Mining in Medicine, part I Cluster Analsis What is Cluster Analsis? K-Means Clustering Hierarchical Clustering Cluster Validit Eample: Microarra data analsis 6 Summar

EFFICIENT K-MEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING

EFFICIENT K-MEANS CLUSTERING ALGORITHM USING RANKING METHOD IN DATA MINING Navjot Kaur, Jaspreet Kaur Sahiwal, Navneet Kaur Lovely Professional University Phagwara- Punjab Abstract Clustering is an essential

Monte Carlo analysis used for Contingency estimating.

Monte Carlo analysis used for Contingency estimating. Author s identification number: Date of authorship: July 24, 2007 Page: 1 of 15 TABLE OF CONTENTS: LIST OF TABLES:...3 LIST OF FIGURES:...3 ABSTRACT:...4

K-Means Clustering. Clustering and Classification Lecture 8

K-Means Clustering Clustering and Lecture 8 Today s Class K-means clustering: What it is How it works What it assumes Pitfalls of the method (locally optimal results) 2 From Last Time If you recall the

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in

Distance based clustering

// Distance based clustering Chapter ² ² Clustering Clustering is the art of finding groups in data (Kaufman and Rousseeuw, 99). What is a cluster? Group of objects separated from other clusters Means

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

OLTP Compared With OLAP

OLTP Compared With OLAP On Line Transaction Processing OLTP Maintains a database that is an accurate model of some realworld enterprise. Supports day-to-day operations. Characteristics: Short simple transactions

Operation Count; Numerical Linear Algebra

10 Operation Count; Numerical Linear Algebra 10.1 Introduction Many computations are limited simply by the sheer number of required additions, multiplications, or function evaluations. If floating-point

Entropy and Information Gain

Entropy and Information Gain The entropy (very common in Information Theory) characterizes the (im)purity of an arbitrary collection of examples Information Gain is the expected reduction in entropy caused

MATH 304 Linear Algebra Lecture 8: Inverse matrix (continued). Elementary matrices. Transpose of a matrix.

MATH 304 Linear Algebra Lecture 8: Inverse matrix (continued). Elementary matrices. Transpose of a matrix. Inverse matrix Definition. Let A be an n n matrix. The inverse of A is an n n matrix, denoted

SoSe 2014: M-TANI: Big Data Analytics

SoSe 2014: M-TANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering

The PageRank Citation Ranking: Bring Order to the Web

The PageRank Citation Ranking: Bring Order to the Web presented by: Xiaoxi Pang 25.Nov 2010 1 / 20 Outline Introduction A ranking for every page on the Web Implementation Convergence Properties Personalized

Descriptive Data Summarization

Descriptive Data Summarization (Understanding Data) First: Some data preprocessing problems... 1 Missing Values The approach of the problem of missing values adopted in SQL is based on nulls and three-valued

Clustering. 15-381 Artificial Intelligence Henry Lin. Organizing data into clusters such that there is

Clustering 15-381 Artificial Intelligence Henry Lin Modified from excellent slides of Eamonn Keogh, Ziv Bar-Joseph, and Andrew Moore What is Clustering? Organizing data into clusters such that there is

Credibility and Pooling Applications to Group Life and Group Disability Insurance

Credibility and Pooling Applications to Group Life and Group Disability Insurance Presented by Paul L. Correia Consulting Actuary paul.correia@milliman.com (207) 771-1204 May 20, 2014 What I plan to cover

MathQuest: Linear Algebra. 1. Which of the following matrices does not have an inverse?

MathQuest: Linear Algebra Matrix Inverses 1. Which of the following matrices does not have an inverse? 1 2 (a) 3 4 2 2 (b) 4 4 1 (c) 3 4 (d) 2 (e) More than one of the above do not have inverses. (f) All

Using Data Mining for Mobile Communication Clustering and Characterization

Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bag-of-words Spatial pyramids Neural Networks Object

CHAPTER 3 DATA MINING AND CLUSTERING

CHAPTER 3 DATA MINING AND CLUSTERING 3.1 Introduction Nowadays, large quantities of data are being accumulated. The amount of data collected is said to be almost doubled every 9 months. Seeking knowledge

E3: PROBABILITY AND STATISTICS lecture notes

E3: PROBABILITY AND STATISTICS lecture notes 2 Contents 1 PROBABILITY THEORY 7 1.1 Experiments and random events............................ 7 1.2 Certain event. Impossible event............................

GLM, insurance pricing & big data: paying attention to convergence issues.

GLM, insurance pricing & big data: paying attention to convergence issues. Michaël NOACK - michael.noack@addactis.com Senior consultant & Manager of ADDACTIS Pricing Copyright 2014 ADDACTIS Worldwide.

CSCI 5417 Information Retrieval Systems Jim Martin!

CSCI 5417 Information Retrieval Systems Jim Martin! Lecture 9 9/20/2011 Today 9/20 Where we are MapReduce/Hadoop Probabilistic IR Language models LM for ad hoc retrieval 1 Where we are... Basics of ad

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Clustering Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Data Mining Cluster Analsis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 Introduction to Data Mining b Tan, Steinbach, Kumar Clustering Algorithms K-means and its variants Hierarchical clustering

Linear Dependence Tests

Linear Dependence Tests The book omits a few key tests for checking the linear dependence of vectors. These short notes discuss these tests, as well as the reasoning behind them. Our first test checks

Topics in basic DBMS course

Topics in basic DBMS course Database design Transaction processing Relational query languages (SQL), calculus, and algebra DBMS APIs Database tuning (physical database design) Basic query processing (ch

Data Mining and Clustering Techniques

DRTC Workshop on Semantic Web 8 th 10 th December, 2003 DRTC, Bangalore Paper: K Data Mining and Clustering Techniques I. K. Ravichandra Rao Professor and Head Documentation Research and Training Center

2. DATA AND EXERCISES (Geos2911 students please read page 8) 2.1 Data set The data set available to you is an Excel spreadsheet file called cyclones.xls. The file consists of 3 sheets. Only the third is

NVIVO 10 WORKSHOP II. Hui Bian Office for Faculty Excellence

NVIVO 10 WORKSHOP II Hui Bian Office for Faculty Excellence Memo Memos are a type of document that enable you to record the ideas, insights, interpretations or growing understanding of the material in

Building Data Cubes and Mining Them. Jelena Jovanovic Email: jeljov@fon.bg.ac.yu

Building Data Cubes and Mining Them Jelena Jovanovic Email: jeljov@fon.bg.ac.yu KDD Process KDD is an overall process of discovering useful knowledge from data. Data mining is a particular step in the

Big Data Processing with Google s MapReduce. Alexandru Costan

1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google: