Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015
|
|
- Lily Wiggins
- 8 years ago
- Views:
Transcription
1 Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015
2 Hello Bulgaria ( A website with thousands of pages... Some pages identical to other pages Some pages nearly identical to other pages same text, different pictures We want smart indexing of the collection Save just one copy of the duplicate pages Save one copy of the nearly duplicate pages Filter out similar documents when returning search results And we want to keep the index up to date 2 2
3 The Naïve Way to Address this Challenge Represent each document as a dot in d-dimensional space Run a k-means algorithm on the document set Resulting in k clusters When presented with a new document Find the nearest cluster Find the documents within the nearest cluster that are nearest to the document in question Can be skipped if the cluster is small enough i.e., k is large enough that everything in the cluster is close! 3 3
4 The Naïve Way has conceptual problems No good way to decide optimal k All documents have to be re-clustered if we want to change k Forces a document to be in a single cluster In practice, a document can be similar to multiple clusters All clusters are roughly the same size In practice, this terrain is lumpy some documents are one-of-a-kind and others are similar to many others. 4 4
5 The Naïve Way has technical problems End result is subject to initial choice of centroids Leads to results not being repeatable Performance is O(nk), or worse! Especially unfortunate because we want k to be large Algorithm is not easily adapted to map/reduce We need a pipeline of map/reduce jobs to compute it 5 5
6 Any Alternatives? Clustering has been picked over quite well due to its combination of interesting math and wide applicability Two dominant types have emerged: Hierarchical clustering Partitional clustering (e.g., k-means) k-means Variations based on Choice of Initial Centroids Choice of k Parameters at each iteration 6 6
7 Another line of inquiry: Nearest Neighbor Based on partitioning the search space Quad Trees kd-trees Locality-Sensitive Hashing Hash functions are locality-sensitive, if, for a random hash function h, for any pair of points p,q : Pr[h(p)=h(q)] is high if p is close to q Pr[h(p)=h(q)] is low if p is far from q 7 7
8 More on Nearest Neighbor Locality-Sensitive Hashing Hash functions are locality-sensitive, if, for a random hash random function h, for any pair of points p,q we have: Pr[h(p)=h(q)] is high if p is close to q Pr[h(p)=h(q)] is low if p is far from q Indyk-Motwani
9 The LSH Idea Treat items as vectors in d- dimensional space. Draw k random hyper-planes in that space. For each hyper-plane: Is each vector on the (0) side of the hyperplane or the (1) side? Hash(Item 1 ) = 000 Hash(Item 3 ) = 101 Hashes each item into a number The magic is in choosing h 1, h 2, h h 3 9 h 2 9
10 The LSH Hash Code Idea Breaks d-dimensional space into proximity-polyhedra. Each purple block represents a document Each Bucket represents a group of alike docs Docs within each bucket still need to be compared to see which ones are the closest Buckets 10
11 A Brief History of LSH Origins at Stanford (1998) Continuing research in universities Stanford, MIT, Rutgers, Cornell, Continuing research in Industry Intel, Microsoft, Google, Textbook: A. Rajaraman and J. Ullman (2010). ( Our contribution: An extensible implementation for large datasets 11 11
12 Choosing hash functions Introducing minhash 1. Sample each document to get its shingles small fragments Mary had a mary, ary, ry h, y ha, had, CTAGTATAAA CTAGTATA, TAGTATAA, AGTATAAA, now is the time now is, is the, the time 2. Calculate the hash value for every shingle. 3. Store the minimum hash value found in step Repeat steps 2 and 3 with different hash algorithms 199 more times to get a total of 200 minhash values
13 Interesting thing about minhashes The resulting minhashes are 200 integer values representing a random selection of shingles. Property of minhashes: If the minhashes for two docs are the same, their shingles are likely to be the same If the shingles for two docs are the same, the docs themselves are likely to be the same Beware Minhash is specific to a particular similarity measure Jaccard similarity Other hash families exist for other similarity measures 13 13
14 All 200 minhashes must match? If all minhashes match, it implies a strong similarity between docs. To catch most cases with weaker similarity Don t compare all minhashes at once, compare them in bands. Candidate pairs are those that hash to the same bucket for 1 band. Sometimes one band will reject a pair and another band will consider it a candidate
15 LSH Involves a Tradeoff Pick the number of minhashes, the number of bands, and the number of rows per band to balance false positives/negatives. False positives need to examine more pairs that are not really similar. More processing resources, more time. False negatives failed to examine pairs that were similar, didn t find all similar results. But got done faster! 15 15
16 Summary Mine the data and place members into hash buckets When you need to find a match, hash it and possible nearest neighbors will be in one of b buckets. Algorithm performance O(n) 16 16
17 Going Beyond k-means Demo J Singh and Teresa Brooks March 17, 2015
18 Peerbelt Results Example 18 18
19 Database Architecture Requirements Need a very large range of bucket numbers Bucket Numbers in our implementation are to Most buckets are empty Empty buckets must not take any space in the database Some buckets have a lot of documents in them, we need to be able to locate all of them To find documents similar to a given document, Bucketize the document, then find other documents in the same buckets 19 19
20 Implementation: OpenLSH We started OpenLSH to provide a framework for LSH Factor out the database Started on Google App Engine Virtualized interface to make it work on Cassandra Factor out the calculation engine Started on Google App Engine Can plug in Google MapReduce Ported to run in Batch mode on Cassandra 20 20
21 Using OpenLSH We re looking for one or two interesting use cases Application areas: Near de-duplicaction (covered with Peerbelt s data) Stocks that move independent of the herd Filtering unique stories from the News Contact us to discuss 21 21
22 What you can do For more information: Links to code and data set are included Run on App Engine Minimum setup required Adapt it to your environment and need If you need help, send or create a Github issue. Send us a pull request for any improvements you make
23 Thank you J Singh Principal, DataThinks Algorithms for j. datathinks. org Adj. Prof, Computer Science, WPI Teresa Brooks Senior Software Xero 23 23
24 Going Beyond k-means Appendix Slides J Singh and Teresa Brooks June 4, 2015
25 Running LSH on a cluster of machines Can be implemented on a Map Reduce Architecture def map(string docname, String doc): # [ skipped ] for bkt in buckets: emit (bkt, docname) Buckets def reduce(string bkt, Iterator docnames): # [ skipped ] for dn in docnames: emit (bkt, dn) Map Step 25 Reduce Step
26 Extending OpenLSH (p1) Distance Measures The minhash family of functions using Jaccard Distance is just one of several family of functions that be used with the LSH technique. Jaccard Similarity is a measure of how close sets are. The real distance (closeness) measure for sets is Jaccard Distance, which is 1 minus the Jaccard Similarity. Other Distance Measures: Euclidian Distance (used in spaces with dimensions) Cosine Distance (used in spaces with dimensions) Edit Distance (used when two points are strings) Hamming Distance (cat kat kit) 26 26
27 Extending OpenLSH (p2) Parallelize it We suggested a potential map/reduce algorithm, Another paper: Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing Sundaram et al, 2014 ( App Engine provides the map reduce infrastructure to serve as foundation 27 27
28 LSH Tradeoff Example If we had fewer than 20 bands, (and more rows / band) fewer pairs would be selected for comparison, the number of false positives would go down, but the number of false negatives would go up, Performance would go up but so would the error rate! 28 28
29 A Brief History of LSH Origins at Stanford Indyk, Piotr.; Motwani, Rajeev. (1998). ( Gionis, A.; Indyk, P.; Motwani, R. (1999). ( Continuing work at MIT (Indyk) Parallel LSH Textbook: A. Rajaraman and J. Ullman (2010). ( Our contribution: An extensible implementation for large datasets 29 29
Entity Resolution Fingerprints Similar News Articles. Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman
Entity Resolution Fingerprints Similar News Articles Cloud and Big Data Summer School, Stockholm, Aug., 2015 Jeffrey D. Ullman 2 The entity-resolution problem is to examine a collection of records and
More informationFast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
More informationLecture #2. Algorithms for Big Data
Additional Topics: Big Data Lecture #2 Algorithms for Big Data Joseph Bonneau jcb82@cam.ac.uk April 30, 2012 Today's topic: algorithms Do we need new algorithms? Quantity is a quality of its own Joseph
More informationBig Data. Lecture 6: Locality Sensitive Hashing (LSH)
Big Data Lecture 6: Locality Sensitive Hashing (LSH) Nearest Neighbor Given a set P of n oints in R d Nearest Neighbor Want to build a data structure to answer nearest neighbor queries Voronoi Diagram
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationRecommender Systems Seminar Topic : Application Tung Do. 28. Januar 2014 TU Darmstadt Thanh Tung Do 1
Recommender Systems Seminar Topic : Application Tung Do 28. Januar 2014 TU Darmstadt Thanh Tung Do 1 Agenda Google news personalization : Scalable Online Collaborative Filtering Algorithm, System Components
More informationSoSe 2014: M-TANI: Big Data Analytics
SoSe 2014: M-TANI: Big Data Analytics Lecture 4 21/05/2014 Sead Izberovic Dr. Nikolaos Korfiatis Agenda Recap from the previous session Clustering Introduction Distance mesures Hierarchical Clustering
More informationB490 Mining the Big Data. 0 Introduction
B490 Mining the Big Data 0 Introduction Qin Zhang 1-1 Data Mining What is Data Mining? A definition : Discovery of useful, possibly unexpected, patterns in data. 2-1 Data Mining What is Data Mining? A
More informationARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)
ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING) Gabriela Ochoa http://www.cs.stir.ac.uk/~goc/ OUTLINE Preliminaries Classification and Clustering Applications
More informationSpark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. www.spark- project.org. University of California, Berkeley UC BERKELEY
Spark in Action Fast Big Data Analytics using Scala Matei Zaharia University of California, Berkeley www.spark- project.org UC BERKELEY My Background Grad student in the AMP Lab at UC Berkeley» 50- person
More informationFinding Similar Items
72 Chapter 3 Finding Similar Items A fundamental data-mining problem is to examine data for similar items. We shall take up applications in Section 3.1, but an example would be looking at a collection
More informationJubatus: An Open Source Platform for Distributed Online Machine Learning
Jubatus: An Open Source Platform for Distributed Online Machine Learning Shohei Hido Seiya Tokui Preferred Infrastructure Inc. Tokyo, Japan {hido, tokui}@preferred.jp Satoshi Oda NTT Software Innovation
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDD-LAB ISTI- CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
More informationThe Stratosphere Big Data Analytics Platform
The Stratosphere Big Data Analytics Platform Amir H. Payberah Swedish Institute of Computer Science amir@sics.se June 4, 2014 Amir H. Payberah (SICS) Stratosphere June 4, 2014 1 / 44 Big Data small data
More informationMap-Reduce for Machine Learning on Multicore
Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,
More informationSmart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets
Smart-Sample: An Efficient Algorithm for Clustering Large High-Dimensional Datasets Dudu Lazarov, Gil David, Amir Averbuch School of Computer Science, Tel-Aviv University Tel-Aviv 69978, Israel Abstract
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance K-nearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationMonday Morning Data Mining
Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline: - data mining - IceCube - Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik
More informationMean Shift Based Clustering in High Dimensions: A Texture Classification Example
Mean Shift Based Clustering in High Dimensions: A Texture Classification Example Bogdan Georgescu µ Ilan Shimshoni µ Peter Meer ¾µ Computer Science µ Electrical and Computer Engineering ¾µ Rutgers University,
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationRANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING
= + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015 KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data sparsity:
More informationHadoop and NoSQL Basics: Big Data Demystified. NYS Innovation Summit, 12/17/2013. Matt LeMay, @mattlemay
Hadoop and NoSQL Basics: Big Data Demystified NYS Innovation Summit, 12/17/2013 Matt LeMay, @mattlemay When I want people to think I m smart, I just say HADOOP really loud. Hadoop! There it is. Big Data!
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationHadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
More informationAnti-Spam Filter Based on Naïve Bayes, SVM, and KNN model
AI TERM PROJECT GROUP 14 1 Anti-Spam Filter Based on,, and model Yun-Nung Chen, Che-An Lu, Chao-Yu Huang Abstract spam email filters are a well-known and powerful type of filters. We construct different
More informationHadoop Design and k-means Clustering
Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationStreaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing
Streaming Similarity Search over one Billion Tweets using Parallel Locality-Sensitive Hashing Narayanan Sundaram, Aizana Turmukhametova, Nadathur Satish, Todd Mostak, Piotr Indyk, Samuel Madden and Pradeep
More informationCS 207 - Data Science and Visualization Spring 2016
CS 207 - Data Science and Visualization Spring 2016 Professor: Sorelle Friedler sorelle@cs.haverford.edu An introduction to techniques for the automated and human-assisted analysis of data sets. These
More informationParallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationMapReduce (in the cloud)
MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:
More informationHadoop Usage At Yahoo! Milind Bhandarkar (milindb@yahoo-inc.com)
Hadoop Usage At Yahoo! Milind Bhandarkar (milindb@yahoo-inc.com) About Me Parallel Programming since 1989 High-Performance Scientific Computing 1989-2005, Data-Intensive Computing 2005 -... Hadoop Solutions
More informationIntroduction to Big Data! with Apache Spark" UC#BERKELEY#
Introduction to Big Data! with Apache Spark" UC#BERKELEY# This Lecture" The Big Data Problem" Hardware for Big Data" Distributing Work" Handling Failures and Slow Machines" Map Reduce and Complex Jobs"
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationBig Data Analytics. Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs
1 Big Data Analytics Genoveva Vargas-Solar http://www.vargas-solar.com/big-data-analytics French Council of Scientific Research, LIG & LAFMIA Labs Montevideo, 22 nd November 4 th December, 2015 INFORMATIQUE
More informationLarge-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce
More informationTeaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data 04 02 --- 04 01 --- 05 Analytics. Theory Marks
Teaching Scheme Credits Assigned Course Code Course Hrs./Week Name Theory Practical Tutorial Theory Practical/Oral Tutorial Tota l BEITC802 Big Data 04 02 --- 04 01 --- 05 Analytics Examination Scheme
More informationThe MapReduce Framework
The MapReduce Framework Luke Tierney Department of Statistics & Actuarial Science University of Iowa November 8, 2007 Luke Tierney (U. of Iowa) The MapReduce Framework November 8, 2007 1 / 16 Background
More informationClustering Connectionist and Statistical Language Processing
Clustering Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes Clustering p.1/21 Overview clustering vs. classification supervised
More informationMedical Information Management & Mining. You Chen Jan,15, 2013 You.chen@vanderbilt.edu
Medical Information Management & Mining You Chen Jan,15, 2013 You.chen@vanderbilt.edu 1 Trees Building Materials Trees cannot be used to build a house directly. How can we transform trees to building materials?
More informationHadoop Operations Management for Big Data Clusters in Telecommunication Industry
Hadoop Operations Management for Big Data Clusters in Telecommunication Industry N. Kamalraj Asst. Prof., Department of Computer Technology Dr. SNS Rajalakshmi College of Arts and Science Coimbatore-49
More informationBig Data Analytics with Spark and Oscar BAO. Tamas Jambor, Lead Data Scientist at Massive Analytic
Big Data Analytics with Spark and Oscar BAO Tamas Jambor, Lead Data Scientist at Massive Analytic About me Building a scalable Machine Learning platform at MA Worked in Big Data and Data Science in the
More informationBig Data Technology Map-Reduce Motivation: Indexing in Search Engines
Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationCommon Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein Falaki @mhfalaki hossein@databricks.com
Common Patterns and Pitfalls for Implementing Algorithms in Spark Hossein Falaki @mhfalaki hossein@databricks.com Challenges of numerical computation over big data When applying any algorithm to big data
More informationData Structure and Network Searching
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence A Unified Approximate Nearest Neighbor Search Scheme by Combining Data Structure and Hashing Debing Zhang Genmao
More informationSummary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen
Summary Data Mining & Process Mining (1BM46) Made by S.P.T. Ariesen Content Data Mining part... 2 Lecture 1... 2 Lecture 2:... 4 Lecture 3... 7 Lecture 4... 9 Process mining part... 13 Lecture 5... 13
More informationBig Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
More informationLecture 4 Online and streaming algorithms for clustering
CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line
More informationLecture 10 - Functional programming: Hadoop and MapReduce
Lecture 10 - Functional programming: Hadoop and MapReduce Sohan Dharmaraja Sohan Dharmaraja Lecture 10 - Functional programming: Hadoop and MapReduce 1 / 41 For today Big Data and Text analytics Functional
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationBUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining alin@intelligentmining.com Outline Predictive modeling methodology k-nearest Neighbor
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant
More informationAttend Part 1 (2-3pm) to get 1 point extra credit. Polo will announce on Piazza options for DL students.
Attend Part 1 (2-3pm) to get 1 point extra credit. Polo will announce on Piazza options for DL students. Data Science/Data Analytics and Scaling to Big Data with MathWorks Using Data Analytics to turn
More informationSentiment analysis using emoticons
Sentiment analysis using emoticons Royden Kayhan Lewis Moharreri Steven Royden Ware Lewis Kayhan Steven Moharreri Ware Department of Computer Science, Ohio State University Problem definition Our aim was
More informationData Mining Project Report. Document Clustering. Meryem Uzun-Per
Data Mining Project Report Document Clustering Meryem Uzun-Per 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. K-means algorithm...
More informationComparison of K-means and Backpropagation Data Mining Algorithms
Comparison of K-means and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
More informationStatistical Validation and Data Analytics in ediscovery. Jesse Kornblum
Statistical Validation and Data Analytics in ediscovery Jesse Kornblum Administrivia Silence your mobile Interactive talk Please ask questions 2 Outline Introduction Big Questions What Makes Things Similar?
More informationAPPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationNoSQL. Thomas Neumann 1 / 22
NoSQL Thomas Neumann 1 / 22 What are NoSQL databases? hard to say more a theme than a well defined thing Usually some or all of the following: no SQL interface no relational model / no schema no joins,
More informationLoad-Balancing the Distance Computations in Record Linkage
Load-Balancing the Distance Computations in Record Linkage Dimitrios Karapiperis Vassilios S. Verykios Hellenic Open University School of Science and Technology Patras, Greece {dkarapiperis, verykios}@eap.gr
More informationMachine Learning Final Project Spam Email Filtering
Machine Learning Final Project Spam Email Filtering March 2013 Shahar Yifrah Guy Lev Table of Content 1. OVERVIEW... 3 2. DATASET... 3 2.1 SOURCE... 3 2.2 CREATION OF TRAINING AND TEST SETS... 4 2.3 FEATURE
More informationClustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
More informationEmail Spam Detection A Machine Learning Approach
Email Spam Detection A Machine Learning Approach Ge Song, Lauren Steimle ABSTRACT Machine learning is a branch of artificial intelligence concerned with the creation and study of systems that can learn
More informationWhich Space Partitioning Tree to Use for Search?
Which Space Partitioning Tree to Use for Search? P. Ram Georgia Tech. / Skytree, Inc. Atlanta, GA 30308 p.ram@gatech.edu Abstract A. G. Gray Georgia Tech. Atlanta, GA 30308 agray@cc.gatech.edu We consider
More informationIntroduction to NoSQL Databases. Tore Risch Information Technology Uppsala University 2013-03-05
Introduction to NoSQL Databases Tore Risch Information Technology Uppsala University 2013-03-05 UDBL Tore Risch Uppsala University, Sweden Evolution of DBMS technology Distributed databases SQL 1960 1970
More informationData Mining 資 料 探 勘. 分 群 分 析 (Cluster Analysis)
Data Mining 資 料 探 勘 Tamkang University 分 群 分 析 (Cluster Analysis) DM MI Wed,, (:- :) (B) Min-Yuh Day 戴 敏 育 Assistant Professor 專 任 助 理 教 授 Dept. of Information Management, Tamkang University 淡 江 大 學 資
More informationAnalysis of MapReduce Algorithms
Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model
More informationClustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca
Clustering Adrian Groza Department of Computer Science Technical University of Cluj-Napoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 K-means 3 Hierarchical Clustering What is Datamining?
More informationIntroduction to Hadoop
Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction
More informationBi-level Locality Sensitive Hashing for K-Nearest Neighbor Computation
Bi-level Locality Sensitive Hashing for K-Nearest Neighbor Computation Jia Pan UNC Chapel Hill panj@cs.unc.edu Dinesh Manocha UNC Chapel Hill dm@cs.unc.edu ABSTRACT We present a new Bi-level LSH algorithm
More informationClustering UE 141 Spring 2013
Clustering UE 141 Spring 013 Jing Gao SUNY Buffalo 1 Definition of Clustering Finding groups of obects such that the obects in a group will be similar (or related) to one another and different from (or
More informationManual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0
Manual for BEAR Big Data Ensemble of Adaptations for Regression Version 1.0 Vahid Jalali David Leake August 9, 2015 Abstract BEAR is a case-based regression learner tailored for big data processing. It
More informationPractical Introduction to Machine Learning and Optimization. Alessio Signorini <alessio.signorini@oneriot.com>
Practical Introduction to Machine Learning and Optimization Alessio Signorini Everyday's Optimizations Although you may not know, everybody uses daily some sort of optimization
More informationProject Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING. Masters in Computer Science
Data Intensive Computing CSE 486/586 Project Report BIG-DATA CONTENT RETRIEVAL, STORAGE AND ANALYSIS FOUNDATIONS OF DATA-INTENSIVE COMPUTING Masters in Computer Science University at Buffalo Website: http://www.acsu.buffalo.edu/~mjalimin/
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationNeural Networks Lesson 5 - Cluster Analysis
Neural Networks Lesson 5 - Cluster Analysis Prof. Michele Scarpiniti INFOCOM Dpt. - Sapienza University of Rome http://ispac.ing.uniroma1.it/scarpiniti/index.htm michele.scarpiniti@uniroma1.it Rome, 29
More informationLet the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data
CS535 Big Data W1.A.1 CS535 BIG DATA W1.A.2 Let the data speak to you Medication Adherence Score How likely people are to take their medication, based on: How long people have lived at the same address
More informationApache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com
Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com http://elephantscale.com Spark Fast & Expressive Cluster computing engine Compatible with Hadoop Came
More informationEM Clustering Approach for Multi-Dimensional Analysis of Big Data Set
EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationDistributed Apriori in Hadoop MapReduce Framework
Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing
More informationPredicting the Stock Market with News Articles
Predicting the Stock Market with News Articles Kari Lee and Ryan Timmons CS224N Final Project Introduction Stock market prediction is an area of extreme importance to an entire industry. Stock price is
More informationData Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing Grid-Files Kd-trees Ulf Leser: Data
More informationMap/Reduce Affinity Propagation Clustering Algorithm
Map/Reduce Affinity Propagation Clustering Algorithm Wei-Chih Hung, Chun-Yen Chu, and Yi-Leh Wu Department of Computer Science and Information Engineering, National Taiwan University of Science and Technology,
More informationInfrastructures for big data
Infrastructures for big data Rasmus Pagh 1 Today s lecture Three technologies for handling big data: MapReduce (Hadoop) BigTable (and descendants) Data stream algorithms Alternatives to (some uses of)
More informationSpark and the Big Data Library
Spark and the Big Data Library Reza Zadeh Thanks to Matei Zaharia Problem Data growing faster than processing speeds Only solution is to parallelize on large clusters» Wide use in both enterprises and
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationLambda Architecture. CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014
Lambda Architecture CSCI 5828: Foundations of Software Engineering Lecture 29 12/09/2014 1 Goals Cover the material in Chapter 8 of the Concurrency Textbook The Lambda Architecture Batch Layer MapReduce
More informationEstimating PageRank Values of Wikipedia Articles using MapReduce
Estimating PageRank Values of Wikipedia Articles using MapReduce Due: Sept. 30 Wednesday 5:00PM Submission: via Canvas, individual submission Instructor: Sangmi Pallickara Web page: http://www.cs.colostate.edu/~cs535/assignments.html
More informationCloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman
Cloud and Big Data Summer School, Stockholm, Aug. 2015 Jeffrey D. Ullman 2 In a DBMS, input is under the control of the programming staff. SQL INSERT commands or bulk loaders. Stream management is important
More informationStep 5: This is the final step in which I observe how many times each word is associated to a word. And
Algorithm: First I decide some random vectors in a mapreduce program. So for each words context I will make a vector (we decided not to use TF as in most of the cases TF will be 1 and hence using it doesnt
More informationValidity Measure of Cluster Based On the Intra-Cluster and Inter-Cluster Distance
International Journal of Electronics and Computer Science Engineering 2486 Available Online at www.ijecse.org ISSN- 2277-1956 Validity Measure of Cluster Based On the Intra-Cluster and Inter-Cluster Distance
More informationMachine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
More informationCluster Analysis for Optimal Indexing
Proceedings of the Twenty-Sixth International Florida Artificial Intelligence Research Society Conference Cluster Analysis for Optimal Indexing Tim Wylie, Michael A. Schuh, John Sheppard, and Rafal A.
More informationTHE concept of Big Data refers to systems conveying
EDIC RESEARCH PROPOSAL 1 High Dimensional Nearest Neighbors Techniques for Data Cleaning Anca-Elena Alexandrescu I&C, EPFL Abstract Organisations from all domains have been searching for increasingly more
More informationGoing Big in Data Dimensionality:
LUDWIG- MAXIMILIANS- UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
More information