RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING


 Theodora Farmer
 1 years ago
 Views:
Transcription
1 = + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015
2 KEYWORDBASED SEARCH Document Data 300 unique words per document words in vocabulary Data sparsity: 99.9% Words are strong query features STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
3 IMAGE SEARCH BY EXAMPLE Image Data 800 pixels 200 black pixels Data sparsity 80% Pixels are weak query features STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
4 TWO KINDS OF SEARCH Data Sparse Dense Query Size Short Long Use Cases Keyword Search Image Search, Semantic Search, Recommendations SEARCH METHOD INVERTED INDEX (LUCENE) RANDOM PROJECTIONS STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
5 OUTLINE 1. High dimensional data (images, text, clicks) 2. Random Projections: Why? What? How? 3. Image Search Benchmark STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
6 HIGHDIMENSIONAL DATA Images STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
7 HIGHDIMENSIONAL DATA Images Pixels STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
8 HIGHDIMENSIONAL DATA Images Pixels Dimensions STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
9 MATRIX REPRESENTATION OF DATASET STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
10 MATRIX REPRESENTATION OF DATASET STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
11 HIGHDIMENSIONAL DATA Text A beer can do everyone a favor. a abstract be bee beer can do everyone Dimensions STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
12 HIGHDIMENSIONAL DATA Clicks STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
13 DENSE VS SPARSE MATRIX IMAGES 1000 PIXELS TEXT/CLICKS WORDS img_1 img_3 doc_1 doc_3 img_n doc_n CANNOT SEARCH WITH LUCENE CAN SEARCH WITH LUCENE STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
14 DENSE VS SPARSE MATRIX IMAGES 1000 PIXELS TEXT/CLICKS WORDS TEXT/CLICKS 500 DIMENSIONS img_1 img_3 doc_1 doc_3 img_n doc_n CANNOT SEARCH WITH LUCENE CAN SEARCH WITH LUCENE WITH LUCENE STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
15 DENSE VS SPARSE MATRIX IMAGES 1000 PIXELS TEXT/CLICKS WORDS TEXT/CLICKS 500 DIMENSIONS img_1 img_3 doc_1 doc_3 img_n doc_n CANNOT SEARCH WITH LUCENE CAN SEARCH WITH LUCENE WITH LUCENE WITH RANDOM PROJECTIONS WITH RANDOM PROJECTIONS WITH RANDOM PROJECTIONS STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
16 DIMENSIONALITY REDUCTION beer wine beer + wine doc_1 doc_3 SVD doc_n dimensions 500 dimensions STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
17 DIMENSIONALITY REDUCTION = PATTERN DISCOVERY = + STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
18 APPLICATIONS Random Forest Search (Nearest Neighbor) Random Projections SVM SVD Boosting STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
19 RANDOM PROJECTIONS IN INDUSTRY Spotify for music recommendations https://github.com/spotify/annoy Etsy for user/product recommendations Style in the Long Tail: Discovering Unique Interests with Latent Variable Models in Large Scale Social Ecommerce, Diane Hu, Rob Hall, Josh Attenberg Referred to as Locality Sensitive Hashing STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
20 OUTLINE 1. High dimensional data (images, text, clicks) 2. Random Projections: Why? What? How? 3. Image Search Benchmark STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
21 HOW DOES IT WORK? = + STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
22 ILLUSTRATION ON ARTIFICIAL DATASET id_1 id_3 id_n STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
23 ILLUSTRATION ON ARTIFICIAL DATASET Nearest neighbor problem: Find the closest point STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
24 ILLUSTRATION ON ARTIFICIAL DATASET Approximation: Neighbors are in a circle with small radius STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
25 ILLUSTRATION ON ARTIFICIAL DATASET Core idea: random grids Dynamic grids (variably sized) Random = cheap STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
26 PARTITION BY RANDOM SPLITS Hyperplane splits the dataset STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
27 PROJECTION = ONE DIMENSIONAL VIEW OF DATASET A Random Direction (1 Dimension) STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
28 PLATO S ALEGORY OF THE CAVE A Random Direction (1 Dimension) STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
29 PARTITIONING BY PROJECTING ON RANDOM LINES STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
30 PARTITIONING BY PROJECTING ON RANDOM LINES STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
31 ALGORITHM VISUALIZATION STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
32 ALGORITHM VISUALIZATION STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
33 ALGORITHM VISUALIZATION STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
34 ALGORITHM VISUALIZATION STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
35 ALGORITHM VISUALIZATION When two points are close, they are likely to end up in the same partition, even under random partitioning STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
36 ALGORITHM VISUALIZATION When two points are close, they are LIKELY to end up in the same partition, even under random partitioning STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
37 MULTIPLE TREES Random tree partitioning is noisy. Build more trees. STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
38 ADDING MORE TREES Tree 1 Trees 1 and 2 Trees 1, 2 and 3 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
39 ADDING MORE TREES Trees 150 Trees STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
40 CONSTRAINING THE NEAREST NEIGHBOR REGION THRESHOLD = In how many trees a NN candidate appears Threshold = 1 Threshold = 60 Threshold = 20 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
41 WE VE GOT HIM STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
42 OUTLINE 1. High dimensional data (images, text, clicks) 2. Random Projections: Why? What? How? 3. Image Search Benchmark STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
43 IMAGE SEARCH FOR PREDICTION STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
44 IMAGE SEARCH FOR PREDICTION Results Image Label STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
45 STAGE 1: INDEXING examples with labels 784 pixels per image STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
46 STAGE 1: INDEXING examples with labels 784 pixels per image STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
47 STAGE 1: INDEXING examples with labels 784 pixels per image index = Index(...) for image in training_examples: labels[image.id] = image.label index.add_item(image.id, image.vector) index.build(...) STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
48 STAGE 2: SEARCH TO PREDICT THE LABEL Results Image Label index.load('.../file.index')... image_vec = read_image(...) nn = 100 #number of nearest neighbors results = index.get_nns_by_vector(image_vec,nn) top_result = results[0] predicted_label = labels[top_result.id] test examples STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
49 TOOLS stefansavev.com/randomtrees STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
50 BENCHMARK IMAGE SEARCH DIMENSIONS # TREES # NN CANDIDATES ACCURACY ON TEST SEARCH TIME [ms/query] BOTTLENECK Reference: https://github.com/stefansavev/randomprojectionstalk/ STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
51 BENCHMARK IMAGE SEARCH DIMENSIONS # TREES # NN CANDIDATES ACCURACY ON TEST , after SVD SEARCH TIME [ms/query] Reference: https://github.com/stefansavev/randomprojectionstalk/ STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
52 BENCHMARK IMAGE SEARCH DIMENSIONS # TREES # NN CANDIDATES ACCURACY ON TEST , after SVD SEARCH TIME [ms/query] BOTTLENECK Reference: https://github.com/stefansavev/randomprojectionstalk/ STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
53 BENCHMARK IMAGE SEARCH stefansavev.com/ randomtrees DIMENSIONS # TREES # NN CANDIDATES ACCURACY ON TEST , after SVD 100, after SVD (mean) SEARCH TIME [ms/query] Reference: https://github.com/stefansavev/randomprojectionstalk/ STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
54 SAME FOUNDATION FOR SEARCH AND MACHINE LEARNING + Data point id STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
55 SAME FOUNDATION FOR SEARCH AND MACHINE LEARNING Histogram + label frequency STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
56 CONCLUSION = + Search in Dense Data CHEAP method to explore data; Makes algorithms FASTER (e.g. SVD, SVM) STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
57 THANK YOU! Stefan Savev Blog: stefansavev.com STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
58 EXTRA SLIDES STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
59 REFERENCES STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
60 ADVANTAGES/DISADVANTAGES DIMENSIONALITY REDUCTION Advantages Semantic Similarity Concentrate the signal Disadvantages No exact matches possible Adds noise STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
61 RANDOM PROJECTIONS AS HASH FUNCTIONS * = = 1 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
62 EXAMPLEBASED IMAGE SEARCH MNIST image dataset of digits Images are digits from 0 to images for training images for test 28 x 28 pixels with values from 0 to (=28*28) features Dataset is already preprocessed Goal is to predict the label of an image STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
63 SOME IMPLEMENTATION TRICKS sparse random projections combine not all dimension but a small number of them apply dimensionality reduction first pick better random projections the projected histogram is wide project on multiple lines simultaneously Hadamard matrix trick reuse random projections via recombination cut out a ball from the densest region PCA may help use the neighbors of neighbors of a point as additional similarity candidates advanced search inside the trees with backtracking (KDtree style) generate multiple versions of query/training data (i.e. for image data translate or rotate the image) [Link to Blog] STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
64 PROJECTION Dot product (cosine, similarity) View of the dataset Overlap with Pattern Geometry of Search Foundation for Search and Machine Learning STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
65 WHY RANDOM? Cheap (replace optimization with randomization); Simply build more trees If we build the perfect tree, it s just one tree. We need more and DIVERSE trees In high dimensions all algorithms will have problems, so do the cheapest With a few points/dimensions random is not the best choice STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
66 DIMENSIONALITY REDUCTION doc_1 doc_3 doc_n beer wine SVD beer + wine MIXING THE COLUMNS OF THE ORIGINAL MATRIX TO MAXIMIZE SIGNAL TO NOISE WORDS 500 DIMENSIONS STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
67 PROJECTION = OVERLAP WITH A PATTERN OVERLAP, =SUM = 1, more red 1, more green STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
68 PROJECTION = OVERLAP WITH A PATTERN OVERLAP, =SUM STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
69 PROJECTION = OVERLAP WITH A PATTERN OVERLAP (, ) = SUM STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
70 PROJECTION = OVERLAP WITH A PATTERN OVERLAP (, ) = SUM = 1, if more red pixels than white 1, if more white pixels than red STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
71 PROJECTION = OVERLAP WITH A PATTERN OVERLAP (, ) = SUM = 1, if more red pixels than white 1, if more white pixels than red STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
72 DATA POINT SVD FEATURE RANDOM FEATURE STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
73 PLATO S ALEGORY OF THE CAVE He then explains how the philosopher is like a prisoner who is freed from the cave and comes to understand that the shadows on the wall do not make up reality at all, as he can perceive the true form of reality rather than the mere shadows seen by the prisoners Source: STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
74 SECRET SAUCE How to make it faster? How to make it more accurate? When does it work? Is there something better? STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE Alex Lin Senior Architect Intelligent Mining alin@intelligentmining.com Outline Predictive modeling methodology knearest Neighbor
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationNonnegative Matrix Factorization (NMF) in Semisupervised Learning Reducing Dimension and Maintaining Meaning
Nonnegative Matrix Factorization (NMF) in Semisupervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
More informationArtificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence
Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network?  Perceptron learners  Multilayer networks What is a Support
More informationActive Learning SVM for Blogs recommendation
Active Learning SVM for Blogs recommendation Xin Guan Computer Science, George Mason University Ⅰ.Introduction In the DH Now website, they try to review a big amount of blogs and articles and find the
More informationSupporting Online Material for
www.sciencemag.org/cgi/content/full/313/5786/504/dc1 Supporting Online Material for Reducing the Dimensionality of Data with Neural Networks G. E. Hinton* and R. R. Salakhutdinov *To whom correspondence
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More informationMultimedia Databases. WolfTilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tubs.
Multimedia Databases WolfTilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig http://www.ifis.cs.tubs.de 14 Previous Lecture 13 Indexes for Multimedia Data 13.1
More informationUnsupervised Data Mining (Clustering)
Unsupervised Data Mining (Clustering) Javier Béjar KEMLG December 01 Javier Béjar (KEMLG) Unsupervised Data Mining (Clustering) December 01 1 / 51 Introduction Clustering in KDD One of the main tasks in
More informationThe Scientific Data Mining Process
Chapter 4 The Scientific Data Mining Process When I use a word, Humpty Dumpty said, in rather a scornful tone, it means just what I choose it to mean neither more nor less. Lewis Carroll [87, p. 214] In
More informationGoing Big in Data Dimensionality:
LUDWIG MAXIMILIANS UNIVERSITY MUNICH DEPARTMENT INSTITUTE FOR INFORMATICS DATABASE Going Big in Data Dimensionality: Challenges and Solutions for Mining High Dimensional Data Peer Kröger Lehrstuhl für
More informationContentBased Recommendation
ContentBased Recommendation Contentbased? Item descriptions to identify items that are of particular interest to the user Example Example Comparing with Noncontent based Items Userbased CF Searches
More informationAPPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder
APPM4720/5720: Fast algorithms for big data Gunnar Martinsson The University of Colorado at Boulder Course objectives: The purpose of this course is to teach efficient algorithms for processing very large
More informationIn this presentation, you will be introduced to data mining and the relationship with meaningful use.
In this presentation, you will be introduced to data mining and the relationship with meaningful use. Data mining refers to the art and science of intelligent data analysis. It is the application of machine
More informationLocal features and matching. Image classification & object localization
Overview Instance level search Local features and matching Efficient visual recognition Image classification & object localization Category recognition Image classification: assigning a class label to
More information9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08
9. Text & Documents Visualizing and Searching Documents Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08 Slide 1 / 37 Outline Characteristics of text data Detecting patterns SeeSoft
More informationCluster Analysis: Advanced Concepts
Cluster Analysis: Advanced Concepts and dalgorithms Dr. Hui Xiong Rutgers University Introduction to Data Mining 08/06/2006 1 Introduction to Data Mining 08/06/2006 1 Outline Prototypebased Fuzzy cmeans
More informationModelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches
Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches PhD Thesis by Payam Birjandi Director: Prof. Mihai Datcu Problematic
More informationBig Data Text Mining and Visualization. Anton Heijs
Copyright 2007 by Treparel Information Solutions BV. This report nor any part of it may be copied, circulated, quoted without prior written approval from Treparel7 Treparel Information Solutions BV Delftechpark
More informationDATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS
DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS 1 AND ALGORITHMS Chiara Renso KDDLAB ISTI CNR, Pisa, Italy WHAT IS CLUSTER ANALYSIS? Finding groups of objects such that the objects in a group will be similar
More informationComparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data
CMPE 59H Comparison of Nonlinear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Nonlinear
More informationSupport Vector Machines with Clustering for Training with Very Large Datasets
Support Vector Machines with Clustering for Training with Very Large Datasets Theodoros Evgeniou Technology Management INSEAD Bd de Constance, Fontainebleau 77300, France theodoros.evgeniou@insead.fr Massimiliano
More informationData Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining
Data Mining Cluster Analysis: Basic Concepts and Algorithms Lecture Notes for Chapter 8 by Tan, Steinbach, Kumar 1 What is Cluster Analysis? Finding groups of objects such that the objects in a group will
More informationClustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015
Clustering Big Data Efficient Data Mining Technologies J Singh and Teresa Brooks June 4, 2015 Hello Bulgaria (http://hello.bg/) A website with thousands of pages... Some pages identical to other pages
More informationClustering. Adrian Groza. Department of Computer Science Technical University of ClujNapoca
Clustering Adrian Groza Department of Computer Science Technical University of ClujNapoca Outline 1 Cluster Analysis What is Datamining? Cluster Analysis 2 Kmeans 3 Hierarchical Clustering What is Datamining?
More informationKnowledge Discovery and Data Mining. Structured vs. NonStructured Data
Knowledge Discovery and Data Mining Unit # 2 1 Structured vs. NonStructured Data Most business databases contain structured data consisting of welldefined fields with numeric or alphanumeric values.
More informationRecognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28
Recognition Topics that we will try to cover: Indexing for fast retrieval (we still owe this one) History of recognition techniques Object classification Bagofwords Spatial pyramids Neural Networks Object
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationClustering. Data Mining. Abraham Otero. Data Mining. Agenda
Clustering 1/46 Agenda Introduction Distance Knearest neighbors Hierarchical clustering Quick reference 2/46 1 Introduction It seems logical that in a new situation we should act in a similar way as in
More informationMHI3000 Big Data Analytics for Health Care Final Project Report
MHI3000 Big Data Analytics for Health Care Final Project Report Zhongtian Fred Qiu (1002274530) http://gallery.azureml.net/details/81ddb2ab137046d4925584b5095ec7aa 1. Data preprocessing The data given
More informationSo which is the best?
Manifold Learning Techniques: So which is the best? Todd Wittman Math 8600: Geometric Data Analysis Instructor: Gilad Lerman Spring 2005 Note: This presentation does not contain information on LTSA, which
More informationSoft Clustering with Projections: PCA, ICA, and Laplacian
1 Soft Clustering with Projections: PCA, ICA, and Laplacian David Gleich and Leonid Zhukov Abstract In this paper we present a comparison of three projection methods that use the eigenvectors of a matrix
More informationKnowledge Discovery from patents using KMX Text Analytics
Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers
More informationClustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012
Clustering Big Data Anil K. Jain (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012 Outline Big Data How to extract information? Data clustering
More informationA Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization
A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization Ángela Blanco Universidad Pontificia de Salamanca ablancogo@upsa.es Spain Manuel MartínMerino Universidad
More information
Search Engines. Stephen Shaw 18th of February, 2014. Netsoc
Search Engines Stephen Shaw Netsoc 18th of February, 2014 Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French,
More informationData Warehousing und Data Mining
Data Warehousing und Data Mining Multidimensionale Indexstrukturen Ulf Leser Wissensmanagement in der Bioinformatik Content of this Lecture Multidimensional Indexing GridFiles Kdtrees Ulf Leser: Data
More informationEnterprise Discovery Best Practices
Enterprise Discovery Best Practices 19932015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic, photocopying, recording or otherwise)
More informationMachine Learning using MapReduce
Machine Learning using MapReduce What is Machine Learning Machine learning is a subfield of artificial intelligence concerned with techniques that allow computers to improve their outputs based on previous
More informationLargescale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook
Largescale Data Mining: MapReduce and Beyond Part 2: Algorithms Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook Part 2:Mining using MapReduce Mining algorithms using MapReduce
More informationActive Learning with Boosting for Spam Detection
Active Learning with Boosting for Spam Detection Nikhila Arkalgud Last update: March 22, 2008 Active Learning with Boosting for Spam Detection Last update: March 22, 2008 1 / 38 Outline 1 Spam Filters
More informationBIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM 10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
More informationData Mining Yelp Data  Predicting rating stars from review text
Data Mining Yelp Data  Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority
More informationFast Matching of Binary Features
Fast Matching of Binary Features Marius Muja and David G. Lowe Laboratory for Computational Intelligence University of British Columbia, Vancouver, Canada {mariusm,lowe}@cs.ubc.ca Abstract There has been
More informationCollaborative Filtering. Radek Pelánek
Collaborative Filtering Radek Pelánek 2015 Collaborative Filtering assumption: users with similar taste in past will have similar taste in future requires only matrix of ratings applicable in many domains
More informationData Clustering. Dec 2nd, 2013 Kyrylo Bessonov
Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms kmeans Hierarchical Main
More informationBEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES
BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationUsing multiple models: Bagging, Boosting, Ensembles, Forests
Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or
More informationClustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016
Clustering Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016 1 Supervised learning vs. unsupervised learning Supervised learning: discover patterns in the data that relate data attributes with
More informationEnhanced Boosted Trees Technique for Customer Churn Prediction Model
IOSR Journal of Engineering (IOSRJEN) ISSN (e): 22503021, ISSN (p): 22788719 Vol. 04, Issue 03 (March. 2014), V5 PP 4145 www.iosrjen.org Enhanced Boosted Trees Technique for Customer Churn Prediction
More informationDecision Trees from large Databases: SLIQ
Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values
More informationIntroduction to Data Mining
Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:
More informationAn Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015
An Introduction to Data Mining for Wind Power Management Spring 2015 Big Data World Every minute: Google receives over 4 million search queries Facebook users share almost 2.5 million pieces of content
More informationBuilding a Question Classifier for a TRECStyle Question Answering System
Building a Question Classifier for a TRECStyle Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given
More informationA Survey on Preprocessing and Postprocessing Techniques in Data Mining
, pp. 99128 http://dx.doi.org/10.14257/ijdta.2014.7.4.09 A Survey on Preprocessing and Postprocessing Techniques in Data Mining Divya Tomar and Sonali Agarwal Indian Institute of Information Technology,
More informationPentaho Data Mining Last Modified on January 22, 2007
Pentaho Data Mining Copyright 2007 Pentaho Corporation. Redistribution permitted. All trademarks are the property of their respective owners. For the latest information, please visit our web site at www.pentaho.org
More informationKnowledge Discovery and Data Mining
Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right
More informationProgramming Exercise 3: Multiclass Classification and Neural Networks
Programming Exercise 3: Multiclass Classification and Neural Networks Machine Learning November 4, 2011 Introduction In this exercise, you will implement onevsall logistic regression and neural networks
More informationData Preprocessing. Week 2
Data Preprocessing Week 2 Topics Data Types Data Repositories Data Preprocessing Present homework assignment #1 Team Homework Assignment #2 Read pp. 227 240, pp. 250 250, and pp. 259 263 the text book.
More informationClassifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang
Classifying Large Data Sets Using SVMs with Hierarchical Clusters Presented by :Limou Wang Overview SVM Overview Motivation Hierarchical microclustering algorithm ClusteringBased SVM (CBSVM) Experimental
More informationIntroduction to Machine Learning
Introduction to Machine Learning Brown University CSCI 1950F, Spring 2012 Instructor: Erik Sudderth Graduate TAs: Dae Il Kim & Ben Swanson Head Undergraduate TA: William Allen Undergraduate TAs: Soravit
More informationBig Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.
Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.com Hadoop Hadoop is an open source distributed platform for data storage
More informationData Mining. Cluster Analysis: Advanced Concepts and Algorithms
Data Mining Cluster Analysis: Advanced Concepts and Algorithms Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 More Clustering Methods Prototypebased clustering Densitybased clustering Graphbased
More informationProtein Protein Interaction Networks
Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks YoungRae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics
More informationTIETS34 Seminar: Data Mining on Biometric identification
TIETS34 Seminar: Data Mining on Biometric identification Youming Zhang Computer Science, School of Information Sciences, 33014 University of Tampere, Finland Youming.Zhang@uta.fi Course Description Content
More informationAutomatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report
Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 69 Class Project Report Junhua Mao and Lunbo Xu University of California, Los Angeles mjhustc@ucla.edu and lunbo
More informationRecognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang
Recognizing Cats and Dogs with Shape and Appearance based Models Group Member: Chu Wang, Landu Jiang Abstract Recognizing cats and dogs from images is a challenging competition raised by Kaggle platform
More informationSocial Media Mining. Data Mining Essentials
Introduction Data production rate has been increased dramatically (Big Data) and we are able store much more data than before E.g., purchase data, social media data, mobile phone data Businesses and customers
More informationClassification algorithm in Data mining: An Overview
Classification algorithm in Data mining: An Overview S.Neelamegam #1, Dr.E.Ramaraj *2 #1 M.phil Scholar, Department of Computer Science and Engineering, Alagappa University, Karaikudi. *2 Professor, Department
More informationData, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
More informationMultidimensional index structures Part I: motivation
Multidimensional index structures Part I: motivation 144 Motivation: Data Warehouse A definition A data warehouse is a repository of integrated enterprise data. A data warehouse is used specifically for
More informationKnowledge Discovery and Data Mining 1 (VO) (707.003)
Knowledge Discovery and Data Mining 1 (VO) (707.003) Denis Helic KTI, TU Graz Oct 1, 2015 Denis Helic (KTI, TU Graz) KDDM1 Oct 1, 2015 1 / 55 Lecturer Name: Denis Helic Office: IWT, Inffeldgasse 13, 5th
More informationTopographic Change Detection Using CloudCompare Version 1.0
Topographic Change Detection Using CloudCompare Version 1.0 Emily Kleber, Arizona State University Edwin Nissen, Colorado School of Mines J Ramón Arrowsmith, Arizona State University Introduction CloudCompare
More informationLearning. CS461 Artificial Intelligence Pinar Duygulu. Bilkent University, Spring 2007. Slides are mostly adapted from AIMA and MIT Open Courseware
1 Learning CS 461 Artificial Intelligence Pinar Duygulu Bilkent University, Slides are mostly adapted from AIMA and MIT Open Courseware 2 Learning What is learning? 3 Induction David Hume Bertrand Russell
More informationAdaptive information source selection during hypothesis testing
Adaptive information source selection during hypothesis testing Andrew T. Hendrickson (drew.hendrickson@adelaide.edu.au) Amy F. Perfors (amy.perfors@adelaide.edu.au) Daniel J. Navarro (daniel.navarro@adelaide.edu.au)
More informationIntroduction to Machine Learning. Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011
Introduction to Machine Learning Speaker: Harry Chao Advisor: J.J. Ding Date: 1/27/2011 1 Outline 1. What is machine learning? 2. The basic of machine learning 3. Principles and effects of machine learning
More informationBig Data Analytics CSCI 4030
High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising
More informationImpact of Boolean factorization as preprocessing methods for classification of Boolean data
Impact of Boolean factorization as preprocessing methods for classification of Boolean data Radim Belohlavek, Jan Outrata, Martin Trnecka Data Analysis and Modeling Lab (DAMOL) Dept. Computer Science,
More informationExtend Table Lens for HighDimensional Data Visualization and Classification Mining
Extend Table Lens for HighDimensional Data Visualization and Classification Mining CPSC 533c, Information Visualization Course Project, Term 2 2003 Fengdong Du fdu@cs.ubc.ca University of British Columbia
More informationRecommender Systems: Contentbased, Knowledgebased, Hybrid. Radek Pelánek
Recommender Systems: Contentbased, Knowledgebased, Hybrid Radek Pelánek 2015 Today lecture, basic principles: contentbased knowledgebased hybrid, choice of approach,... critiquing, explanations,...
More informationEM Clustering Approach for MultiDimensional Analysis of Big Data Set
EM Clustering Approach for MultiDimensional Analysis of Big Data Set Amhmed A. Bhih School of Electrical and Electronic Engineering Princy Johnson School of Electrical and Electronic Engineering Martin
More informationTracking and Recognition in Sports Videos
Tracking and Recognition in Sports Videos Mustafa Teke a, Masoud Sattari b a Graduate School of Informatics, Middle East Technical University, Ankara, Turkey mustafa.teke@gmail.com b Department of Computer
More informationFeature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier
Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier D.Nithya a, *, V.Suganya b,1, R.Saranya Irudaya Mary c,1 Abstract  This paper presents,
More informationAnalysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j
Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j What is Kiva? An organization that allows people to lend small amounts of money via the Internet
More informationGrid Density Clustering Algorithm
Grid Density Clustering Algorithm Amandeep Kaur Mann 1, Navneet Kaur 2, Scholar, M.Tech (CSE), RIMT, Mandi Gobindgarh, Punjab, India 1 Assistant Professor (CSE), RIMT, Mandi Gobindgarh, Punjab, India 2
More informationPredictive Indexing for Fast Search
Predictive Indexing for Fast Search Sharad Goel Yahoo! Research New York, NY 10018 goel@yahooinc.com John Langford Yahoo! Research New York, NY 10018 jl@yahooinc.com Alex Strehl Yahoo! Research New York,
More informationMonday Morning Data Mining
Monday Morning Data Mining Tim Ruhe Statistische Methoden der Datenanalyse Outline:  data mining  IceCube  Data mining in IceCube Computer Scientists are different... Fakultät Physik Fakultät Physik
More informationText Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and CoFounder, Adsurgo LLC
Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and CoFounder, Adsurgo LLC 1. Introduction A popular rule of thumb suggests that
More informationRandom Forest Based Imbalanced Data Cleaning and Classification
Random Forest Based Imbalanced Data Cleaning and Classification Jie Gu Software School of Tsinghua University, China Abstract. The given task of PAKDD 2007 data mining competition is a typical problem
More informationAn Optical Sudoku Solver
An Optical Sudoku Solver Martin Byröd February 12, 07 Abstract In this report, a visionbased sudoku solver is described. The solver is capable of solving a sudoku directly from a photograph taken with
More informationAn Overview of Knowledge Discovery Database and Data mining Techniques
An Overview of Knowledge Discovery Database and Data mining Techniques Priyadharsini.C 1, Dr. Antony Selvadoss Thanamani 2 M.Phil, Department of Computer Science, NGM College, Pollachi, Coimbatore, Tamilnadu,
More informationComparison of Kmeans and Backpropagation Data Mining Algorithms
Comparison of Kmeans and Backpropagation Data Mining Algorithms Nitu Mathuriya, Dr. Ashish Bansal Abstract Data mining has got more and more mature as a field of basic research in computer science and
More informationBases de données avancées Bases de données multimédia
Bases de données avancées Bases de données multimédia Université de CergyPontoise Master Informatique M1 Cours BDA Multimedia DB Multimedia data include images, text, video, sound, spatial data Data of
More informationFinal Project Report
CPSC545 by Introduction to Data Mining Prof. Martin Schultz & Prof. Mark Gerstein Student Name: Yu Kor Hugo Lam Student ID : 904907866 Due Date : May 7, 2007 Introduction Final Project Report Pseudogenes
More informationVisualization of large data sets using MDS combined with LVQ.
Visualization of large data sets using MDS combined with LVQ. Antoine Naud and Włodzisław Duch Department of Informatics, Nicholas Copernicus University, Grudziądzka 5, 87100 Toruń, Poland. www.phys.uni.torun.pl/kmk
More informationAn introduction to Global Illumination. Tomas AkenineMöller Department of Computer Engineering Chalmers University of Technology
An introduction to Global Illumination Tomas AkenineMöller Department of Computer Engineering Chalmers University of Technology Isn t ray tracing enough? Effects to note in Global Illumination image:
More informationData Mining Project Report. Document Clustering. Meryem UzunPer
Data Mining Project Report Document Clustering Meryem UzunPer 504112506 Table of Content Table of Content... 2 1. Project Definition... 3 2. Literature Survey... 3 3. Methods... 4 3.1. Kmeans algorithm...
More informationLearning Binary Hash Codes for LargeScale Image Search
Learning Binary Hash Codes for LargeScale Image Search Kristen Grauman and Rob Fergus Abstract Algorithms to rapidly search massive image or video collections are critical for many vision applications,
More information