RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Similar documents
BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Data Mining Practical Machine Learning Tools and Techniques

Active Learning SVM for Blogs recommendation

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Supporting Online Material for

Multimedia Databases. Wolf-Tilo Balke Philipp Wille Institut für Informationssysteme Technische Universität Braunschweig

The Scientific Data Mining Process

Unsupervised Data Mining (Clustering)

Going Big in Data Dimensionality:

Content-Based Recommendation

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Local features and matching. Image classification & object localization

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

9. Text & Documents. Visualizing and Searching Documents. Dr. Thorsten Büring, 20. Dezember 2007, Vorlesung Wintersemester 2007/08

Cluster Analysis: Advanced Concepts

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Big Data Text Mining and Visualization. Anton Heijs

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Knowledge Discovery from patents using KMX Text Analytics

Clustering Big Data. Efficient Data Mining Technologies. J Singh and Teresa Brooks. June 4, 2015

Support Vector Machines with Clustering for Training with Very Large Datasets

Data Mining Cluster Analysis: Basic Concepts and Algorithms. Lecture Notes for Chapter 8. Introduction to Data Mining

Clustering. Adrian Groza. Department of Computer Science Technical University of Cluj-Napoca

Knowledge Discovery and Data Mining. Structured vs. Non-Structured Data

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Recognition. Sanja Fidler CSC420: Intro to Image Understanding 1 / 28

Clustering. Data Mining. Abraham Otero. Data Mining. Agenda

So which is the best?

Data Mining Yelp Data - Predicting rating stars from review text

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

MHI3000 Big Data Analytics for Health Care Final Project Report

A Partially Supervised Metric Multidimensional Scaling Algorithm for Textual Data Visualization

Data Warehousing und Data Mining

Search Engines. Stephen Shaw 18th of February, Netsoc

Recognizing Cats and Dogs with Shape and Appearance based Models. Group Member: Chu Wang, Landu Jiang

How To Cluster

Machine Learning using MapReduce

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Active Learning with Boosting for Spam Detection

Soft Clustering with Projections: PCA, ICA, and Laplacian

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Collaborative Filtering. Radek Pelánek

Pentaho Data Mining Last Modified on January 22, 2007

Topographic Change Detection Using CloudCompare Version 1.0

Data Mining. Nonlinear Classification

Using multiple models: Bagging, Boosting, Ensembles, Forests

Clustering. Danilo Croce Web Mining & Retrieval a.a. 2015/201 16/03/2016

Enhanced Boosted Trees Technique for Customer Churn Prediction Model

Introduction to Data Mining

Decision Trees from large Databases: SLIQ

A Survey on Pre-processing and Post-processing Techniques in Data Mining

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Building a Question Classifier for a TREC-Style Question Answering System

Programming Exercise 3: Multi-class Classification and Neural Networks

Knowledge Discovery and Data Mining

Tracking and Recognition in Sports Videos

Data Preprocessing. Week 2

Monday Morning Data Mining

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

Big Data: What (you can get out of it), How (to approach it) and Where (this trend can be applied) natalia_ponomareva sobachka yahoo.

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Final Project Report

Fast Matching of Binary Features

Protein Protein Interaction Networks

TIETS34 Seminar: Data Mining on Biometric identification

Social Media Mining. Data Mining Essentials

Automatic 3D Reconstruction via Object Detection and 3D Transformable Model Matching CS 269 Class Project Report

Classification algorithm in Data mining: An Overview

Multi-dimensional index structures Part I: motivation

Adaptive information source selection during hypothesis testing

Data, Measurements, Features

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

EM Clustering Approach for Multi-Dimensional Analysis of Big Data Set

Form Design Guidelines Part of the Best-Practices Handbook. Form Design Guidelines

Extend Table Lens for High-Dimensional Data Visualization and Classification Mining

Feature Selection using Integer and Binary coded Genetic Algorithm to improve the performance of SVM Classifier

Big Data Analytics CSCI 4030

Predictive Indexing for Fast Search

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Grid Density Clustering Algorithm

An Overview of Knowledge Discovery Database and Data mining Techniques

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Visualization of large data sets using MDS combined with LVQ.

Comparison of K-means and Backpropagation Data Mining Algorithms

An introduction to Global Illumination. Tomas Akenine-Möller Department of Computer Engineering Chalmers University of Technology

Bases de données avancées Bases de données multimédia

Learning Binary Hash Codes for Large-Scale Image Search

Clustering & Visualization

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Specific Usage of Visual Data Analysis Techniques

Random Forest Based Imbalanced Data Cleaning and Classification

The Need for Training in Big Data: Experiences and Case Studies

Designing a Schematic and Layout in PCB Artist

The Value of Visualization 2

Data Mining: Exploring Data. Lecture Notes for Chapter 3. Introduction to Data Mining

Data quality in Accounting Information Systems

Transcription:

= + RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING Stefan Savev Berlin Buzzwords June 2015

KEYWORD-BASED SEARCH Document Data 300 unique words per document 300 000 words in vocabulary Data sparsity: 99.9% Words are strong query features STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 2

IMAGE SEARCH BY EXAMPLE Image Data 800 pixels 200 black pixels Data sparsity 80% Pixels are weak query features STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 3

TWO KINDS OF SEARCH Data Sparse Dense Query Size Short Long Use Cases Keyword Search Image Search, Semantic Search, Recommendations SEARCH METHOD INVERTED INDEX (LUCENE) RANDOM PROJECTIONS STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 4

OUTLINE 1. High dimensional data (images, text, clicks) 2. Random Projections: Why? What? How? 3. Image Search Benchmark STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 5

HIGH-DIMENSIONAL DATA Images STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 6

HIGH-DIMENSIONAL DATA Images Pixels STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 7

HIGH-DIMENSIONAL DATA Images 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Pixels 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Dimensions STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 8

MATRIX REPRESENTATION OF DATASET STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 9

MATRIX REPRESENTATION OF DATASET STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 10

HIGH-DIMENSIONAL DATA Text A beer can do everyone a favor. a abstract be bee beer can do everyone 2 0 0 0 1 1 Dimensions STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 11

HIGH-DIMENSIONAL DATA Clicks 1 0 0 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 12

DENSE VS SPARSE MATRIX IMAGES 1000 PIXELS TEXT/CLICKS 300000 WORDS img_1 img_3 doc_1 doc_3 img_n doc_n CANNOT SEARCH WITH LUCENE CAN SEARCH WITH LUCENE STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 13

DENSE VS SPARSE MATRIX IMAGES 1000 PIXELS TEXT/CLICKS 300000 WORDS TEXT/CLICKS 500 DIMENSIONS img_1 img_3 doc_1 doc_3 img_n doc_n CANNOT SEARCH WITH LUCENE CAN SEARCH WITH LUCENE WITH LUCENE STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 14

DENSE VS SPARSE MATRIX IMAGES 1000 PIXELS TEXT/CLICKS 300000 WORDS TEXT/CLICKS 500 DIMENSIONS img_1 img_3 doc_1 doc_3 img_n doc_n CANNOT SEARCH WITH LUCENE CAN SEARCH WITH LUCENE WITH LUCENE WITH RANDOM PROJECTIONS WITH RANDOM PROJECTIONS WITH RANDOM PROJECTIONS STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 15

DIMENSIONALITY REDUCTION beer wine beer + wine doc_1 doc_3 SVD doc_n 300 000 dimensions 500 dimensions STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 16

DIMENSIONALITY REDUCTION = PATTERN DISCOVERY = + STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 17

APPLICATIONS Random Forest Search (Nearest Neighbor) Random Projections SVM SVD Boosting STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 18

RANDOM PROJECTIONS IN INDUSTRY Spotify for music recommendations https://github.com/spotify/annoy Etsy for user/product recommendations Style in the Long Tail: Discovering Unique Interests with Latent Variable Models in Large Scale Social E-commerce, Diane Hu, Rob Hall, Josh Attenberg Referred to as Locality Sensitive Hashing STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 19

OUTLINE 1. High dimensional data (images, text, clicks) 2. Random Projections: Why? What? How? 3. Image Search Benchmark STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 20

HOW DOES IT WORK? = + STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 21

ILLUSTRATION ON ARTIFICIAL DATASET id_1 id_3 id_n STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 22

ILLUSTRATION ON ARTIFICIAL DATASET Nearest neighbor problem: Find the closest point STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 23

ILLUSTRATION ON ARTIFICIAL DATASET Approximation: Neighbors are in a circle with small radius STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 24

ILLUSTRATION ON ARTIFICIAL DATASET Core idea: random grids Dynamic grids (variably sized) Random = cheap STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 25

PARTITION BY RANDOM SPLITS Hyperplane splits the dataset STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 26

PROJECTION = ONE DIMENSIONAL VIEW OF DATASET A Random Direction (1 Dimension) STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 27

PLATO S ALEGORY OF THE CAVE A Random Direction (1 Dimension) STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 28

PARTITIONING BY PROJECTING ON RANDOM LINES STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 29

PARTITIONING BY PROJECTING ON RANDOM LINES STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 30

ALGORITHM VISUALIZATION STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 31

ALGORITHM VISUALIZATION STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 32

ALGORITHM VISUALIZATION STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 33

ALGORITHM VISUALIZATION STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 34

ALGORITHM VISUALIZATION When two points are close, they are likely to end up in the same partition, even under random partitioning STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 35

ALGORITHM VISUALIZATION When two points are close, they are LIKELY to end up in the same partition, even under random partitioning STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 36

MULTIPLE TREES Random tree partitioning is noisy. Build more trees. STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 37

ADDING MORE TREES Tree 1 Trees 1 and 2 Trees 1, 2 and 3 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 38

ADDING MORE TREES Trees 1-50 Trees 1-100 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 39

CONSTRAINING THE NEAREST NEIGHBOR REGION THRESHOLD = In how many trees a NN candidate appears Threshold = 1 Threshold = 60 Threshold = 20 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 40

WE VE GOT HIM STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 41

OUTLINE 1. High dimensional data (images, text, clicks) 2. Random Projections: Why? What? How? 3. Image Search Benchmark STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 42

IMAGE SEARCH FOR PREDICTION STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 43

IMAGE SEARCH FOR PREDICTION Results Image Label 5 3 3 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 44

STAGE 1: INDEXING 42 000 examples with labels 784 pixels per image STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 45

STAGE 1: INDEXING 42 000 examples with labels 784 pixels per image STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 46

STAGE 1: INDEXING 42 000 examples with labels 784 pixels per image index = Index(...) for image in training_examples: labels[image.id] = image.label index.add_item(image.id, image.vector) index.build(...) STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 47

STAGE 2: SEARCH TO PREDICT THE LABEL Results Image Label 5 3 3 index.load('.../file.index')... image_vec = read_image(...) nn = 100 #number of nearest neighbors results = index.get_nns_by_vector(image_vec,nn) top_result = results[0] predicted_label = labels[top_result.id] 28 000 test examples STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 48

TOOLS stefansavev.com/randomtrees STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 49

BENCHMARK IMAGE SEARCH DIMENSIONS # TREES # NN CANDIDATES ACCURACY ON TEST 784 10 1000-1010 97.0 4.85 SEARCH TIME [ms/query] BOTTLENECK Reference: https://github.com/stefansavev/random-projections-talk/ STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 50

BENCHMARK IMAGE SEARCH DIMENSIONS # TREES # NN CANDIDATES ACCURACY ON TEST 784 10 1000-1010 97.0 4.85 100, after SVD 10 1000-1010 97.5 0.7 SEARCH TIME [ms/query] Reference: https://github.com/stefansavev/random-projections-talk/ STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 51

BENCHMARK IMAGE SEARCH DIMENSIONS # TREES # NN CANDIDATES ACCURACY ON TEST 784 10 1000-1010 97.0 4.85 100, after SVD 10 1000-1010 97.5 0.7 SEARCH TIME [ms/query] BOTTLENECK Reference: https://github.com/stefansavev/random-projections-talk/ STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 52

BENCHMARK IMAGE SEARCH stefansavev.com/ randomtrees DIMENSIONS # TREES # NN CANDIDATES ACCURACY ON TEST 784 10 1000-1010 97.0 4.85 100, after SVD 100, after SVD 10 1000-1010 97.5 0.7 40 180 (mean) 97.5 0.36 SEARCH TIME [ms/query] Reference: https://github.com/stefansavev/random-projections-talk/ STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 53

SAME FOUNDATION FOR SEARCH AND MACHINE LEARNING + Data point id 32 56 89 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 54

SAME FOUNDATION FOR SEARCH AND MACHINE LEARNING Histogram + label frequency 0 4 1 51 7 38 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 55

CONCLUSION = + Search in Dense Data CHEAP method to explore data; Makes algorithms FASTER (e.g. SVD, SVM) STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 56

THANK YOU! Stefan Savev Email: info@stefansavev.com Blog: stefansavev.com STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 57

EXTRA SLIDES STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 58

REFERENCES STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 59

ADVANTAGES/DISADVANTAGES DIMENSIONALITY REDUCTION Advantages Semantic Similarity Concentrate the signal Disadvantages No exact matches possible Adds noise STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 60

RANDOM PROJECTIONS AS HASH FUNCTIONS 0 0 1 0 0 1 1 0 0 0 1 0 0 0 1 0-1 1 1-1 1-1 1 1 * = 1 1-1 -1-1 1-1 1 0 0 1 0 0-1 1 0 0 0-1 0 0 0-1 0 1 +1-1 1 1 = -1 STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 61

EXAMPLE-BASED IMAGE SEARCH MNIST image dataset of digits Images are digits from 0 to 9 42 000 images for training 28 000 images for test 28 x 28 pixels with values from 0 to 255 784 (=28*28) features Dataset is already preprocessed Goal is to predict the label of an image STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 62

SOME IMPLEMENTATION TRICKS sparse random projections combine not all dimension but a small number of them apply dimensionality reduction first pick better random projections the projected histogram is wide project on multiple lines simultaneously Hadamard matrix trick reuse random projections via recombination cut out a ball from the densest region PCA may help use the neighbors of neighbors of a point as additional similarity candidates advanced search inside the trees with backtracking (KD-tree style) generate multiple versions of query/training data (i.e. for image data translate or rotate the image) [Link to Blog] STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 63

PROJECTION Dot product (cosine, similarity) View of the dataset Overlap with Pattern Geometry of Search Foundation for Search and Machine Learning STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 64

WHY RANDOM? Cheap (replace optimization with randomization); Simply build more trees If we build the perfect tree, it s just one tree. We need more and DIVERSE trees In high dimensions all algorithms will have problems, so do the cheapest With a few points/dimensions random is not the best choice STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 65

DIMENSIONALITY REDUCTION doc_1 doc_3 doc_n beer wine SVD beer + wine MIXING THE COLUMNS OF THE ORIGINAL MATRIX TO MAXIMIZE SIGNAL TO NOISE 300000 WORDS 500 DIMENSIONS STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 66

PROJECTION = OVERLAP WITH A PATTERN OVERLAP, =SUM = 1, more red -1, more green STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 67

PROJECTION = OVERLAP WITH A PATTERN OVERLAP, =SUM STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 68

PROJECTION = OVERLAP WITH A PATTERN OVERLAP (, ) = SUM STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 69

PROJECTION = OVERLAP WITH A PATTERN OVERLAP (, ) = SUM = 1, if more red pixels than white -1, if more white pixels than red STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 70

PROJECTION = OVERLAP WITH A PATTERN OVERLAP (, ) = SUM = 1, if more red pixels than white -1, if more white pixels than red STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 71

DATA POINT SVD FEATURE RANDOM FEATURE STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 72

PLATO S ALEGORY OF THE CAVE He then explains how the philosopher is like a prisoner who is freed from the cave and comes to understand that the shadows on the wall do not make up reality at all, as he can perceive the true form of reality rather than the mere shadows seen by the prisoners Source: http://en.wikipedia.org/wiki/allegory_of_the_cave STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 73

SECRET SAUCE How to make it faster? How to make it more accurate? When does it work? Is there something better? http://stefansavev.com/random-projections-secretsouce.html STEFAN SAVEV ~ RANDOM PROJECTIONS FOR BIG DATA ~ BERLIN BUZZWORDS ~ JUNE 2015 74