Search Engines. Stephen Shaw <stesh@netsoc.tcd.ie> 18th of February, 2014. Netsoc



Similar documents
W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

CSCI 5417 Information Retrieval Systems Jim Martin!

Distributed Computing and Big Data: Hadoop and MapReduce

Search and Information Retrieval

Introduction to Information Retrieval

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Introduction to Information Retrieval

Text Mining in JMP with R Andrew T. Karl, Senior Management Consultant, Adsurgo LLC Heath Rushing, Principal Consultant and Co-Founder, Adsurgo LLC

Search engine ranking

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Comparison of Standard and Zipf-Based Document Retrieval Heuristics

Developing MapReduce Programs

Medical Information-Retrieval Systems. Dong Peng Medical Informatics Group

Mining a Corpus of Job Ads

Analysis of MapReduce Algorithms

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

Krishna Institute of Engineering & Technology, Ghaziabad Department of Computer Application MCA-213 : DATA STRUCTURES USING C

1 o Semestre 2007/2008

Machine Learning using MapReduce

dm106 TEXT MINING FOR CUSTOMER RELATIONSHIP MANAGEMENT: AN APPROACH BASED ON LATENT SEMANTIC ANALYSIS AND FUZZY CLUSTERING

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Server Load Prediction

Similarity Search in a Very Large Scale Using Hadoop and HBase

Clustering Technique in Data Mining for Text Documents

SEARCH ENGINE WITH PARALLEL PROCESSING AND INCREMENTAL K-MEANS FOR FAST SEARCH AND RETRIEVAL

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Active Learning SVM for Blogs recommendation

A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Corso di Biblioteche Digitali

International Journal of Computer Engineering and Applications, Volume IX, Issue VI, Part I, June

Common Patterns and Pitfalls for Implementing Algorithms in Spark. Hossein

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

The PageRank Citation Ranking: Bring Order to the Web

Social Media Mining. Data Mining Essentials

Dynamical Clustering of Personalized Web Search Results

Spam Detection A Machine Learning Approach

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Big Data Analytics and Healthcare

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Bachelor of Games and Virtual Worlds (Programming) Subject and Course Summaries

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Homework 2. Page 154: Exercise Page 145: Exercise 8.3 Page 150: Exercise 8.9

A Cost-Benefit Analysis of Indexing Big Data with Map-Reduce

Latent Semantic Indexing with Selective Query Expansion Abstract Introduction

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:

Hadoop SNS. renren.com. Saturday, December 3, 11

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Introduction to Parallel Programming and MapReduce

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

Mining Text Data: An Introduction

2014/02/13 Sphinx Lunch

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University

Frontera: open source, large scale web crawling framework. Alexander Sibiryakov, October 1, 2015

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Eng. Mohammed Abdualal

Chapter 2 The Information Retrieval Process

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

Unsupervised Data Mining (Clustering)

Data Mining Project Report. Document Clustering. Meryem Uzun-Per

Big Data Text Mining and Visualization. Anton Heijs

CLOUD ANALYTICS: Empowering the Army Intelligence Core Analytic Enterprise

Monitoring and Alerting

Web Document Clustering

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

Linear Algebra Methods for Data Mining

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Lecture 10: HBase! Claudia Hauff (Web Information Systems)!

Sibyl: a system for large scale machine learning

Sentiment Analysis for Movie Reviews

Information Retrieval Elasticsearch

Practical Graph Mining with R. 5. Link Analysis

Manifold Learning Examples PCA, LLE and ISOMAP

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Abstract Data Type. EECS 281: Data Structures and Algorithms. The Foundation: Data Structures and Abstract Data Types

Overview. Physical Database Design. Modern Database Management McFadden/Hoffer Chapter 7. Database Management Systems Ramakrishnan Chapter 16

Tables so far. set() get() delete() BST Average O(lg n) O(lg n) O(lg n) Worst O(n) O(n) O(n) RB Tree Average O(lg n) O(lg n) O(lg n)

Towards a Visually Enhanced Medical Search Engine

Transcription:

Search Engines Stephen Shaw <stesh@netsoc.tcd.ie> Netsoc 18th of February, 2014

Me M.Sc. Artificial Intelligence, University of Edinburgh Would recommend B.A. (Mod.) Computer Science, Linguistics, French, TCD Would recommend One time Netsoc sysadmin Would recommend Talk inspired by 'Text Technologies' lecture course given by Dr Victor Lavrenko at the University of Edinburgh Would recommend http://www.inf.ed.ac.uk/teaching/courses/tts/

Information Retrieval Field of AI Perfecting the task of getting information from data Efficiency Accuracy Hot topic right now 'Big Data' Buzzwords

The task Match a query, Q, to a document, D Not so clear what 'matching' is A tricky supervised learning problem Easy to get feedback (they clicked on it = good) But by then it's too late (they've already got the results) Adversarial task ('search engine optimization') Relational algebra sucks too slow Natural language processing sucks too slow doesn't generalize across domains

Bag of words Revolutionary idea: if it matches, it's similar Represent documents and queries as a 'bag of words' Word order doesn't make much of a difference Example Document: we clawed we chained our hearts in vain B-o-w: {'we': 2, 'clawed': 1, 'chained': 1, 'our': 1, 'hearts': 1, 'vain': 1, 'in': 1}

Vector space model How to compare bags of words? Each word ('term') is a dimension Each document is a vector Frequency of term w in D is its magnitude in dimension w Euclidean distance? n s(q, D) = i=1 (Q i D i ) 2

Vector space model Documents terms love in D 1 1 0 D 2 1 1 D 3 2 1 one 1 0 0 vain clawed 2 3 2 2 0 1 D n 3 1 0 0 2

Cosine similarity Euclidean distance sensitive to length A poor inductive bias Cosine similarity w (Q D) s(q, D) = Q wd w w Q Q w 2 w D D w 2 Best match = D with smallest angle between it and Q Problem?

Favouring important terms Too much weight given to semantically vacuous terms a, the, one, Zipf's law: f(w) 1 R(w) tf idf penalize terms that appear in a lot of documents tf w,d # times term w appears in document D df w : # documents w appears in

tf-idf weighted sum Term frequency, inverse document frequency s(q, D) = tf w,q w V tf w,d + C : # documents in corpus C tf w,d k D avg d,d C avg d : average length of a document d in C log C df w k: free parameter (lets you tune against document length)

Curse of dimensionality Each term is a dimension of the vector space That's a lot of dimensions Lots of dimensions = lots of headache Huge, sparse matrix Inaccurate matches due to vocabulary mismatch

Dimensionality reduction Stopword supression: delete semantically vacuous words Stemming: map all inflected word forms to their stems Clever tokenization: strip punctuation Soundex fingerprinting: help overcome badly-spelled queries n-gram models: suggest previous queries to user before query

Executing queries: first pass 1. Index user's query Q 2. Compute s(q, D i ) for all D i C 3. Toss results into a priority queue with s(q, D i ) as priority number 4. Dequeue first n results Good luck There are essentially infinite documents and finite memory and time

Rethinking document representation Need a cleverer data structure Bag of words: records which terms appear in each document A document-centric representation We need a term-centric representation Instead record which documents contain each term

Term-centric vector space Documents terms love in D 1 1 0 D 2 1 1 D 3 2 1 one 1 0 0 vain clawed 2 3 2 2 0 1 D n 3 1 0 0 2

Postings lists Representation still very sparse (remember Zipf's law) Specialized data structure needed Postings lists fit the bill key-value structure keys: terms values: linked list of tuples (document ID, frequency)

Postings lists love 1:1 2:1 3:2 n:3 in 1:0 2:1 3:1 n:1 one 1:1 2:0 3:0 n:0 vain 1:2 2:2 3:0 n:0 clawed 1:1 2:1 3:2 n:2

Query by merge Execute queries by merging postings lists Q = {wrecking, ball} Look up postings lists for wrecking and ball Set pointers at beginning of each list Merge left-to-right to construct new list Read off results

doc-at-a-time Record the query term postings entries for each document and merge to compute s Works well, but can degenerate when documents are of vastly different lengths Merging lots of lists is slow: O( C Q ) Memory-expensive: O( Q )

term-at-a-time Initialize a new postings list for each query term Look up each term's postings list in turn Update aggregate score for each term Linear runtime

Construction and storage Amenable to distributed computation (e.g. Map-Reduce) Postings lists still take up a lot of space Usually store them on disk unless you have a stupid amount of RAM Clever compression techniques Can add additional pointers to speed up linear access

Link analysis The Internet is a special kind of corpus Documents link to each other Forms a directed graph Relevant to ranking Favour popular pages over unpopular pages when ranking results

Extracting content Crawler starts on a random page Follows outgoing links Statistical, layout-independent methods for extracting page content Duplicate-detection important News articles Wikipedia clones

Duplicate detection Define a clever hash function Cryptographic hashing: collisions = bad Duplicate detection hashing: cook up collisions such that duplicates collide

Hubs and authorities Hub: page that links to lots of other pages (e.g. Google) Authority: page that a lot of pages link to (e.g. Wikipedia) HITS algorithm iteratively rates pages as hubs or authorities Useful in ranking Useful for driving crawlers But vulnerable to SEO

Hubs and authorities H H H A H H H

PageRank 'Rank by consensus' If P 1 links to P 2 then P 1 likes P 2

That's it Questions