This lecture. Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis.



Similar documents
Lecture 1: Introduction and the Boolean Model

Practical Graph Mining with R. 5. Link Analysis

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Introduction to Information Retrieval

Search and Information Retrieval

Computational Advertising Andrei Broder Yahoo! Research. SCECR, May 30, 2009

Web Advertising 1 2/26/2013 CS190: Web Science and Technology, 2010

Web Mining. Margherita Berardi LACAM. Dipartimento di Informatica Università degli Studi di Bari

DIGITAL MARKETING BASICS: SEO

How to Use Google AdWords

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Corso di Biblioteche Digitali

1 o Semestre 2007/2008

Semantic Search in Portals using Ontologies

Chapter 6. Attracting Buyers with Search, Semantic, and Recommendation Technology

Chapter-1 : Introduction 1 CHAPTER - 1. Introduction

Technical challenges in web advertising

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

The ABCs of AdWords. The 49 PPC Terms You Need to Know to Be Successful. A publication of WordStream & Hanapin Marketing

International Journal of Engineering Research-Online A Peer Reviewed International Journal Articles are freely available online:

So what is this session all about?

GOOGLE ANALYTICS TERMS

Search Engine Optimisation (SEO) Factsheet

2015 SEO AND Beyond. Enter the Search Engines for Business.

Online Marketing Optimization Essentials

Optimization of Search Results with Duplicate Page Elimination using Usage Data A. K. Sharma 1, Neelam Duhan 2 1, 2

Subordinating to the Majority: Factoid Question Answering over CQA Sites

The PageRank Citation Ranking: Bring Order to the Web

Introduction to Information Retrieval

DIGITAL MARKETING BASICS: PPC

SEO 360: The Essentials of Search Engine Optimization INTRODUCTION CONTENTS. By Chris Adams, Director of Online Marketing & Research

Part 1: Link Analysis & Page Rank

Search Engines. Stephen Shaw 18th of February, Netsoc

Search Engine Optimization - From Automatic Repetitive Steps To Subtle Site Development

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

A SIMPLE GUIDE TO PAID SEARCH (PPC)

The 8 Key Metrics That Define Your AdWords Performance. A WordStream Guide

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

Small Business SEO Marketing an introduction

Proposal for Search Engine Optimization. Ref: Pro-SEO-0049/2009

CSCI 5417 Information Retrieval Systems Jim Martin!

A COMPREHENSIVE REVIEW ON SEARCH ENGINE OPTIMIZATION

Search Engine Optimization (SEO): Improving Website Ranking

Watson. An analytical computing system that specializes in natural human language and provides specific answers to complex questions at rapid speeds

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Graph Mining and Social Network Analysis

Search engine ranking

Dynamical Clustering of Personalized Web Search Results

Building a Question Classifier for a TREC-Style Question Answering System

Index Terms Domain name, Firewall, Packet, Phishing, URL.

Search Engine Marketing (SEM) with Google Adwords

Using LSI for Implementing Document Management Systems Turning unstructured data from a liability to an asset.

SEO AND CONTENT MANAGEMENT SYSTEM

Search Engine Optimization and Pay Per Click Building Your Online Success

An Approach to Give First Rank for Website and Webpage Through SEO

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Analysis of Web Archives. Vinay Goel Senior Data Engineer

Considerations of Modeling in Keyword Bidding (Google:AdWords) Xiaoming Huo Georgia Institute of Technology August 8, 2012

Search Engine Optimisation workbook

Social Media Mining. Data Mining Essentials

» A Hardware & Software Overview. Eli M. Dow <emdow@us.ibm.com:>

How to Drive More Traffic to Your Event Website

Distributed Computing and Big Data: Hadoop and MapReduce

11/23/2011. PPC Search Advertising. There are Two Key Parts to any Search Engine Marketing Strategy. 1. Search Engine Optimisation (SEO)

SEARCH ENGINE OPTIMISATION

The 20-Minute PPC Work Week. Making the Most of Your PPC Account in Minimal Time. A WordStream Guide

Context Aware Predictive Analytics: Motivation, Potential, Challenges

SEO Definition. SEM Definition

Enhancing the Ranking of a Web Page in the Ocean of Data

Large-Scale Test Mining

Digital Training Search Engine Optimization. Presented by: Aris Tianto Head of Search at

Parallelism and Cloud Computing

Search Engine Optimisation Guide May 2009

8 Simple Things You Might Be Overlooking In Your AdWords Account. A WordStream Guide

Removing Web Spam Links from Search Engine Results

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Website Report for by Cresconnect UK, London

Google AdWords Audit. Prepared for: [Client Name] By Jordan Consulting Group Ltd.

EVILSEED: A Guided Approach to Finding Malicious Web Pages

Transcription:

This lecture Introduction to information retrieval. Making money with information retrieval. Some technical basics. Link analysis. CSC401/2511 Spring 2015 2

Information retrieval systems Information retrieval (IR): n. searching for documents or information in documents. Question-answering: respond with a specific answer to a question (e.g., Wolfram Alpha). Document retrieval: find documents relevant to a query, ranked by relevance (e.g., bing or Google). Text analytics/data mining: General organization of large textual databases (e.g., Lexis-Nexis, OpenText, MedSearch,.) CSC401/2511 Spring 2015 3

Terminology Information retrieval has slightly different terminology than the tasks we ve seen previously: Document: a book, article, web page, or paragraph Collection: Term: Stop word: (depending on the task and data). a corpus of documents a word type a functional (non-content) word (e.g., the) CSC401/2511 Spring 2015 4

Query types Different kinds of questions can be asked. Factoid questions, e.g., How often were the peace talks in Ireland delayed or disrupted as a result of acts of violence? Narrative (open-ended) questions, e.g., Can you tell me about contemporary interest in the Greek philosophy of stoicism? Complex/hybrid questions, e.g., Who was involved in the Schengen agreement to eliminate border controls in Western Europe and what did they hope to accomplish? CSC401/2511 Spring 2015 5

Question answering (QA) Which woman has won more than 1 Nobel prize? (Marie Curie) Question Answering (QA) usually involves a specific answer to a question. CSC401/2511 Spring 2015 6

Document retrieval vs IR One strategy is to turn question answering into information retrieval (IR) and let the human complete the task. CSC401/2511 Spring 2015 7

Question answering (QA) CSC401/2511 Spring 2015 8

Knowledge-based QA 1. Build a structured semantic representation of the query. Extract times, dates, locations, entities using regular expressions. Fit to well-known templates. CSC401/2511 Spring 2015 9 2. Query databases with these semantics. Ontologies (Wikipedia infoboxes). Restaurant review databases. Calendars. Movie schedules.

IR-based QA CSC401/2511 Spring 2015 10

IR-based QA Information retrieval Question answering CSC401/2511 Spring 2015 11

IBM s Watson Human 1 Game Control System Clue Grid Decisions to Buzz and Bet Strategy Watson s Game Controller Text-to-Speech Clue & Category Answers & Confidences Watson s QA Engine 2,880 IBM Power750 Compute Cores 15 TB of Memory Human 2 Clues, Scores & Other Game Data Content equivalent to ~ 1,000,000 books source: A Brief Overview and Thoughts for Healthcare Education and Performance Improvement by the IBM Watson team CSC401/2511 Spring 2015 12

IBM s Watson: search This man became the 44 th President of the United States in 2008 CSC401/2511 Spring 2015 13

IBM s Watson: search Title-oriented search: In some cases, the solution is in the title of highly-ranked documents. E.g., This pizza delivery boy celebrated New Year s at Applied Cryogenics. CSC401/2511 Spring 2015 14

IBM s Watson: selection Once candidates have been gathered from various sources and methods, rank them according to various scores (IBM Watson uses >50 scoring metrics). In cell division, mitosis splits the nucleus & cytokinesis splits this liquid cushioning the nucleus CSC401/2511 Spring 2015 15

IBM s Watson: selection One aspect of Jeopardy! is that answers are often posed with puns that have to be disambiguated. Bilbo shouldn t have played riddles in the dark with this shady character from WordNet s Synonym-sets CSC401/2511 Spring 2015 16

How to make money out of this? CSC401/2511 Spring 2015 17

Making money before search Advertisers used to pay for banner ads that did not depend on user queries. CPM (Cost per mille): Pay for each ad display. CPC (Cost per click): Pay when user clicks an ad. CTR (Click through rate): Fraction of ad displays that result in click-throughs. CPA (Cost per action): Pay only when user makes online purchase after click-through. CSC401/2511 Spring 2015 18

Making money with search Advertisers now bid for keywords. Ads are displayed for the highest bidders when a query contains those keywords. PPC (Pay per click): CPC for ads served based on a ranking of bid keywords and user interest (e.g., Google AdWords). (it s a bit more complicated ) CSC401/2511 Spring 2015 19

How are ads ranked? Today, a two-bid process is typical. First, organizations bid on keywords By itself, this can lead to abuse, monopolization, and irrelevant content. Second, we re-rank based on relevance based on click-through. CSC401/2511 Spring 2015 20

How are ads ranked? Advertiser Bid CTR Ad rank Rank Paid A $4.00 0.01 0.04 4 (minimum) B $3.00 0.03 0.09 2 $2.68 C $2.00 0.06 0.12 1 $1.51 D $1.00 0.08 0.08 3 $0.51 Bid: amount determined by advertiser for keyword. CTR: click-through rate an approximation of relevance. Ad rank: Bid CTR trades off advertiser and user interests. Rank: actual rank. Paid: Minimum amount necessary to maintain rank + 1. CSC401/2511 Spring 2015 21

How are ads ranked? Advertiser Bid CTR Ad rank Rank Paid A $4.00 0.01 0.04 4 (minimum) B $3.00 0.03 0.09 2 $2.68 C $2.00 0.06 0.12 1 $1.51 D $1.00 0.08 0.08 3 $0.51 Paid: Minimum amount necessary to maintain rank + 1. Paid r CTR r = Bid r+1 CTR r+1 + $0.01 Paid r = Bid r+1 CTR r+1 CTR r + $0.01 E.g., Paid 1 = $3.00 0.03 0.06 + $0.01 = $1.51 CSC401/2511 Spring 2015 22

Aside highest paying search terms (according to http://www.cwire.org/highest-paying-search-terms) $69.10 mesothelioma treatment options $66.46 mesothelioma risk $65.85 personal injury lawyer michigan $65.74 michigan personal injury attorney $62.59 student loans consolidation $61.44 car accident attorney los angeles $61.26 mesothelioma survival rate $60.96 treatment of mesothelioma $59.44 online car insurance quotes $59.39 arizona dui lawyer CSC401/2511 Spring 2015 23

Back to basics. How do we find the right documents for a query? CSC401/2511 Spring 2015 24

Queries A query is a textual key which orders a specific subset of documents (or answers) in a collection. Historically, these were highly structured in a logical language, but in modern search engines queries are more often streams of syntactically disconnected keywords. A boolean query is a logical combination of boolean membership predicates. Brutus AND Caesar AND NOT Calpurnia CSC401/2511 Spring 2015 25

Term-document incidence Anthony and Cleopatra Julius Caesar The Tempest Hamlet Othello Macbeth ANTHONY BRUTUS CAESAR CALPURNIA CLEOPATRA MERCY WORSER... 1 1 1 0 1 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 0 0 1 1 0 0 1 0 0 1 1 1 0 1 0 0 1 0 For the query Brutus AND Caesar AND NOT Calpurnia, 110100 (Brutus) 110111 (Caesar) 101111 (Not Calpurnia) 100100 (Bitwise AND) CSC401/2511 Spring 2015 26

Boolean Queries and big collections If we have 1 million documents, each with 1000 tokens 1 billion tokens at most 1 billion 1 s in the matrix. If we have 500,000 distinct terms, the term-document incidence matrix will have 500,000,000,000 elements. There will be << 1 billion 1s in this matrix. Very sparse and a waste of space. Can there be a better way? CSC401/2511 Spring 2015 27

Inverted index Given a query word, the inverted index for that word gives us all documents that contain that word in either the title, the abstract (summary), some hidden metadata, or the entire text. More sophisticated versions also include the frequency and positions of the query word in each document. Matlab query Inverted index D 1 documents How does one construct such indices? CSC401/2511 Spring 2015 28

Inverted index construction 1. Collect the documents to be indexed. Friends, Romans, countrymen So let it be with Caesar 2. Tokenize the text. Friends Romans countrymen So 3. Do preprocessing and normalization, resulting in the indexing terms. friend roman countryman so 4. Create a dictionary (hash) of documents given terms. CSC401/2511 Spring 2015 29

Simple conjunctive query Given the query Brutus AND Calpurnia, 1. Locate Brutus in the dictionary. Retrieve documents list. 2. Locate Calpurnia in the dictionary. Retrieve documents list. 3. Intersect the two document lists. Return the result to the user. Linear in the lengths of document lists. (if lists are sorted) CSC401/2511 Spring 2015 30

Constructing indices Spiders (aka. Robots, bots, crawlers) start with root (seed) URLs. Follow all links on these pages recursively. Novel pages are processed and indexed. Despite the exponential growth in memory across depth, breadth-first search is quite popular. Depth-first search is linear in depth, but can get lost. Trivia: If you click on the first contentful link in any Wikipedia page, you will eventually be led to the Philosophy article. CSC401/2511 Spring 2015 31

Increasing entropy? Boolean retrieval is precise and was very popular for decades (it still is used for structured data, like desktop file search). The amount and value of unstructured data (i.e., text) has grown faster than structured data on the web. 1996 2006 150 100 Unstructured Structured 150 100 50 0 50 0 Data volume Market cap (data from Chris Manning) Data volume Market cap CSC401/2511 Spring 2015 32

Zipf s law on the web These variables have Zipfian distributions: Number of links to and links from a page. Length of web pages. Number of web page hits. (graph from Ray Mooney) CSC401/2511 Spring 2015 33

New challenges for IR on the web Distributed data: Documents spread over millions of web servers. Volatile data: Document change or disappear frequently and rapidly. Large volume: Petabytes of data. Poor quality: No editorial control, false information, poor writing, typographic errours. Heterogeneity: Various media, languages, encodings. Unstructured: No uniform structure, HTML errors, CSC401/2511 Spring 2015 34 duplicate documents.

Detecting duplicates duplicates The user will become annoyed when many top-ranking hits are identical/similar. Nearly-identical pages can be determined by hashing E.g., don t index en.m.wikipedia.org/wiki/ if you ve indexed en.wikipedia.org/wiki/. Zero marginal relevance occurs when a highly relevant document becomes irrelevant by being ranked below a (near-)duplicate. CSC401/2511 Spring 2015 35

Detecting duplicates duplicates Compute similarity with some edit-distance measure. Syntactic similarity (e.g., overlap of bigrams) easier to measure than semantic similarity. If this measure is above some threshold θ for some pair of documents, we consider them duplicates. Jaccard coefficient: J A, B = A B A B Is a measure of similarity on [0.. 1] J A, A = 1 J A, B = 0 iff A B = CSC401/2511 Spring 2015 36

Jaccard coefficient on 2-grams Documents: d 1 : Jack London went to Toronto d 2 : Jack London went to the city of Toronto d 3 : Jack went from Toronto to London J d 1, d 2 = 3 8 = 0.375 J d 1, d 3 = 0 CSC401/2511 Spring 2015 37

Link analysis When we re crawling the web and indexing, we want to retain some record of similarity between (non-duplicate) documents in terms of their link structure. This will help in searching. CSC401/2511 Spring 2015 38

Bibliometrics: citation analysis Impact factor: Developed in 1972 to measure the quality and influence of scientific journals. Measures how often articles are cited. Bibliographic coupling: Measure of similarity between documents according to the intersection of their citations (Kessler, 1963). A B CSC401/2511 Spring 2015 39

Bibliometrics: citation analysis Co-citation: Measure of similarity between documents according to the intersection of the documents that cite them (Small, 1973). A B CSC401/2511 Spring 2015 40

Links are not citations Many links are navigational within a website. Many pages with high in-degree are portals without much content. Some links are not necessarily endorsements. Relevance of citations in scientific settings is (theoretically) enforced by peer review. Can we mimic the enforcement of relevance usually done by human experts in scientific articles? CSC401/2511 Spring 2015 41

Authorities and hubs Authorities are pages recognized as significant, trustworthy, and useful for a topic. In-degree (number of incoming links) is an estimate of authority. Should incoming links from authoritative pages count more than others? Hubs are index pages that provide lots of links to relevant content pages. e.g., reddit.com is a hub page for recycled memes. CSC401/2511 Spring 2015 42

HITS The HITS algorithm (Kleinberg, 1998) attempts to learn hubs and authorities on a given topic given relevant web subgraphs. Hubs and authorities tend to form bipartite graphs. Hubs Authorities CSC401/2511 Spring 2015 43

HITS First, find (top N) most relevant pages for a query this is the root set, R. (we ll see how to do this next lecture) Next, look at the link structure relative to R. The base set, S is R and all pages that link to and are linked from pages in R S R CSC401/2511 Spring 2015 44

HITS: Authorities and In-degree Even for S, nodes with high in-degree may not be authorities they may just be generically popular pages. Authority should be determined by strong hubs. Iteratively (slowly) converge on a mutually reinforcing set of hubs and authorities. For every page p S, maintain Authority score: a p (initialized to 1/ S ) Hub score: h p (initialized to 1/ S ) subject to p S a 2 p = 1 = 2 p S h p CSC401/2511 Spring 2015 45

HITS update rules Authorities p are pointed to ( ) by lots of good hubs q: a p = q:q p h q 1 2 4 3 a 4 = h 1 + h 2 + h 3 Hubs point to lots of good authorities: h q = a p p:q p 4 1 2 3 h 4 = a 1 + a 2 + a 3 CSC401/2511 Spring 2015 46

Page similarity using HITS Given honda.com, we also get: toyota.com ford.com bmwusa.com saturn.com nissanmotors.com This method can have trouble with ambiguous queries, however CSC401/2511 Spring 2015 47

PageRank PageRank (Brin & Page, 1998) is an alternative to HITS that does not distinguish between hub and authority. CSC401/2511 Spring 2015 48

PageRank initial idea Assume that in-degree does not account for the authority of the source of a link. For page p, the page rank is: where R p = c CSC401/2511 Spring 2015 49 q:q p R(q) N q N q is the total number of out-links over all q. c is a normalizing constant. A page s rank flows out equally among outgoing links.

PageRank flow of authority PageRank would iteratively adjust all R p until overall page ranking converged. 0.4 0.2 0.2 0.4 0.2 0.4 0.2 Steady state CSC401/2511 Spring 2015 50

PageRank problem Groups of purely self-referential pages (linked from the outside) are sinks that absorb all the rank in the system during the iterative rank assignment process. CSC401/2511 Spring 2015 51

PageRank rank source An ethereal rank source E continually replenishes the rank of each page p by a fixed amount E p R p = c q:q p R(q) N q + E(p) CSC401/2511 Spring 2015 52

Complete ranking A complete ranking involves combining: PageRank. Preferences using HTML tags (e.g., title or abstract are often highly informative). Similarity of query words and documents. How do we relate query words and documents in the first place? CSC401/2511 Spring 2015 53

Next lecture How to relate query terms and documents. Vector space model. How to generalize query terms. Latent semantic indexing. How to rank documents. Singular value decomposition. How to evaluate different search engines. CSC401/2511 Spring 2015 54

Misc Some slide and material based on those of Ray J. Mooney (UTexas, CS371R), Hinrich Schütze, Christina Lioma, and Chris Manning (Stanford, CS276). Dan Jurafsky (Stanford, CS124) CSC401/2511 Spring 2015 55

Aside PageRank algorithm Given the total set of pages S, Let p S: E p = α for some 0 α 1 S Initialize p S: R p = 1/ S Until convergence: For each p S: R R q p 1 α + E(p) q:q p N q 1 c p S R p For each p S: R p cr (p) //normalize CSC401/2511 Spring 2015 56