Generating Query Substitutions



Similar documents
Generating Query Substitutions

Course Overview (subject to change)

Search and Information Retrieval

Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University

Quick Start Guide. what is property owners liability insurance

Building a Question Classifier for a TREC-Style Question Answering System

POSBIOTM-NER: A Machine Learning Approach for. Bio-Named Entity Recognition

Automatic Generation of Bid Phrases for Online Advertising

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Help Sheet on the Use of RIA-Checkpoint Database

Machine Learning Final Project Spam Filtering

SEARCH ENGINE OPTIMIZATION USING D-DICTIONARY

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Data Mining on Streams

Domain Classification of Technical Terms Using the Web

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

Projektgruppe. Categorization of text documents via classification

What Is This, Anyway: Automatic Hypernym Discovery

Insurance Solutions. 17 October Risk Solutions

Mammoth Scale Machine Learning!

Keywords the Most Important Item in SEO

University of Glasgow Terrier Team / Project Abacá at RepLab 2014: Reputation Dimensions Task

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

The Winchester School Family Learning Newsletter (FS 2) March 2015

Finding Advertising Keywords on Web Pages. Contextual Ads 101

Measuring the online experience of auto insurance companies

TS3: an Improved Version of the Bilingual Concordancer TransSearch

Classification of Bad Accounts in Credit Card Industry

ATLAS.ti 5.5: A Qualitative Data Analysis Tool

3a Order the words to make questions. b Check your answers. c Practise saying the questions.

quote for non owners auto insurance Quick Start Guide This quote for non owners auto insurance is as independently produced user guides.

The Artificial Prediction Market

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

How to undertake a literature search and review for dissertations and final year projects

Protein Protein Interaction Networks

Using Edit-Distance Functions to Identify Similar Addresses Howard Schreier, U.S. Dept. of Commerce, Washington DC

Teens, Online Stranger Contact and Cyberbullying What the research is telling us

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. ~ Spring~r

OvidSP Quick Reference Guide

QUANTIFIERS (7) Countable vs. Uncountable (01)

MyOra 3.0. User Guide. SQL Tool for Oracle. Jayam Systems, LLC

Introduction to Bayesian Classification (A Practical Discussion) Todd Holloway Lecture for B551 Nov. 27, 2007

Extraction of Hypernymy Information from Text

TF-IDF. David Kauchak cs160 Fall 2009 adapted from:

IBM SPSS Direct Marketing 22

Justin Kribs, CFP. Manager, Student Debt Counseling and Financial Management

Why Evaluation? Machine Translation. Evaluation. Evaluation Metrics. Ten Translations of a Chinese Sentence. How good is a given system?

Add external resources to your search by including Universal Search sites in either Basic or Advanced mode.

Information Extraction for Standardization of Tourism Products

I Want To Start A Business : Getting Recommendation on Starting New Businesses Based on Yelp Data

CSE 473: Artificial Intelligence Autumn 2010

Machine Translation. Why Evaluation? Evaluation. Ten Translations of a Chinese Sentence. Evaluation Metrics. But MT evaluation is a di cult problem!

Contents EARLY START SPANISH - TÚ Y YO. Introduction 4. 1 Hola Greetings Adiós Saying goodbye Qué tal? Asking people how they are 26

Online Marketing Briefing: Google Instant

Data Mining Practical Machine Learning Tools and Techniques

Math 141. Lecture 2: More Probability! Albyn Jones 1. jones/courses/ Library 304. Albyn Jones Math 141

Answers to the Problems Chapter 3

A Deeper Look Inside Generalized Linear Models

M15_BERE8380_12_SE_C15.7.qxd 2/21/11 3:59 PM Page Analytics and Data Mining 1

what is property owners liability insurance

Hector s World Lesson plan Episode: Computer security: Oops Upper primary

Durango Real Estate: Revisiting California and Price-to-Rent by Luke T Miller, PhD

How to set up a campaign with Admedo Premium Programmatic Advertising. Log in to the platform with your address & password at app.admedo.

SAMPLE. Grammar, punctuation and spelling. Paper 2: short answer questions. English tests KEY STAGE LEVEL. Downloaded from satspapers.org.

Google Instant: Potential Impact on SEM and SEO

Topics in Computational Linguistics. Learning to Paraphrase: An Unsupervised Approach Using Multiple-Sequence Alignment

Support Vector Machine (SVM)

Constituency. The basic units of sentence structure

Supervised Learning (Big Data Analytics)

Some Statistical Applications In The Financial Services Industry

CS 2112 Spring Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

Lecture 10: Regression Trees

Collecting Polish German Parallel Corpora in the Internet

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Rami: Alright Simon, we need to learn how to make money in the stock market. Simon: Ok, fine. What do you know about the stock market?

IBM SPSS Direct Marketing 23

Accurately and Efficiently Measuring Individual Account Credit Risk On Existing Portfolios

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

A LEVEL H446 COMPUTER SCIENCE. Code Challenges (1 20) August 2015

Text Analytics Illustrated with a Simple Data Set

UNDERSTANDING THE EFFECTIVENESS OF BANK DIRECT MARKETING Tarun Gupta, Tong Xia and Diana Lee

Mining Term Association Patterns from Search Logs for Effective Query Reformulation

An Overview of Computational Advertising

Probability and Statistics

A picture is worth five captions

Chapter Two: Your Financial Future

TREC 2003 Question Answering Track at CAS-ICT

Gerry Hobbs, Department of Statistics, West Virginia University

Intro to Linear Equations Algebra 6.0

NYC + SAN DIEGO MEASUREMENTS 1. Demographics. Media

Making Sense of the Mayhem: Machine Learning and March Madness

Predicting the Stock Market with News Articles

Fun Learning Activities for Mentors and Tutors

More information >>> HERE <<<

Active Learning SVM for Blogs recommendation

Statistical Validation and Data Analytics in ediscovery. Jesse Kornblum

Non-life insurance mathematics. Nils F. Haavardsson, University of Oslo and DNB Skadeforsikring

Transcription:

1 Generating Query Substitutions Rosie Jones Benjamin Rey, Omid Madani and Wiley Greiner

Substitutes 2

Query Substitutions 3

Query Substitutions 4

Functions of Rewriting Enhance meaning Spell correction Corpus appropriate terminology Cat cancer feline cancer Change meaning Narrow [ lexical entailment: fruit apple] Broaden [ alternatives, common interests] Conference proceedings textbooks 5

Trying to Find Nathan Welsh, who lives and works in Edinburgh nathan welsh edinburg scotland nathan welsh edinburgh scotland financial consultants edinburg scotland financial consultants edinburgh scotland financial consultants nathan welsh 16 18 pennwell place edinburgh nathan welsh 16 18 pennywell place edinburgh international phone directory white pages edinburgh scotland phone directory edinburgh scotland uk nathan welsh investment consultant edinburg nathan welsh investment consultant edinburgh investment consultants edinburgh scotland nathan welsh kansas virginia herndon virginia Spell correction Name profession Spell correction Delete terms, generalize Spell correction Try looking up addresses rephrase specialization Generalize to location Switch to new topic Try second approach, using his address 6

Half of Query Pairs are Related Type non rewrite insertions substitutions deletions spell correction mixture specialization generalization Example mic amps > create taxi game codes > video game codes john wayne bust > john wayne statue skateboarding pics skateboarding real eastate > real estate huston's restaurant > houston's jobs > marine employment gm reabtes > show me all the current auto rebates % 53.2% 9.1% 8.7% 5.0% 7.0% 6.2% 4.6% 3.2% other thansgiving > dia de acconde gracias 2.4% 7

Substitutions are repeated car insurance auto insurance 5086 times in a sample car insurance car insurance quotes 4826 times car insurance geico [ brand of car insurance ] 2613 times car insurance progressive auto insurance 1677 times car insurance carinsurance 428 times Different Users, Different Days 8

Test whether Statistical Test to Find Significant Rewrites p q2 q1 >> p q2 P(britney spears brittney spears) >> P(britney spears) 8% >> 0.01% Log likelihood ratio test (GLRT) gives χ 2 distributed score About 90% of query pairs are related after filtering with LLR > 100 9

Many Types of Substitutable Rewrites dog > dogs dog > cat dog > dog breeds dog > dog pictures dog > 80 dog > pets dog > puppy dog > dog picture dog > animals dog > pet 9185 5942 5567 5292 2420 1719 1553 1416 1363 920 pluralization both instances of 'pet generalization more specific random junk in query processing generalization hypernym specification hyponym more specific generalization hypernym generalization hypernym 10

Defining Categories of Relatedness for Sponsored Search 1 Precise Match A near certain match. E.g.: automotive insurance automobile insurance; 2 Approximate Match A probable, but inexact match with user intent. E.g.: apple music player ipod shuffle 3 Marginal Match A distant, but plausible match to a related topic. E.g.: glasses contact lenses 4 Mismatch A clear mismatch. We will call {1,2} Precise and {1,2,3} Broad 11

Increase Tail Coverage with Query Segmentation segmentation Q' Q Q'' ' p p 1 2 '' p p 1 2 p p ' 1 2 p p 1 2 ' ' p p 1 2 Query segmented using high mutual information terms Most frequent queries: replace whole query Infrequent queries: replace constituent phrases 12

Generating Query Substitutions Q1 > {q2,q3,q4,q5,q6} catholic baby names > {christian baby names, christian baby boy names, catholic names, } All are statistically relevant (log likelihood ratio on successive queries) Find a model to rank substitutions, to be able to pick the best ones score Q u 1 '' u 2 score Q Q''... associate a probability of correctness P Q Q' is correct score Q Q' 13

Train/Test Data Sample 1000 queries (q1) Select a single substitution for each (q2) Manually label the <q1,q2> pairs Learn to score <q1,q2> pairs Order by score Assess Precision/Recall Precise task {1,2} vs {3,4} Broad task {1,2,3} vs {4} 14

Predicting High Quality Query Suggestions Used labels to fit model Tried 37 features for model: Lexical features including Levenshtein character edit distance Prefix overlap Porter stem Jaccard score on words Statistical features including Probability of rewrite Frequency of rewrite Other Number of substitutions (numsubst) Whole query = 0 Replace one phrase = 1 Replace two phrases = 2 Query length, existence of sponsored results 15

Baseline: Most suggestions broadly related Random Suggestion Maximize LLR Minimize numsubst Precise {1,2} 55% 66% Broad {1,2,3} 87.5% 16

Simple Decision Tree wordsincommon > 0 Yes No Class={1,2} Yes prefixoverlap>0 No Class={1,2} Class={3,4} Interpretation of the decision tree: substitution must have at least 1 word in common with initial query the beginning of the query should stay unchanged 17

Linear Regression Model Regression: continuous output in [1,4] LMScore=in tercept f = features w f. f Classification: If(LMScore < T) then Good, else Bad For each T, we have a precision and a recall Evaluation: Average precision / recall on 100 times 10 fold cross validation 18

Learned Function f q 1, q 2 =0.74 1.88 editdist q 1, q 2 0. 71 worddist q 1, q 2 0.36 numsubst q 1, q 2 Outputs continuous score [1..4] Like decision tree Prefer few edits Prefer few word changes Prefer whole query or few phrase changes Normalize output to a probability of correctness using sigmoid fit 19

SVM, Bags of Trees, Linear Model Give Similar Trade offs 100% 95% 90% 2 levels DT bag of 100 DTs SVM Linear model 85% precision 80% 75% 70% 65% 60% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% recall 20

Results Reranking Query Suggestion Pairs Breakeven Max F1 Av. Prec Baseline 0.66 0.66 0.66 2 level DT 0.71 0.83 0.71 SVM 0.81 0.83 0.86 Linear Model Bag of 100 DTs 0.80 0.83 0.84 0.84 0.87 0.88 21

Generate Best Candidate on New Sample: Precision at 100% Recall Precise {1,2} Broad {1,2,3} Random Suggestion Maximize LLR Minimize numsubst Linear Regression 55% 66% 74% 87.5% 87.5% 22

Example Query Substitutions Initial Query Substitution Handlabel Alg. Prob anne klien watches anne klein watches 1 92% sea world san diego sea world san diego tickets 2 90% restaurants in washington dc restaurants in washington 2 89% nash county wilson county 3 66% frank sinatra birth certificate elvis presley birth 4 17% 23

Applications Sponsored Search Query Expansion Assisted Search Lexical entailment 24

Other Sources of Phrase Similarity Data Source Queries Cooccurence Pet dogs Distributional Similarity dog food pet food Query Sequence Sentences Pets dogs Pets such as dogs dogs petshop pets petshop pets like their owners dogs like their owners Documents pets dogs dogs.. owners.. food pets.. owners.. food 25

Sample Query Substitution Types meaning of dreams interpretation of dreams (synonym) furniture etegere furniture etagere (spelling correction) venetian hotel venetian hotel las vegas (expansion) delta employment credit union decu (acronym) lyrics finder mp3 finder (related term) national car rental alamo car rental (related brand) amanda peet saving silverman (actress in) 26

Future Work Add other sources of similarity Try additional features IR experiments 27

Substitutes for Edinburgh Castle edinburgh castle scotland (0.86) edinburgh (0.79) stirling castle (0.76) dudley castle (0.56) 28

Substitutes for Mars Bar mars candy bar (0.83) mars candy (0.77) Substitutes for deep fried mars bar? 29