Finding Advertising Keywords on Web Pages. Contextual Ads 101



Similar documents
Finding Advertising Keywords on Web Pages

Search and Information Retrieval

Studying the Impact of Text Summarization on Contextual Advertising

Automatic Advertising Campaign Development

Revenue Optimization with Relevance Constraint in Sponsored Search

Folksonomies versus Automatic Keyword Extraction: An Empirical Study

Wikipedia and Web document based Query Translation and Expansion for Cross-language IR

Enhancing the relativity between Content, Title and Meta Tags Based on Term Frequency in Lexical and Semantic Aspects

Eng. Mohammed Abdualal

Data Mining Yelp Data - Predicting rating stars from review text

Mining Text Data: An Introduction

Incorporating Window-Based Passage-Level Evidence in Document Retrieval

A survey on click modeling in web search

Invited Applications Paper

A Survey of Search Advertising

Developing a Collaborative MOOC Learning Environment utilizing Video Sharing with Discussion Summarization as Added-Value

Yifan Chen, Guirong Xue and Yong Yu Apex Data & Knowledge Management LabShanghai Jiao Tong University

Optimization of Internet Search based on Noun Phrases and Clustering Techniques

Term extraction for user profiling: evaluation by the user

Predicting the NFL Using Twitter. Shiladitya Sinha, Chris Dyer, Kevin Gimpel, Noah Smith

An Introduction to Machine Learning and Natural Language Processing Tools

An Unsupervised Approach to Domain-Specific Term Extraction

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Best Practice Search Engine Optimisation

How To Write A Summary Of A Review

Computational Advertising Andrei Broder Yahoo! Research. SCECR, May 30, 2009

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

CS 229, Autumn 2011 Modeling the Stock Market Using Twitter Sentiment Analysis

An Analysis of Factors Used in Search Engine Ranking

Advertising Keywords Recommendation for Short-Text Web Pages Using Wikipedia

Dynamical Clustering of Personalized Web Search Results

Fuzzy Content Mining for Targeted Advertisement

Outline. for Making Online Advertising Decisions. The first banner ad in Online Advertising. Online Advertising.

Introduction to Search Engine Marketing

Task CMU Report 06: Programs for Design Analysis Support and Simulation Integration. Department of Energy Award # EE

An Information Retrieval using weighted Index Terms in Natural Language document collections

Sentiment Analysis of Movie Reviews and Twitter Statuses. Introduction

The multilayer sentiment analysis model based on Random forest Wei Liu1, Jie Zhang2

Sentiment-Oriented Contextual Advertising

An Overview of Computational Advertising

Technical challenges in web advertising

Effective Data Retrieval Mechanism Using AML within the Web Based Join Framework

Exam in course TDT4215 Web Intelligence - Solutions and guidelines -

Language Interface for an XML. Constructing a Generic Natural. Database. Rohit Paravastu

GrammAds: Keyword and Ad Creative Generator for Online Advertising Campaigns

Generating Advertising Keywords from Video Content

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

Introduction to Information Retrieval

A Comparison of Word- and Term-based Methods for Automatic Web Site Summarization

1 o Semestre 2007/2008

Online marketing. Summery. Introduction. marketing. Martin Hellgren,

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

III. DATA SETS. Training the Matching Model

Internet Marketing Basics

Sustaining Privacy Protection in Personalized Web Search with Temporal Behavior

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Search Engines are #1 Way to be Found

ON THE ANALYSIS OF LARGE-SCALE DATASETS TOWARDS ONLINE CONTEXTUAL ADVERTISING

Sentiment-Oriented Contextual Advertising

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Constructing Social Intentional Corpora to Predict Click-Through Rate for Search Advertising

SmartAds: Bringing Contextual Ads to Mobile Apps

Automated Problem List Generation from Electronic Medical Records in IBM Watson

Challenges of Cloud Scale Natural Language Processing

Supervised Learning Evaluation (via Sentiment Analysis)!

INTERNET MARKETING. SEO Course Syllabus Modules includes: COURSE BROCHURE

31 Case Studies: Java Natural Language Tools Available on the Web

Subordinating to the Majority: Factoid Question Answering over CQA Sites

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Search Engine Optimisation Guide May 2009

Multiword Keyword Recommendation System for Online Advertising

Google and Yahoo Keyword Auctions. Ryan Gabbard

Removing Web Spam Links from Search Engine Results

Turker-Assisted Paraphrasing for English-Arabic Machine Translation

A Comparative Study on Sentiment Classification and Ranking on Product Reviews

The Enron Corpus: A New Dataset for Classification Research

Open Domain Information Extraction. Günter Neumann, DFKI, 2012

Text Opinion Mining to Analyze News for Stock Market Prediction

Topic Extrac,on from Online Reviews for Classifica,on and Recommenda,on (2013) R. Dong, M. Schaal, M. P. O Mahony, B. Smyth

Social Media Mining. Data Mining Essentials

The Definitive Guide to Google AdWords

Website Marketing Audit. Example, inc. Website Marketing Audit. For. Example, INC. Provided by

Identifying Focus, Techniques and Domain of Scientific Papers

The Need for Training in Big Data: Experiences and Case Studies

Legal Informatics Final Paper Submission Creating a Legal-Focused Search Engine I. BACKGROUND II. PROBLEM AND SOLUTION

Electronic Document Management Using Inverted Files System

Query term suggestion in academic search

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

TREC 2003 Question Answering Track at CAS-ICT

Data Mining Practical Machine Learning Tools and Techniques

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

ADVANCE DIGITAL MARKETING VIDEO TRAINING COURSE. Page 1 of 34 Youtube.com/ViralJadhav viral@experttraining.

Internet Marketing Institute Delhi Mobile No.: DIMI. Internet Marketing Institute Delhi (DIMI)

A Web Content Mining Approach for Tag Cloud Generation

Search Result Optimization using Annotators

A Personalized Spam Filtering Approach Utilizing Two Separately Trained Filters

Search Engine Marketing (SEM) with Google Adwords

Statistical Machine Learning

Search Engines. Stephen Shaw 18th of February, Netsoc

GET OPTIMIZED! HOW TO OPTIMIZE YOUR SHOPPING CENTER IN THE ONLINE ENVIRONMENT

Transcription:

Finding Advertising Keywords on Web Pages Scott Wen-tau Yih Joshua Goodman Microsoft Research Vitor R. Carvalho Carnegie Mellon University Contextual Ads 101 Publisher s website Digital Camera Review The new flagship of Canon s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Sponsored Links Digital Camera for $99 http://buycamera.com Display Ads Google s AdSense program More than 40% of its revenues Advertisers bid on keywords digital camera powershot s80 canon camera movie video football Find relevant keywords Match keywords Ad Agency: Company G 1

Keyword Extraction is the Key!! Company M wants to copy this business model Publisher s website Chinese Restaurant Review Yen Ching's menu is of daunting length and enormous breadth. For example, a lot of vegetarians like their Braised Fungus and Winter Bamboo Shoots, while others love the special Stewed Duck and Iron Plate Beef. Sponsored Links Eliminate Nail Fungus http://pharmacy.com/nail-fungus Keywords extracted are more relevant More useful and interesting to readers Higher click-through rate, more revenue Introduction A machine learning based system Significantly better than simple TF IDF baseline Better than an existing system, KEA Explore different frameworks of choosing keyword candidates Phrases vs. Words Looking at whole phrases monolithically is better Combined vs. Separate Will show that looking at all instances of a phrase together (combined) is better Extensive feature study TF and DF Instead of TF IDF, use them as separate features Search Query Log Keywords that people use to query are good features to find keywords people like 2

Outline System Architecture Preprocessor Candidate selector Postprocessor Experiments Data preparation Performance measures Results Related Work System Architecture HTML Document Candidate Selector PowerShot 0.17 Canon 0.14 Canon s S-series 0.06 Digital Camera 0.07 3

Facilitate keyword candidate selection and feature extraction Transform HTML documents into sentence-split plain-text documents No sophisticated parsing No block detection Preserve/Augment some information Some HTML tags Linguistic analysis: POS tagging Monolithic (1/2) Consider consecutive words up to length 5 as candidates Candidates do not cross sentence boundaries Digital Camera Review The new flagship of Canon s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Some candidates The, The new, The new flagship, The new flagship of, The new flagship of Canon, new, new flagship, 4

Monolithic (2/2) Combined vs. Separate Information extraction community usually looks at candidate phrases separately, while previous work in this area has combined all instances together Digital Camera Review The new flagship of Canon s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Once we have candidates, must determine which ones are the best Two steps: For each phrase, extract its features Indications of whether a candidate phrase is relevant to the document Use both binary and real-valued features From features, determine score of the phrase Learn the weights of features 5

Important Features Digital Camera Review The new flagship of Canon s S-series, PowerShot S80 digital camera, incorporates 8 megapixels for shooting still images and a movie mode that records an impressive 1024 x 768 pixels. Term Frequency & Document Frequency (IR features) Search Query Log Most frequent 7.5 million query terms from MSN search Whether the phrase is in the query log, as well as the frequency Whether the phrase appears in TITLE Sentence Length (where the phrase is in) Capitalization (whether the phrase is capitalized) Location (relative to the whole document and sentence) Linguistics (noun or proper noun) MetaSec (keywords, description, etc) Logistic Regression Need to combine the features to get a score for each phrase For each feature, compute a weight For a given phrase, find weighted sum of features, add them up Need to find the weights Use training data (more later) with list of correct keyphrases for each document Use logistic regression to find best weights exp( x wi ) p( y x) = 1+ exp( x w ) y is 1 if word/phrase is relevant x is the features of the word/phrase (a vector of numbers) Learning: find weights that match the labeled training data i 6

Monolithic Combined (Consider identical phrases as one candidate) Direct output what classifier predicts Monolithic Separate Output the largest probability estimation of identical candidates Experiments How do we collect data to train and evaluate our system? How good is our system? How to measure performance Which framework is the best? Compare it with other systems Feature contribution 7

Data Annotation Raw data: 828 web pages Have content-targeted advertising Remove advertisements 5 annotators pick keywords Asked them to choose only words/phrases that occurred in the documents Asked them to label phrases about things they might want to buy when reading this page 10-fold cross validation for experiments Performance Measures Accuracy or Recall is not very meaningful Hard to define/pick a complete set of keywords Rank of keywords is also important Top-n scores We return our top n phrases Get 1 point for each correct phrase we return (Annotator listed that keyphrase) Divide by maximum points any system could possibly get Score is between 0 and 1 (1 is best) K i : set of top n keywords chosen by the system for page i A i : keywords selected by the annotators for page i Score = i K i i Ai 100% min( A, n) i 8

Top-n Score for 1 Document Digital Camera PowerShot S80 Canon s S-series S80 S80 PowerShot Canon Canon s S-series Digital Camera S-series 0.23 0.17 0.14 0.06 0.07 0.04 Top-1 score? score: 1/1 = 1.0 Top-n Score for 1 Document Digital Camera PowerShot S80 Canon s S-series S80 S80 PowerShot Canon Canon s S-series Digital Camera S-series 0.23 0.17 0.14 0.06 0.07 0.04 Top-5 score? score: 3/4= 0.75 9

Performance Comparison Combining identical phrases as candidates is the best framework 50 40 46.97 44.13 38.21 Top1 Top10 30 30.06 27.95 23.57 25.67 20 13.63 13.01 19.03 10 0 Monolithic Combined Monolithic Separate KEA IR features (MoC) TFIDF (MoC) Phrase candidates Performance Comparison Better than KEA 50 40 46.97 44.13 38.21 Top1 Top10 30 30.06 27.95 23.57 25.67 20 13.63 13.01 19.03 10 0 Monolithic Combined Monolithic Separate KEA IR features (MoC) TFIDF (MoC) Phrase candidates 10

Performance Comparison Learning weights for TF and DF separately is better than TF IDF 50 40 46.97 44.13 38.21 Top1 Top10 30 30.06 27.95 23.57 25.67 20 13.63 13.01 19.03 10 0 Monolithic Combined Monolithic Separate KEA IR features (MoC) TFIDF (MoC) Phrase candidates IR + One Set of Features 50 Top1 Top10 40 30 25.67 35.88 34.17 33.43 33.16 32.76 32.26 31.9 20 22.36 19.9 19.22 17.41 17.01 18.2 19.02 13.63 10 0 IR +Query +Title +Length +Capital +Location +Ling +MetaSec 11

Search Engine Query Log 2 nd useful feature Size could be too large especially for client-side applications 7.5 million queries, 20 bytes per query 20 languages 3GB query log files Effects of Using a smaller query log file Restrict candidates by query log Using Different Sizes of Query Log File Score 50 45 40 35 30 25 20 15 10 5 0 restop1 restop10 10 100 1000 10000 100000 Query Log Frequency Threshold 12

Related Work Exciting field: Researchers tend to be rich! Extracting keywords (from scientific papers) GenEx: rules + GA [Turney IR-00] KEA: Naïve Bayes using 3 features [Frank et al. IJCAI-99] Craig Nevill-Manning, Engineering Director, Google NYC Query-Free News Search [Henzinger, et al. WWW-03] Extract keywords from TV news caption Using TF IDF and its variations to score phrases Sergey Brin, 1 of the 2 billionaires who published in WWW Impedance coupling [Ribeiro-Neto et al. SIGIR-05] Match advertisements to web pages directly Berthier Ribeiro-Neto, Google Latin America R&D Center Implicit Queries from Emails [Goodman&Carvalho CEAS-05] Joshua Goodman, Poor Researcher, Microsoft Research Conclusions Keyword extraction drives content-targeted advertising Foundation of free web services Very successful business model Extensive experimental study TF, DF, Search Query Log are the three most useful features Machine learning is important in tuning the weights Monolithic combined (combine identical phrases together) is the best approach Our system is substantially better than KEA the only publicly available keyword extraction system Just a start; hope to see more papers 13