Ranking Methods in Machine Learning

Similar documents
Large Scale Learning to Rank

Multiple Clicks Model for Web Search of Multi-clickable Documents

Learning to Rank Revisited: Our Progresses in New Algorithms and Tasks

BALANCE LEARNING TO RANK IN BIG DATA. Guanqun Cao, Iftikhar Ahmad, Honglei Zhang, Weiyi Xie, Moncef Gabbouj. Tampere University of Technology, Finland

International Journal of Advanced Research in Computer Science and Software Engineering

How To Rank With Ancient.Org And A Random Forest

A Structural SVM Based Approach for Optimizing Partial AUC

A General Boosting Method and its Application to Learning Ranking Functions for Web Search

Yahoo! Learning to Rank Challenge Overview

Learning to Rank for Information Retrieval

How To Cluster On A Search Engine

Business Analytics for Non-profit Marketing and Online Advertising. Wei Chang. Bachelor of Science, Tsinghua University, 2003

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

A ranking SVM based fusion model for cross-media meta-search engine *

Context Models For Web Search Personalization

Learning to Rank with Nonsmooth Cost Functions

Identifying Best Bet Web Search Results by Mining Past User Behavior

Revenue Optimization with Relevance Constraint in Sponsored Search

An Alternative Ranking Problem for Search Engines

Listwise Approach to Learning to Rank - Theory and Algorithm

An Unsupervised Learning Algorithm for Rank Aggregation

Learning to Rank By Aggregating Expert Preferences

Scaling Learning to Rank to Big Data

Relationship Identification for Social Network Discovery

Support Vector Machines with Clustering for Training with Very Large Datasets

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Beating the NCAA Football Point Spread

Predictive Data Mining in Very Large Data Sets: A Demonstration and Comparison Under Model Ensemble

Finding the Right Facts in the Crowd: Factoid Question Answering over Social Media

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

The Enron Corpus: A New Dataset for Classification Research

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

DATA PREPARATION FOR DATA MINING

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Big Data - Lecture 1 Optimization reminders

Active Learning in the Drug Discovery Process

Introduction to Data Mining

Data, Measurements, Features

Machine Learning CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Towards Inferring Web Page Relevance An Eye-Tracking Study

Social Media Mining. Data Mining Essentials

Social Business Intelligence Text Search System

A Support Vector Method for Multivariate Performance Measures

Learning to Rank for Why-Question Answering

Steven C.H. Hoi. School of Computer Engineering Nanyang Technological University Singapore

Domain Classification of Technical Terms Using the Web

Intrusion Detection via Machine Learning for SCADA System Protection

WE DEFINE spam as an message that is unwanted basically

Ensemble Approach for the Classification of Imbalanced Data

Adapting Codes and Embeddings for Polychotomies

An Introduction to Data Mining

The Artificial Prediction Market

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Active Learning of Label Ranking Functions

Robust Ranking Models via Risk-Sensitive Optimization

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Theme-based Retrieval of Web News

Biomedical Big Data and Precision Medicine

Travis Goodwin & Sanda Harabagiu

A Novel Feature Selection Method Based on an Integrated Data Envelopment Analysis and Entropy Mode

Learning from Diversity

MACHINE LEARNING BASICS WITH R

Metaheuristics in Big Data: An Approach to Railway Engineering

Term extraction for user profiling: evaluation by the user

Predict Influencers in the Social Network

RecSys Challenge 2014: an ensemble of binary classifiers and matrix factorization

A Logistic Regression Approach to Ad Click Prediction

Top 10 Algorithms in Data Mining

How To Filter Spam Image From A Picture By Color Or Color

Ming-Wei Chang. Machine learning and its applications to natural language processing, information retrieval and data mining.

MA2823: Foundations of Machine Learning

Sentiment analysis for news articles

An Optimization Framework for Weighting Implicit Relevance Labels for Personalized Web Search

Search and Data Mining: Techniques. Applications Anya Yarygina Boris Novikov

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Constrained Classification of Large Imbalanced Data by Logistic Regression and Genetic Algorithm

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances

Using Random Forest to Learn Imbalanced Data

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Invited Applications Paper

MAXIMIZING RETURN ON DIRECT MARKETING CAMPAIGNS

Active Learning with Boosting for Spam Detection

Inner Classification of Clusters for Online News

Spam detection with data mining method:

Analysis Tools and Libraries for BigData

Implementing a Resume Database with Online Learning to Rank

The Data Mining Process

Mathematical Models of Supervised Learning and their Application to Medical Diagnosis

ENSEMBLE DECISION TREE CLASSIFIER FOR BREAST CANCER DATA

Personalized Ranking Model Adaptation for Web Search

Advanced analytics at your hands

IDENTIFIC ATION OF SOFTWARE EROSION USING LOGISTIC REGRESSION

How to Win at the Track

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

An Overview of Knowledge Discovery Database and Data mining Techniques

Towards Recency Ranking in Web Search

ENHANCED WEB IMAGE RE-RANKING USING SEMANTIC SIGNATURES

Classification and Regression by randomforest

Transcription:

Ranking Methods in Machine Learning A Tutorial Introduction Shivani Agarwal Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology

Example 1: Recommendation Systems

Example 2: Information Retrieval

Example 2: Information Retrieval information

Example 2: Information Retrieval information

Example 3: Drug Discovery Problem: Millions of structures in a chemical library. How do we identify the most promising ones?

Example 4: Bioinformatics Human genetics is now at a critical juncture. The molecular methods used successfully to identify the genes underlying rare mendelian syndromes are failing to find the numerous genes causing more common, familial, nonmendelian diseases... Nature 405:847 856, 2000

Example 4: Bioinformatics With the human genome sequence nearing completion, new opportunities are being presented for unravelling the complex genetic basis of nonmendelian disorders based on large-scale genomewide studies... Nature 405:847 856, 2000

Types of Ranking Problems Instance Ranking Label Ranking Subset Ranking Rank Aggregation?

Instance Ranking doc1 > doc2, 10 doc3 > doc5, 20

Label Ranking doc1 doc2 sports > politics health > money science > sports money > politics

Subset Ranking > query 1 doc1 > doc2, doc3 doc5 query 2 doc2 > doc4, doc11 > doc3

Rank Aggregation query 1 results of search engine 1 results of search engine 2 desired ranking query 2 results of search engine 1 results of search engine 2 desired ranking

Types of Ranking Problems Instance Ranking Label Ranking Subset Ranking Rank Aggregation? This tutorial

Tutorial Road Map Part I: Theory & Algorithms Bipartite Ranking k-partite Ranking Ranking with Real-Valued Labels General Instance Ranking RankSVM RankBoost RankNet Part II: Applications Applications to Bioinformatics Applications to Drug Discovery Subset Ranking and Applications to Information Retrieval Further Reading & Resources

Part I Theory & Algorithms [for Instance Ranking]

Bipartite Ranking Relevant (+) doc1+ doc2+ doc3+ Irrelevant (-) doc1- doc2- doc3- doc4-

Bipartite Ranking

Bipartite Ranking

Is Bipartite Ranking Different from Binary Classification?

Is Bipartite Ranking Different from Binary Classification?

Is Bipartite Ranking Different from Binary Classification?

Is Bipartite Ranking Different from Binary Classification?

Is Bipartite Ranking Different from Binary Classification?

Is Bipartite Ranking Different from Binary Classification?

Bipartite Ranking: Basic Algorithmic Framework

Bipartite RankSVM Algorithm [Herbrich et al, 2000; Joachims, 2002; Rakotomamonjy, 2004]

Bipartite RankSVM Algorithm

Bipartite RankBoost Algorithm [Freund et al, 2003]

Bipartite RankBoost Algorithm

Bipartite RankNet Algorithm [Burges et al, 2005]

k-partite Ranking Rating k doc1 k doc2 k doc3 k Rating 2 doc1 2 doc2 2 doc3 2 doc4 2 Rating 1 doc1 1 doc2 1 doc3 1 doc4 1

k-partite Ranking

k-partite Ranking: Basic Algorithmic Framework

Ranking with Real-Valued Labels doc1 y1 doc2 y2 doc3 y3

Ranking with Real-Valued Labels

Ranking with Real-Valued Labels: Basic Algorithmic Framework

General Instance Ranking doc1 > doc1, r1 doc2 > doc2, r2

General Instance Ranking r i

General Instance Ranking

General Instance Ranking: Basic Algorithmic Framework

General RankSVM Algorithm [Herbrich et al, 2000; Joachims, 2002]

General RankBoost Algorithm [Freund et al, 2003]

General RankNet Algorithm [Burges et al, 2005]

Tutorial Road Map Part I: Theory & Algorithms Bipartite Ranking k-partite Ranking Ranking with Real-Valued Labels General Instance Ranking RankSVM RankBoost RankNet Part II: Applications Applications to Bioinformatics Applications to Drug Discovery Subset Ranking and Applications to Information Retrieval Further Reading & Resources

Part II Applications [and Subset Ranking]

Application to Bioinformatics Human genetics is now at a critical juncture. The molecular methods used successfully to identify the genes underlying rare mendelian syndromes are failing to find the numerous genes causing more common, familial, nonmendelian diseases... Nature 405:847 856, 2000

Application to Bioinformatics With the human genome sequence nearing completion, new opportunities are being presented for unravelling the complex genetic basis of nonmendelian disorders based on large-scale genomewide studies... Nature 405:847 856, 2000

Identifying Genes Relevant to a Disease Using Microarrray Gene Expression Data

Identifying Genes Relevant to a Disease Using Microarrray Gene Expression Data

Identifying Genes Relevant to a Disease Using Microarrray Gene Expression Data

Identifying Genes Relevant to a Disease Using Microarrray Gene Expression Data

Identifying Genes Relevant to a Disease Using Microarrray Gene Expression Data

Identifying Genes Relevant to a Disease Using Microarrray Gene Expression Data

Formulation as a Bipartite Ranking Problem Relevant Not relevant

Microarray Gene Expression Data Sets [Golub et al, 1999; Alon et al, 1999]

Selection of Training Genes

Top-Ranking Genes for Leukemia Returned by RankBoost [Agarwal & Sengupta, 2009]

Biological Validation [Agarwal et al, 2010]

Application to Drug Discovery Problem: Millions of structures in a chemical library. How do we identify the most promising ones?

Formulation as a Ranking Problem with Real-Valued Labels

Cheminformatics Data Sets [Sutherland et al, 2004]

DHFR Results Using RankSVM [Agarwal et al, 2010]

Application to Information Retrieval (IR)

Learning to Rank in IR

Learning to Rank in IR

Learning to Rank in IR

Learning to Rank in IR

Learning to Rank in IR

Learning to Rank in IR

Learning to Rank in IR

Learning to Rank in IR

Learning to Rank in IR

Learning to Rank in IR

General Subset Ranking > query 1 doc1 > doc2, doc3 doc5 query 2 doc2 > doc4, doc11 > doc3

General Subset Ranking

Subset Ranking with Real-Valued Relevance Labels query 1 doc1 y1, doc2 y2, doc3 y3 query 2 doc1 y1, y2, y3 doc2 doc3

Subset Ranking with Real-Valued Relevance Labels

RankSVM Applied to IR/Subset Ranking Standard RankSVM [Joachims, 2002]

RankSVM Applied to IR/Subset Ranking RankSVM with Query Normalization & Relevance Weighting [Agarwal & Collins, 2010; also Cao et al, 2006]

Ranking Performance Measures in IR Mean Average Precision (MAP)

Ranking Performance Measures in IR Normalized Discounted Cumulative Gain (NDCG)

Ranking Algorithms for Optimizing MAP/NDCG

LETOR 3.0/OHSUMED Data Set [Liu et al, 2007]

OHSUMED Results NDCG

OHSUMED Results MAP

Further Reading & Resources [Incomplete!]

Early Papers on Ranking W. W. Cohen, R. E. Schapire, and Y. Singer, Learning to order things, Journal of Artificial Intelligence Research, 10:243 270, 1999. R. Herbrich, T. Graepel, and K. Obermayer, Large margin rank boundaries for ordinal regression. Advances in Large Margin Classifiers, 2000. T. Joachims, Optimizing search engines using clickthrough data, KDD 2002. Y. Freund, R. Iyer, R. E. Schapire, and Y. Singer, An efficient boosting algorithm for combining preferences. Journal of Machine Learning Research, 4:933 969, 2003. C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton, G. Hullender, Learning to rank using gradient descent, ICML 2005.

Generalization Bounds for Ranking S. Agarwal, T. Graepel, R. Herbrich, S. Har-Peled, D. Roth, Generalization bounds for the area under the ROC curve, Journal of Machine Learning Research, 6:393 425, 2005. S. Agarwal, P. Niyogi, Generalization bounds for ranking algorithms via algorithmic stability, Journal of Machine Learning Research, 10:441 474, 2009. C. Rudin, R. Schapire, Margin-based ranking and an equivalence between AdaBoost and RankBoost, Journal of Machine Learning Research, 10: 2193 2232, 2009

Bioinformatics/Drug Discovery Applications S. Agarwal and S. Sengupta, Ranking genes by relevance to a disease, CSB 2009. S. Agarwal, D. Dugar, and S. Sengupta, Ranking chemical structures for drug discovery: A new machine learning approach. Journal of Chemical Information and Modeling, DOI 10.1021/ci9003865, 2010.

Other Applications Natural Language Processing M. Collins and T. Koo, Discriminative reranking for natural language parsing, Computational Linguistics, 31:25 69, 2005. Collaborative Filtering M. Weimer, A. Karatzoglou, Q. V. Le, and A. Smola, CofiRank - Maximum margin matrix factorization for collaborative ranking, NIPS 2007. Manhole Event Prediction C. Rudin, R. Passonneau, A. Radeva, H. Dutta, S. Ierome, and D. Isaac, A process for predicting manhole events in Manhattan, Machine Learning, DOI 10.1007/s10994-009-5166, 2010.

IR Ranking Algorithms Y. Cao, J. Xu, T.-Y. Liu, H. Li, Y. Hunag, and H.W. Hon, Adapting ranking SVM to document retrieval, SIGIR 2006. C.J.C. Burges, R. Ragno, and Q.V. Le, Learning to rank with non-smooth cost functions. NIPS 2006. J. Xu and H. Li, AdaRank: A boosting algorithm for information retrieval. SIGIR 2007. Y. Yue, T. Finley, F. Radlinski, and T. Joachims, A support vector method for optimizing average precision. SIGIR 2007. M. Taylor, J. Guiver, S. Robertson, T. Minka, Softrank: optimizing non-smooth rank metrics. WSDM 2008.

IR Ranking Algorithms S. Chakrabarti, R. Khanna, U. Sawant, and C. Bhattacharyya, Structured learning for nonsmooth ranking losses. KDD 2008. D. Cossock and T. Zhang, Statistical analysis of Bayes optimal subset ranking, IEEE Transactions on Information Theory, 54:5140 5154, 2008. T. Qin, X.D. Zhang, M.F. Tsai, D.S. Wang, T.Y. Liu, and H. Li. Query-level loss functions for information retrieval. Information Processing and Management, 44:838 855, 2008. O. Chapelle and M. Wu, Gradient descent optimization of smoothed information retrieval metrics. Information Retrieval (To appear), 2010. S. Agarwal and M. Collins, Maximum margin ranking algorithms for information retrieval, ECIR 2010.

NIPS Workshop 2005 Learning to Rank SIGIR Workshops 2007-2009 Learning to Rank for Information Retrieval NIPS Workshop 2009 Advances in Ranking American Institute of Mathematics Workshop in Summer 2010 The Mathematics of Ranking

Tutorial Articles & Books Tie-Yan Liu, Learning to Rank for Information Retrieval, Foundations & Trends in Information Retrieval, 2009. Shivani Agarwal, A Tutorial Introduction to Ranking Methods in Machine Learning, In preparation. Shivani Agarwal (Ed.), Advances in Ranking Methods in Machine Learning, Springer-Verlag, In preparation.