Content-Based Recommendation

Similar documents
Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

Data mining techniques: decision trees

Social Media Mining. Data Mining Essentials

Machine Learning Logistic Regression

LCs for Binary Classification

Linear Threshold Units

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Spam Detection A Machine Learning Approach

Data Mining Classification: Decision Trees

Collaborative Filtering. Radek Pelánek

Recommender Systems: Content-based, Knowledge-based, Hybrid. Radek Pelánek

Classification algorithm in Data mining: An Overview

An Introduction to Data Mining

Employer Health Insurance Premium Prediction Elliott Lui

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

A SURVEY OF TEXT CLASSIFICATION ALGORITHMS

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Anti-Spam Filter Based on Naïve Bayes, SVM, and KNN model

Big Data Analytics CSCI 4030

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Machine Learning using MapReduce

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

MACHINE LEARNING IN HIGH ENERGY PHYSICS

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Machine learning for algo trading

Support Vector Machines with Clustering for Training with Very Large Datasets

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Decompose Error Rate into components, some of which can be measured on unlabeled data

Machine Learning in Spam Filtering

Active Learning SVM for Blogs recommendation

Bagged Ensemble Classifiers for Sentiment Classification of Movie Reviews

Supervised Learning (Big Data Analytics)

Machine Learning Final Project Spam Filtering

Analysis Tools and Libraries for BigData

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Université de Montpellier 2 Hugo Alatrista-Salas : hugo.alatrista-salas@teledetection.fr

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Data Mining. Nonlinear Classification

The Need for Training in Big Data: Experiences and Case Studies

Data Mining Tools. Jean- Gabriel Ganascia LIP6 University Pierre et Marie Curie 4, place Jussieu, Paris, Cedex 05 Jean-

Statistical Machine Learning

Knowledge Discovery from patents using KMX Text Analytics

STA 4273H: Statistical Machine Learning

Data Mining Practical Machine Learning Tools and Techniques

Statistical Feature Selection Techniques for Arabic Text Categorization

An Overview of Knowledge Discovery Database and Data mining Techniques

Supervised Feature Selection & Unsupervised Dimensionality Reduction

BIDM Project. Predicting the contract type for IT/ITES outsourcing contracts

Data Mining - Evaluation of Classifiers

Mining a Corpus of Job Ads

1. Classification problems

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Experiments in Web Page Classification for Semantic Web

RANDOM PROJECTIONS FOR SEARCH AND MACHINE LEARNING

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Data Mining Yelp Data - Predicting rating stars from review text

Assessment. Presenter: Yupu Zhang, Guoliang Jin, Tuo Wang Computer Vision 2008 Fall

Support Vector Machine (SVM)

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Distances, Clustering, and Classification. Heatmaps

Reference Books. Data Mining. Supervised vs. Unsupervised Learning. Classification: Definition. Classification k-nearest neighbors

Projektgruppe. Categorization of text documents via classification

Data Mining for Knowledge Management. Classification

Large-scale Data Mining: MapReduce and Beyond Part 2: Algorithms. Spiros Papadimitriou, IBM Research Jimeng Sun, IBM Research Rong Yan, Facebook

Recommendations in Mobile Environments. Professor Hui Xiong Rutgers Business School Rutgers University. Rutgers, the State University of New Jersey

Data Mining Techniques for Prognosis in Pancreatic Cancer

A Survey on Pre-processing and Post-processing Techniques in Data Mining

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Research on Sentiment Classification of Chinese Micro Blog Based on

A Proposed Algorithm for Spam Filtering s by Hash Table Approach

The Data Mining Process

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Statistical Models in Data Mining

3/17/2009. Knowledge Management BIKM eclassifier Integrated BIKM Tools

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Lecture 6: Logistic Regression

Analytics on Big Data

Applying Machine Learning to Stock Market Trading Bryce Taylor

Data Mining Analytics for Business Intelligence and Decision Support

Data Mining Essentials

Local features and matching. Image classification & object localization

Data Mining for Customer Service Support. Senioritis Seminar Presentation Megan Boice Jay Carter Nick Linke KC Tobin

International Journal of Computer Science Trends and Technology (IJCST) Volume 3 Issue 3, May-June 2015

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Data Mining for Web Personalization

W. Heath Rushing Adsurgo LLC. Harness the Power of Text Analytics: Unstructured Data Analysis for Healthcare. Session H-1 JTCC: October 23, 2015

Lecture 3: Linear methods for classification

Machine Learning and Pattern Recognition Logistic Regression

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Summary Data Mining & Process Mining (1BM46) Content. Made by S.P.T. Ariesen

The Optimality of Naive Bayes

Bayes and Naïve Bayes. cs534-machine Learning

Final Project Report

Transcription:

Content-Based Recommendation

Content-based? Item descriptions to identify items that are of particular interest to the user

Example

Example

Comparing with Noncontent based Items User-based CF Searches for similar users in user-item rating matrix No rating Item-feature matrix Ratings

Item Representation Structured Unstructured Semi-structured

Attribute - value Structured

Unstructured Full-text No attributes formally defined Other complicated problems - such as synonymy, polysymy

Semi-structured Structured + unstructured Well defined attributes/values + free text

Conversion of Unstructured Data Need to convert to structured form IR techniques VSM, TF-IDF, stemming, etc.

User Profile A model of user s preference A history of the user s interactions Recently viewed items Already viewed items Training data machine learning

User Profile Tear Rate Reduced Normal No Young Age Pre-presb Yes Presbyoptic Prescription Yes Hypermetrope Myope Astigmatic Yes Yes No No

User Profiling Collects information about individual user User Modeling side Adaptive System User Model Provides adaptation effect Adaptation side User profiling == user modeling

User Profile Customization Manual Limitations user efforts (interest change), no ordering Rule-based recommendation history based rules

User Profile

User Model Learning Classification problem Classifying to Like or Dislike Training data feedbacks Probability of classification Unstructured data conversion Feature selection high/low dimensions

User Model Learning Training Data Target Data Train/Learn Classify/Recommen

UM Learning Feedbacks Implicit feedback Indirect interaction Opened document, Reading time, etc. Large data, high uncertainty

UM Learning Feedbacks Explicit Directly from users No noise, hard to obtain

User Model Learning Feature Selection Problem of high dimensional input vectors Overfit (especially when a dataset is small) Document frequency thresholding, Information gain, Mutual information, Chi square statistic, Term strength

Overfitting Overfit Underfit

User Model Learning Feature Selection Mutual Information A = number of times t and c co-occur B = number of times t occurs without c C = number of times c occurs without t N = number of total documents I(t,c) = log A! N (A + C)! (A + B)

User Model Learning Feature Selection Austrian train fire accident After learning 5 documents Train Fire Alps Austria People A 5 5 2 5 5 B 5873 8092 93 974 34501 C 0 0 3 0 0

Decision Tree Partitioning dataset into trees Ideal for structured, small data Performance, simplicity, understandability ID3 (Iterative Dichotomiser 3)

Decision Tree Example Using WEKA http://www.cs.waikato.ac.nz/ml/weka/ Machine learning algorithm package JAVA API, interactive UI

Decision Tree Example - dataset Training Testing attribute as inputs attribute to be es

Decision Tree Example - tree

Decision Tree Example - tree Tear Rate Reduced Normal Young Age Pre-presbyoptic es Presbyoptic Prescription Yes Hypermetrope s Myope Astigmatic Yes No

Decision Tree Example - evaluation

k Nearest Neighbor Prepare training data (classification labels) Extract k most similar items Decide the label of the test data by looking at knn s

k Nearest Neighbor Example k=3 k=5

Linear Classifier aining data Tries to find out a hyperplane that best separates classes

Linear Classifier SVN Support Vector Machine Maximizes the distance between decision boundary & support vector (closest training instance) Avoids overfitting

Linear Classifier SVM

Naive Bayes VSM lack of theoretical justification Probabilistic text classification method Naive = term independence Probability document d is classified to category c

Naive Bayes Multivariate Bernoulli Document probability = of term probability term independence assumption naive

Naive Bayes Multinomial Non-binary

Naive Bayes Example

Conclusions User model learns from content (description, fulltext, etc) itself Implicit, explicit method Classifying like, dislike Limitations not enough content Hybrid content + collaborative