Big Data in Web Search. Claudio Lucchese claudio.lucchese@isti.cnr.it hpc.isti.cnr.it

Similar documents

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Fast Data in the Era of Big Data: Twitter s Real-

Exploring Big Data in Social Networks

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Network Big Data: Facing and Tackling the Complexities Xiaolong Jin

Parallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014

Big Data: Image & Video Analytics

An Introduction to Data Mining

Fast Matching of Binary Features

How To Cluster On A Search Engine

How To Make Sense Of Data With Altilia

MLg. Big Data and Its Implication to Research Methodologies and Funding. Cornelia Caragea TARDIS November 7, Machine Learning Group

Machine Learning over Big Data

Improving Search by using Query Logs and a bit of Seman9cs

Mammoth Scale Machine Learning!

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Social Media Mining. Data Mining Essentials

Advanced In-Database Analytics

The Need for Training in Big Data: Experiences and Case Studies

Sentiment analysis using emoticons

Content Delivery Networks. Shaxun Chen April 21, 2009

How To Handle Big Data With A Data Scientist

CIRGIRDISCO at RepLab2014 Reputation Dimension Task: Using Wikipedia Graph Structure for Classifying the Reputation Dimension of a Tweet

Doctoral Consortium 2013 Dept. Lenguajes y Sistemas Informáticos UNED

Big Data Analytics. Lucas Rego Drumond

Analysis of Social Media Streams

ANALYTICS CENTER LEARNING PROGRAM

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Enhanced Information Access to Social Streams. Enhanced Word Clouds with Entity Grouping

Spatio-Temporal Patterns of Passengers Interests at London Tube Stations

SEIZE THE DATA SEIZE THE DATA. 2015

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

A Logistic Regression Approach to Ad Click Prediction

A STUDY REGARDING INTER DOMAIN LINKED DOCUMENTS SIMILARITY AND THEIR CONSEQUENT BOUNCE RATE

Data Mining. Nonlinear Classification

Knowledge Discovery from patents using KMX Text Analytics

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Concept and Project Objectives

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Identifying SPAM with Predictive Models

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Fast Analytics on Big Data with H20

Search and Information Retrieval

Spark and the Big Data Library

Data Mining in Web Search Engine Optimization and User Assisted Rank Results

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Introduction. A. Bellaachia Page: 1

Scalable Machine Learning - or what to do with all that Big Data infrastructure

Automating Big Data Management, by DISIT Lab Distributed [Systems and Internet, Data Intelligence] Technologies Lab Prof. Ph.D. Eng.

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Software Engineering for Big Data. CS846 Paulo Alencar David R. Cheriton School of Computer Science University of Waterloo

Using In-Memory Computing to Simplify Big Data Analytics

Optimization of Image Search from Photo Sharing Websites Using Personal Data

Introduction to Data Mining

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Distributed forests for MapReduce-based machine learning

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

A U T H O R S : G a n e s h S r i n i v a s a n a n d S a n d e e p W a g h Social Media Analytics

PARC and SAP Co-innovation: High-performance Graph Analytics for Big Data Powered by SAP HANA

NetView 360 Product Description

How To Use Big Data For Telco (For A Telco)

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Teaching Scheme Credits Assigned Course Code Course Hrs./Week. BEITC802 Big Data Analytics. Theory Marks

The Impact of Big Data on Classic Machine Learning Algorithms. Thomas Jensen, Senior Business Expedia

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

Hadoop s Advantages for! Machine! Learning and. Predictive! Analytics. Webinar will begin shortly. Presented by Hortonworks & Zementis

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Scalable Developments for Big Data Analytics in Remote Sensing

The Italian Hate Map:

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

BUDT 758B-0501: Big Data Analytics (Fall 2015) Decisions, Operations & Information Technologies Robert H. Smith School of Business

Categorical Data Visualization and Clustering Using Subjective Factors

IT services for analyses of various data samples

DATA MINING CLUSTER ANALYSIS: BASIC CONCEPTS

IEEE JAVA Project 2012

Principles of Data Mining by Hand&Mannila&Smyth

ANALYTICS BUILT FOR INTERNET OF THINGS

Sunnie Chung. Cleveland State University

Content-Based Image Retrieval

Oracle Big Data SQL Technical Update

CAP4773/CIS6930 Projects in Data Science, Fall 2014 [Review] Overview of Data Science

ANALYTICS IN BIG DATA ERA

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Information Management course

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Big Data Text Mining and Visualization. Anton Heijs

Experiments in Web Page Classification for Semantic Web

Predicting stocks returns correlations based on unstructured data sources

Similarity Search in a Very Large Scale Using Hadoop and HBase

Text Mining Approach for Big Data Analysis Using Clustering and Classification Methodologies

Big Data Integration: A Buyer's Guide

VCU-TSA at Semeval-2016 Task 4: Sentiment Analysis in Twitter

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

Transcription:

Big Data in Web Search Claudio Lucchese claudio.lucchese@isti.cnr.it hpc.isti.cnr.it

y High Performance Computing Lab 3 Post-doc 3 Research Associates from U. of Venice and Pisa l a b o fellows and r a t o r 7 Researchers 6 PhD students Claudio Lucchese Big Data in Web Search 2

Main Research Topics Web Search & Scalable DM/ML Responsiveness of large-scale search systems, storage, analysis and indexing of large amounts of data Machine learning and Web mining techniques for Ranking, Prediction, Recommendation, Diversification, Social media analysis, Entity Linking and Semantic Enrichment Cloud and Distributed computing Cloud federations, Resource Management l a b o r a t o r y Network overlays for P2P and Big Data Scalable data analysis with Hadoop Map-Reduce, Giraph, Spark, etc Claudio Lucchese Big Data in Web Search 3

Outline Some recent Learning to Rank activities User Task Discovery in Query Logs Lucchese, C., Orlando, S., Perego, R., Silvestri, F., & Tolomei, G. Discovering tasks from search engine query logs. ACM Transactions on Information Systems (TOIS), 31(3), 14. 2014. ACM Notable Article. Entity Linking D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, S. Trani. Learning relatedness measures for entity linking. In Proceedings of CIKM '13: ACM Int. Conference on Information and Knowledge Management, p. 139-148, Oct. 2013. News Recommendation G. De Francisci Morales, A. Gionis, C. Lucchese. From Chatter to Headlines: Harnessing the Real-Time Web for Personalized News Recommendation. In Proceedings of WSDM '12: ACM Int. Conference on Web Search and Data Mining, Seattle, Washington, USA, February 2012. Tour planning Brilhante, I., Macedo, J. A., Nardini, F. M., Perego, R., & Renso, C. Where shall we go today?: planning touristic tours with tripbuilder. In Proceedings of the CIKM '13: ACM Int. Conference on Information and Knowledge Management, pp. 757-762. Oct. 2013. Claudio Lucchese Big Data in Web Search 4

Ranking Ranking is (one of) the most important challenges in Web Search We define Ranking as the problem of sorting a set of documents according to their relevance to the user query. This is a typical Big Data task users feedback is very relevant Claudio Lucchese Big Data in Web Search 5

Learning to Rank is: q 1, d 11, r 11 q 1, d 12, r 12 q 1, d 1k, r 1k q m, d m1, r m1 q m, d mk, r mk Learning Scoring function h q *, d *,? Document Scoring q *, d *, h(q *, d * ) The goal is to learn the ranking, not the label! Claudio Lucchese Big Data in Web Search 6

Is it easy? q 1, d 11, r 11 q 1, d 12, r 12 q 1, d 1k, r 1k q m, d m1, r m1 q m, d mk, r mk Learning Scoring function h q *, d *,? Document Scoring q *, d *, h(q *, d * ) Not so easy when optimizing typical Information Retrieval measures. One simple reason is that they imply sorting (of documents), which is not a nicely derivable function Therefore we cannot apply gradient descent or similar Claudio Lucchese Big Data in Web Search 7

QuickScore A Learning-to-Rank function is typically implemented as a forest of thousands decision trees QuickScore is a cache-aware algorithm improving the scoring efficiency of tree-based ranking models It s 2x to 6.5x times faster than state-of-the-art implementations It visits a tree by touching a smaller number of nodes We also implemented an efficient multi-threaded learning toolkit, named QuickLearn, implementing a few variants of Gradient Boosted Regression Trees Claudio Lucchese Big Data in Web Search 8

Big Data in Web Search Europeana is the biggest European Cultural Heritage portal. Within the ASSETS EU CIP Project we designed a new ranking function, which was actually deployed on the portal! Users feedback (result licks) is exploited to learn document relevance Claudio Lucchese Big Data in Web Search 9

Big Data in Web Search Ranking design: define a ranking architecture, learn a ranking function, feature tuning, efficient scoring, etc. Data cleaning: remove near-duplicate documents from a collection of 6 billion documents with ~1000 CPU cores. NoSQL storage, massive Hadoop MapReduce computations Claudio Lucchese Big Data in Web Search 10

Outline Some recent Learning to Rank activities User Task Discovery in Query Logs Lucchese, C., Orlando, S., Perego, R., Silvestri, F., & Tolomei, G. Discovering tasks from search engine query logs. ACM Transactions on Information Systems (TOIS), 31(3), 14. 2014. ACM Notable Article. Entity Linking D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, S. Trani. Learning relatedness measures for entity linking. In Proceedings of CIKM '13: ACM Int. Conference on Information and Knowledge Management, p. 139-148, Oct. 2013. News Recommendation G. De Francisci Morales, A. Gionis, C. Lucchese. From Chatter to Headlines: Harnessing the Real-Time Web for Personalized News Recommendation. In Proceedings of WSDM '12: ACM Int. Conference on Web Search and Data Mining, Seattle, Washington, USA, February 2012. Tour planning Brilhante, I., Macedo, J. A., Nardini, F. M., Perego, R., & Renso, C. Where shall we go today?: planning touristic tours with tripbuilder. In Proceedings of the CIKM '13: ACM Int. Conference on Information and Knowledge Management, pp. 757-762. Oct. 2013. Claudio Lucchese Big Data in Web Search 12

User Task Discovery (UTD) Users have interleaving multi/tasking behavior: Thanks to Users Task Discovery, it would be possible to bettern understand users, to recommend tasks, etc. Claudio Lucchese 2nd HPC Workshop - Playing with Learning to Rank 13

User Task Discovery (UTD) Easy process: Put weights on the similarity graph Remove low weighted edges Extract Connected Components 1 2 3 4 5 6 7 8 How to find a good query similarity function? Claudio Lucchese 2nd HPC Workshop - Playing with Learning to Rank 14

User Task Discovery (UTD) Binary classification approach Given a set of task-annotated queries Build a training set of query pairs labeled as same task vs. different tasks Define the feature set: edlevgt2: edit distance between q i and q j wordr: Jaccard distance between the sets of words of q i and q j char_suf: number of common characters in q i and q j nsubst_q j _X: related to the probability of q j being reformulated time_diff: inter-query time gap between q i and q j sequential: binary feature that is positive if q i and q j are issued sequentially prisma: cosine between the two vectors of the top-50 pages retuned by a SE for q i and q j entropy_q i _X: measure the rewrite probabilities from q i σ jaccard_url : Jaccard similarity between the top-20 URLs returned by SE for qi and qj σ wikipedia : cosine based on Wikipedia articles containing qi and qj Learn a classifier optimizing classification accuracy Logistic Regression works sufficiently well Decision trees are slightly better We improved over the Query-Flow-Graph approach Claudio Lucchese 2nd HPC Workshop - Playing with Learning to Rank 15

Learning to Rank for UTD Learning to rank setting: A query is a query q i in the user session A document is any other query q j in the same user session Objective: rank same-task queries higher than different-task queries Clustering quality: Sample of the AOL 2006 query log: ~8,800 queries by 127 users, manually annotated into ~1350 tasks, ~6.5 queries per task Algo Rand Index Jacc. Index Avg. Fm Log.Reg. 0.896 0.571 0.845 Lambda-MART 0.943 0.660 0.899 Claudio Lucchese 2nd HPC Workshop - Playing with Learning to Rank 16

Outline Some recent Learning to Rank activities User Task Discovery in Query Logs Lucchese, C., Orlando, S., Perego, R., Silvestri, F., & Tolomei, G. Discovering tasks from search engine query logs. ACM Transactions on Information Systems (TOIS), 31(3), 14. 2014. ACM Notable Article. Entity Linking D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, S. Trani. Learning relatedness measures for entity linking. In Proceedings of CIKM '13: ACM Int. Conference on Information and Knowledge Management, p. 139-148, Oct. 2013. News Recommendation G. De Francisci Morales, A. Gionis, C. Lucchese. From Chatter to Headlines: Harnessing the Real-Time Web for Personalized News Recommendation. In Proceedings of WSDM '12: ACM Int. Conference on Web Search and Data Mining, Seattle, Washington, USA, February 2012. Tour planning Brilhante, I., Macedo, J. A., Nardini, F. M., Perego, R., & Renso, C. Where shall we go today?: planning touristic tours with tripbuilder. In Proceedings of the CIKM '13: ACM Int. Conference on Information and Knowledge Management, pp. 757-762. Oct. 2013. Claudio Lucchese Big Data in Web Search 17

Entity Linking The goal is to identify relevant entities mentioned by fragments of text. Entities are taken from a given catalogue, e.g. Wikipedia. Claudio Lucchese Big Data in Web Search 18

Entity Linking State-of-the-art approaches run three steps: 1. Spotting Given a document, find fragments of text potentially referring to entities, a.k.a. spots Common approach is to match anchors in Wikipedia Some spots are ambiguous, e.g. Michael Collins 2. Disambiguation Given a set of spots in a document, find the correct entity for each spot. Steps 1 and 2 are sometimes referred to as Word Sense Disambiguation 3. Link Detection Given a document, its terms and their senses, decide where to put links: E.g., Bank of London The main research questions is: How to improve the disambiguation step. Claudio Lucchese Big Data in Web Search 19

Saliency Driven Entity Linking Mentioned entities have not the same importance in the given document We defined 3 levels of saliency We trained a model to rank entities according to their expected saliency Table 2: Entity linking and saliency prediction performance. CoNLL Wikinews Rec Prec F 1 Rec Prec F 1 NDCG F top 1 GBDT-F 0.696 0.774 0.717 0.755 0.704 0.715 0.724 0.448 GBRT-F 0.753 0.734 0.728 0.727 0.737 0.719 0.733 0.316 LRC-F 0.692 0.720 0.685 0.733 0.708 0.707 0.705 0.434 Tagme 0.677 0.588 0.611 0.773 0.665 0.701 0.654 0.182 Wikiminer 0.549 0.433 0.458 0.776 0.534 0.620 0.639 0.159 Spotlight 0.508 0.261 0.311 0.564 0.310 0.382 0.469 0.103 1-Step GBRT-F l 0.687 0.690 0.666 0.695 0.731 0.694 0.713 0.335 Claudio Lucchese Big Data in Web Search 26

What s Next The software we developed is open-source and available at http://dexter.isti.cnr.it/ Endless Applications: News stream analysis, for understanding, summarization, sentiment, Web queries and web documents annotation Claudio Lucchese Big Data in Web Search 27

Outline Some recent Learning to Rank activities User Task Discovery in Query Logs Lucchese, C., Orlando, S., Perego, R., Silvestri, F., & Tolomei, G. Discovering tasks from search engine query logs. ACM Transactions on Information Systems (TOIS), 31(3), 14. 2014. ACM Notable Article. Entity Linking D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, S. Trani. Learning relatedness measures for entity linking. In Proceedings of CIKM '13: ACM Int. Conference on Information and Knowledge Management, p. 139-148, Oct. 2013. News Recommendation G. De Francisci Morales, A. Gionis, C. Lucchese. From Chatter to Headlines: Harnessing the Real-Time Web for Personalized News Recommendation. In Proceedings of WSDM '12: ACM Int. Conference on Web Search and Data Mining, Seattle, Washington, USA, February 2012. Tour planning Brilhante, I., Macedo, J. A., Nardini, F. M., Perego, R., & Renso, C. Where shall we go today?: planning touristic tours with tripbuilder. In Proceedings of the CIKM '13: ACM Int. Conference on Information and Knowledge Management, pp. 757-762. Oct. 2013. Claudio Lucchese Big Data in Web Search 28

News Recommendation How recommend to a given user the most relevant news? There are a lot of information sources Users are different Timeliness is crucial! Claudio Lucchese Big Data in Web Search 29

Time is Crucial delay between news publication and clicks Claudio Lucchese Big Data in Web Search 30

Time is crucial: Earthquakes and Twitter There are some arguments about twitter waves being faster than seismic waves Claudio Lucchese Big Data in Web Search 31

Time is crucial: News Agencies vs. Twitter Tweets about Osama Bin Laden death Twitter is sometimes a faster than other media Claudio Lucchese Big Data in Web Search 32

Twitter vs. News streams The main research questions are: Streaming analysis of twitter stream Detect trending topics early in twitter Model information spreading process in twitter Detect trending topics early in news Recommend news to users according to their tastes Our research question is: Can we use information both from the twitter stream and from the news stream to provide personalized news recommendation? Claudio Lucchese Big Data in Web Search 33

When a news is relevant? Our assumptions. A news article is interesting if: 1. If it discusses topics of interest to the user (e.g., computer science) 2. If it discusses topics of interest to the social network of the user (e.g., computer science, art exhibitions in Tuscany) 3. If it does not fall in the user s interests, but it is of general relevance (e.g., Ukraine crisis) We need a way to detect topics in text Topics provide an higher level view Topics can bridge the gap between twitter and news streams We model the relation between news, tweets and users in terms of the topics/entities they discuss Claudio Lucchese Big Data in Web Search 34

Tweets and news relatedness We extracted topics from tweets and from news Let Z be the set of topics Z In our case, Z is the set of Wikipedia pages Let T(i, j) be the relevance of topic z j for tweet t i Let N(j, k) be the relevance of topic z j for news n k The product M=TN is used to estimate the relatedness between tweets and news: M(i, k) is the relatedness between the tweet t i and the news n k based on the co-cited topics. Claudio Lucchese Big Data in Web Search 35

Content and Social relevance Content relatedness is based of the users history of tweets. Let the binary matrix A(u, i)=1 iff user u tweeted tweet t i Content relatedness is defined by the matrix Γ=AM, where Γ(u, k) is the relevance of news n k for user u based on the entities mentioned in u s tweets and in n k. Social relatedness is a function of the tweets in the social network S of the user = Xi=d i=1 i S i! A M σ is a dumping factor, d is the max distance in the social network Σ(u, k) is the relevance of news n k for users in the social circles of u Claudio Lucchese Big Data in Web Search 36

Topic popularity over time Let Z be a (row) vector where Z(j) is the popularity of topic z j Z = Z 1 + w T H T + w N H N. Three components contribute to the popularity of a topic The popularity at the previous time-step τ (exponential forgiving) The estimated popularity in the stream of tweets (first order derivative) The estimated popularity in the stream of news (first order derivative) Weights w T and w N are set equal in our experiments Topic popularity is defined by the matrix Π=ZN, where Π(k) is popularity of news article n k Based on the topics it discusses Claudio Lucchese Big Data in Web Search 37

Learning to rank formulation We say that the relevance of a news n for a user u is a linear combination of content, social relevance and topic popularity R (u, n) = (u, n) + (u, n) + (n) This can be thought as a ranking problem: given a set of news at time τ, sort them according to their relevance R τ (u,n) and propose the best to the user u. Learn a ranking/relevance function that promotes clicked news Find the best α, β,γ such that clicked news are ranked higher than non clicked ones. This can be mapped into a Support Vector Machines formulations Claudio Lucchese Big Data in Web Search 38

Dataset used 1 Million English tweets by 3,214 users (May 2011) 40,000 news articles from the Yahoo! News portal containing at least one topic mentioned in the tweet stream Yahoo! Toolbar data for clicks on news articles We also used toolbar data to link web users with twitter users Assuming his twitter account is the most visited (filtering out popular public persons, celebrities, etc.) The training and test set are built under the assumption that a clicked news article must be the top ranked at the time of its publication. Claudio Lucchese Big Data in Web Search 39

1 T.Rex: 10 Twitter-based 100 News 1000 Recommendation 10000 System: Entity Predicting clicked entities stribution of entities (b) News. j=1 where G[j] is the entity intersection with the clicked news (rescaled in 0,,4) Average DCG 14 12 10 8 6 4 DCG(N) = T.Rex+ T.Rex Popularity Content Social Recency Click count N X G[j] log(j + 1) 2 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Rank Claudio Lucchese Big Data in Web Search 40

What s Next The mining of twitter data requires large-scale tools, everything was implemented with MapReduce. Streaming data mining tools should be adopted More complex/interesting models instead of linear combination Correlation of twitter streams with other streams, e.g. query logs Claudio Lucchese Big Data in Web Search 41

Outline Some recent Learning to Rank activities User Task Discovery in Query Logs Lucchese, C., Orlando, S., Perego, R., Silvestri, F., & Tolomei, G. Discovering tasks from search engine query logs. ACM Transactions on Information Systems (TOIS), 31(3), 14. 2014. ACM Notable Article. Entity Linking D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, S. Trani. Learning relatedness measures for entity linking. In Proceedings of CIKM '13: ACM Int. Conference on Information and Knowledge Management, p. 139-148, Oct. 2013. News Recommendation G. De Francisci Morales, A. Gionis, C. Lucchese. From Chatter to Headlines: Harnessing the Real-Time Web for Personalized News Recommendation. In Proceedings of WSDM '12: ACM Int. Conference on Web Search and Data Mining, Seattle, Washington, USA, February 2012. Tour planning Brilhante, I., Macedo, J. A., Nardini, F. M., Perego, R., & Renso, C. Where shall we go today?: planning touristic tours with tripbuilder. In Proceedings of the CIKM '13: ACM Int. Conference on Information and Knowledge Management, pp. 757-762. Oct. 2013. Claudio Lucchese Big Data in Web Search 42

(credits to David Crandall et al., Cornell University) Claudio Lucchese Big Data in Web Search 43

Research Challenges The analysis of a large and noisy collection of social geo-tagged photos poses several challenges: 1. Clean and organize the collection in semantically coherent clusters 2. Associate relevant PoIs with these clusters 3. Devise routes of tourists through these PoIs and characterize as precisely as possible the behaviors of tourists 4. Extract and Exploit such knowledge pontevecchi o trip firenze palazzo vecchio canon florence Our research question is: How to provide personalized PoI recommendations?

How much can we understand from photos?

Visual Clustering Goal: to reduce the cost of computing similarity H. Bay, T. Tuytelaars, and L. Van Gool. Surf: Speeded up robust features. Computer Vision ECCV 2006 G Csurka, C. Dance, L Fan, J Willamowski, and C. Bray. Visual categorization with bags of keypoints. ECCV, 2004.

Labeling with tags Two key ideas: Using the spatial relevance of tags Measure: ratio between the tag area and the overall geographical area analyzed Using the social relevance of tags Measure: number of different users using a given tag

Clustering and enriching Flickr photos Lucca, Toscana, Vacanze 2011, Tuscany, Italy, Summer 2011, Oval Square, Piazza dell anfiteatro, Piazza del mercato, square, happy new year, windows, balcony, canon, nikon, Claudio Lucchese Big Data in Web Search 48

Enriching older photos Claudio Lucchese Big Data in Web Search 49

Mining Trajectories from Flickr Colosseum 3 photos 01/07/2013 9:00-12:00 Ruins 2 photos 01/07/2013 13:30-15:00 Devise patterns of tourists behavior... Trevi Fountain 2 photos 01/07/2013 15:42-16:00

Planning Sightseeing Tours with TripBuilder Golden Gate Bridge What should I visit 4"h" in San Francisco? Golden Gate Park California Academy of Sciences Given: 4"h" Time: de Young 2 Museum days; My preferences. 8"h" San Francisco Museum of Modern Art Aquarium of the Bay Alcatraz How many How do of these other tourists trajectories visit can such I enjoy? places?

The TripCover Problem Given:' A"set"of"popular"trajectories" crossing"a"set"of"pois"and" their"8me"cost"" The"relevance'of"the" trajectories"w.r.t."the" category"set" The'Time"Budget'and' Preferences'of"a"user"" A"measure"of"PoI6User' interest' Find:' the"subset"of"trajectories"that" maximizes"user"interest"and" fits"in"the"8me"budget" TripCover'is"an"instance"of"the"Generalized'Maximum'Coverage"(GMC)" problem."npihard"with"a"(e/(ei1))iapproxima8on"algorithm."

TrajSP: joining the trajectories TripCover solution is a set of trajectories fitting user interest and time budget Local search heuristics for connecting the solution in a single sightseeing tour l(i, k) = 4 e(i) e(i) i n(k) i n(k) i n(i) k n(i) k k e(k) e(k) (a) (b) (c) l(i, k) = 3 e(i) n(e(i)) e(i) n(e(i)) i i i n(i) e(k) n(i) e(k) n(k) k n(k) k (d) (e) (f) k Claudio Lucchese Big Data in Web Search 53

http://tripbuilder.isti.cnr.it Claudio Lucchese Big Data in Web Search 54

The End! "Data! Data! Data!" he cried impatiently. "I can't make bricks without clay." Claudio Lucchese Big Data in Web Search 55